Submit Hadoop Job
The Submit Hadoop Job operator is used to submit jobs to Hadoop clusters provided by different cloud providers. Currently the only supported Hadoop cluster type is Google Dataproc.
- Google Dataproc (Spark, PySpark, SparkSQL, Hive)
The operator has one input port: 'jobConfig', which is optional and used to dynamically overwrite static configs. A dynamic job config is represented as a JSON object, whose structure depends on a job type and a service provider. See examples of job configs below.
The operator has one output port: 'result'. The output is represented as a message with attributes containing metadata. For example, each message will have a cluster and job config in its attributes as well as specific attributes for each job type. The message contains a key named 'message.error' which is 'true' in case of failure and 'false' in case of success.
{ "sparkJobSpec": { "mainJarFileUri":"gs://bucket/file.jar", "mainClass": "org.apache.spark.ClassName", "args": ["10"], "fileUris": ["gs://bucket/file1.txt","gs://bucket/file2.txt"], "jarFileUris": ["gs://bucket/jar1.jar", "gs://bucket/jar2.jar"], "archiveUris": ["models1.zip", "models2.zip"], "properties": {"prop1": "v1", "prop2": "v2"}, "loggingConfig": {"driverLogLevels": {"org.apache.spark": "DEBUG"}} }, "labels": {"label1": "v1", "label2": "v2"} }
{ "sparkSqlJobSpec": { "queryList": {"queries": ["DROP TABLE t2"]}, "scriptVariables": {"a": "b", "spark.property": "20"}, "fileUris": ["gs://bucket/file1.txt","gs://bucket/file2.txt"], "jarFileUris": ["gs://bucket/jar1.jar"], "properties": {"prop1": "v1", "prop2": "v2"}, "loggingConfig": {"driverLogLevels": {"org": "DEBUG"}} }, "labels": {"label1": "v1", "label2": "v2"} }
{ "pySparkJobSpec": { "mainPythonFileUri":"gs://bucket/file.py", "args": ["10"], "pythonFileUris": ["gs://bucket/f1.py","gs://bucket/f2.py"], "fileUris": ["gs://bucket/file1.txt","gs://bucket/file2.txt"], "jarFileUris": ["gs://bucket/jar1.jar", "gs://bucket/jar2.jar"], "archiveUris": ["models1.zip", "models2.zip"], "properties": {"prop1": "v1", "prop2": "v2"}, "loggingConfig": {"driverLogLevels": {"org.apache.spark": "DEBUG"}} }, "labels": {"label1": "v1", "label2": "v2"} }
{ "hiveJobSpec": { "queryList": {"queries": ["DROP TABLE t2"]}, "scriptVariables": {"a": "b", "hive.property": "20"}, "fileUris": ["gs://bucket/file1.txt","gs://bucket/file2.txt"], "jarFileUris": ["gs://bucket/jar1.jar"], "properties": {"prop1": "v1", "prop2": "v2"}, "loggingConfig": {"driverLogLevels": {"org": "DEBUG"}} }, "labels": {"label1": "v1", "label2": "v2"} }
There is a sample graph that shows how to use this operator. See com.sap.demo.hadoop.dataproc.
Configuration Parameters
Parameter | Type | Description |
---|---|---|
dataprocJobType | string | Mandatory: A Dataproc job type.
Default: "spark" |
dataprocJarFileUris | string | Jar file URIs separated by ','.
Default: "" |
dataprocLabels | string | Labels to associate with the Dataproc job.
Default: "" |
dataprocSparkArgs | string | Spark job arguments separated by ','.
Default: "" |
dataprocSparkConf | string | Spark configs. Similar to --conf parameters of spark-submit
but as a JSON object.
Default: "" |
dataprocSparkFileUris | string | Spark file URIs separated by ','.
Default: "" |
dataprocSparkArchiveUris | string | Spark archive URIs separated by ','.
Default: "" |
dataprocSparkMainPythonFileUri | string | Mandatory: The main PySpark file URI.
Default: "" |
dataprocSparkPythonFileUris | string | Additional Python file URIs separated by ','.
Default: "" |
dataprocSparkMainJarFileUri | string | Mandatory: The main Spark jar file URI.
Default: "" |
dataprocSparkClassName | string | Mandatory: A class name to execute.
Default: "" |
dataprocSparkSqlQueryFileUri | string | A file URI with Spark SQL queries.
Default: "" |
dataprocSparkSqlQueries | string | A field for Spark SQL queries.
Default: "" |
dataprocSparkSqlScriptVariables | string | A field for Spark SQL script variables. Equivalent to the
Spark SQL command: SET name="value".
Default: "" |
dataprocSparkDriverLogLevels | string | Spark driver log levels represented as a JSON
object.
Default: "" |
dataprocHiveContinueOnFailure | bool | Mandatory: A flag that determines whether to continue
executing Hive queries if a query fails.
Default: false |
dataprocHiveConf | string | Hive configs represented as a JSON object
Default: "" |
dataprocHiveQueryFileUri | string | A file URI with Hive queries.
Default: "" |
dataprocHiveQueries | string | A field for Hive queries.
Default: "" |
dataprocHiveScriptVariables | string | A field for Hive script variables. Equivalent to the Hive
command: SET name="value".
Default: "" |
Input
Input |
Type |
Description |
---|---|---|
jobConfig |
string |
A JSON object representing a dynamic job configuration, which will overwrite/append the static properties. |
Output
Output |
Type |
Description |
---|---|---|
success |
message |
The operator sends a message with attributes to this port if the job is successful. |
failure | message | The operator sends a message with attributes to this port if the job is unsuccessful. |