Modeling Guide

Submit Hadoop Job

The Submit Hadoop Job operator is used to submit jobs to Hadoop clusters provided by different cloud providers. Currently the only supported Hadoop cluster type is Google Dataproc.

Supported services:
  • Google Dataproc (Spark, PySpark, SparkSQL, Hive)

The operator has one input port: 'jobConfig', which is optional and used to dynamically overwrite static configs. A dynamic job config is represented as a JSON object, whose structure depends on a job type and a service provider. See examples of job configs below.

The operator has one output port: 'result'. The output is represented as a message with attributes containing metadata. For example, each message will have a cluster and job config in its attributes as well as specific attributes for each job type. The message contains a key named 'message.error' which is 'true' in case of failure and 'false' in case of success.

A dynamic Dataproc Spark job configuration can look like the example below (include only needed fields):
{
  "sparkJobSpec": {
    "mainJarFileUri":"gs://bucket/file.jar",
    "mainClass": "org.apache.spark.ClassName",
    "args": ["10"],
    "fileUris": ["gs://bucket/file1.txt","gs://bucket/file2.txt"],
    "jarFileUris": ["gs://bucket/jar1.jar", "gs://bucket/jar2.jar"],
    "archiveUris": ["models1.zip", "models2.zip"],
    "properties": {"prop1": "v1", "prop2": "v2"},
    "loggingConfig": {"driverLogLevels": {"org.apache.spark": "DEBUG"}}
  },
  "labels": {"label1": "v1", "label2": "v2"}
}
A dynamic Dataproc Spark SQL job configuration can look like the example below (include only needed fields):
{
  "sparkSqlJobSpec": {
    "queryList": {"queries": ["DROP TABLE t2"]},
    "scriptVariables": {"a": "b", "spark.property": "20"},
    "fileUris": ["gs://bucket/file1.txt","gs://bucket/file2.txt"],
    "jarFileUris": ["gs://bucket/jar1.jar"],
    "properties": {"prop1": "v1", "prop2": "v2"},
    "loggingConfig": {"driverLogLevels": {"org": "DEBUG"}}
  },
  "labels": {"label1": "v1", "label2": "v2"}
}
A dynamic Dataproc PySpark job configuration can look like the example below (include only needed fields):
{
  "pySparkJobSpec": {
    "mainPythonFileUri":"gs://bucket/file.py",
    "args": ["10"],
    "pythonFileUris": ["gs://bucket/f1.py","gs://bucket/f2.py"],
    "fileUris": ["gs://bucket/file1.txt","gs://bucket/file2.txt"],
    "jarFileUris": ["gs://bucket/jar1.jar", "gs://bucket/jar2.jar"],
    "archiveUris": ["models1.zip", "models2.zip"],
    "properties": {"prop1": "v1", "prop2": "v2"},
    "loggingConfig": {"driverLogLevels": {"org.apache.spark": "DEBUG"}}
  },
  "labels": {"label1": "v1", "label2": "v2"}
}
A dynamic Dataproc Hive job configuration can look like the example below (include only needed fields):
{
  "hiveJobSpec": {
    "queryList": {"queries": ["DROP TABLE t2"]},
    "scriptVariables": {"a": "b", "hive.property": "20"},
    "fileUris": ["gs://bucket/file1.txt","gs://bucket/file2.txt"],
    "jarFileUris": ["gs://bucket/jar1.jar"],
    "properties": {"prop1": "v1", "prop2": "v2"},
    "loggingConfig": {"driverLogLevels": {"org": "DEBUG"}}
  },
  "labels": {"label1": "v1", "label2": "v2"}
}

There is a sample graph that shows how to use this operator. See com.sap.demo.hadoop.dataproc.

Configuration Parameters

Parameter Type Description
dataprocJobType string Mandatory: A Dataproc job type.

Default: "spark"

dataprocJarFileUris string Jar file URIs separated by ','.

Default: ""

dataprocLabels string Labels to associate with the Dataproc job.

Default: ""

dataprocSparkArgs string Spark job arguments separated by ','.

Default: ""

dataprocSparkConf string Spark configs. Similar to --conf parameters of spark-submit but as a JSON object.

Default: ""

dataprocSparkFileUris string Spark file URIs separated by ','.

Default: ""

dataprocSparkArchiveUris string Spark archive URIs separated by ','.

Default: ""

dataprocSparkMainPythonFileUri string Mandatory: The main PySpark file URI.

Default: ""

dataprocSparkPythonFileUris string Additional Python file URIs separated by ','.

Default: ""

dataprocSparkMainJarFileUri string Mandatory: The main Spark jar file URI.

Default: ""

dataprocSparkClassName string Mandatory: A class name to execute.

Default: ""

dataprocSparkSqlQueryFileUri string A file URI with Spark SQL queries.

Default: ""

dataprocSparkSqlQueries string A field for Spark SQL queries.

Default: ""

dataprocSparkSqlScriptVariables string A field for Spark SQL script variables. Equivalent to the Spark SQL command: SET name="value".

Default: ""

dataprocSparkDriverLogLevels string Spark driver log levels represented as a JSON object.

Default: ""

dataprocHiveContinueOnFailure bool Mandatory: A flag that determines whether to continue executing Hive queries if a query fails.

Default: false

dataprocHiveConf string Hive configs represented as a JSON object

Default: ""

dataprocHiveQueryFileUri string A file URI with Hive queries.

Default: ""

dataprocHiveQueries string A field for Hive queries.

Default: ""

dataprocHiveScriptVariables string A field for Hive script variables. Equivalent to the Hive command: SET name="value".

Default: ""

Input

Input

Type

Description

jobConfig

string

A JSON object representing a dynamic job configuration, which will overwrite/append the static properties.

Output

Output

Type

Description

success

message

The operator sends a message with attributes to this port if the job is successful.
failure message The operator sends a message with attributes to this port if the job is unsuccessful.