Modeling Guide

Spark Submit

The Spark Submit operator is a wrapper for spark-submit.

It requires a Spark installation, and SPARK_HOME must point to it. If YARN is used (using the master parameter), then HADOOP_CONF_DIR must point to a directory with core-site.xml and yarn-side.xml that contain the correct settings to connect to the Hadoop cluster with YARN.

If SAP Vora Pipeline Engine is running in cluster mode, it uses a Docker image that provides the necessary environment. The configuration for YARN is retrieved from the yarn.resourcemanager.address, yarn.resourcemanager.hostname, and fs.defaultFS parameters. Because the appjar parameter is relative to the environment, it must be a path inside the Docker container; otherwise, use the binary input to stream in a JAR.

Configuration Parameters

Parameter

Type

Description

master

string

The value for the "--master" argument of spark-submit.

Default: "yarn"

deploymode

string

The value for the "--deploy-mode" argument of spark-submit.

Default: "cluster"

class

string

Mandatory. The value for the "--class" argument of spark-submit.

Default: "org.apache.spark.examples.SparkPi"

appjar

string

Mandatory. The path to the JAR to be executed. Optionally, the &workingDirectory& variable can be used, which expands to the path of the current operator directory in the repository (usage: workingDirectory/my_app.jar").

Default: "/usr/local/spark/examples/spark-examples_2.10-1.1.1.jar"

jars

string

JARs that are added to spark-submit via the "--jars" argument. Like the appjar path, the &workingDirectory& variable is also available.

Default: ""

packages

string

Packages that are added to spark-submit via the "--packages" argument.

Default: ""

conf

string

The value for the "--conf" argument of spark-submit. Each "key=value" pair is separated by a newline.

Default: ""

impersonateUser

string

The user name used to access YARN and HDFS and the value for the "--proxy-user" argument of spark-submit if the cluster is kerberized .

Default: "vora"

shutdownOnFailure

boolean

Specify true if the component should exit when a single error occurs; otherwise, specify as false.

Default: false

secContext

string

The security-context to be used to connect to the Hadoop system.

Default: "default"

Input

Input

Type

Description

args

string

A string of arguments that are parsed to the JAR that is executed.

binary

blob

An application JAR that is executed. If this input is connected, appjar is ignored.

Output

Output

Type

Description

success

string

On success of one Spark job, it returns the corresponding argument parsed in via args, and "done" is appended.

failure

string

On failure of one Spark job, it returns the corresponding argument parsed in via args, and "error" is appended.