Modeling Guide

Write File

The Write File operator writes files to a storage service.

Supported services are:
  • Azure Data Lake Store (ADLS)

  • Local File System (file)

  • Google Cloud Storage (GCS)

  • HDFS

  • Amazon S3

  • Azure Storage Blob (WASB)

  • WebHDFS

Configuration Parameters

Parameter

Type

Description

mode

string

Controls whether the target file should be appended to, created (avoiding overwrites, truncated if it already exists), or overwritten (created if it does not exist). It may be dynamically set through the message header storage.writeMode.

Default: "append"

path

string

A formatted string describing the output path for files. See Path formatting below for details and examples.

Default: "/tmp/file\_\<counter\>.txt"

numRetryAttempts

int

The number of times to retry a connection.

Default: 0

retryPeriodInMs

int

The time interval in milliseconds between connection trials.

Default: 0

terminateOnError

boolean

Sets if the graph should terminate when the operator fails.

Default: "true"

connection

object

Holds information about connection information for the services.

configurationType

string

connection parameter: Which type of connection information will be used: Manual (user input) or retrieved by the Connection Management Service.

Default: ""

connectionID

string

connection parameter: The ID of the connection information to retrieve from the Connection Management Service.

Default: ""

connectionProperties

object

connection parameter: All the connection properties for the selected service for manual input.

clientId

string

ADL parameter: Mandatory. The client ID from ADLS.

Default: ""

tenantId

string

ADL parameter: Mandatory. The tenant ID from ADLS.

Default: ""

clientKey

string

ADL parameter: Mandatory. The client key from ADLS.

Default: ""

accountName

string

ADL parameter: Mandatory. The account name from ADLS.

Default: ""

rootPath

string

ADL parameter: The optional root path name for browsing. Starts with a slash (e.g. /MyFolder/MySubfolder).

Default: "/MyFolder/MySubfolder"

host

string

HDFS parameter: Mandatory. The IP address to the Hadoop name node.

Default: "127.0.0.1"

port

string

**HDFS parameter:** The port to the Hadoop name node.

Default: "9000"

mandatory

user

string

**HDFS parameter:** The Hadoop user name.

Default: "hdfs"

mandatory

rootPath

string

**HDFS parameter** The optional root path name for browsing. Starts with a slash (e.g. /MyFolder/MySubfolder).

Default: "/MyFolder/MySubfolder"

keyFile

string

**GCS parameters:** Service account json key.

Default: ""

mandatory

projectId

string

**GCS parameters:** The ID of project that will be used.

Default: "projectID"

mandatory

rootPath

string

**GCS parameters:** "The optional root path name for browsing. Starts with a slash and the **bucket** name (e.g. /MyBucket/MyFolder).

Default: "/MyBucket/MyFolder"

accessKey

string

S3 parameter: Mandatory. The AWS access key ID.

Default: "AWSAccessKeyId"

secretKey

string

S3 parameter: Mandatory. The AWS secret access key.

Default: "AWSSecretAccessKey"

endpoint

string

S3 parameter: allows a custom endpoint http://awsEndpointURL

Default: ""

awsProxy

string

S3 parameter: The optional proxy URL.

Default: ""

region

string

S3 parameter: Mandatory. The AWS region to create the bucket in.

Default: "eu-central-1"

rootPath

string

S3 parameter: Mandatory. The optional root path name for browsing. Starts with a slash and the bucket name (e.g. /MyBucket/MyFolder).

Default: "/MyBucket/MyFolder"

protocol

string

S3 parameter: Mandatory. The protocol schema to be used (HTTP or HTTPS).

Default: "HTTP"

accountName

string

WASB parameter: Mandatory. The account name from WASB.

Default: ""

accountKey

string

WASB parameter: Mandatory. The account key from WASB.

Default: ""

rootPath

string

WASB parameter: Mandatory. The optional root path name for browsing. Starts with a slash and the **container** name (e.g. /MyContainer/MyFolder).

Default: "/MyContainer/MyFolder"

protocol

boolean

WASB parameter: The protocol schema to be used (WASBS/HTTPS or WASB/HTTP)

Default: true

rootPath

string

WebHDFS parameter: The optional root path name for browsing. Starts with a slash (e.g. /MyFolder/MySubfolder).

Default: "/MyFolder/MySubfolder"

protocol

string

WebHDFS parameter: Mandatory. The scheme used on WebHDFS connection (webhdfs/http or swebhdfs/https).

Default: "webhdfs"

host

string

WebHDFS parameter: Mandatory. The IP address to the WebHDFS node.

Default: "127.0.0.1"

port

string

WebHDFS parameter: Mandatory. The port to the WebHDFS node.

Default: "9000"

user

string

WebHDFS parameter: Mandatory. The WebHDFS user name.

Default: "hdfs"

webhdfsToken

string

WebHDFS parameter: The Token to authenticate to WebHDFS with.

Default: ""

webhdfsOAuthToken

string

WebHDFS parameter: The OAuth Token to authenticate to WebHDFS with.

Default: ""

webhdfsDoAs

string

WebHDFS parameter: The user to impersonate. Has to be used together with `webhdfsUser`.

Default: ""

Input

Input

Type

Description

inFile

message

A message whose body (blob) will be written to a file. There are no requirements on the message's headers other than those referred to in the path and mode configuration parameters.

Default:

Output

Output

Type

Description

outFilename

string

The path to the file to which content is written or appended. Whether this path is relative or absolute depends on how it was given to the path configuration.

Path Formatting

Strings in the path configuration are subjected to the following rules:

  • Schemes can be invoked using angle brackets: the string <foo> will be replaced by the result of the scheme named "foo". Available schemes are:
    • counter: an incremental integer

    • date: the current local date in the format YYYYMMDD

    • time: the current local time in the format HHMMSS

    Any other (unrecognized) scheme names will cause an error to be thrown.

  • Message headers can be queried using \${...}. For example, \${bar} would be replaced by the value of header "bar" in the message given to inFile. Note that the dollar sign must always be escaped with a backslash, otherwise it will be seen as a substitution parameter.
    • A default value can be set using an "equals" sign: ${bar=lorem} will be replaced by the value "lorem" whenever the input message lacks the "bar" header. If no default value is set and the message is missing the header, an error will be thrown.

    • Anything else (that is not between < and > or ${ and }) will be left untouched.

Limitations

  • The following characters cannot appear in scheme or message header names: <>${}.

  • Empty scheme (<>) or header (\${}) names will be left untouched.

Example:Basic Usage

Suppose you have messages coming from a Kafka Consumer whose topic describes the type of sensor that sent the message. Then, you can use a path such as:
mydir_<date>_<time>/${kafka.topic}.csv
to produce output files like the following:
mydir_20170131_234550/noise.csv
mydir_20170201_080010/temperature.csv
mydir_20170201_080010/humidity.csv

Example: Copying Directories

If we want to reproduce an entire directory structure, we can set a File, HDFS or S3 Consumer to poll this directory and use the storage.pathInPolledDirectory header to refer to each file's location in it:
outputDir/${storage.pathInPolledDir}
can be expanded to, for example:
outputDir/YearReport.docx
outputDir/January/sales.pdf
outputDir/January/finance.xlsx
outputDir/February/sales.pdf