Modeling Guide

Read File

The Read File operator is used to read a file or periodically poll a directory for its contents in a storage service.

Supported services are:
  • Azure Data Lake Store (ADLS)

  • Local File System (file)

  • Google Cloud Storage (GCS)

  • HDFS

  • Amazon S3

  • Azure Storage Blob (WASB)

  • WebHDFS

Polling directories: when the given path points to a directory, this operator will poll all files inside that directory. It may be recursive if set to do so.

Configuration Parameters

Parameter

Type

Description

service

string

The file service to operate. Additional parameters may depend on the selected service.

Default: "file"

path

string

A directory to be polled (ends with /) or a file to be read. This only applies if inPath is not connected.

Default: "/tmp/test.txt"

deleteAfterSend

bool

A flag that indicates whether the file should be deleted after its contents have been sent.

Default: false

chunkSize

string

The maximum number of bytes that can be read from files at once. It reads the bytes in blocks until it reaches the end of the file. This can be used to reduce graph latency and memory usage.

If chunkSize is zero, files are read in a single chunk. Otherwise, it will be broken in chunks with a maximum size `chunkSize`. It may be dynamically customized through the message header storage.chunkSize. This field allows metric prefixes to be used and an optional "i" to indicate binary bases. For example:
  • "0", "0B"
    : unlimited
  • "1", "1B"
    : 1 byte
  • "2kb", "2KB"
    : 2*103 bytes
    "2kib", "2KiB"
    : 2*210 bytes
  • "3mb", "3MB"
    : 3*106 bytes
  • "3mib", "3MiB"
    : 3*220 bytes
Prefixes for Giga (G), Tera (T), and Peta (P) are also supported.

Default: "0"

numRetryAttempts

int

The number of times to retry a connection.

Default: 0

retryPeriodInMs

int

The time interval in milliseconds between connection trials.

Default: 0

pollPeriodInMs

int

The time interval in milliseconds between successive polls. If no interval is needed, the value `0` should be used.

Default: 1000

batchRead

bool

A flag that controls whether all files should be read in batches. If set to true and a directory is being polled, then outFilename will be given a list with one filename per line; outFile will be a message whose body is a list of messages, each of which corresponds to a single file.

Default: false

recursive

bool

A flag that controls whether a directory listing should recursively include all sub-directories.

Default: false

pattern

string

A regular expression used to filter file paths before reading them. If empty, all files are accepted. The expression is applied to a path after being converted to an absolute one and only if:
  • it came from inPath and points to a regular file (not a directory); or

  • it is a file found in the polled directory, regardless of whether the directory was specified by inPath or in path.

Default: ""

onlyReadOnChange

bool

If true, only outputs a file if it is new or changed, which avoids the repetitive reading of unchanged files. This uses the date and time given by the file system as the latest modification time as opposed to the actual file contents.

Default: false

terminateOnError

boolean

Sets if the graph should terminate when the operator fails.

Default: "true"

connection

object

Holds information about connection information for the services.

Default:

configurationType

string

connection parameter: Which type of connection information will be used: Manual (user input) or retrieved by the Connection Management Service.

Default: ""

connectionID

string

connection parameter: The ID of the connection information to retrieve from the Connection Management Service.

Default: ""

connectionProperties

object

connection parameter: All the connection properties for the selected service for manual input.

clientId

string

ADL parameter: Mandatory. The client ID from ADLS.

Default: ""

tenantId

string

ADL parameter: Mandatory. The tenant ID from ADLS.

Default: ""

clientKey

string

ADL parameter: Mandatory. The client key from ADLS.

Default: ""

accountName

string

ADL parameter: Mandatory. The account name from ADLS.

Default: ""

rootPath

string

ADL parameter: The optional root path name for browsing. Starts with a slash (e.g. /MyFolder/MySubfolder).

Default: "/MyFolder/MySubfolder"

host

string

HDFS parameter: Mandatory. The IP address to the Hadoop name node.

Default: "127.0.0.1"

port

string

HDFS parameter: Mandatory. The port to the Hadoop name node.

Default: "9000"

user

string

HDFS parameter: Mandatory. The Hadoop user name.

Default: "hdfs"

rootPath

string

HDFS parameter: The optional root path name for browsing. Starts with a slash (e.g. /MyFolder/MySubfolder).

Default: "/MyFolder/MySubfolder"

keyFile

string

GCS parameters: Mandatory. Service account json key.

Default: ""

projectId

string

GCS parameters: Mandatory. The ID of project that will be used.

Default: "projectID"

rootPath

string

GCS parameters: "The optional root path name for browsing. Starts with a slash and the **bucket** name (e.g. /MyBucket/MyFolder).

Default: "/MyBucket/MyFolder"

accessKey

string

S3 parameter: Mandatory. The AWS access key ID.

Default: "AWSAccessKeyId"

secretKey

string

S3 parameter: Mandatory. The AWS secret access key.

Default: "AWSSecretAccessKey"

endpoint

string

S3 parameter: allows a custom endpoint http://awsEndpointURL

Default: ""

awsProxy

string

S3 parameter: The optional proxy URL.

Default: ""

region

string

S3 parameter: Mandatory. The AWS region to create the bucket in.

Default: "eu-central-1"

rootPath

string

S3 parameter: Mandatory. The optional root path name for browsing. Starts with a slash and the bucket name (e.g. /MyBucket/MyFolder).

Default: "/MyBucket/MyFolder"

protocol

string

S3 parameter: Mandatory. The protocol schema to be used (HTTP or HTTPS).

Default: "HTTP"

accountName

string

WASB parameter: Mandatory. The account name from WASB.

Default: ""

accountKey

string

WASB parameter: Mandatory. The account key from WASB.

Default: ""

rootPath

string

WASB parameter: Mandatory. The optional root path name for browsing. Starts with a slash and the **container** name (e.g. /MyContainer/MyFolder).

Default: "/MyContainer/MyFolder"

protocol

boolean

WASB parameter: The protocol schema to be used (WASBS/HTTPS or WASB/HTTP)

Default: true

rootPath

string

WebHDFS parameter: The optional root path name for browsing. Starts with a slash (e.g. /MyFolder/MySubfolder).

Default: "/MyFolder/MySubfolder"

protocol

string

WebHDFS parameter: Mandatory. The scheme used on WebHDFS connection (webhdfs/http or swebhdfs/https).

Default: "webhdfs"

host

string

WebHDFS parameter: Mandatory. The IP address to the WebHDFS node.

Default: "127.0.0.1"

port

string

WebHDFS parameter: Mandatory. The port to the WebHDFS node.

Default: "9000"

user

string

WebHDFS parameter: Mandatory. The WebHDFS user name.

Default: "hdfs"

webhdfsToken

string

WebHDFS parameter: The Token to authenticate to WebHDFS with.

Default: ""

webhdfsOAuthToken

string

WebHDFS parameter: The OAuth Token to authenticate to WebHDFS with.

Default: ""

webhdfsDoAs

string

WebHDFS parameter: The user to impersonate. Has to be used together with webhdfsUser.

Default: ""

Input

Input

Type

Description

inPath

string

The path (relative or absolute) of a file or directory (ends with /) to be read.

Output

Output

Type

Description

outFilename

string

The path of the file. This will be equal to the the path that prompted the reading (either inPath or path).

outFile

message

A message whose headers describe the file read and whose body contains the file's contents as a blob. The message contains the following headers:
  • storage.chunkIndex (type int)
    The current chunk's index, starting from zero.
  • storage.chunkCount (type int)
    The total number of chunks for this file.
  • storage.fileSize (type int64)
    The size of the file, in bytes.
  • storage.endOfFile (type bool)
    A flag that indicates whether this chunk is the last (this is simply a convenience:
    endOfFile == (chunkIndex == chunkCount-1
    )).
  • storage.filename (type string)

    The name of the file, including extension (if any), but excluding its directory.

  • storage.directory (type string)
    The absolute path of the directory where the file resides.
  • storage.path (type string)
    The absolute path of the file (equivalent to <directory>/<filename>).
  • storage.polledDirectory (type string)
    The absolute path to the directory being polled, if applicable.
  • storage.pathInPolledDirectory (type string)

    The file path relative to polledDirectory. This is the result of subtracting polledDirectory from directory.