Read File
The Read File operator is used to read a file or periodically poll a directory for its contents in a storage service.
-
Azure Data Lake Store (ADLS)
-
Local File System (file)
-
Google Cloud Storage (GCS)
-
HDFS
-
Amazon S3
-
Azure Storage Blob (WASB)
-
WebHDFS
Polling directories: when the given path points to a directory, this operator will poll all files inside that directory. It may be recursive if set to do so.
Configuration Parameters
Parameter |
Type |
Description |
---|---|---|
service |
string |
The file service to operate. Additional parameters may depend on the selected service. Default: "file" |
path |
string |
A directory to be polled (ends with /) or a file to be read. This only applies if inPath is not connected. Default: "/tmp/test.txt" |
deleteAfterSend |
bool |
A flag that indicates whether the file should be deleted after its contents have been
sent.
Default: false |
chunkSize |
string |
The maximum number of bytes that can be read from files at once. It reads the bytes in blocks until it reaches the end of the file. This can be used to reduce graph latency and memory usage. If chunkSize is zero, files are read in a single chunk. Otherwise, it will be broken in
chunks with a maximum size `chunkSize`. It may be dynamically
customized through the message header
storage.chunkSize. This field allows metric
prefixes to be used and an optional "i" to indicate binary
bases. For example:
Default: "0" |
numRetryAttempts |
int |
The number of times to retry a connection. Default: 0 |
retryPeriodInMs |
int |
The time interval in milliseconds between connection trials. Default: 0 |
pollPeriodInMs |
int |
The time interval in milliseconds between successive polls. If no interval is needed, the value `0` should be used. Default: 1000 |
batchRead |
bool |
A flag that controls whether all files should be read in batches. If set to true and a directory is being polled, then outFilename will be given a list with one filename per line; outFile will be a message whose body is a list of messages, each of which corresponds to a single file. Default: false |
recursive |
bool |
A flag that controls whether a directory listing should recursively include all sub-directories. Default: false |
pattern |
string |
A regular expression used to filter file paths before reading them. If empty, all files
are accepted. The expression is applied to a path after being
converted to an absolute one and only if:
Default: "" |
onlyReadOnChange |
bool |
If true, only outputs a file if it is new or changed, which avoids the repetitive reading of unchanged files. This uses the date and time given by the file system as the latest modification time as opposed to the actual file contents. Default: false |
terminateOnError |
boolean |
Sets if the graph should terminate when the operator fails. Default: "true" |
connection |
object |
Holds information about connection information for the services. Default: |
configurationType |
string |
connection parameter: Which type of connection information will be used: Manual (user input) or retrieved by the Connection Management Service. Default: "" |
connectionID |
string |
connection parameter: The ID of the connection information to retrieve from the Connection Management Service. Default: "" |
connectionProperties |
object |
connection parameter: All the connection properties for the selected service for manual input. |
clientId |
string |
ADL parameter: Mandatory. The client ID from ADLS. Default: "" |
tenantId |
string |
ADL parameter: Mandatory. The tenant ID from ADLS. Default: "" |
clientKey |
string |
ADL parameter: Mandatory. The client key from ADLS. Default: "" |
accountName |
string |
ADL parameter: Mandatory. The account name from ADLS.
Default: "" |
rootPath |
string |
ADL parameter: The optional root path name for browsing. Starts with a slash (e.g. /MyFolder/MySubfolder). Default: "/MyFolder/MySubfolder" |
host |
string |
HDFS parameter: Mandatory. The IP address to the Hadoop name node. Default: "127.0.0.1" |
port |
string |
HDFS parameter: Mandatory. The port to the Hadoop name node. Default: "9000" |
user |
string |
HDFS parameter: Mandatory. The Hadoop user name. Default: "hdfs" |
rootPath |
string |
HDFS parameter: The optional root path name for browsing. Starts with a slash (e.g. /MyFolder/MySubfolder). Default: "/MyFolder/MySubfolder" |
keyFile |
string |
GCS parameters: Mandatory. Service account json key. Default: "" |
projectId |
string |
GCS parameters: Mandatory. The ID of project that will be used. Default: "projectID" |
rootPath |
string |
GCS parameters: "The optional root path name for browsing. Starts with a slash and the **bucket** name (e.g. /MyBucket/MyFolder). Default: "/MyBucket/MyFolder" |
accessKey |
string |
S3 parameter: Mandatory. The AWS access key ID. Default: "AWSAccessKeyId" |
secretKey |
string |
S3 parameter: Mandatory. The AWS secret access key. Default: "AWSSecretAccessKey" |
endpoint |
string |
S3 parameter: allows a custom endpoint http://awsEndpointURL Default: "" |
awsProxy |
string |
S3 parameter: The optional proxy URL. Default: "" |
region |
string |
S3 parameter: Mandatory. The AWS region to create the bucket in. Default: "eu-central-1" |
rootPath |
string |
S3 parameter: Mandatory. The optional root path name for browsing. Starts with a slash and the bucket name (e.g. /MyBucket/MyFolder). Default: "/MyBucket/MyFolder" |
protocol |
string |
S3 parameter: Mandatory. The protocol schema to be used (HTTP or HTTPS). Default: "HTTP" |
accountName |
string |
WASB parameter: Mandatory. The account name from WASB. Default: "" |
accountKey |
string |
WASB parameter: Mandatory. The account key from WASB. Default: "" |
rootPath |
string |
WASB parameter: Mandatory. The optional root path name for browsing. Starts with a slash and the **container** name (e.g. /MyContainer/MyFolder). Default: "/MyContainer/MyFolder" |
protocol |
boolean |
WASB parameter: The protocol schema to be used (WASBS/HTTPS or WASB/HTTP) Default: true |
rootPath |
string |
WebHDFS parameter: The optional root path name for browsing. Starts with a slash (e.g. /MyFolder/MySubfolder). Default: "/MyFolder/MySubfolder" |
protocol |
string |
WebHDFS parameter: Mandatory. The scheme used on WebHDFS connection (webhdfs/http or swebhdfs/https). Default: "webhdfs" |
host |
string |
WebHDFS parameter: Mandatory. The IP address to the WebHDFS node. Default: "127.0.0.1" |
port |
string |
WebHDFS parameter: Mandatory. The port to the WebHDFS node. Default: "9000" |
user |
string |
WebHDFS parameter: Mandatory. The WebHDFS user name. Default: "hdfs" |
webhdfsToken |
string |
WebHDFS parameter: The Token to authenticate to WebHDFS with. Default: "" |
webhdfsOAuthToken |
string |
WebHDFS parameter: The OAuth Token to authenticate to WebHDFS with. Default: "" |
webhdfsDoAs |
string |
WebHDFS parameter: The user to impersonate. Has to be used together with webhdfsUser. Default: "" |
Input
Input |
Type |
Description |
---|---|---|
inPath |
string |
The path (relative or absolute) of a file or directory (ends with /) to be read. |
Output
Output |
Type |
Description |
---|---|---|
outFilename |
string |
The path of the file. This will be equal to the the path that prompted the reading (either inPath or path). |
outFile |
message |
A message whose headers describe the file read and whose body contains the file's
contents as a blob. The message contains the following
headers:
|