Modeling Guide for SAP Data Hub

Read File

The Read File operator is used to read a file or periodically poll a directory for its contents in a storage service.

The operation takes only one input parameter: the path of the file. This is given as a string in the body of a message in the inPath port. If no input is connected to the port, the operator will periodically poll from the configured path.

The file content is outputted as the body of a message in the outFile port. Further details of the operation are reported as headers of the message, as listed in the port documentation.

An example of usage is given in the com.sap.demo.file graph.

Polling directories: when the given path points to a directory, this operator will poll all files inside that directory. It may be recursive if set to do so.

Configuration Parameters

Parameter

Type

Description

service

string

The file service to operate. Additional parameters may depend on the selected service.

Default: "file"

path

string

A directory to be polled (ends with /) or a file to be read. This only applies if inPath is not connected.

Default: "/tmp/test.txt"

deleteAfterSend

bool

A flag that indicates whether the file should be deleted after its contents have been sent.

Default: false

chunkSize

string

The maximum number of bytes that can be read from files at once. It reads the bytes in blocks until it reaches the end of the file. This can be used to reduce graph latency and memory usage.

If chunkSize is zero, files are read in a single chunk. Otherwise, it will be broken in chunks with a maximum size `chunkSize`. It may be dynamically customized through the message header storage.chunkSize. This field allows metric prefixes to be used and an optional "i" to indicate binary bases. For example:
  • "0", "0B"
    : unlimited
  • "1", "1B"
    : 1 byte
  • "2kb", "2KB"
    : 2*103 bytes
    "2kib", "2KiB"
    : 2*210 bytes
  • "3mb", "3MB"
    : 3*106 bytes
  • "3mib", "3MiB"
    : 3*220 bytes
Prefixes for Giga (G), Tera (T), and Peta (P) are also supported.

Default: "0"

numRetryAttempts

int

The number of times to retry a connection.

Default: 0

retryPeriodInMs

int

The time interval in milliseconds between connection trials.

Default: 0

pollPeriodInMs

int

The time interval in milliseconds between successive polls. If no interval is needed, the value `0` should be used.

Default: 1000

batchRead

bool

A flag that controls whether all files should be read in batches. If set to true and a directory is being polled, then outFilename will be given a list with one filename per line; outFile will be a message whose body is a list of messages, each of which corresponds to a single file.

Default: false

recursive

bool

A flag that controls whether a directory listing should recursively include all sub-directories.

Default: false

pattern

string

A regular expression used to filter file paths before reading them. If empty, all files are accepted. The expression is applied to a path after being converted to an absolute one and only if:
  • it came from inPath and points to a regular file (not a directory); or

  • it is a file found in the polled directory, regardless of whether the directory was specified by inPath or in path.

Default: ""

onlyReadOnChange

bool

If true, only outputs a file if it is new or changed, which avoids the repetitive reading of unchanged files. This uses the date and time given by the file system as the latest modification time as opposed to the actual file contents.

Default: false

terminateOnError

boolean

Sets if the graph should terminate when the operator fails.

Default: "true"

connection

object

Holds information about connection information for the services. Each service connection parameters is documented separately:

configurationType

string

connection parameter: Which type of connection information will be used: Manual (user input) or retrieved by the Connection Management Service.

Default: ""

connectionID

string

connection parameter: The ID of the connection information to retrieve from the Connection Management Service.

Default: ""

connectionProperties

object

connection parameter: All the connection properties for the selected service for manual input.

Input

Input

Type

Description

inPath

message

A message whose body is the path (relative or absolute) of a file or directory (ends with /) to be read. When reading a single file, the message header storage.offsetmay be set to read a specific chunk from a file, which is also subject to the chunkSize configuration.

Output

Output

Type

Description

outFilename

message

A message whose body is the path of the file. This will be equal to the the path that prompted the reading (either inPath or path).

outFile

message

A message whose headers describe the file read and whose body contains the file's contents as a blob. The message contains the following headers:
  • storage.chunkIndex (type int)
    The current chunk's index, starting from zero.
  • storage.chunkCount (type int)
    The total number of chunks for this file.
  • storage.fileSize (type int64)
    The size of the file, in bytes.
  • storage.endOfFile (type bool)
    A flag that indicates whether this chunk is the last (this is simply a convenience:
    endOfFile == (chunkIndex == chunkCount-1
    )).
  • storage.filename (type string)

    The name of the file, including extension (if any), but excluding its directory.

  • storage.directory (type string)
    The absolute path of the directory where the file resides.
  • storage.path (type string)
    The absolute path of the file (equivalent to <directory>/<filename>).
  • storage.polledDirectory (type string)
    The absolute path to the directory being polled, if applicable.
  • storage.pathInPolledDirectory (type string)

    The file path relative to polledDirectory. This is the result of subtracting polledDirectory from directory.