Modeling Guide for SAP Data Hub

Text Analysis

The Text Analysis operator is used for connecting to a text analysis server using an external connector binary. It makes use of the stdmuxer component.

The operator receives a request for text analysis in JSON format. There are two types of requests:
  1. Data: The content of the document to be analyzed is included in the request. The operator outputs the result of the analysis as a string in csv format.
  2. Folder: The request includes a folder name with the location of the documents to be analyzed. The result of the analysis is written directly in the specified location in two files “[folder_name]_TA.csv” and “[folder_name]_TADOC.csv”.

A sample graph called 'Text Analysis Example' (com.sap.textanalysis.example) using this operator can be found in the graph library.

See the Text Analysis section of the Developer Guide for SAP Vora for more information about the text analysis configurations and output format.

Request Parameters

Parameter

Description

endpoint Host name and port number of the vora-textanalysis service. For example: 'vora-textanalysis:2204'.
taconfig Text Analysis default configuration.

One of: LINGANALYSIS_BASIC, LINGANALYSIS_STEMS, LINGANALYSIS_STEMS, EXTRACTION_CORE, EXTRACTION_CORE_ENTERPRISE, EXTRACTION_CORE_PUBLIC_SECTOR, EXTRACTION_CORE_VOICEOFCUSTOMER. If empty, LINGANALYSIS_BASIC is used.

languages A list of languages used for language detection specified in ISO 639-1 codes. For example: 'EN,DE,ES'. If no language is specified, then automatic detection is attempted.
mime_type The type of input documents.

Allowed values are 'text/plain', 'text/html', 'text/xml', and 'text'.

The value 'text' indicates that the input is one of plain text, HTML or XML. If not set, or if value is 'text', document identification and conversion are performed.

Note that the analysis of documents in binary formats is also supported, but automatic format detection is used in this case.
text_encoding If the document contains text, this parameter indicates the encoding. For example: 'UTF-8'. If not set and the MIME type indicates text, encoding detection and conversion are performed.

document

The text of the document to be analyzed.

folderpath

The location of the files to be analyzed. Currently supported file systems are local and hdfs. For example: '/tmp/folder' or 'hdfs://vora-hdfs:8020/user/hdfs/folder'.

recursive_flag

If true the analysis is done recursively in the subfolders in the specified location. Output files are written locally in each subfolder.

document_id

The starting id for identifying documents in the result.

Input

Input

Type

Description

inFileData

string

A request including the document to be analyzed.

inFolderPath

string

A request including a folder to be processed.

Output

Output

Type

Description

outFileData

string Result of the analysis of the document received through the inFileData port, in csv format.

outFolderPath

string

A string containing the input request specification and the number of successfully analyzed documents.