Example Graph for Text Analysis with HDFS
Use the text analysis example graph in the SAP Data Hub Modeler to build applications with natural language processing capabilities.
The SAP Data Hub Modeler provides a text analysis example graph: com.sap.textanalysis.hdfs. This graph helps perform text analysis on files stored on a HDFS file system. The graph listens to changes to files under a given root folder and sends requests to a text analysis server. The server tokenizes the contents of the files in that folder and stores the results of the tokenization on HDFS.
The graph also loads the results of the tokenization into SAP Vora. After loading the results, it also creates a full-text index on the file collection using the results of the tokenization for fast query execution.
For more information on SAP Vora, see Developer Guide for SAP Vora .
- You have an HDFS server available, which is reachable from a network firewall, if any.
- You have the hostname for the HDFS server and its port number.
- If Kerberos is not enabled, the folder and subfolders in HDFS, where your files are located
allow write permissions to the 'root' user.
For more information, see 624ce81c22f94cb99a1100f4aae925e8.html
- You have installed the vora-dqp services and have the port number of the transaction coordinator.
- You have installed the vora-textanalysis service and have the port number.
- Start the SAP Data Hub Modeler.
- In the navigation pane, select the Graphs tab.
- In the search box, enter
com.sap.textanalysis.hdfs.
The tool loads the selected graph in the graph editor.
- Select an operator in the graph and in the right pane, select the
Configuration tab to set the configuration
parameter values.
Operator
Configuration Parameter
Value
Read File
(Operator id: readfile1)
Service
hdfs
Connection Properties
Host name, port number, and username to logon to the HDFS server
Path
Path to the folder on HDFS to be analyzed
Delete after Send
false
Recursive
If true, subfolders in the given location are analyzed recursively.
Only Read On Change
true
Poll Period
Interval between two content change detection events, must be >= 1000
Read File (Operator id: readfile2)
Service
hdfs
Connection Properties
Host name, port number, and username to logon to the HDFS server (must be the same as in readfile1)
Path
Path to the folder on HDFS to be analyzed (must be the same as in readfile1)
Delete after Send
false
Recursive
If true, subfolders in the given location are analyzed recursively.
Only Read On Change
false
Poll Period
Interval between two content change detection events, must be >= 1000
Modification Checker
(Operator id: javascriptoperator5)
folderpath
Path to the folder on HDFS to be analyzed, without trailing ‘/’ or ‘\’ (must be the same as in readfile1)
hadoopNameNode
Host name and port number of the HDFS server (must be the same as in readfile1).
schemaname
Name of the schema in SAP Vora under which the analysis results tables are to be created.
tablenamesuffix
Suffix to the names of the analysis results tables.duration
(Recommended): At least 2 seconds larger than the Poll Period value set in readfile2.
TA Request Creator
(Operator id: javascriptoperator4)
serverendpoints
Host name and port number of the vora-textanalysis service
taconfig
Use either LINGANALYSIS_BASIC, LINGANALYSIS_STEMS, LINGANALYSIS_FULL, EXTRACTION_CORE, EXTRACTION_CORE_ENTERPRISE, EXTRACTION_CORE_PUBLIC_SECTOR, or EXTRACTION_CORE_VOICEOFCUSTOMER. For more information on the description of each configuration, see the Text Analysis section in the Developer Guide for SAP Vora.
languages (Optional)
A list of languages used for language detection specified in ISO 639-1 codes. For example: 'EN,DE,ES'. If no language is specified, then automatic detection is attempted.
mime_type (Optional)
The type of input documents. Allowed values are 'text/plain', 'text/html', 'text/xml', and 'text'. The value 'text' indicates that the input is one of plain text, HTML or XML. If not set, or if value is 'text', document identification and conversion are performed.text_encoding (Optional)
If the document contains text, this parameter indicates the encoding. For example: 'UTF-8'. If not set and the MIME type indicates text, encoding detection and conversion are performed.
SQL Creator
(Operator id: javascriptoperator6)
createIndex
If true, the engine buils a full-text index on the file collection
SAP Vora Client
(Operator id: sapvoraclient1)
Connection Properties
Host name of the vora-tx-coordinator service, tc port number, and authentication credentials. Right-click the operator and choose Open Documenation for more information on the format of these values. SAP Vora Client
(Operator id: sapvoraclient2)
Connection Properties
Host name of the vora-tx-coordinator service, tc port number, and authentication credentials. Right-click the operator and choose Open Documenation for more information on the format of these values.
- In the editor toolbar, choose (Save) to save the graph.
- In the editor toolbar, choose (Run) to execute the
graph.
The Status tab in the bottom pane shows the status for the graph execution as running to indicate that the graph is being executed.
The results of the text analysis are loaded into two tables in SAP Vora: TA_tablenamesuffix and TADOC_tablenamesuffix, where tablenamesuffix is defined in the configuration of the Modification Checker operator. These two tables are in the SAP Vora relational disk engine.
If the createIndex parameter in the configuration of the SQL Creator operator of the graph is set to true, a full-text index is created using the results of text analysis. The index is associated with the filename column of a table TADOC_tablenamesuffix_mem. If you want to query using this index, use the file_contains predicate:
select * from TADOC_tablenamesuffix_mem where file_contains(filename, ‘search_key’);