Example Graph for Text Analysis with HDFS

Use the text analysis example graph in the SAP Data Hub Modeler to build applications with natural language processing capabilities.

The SAP Data Hub Modeler provides a text analysis example graph: com.sap.textanalysis.hdfs. This graph helps perform text analysis on files stored on a HDFS file system. The graph listens to changes to files under a given root folder and sends requests to a text analysis server. The server tokenizes the contents of the files in that folder and stores the results of the tokenization on HDFS.

The graph also loads the results of the tokenization into SAP Vora. After loading the results, it also creates a full-text index on the file collection using the results of the tokenization for fast query execution.

For more information on SAP Vora, see Developer Guide for SAP Vora .

Prerequisites to execute the text analysis example graph

You have an HDFS server available, which is reachable from a network firewall, if any.
You have the hostname for the HDFS server and its port number.
If Kerberos is not enabled, the folder and subfolders in HDFS, where your files are located allow write permissions to the 'root' user.
For more information, see 624ce81c22f94cb99a1100f4aae925e8.html
You have installed the vora-dqp services and have the port number of the transaction coordinator.
You have installed the vora-textanalysis service and have the port number.

Executing the com.sap.textanalysis.hdfs graph

Start the SAP Data Hub Modeler.
In the navigation pane, select the Graphs tab.
In the search box, enter com.sap.textanalysis.hdfs.
The tool loads the selected graph in the graph editor.

Select an operator in the graph and in the right pane, select the Configuration tab to set the configuration parameter values.

Operator	Configuration Parameter	Value
Read File (Operator id: readfile1)	Service	hdfs
	Connection Properties	Host name, port number, and username to logon to the HDFS server
	Path	Path to the folder on HDFS to be analyzed
	Delete after Send	false
	Recursive	If true, subfolders in the given location are analyzed recursively.
	Only Read On Change	true
	Poll Period	Interval between two content change detection events, must be >= 1000
Read File (Operator id: readfile2)	Service	hdfs
	Connection Properties	Host name, port number, and username to logon to the HDFS server (must be the same as in readfile1)
	Path	Path to the folder on HDFS to be analyzed (must be the same as in readfile1)
	Delete after Send	false
	Recursive	If true, subfolders in the given location are analyzed recursively.
	Only Read On Change	false
	Poll Period	Interval between two content change detection events, must be >= 1000
Modification Checker (Operator id: javascriptoperator5)	folderpath	Path to the folder on HDFS to be analyzed, without trailing `‘/’ or ‘\’` (must be the same as in readfile1)
	hadoopNameNode	Host name and port number of the HDFS server (must be the same as in readfile1).
	schemaname	Name of the schema in SAP Vora under which the analysis results tables are to be created.
	tablenamesuffix	Suffix to the names of the analysis results tables. Caution Tables with names `TA_tablenamesuffix` and `TADOC_tablenamesuffix` must not already exist in SAP Vora. If it exists, the graph will immediately terminate.
	duration	(Recommended): At least 2 seconds larger than the Poll Period value set in readfile2. Note The value of the duration parameter must be large enough, so that the Modification Checker operator can receive the complete set of modifications on the file collection reported by the HDFS consumers. A small value for a slow network may lead to incorrect text analysis results.
TA Request Creator (Operator id: javascriptoperator4)	serverendpoints	Host name and port number of the vora-textanalysis service
	taconfig	Use either LINGANALYSIS_BASIC, LINGANALYSIS_STEMS, LINGANALYSIS_FULL, EXTRACTION_CORE, EXTRACTION_CORE_ENTERPRISE, EXTRACTION_CORE_PUBLIC_SECTOR, or EXTRACTION_CORE_VOICEOFCUSTOMER. For more information on the description of each configuration, see the Text Analysis section in the Developer Guide for SAP Vora.
	languages (Optional)	A list of languages used for language detection specified in ISO 639-1 codes. For example: 'EN,DE,ES'. If no language is specified, then automatic detection is attempted.
	mime_type (Optional)	The type of input documents. Allowed values are 'text/plain', 'text/html', 'text/xml', and 'text'. The value 'text' indicates that the input is one of plain text, HTML or XML. If not set, or if value is 'text', document identification and conversion are performed. Note Analysis of documents in binary formats is also supported, but automatic format detection is used in this case.
	text_encoding (Optional)	If the document contains text, this parameter indicates the encoding. For example: 'UTF-8'. If not set and the MIME type indicates text, encoding detection and conversion are performed.
SQL Creator (Operator id: javascriptoperator6)	createIndex	If true, the engine buils a full-text index on the file collection
SAP Vora Client (Operator id: sapvoraclient1)	Connection Properties	Host name of the vora-tx-coordinator service, tc port number, and authentication credentials. Right-click the operator and choose Open Documenation for more information on the format of these values.
SAP Vora Client (Operator id: sapvoraclient2)	Connection Properties	Host name of the vora-tx-coordinator service, tc port number, and authentication credentials. Right-click the operator and choose Open Documenation for more information on the format of these values.

In the editor toolbar, choose (Save) to save the graph.
In the editor toolbar, choose (Run) to execute the graph.
The Status tab in the bottom pane shows the status for the graph execution as running to indicate that the graph is being executed.

Querying text analysis results in SAP Vora

The results of the text analysis are loaded into two tables in SAP Vora: TA_tablenamesuffix and TADOC_tablenamesuffix, where tablenamesuffix is defined in the configuration of the Modification Checker operator. These two tables are in the SAP Vora relational disk engine.

If the createIndex parameter in the configuration of the SQL Creator operator of the graph is set to true, a full-text index is created using the results of text analysis. The index is associated with the filename column of a table TADOC_tablenamesuffix_mem. If you want to query using this index, use the file_contains predicate:

select * from TADOC_tablenamesuffix_mem where file_contains(filename, ‘search_key’);