Modeling Guide

Example Graph for Text Analysis with HDFS

Use the text analysis example graph in the SAP Data Hub Modeler to build applications with natural language processing capabilities.

The SAP Data Hub Modeler provides a text analysis example graph: com.sap.textanalysis.hdfs. This graph helps perform text analysis on files stored on a HDFS file system. The graph listens to changes to files under a given root folder and sends requests to a text analysis server. The server tokenizes the contents of the files in that folder and stores the results of the tokenization on HDFS.

The graph also loads the results of the tokenization into SAP Vora. After loading the results, it also creates a full-text index on the file collection using the results of the tokenization for fast query execution.

For more information on SAP Vora, see Developer Guide for SAP Vora .

Prerequisites to execute the text analysis example graph
  • You have an HDFS server available, which is reachable from a network firewall, if any.
  • You have the hostname for the HDFS server and its port number.
  • If Kerberos is not enabled, the folder and subfolders in HDFS, where your files are located allow write permissions to the 'root' user.

    For more information, see 624ce81c22f94cb99a1100f4aae925e8.html

  • You have installed the vora-dqp services and have the port number of the transaction coordinator.
  • You have installed the vora-textanalysis service and have the port number.
Executing the com.sap.textanalysis.hdfs graph
  1. Start the SAP Data Hub Modeler.
  2. In the navigation pane, select the Graphs tab.
  3. In the search box, enter com.sap.textanalysis.hdfs.

    The tool loads the selected graph in the graph editor.

  4. Select an operator in the graph and in the right pane, select the Configuration tab to set the configuration parameter values.

    Operator

    Configuration Parameter

    Value

    Read File

    (Operator id: readfile1)

    Service

    hdfs

    Connection Properties

    Host name, port number, and username to logon to the HDFS server

    Path

    Path to the folder on HDFS to be analyzed

    Delete after Send

    false

    Recursive

    If true, subfolders in the given location are analyzed recursively.

    Only Read On Change

    true

    Poll Period

    Interval between two content change detection events, must be >= 1000

    Read File

    (Operator id: readfile2)

    Service

    hdfs

    Connection Properties

    Host name, port number, and username to logon to the HDFS server (must be the same as in readfile1)

    Path

    Path to the folder on HDFS to be analyzed (must be the same as in readfile1)

    Delete after Send

    false

    Recursive

    If true, subfolders in the given location are analyzed recursively.

    Only Read On Change

    false

    Poll Period

    Interval between two content change detection events, must be >= 1000

    Modification Checker

    (Operator id: javascriptoperator5)

    folderpath

    Path to the folder on HDFS to be analyzed, without trailing ‘/’ or ‘\’ (must be the same as in readfile1)

    hadoopNameNode

    Host name and port number of the HDFS server (must be the same as in readfile1).

    schemaname

    Name of the schema in SAP Vora under which the analysis results tables are to be created.

    tablenamesuffix

    Suffix to the names of the analysis results tables.

    duration

    (Recommended): At least 2 seconds larger than the Poll Period value set in readfile2.

    TA Request Creator

    (Operator id: javascriptoperator4)

    serverendpoints

    Host name and port number of the vora-textanalysis service

    taconfig

    Use either LINGANALYSIS_BASIC, LINGANALYSIS_STEMS, LINGANALYSIS_FULL, EXTRACTION_CORE, EXTRACTION_CORE_ENTERPRISE, EXTRACTION_CORE_PUBLIC_SECTOR, or EXTRACTION_CORE_VOICEOFCUSTOMER. For more information on the description of each configuration, see the Text Analysis section in the Developer Guide for SAP Vora.

    languages (Optional)

    A list of languages used for language detection specified in ISO 639-1 codes. For example: 'EN,DE,ES'. If no language is specified, then automatic detection is attempted.

    mime_type (Optional)

    The type of input documents. Allowed values are 'text/plain', 'text/html', 'text/xml', and 'text'. The value 'text' indicates that the input is one of plain text, HTML or XML. If not set, or if value is 'text', document identification and conversion are performed.

    text_encoding (Optional)

    If the document contains text, this parameter indicates the encoding. For example: 'UTF-8'. If not set and the MIME type indicates text, encoding detection and conversion are performed.

    SQL Creator

    (Operator id: javascriptoperator6)

    createIndex

    If true, the engine buils a full-text index on the file collection

    SAP Vora Client

    (Operator id: sapvoraclient1)

    Connection Properties

    Host name of the vora-tx-coordinator service, tc port number, and authentication credentials. Right-click the operator and choose Open Documenation for more information on the format of these values.

    SAP Vora Client

    (Operator id: sapvoraclient2)

    Connection Properties

    Host name of the vora-tx-coordinator service, tc port number, and authentication credentials. Right-click the operator and choose Open Documenation for more information on the format of these values.

  5. In the editor toolbar, choose (Save) to save the graph.
  6. In the editor toolbar, choose (Run) to execute the graph.

    The Status tab in the bottom pane shows the status for the graph execution as running to indicate that the graph is being executed.

Querying text analysis results in SAP Vora

The results of the text analysis are loaded into two tables in SAP Vora: TA_tablenamesuffix and TADOC_tablenamesuffix, where tablenamesuffix is defined in the configuration of the Modification Checker operator. These two tables are in the SAP Vora relational disk engine.

If the createIndex parameter in the configuration of the SQL Creator operator of the graph is set to true, a full-text index is created using the results of text analysis. The index is associated with the filename column of a table TADOC_tablenamesuffix_mem. If you want to query using this index, use the file_contains predicate:

select * from TADOC_tablenamesuffix_mem where file_contains(filename, ‘search_key’);