Modeling Guide for SAP Data Hub

Text Analysis HDFS

This graph is a demo for text analysis using HDFS file system. It listens to changes to files under a given root folder "folderpath" and sends requests to a textanalysis server, which analyzes the contents of that folder. For every subfolder under (and including) the root folder, the results of the analysis of the files directly in the subfolder are output to csv files "_TA.csv" and "_TADOC.csv".

Requests are sent only when modifications happen and only at the level of the modified subfolder. The result of the analysis is loaded into two tables in SAP Vora: "TA_tablenamesuffix" and "TADOC_tablenamesuffix", where tablenamesuffix is a parameter of the configuration. A fulltext index is created on the file collection using the result of the analysis and can be used by querying the table "TA_tablenamesuffix_mem".

See the Example Graph for Text Analysis with HDFS page of the Tutorials section in the Developer Guide for Data Pipelines for more information.

Prerequisites

To run the graph, the following configurations need to be set:

Operator: readfile1
  • Service: "hdfs"

  • Connection Properties

  • Path (Folder path to be analyzed)

  • Delete after Send: "false"

  • Recursive: "true"

  • Only Read on Change: "true"

Operator: readfile2
  • Service: "hdfs"

  • Connection Properties

  • Path (Folder path to be analyzed)

  • Delete after Send: "false"

  • Recursive: "true"

  • Only Read on Change: "false"

Operator : javascriptoperator5 (label: Modification Checker)
  • folderpath (Folder path to be analyzed, without trailing '/')

  • hadoopNameNode

  • schemaname

  • tablenamesuffix

  • duration (recommended to be at least 2 seconds larger than pollPeriodInMs for file consumers)

Operator : javascriptoperator4 (label: TA Request Creator)
  • serverendpoints

  • taconfig

Operator : sapvoraclient1
  • Connection Properties
Operator : sapvoraclient2
  • Connection Properties