Preprocessor (SAP Library - Search and Classification (TREX))

Preprocessor

The preprocessor preprocesses documents and search queries.

Document preprocessing comprises the following steps:

· Loading documents

If the application transmits the documents as URIs rather than directly, TREX resolves the URIs. This involves fetching the documents from the repository that the URIs reference.

· Filtering documents

Documents can exist in various formats, such as Microsoft Word, Microsoft PowerPoint, PDF, and so on. The preprocessor extracts textual content from the documents and then converts it into the UTF-8 Unicode format for further processing.

· Analyzing documents linguistically

Linguistic analysis involves splitting text into individual words and reducing words to base forms (stems). The preprocessor uses a lexicon that exists in several languages for this.

During search queries, the preprocessor performs a linguistic analysis. It transmits the results of the analysis to the index server, which continues the processing of the document.