Show TOC

PreprocessorLocate this document in the navigation structure

Use

The preprocessor preprocesses documents and search queries.

Document preprocessing comprises the following steps:

  • Loading documents

    If the application transmits the documents as URIs rather than directly, TREX resolves the URIs. "Resolution" involves fetching the documents from the repository that the URIs reference.

  • Filtering documents

    Documents can exist in various formats, such as Microsoft Word, Microsoft PowerPoint, PDF, and so on. The preprocessor extracts textual content from the documents and then converts it into the UTF-8 Unicode format for further processing.

  • Analyzing documents linguistically

    Linguistic analysis involves splitting text into individual words and reducing words to base forms (stems). The preprocessor uses a lexicon that exists in several languages for this.

During search queries, the preprocessor performs a linguistic analysis. It transmits the results of the analysis to the index server, which continues the processing of the document.