The preprocessor preprocesses documents and search queries.
Document preprocessing comprises the following steps:
Loading documents
If the application transmits the documents as URIs rather than directly, TREX resolves the URIs. "Resolution" involves fetching the documents from the repository that the URIs reference.
Filtering documents
Documents can exist in various formats, such as Microsoft Word, Microsoft PowerPoint, PDF, and so on. The preprocessor extracts textual content from the documents and then converts it into the UTF-8 Unicode format for further processing.
Analyzing documents linguistically
Linguistic analysis involves splitting text into individual words and reducing words to base forms (stems). The preprocessor uses a lexicon that exists in several languages for this.
During search queries, the preprocessor performs a linguistic analysis. It transmits the results of the analysis to the index server, which continues the processing of the document.