Indexing is a complex process consisting of several phases. One phase is the preprocessing of documents by the preprocessor. Preprocessing includes the following steps:
Preprocessing can take a similar amount of time and use similar system resources to the actual indexing process. The filtering of a large number of large documents that are not in text or HTML form can be particularly time- and resource-consuming (for example, large PDFs).
In order to increase throughput in preprocessing, you can distribute the preprocessing among multiple hosts. For example, you can use one host (or more than one) exclusively for preprocessing documents. You do this if there are a large number of documents to be preprocessed for the initial indexing run.
The following sections contain information on the distributed preprocessing of documents.
The preprocessor is involved in processing search and text-mining requests as well as in indexing. In all of these processes, the preprocessor has the task of preparing the actual preprocessing.
The sections below only relate to the preprocessing of documents for indexing. The role of the preprocessor in processing search and text-mining requests is not described.