Distributed Preprocessing of Documents

Purpose

Indexing is a complex process consisting of several phases. One phase is the preprocessing of documents by the preprocessor. Preprocessing includes the following steps:

Loading documents if the application transmitted them as URIs.
Filtering
Carrying out a linguistic analysis

Preprocessing can take a similar amount of time and use similar system resources to the actual indexing process. The filtering of a large number of large documents that are not in text or HTML form can be particularly time- and resource-consuming (for example, large PDFs).

In order to increase throughput in preprocessing, you can distribute the preprocessing among multiple hosts. For example, you can use one host (or more than one) exclusively for preprocessing documents. You do this if there are a large number of documents to be preprocessed for the initial indexing run.

The following sections contain information on the distributed preprocessing of documents.

The sectionFundamentals explains the preprocessing flow for indexing. It also tells you about distribution options and how to control load distribution and performance.
The sectionConfiguration explains how to configure distributed preprocessing.
Note
The preprocessor is involved in processing search and text-mining requests as well as in indexing. In all of these processes, the preprocessor has the task of preparing the actual preprocessing.

The sections below only relate to the preprocessing of documents for indexing. The role of the preprocessor in processing search and text-mining requests is not described.