The following parameters in the preprocessor and queue server are important for performance:
The number of preprocessors running on a host
(Number of preprocessors per host)
The number of threads in a preprocessor process
(Number of threads per preprocessor)
Number of preprocessor clients in the queue server
(Pool size per queue server)
You can use the pool size for the queue servers to directly influence the number of preprocessor threads. The number of preprocessor threads and the pool size are connected as follows: <queue server pool size> = <number of preprocessor threads>. For more information, see Preprocessor Threads and Queue Server Pool Size.
Configuration Rules for Preprocessor and Queue Server
You must take into account the following relationships and configuration rules for a high-performance configuration of distributed preprocessing:
<maximum number of preprocessors per host> = <number of CPUs>
That is, a maximum of one preprocessor per CPU.
<maximum number of threads per preprocessor> = 3
That is, a maximum of three threads per preprocessor and per CPU.
<total pool size of all queue servers> = <total number of CPUs for all preprocessor hosts> * 3
These relationships are explained in more detail below.
How Many Preprocessors Can Run On a Host?
The number of preprocessors that can run on a host is limited by the available main memory and the number of CPUs.
Each preprocessor process has its own main memory area. If there are multiple preprocessors running, they need a correspondingly large amount of main memory. The main memory requirement of a preprocessor depends on the following factors:
How big are the documents?
What format do the documents have (PDF, HTML, and so on)?
For how many languages is language recognition activated?
The main memory requirement for one language is between 30 and 40 MB per preprocessor. If there are more languages, the main memory requirement is normally around 100 MB per preprocessor.
In some cases, the main memory requirement may be between 500 MB and 1 GB. The worst case scenario can occur if language recognition is activated for all languages and a large number of preprocessor threads are processing large documents at the same time.
If the host has enough main memory, the following upper limit is valid:
<Maximum number of preprocessors on a host> = <number of CPUs>
What Is the Maximum Possible Number of Preprocessor Threads?
A preprocessor process can consist of one or more threads. If there are multiple threads, the preprocessor can distribute the requests among the threads and process the requests in parallel. The preprocessor automatically starts the number of threads that is required for processing.
For each preprocessor process, a maximum of three preprocessor threads per CPU should be started:
<number of preprocessor threads per preprocessor process> = 3
Since only one preprocessor per CPU and only three threads per preprocessor should be started, this results in the following relationship:
<maximum number of all preprocessor threads running on a host> =<number of CPUs> * 3
You use the queue server pool size to indirectly configure the number of preprocessor threads (more information: Queue Server Pool Size and Preprocessor Threads).
If the preprocessor uses the maximum number of threads it is also using the maximum amount of system resources. You will have almost complete CPU load.
If you want the preprocessor to have fewer system resources, you can choose to have a smaller number of threads. However, you ought not to choose to have a greater number of threads, since this can cause performance to drop.
The more threads invoked in parallel, the longer the operating system takes to administrate the threads (to trigger, stop, and monitor them). If the number of threads invoked in parallel is too great, the operating system is overwhelmed by thread administration.
More Preprocessors or More Threads?
If you want to optimize preprocessing performance, you need to decide whether to increase the number of preprocessors or the number of preprocessor threads. Your decision depends on the following factors:
Required load distribution among the hosts
System resources of the hosts (number of CPUs and available main memory)
If only one host is preprocessing documents, it makes no difference whether one preprocessor is running with multiple threads or several with one thread each.
If several hosts are preprocessing documents, the parameters have the following effect:
The number of preprocessors running on each host controls the load distribution among the hosts.
The more preprocessors running on a host, the more load that host receives.
Preprocessing takes place on the master host and on a preprocessor host. Because the master host also carries out indexing you want it to receive a smaller preprocessing load. There is therefore only one preprocessor on the master host, but two preprocessors on the preprocessor host.
The load is distributed among the two hosts in the ratio 1:2.
The number of preprocessor threads controls the performance on one host.
The more threads there are, the more documents a preprocessor can process in parallel.
You cannot indefinitely increase the number of preprocessor threads using the queue server pool size and number of preprocessors. The maximum number depends on the available system resources.
Availability can also play a part when deciding on the number of preprocessors and preprocessor threads.
Using multiple preprocessors increases the availability of the system. This is because different processes (preprocessors) have less impact on one another than do the different threads of a process. If a thread hangs, this can affect other threads of the same process but not of another process.
However, using multiple preprocessors also requires more main memory (see How Many Preprocessors Can Run On a Host? above).