Indexing and Optimization

The queue parameters in the table below define the following:

What quantity of data is indexed at one time

When the queue server triggers optimization

Queue Parameter	Purpose
Index Bulk Size	Maximum number of documents to be indexed at one time. If there are more documents awaiting indexing, the queue server distributes them over more than one indexing run. Tip Indexing takes place at hourly intervals. You should only index a maximum of 1000 document at a time. When the start condition is next reached, 3000 documents are ready for indexing. The queue server distributes the documents over three indexing runs. If there are fewer documents ready for indexing, the queue server nevertheless triggers indexing. Tip Indexing takes place at hourly intervals. You should only index a maximum of 1000 document at a time. When the start condition is next reached, only 900 documents are ready for indexing. Although this is less than specified in the Index Bulk Size parameter, the queue server nevertheless triggers indexing.
Max Size of Index Bulk	Maximum number of bytes to be indexed at one time. The duration of indexing depends on the size of the documents. If documents are several MB in size, indexing takes a corresponding amount of time. This parameter therefore defines an upper limit for the data quantity. If the documents exceed this limit, the queue server distributes them accordingly.
Optimize Bulk Size	Specifies the number of indexing intervals after which the queue server triggers optimizing. In the optimization phase, the index server rebuilds the index. It inserts new documents in the index, removes deleted objects from the index, and optimizes the index structure so that it can reply to search queries as quickly as possible. While optimization is running, the queue server does not trigger any more indexing. It waits until optimization is completed.
Initial Indexing Mode	Specifies the situations in which the queue server triggers optimizing. This parameter influences the performances of the initial indexing of large data sets (100,000 documents or more). The following settings are possible: Off The queue server triggers optimization when one of the following situations arises: <Index Bulk Size> * <Optimize Bulk Size> documents have been indexed. The system has indexed all documents that were waiting when the start condition was reached. This setting is suitable if: Large document sets are to be indexed initially The initial indexing of large data sets is complete and you are moving to routine updates of the index On The queue server triggers optimization when it has indexed <Index Bulk Size> * <Optimize Bulk Size> documents. This setting is suitable if large document sets are to be indexed initially. This setting ensures that the index server only optimizes after a larger number of documents have been indexed. The setting Off may also cause small document sets to be optimized by the index server. Note Optimization of a larger quantity of documents generally runs quicker than optimization of numerous small quantities of documents. This is because the index server rewrites the index when optimization takes place. The larger the index, the longer the write process takes. If the index server carries out optimization frequently, it must also rewrite the index frequently. This can hamper performance. Note If you change the parameter from On to Off after the initial indexing run, the Flush function is triggered internally. This causes the queue server to process the remaining content of the queue in its entirety.

When changing the parameters, remember that if the queue has the status Indexing or Optimizing, the changes do not affect actions that are currently taking place. The changes only take effect when the queue server has completed the actions.

Example

Initial indexing of large database tables

You want to index a large database table with around 200 million data records. TREX treats each data set as a document that consists only of attributes.

The queue server and index server should process the documents as efficiently as possible. Both servers should have a reasonable load. To avoid idle time and reduce administration overheads, the document sets should not be too small. However, the document sets should also not be so large that they overload both servers.

The application sends the table content to TREX in packages that each contain 25,000 documents. The queue server should always collect 100,000 documents before triggering indexing. You should only index a maximum of 50,000 documents at a time. The index server should only carry out optimization once it has indexed 20 million documents. This means that there must be 400 indexing runs, for 50,000 documents each, before the queue server triggers optimization.

In this case, you set the queue parameters as follows:

Parameters	Value
Schedule Type	Count
Schedule Max Documents	100000
Index Bulk Size	50000
Optimize Bulk Size	400
Initial Indexing Mode	On

Note

The setting Initial Indexing Mode = On ensures that optimization first takes place after the index server has indexed 20 million documents.

If you set the Initial Indexing Mode to Off and keep the remaining configuration, the system would trigger indexing as soon as it has collected 100,000 documents. When indexing is complete (that is, 2 * 50,000 documents have been transmitted) the queue server triggers optimization. This means that optimization takes place after 100,000 documents have been indexed rather than 20 million documents.

Initial indexing of large document collections

You want to index 200 million documents (Word files, PDF files, and so on). Indexing should be as efficient as possible.

Processing documents with text content takes much more effort than processing documents that consist only of attributes. Therefore, the queue server should always collect only 10,000 documents before triggering indexing. You should only index a maximum of 10,000 documents at a time, with a maximum of 100 MB. The index server should trigger optimization once 100,000 documents have been indexed.

In this case, you set the queue parameters as follows:

Parameters	Value
Schedule Type	Count
Schedule Max Documents	10000
Index Bulk Size	10000
Max Size of Index Bulk	1073741824 (corresponds to 100 MB)
Optimize Bulk Size	10
Initial Indexing Mode	On

Daily update of index

You should update a large index every day. To avoid creating a heavy load during the day, you set the update for 2am.

In this case, you set the queue parameters as follows:

Parameters	Value
Schedule Type	Time
Schedule Time	All(02:00 AM)
Index Bulk Size	10000
Max. Size of Index Bulk	1073741824 (corresponds to 100 MB)
Optimize Bulk Size	1
Initial Indexing Mode	Off

The queue server triggers indexing at 2am. If there are 10,000 documents or less, and the quantity of data does not exceed 100 MB, the documents are indexed in one run and then optimized.

If there are more than 10,000 documents, the queue server triggers indexing and optimizing for the first 10,000 documents. The queue server waits until the optimization of these documents is complete. It then processes the next lot of documents.

Hourly update of index

You want to index a smaller document set that changes continually. This index should be updated hourly.

Because the document set to be indexed initially is not large, set the queue parameters as follows straight away:

Parameters	Value
Schedule Type	Time
Schedule Time	All-1
Index Bulk Size	10000
Max. Size of Index Bulk	1073741824 (corresponds to 100 MB)
Optimize Bulk Size	1
Initial Indexing Mode	Off