Indexing and Optimization

The queue parameters in the table below define the following:

  • What quantity of data is indexed at one time
  • When the queue server triggers optimization
    Queue Parameter Purpose

    Index Bulk Size

    Maximum number of documents to be indexed at one time.

    If there are more documents awaiting indexing, the queue server distributes them over more than one indexing run.

    If there are fewer documents ready for indexing, the queue server nevertheless triggers indexing.

    Max Size of Index Bulk

    Maximum number of bytes to be indexed at one time.

    The duration of indexing depends on the size of the documents. If documents are several MB in size, indexing takes a corresponding amount of time. This parameter therefore defines an upper limit for the data quantity. If the documents exceed this limit, the queue server distributes them accordingly.

    Optimize Bulk Size

    Specifies the number of indexing intervals after which the queue server triggers optimizing.

    In the optimization phase, the index server rebuilds the index. It inserts new documents in the index, removes deleted objects from the index, and optimizes the index structure so that it can reply to search queries as quickly as possible.

    While optimization is running, the queue server does not trigger any more indexing. It waits until optimization is completed.

    Initial Indexing Mode

    Specifies the situations in which the queue server triggers optimizing.

    This parameter influences the performances of the initial indexing of large data sets (100,000 documents or more).

    The following settings are possible:

    • Off

      The queue server triggers optimization when one of the following situations arises:

      • <Index Bulk Size> * <Optimize Bulk Size> documents have been indexed.
      • The system has indexed all documents that were waiting when the start condition was reached.

      This setting is suitable if:

      • Large document sets are to be indexed initially
      • The initial indexing of large data sets is complete and you are moving to routine updates of the index
    • On

      The queue server triggers optimization when it has indexed <Index Bulk Size> * <Optimize Bulk Size> documents.

      This setting is suitable if large document sets are to be indexed initially.

      This setting ensures that the index server only optimizes after a larger number of documents have been indexed. The setting Off may also cause small document sets to be optimized by the index server.

When changing the parameters, remember that if the queue has the status Indexing or Optimizing, the changes do not affect actions that are currently taking place. The changes only take effect when the queue server has completed the actions.

Example

Initial indexing of large database tables

You want to index a large database table with around 200 million data records. TREX treats each data set as a document that consists only of attributes.

The queue server and index server should process the documents as efficiently as possible. Both servers should have a reasonable load. To avoid idle time and reduce administration overheads, the document sets should not be too small. However, the document sets should also not be so large that they overload both servers.

The application sends the table content to TREX in packages that each contain 25,000 documents. The queue server should always collect 100,000 documents before triggering indexing. You should only index a maximum of 50,000 documents at a time. The index server should only carry out optimization once it has indexed 20 million documents. This means that there must be 400 indexing runs, for 50,000 documents each, before the queue server triggers optimization.

In this case, you set the queue parameters as follows:

Parameters Value

Schedule Type

Count

Schedule Max Documents

100000

Index Bulk Size

50000

Optimize Bulk Size

400

Initial Indexing Mode

On

Initial indexing of large document collections

You want to index 200 million documents (Word files, PDF files, and so on). Indexing should be as efficient as possible.

Processing documents with text content takes much more effort than processing documents that consist only of attributes. Therefore, the queue server should always collect only 10,000 documents before triggering indexing. You should only index a maximum of 10,000 documents at a time, with a maximum of 100 MB. The index server should trigger optimization once 100,000 documents have been indexed.

In this case, you set the queue parameters as follows:

Parameters Value

Schedule Type

Count

Schedule Max Documents

10000

Index Bulk Size

10000

Max Size of Index Bulk

1073741824 (corresponds to 100 MB)

Optimize Bulk Size

10

Initial Indexing Mode

On

Daily update of index

You should update a large index every day. To avoid creating a heavy load during the day, you set the update for 2am.

In this case, you set the queue parameters as follows:

Parameters Value

Schedule Type

Time

Schedule Time

All(02:00 AM)

Index Bulk Size

10000

Max. Size of Index Bulk

1073741824 (corresponds to 100 MB)

Optimize Bulk Size

1

Initial Indexing Mode

Off

The queue server triggers indexing at 2am. If there are 10,000 documents or less, and the quantity of data does not exceed 100 MB, the documents are indexed in one run and then optimized.

If there are more than 10,000 documents, the queue server triggers indexing and optimizing for the first 10,000 documents. The queue server waits until the optimization of these documents is complete. It then processes the next lot of documents.

Hourly update of index

You want to index a smaller document set that changes continually. This index should be updated hourly.

Because the document set to be indexed initially is not large, set the queue parameters as follows straight away:

Parameters Value

Schedule Type

Time

Schedule Time

All-1

Index Bulk Size

10000

Max. Size of Index Bulk

1073741824 (corresponds to 100 MB)

Optimize Bulk Size

1

Initial Indexing Mode

Off