Documents can exist in various languages and file formats. The TREX preprocessor converts the documents into UTF-8 encoded HTML so that they can be processed by TREX. If there is no information on the document language, the preprocessor also carries out a language recognition process before processing the document further. You can specify the languages to be recognized by the preprocessor in the std.langid-config configuration file during the TREX installation and later on. For more information, see Configuring Language Recognition Manually. The language of a document is needed so that it can be placed in the correct language version of the index.
Language recognition is based on statistical methods: Because the frequency of certain combinations of letters is a characteristic of a language, these combinations can be used in order to identify it with a reasonable degree of probability. A frequency file exists for each of the languages supported by TREX. It contains frequency ratings and weightings for letter combinations that are typical for the language in question. The TREX preprocessor checks the text document it is identifying to see whether it contains these combinations. It is then assigned to the language to which it is most similar.
Because the language of documents with only a small amount of text cannot be reliably identified, TREX preprocessor language recognition is only activated if at least 7 terms (default value) can be recognized per document. When the language has been identified the term recognition process, which takes place after the language recognition process, can improve the number of terms recognized using user-specific dictionaries.
Not all words that appear in a piece of document text are equally significant as regards representing that document in an index. This is why language processing takes place after the language recognition process. This involves generating terms that are significant as regards creating an index, and is done using various text operations.
Text Operations for Language Processing
Tokenization: Determining words and sentence boundaries
Normalization: Normalizing orthography
Tagging: Determining word types
Stemming: Reducing words to their stem form (for example, mice- mouse)
Stop words: Eliminating frequent words (such as and, and or)