Configuring Language Recognition (SAP Library - Search and Classification (TREX))

Configuring Language Recognition

Use

Language recognition takes place using the lexicon software of third-party providers and the TREX preprocessor. You can configure language recognition.

Modifying Language Recognition with Lexicon Software

Language recognition with lexicon software includes the following areas:

· Configuring additional languages

You can retrospectively configure language recognition for additional languages. When TREX is installed, you select the languages to be identified by language recognition.

Note

Only activate the languages that appear in your documents and that you also want to index. Doing this optimizes the performance of the language recognition procedure and of indexing in general. Moreover, the fewer languages used, the better the results of language recognition.

· Disregarding parts of HTML or XML documents

You can configure the system so that certain parts of HTML or XML documents are ignored when the language recognition procedure takes place. Documents that are to be indexed and are in HTML or XML format often contain elements (such as Javascript programs) that damage the performance of the language recognition procedure.

· Changing the number of characters for language recognition

Language recognition using lexicon software is set up so that only a certain amount of characters are taken into consideration. This is usually set to its optimum value at delivery. If it turns out that languages are not being recognized correctly, you can increase the quantity of text that is taken into consideration.

Note

In certain cases, language recognition might not deliver the correct language. In particular, problems can occur when processing documents that are very short or that contain a large number of abbreviations or words loaned from another language.

You make modifications to the lexicon software language recognition in the configuration file <TREX_Installation_Directory>\Lexicon\std.langid-config on the TREX preprocessor. The settings are valid for all indexes on the preprocessor. If you are using more than one TREX preprocessor, you need to modify the configuration file of each preprocessor.

...

1. Open the configuration file <TREX_Installation_Directory>\Lexicon\std.langid-config with a text editor.

2. In the section <encodings-languages-covered>, check the list of languages to be taken into consideration for the language recognition procedure. The list is under <list key = "utf_8">.

Delete languages that you do not need, or flag them using .

You can add more languages to the list as needed as long as the languages in question are supported by the language recognition service. The following list shows which languages you can use, and gives the entry that you enter into the list for each language.

Languages supported by TREX

Language	Entry
Chinese (simplified)	<item key = "simplified-chinese" />
Chinese (traditional)	<item key = "traditional-chinese" />
Danish	<item key = "danish" />
German	<item key = "german" />
English	<item key = "english" />
Finnish	<item key = "finnish" />
French	<item key = "french" />
Dutch	<item key = "dutch" />
Italian	<item key = "italian" />
Japanese	<item key = "japanese" />
Korean	<item key = "korean" />
Norwegian (Bokmal)	<item key = "bokmal" />
Norwegian (Nynorsk)	<item key = "nynorsk" />
Portuguese	<item key = "portuguese" />
Swedish	<item key = "swedish" />
Spanish	<item key = "spanish" />

Languages supported by TREX with limited functionality

Language	Entry
Arabic	<item key = "arabic" />
Greek	<item key = "greek" />
Hebrew	<item key = "hebrew" />
Polish	<item key = "polish" />
Romanian	<item key = "romanian" />
Russian	<item key = "russian" />
Thai	<item key = "thai" />
Czechoslovakian	<item key = "czech" />
Turkish	<item key = "turkish" />
Hungarian	<item key = "hungarian" />

Only limited text-mining functions are currently available for these additional languages. For more information on these languages, see Supported Languages with Restricted Functionality.

Example

In the following example, English, French, and Danish are taken into consideration. Italian is not.

<encodings-languages-covered>

...

All documents are converted to URF-8 Unicode format before language recognition takes place. Therefore only the section <list key = "utf_8"> is relevant in the language list. The other types of coding do not need to be modified.

3. The section <remove-markup-content> contains a list of markings that are ignored when language recognition takes place. All texts with these markings are ignored.

Example

The following example shows a section of the list.

<remove-markup-content>

...

If, for example, Javascript programs (marked by <script>) occur in HTML documents, they are ignored when language recognition takes place.

If necessary, you can add more elements to the list, or remove existing elements from the list.

Example

You have French documents that also contain a short summary in English. The summary is marked with the tag <English-Abstract>: <English-Abstract> This is the abstract in English. ... </English-Abstract>. Add the line <item key = "english-abstract" /> to the list mentioned above.

On the other hand, if you want text marked with <title> to be taken into consideration for language recognition, you need to remove the line <item key = "title" /> from the list.

4. The section <detection-buffer-size> determines what quantity of text is taken into consideration when a document is subjected to the language recognition procedure. You can increase this value if you think that the quantity of text is too small.

Caution

However, this should only be done in exceptional circumstances. The larger the quantity of text, the longer language recognition, and therefore indexing, takes.

The value in the section <detection-buffer-size> cannot be greater than the value in the section <langid-buffer-size>.

5. Save the file and close the text editor.

Modifying Language Recognition Using the TREX Preprocessor

Because the language of documents with only a small amount of text cannot be reliably identified, TREX preprocessor language recognition is only activated if at least 7 terms (default value) can be recognized per document. You can change this value if you are using TREX in a scenario with very short sentences.

You make modifications for TREX preprocessor language recognition in the configuration file <TREX-installation_directory>\TREXPreprocessor.ini.

...

1. Open the configuration file <TREX_Installation_Directory>\TrexPreprocessor.ini with a text editor.

2. In the section [lexicon], change the parameter min_valid_tokens.

The default value of this parameter is 7. Choose a lower value if you want the TREX preprocessor to try to identify the language of documents with fewer terms per document.

Restarting the TREX Preprocessor

...

You need to stop and restart the preprocessor for the new settings to take effect. You do this using the TREX admin tool (stand-alone), using the function for starting and stopping the TREX servers. Note that the TREX daemon automatically restarts the server after it has been stopped. The settings are valid for all documents indexed after the TREX preprocessor is restarted.

Caution

The new settings do not affect documents that have already been indexed. This means that if, for example, a document that has already been indexed has been assigned to the wrong language, it must be reindexed.