Configuring Language Recognition

Use

Language recognition takes place first using the lexicon software of third-party providers and then using the TREX preprocessor. You can configure both types of language recognition.

Naming Convention

Central directory for executable files <CENTRAL_DIR>
- On UNIX: usr/SAP/<SAPSID>/SYS/exe/nuc/<OS>
- On Windows: <drive>:usr\SAP\<SAPSID>\SYS\exe\nuc\<OS>
  As part of the CPE (Central Patch Environment), the sapcpe program takes on the automatic synchronization of executable files and copies them from the central directory for executable files, <CENTRAL_DIR>, into the local directory for executable files, <TREX_DIR>\exe. When you restart TREX, the system automatically launches the sapcpe program. During all subsequent starts, sapcpe checks whether or not the local executable files are up-to-date and copies new or changed executable files from the central directory to the local directory, <TREX_DIR>\exe.
TREX installation directory <TREX_INSTALL>
- UNIX: /usr/sap/<sapsid>/trx<instance_number>/<TREX_host_name>
- Windows: <disk_drive>:\usr\sap\<SAPSID>\TRX<instance_number>\<trex_hostname>

Modifying Language Recognition with Lexicon Software

Language recognition with lexicon software includes the following areas:

Configure additional languages
You can retrospectively configure language recognition for additional languages. When TREX is installed, you select the languages to be identified by language recognition.

Note
Only activate the languages that appear in your documents and that you also want to index. Doing this optimizes the performance of the language recognition procedure and of indexing in general. Moreover, the less languages used, the better the results of language recognition.
Disregarding parts of HTML or XML documents
You can configure the system so that certain parts of HTML or XML documents are ignored when the language recognition procedure takes place. Documents that are to be indexed and are in HTML or XML format often contain elements (such as JavaScript programs) that damage the performance of the language recognition procedure.
Changing the number of characters for language recognition
Language recognition using lexicon software is set up so that only a certain number of characters are taken into consideration. This is usually set to its optimum value at delivery. If it turns out that languages are not being recognized correctly, you can increase the quantity of text that is taken into consideration.

Note
In certain cases, language recognition might not deliver the correct language. In particular, problems can occur when processing documents that are very short or that contain a large number of abbreviations or words loaned from another language.

You modify the lexicon software language recognition by editing the std.langid-config configuration file on the TREX preprocessor. The settings are valid for all indexes on the preprocessor. If you are using more than one TREX preprocessor, you need to modify the configuration file of each preprocessor.

Open the std.langid-config configuration file in the central directory for executable files, <CENTRAL_DIR>\lexicon, in a text editor.
In the section <encodings-languages-covered>, check the list of languages to be taken into consideration for the language recognition procedure. The list is under <list key = "utf_8">.
Delete languages that you do not need, or flag them using .

You can add more languages to the list as needed as long as the languages in question are supported by the language recognition service. The following list shows which languages you can use, and gives the entry that you enter into the list for each language.

Languages supported by TREX

Language	Entry
Chinese (simplified)	<item key = "simplified-chinese" />
Chinese (traditional)	<item key = "traditional-chinese" />
Danish	<item key = "danish" />
German	<item key = "german" />
English	<item key = "english" />
Finnish	<item key = "finnish" />
French	<item key = "french" />
Dutch	<item key = "dutch" />
Italian	<item key = "italian" />
Japanese	<item key = "japanese" />
Korean	<item key = "korean" />
Norwegian (Bokmal)	<item key = "bokmal" />
Norwegian (Nynorsk)	<item key = "nynorsk" />
Portuguese	<item key = "portuguese" />
Swedish	<item key = "swedish" />
Spanish	<item key = "spanish" />

Languages supported by TREX with limited functionality

Language	Entry
Arabic	<item key = "arabic" />
Greek	<item key = "greek" />
Hebrew	<item key = "hebrew" />
Polish	<item key = "polish" />
Romanian	<item key = "romanian" />
Russian	<item key = "russian" />
Thai	<item key = "thai" />
Czech	<item key = "czech" />
Turkish	<item key = "turkish" />
Hungarian	<item key = "hungarian" />

Only limited text-mining functions are currently available for these additional languages. For more information about these languages, seeSupported Languages with Restricted Functionality.

Tip

In the following example, English, French, and Danish are taken into consideration. Italian is not.

<encodings-languages-covered>

...

All documents are converted to URF-8 Unicode format before language recognition takes place. Therefore only the section <list key = "utf_8"> is relevant in the language list. The other types of coding do not need to be modified.

The section <remove-markup-content> contains a list of markings that are ignored when language recognition takes place. All texts with these markings are ignored.
Tip
The following example shows a section of the list.

<remove-markup-content>

<item key = "applet" />

<item key = "code" />

<item key = "script" />

...

<item key = "title" />

For example, if JavaScript programs (marked by <script>) occur in HTML documents, they are ignored when language recognition takes place.

If necessary, you can add more elements to the list, or remove existing elements from the list.

Tip
You have French documents that also contain a short summary in English. The summary is marked with the tag <English-Abstract>: <English-Abstract> This is the abstract in English. ... </English-Abstract>. Add the line <item key = "english-abstract" />to the list mentioned above.

On the other hand, if you want text marked with <title> to be taken into consideration for language recognition, you need to remove the line <item key = "title" /> from the list.
The section <detection-buffer-size> determines the quantity of text that is taken into consideration when a document is subjected to the language recognition procedure. You can increase this value if you think that the quantity of text is too small.
Caution
However, this should only be done in exceptional circumstances. The larger the quantity of text, the longer language recognition, and therefore indexing, takes.

The value in the section <detection-buffer-size> cannot be greater than the value in the section <langid-buffer-size>.
Save the file and close the text editor.
Restart TREX.
For the changes to the std.langid-config configuration file in the <CENTRAL_DIR>\lexicon directory to take effect, you must restart TREX. When you restart TREX, the sapcpe program copies the changed configuration files from the central directory for executable files, <CENTRAL_DIR>\lexicon, to the local TREX directory, <TREX_DIR>\exe\lexicon, and overwrites the std.langid-config configuration file there.

Caution
You can also use the TREX admin tool (stand-alone), area Landscape → Ini to change the std.langid-configconfiguration file and then have the changes take effect by restarting the TREX preprocessor. Note that only the file in the <TREX_DIR>\exe\lexicon directory is changed if you use this method. If you have changed the std.langid-config configuration file in the central directory, <CENTRAL_DIR>\lexicon, as described above and restarted TREX, the system overwrites the changed file in the local directory, <TREX_DIR>\exe\lexicon, during the automatic synchronization by the CPE and the changes are lost.

Modifying Language Recognition with the TREX Preprocessor

The language of documents with only a small amount of text cannot be reliably identified, therefore TREX preprocessor language recognition is only activated if at least seven terms (default value) can be recognized for each document. You can change this value if you are using TREX in a scenario with very short sentences.

You make modifications for TREX preprocessor language recognition in the TREXPreprocessor.ini configuration file.

Open the <TREX_INSTALL>\TREXPreprocessor.ini configuration file with a text editor.
In the section [lexicon], change the min_valid_tokens parameter.
The default value of this parameter is 7. Choose a lower value if you want the TREX preprocessor to try to identify the language of documents with fewer terms per document.
Restart the TREX preprocessor.
You need to stop and restart the preprocessor for the new settings to take effect. You do this with theTREX admin tool (standalone), using the function forstarting and stopping the TREX servers. Note that the TREX daemon automatically restarts the server after it has been stopped. The settings are valid for all documents indexed after the TREX preprocessor is restarted.

Caution
The new settings do not affect documents that have already been indexed. This means that if, for example, a document that has already been indexed has been assigned to the wrong language, it must be reindexed.