Manually Performing Preprocessing
Steps
In the Preprocessor: View Docs window, you can perform the processing steps manually that the preprocessor carries out automatically when indexing takes place.
You can specify a document for the preprocessor to process as follows:
● Load and filter
● Load, filter, and analyze linguistically
You use this function for test purposes and for troubleshooting.
The graphic below depicts the structure of the Preprocessor View Docs window.
The function bars contain the following fields and buttons:
Field/Button |
Description |
File/URL |
Path of URI to the document that the preprocessor is to process. |
Show Original |
Opens the original document. An application for opening the document must be installed. |
Show Filtered |
Displays the filtered document in the browser. |
None |
Deletes the output area. |
Get+Filter |
Loads and filters the document. You use this function to check which HTML code the filters generate from the original document. |
Get+Filter+Lex |
Loads and filters the document and analyzes it linguistically. You can also use this function to check the results of the linguistic analysis as follows: ● Did language recognition determine the correct document language? ● Which word types and base forms (stems) did the linguistic analysis assign to the terms? ● Which terms from the document would TREX include in a full-text index? ● Which document attributes would TREX include in the index? |
Index |
Only relevant for Get+Filter+Lex. Specifies whether or not the preprocessor is to use global or index-specific settings for processing. ● Do not select an index if you want TREX to use the global settings. ● Select the required index if you want TREX to use index-specific settings. This affects the following settings: ● Python extensions Python extensions can be activated locally for an index or globally. ● Word separators Word separators are characters such as /\;,.:-. The linguistic analysis uses the defined word separators to split a text into individual words. The global word separators are defined in the configuration file TREXPreprocessor.ini. |
The output areas display the results of the analysis. There are the following output areas:
Output Area |
Description |
Document content |
Document content that the preprocessor issues after processing. Depending on the selected function, you see the following: ● The HTML version without linguistic analysis ● The version after linguistic analysis You only see terms that TREX would include in a full-text index here. All other terms are hidden. The displayed terms are also the basis on which TREX would select terms for text-mining. This selection takes place using term generation rules. If you place the cursor over a term you see the details of the linguistic analysis in the middle area of the status bar (see the Status Bar section below). Use the menu path Action ® Find In Content to search for terms if necessary. |
Python extensions and document attributes |
Only relevant for Get+Filter+Lex. ● Python extensions that are activated locally for the index ● Document attributes that TREX would include in the index You see attributes that are defined in the document itself as well as attributes that may be generated by Python extensions.
You have activated Python extensions that generate document attributes from the <meta> tags of an HTML document. You can then check the results generated by the Python extension in this area. |
Categories |
Only relevant for Get+Filter+Lex. Word types that the linguistic analysis assigned to the terms. If the list contains the category ‘unknown’, the linguistics analysis was unable to assign a word type to some of the terms. TREX includes these terms in the text-mining index as nouns by default (category: nn). This setting is defined in TREXMiningIndex.ini.
If there are a large number of terms in the category ‘unknown’, it is possible that language recognition determined the language incorrectly. Terms may be classified as unknown even if the correct document language was determined. This might be due to proper nouns such as the names of people, products, and places. It is planned to add recognition support for proper nouns (names entities) in a later TREX release. However, this function is not available in the current release. The NE node is therefore currently inactive. |
If you have executed the Get+Filter+Lex function, the status bar displays the following information:
Area |
Description |
Links |
Document language that the language recognition determined. |
Middle |
If you place the cursor on a term in the output area containing the document content, the following information appears: ● normal:<form> Word form that appears in the original document. ● num:<position> Position of the term in the text. ● stem:<form> Root form of the term. The root form is the base form of the term. For example, the singular form is the root form of an English noun. In the case of compound terms you also see the parts into which the linguistic analysis split the term. TREX includes the compound and the constituent parts in the index.
The English term ’annual sales’ is split into ‘annual' and 'sales'. This is displayed in the following form: stem:annual#sales ● lex:<category> Word type of term |
Right |
You can use the <html> field to change the view of the document content. You can switch between views with and without HTML tags. |