Show TOC

Manually Performing Preprocessing StepsLocate this document in the navigation structure

Use

In the Preprocessor: View Docs window you can manually carry out processing steps that the preprocessor carries out automatically when indexing takes place.

You can specify a document for the preprocessor to process as follows:

  • Load and filter

  • Load, filter, and analyze linguistically

You use this function for test purposes and for troubleshooting.

Features

The graphic below depicts the structure of the Preprocessor View Docs window.

Function Bars

The function bars contain the following fields and buttons:

Field/Button

Description

File/URL

Path of URI to the document that the preprocessor is to process.

Show Original

Opens the original document.

An application for opening the document must be installed.

Show Filtered

Displays the filtered document in the browser.

None

Deletes the output area.

Get+Filter

Loads and filters the document.

You use this function to check which HTML code the filters generate from the original document.

Get+Filter+Lex

Loads and filters the document and analyzes it linguistically.

You can also use this function to check the results of the linguistic analysis as follows:

  • Did language recognition determine the correct document language?

  • Which word types and base forms (stems) did the linguistic analysis assign to the terms?

  • Which terms from the document would TREX include in a full-text index?

  • Which document attributes would TREX include in the index?

Index

Only relevant for Get+Filter+Lex.

Defines whether the preprocessor is to use global or index-specific settings for processing.

  • Do not select an index if you want TREX to use the global settings.

  • Select the required index if you want TREX to use index-specific settings.

    This affects the following settings:

  • Python extensions

    Python extensions can be activated locally for an index or globally.

  • Word separators

    Word separators are characters such as \/\;,.:-. The linguistic analysis uses the defined word separators to split a text into individual words.

    The global word separators are defined in the configuration file TREXPreprocessor.ini.

Output Areas

The output areas display the results of the analysis. There are the following output areas:

Output area

Description

Document content

Document content that the preprocessor issues after processing.

Depending on the selected function, you see the following:

  • The HTML version without linguistic analysis

  • The version after linguistic analysis

    You only see terms that TREX would include in a full-text index here. All other terms are hidden.

    The displayed terms are also the basis on which TREX would select terms for text-mining. This selection takes place using term generation rules.

    If you place the cursor over a term you see the details of the linguistic analysis in the middle area of the status bar (see the Status Bar section below).

    Use the menu path Start of the navigation path Action Next navigation step Find In Content End of the navigation path to search for terms if necessary.

Python Extensions and Document Attributes

Only relevant for Get+Filter+Lex.

  • Python extensions that are activated locally for the index

  • Document attributes that TREX would include in the index

    You see attributes that are defined in the document itself as well as attributes that may be generated by Python extensions.

    Example

    You have activated Python extensions that generate document attributes from the <meta> tags of an HTML document. You can then check the results generated by the Python extension in this area.

Categories

Only relevant for Get+Filter+Lex.

Word types that the linguistic analysis assigned to the terms.

If the list contains the category 'unknown', the linguistics analysis was not able to assign a word type to some of the terms. TREX includes these terms in the text-mining index as nouns by default (category: nn). This setting is defined in TREXMiningIndex.ini.

Note

If there are a large number of terms in the category 'unknown', it is possible that language recognition determined the language incorrectly.

Terms may be classified as unknown even if the correct document language was determined. This might be due to proper nouns such as the names of people, products, and places.

It is planned to add recognition support for proper nouns (names entities) in a later TREX release. However, this function is not available in the current release. The NE node is therefore currently inactive.

Status Bar

If you have executed the function Get+Filter+Lex, the status bar displays the following information.

Area

Description

Links

Document language that the language recognition determined.

Middle

If you place the cursor on a term in the output area containing the document content, the following information appears:

  • normal:<form>

    Word form that appears in the original document.

  • num:<position>

    Position of the term in the text.

  • stem:<form>

    Root form of the term.

    The root form is the base form of the term. For example, the singular form is the root form of an English noun.

    In the case of compound terms you also see the parts into which the linguistic analysis split the term. TREX includes the compound and the constituent parts in the index.

    Example

    The English term 'annual sales'is split into 'annual' and 'sales'. This is displayed in the following form:

    stem:annual#sales

  • lex:<category>

    Word type of term

Right

You can use the <html> field to change the view of the document content. You can switch between views with and without HTML tags.