XML (EXtensible Mark-up Language) and HTML (HyperText Mark-up Language) are so-called mark-up languages, which structure and display the text in a document using mark-up elements. Using the mark-up elements in XML and HTML files, you can define in the <TREX-installation_directory>\Lexicon\std.html-config file which texts within HTML and XML documents should not be indexed.
This makes sense in the following cases:
For example, you can exclude the technical information in JavaScript program code from indexing, which is marked in HTML by the tags <script type="text/javascript"...> ... </script>. The JavaScript program code that is marked by these tags does not contain any characteristic content for the respective document and thus can be ignored during the indexing run.
You can exclude text parts from indexing if they are identical in more than one XML or HTML file and thus do not contain any information about the respective document content.
Excluding Technical Information From Indexing
You must change entries in the sections <remove-region> and <multimedia-markup> in the std.html-config file. In each of these sections, you can find a list of mark-up elements for XML or HTML code. The texts that are marked by these elements in the XML or HTML file are not taken into account during indexing. In the case of HTML, these are mark-up elements that contain technical information about processing and displaying HTML files.
The following examples each contain an extract from these lists:
<item key = "applet" />
<item key = "code" />
<item key = "script" />
...
<item key = "title" />
<item key = "applet" />
<item key = "code" />
<item key = "script" />
<item key = "server" />
...
<item key = "title" />
Special features of XML files
The selection of the mark-up elements in the std.html-config configuration file is based on that fact that all HTML language elements are standardized and defined on an international level. Thus for HTML is guaranteed that the mark-up element listed contain only technical information (<applet>, <script>, <code>, and so on), and not texts relevant to the document content.
However, in XML you can use a DTD (Document Type Definition) or an XML schema to define your own XML language elements, whose descriptions can be identical to HTML language elements and which can contain text that is relevant to the document content. If you are indexing XML files, you therefore need to check whether some of the mark-up elements have the same names and remove any affected elements from the list so that TREX processes them.
You do this, for example, by deleting the line <item key = "applet" /> from the list.
You do this by adding the line <item key = "Markup_Element" /> to the above list. In doing this, you replace Markup_Element with the element that you want to exclude from processing.
Note that the list in the std.html-configconfiguration file contains certain default elements that are not taken into account during indexing.
Note that the TREX daemon automatically restarts the server after it has been stopped. The settings are valid for all documents indexed after the TREX preprocessor is restarted. The new settings do not affect documents that have already been indexed.
Exclude Redundant Text Parts From Indexing
To exclude redundant text parts from indexing, proceed as follows:
Note that, in the case of XML file, you must define the new mark-up element in the associated DTD or XML schema, otherwise the XML document is not well-defined. In the case of HTML, the new mark-up element is ignored by the browser when displaying the document, because it is not part of the HTML standard.