Show TOC

Excluding Parts of XML and HTML Files From IndexingLocate this document in the navigation structure

Use

XML (EXtensible Mark-up Language) and HTML (HyperText Mark-up Language) are so-called mark-up languages, which structure and display the text in a document using mark-up elements. Using the mark-up elements in XML and HTML files, you can define in the <TREX-installation_directory>\Lexicon\std.html-config file which texts within HTML and XML documents should not be indexed.

This makes sense in the following cases:

  • Excluding technical information from indexing

    For example, you can exclude the technical information in JavaScript program code from indexing, which is marked in HTML by the tags <script type="text/javascript"...> ... </script>. The JavaScript program code that is marked by these tags does not contain any characteristic content for the respective document and thus can be ignored during the indexing run.

  • Exclude redundant text parts from indexing

    You can exclude text parts from indexing if they are identical in more than one XML or HTML file and thus do not contain any information about the respective document content.

Excluding Technical Information From Indexing

  1. Open the<TREX_Installation_Directory>\Lexicon\std. html-config configuration file with a text editor.

    You must change entries in the sections <remove-region> and <multimedia-markup> in the std.html-config file. In each of these sections, you can find a list of mark-up elements for XML or HTML code. The texts that are marked by these elements in the XML or HTML file are not taken into account during indexing. In the case of HTML, these are mark-up elements that contain technical information about processing and displaying HTML files.

    The following examples each contain an extract from these lists:

    • <remove-region>

      <item key = "applet" />

      <item key = "code" />

      <item key = "script" />

      ...

      <item key = "title" />

    • <multimedia-markup>

      <item key = "applet" />

      <item key = "code" />

      <item key = "script" />

      <item key = "server" />

      ...

      <item key = "title" />

      Caution

      Special features of XML files

      The selection of the mark-up elements in the std.html-config configuration file is based on that fact that all HTML language elements are standardized and defined on an international level. Thus for HTML is guaranteed that the mark-up element listed contain only technical information (<applet>, <script>, <code>, and so on), and not texts relevant to the document content.

      However, in XML you can use a DTD (Document Type Definition) or an XML schema to define your own XML language elements, whose descriptions can be identical to HTML language elements and which can contain text that is relevant to the document content. If you are indexing XML files, you therefore need to check whether some of the mark-up elements have the same names and remove any affected elements from the list so that TREX processes them.

  2. Remove an element from the list or add an element to the list:
    • Remove an element from the list if you want the system to index the text that is marked by this element.

      You do this, for example, by deleting the line <item key = "applet" /> from the list.

    • Add an element to the list if you do not want the system to index the text that is marked by this element.

      You do this by adding the line <item key = "Markup_Element" /> to the above list. In doing this, you replace Markup_Element with the element that you want to exclude from processing.

      Note

      Note that the list in the std.html-configconfiguration file contains certain default elements that are not taken into account during indexing.

  3. Save the file and close the text editor.
  4. Stop the TREX preprocessor and restart it, so that the new settings take effect. You start and stop the preprocessor using the function forstarting and stopping the TREX servers in the TREX admin tool (stand-alone).
    Note

    Note that the TREX daemon automatically restarts the server after it has been stopped. The settings are valid for all documents indexed after the TREX preprocessor is restarted. The new settings do not affect documents that have already been indexed.

Exclude Redundant Text Parts From Indexing

To exclude redundant text parts from indexing, proceed as follows:

  1. Flag these text parts within the XML or HTML code in the relevant XML or HTML documents using a dedicated mark-up element (for example, <trexignore> ... </trexignore>).
    Caution

    Note that, in the case of XML file, you must define the new mark-up element in the associated DTD or XML schema, otherwise the XML document is not well-defined. In the case of HTML, the new mark-up element is ignored by the browser when displaying the document, because it is not part of the HTML standard.

  2. Add the newly-defined mark-up element (for example, <trexignore> ... </trexignore>) in the two sections <remove-region> and <multimedia-markup> in the std.html-config file as <item key = "trexignore" /> as described in the procedure Excluding Technical Information From Indexing (see above).