HTML Property Extractors (SAP Library

HTML Property Extractors

Use

A Web repository manager extracts values for various standard documents properties from the content of HTML documents. You can use HTML property extractors to combine multiple properties.

An HTML property extractor consists of a group of properties. Each property specification consists of a property name and rules that define how to extract the property value from the HTML content.

Integration

Each HTML property extractor has its own name, so the same extractor can be used in more than one Web repository manager (specified in the repository manager's HTML property extractors configuration property).

Features

The Web repository manager supports property extraction from HTML content only. To extract a new resource property, you have to define the name of the resource property and specify the HTML tags from which you want to extract the value of the document property. In addition, you can define filter expressions to exclude certain values.

For each document property you want to extract, you can use the following parameters:

Parameters for the Specification of Document Properties

Parameters	Definition
Name	Name of the property specification
Case-insensitive	If activated, the system does not distinguish between uppercase and lowercase.
Select HREF	Same as Select All, but all extracted values are handled as hypertext links and changed to links to resources in the portal.
Exclude	Regular expression denoting values of HTML tags and attributes that are not to be extracted.
Namespace	Namespace to which the property name belongs (optional)
Property Name	Name of document property.
Select All	Comma-separated list of HTML tags and attributes whose values are to be stored in the document property in question
Select All META	Specification of an HTML META tag, whose values are saved in the respective document property. For example, description or author
Select First	Same as Select All, but only the first occurrence of each tag is extracted
Select First META	Same as Select All META, but only the first occurrence of each META tag is extracted

Note

Extracting properties has an impact on performance. The more complex your HTML property extractor is, the slower the indexing of such resources will be. Filtering of values with Exclude is not particularly beneficial due to multi-threading problems in the case of regular expressions.

Activities

To define properties that you want to include in HTML property extractors, choose Content Management ® Repository Managers® HTML Property Extractors ® HTML Properties. Choose New and define the property.

Once you have defined properties, you can combine them in HTML property extractors: Choose Content Management ® Repository Managers® HTML Property Extractors ® HTML Property Extractor. To create a new HTML property extractor, choose New and enter a name for the new HTML property extractor. Now select the properties that you want to combine in this HTML property extractor.

When you configure a Web repository manager, you can now select the HTML property extractor from the HTML Property Extractors field and assign it to the Web repository manager in question.

Example

Listing the Images Contained in an HTML page

You want to create a document property named 'images' that contains the links to all images in an HTML page.

Since every name in CM belongs to a certain namespace, you can specify a namespace. If, for example, you want to define the resource property sap:images, where sap: is the namespace http://sapportals.com/xmlns/cm, you define the property name as follows:

Namespace = http://sapportals.com/xmlns/cm
Property Name = images

(The sap:namespace is the default namespace, so you could have omitted the namespace definition).

Links to images in HTML occur in two places, in the SRC attribute of IMG tags and in BACKGROUND attributes of BODY and TABLE tags. You extend your definition accordingly:

Namespace = http://sapportals.com/xmlns/cm
Property Name = images
Select All = img/@src, @background

Select all lists the HTML tags/attributes to extract the value from. It selects all BACKGROUND attributes and all SRC attributes of IMG tags.

For example, the value of a SRC attribute could be './white.gif'. In order to make such attributes into resource identifiers, such as '/web/server/image/white.gif', use Select HREF instead of Select all:

Namespace = http://sapportals.com/xmlns/cm
Property Name = images
Select HREF = img/@src, @background

If you want to exclude GIF files from being listed in the images property, define an exclude filter as follows:

Exclude       = \\.(gif|GIF)$
Namespace     = http://sapportals.com/xmlns/cm
Property Name = images
Select HREF   = img/@src, @background

Excluding certain links when searching an HTML page

You want to exclude particular links from being pursued by a crawler when the content of a Web repository is analyzed.

Background documentation

All links in an HTML document on a Web site that is depicted in a Web repository are stored in the embedded links resource property. During the analysis of the Web repository, the crawler pursues the links that are stored in this property.

You can drive the crawler by filtering the links that are stored in the embedded links property. You can filter links by defining a property and a property extractor in the Configuration iView. This property excludes links that you do not want to have analyzed, and writes links that you do want to have analyzed to the embedded links property.

Carry out the following steps:

...

1. Define a property.

Choose Content Management ® Repository Managers ® HTML Property Extractors ® HTML Properties.

Choose New and specify a name for the property. Enter the following in the Property Name field:

Property Name = embedded-links

In the Exclude field, enter a regular expression that matches the links that you do not want the crawler to pursue.

For example, the regular expression *spiegel.de\/sport.* excludes all links that contain the path spiegel.de/sport in their URL.

2. Define an HTML property extractor and assign a property or properties to it.

Choose Content Management ® Repository Managers ® HTML Property Extractors ® HTML Property Extractor. Choose New and define the HTML property extractor.

Assign the property or properties to the HTML property extractor.

3. Create a Web repository manager and assign the corresponding HTML property extractor to it in the HTML Property Extractors field (see Web Repository Managers).

Listing the headings contained in an HTML page

The following property definition selects all textual content of the HTML header tags and puts it into the resource property 'headers' (in the namespace 'http://mycompany.com/xmlns/htmlprops').

Namespace = http://mycompany.com/xmlns/htmlprops
Property Name = headers
Select All = H1, H2, H3, H4

Listing the first headings contained in an HTML page

If you want to select only the first tag/attribute in an HTML document, use the Select Firstparameter:

Namespace = http://mycompany.com/xmlns/htmlprops
Property Name = headers
Select First = H1, H2, H3, H4

Listing Information in META Tags

You want to create a document property that contains the authors of an HTML page. To do this, the META tag author is read from the HTML code in documents.

Name = author

Property Name = prop1

Select All META = author

As an example, here is an HTML document that contains the following HTML code:

<head>

</head>

You create the multi-value property propl for this HTML document. As the property value, a list appears containing the entries “Susan Summer,” “Paul Winter.”

If you want to select only the first occurrence of the META tag author in an HTML document, use the parameter Select First META:

Name = author

Property Name = prop2

Select First META = author

This results in the single-value property prop2 with the value “Susan Summer.”