Entering content frame

Function documentation Web Property Extractors Locate the document in its SAP Library structure

Use

A Web repository manager extracts values for various standard documents properties from the content of HTML documents. You can use Web property extractors to combine multiple properties.

A Web property extractor consists of a group of properties. Each property specification consists of a property name and rules that define how to extract the property value from the HTML content.

 

Integration

Each Web property extractor has its own name, so the same extractor can be used in more than one Web repository manager (specified in the Web Property Extractors parameter for the repository manager).

 

Features

The Web repository manager supports property extraction from HTML content only.

You can create the following types of properties.

     HTML properties

Can be used if the values to be extracted are between HTML tags

     Text properties

Can be used if the values to be extracted are in text on HTML pages

 

Note

In parameters in which regular expressions are required, you must bear in mind SUN’s syntax for regular expressions for J2EE 1.4.

 

HTML Properties

To extract a property value, you have to define an HTML property and specify the HTML tags from which you want to extract the property value. In addition, you can define filter expressions to exclude certain values.

Parameters for the Specification of HTML Properties

Parameter

Definition

Name

Name of the property specification

Case-insensitive

If activated, the system does not distinguish between uppercase and lowercase.

Select HREF

Same as Select All, but all extracted values are handled as hypertext links and changed to links to resources in the portal.

Exclude

Regular expression denoting values of HTML tags and attributes that are not to be extracted.

Namespace

Namespace to which the property name belongs (optional)

Property Name

Name of the HTML property

Select All

Comma-separated list of HTML tags and attributes whose values are to be stored in the HTML property in question

Select All META

Specification of an HTML META tag, whose values are saved in the respective HTML property.

For example, description or author

Select First

Same as Select All, but only the first occurrence of each tag is extracted

Select First META

Same as Select All META, but only the first occurrence of each META tag is extracted

 

For examples of using HTML properties, see the end of this section.

 

Text Properties

Text properties allow the extraction of properties at text level.

Parameters for the Specification of Text Properties

Parameter

Description

Name

Name of the property specification

Case Sensitive Matching

If activated, the system takes uppercase and lowercase into account during matching.

Include Start and End Strings

If activated, the specified patterns are inserted in the result at the start and end.

Match only first occurrence

Determines whether the property is complete after the first match (and contains only the one value) or whether all other matches are collected as a list in the property.

Activated = Only the first match is added to the property.

Deactivated = All matches are added to the property.

Maximum Length

Specifies the maximum number of characters that a character string can match to be valid.

For example, a search is terminated if the End Pattern has not been found after this number of characters.

End Pattern

Regular expression that describes the end of the character string to be extracted.

If you do not make an entry here, the character string is extracted from the start pattern to the end of the line.

Match Pattern

To include only some of the characters in the character string found in the property, you can use the Match Pattern and Report Expression parameters.

Match Pattern is a regular expression that is applied to the character string found.

To flag locations, you set the text to be flagged in parentheses ().

You can use the flagged text in the Report Expression parameters to define the value of the property.

For more information about regular expression, see the JDK 1.4 documentation at the Internet address java.sun.com/j2ee/1.4.

Namespace

Namespace to which the property name belongs.

Property Name

Name of the text property

Report Expression

Regular expression to define the value of the property from the Match Pattern found.

Start Pattern

Regular expression that describes the start of the character string to be extracted.

 

For examples of using text properties, see Examples of the Configuration of Text Properties.

 

Activities

To define HTML or text properties that you want to include in Web property extractors, choose Content Management ® Repository Managers ® Web Property Extractors ® HTML Properties or Text Properties. Choose New and define the property.

Once you have defined properties, you can combine them in Web property extractors: Choose Content Management ® Repository Managers ® Web Property Extractors ® Web Property Extractor.

To create a new Web property extractor, choose New and enter a name for the new Web property extractor. Now select the properties that you want to combine in this Web property extractor.

During configuration of a Web repository manager, you can select the Web property extractor in the Web Property Extractors parameter.

 

Examples

Listing the Images Contained in an HTML Page

You want to create a document property named 'images' that contains the links to all images in an HTML page.

Since every name in CM belongs to a certain namespace, you can specify a namespace. If, for example, you want to define the resource property sap:images, where sap: is the namespace http://sap.com/xmlns/cm, you define the parameter name as follows:

 

Namespace     = http://sap.com/xmlns/cm
Property Name = images

(The sap: namespace is the default namespace, so you could have omitted the namespace definition).

Links to images in HTML occur in two places, in the SRC attribute of IMG tags and in BACKGROUND attributes of BODY and TABLE tags. You extend your definition accordingly:

 

Namespace     = http://sap.com/xmlns/cm
Property Name = images
Select All    = img/@src, @background

 

Select all lists the HTML tags/attributes to extract the value from. It selects all BACKGROUND attributes and all SRC attributes of IMG tags.

 

For example, the value of a SRC attribute could be './white.gif'. In order to make such attributes into resource identifiers, such as '/web/server/image/white.gif', use Select HREF instead of Select all:

 

Namespace     = http://sap.com/xmlns/cm
Property Name = images
Select HREF   = img/@src, @background

 

If you want to exclude GIF files from being listed in the images property, define an exclude filter as follows:

 

Exclude       = \\.(gif|GIF)$
Namespace     = http://sap.com/xmlns/cm
Property Name = images
Select HREF   = img/@src, @background

 

Excluding Certain Links When Searching an HTML Page

You want to exclude particular links from being pursued by a crawler when the content of a Web repository is analyzed.

 

Background documentation

All links in an HTML document on a Web site that is depicted in a Web repository are stored in the embedded links resource property. During the analysis of the Web repository, the crawler pursues the links that are stored in this property.

 

You can drive the crawler by filtering the links that are stored in the embedded links property. You can filter links by defining a property and a property extractor in the Configuration iView. This property excludes links that you do not want to have analyzed, and writes links that you do want to have analyzed to the embedded links property.

 

Carry out the following steps:

 

...

       1.      Define a property.

Choose Content Management ® Repository Managers ® Web Property Extractors ® HTML Properties.

Choose New and specify a name for the property. Enter the following in the Property Name field:

Property Name = embedded-links

In the Exclude field, enter a regular expression that matches the links that you do not want the crawler to pursue.

For example, the regular expression *spiegel.de\/sport.*excludes all links that contain the path spiegel.de/sport in their URL.

 

       2.      Define a Web property extractor and assign a property or properties to it.

Choose Content Management ® Repository Managers ® Web Property Extractors ® Web Property Extractor. Choose New and define the Web property extractor.

Assign the property or properties to the Web property extractor.

 

       3.      Create a Web repository manager and assign the corresponding HTML property extractor to it in the Web Property Extractors field (see Web Repository Manager).

 

Listing the Headings Contained in an HTML Page

The following property definition selects all textual content of the HTML header tags and puts it into the resource property 'headers' (in the namespace 'http://mycompany.com/xmlns/htmlprops').

Namespace     = http://mycompany.com/xmlns/htmlprops
Property Name = headers
Select All    = H1, H2, H3, H4

 

Listing the First Headings Contained in an HTML Page

If you want to select only the first tag/attribute in an HTML document, use the Select First parameter:

Namespace     = http://mycompany.com/xmlns/htmlprops
Property Name = headers
Select First  = H1, H2, H3, H4

 

Listing Information in META Tags

You want to create a document property that contains the authors of an HTML page. To do this, the META tag author is read from the HTML code in documents.

Name              = author

Property Name   = prop1

Select All META = author

 

As an example, here is an HTML document that contains the following HTML code:

 

 <head>

   <meta name="author" content="Susan Summer" />

   <meta name="author" content="Paul Winter" />

   <meta name="description" content="This is an example." />

 </head>

 

You create the multi-value property propl for this HTML document. As the property value, a list appears containing the entries “Susan Summer,” “Paul Winter.”

 

If you want to select only the first occurrence of the META tag author in an HTML document, use the parameter Select First META: 

Name              = author

Property Name     = prop2

Select First META = author

This results in the single-value property prop2 with the value “Susan Summer.”

 

 

Leaving content frame