Show TOC

 Web Repository ManagerLocate this document in the navigation structure

Use

You use a Web repository (manager) to provide read access to documents stored on remote Web servers. The documents are made available as CM resources, which allows them to be indexed and searched in Content Management.

Prerequisites

The remote Web servers whose content you wish to access using a Web repository manager must be defined in the CM system landscape in HTTP systems .

Features

Access to a Web repository normally takes place using the crawler server. Its content is processed by the index management service. Once the documents are indexed, users can access them using a search or by browsing in taxonomies.

Using a Web repository manager, you can retrieve the content and properties of HTML documents and of other documents that are provided by a Web server. You cannot, however, modify or delete the content and properties of those documents. The Web repository manager uses HTTP/1.1 to access other Web servers. This protocol does not allow modification of documents. In contrast, the WebDAV repository manager accesses other Web servers using WebDAV, an extension to the HTTP protocol that allows modification of content and metadata.

Since the document collections accessed using HTTP have a network-like structure instead of a hierarchical structure, they are mapped onto plain CM resources only, not onto CM collections. If a resource represents an HTML document containing hyperlinks, those links are stored in a property of the resource.

Dynamic and Static Web Repositories

Web repositories can be configured so they comprise only the content of one or more predefined Web sites, or they can be configured so their namespace is expanded dynamically when other servers are accessed. There are the following basic configuration types for a Web repository: Dynamic and static.

Dynamic Repositories

The most powerful Web repository is the dynamic one. It allows access to any HTML document on the Internet or Intranet, assuming that the Web servers are accessible from the server on which the Web repository is running.

Let us assume such a Web repository is configured with the prefix /web. You could then use the URI /web/www.example.com/ to access the home page http://www.example.com/. A Web server reachable under http://company:81/ could be accessed as /web/company:81/.

The resource reachable under /web is a collection. It is called the root collection of the repository. Its children are all resources that have been accessed so far. In our example, /web would comprise /web/www.example.com and /web/company:81 as children.

Whenever a new server is accessed successfully for the first time, another child is added to the root collection. Thus, the list of children grows dynamically.

When you set up a dynamic Web repository, you can start with an empty root collection, or you can specify a set of predefined Web sites that form its initial root collection. You specify such an initial root collection in the same way as you specify the root collection of a static repository (see below).

Note

SAP recommends that you only set up one dynamic Web repository. If you use more than one dynamic Web repository, crawlers may find the same document in more than one Web repository, and select the location of the document at random. This can make indexing and classifying documents more difficult.

Static Repositories

Static Web repositories are limited to the content of one or more predefined Web sites. They do not connect to servers and Web sites that are not specified in the configuration settings.

For example, a static Web repository with prefix /web2 is configured to give access to http://example.com/ and http://www.sap.com/. The repository /web2 would have only /web2/example.com and /web2/www.sap.com as children. If a service or user tried to access /web2/www.example.com, the attempt would fail.

Because the children of the root collection are static, you can treat them differently than in the dynamic setup:

  • You can assign arbitrary names to the children of the root collection (using the displayname property). For example, you could use the display name Example! for example.com.
  • You can make them point directly to a specific document on a server. For example, instead of linking to http://www.sap.com/ you could directly link to the FAQ at http://www.sap.com/contactsap. Then /web2/SAP-CONTACT could point to that HTML document directly.

You specify the Web site(s) to be accessed by a static Web repository by selecting them from the Systems list in the Web repository manager configuration. As a prerequisite, you must have defined these Web sites.

To do this, choose Content Management → Repository Managers → Web Sites.

The following table shows the differences between the parameter settings for the different types of Web repositories described above.

Parameter Combinations for Different Types of Web Repositories

  Web Sites Dynamic

Dynamic w/o initial root collection

Blank

Selected

Dynamic with initial root collection

One or more Web sites selected

Selected

Static

One or more Web sites selected

Deselected

URI to URL Mapping

The resources in a Web repository's root collection (or the single top-level resource of a single Web repository) are mapped onto URLs on other servers. You define this mapping by specifying system identifiers and paths on the remote server, either directly in the Web repository manager configuration, or in the configuration of a Web site that is accessed by the Web repository manager. As a prerequisite, you also have to register the system identifiers in the system landscape service .

Obviously, it would be very time consuming to define a mapping for each and every HTML document that you want to access. A Web repository automatically maps CM resources to documents on the remote server. It creates mappings relative to the root resources. This mapping is also used when the embedded-links resource property is generated (see the Resource Properties section).

For example, if a repository with the prefix /cnn is configured to point to http://www.cnn.com/, you can reach any document on the CNN Web site by just appending the path. You can access the weather area (http://www.cnn.com/WEATHER/) at /cnn/WEATHER/, world news at /cnn/WORLD/, and so on. This works on every level, so http://www.cnn.com/2002/TECH/industry/01/29/microsoft.reut/index.html would be available at /cnn/2002/TECH/industry/01/29/microsoft.reut/index.html.

Example of URI Mapping

URI in Knowledge Management HTTP URI

/web

http://host/a

/web/index.html

http://host/a/index.html

/web/b/c/

http://host/a/b/c/

/web/search?key=sap

http://host/a/search?key=sap

In order to map HTTP URIs containing parameter separators (such as the last entry in the table above) to resource names that are valid filenames, the Filenames property in the Web repository manager configuration must have the value true (see below).

Obviously, there are hyperlinks that cannot be mapped. For example, if index.html contained a link to http://host, this link could not be mapped onto the namespace of the Web repository. How such links are treated depends on whether or not the respective HTTP URI is mapped to another Web repository, and on the value of the External Server URI Handling property in the repository settings (see below).

This illustrates why configuring a Web repository to point to a specific document (for example, /web2/SAP-CONTACT pointing to http://www.sap.com/contact/) limits the scope of the repository. In this example, the documents http://www.sap.com/ and http://www.sap.com/something/else are not accessible through this Web repository. They cannot be mapped to a name in the namespace of the Web repository /web2.

Resource Properties

Resources created by the Web repository manager have standard properties such as contenttype, contentlength, displayname, and description.

The following data is extracted from documents on the Web server and stored as properties of the CM resources that represent the documents in question. While displayname and descriptionare standard properties of CM resources in any kind of repository, the other two properties are specific to resources in Web repositories.

Property Name Data

displayname

Content of the <title> HTML tag

description

Content of the description HTML META tag

embedded-keywords

List of keywords specified in the keywords HTML META tag

embedded-links

List of hyperlinks, rewritten to be valid relative to the resource in the Web repository namespace

The embedded-keywords property is useful for indexing an HTML document, using keywords specified by the document author.

The embedded-links property can be used to traverse a Web server. The embedded-links property for /web/www.cnn.comcontains links to all resources that are linked to from the HTML page at http://www.cnn.com. In this example, these are the resources /web/www.cnn.com/WEATHER/ and /web/www.cnn.com/WORLD/. The embedded-links property of these resources contains all resources referenced by these HTML pages, and so on.

Tip

The URI http://host/site is mapped to the resource /web. A client requests the resource /web/index.html. The Web repository manager requests the document http://host/site/index.html from the remote server.

If index.html looks like this:

<html><head> <title>Tension Band Wiring</title> <META NAME="keywords" CONTENT="fracture, fixation technique"> </head><body> <a href="this.html">link one</a> <a href="./there/another.html">link two</a></body></html>

The resource /web/index.html has the following properties:

displayname = Tension Band Wiringembedded-keywords = fracture, fixation techniqueembedded-links = "/web/this.html", "/web/there/another.html"

The description property is missing, because index.html does not have a description META tag.

Resource Content

For all mime types except 'text/html', the resource content is exactly the content of the HTTP resource on the remote server. HTML pages that are obtained from the remote server receive an HTML BASE attribute that allows all links on the page to point to the original server. when launching the page, the browser behaves as if the user directly accessed the remote server.

Caution

Links embedded in scripting languages such as JavaScript are not detected and therefore do not work correctly.

Authentication

The Web repository is capable of authenticating users at the remote server. The following types of authentication are supported:

  • Basic authentication
  • Digest authentication
  • NTLM authentication

There are two options for specifying credentials: You can specify the User and Password parameters statically in the HTTP system. If you do not specify these parameters, the credentials are read dynamically from the user mappings of the portal. For more information, see Permissions for Accessing Web Repositories .

Consideration of ROBOTS Entries

When Web repositories are crawled, the Web site's robots.txt file is analyzed.

In HTML documents, the following ROBOTS entries are taken into account:

  • <META NAME="ROBOTS" CONTENT="NOFOLLOW">
  • <META NAME="ROBOTS" CONTENT="NOINDEX">
  • <META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">

For more information, see Crawlers and Crawler Parameters .

Form-Based Logon

The Web repository manager supports access to Web servers that use authentication with HTML pages using form-based logon dialogs.

To be able to use this function, you must specify the parameters for form-based logon in the configuration of a Web site:

  • Login Form ID
  • Login URI
  • Login User Agent
  • Login Timeout

For a description of the parameters, see Web sites .

Normally, the user information is determined for logging on to the remote Web server: Either the user specified in the HTTP system is used, or the information is read from portal user mappings. The Web repository manager fetches the HTML page with the logon dialog box (specified in the parameter Login URI) from the Web server. The logon dialog box, normally an HTML form, is filled with the user data and sent to the remote Web server, which sends an HTTP cookie back. The HTTP cookie identifies the user session and is used as authentication for further access.

Limitation: Remote Web servers that use URL rewriting to identify sessions instead of cookies are not supported. (When URL rewriting takes place, the session ID is stored in the URL.) Because these URLs exist for such a short time they are not generally suited to indexing using TREX.

HTTP and HTTPS Support

Remote Web servers for which the URLs must be entered in HTTP systems can be accessed using HTTP or HTTPS.

Timeouts

Cache Timeout

The Cache Timeout parameter determines how long cached content is considered to be up-to-date. With a timeout value of 3000, for example, the properties of a resource would not be refreshed for three seconds after they have been calculated. Depending on the type of content on a Web site, you usually set the value to several minutes, if not a complete day. The optional value is determined by the update frequency of the remote site.

To facilitate finding the best setting for the timeout value, the Web repository also respects the Expires header from a remote server. If the server sends this header, the repository adds the timeout to the expiry date. So, when you set the timeout to one hour, this hour is added to all resource expiry information. The remote server knows how to set expiry dates based on the individual update frequency of a resource.

Note

If the Web repositories are updated at regular intervals, you should configure the indexing interval so that the system indexes the repositories shortly after the cache timeout takes place (see Assigning Data Sources ). This ensures that the latest version is available in the cache.

HTTP Timeout

The HTTP timeout defines the idle time for an operation against the remote server, after which the operation is aborted. This is a useful protection against faulty connections or incorrectly configured Web servers.

Idle time means the time between successful read/write operations. For example, if the remote server does not answer a request with response data in the given time interval, the request is considered unsuccessful and an error is reported.

Note that this is not the time for reading a complete response. When the Web repository manager is reading a large document, idle time is counted as the time between successful reading of data pieces, not the time for retrieving the complete document. (Explanation: HTTP Timeout determines the SOTIMEOUT value of the connection socket to the remote server.)

Parameters

Enter the following parameters in the configuration of a Web repository manager:

Web Repository Manager Parameters

Parameter

Required

Description

Name

Yes

Name of the repository manager.

Description

No

Description of the repository manager

Prefix

Yes

The URI prefix for which the manager is registered.

This specification is entered in the list in the root directory.

Note that you must enter the prefix with a forward slash, for example, /WEB.

Active

No

You can (de)activate the repository manager using the Active parameter.

Hide in Root Folder

No

Specifies whether the repository is listed in the root directory.

If you activate this parameter, the repository is not listed in the root directory.

Repository services

No

Identifiers of the repository services that you want to use with the repository.

Security Manager

No

Selection of the security manager that controls access to repository content.

If you want CM to perform an authorization check when resources are accessed, you need to specify a security manager.

If the Web server supports authentication, you can use the WebSecurityManager. In addition, you have to carry out user mapping in the portal. For more information, see Permissions for Accessing Web Repositories .

ACL Manager Cache

No

Cache identification for resource ACLs.

This parameter is required if an ACL security manager is specified in the Security Manager Class parameter. The ca_rsrc_acl cache is preset in the KM standard configuration that is delivered with CM (see Caches ).

Send Events

No

Specifies whether the repository sends events when operations such as delete and update content are performed.

The repository sends events if this parameter is activated. This is necessary in order to use services such as the subscription service.

Case-Sensitive URI Handling

No

Determines whether the repository manager notes lowercase and uppercase text in resource URLs.

If this parameter is active, the system distinguishes between INDEX.HTM and index.htm, for example.

Dynamic

No

Determines whether the repository can dynamically add new remote HTTP servers to its namespace.

The default value is false.

Filenames

No

Determines whether the repository generates resource names that are valid file names.

Links inside an HTML page also use such names. The default value is false.

Persistent Caching

No

If activated, the database is used to cache objects.

The database cache is used when the memory cache reaches the defined size or the maximum number of entries is reached. The maximum possible number of database cache entries is defined in the configuration of the Web garbage collector scheduler task. The cache is used by all Web repository managers, for which this Parameter is activated.

Use System Default Proxy Settings

No

Determines whether the settings for the default proxy system are used.

External Server URI Handling

Yes

Determines how URIs (links) inside HTML pages that point to external servers (outside the scope of this web repository) are handled.

Three different values are possible: none (URIs are passed unchanged) rewrite (URIs are rewritten to point to URIs on this server, if possible) report (same as rewrite, but the rewritten URIs are also reported in the 'embedded-URIs' property of the resource) The default value is none.

Cache Stale Timeout

No

If defined, this determines the time in milliseconds after which old resources are deleted from the database.

The lifetime of cached resources is determined by the Cache Timeout parameter.

After the timeout, they are stale. A resource older than the value in Cache Stale Timeout is removed from the cache.

Cache Timeout

No

Timeout in milliseconds for resources in the cache.

The timeout determines the amount of time for which cached resources (content and properties) are not refreshed. The optional value is determined by the update frequency of the remote site. However, the timeout value should be shorter than the update interval of the remote location, so that the cache is updated with the latest content.

For resources supplied with an 'Expires' header by the remote HTTP server, the timeout is added to the expiry date.

HTTP Timeout

No

Timeout in milliseconds after which operations on the server are aborted.

The default value is 60000 ms (that is, one minute).

Web Sites

No

List of Web sites that you want to include in the root collection of the repository manager.

You can configure the options presented here by choosing Content Management → Repository Managers → Web Sites in the Configuration iView.

You specify this property only if the repository is to map more than one Web site. In this case, you must not specify the System ID property.

HTML Property Extractors

No

Identifier of a Web property extractor to be used with this repository manager for extracting additional resource properties (see Web Property Extractors ).

If you make changes to the Web property extractor used, you have to empty the cache using the cache monitor.

Memory Cache

Yes

Identifier of the memory cache to be used by the Web repository manager for caching both content and properties of generated resources.

The Maximum Entry Size property in this cache determines whether the content of a resource is stored in the cache. Only resource content smaller than the Maximum Entry Size is cached. In contrast, resource properties are always cached.

Note

The Web repository manager automatically creates a virtual, persistent cache with the name 'WebRepositoryPersistence-/x', where '/x' is the prefix of the repository. You cannot configure this cache. If you make changes to the HTML property extractor used, you have to empty the cache using the cache monitor .

Activities

To configure a Web repository manager, choose Content Management → Repository Managers → Web Repository. You assign Web sites to the Web repository manager during configuration.

If you have upgraded or migrated your system, you can still launch the configuration of your previous Web repository managers at Content Management → Repository Managers → [Legacy Web Repository].

Example

The following is a sample configuration of a Web repository registered with the URI prefix /web. The repository recognizes two Web sites with the names spiegel and faz. You must configure these Web sites. The remote servers have to be registered in the system landscape in the HTTP systems . The repository is static. It can only access the Web sites specified. It uses a memory cache with the ID web_cache. Requested resources are held in the cache for 3600000 ms (one hour) before they are refreshed on subsequent requests.

Configuration of the Web Repository Manager web

Name          = webPrefix        = /webDynamic       = falseWeb Sites     = spiegel, faz Cache         = web_cacheCache Timeout = 3600000

Configuration of the Web Site spiegel

Name          = spiegelDisplay Name  = Der SpiegelSystem ID     = WEB_SPIEGELSystem Path   = politics

 

Configuration of the Web Site faz

Name          = fazDisplay Name  = Frankfurter Allgemeine ZeitungSystem ID     = WEB_FAZSystem Path   = news

 

Specification of the Remote Server ://www.spiegel.de in an HTTP System

System ID        = WEB_SPIEGEL

Server-URL       = http://www.spiegel.de

 

Specification of the Remote Server ://www.faz.de in an HTTP System

System ID        = WEB_FAZ

Server-URL       = http://www.faz.net