Show TOC Entering content frame

Function documentation Crawlers and Crawler Parameters Locate the document in its SAP Library structure

Use

Knowledge Management uses crawlers to collect resources located in internal or external repositories. The resources and hierarchical or net-like structures found are transmitted to various services and applications for further processing.

You use crawler parameters to control the behavior of crawlers that are active in your system.

 

Integration

You use the crawler monitor to monitor the activity of crawler tasks. You can also suspend them there and then continue them at a later point.

 

Prerequisites

The crawler service is activated.

 

Features

KM uses the following crawlers for various tasks.

 

·        Content Exchange Crawler

Used to collect offers in Structure linkcontent exchange.

 

·        Subscription Crawler

Used to collect and stage subscription-specific data.

 

·        Taxonomy Crawler

Used to collect and stage data in taxonomies.

 

There is one instance of each of these crawlers. They are preconfigured, so you do not need to change their parameters.

 

·        General Purpose Crawler

Used for various index management tasks.

Crawlers of this type are used to search data sources. This crawler type allows normal and delta crawling. The gathered objects are transmitted to TREX to be indexed.

The standard delivery contains the preconfigured instance standard. You can create and configure further instances of this crawler. You may need to do this if you want to use the logging function or to specify resource filters.

When you have created a new instance of this crawler you can select it in index administration.

 

Crawler Parameters

Parameters

Required

Description

Name

Yes

Name of the set of crawler parameters.

Description

No

Description of the set.

Maximum Depth

No

Maximum number of recursion levels to be taken into account by the crawler.

For example, a recursion level of '2' means that starting from a given document, all documents referenced by hyperlinks in the start document and all documents referenced in turn in those documents are included in the results set. With hierarchically structured repositories, such as a file-system repository, the recursion level corresponds to the hierarchy levels.

0 or an empty field means that the depth is unrestricted.

Number of Retriever Threads

Yes

Number of retriever threads.

Retriever threads search the repositories and collect the resources.

The default value is 1.  Make sure that the specified number of retrievers is also supported by the remote server.

Number of Provider Threads

Yes

Number of provider threads.

Provider threads handle the transfer of resources found to the service or application that receives the results.

The default value is 3. Choose a value that corresponds to the processing activity of your system. The higher the number of providers, the higher the system load.

Repository Access Delay

No

Time in milliseconds that a crawler thread waits before accessing the next document.

This delay can be used to lower the load on the source repository or network during the loading process.

Document Retrieval Timeout

No

Time in seconds after which the crawler finishes the crawling process for one document and moves on to the next.

Resource Filters (Scope)

No

Resource filters that limit the scope of the crawling process.

Resource Filters (Result)

No

Resource filters to be used on the results of the crawling process.

Follow Links (or Redirects on Web-Sites)

No

If activated, the crawler follows links in hierarchical repositories and redirects in Web repositories.

Verify Modification Using Checksum

No

If activated, a checksum comparison is carried out to check the modification of a file.

Verify Modification Using ETag

No

If activated, an ETag comparison is carried out to check the modification of a file.

Condition for Treating a Document as Modified

Yes

Specifies the conditions that must be met for a document to be treated as modified.

OR: The document is treated as modified if the date, ETag, or checksum has changed.

AND: The document is treated as modified if the date, ETag, and checksum have changed.

Crawl Hidden Documents

No

If activated, the crawler takes hidden documents into account.

Crawl Document Versions

No

If activated, the crawler takes versions of documents into account.

Maximum Log Level

Yes

Defines the degree of information written to the log files.

off: No log file is written.

error: Errors are written to a log file.

info: Like error, but all documents found are also listed.

Information on when the crawlers start and stop and crawler service messages are written to the application log.

Path for Log Files

No

Path to the location of the log files.

If you do not specify a path, the directory /etc/log/crawler is used.

Maximum Size of a Single Log File

No

Maximum size of a log file in bytes.

Maximum Number of Backed Up Log Files

No

Maximum number of log files that can be stored.

Test Mode

No

Specifies whether the crawler is used in text mode.

Activate this parameter if you want to carry out a test with the crawler. The results of the crawler are not processed further. No indexing takes place.

The parameters for the log file should be set appropriately for a test.

 

If errors occur when using the crawler, and the crawler terminates as a result, the errors are written to the application log.

If the portal is restarted whilst crawling is taking place, the crawlers automatically continue when the portal restarts from the place where processing was interrupted.

 

Note

Crawlers that are used by the subscription service and the content exchange service do not continue their activities when the portal is restarted. They are only restarted at the next time entered in the corresponding scheduler tasks.

 

 

Note

The size of the database has a lot of influence on the speed of the crawling process. If a large database is being used the crawlers work faster than if a small database is being used.

 

The file robots.txt is evaluated when a Web repository is crawled.

 

Activities

To create a new set of crawler parameters, choose Content Management ® Global Services ® Crawlers ® Standard Crawler.

 

See also:

Crawler Monitor

 

 

 

Leaving content frame