Crawlers and Crawler Parameters (SAP Library

Crawlers and Crawler Parameters

Use

Knowledge Management uses crawlers to collect resources located in internal or external repositories. The resources and hierarchical or net-like structures found are transmitted to various services and applications for further processing.

You use crawler parameters to control the behavior of crawlers that are active in your system.

Integration

You use the crawler monitor to monitor the activity of crawler tasks. You can also suspend them there and then continue them at a later point.

Prerequisites

The crawler service is activated.

Features

KM uses the following crawlers for various tasks.

· Content Exchange Crawler

Used to collect offers in Structure link content exchange.

· Subscription Crawler

Used to collect and stage subscription-specific data.

· Taxonomy Crawler

Used to collect and stage data in taxonomies.

There is one instance of each of these crawlers. They are preconfigured, so you do not need to change their parameters.

· General Purpose Crawler

Used for various index management tasks.

Crawlers of this type are used to search data sources. This crawler type allows normal and delta crawling. The gathered objects are transmitted to TREX to be indexed.

The standard delivery contains the preconfigured instance standard. You can create and configure further instances of this crawler. You may need to do this if you want to use the logging function or to specify resource filters.

When you have created a new instance of this crawler you can select it in index administration.

Crawler Parameters

Parameters	Required	Description
Name	Yes	Name of the set of crawler parameters.
Description	No	Description of the set.
Maximum Depth	No	Maximum number of recursion levels to be taken into account by the crawler. For example, a recursion level of '2' means that starting from a given document, all documents referenced by hyperlinks in the start document and all documents referenced in turn in those documents are included in the results set. With hierarchically structured repositories, such as a file-system repository, the recursion level corresponds to the hierarchy levels. 0 or an empty field means that the depth is unrestricted.
Number of Retriever Threads	Yes	Number of retriever threads. Retriever threads search the repositories and collect the resources. The default value is 1. Make sure that the specified number of retrievers is also supported by the remote server.
Number of Provider Threads	Yes	Number of provider threads. Provider threads handle the transfer of resources found to the service or application that receives the results. The default value is 3. Choose a value that corresponds to the processing activity of your system. The higher the number of providers, the higher the system load.
Repository Access Delay	No	Time in milliseconds that a crawler thread waits before accessing the next document. This delay can be used to lower the load on the source repository or network during the loading process.
Document Retrieval Timeout	No	Time in seconds after which the crawler finishes the crawling process for one document and moves on to the next.
Resource Filters (Scope)	No	Resource filters that limit the scope of the crawling process.
Resource Filters (Result)	No	Resource filters to be used on the results of the crawling process.
Follow Links (or Redirects on Web-Sites)	No	If activated, the crawler follows links in hierarchical repositories and redirects in Web repositories.
Verify Modification Using Checksum	No	If activated, a checksum comparison is carried out to check the modification of a file.
Verify Modification Using ETag	No	If activated, an ETag comparison is carried out to check the modification of a file.
Condition for Treating a Document as Modified	Yes	Specifies the conditions that must be met for a document to be treated as modified. OR: The document is treated as modified if the date, ETag, or checksum has changed. AND: The document is treated as modified if the date, ETag, and checksum have changed.
Crawl Hidden Documents	No	If activated, the crawler takes hidden documents into account.
Crawl Document Versions	No	If activated, the crawler takes versions of documents into account.
Maximum Log Level	Yes	Defines the degree of information written to the log files. off: No log file is written. error: Errors are written to a log file. info: Like error, but all documents found are also listed. Information on when the crawlers start and stop and crawler service messages are written to the application log.
Path for Log Files	No	Path to the location of the log files. If you do not specify a path, the directory /etc/log/crawler is used.
Maximum Size of a Single Log File	No	Maximum size of a log file in bytes.
Maximum Number of Backed Up Log Files	No	Maximum number of log files that can be stored.
Test Mode	No	Specifies whether the crawler is used in text mode. Activate this parameter if you want to carry out a test with the crawler. The results of the crawler are not processed further. No indexing takes place. The parameters for the log file should be set appropriately for a test.

If errors occur when using the crawler, and the crawler terminates as a result, the errors are written to the application log.

If the portal is restarted whilst crawling is taking place, the crawlers automatically continue when the portal restarts from the place where processing was interrupted.

Note

Crawlers that are used by the subscription service and the content exchange service do not continue their activities when the portal is restarted. They are only restarted at the next time entered in the corresponding scheduler tasks.

Note

The size of the database has a lot of influence on the speed of the crawling process. If a large database is being used the crawlers work faster than if a small database is being used.

The file robots.txt is evaluated when a Web repository is crawled.

Activities

To create a new set of crawler parameters, choose Content Management ® Global Services ® Crawlers ® Standard Crawler.

See also:

Crawler Monitor