Crawlers and Crawler Parameters (SAP Library

Crawlers and Crawler Parameters

Use

Crawlers are used in Knowledge Management to collect resources that are stored in internal or external repositories. The resources found and the hierarchical or net-like structures are forwarded to various services and applications for further processing.

You can use crawler parameters to determine the behavior of the active crawlers in the system.

Integration

In the crawler monitor, you monitor the activity of crawler tasks and can suspend them if necessary and continue them at a later time.

Prerequisites

The crawler service is activated.

Features

In KM, the following crawlers are used for various tasks:

● Content Exchange Crawler

Is used to collect and group offers for Structure link content exchange.

● Subscription Crawler

Is used to collect and make available subscription-specific data.

● Taxonomy Crawler

Is used to collect and make data available in taxonomies.

There is one instance respectively of each of these crawlers. Since they are preconfigured, you do not need to change the parameters of these crawlers.

● General Purpose Crawler

Is used for various tasks in index management.

Crawlers of this type are used to search for object addresses in data sources. This crawler type allows both normal crawling and delta crawling. The addresses collected are provided to TREX for indexing.

The standard system contains the preconfigured instance standard. You can create and configure further instances of this crawler, for example, if you want to use the log functions or specify resource filters.

Once you have created a new instance of this crawler, you can select it in Index Administration.

Crawler Parameters

Parameter	Required	Description
Name	Yes	Name of the set of crawler parameters
Description	No	Description of the set
Maximum Depth	No	Maximum number of recursion levels that the crawler takes into account. For example, a recursion level of 2 means that starting from a given document, all documents referenced by hyperlinks in the start document and all documents referenced in turn in those documents are included in the results set. With hierarchically structured repositories, such as a file system repository, the recursion level corresponds to the hierarchy levels. The entry 0 or an empty input field stands for unrestricted depth.
Number of Retriever Threads	Yes	Number of retriever threads. Retriever threads search the repositories and collect the resources. The default value is 1. Make sure that the specified number of retrievers are also supported by the remote server.
Number of Provider Threads	Yes	Number of provider threads. Provider threads handle the transfer of resources found to the service or application that receives the results. The default value is 3. Choose a value that corresponds to the processing activity of your system. The higher the number of providers, the higher the system load.
Repository Access Delay	No	Specifies the time in milliseconds that a crawler thread waits before accessing the next document. This delay can be used to reduce the load on the source repository or the network during the crawling run.
Document Retrieval Timeout	No	Time interval in seconds after which the crawler ends the crawling run for a document and moves to the next document.
Resource Filters (Scope)	No	Specifies resource filters that reduce the scope of the crawling run (see Resource Filters).
Resource Filters (Result)	No	Specifies resource filters that are applied to the results of the crawling run (see Resource Filters).
Follow Links (resp. Redirects on Web-Sites)	No	If this parameter is activated, the crawler follows links in hierarchical repositories or redirects in Web repositories.
Verify Modification Using Checksum	No	If this parameter is activated, the system performs a check sum comparison to check the modifications made to a file.
Verify Modification Using ETag	No	If this parameter is activated, the system performs an ETag comparison to check the modifications made to a file.
Condition for Treating a Document as Modified	Yes	Specifies which conditions must be fulfilled for a document to be considered modified. OR: The document is considered modified if the date or the ETag or the check sum has been changed. AND: The document is considered modified if the date and the ETag and the check sum have been changed.
Crawl Hidden Documents	No	If this parameter is activated, hidden documents are taken into account by the crawler.
Crawl Document Versions	No	If this parameter is activated, versions of documents are taken into account by the crawler.
Maximum Log Level	Yes	Defines the degree of information written to application logs. off: No application log is written. error: An application log is written for errors. info: Like error, but all documents found are also listed. Starting and stopping the crawler and messages from the crawler service are written to the application log.
Path for Log Files	No	Specifies the path where the log files are stored. if you do not specify a path, the system uses the directory /etc/log/crawler to store the log files.
Maximum Size of a Single Log File	No	Specifies the maximum size of a log file in bytes.
Maximum Number of Backed Up Log Files	No	Specifies the maximum number of log files that are saved.
Test Mode	No	Specifies whether the crawler is operated in test mode. Activate this parameter if you want to test the crawler. The results of the crawler are not processed further. Indexing does not take place. The parameters for the log file should be set accordingly during a test.

If there are errors when using the crawler that terminate the process, they are recorded in the application log.

If the portal is restarted during the crawling run, the crawlers automatically continue their activities from the point where they were interrupted.

Note

Note that crawlers that are used by the subscription service and the content exchange do not continue their activities when the portal is restarted. They start again at the next time that is entered in the corresponding scheduler tasks.

Note

Note that the size of the database has a significant influence on the speed of the crawling process. If you are using a large database, the crawlers work more quickly than if you use a small database.

When a Web repository is crawled, the robots.txt file is analyzed.

Activities

To create a new set of crawler parameters, choose Content Management ® Global Services ® Crawler Parameters ® General Purpose Crawler.

See also:

Crawler Monitor