!--a11y-->
Crawlers and Crawler Parameters 
Knowledge Management uses crawlers to collect resources located in internal or external repositories. The resources and hierarchical or net-like structures found are transmitted to various services and applications for further processing.
You use crawler parameters to control the behavior of crawlers that are active in your system.
You use the crawler monitor to monitor the activity of crawler tasks. You can also suspend them there and then continue them at a later point.
The crawler service is activated.
KM uses the following crawlers for various tasks.
· Content Exchange Crawler
Used to
collect offers in
content
exchange.
· Subscription Crawler
Used to collect and stage subscription-specific data.
· Taxonomy Crawler
Used to collect and stage data in taxonomies.
There is one instance of each of these crawlers. They are preconfigured, so you do not need to change their parameters.
· General Purpose Crawler
Used for various index management tasks.
Crawlers of this type are used to search data sources. This crawler type allows normal and delta crawling. The gathered objects are transmitted to TREX to be indexed.
The standard delivery contains the preconfigured instance standard. You can create and configure further instances of this crawler. You may need to do this if you want to use the logging function or to specify resource filters.
When you have created a new instance of this crawler you can select it in index administration.
Crawler Parameters
Parameters |
Required |
Description |
Name |
Yes |
Name of the set of crawler parameters. |
Description |
No |
Description of the set. |
Maximum Depth |
No |
Maximum number of recursion levels to be taken into account by the crawler. For example, a recursion level of '2' means that starting from a given document, all documents referenced by hyperlinks in the start document and all documents referenced in turn in those documents are included in the results set. With hierarchically structured repositories, such as a file-system repository, the recursion level corresponds to the hierarchy levels. 0 or an empty field means that the depth is unrestricted. |
Number of Retriever Threads |
Yes |
Number of retriever threads. Retriever threads search the repositories and collect the resources. The default value is 1. Make sure that the specified number of retrievers is also supported by the remote server. |
Number of Provider Threads |
Yes |
Number of provider threads. Provider threads handle the transfer of resources found to the service or application that receives the results. The default value is 3. Choose a value that corresponds to the processing activity of your system. The higher the number of providers, the higher the system load. |
Repository Access Delay |
No |
Time in milliseconds that a crawler thread waits before accessing the next document. This delay can be used to lower the load on the source repository or network during the loading process. |
Document Retrieval Timeout |
No |
Time in seconds after which the crawler finishes the crawling process for one document and moves on to the next. |
Resource Filters (Scope) |
No |
Resource filters that limit the scope of the crawling process. |
Resource Filters (Result) |
No |
Resource filters to be used on the results of the crawling process. |
Follow Links (or Redirects on Web-Sites) |
No |
If activated, the crawler follows links in hierarchical repositories and redirects in Web repositories. |
Verify Modification Using Checksum |
No |
If activated, a checksum comparison is carried out to check the modification of a file. |
Verify Modification Using ETag |
No |
If activated, an ETag comparison is carried out to check the modification of a file. |
Condition for Treating a Document as Modified |
Yes |
Specifies the conditions that must be met for a document to be treated as modified. OR: The document is treated as modified if the date, ETag, or checksum has changed. AND: The document is treated as modified if the date, ETag, and checksum have changed. |
Crawl Hidden Documents |
No |
If activated, the crawler takes hidden documents into account. |
Crawl Document Versions |
No |
If activated, the crawler takes versions of documents into account. |
Maximum Log Level |
Yes |
Defines the degree of information written to the log files. off: No log file is written. error: Errors are written to a log file. info: Like error, but all documents found are also listed. Information on when the crawlers start and stop and crawler service messages are written to the application log. |
Path for Log Files |
No |
Path to the location of the log files. If you do not specify a path, the directory /etc/log/crawler is used. |
Maximum Size of a Single Log File |
No |
Maximum size of a log file in bytes. |
Maximum Number of Backed Up Log Files |
No |
Maximum number of log files that can be stored. |
Test Mode |
No |
Specifies whether the crawler is used in text mode. Activate this parameter if you want to carry out a test with the crawler. The results of the crawler are not processed further. No indexing takes place. The parameters for the log file should be set appropriately for a test. |
If errors occur when using the crawler, and the crawler terminates as a result, the errors are written to the application log.
If the portal is restarted whilst crawling is taking place, the crawlers automatically continue when the portal restarts from the place where processing was interrupted.

Crawlers that are used by the subscription service and the content exchange service do not continue their activities when the portal is restarted. They are only restarted at the next time entered in the corresponding scheduler tasks.

The size of the database has a lot of influence on the speed of the crawling process. If a large database is being used the crawlers work faster than if a small database is being used.
The file robots.txt is evaluated when a Web repository is crawled.
To create a new set of crawler parameters, choose Content Management ® Global Services ® Crawlers ® Standard Crawler.
See also: