!--a11y-->
Crawlers and Crawler Parameters 
Crawlers are used in Knowledge Management to collect resources that are stored in internal or external repositories. The resources found and the hierarchical or net-like structures are forwarded to various services and applications for further processing.
You can use crawler parameters to determine the behavior of the active crawlers in the system.
In the crawler monitor, you monitor the activity of crawler tasks and can suspend them if necessary and continue them at a later time.
The crawler service is activated.
In KM, the following crawlers are used for various tasks:
● Content Exchange Crawler
Is used to
collect and group offers for
content
exchange.
● Subscription Crawler
Is used to collect and make available subscription-specific data.
● Taxonomy Crawler
Is used to collect and make data available in taxonomies.
There is one instance respectively of each of these crawlers. Since they are preconfigured, you do not need to change the parameters of these crawlers.
● General Purpose Crawler
Is used for various tasks in index management.
Crawlers of this type are used to search for object addresses in data sources. This crawler type allows both normal crawling and delta crawling. The addresses collected are provided to TREX for indexing.
The standard system contains the preconfigured instance standard. You can create and configure further instances of this crawler, for example, if you want to use the log functions or specify resource filters.
Once you have created a new instance of this crawler, you can select it in Index Administration.
Crawler Parameters
Parameter |
Required |
Description |
Name |
Yes |
Name of the set of crawler parameters |
Description |
No |
Description of the set |
Maximum Depth |
No |
Maximum number of recursion levels that the crawler takes into account. For example, a recursion level of 2 means that starting from a given document, all documents referenced by hyperlinks in the start document and all documents referenced in turn in those documents are included in the results set. With hierarchically structured repositories, such as a file system repository, the recursion level corresponds to the hierarchy levels. The entry 0 or an empty input field stands for unrestricted depth. |
Number of Retriever Threads |
Yes |
Number of retriever threads. Retriever threads search the repositories and collect the resources. The default value is 1. Make sure that the specified number of retrievers are also supported by the remote server. |
Number of Provider Threads |
Yes |
Number of provider threads. Provider threads handle the transfer of resources found to the service or application that receives the results. The default value is 3. Choose a value that corresponds to the processing activity of your system. The higher the number of providers, the higher the system load. |
Repository Access Delay |
No |
Specifies the time in milliseconds that a crawler thread waits before accessing the next document. This delay can be used to reduce the load on the source repository or the network during the crawling run. |
Document Retrieval Timeout |
No |
Time interval in seconds after which the crawler ends the crawling run for a document and moves to the next document. |
Resource Filters (Scope) |
No |
Specifies resource filters that reduce the scope of the crawling run (see Resource Filters). |
Resource Filters (Result) |
No |
Specifies resource filters that are applied to the results of the crawling run (see Resource Filters). |
Follow Links (resp. Redirects on Web-Sites) |
No |
If this parameter is activated, the crawler follows links in hierarchical repositories or redirects in Web repositories. |
Verify Modification Using Checksum |
No |
If this parameter is activated, the system performs a check sum comparison to check the modifications made to a file. |
Verify Modification Using ETag |
No |
If this parameter is activated, the system performs an ETag comparison to check the modifications made to a file. |
Condition for Treating a Document as Modified |
Yes |
Specifies which conditions must be fulfilled for a document to be considered modified. OR: The document is considered modified if the date or the ETag or the check sum has been changed. AND: The document is considered modified if the date and the ETag and the check sum have been changed. |
Crawl Hidden Documents |
No |
If this parameter is activated, hidden documents are taken into account by the crawler. |
Crawl Document Versions |
No |
If this parameter is activated, versions of documents are taken into account by the crawler. |
Maximum Log Level |
Yes |
Defines the degree of information written to application logs. off: No application log is written. error: An application log is written for errors. info: Like error, but all documents found are also listed. Starting and stopping the crawler and messages from the crawler service are written to the application log. |
Path for Log Files |
No |
Specifies the path where the log files are stored. if you do not specify a path, the system uses the directory /etc/log/crawler to store the log files. |
Maximum Size of a Single Log File |
No |
Specifies the maximum size of a log file in bytes. |
Maximum Number of Backed Up Log Files |
No |
Specifies the maximum number of log files that are saved. |
Test Mode |
No |
Specifies whether the crawler is operated in test mode. Activate this parameter if you want to test the crawler. The results of the crawler are not processed further. Indexing does not take place. The parameters for the log file should be set accordingly during a test. |
If there are errors when using the crawler that terminate the process, they are recorded in the application log.
If the portal is restarted during the crawling run, the crawlers automatically continue their activities from the point where they were interrupted.

Note that crawlers that are used by the subscription service and the content exchange do not continue their activities when the portal is restarted. They start again at the next time that is entered in the corresponding scheduler tasks.

Note that the size of the database has a significant influence on the speed of the crawling process. If you are using a large database, the crawlers work more quickly than if you use a small database.
When a Web repository is crawled, the robots.txt file is analyzed.
To create a new set of crawler parameters, choose Content Management ® Global Services ® Crawler Parameters ® General Purpose Crawler.
See also: