Crawlers and Crawler Parameters

Use

Crawlers are used in Knowledge Management to collect resources that are stored in internal or external repositories. The resources found and the hierarchical or net-like structures are forwarded to various services and applications for further processing.

You can use crawler parameters to determine the behavior of the active crawlers in the system.

Integration

In the crawler monitor , you monitor the activity of crawler tasks and can suspend them if necessary and continue them at a later time.

Prerequisites

The crawler service is activated.

Features

In KM, the following crawlers are used for various tasks:

Content Exchange Crawler
Is used to collect and group offers for the content exchange. For more information, see Content Exchange .
Subscription Crawler
Is used to collect and make available subscription-specific data.
Taxonomy Crawler
Is used to collect and make data available in taxonomies.

There is one instance respectively of each of these crawlers. Since they are preconfigured, you do not need to change the parameters of these crawlers.

Index Management Crawler
Is used for various tasks in index management.

Crawlers of this type are used to search for object addresses in data sources. This crawler type allows both normal crawling and delta crawling. The addresses collected are provided to TREX for indexing.

The standard system contains the preconfigured instance standard. You can create and configure further instances of this crawler, for example, if you want to use the log functions or specify resource filters.

Once you have created a new instance of this crawler, you can select it in Administering Indexes .
General Purpose Crawler
This crawler provides generic functions and can be addressed using API calls for your own projects.

Crawler Parameters

Parameter	Required	Description
Name	Yes	Name of the set of crawler parameters
Description	No	Description of the set
Maximum Depth	No	Maximum number of recursion levels that the crawler takes into account. For example, a recursion level of 2 means that starting from a given document, all documents referenced by hyperlinks in the start document and all documents referenced in turn in those documents are included in the results set. With hierarchically structured repositories, such as a file system repository, the recursion level corresponds to the hierarchy levels. The entry 0 or an empty input field stands for unrestricted depth.
Number of Retriever Threads	Yes	Number of retriever threads. Retriever threads search the repositories and collect the resources. The default value is 1. Make sure that the specified number of retrievers are also supported by the remote server.
Number of Provider Threads	Yes	Number of provider threads. Provider threads handle the transfer of resources found to the service or application that receives the results. The default value is 3. Choose a value that corresponds to the processing activity of your system. The higher the number of providers, the higher the system load.
Repository Access Delay	No	Specifies the time in milliseconds that a crawler thread waits before accessing the next document. This delay can be used to reduce the load on the source repository or the network during the crawling run.
Document Retrieval Timeout	No	Time interval in seconds after which the crawler ends the crawling run for a document and moves to the next document.
Resource Filters (Scope)	No	Specifies resource filters that reduce the scope of the crawling run (see Resource Filters ).
Resource Filters (Result)	No	Specifies resource filters that are applied to the results of the crawling run (see Resource Filters ).
Follow Links	No	If this parameter is activated, the crawler follows links in hierarchical repositories.
Follow Redirects on Web-Sites	No	If this parameter is activated, the crawler follows redirects in Web repositories. Note that in the case of a static Web repository, redirects are only followed if they point to the same server. In the case of a dynamic Web repository, redirects are also followed if they point to a different server.
Respect the 'index-content' Property	No	If this parameter is activated, the crawler takes into account the index-content custom property for documents. If index-content = false, the documents are searched for links, but the query documents are not passed on to TREX for indexing.
Verify Modification Using Checksum	No	If this parameter is activated, the system performs a check sum comparison to check the modifications made to a file.
Verify Modification Using ETag	No	If this parameter is activated, the system performs an ETag comparison to check the modifications made to a file.
Condition for Treating a Document as Modified	Yes	Specifies which conditions must be fulfilled for a document to be considered modified. OR: The document is considered modified if the date or the ETag or the check sum has been changed. AND: The document is considered modified if the date and the ETag and the check sum have been changed.
Crawl Hidden Documents	No	If this parameter is activated, hidden documents are taken into account by the crawler.
Crawl Document Versions	No	If this parameter is activated, versions of documents are taken into account by the crawler.
Maximum Log Level	Yes	Defines the degree of information written to application logs. off: No application log is written. error: An application log is written for errors. info: Like error, but all documents found are also listed. Starting and stopping the crawler and messages from the crawler service are written to the application log.
Path for Log Files	No	Specifies the path where the log files are stored. If you do not specify a path, the system uses the directory /etc/log/crawler to store the log files.
Maximum Size of a Single Log File	No	Specifies the maximum size of a log file in bytes.
Maximum Number of Backed Up Log Files	No	Specifies the maximum number of log files that are saved.
Test Mode	No	Specifies whether the crawler is operated in test mode. Activate this parameter if you want to test the crawler. The results of the crawler are not processed further. Indexing does not take place. The parameters for the log file should be set accordingly during a test.

If there are errors when using the crawler that terminate the process, they are recorded in the applications.<n>.log file (see trace and log files ).

If the portal is restarted during the crawling run, the crawlers automatically continue their activities from the point where they were interrupted.

Note

Note that crawlers that are used by the subscription service and the content exchange do not continue their activities when the portal is restarted. They start again at the next time that is entered in the corresponding scheduler tasks.

Note

Note that the size of the database has a significant influence on the speed of the crawling process. If you are using a large database, the crawlers work more quickly than if you use a small database.

Consideration of ROBOTS Entries

When crawling Web repositories (see Web Repository Manager ), the robots.txt file for the Web site is analyzed.

In HTML documents, the following ROBOTS entries are taken into account:

<META NAME="ROBOTS" CONTENT="NOFOLLOW">
This entry ensures that the crawler does not follow any links in the document. However, the crawler passes on documents with this meta tag to TREX for indexing.
<META NAME="ROBOTS" CONTENT="NOINDEX">
The crawler does not pass on documents with this meta tag to TREX for indexing. Therefore, the search results do not contain these documents. However, the crawler follows all links in the query documents.
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
This entry means that the document is not indexed and links are not followed.

Excluding Individual Documents From Indexing

To exclude a specific document in a repository from indexing, you can create the index-content custom property for the document. Enter the value false. For Index Management Crawler, the Respect the 'index-content' Property parameter is activated by default for this. If index-content = false, the crawler searches the document for links, but does not pass it on to TREX for indexing.

Activities

To create a new set of crawler parameters for indexing purposes, choose System Administration → System Configuration → Knowledge Management → Content Management → Global Services → Crawler Parameters → Index Management Crawler in the portal.

More Information

Crawler Monitor