Crawlers are used in Knowledge Management to collect resources that are stored in internal or external repositories. The resources found and the hierarchical or net-like structures are forwarded to various services and applications for further processing.
You can use crawler parameters to determine the behavior of the active crawlers in the system.
In the crawler monitor , you monitor the activity of crawler tasks and can suspend them if necessary and continue them at a later time.
The crawler service is activated.
In KM, the following crawlers are used for various tasks:
Is used to collect and group offers for the content exchange. For more information, see Content Exchange .
Is used to collect and make available subscription-specific data.
Is used to collect and make data available in taxonomies.
There is one instance respectively of each of these crawlers. Since they are preconfigured, you do not need to change the parameters of these crawlers.
Is used for various tasks in index management.
Crawlers of this type are used to search for object addresses in data sources. This crawler type allows both normal crawling and delta crawling. The addresses collected are provided to TREX for indexing.
The standard system contains the preconfigured instance standard. You can create and configure further instances of this crawler, for example, if you want to use the log functions or specify resource filters.
Once you have created a new instance of this crawler, you can select it in Administering Indexes .
This crawler provides generic functions and can be addressed using API calls for your own projects.
Crawler Parameters
Parameter | Required | Description |
---|---|---|
Name |
Yes |
Name of the set of crawler parameters |
Description |
No |
Description of the set |
Maximum Depth |
No |
Maximum number of recursion levels that the crawler takes into account. For example, a recursion level of 2 means that starting from a given document, all documents referenced by hyperlinks in the start document and all documents referenced in turn in those documents are included in the results set. With hierarchically structured repositories, such as a file system repository, the recursion level corresponds to the hierarchy levels. The entry 0 or an empty input field stands for unrestricted depth. |
Number of Retriever Threads |
Yes |
Number of retriever threads. Retriever threads search the repositories and collect the resources. The default value is 1. Make sure that the specified number of retrievers are also supported by the remote server. |
Number of Provider Threads |
Yes |
Number of provider threads. Provider threads handle the transfer of resources found to the service or application that receives the results. The default value is 3. Choose a value that corresponds to the processing activity of your system. The higher the number of providers, the higher the system load. |
Repository Access Delay |
No |
Specifies the time in milliseconds that a crawler thread waits before accessing the next document. This delay can be used to reduce the load on the source repository or the network during the crawling run. |
Document Retrieval Timeout |
No |
Time interval in seconds after which the crawler ends the crawling run for a document and moves to the next document. |
Resource Filters (Scope) |
No |
Specifies resource filters that reduce the scope of the crawling run (see Resource Filters ). |
Resource Filters (Result) |
No |
Specifies resource filters that are applied to the results of the crawling run (see Resource Filters ). |
Follow Links |
No |
If this parameter is activated, the crawler follows links in hierarchical repositories. |
Follow Redirects on Web-Sites |
No |
If this parameter is activated, the crawler follows redirects in Web repositories. Note that in the case of a static Web repository, redirects are only followed if they point to the same server. In the case of a dynamic Web repository, redirects are also followed if they point to a different server. |
Respect the 'index-content' Property |
No |
If this parameter is activated, the crawler takes into account the index-content custom property for documents. If index-content = false, the documents are searched for links, but the query documents are not passed on to TREX for indexing. |
Verify Modification Using Checksum |
No |
If this parameter is activated, the system performs a check sum comparison to check the modifications made to a file. |
Verify Modification Using ETag |
No |
If this parameter is activated, the system performs an ETag comparison to check the modifications made to a file. |
Condition for Treating a Document as Modified |
Yes |
Specifies which conditions must be fulfilled for a document to be considered modified. OR: The document is considered modified if the date or the ETag or the check sum has been changed. AND: The document is considered modified if the date and the ETag and the check sum have been changed. |
Crawl Hidden Documents |
No |
If this parameter is activated, hidden documents are taken into account by the crawler. |
Crawl Document Versions |
No |
If this parameter is activated, versions of documents are taken into account by the crawler. |
Maximum Log Level |
Yes |
Defines the degree of information written to application logs. off: No application log is written. error: An application log is written for errors. info: Like error, but all documents found are also listed. Starting and stopping the crawler and messages from the crawler service are written to the application log. |
Path for Log Files |
No |
Specifies the path where the log files are stored. If you do not specify a path, the system uses the directory /etc/log/crawler to store the log files. |
Maximum Size of a Single Log File |
No |
Specifies the maximum size of a log file in bytes. |
Maximum Number of Backed Up Log Files |
No |
Specifies the maximum number of log files that are saved. |
Test Mode |
No |
Specifies whether the crawler is operated in test mode. Activate this parameter if you want to test the crawler. The results of the crawler are not processed further. Indexing does not take place. The parameters for the log file should be set accordingly during a test. |
If there are errors when using the crawler that terminate the process, they are recorded in the applications.<n>.log file (see trace and log files ).
If the portal is restarted during the crawling run, the crawlers automatically continue their activities from the point where they were interrupted.
Note that crawlers that are used by the subscription service and the content exchange do not continue their activities when the portal is restarted. They start again at the next time that is entered in the corresponding scheduler tasks.
Note that the size of the database has a significant influence on the speed of the crawling process. If you are using a large database, the crawlers work more quickly than if you use a small database.
Consideration of ROBOTS Entries
When crawling Web repositories (see Web Repository Manager ), the robots.txt file for the Web site is analyzed.
In HTML documents, the following ROBOTS entries are taken into account:
This entry ensures that the crawler does not follow any links in the document. However, the crawler passes on documents with this meta tag to TREX for indexing.
The crawler does not pass on documents with this meta tag to TREX for indexing. Therefore, the search results do not contain these documents. However, the crawler follows all links in the query documents.
This entry means that the document is not indexed and links are not followed.
Excluding Individual Documents From Indexing
To exclude a specific document in a repository from indexing, you can create the index-content custom property for the document. Enter the value false. For Index Management Crawler, the Respect the 'index-content' Property parameter is activated by default for this. If index-content = false, the crawler searches the document for links, but does not pass it on to TREX for indexing.
To create a new set of crawler parameters for indexing purposes, choose System Administration → System Configuration → Knowledge Management → Content Management → Global Services → Crawler Parameters → Index Management Crawler in the portal.