Managing Crawls

Show TOC

Use this procedure for the following tasks:

Adding crawls to an index
Editing crawls
Starting, canceling, suspending, resuming, and updating crawls
Deleting crawls

Procedure

Launch the administration cockpit.
Choose the search object connectors of type SAP File Search.
The system displays all search object connectors with their respective status.
Expand the node for the first search object connector for which you want to add, edit, or delete a crawl.
The system displays the index that was created for the search object connector. If the index contains information about authorization checks, this is visible in the node.

If you expand the index node, you can display a list of the crawls for this index.

Note
As soon as a crawl for which the authorization check is active is added to a public index, this is visible in the node text as Index contains authorization data.

End of the note.

Adding Crawls To an Index

Choose the index that you want to add a crawl to and choose Crawl Create .

Specify the following settings for the crawl:

Screen Element	Value
Seed URL	Enter the URL of the file share that you want to connect to Enterprise Search, for example `\\abc12345\docs`.
Repository Type	Specify whether your file repository is a file server, a Web server, or a WebDAV repository.
Crawling Algorithm	Down the Tree: Choose this value if the crawl should only contain the files that are located in the node of the seed URL that you specified above (without referenced files). Same Server: Choose this value if the crawl should also contain the referenced files that are stored on this server. Same Domain: Choose this value if the crawl should also contain the referenced files that are stored in the same domain. Top-Level Domain: Choose this value if the crawl should also contain the referenced files that are stored in the same top-level domain. Free: Choose this value if the crawl should also contain all referenced files regardless of their location.
Directory	Do not Index Directories Choose this option to exclude directory names (and attributes) from the index. This means that the index contains only the files that are stored in a directory, but not the directory information. Index Directories: Choose this option to add directory information (name, attributes) to the index.
Robot Rules	Choose whether or not the system should comply with robot rules. Web server administrators can use crawler rules (`robots.txt` files or meta-tags in HTML files) to restrict access to the homepages. Set this flag to have the cruiser exclude directories or files that are subject to crawler rules.
Access Rights	Do not Get Access Rights: The search results are not checked for authorizations. Get Access Rights: Choose this option, if you want the system to perform an authorization check for the search results. If you want to use this option, the authorization check must be configured for the Enterprise Search File Search. More information: Authorization Checking for File Search Configuring Authorization Checking for File Search
User, Password	Enter the name and password of a user with read access for the MS Active Directory that also has authorization to extract authorization data.
Domain	Enter the domain that the user belongs to.
Maximum Depth	Enter the maximum number of subdirectories that the crawler should search below the seed URL.
Minimum Size (Byte)	Enter the minimum size of the documents that the crawler should take into account.
Maximum Size (Byte)	Enter the maximum size of the documents that the crawler should take into account.
Pool Size	Enter the number of parallel requests to the server on which the documents are located. For example, if you define the pool size as 3, the system retrieves three documents at a time. Note End of the note. Make sure that you find a suitable value for the server on which the crawl is to be performed. If the pool size is too large, this can affect server performance to an unacceptable extent. If the pool size is too small, the crawling process is very slow. The most suitable value depends particularly on the server; some servers can fail with a pool size of 4, other servers can experience no problems with a pool size of 20. Note If you have more than one seed URL (crawl), the pool sizes for each crawl are added together. The total pool size should not be too large because simultaneous crawl requests increase the load on Enterprise Search. The suitable maximum pool size depends on your hardware, for example 40 for a blade configuration and 10 for a single host). End of the note.
Access Delay	Define the cruiser interval between two requests in seconds. For example, you use this parameter for Web servers that cannot process more than one query in parallel. You can also use this parameter to avoid a denial of service. Some Web server administrators block the pages against clients that send too many requests within a defined time period.
Regular Expression	Here you can enter a regular expression that the URL for which the crawl is performed must fulfill, otherwise the crawl is not performed. This can be any string. Restricting the search with a regular expression makes sense, for example, if you have selected the crawl algorithm Free. Example `http://www.sap.com` or `abcd01234` End of the example.
Included Extensions	Enter the extensions of the file types that you want the crawl to include, separated by commas. Example `doc,pdf,xls` End of the example. Note End of the note. If you leave this field blank, all file types are crawled (except for any file types you enter under Excluded Extensions).
Excluded Extensions	Enter the extensions of the file types that you want the crawl to exclude. Example `.tmp` End of the example.

Choose Next.
Optional: Enter user information for reference paths.
If the files to be crawled refer to a path that requires a different user for access, you can define the relevant path and user information here:

Choose Add and enter the path to the directory and the name, password, and domain of the user required to access this directory.
Choose Next.
Optional: Exclude specific directories from the crawl as follows:
Choose Add and enter the path to the directory that you want to exclude, relative to the seed URL in the Directories column.
Choose Finish.

Editing Crawls

Note Note

You can only edit a crawl as long as it has not been started.

End of the note.

Expand the node of the index for which you want to edit a crawl.
You can display the list of all crawls defined for this index.
Choose the relevant crawl and then Edit on the Crawl Settings tab.
Edit the crawl settings. See Adding Crawls in the text above for details.

Note
Note that you cannot edit all settings subsequently.

End of the note.
Choose Save to adopt your changes.

The system asks you whether you want to perform a complete update using the new crawl settings now or at the next scheduled update.

Flushing Index Queues

You can trigger immediate flushing of the index queue. As soon as the queue has been flushed, the crawler data that was written to the index queue is available for searching.

Select the index that you want to flush the queue for.
Choose Crawl Flush Index Queue .

Performing Crawls

Proceed as follows to start, cancel, suspend, or resume a crawl:

Choose the index for which the crawl is defined and expand the node.

Choose the crawl in question and then Crawl and one of the following commands:

Screen Element	Function
Start	Starts the crawl for the first time. This makes sense only for crawls that have never been run before.
Cancel	Cancels a running crawl process. You can do this only for crawls with the status `Indexing`.
Restart	Deletes all the data the crawl has written to the index (but not the index itself) and starts crawling and filling the index again from scratch according to the crawl settings defined at that time. This option is required if a crawl was unsuccessful and the index is corrupt.
Suspend	Suspends a crawl process without deleting the data the crawl has already added to the index. You can do this only for crawls with the status `Indexing`.
Resume	Restarts the crawl at the point where you suspended it. This is only relevant for crawls with the status `Suspended`.
Update	Only checks documents that have been changed (check sum or change data) and updates the index only for the files for which an attribute has changed.
Full Update	Crawls all the files in the index and indexes them again, including files that have not been changed. This makes sense if you have changed the crawl settings, for example.

Deleting Crawls

Choose the index for which the crawl is defined and expand the node.
Choose the crawl that you want to delete and choose Crawl Delete .