Show TOC

Procedure documentationManaging Crawls Locate this document in the navigation structure

 

Use this procedure for the following tasks:

  • Adding crawls to an index

  • Editing crawls

  • Starting, canceling, suspending, resuming, and updating crawls

  • Deleting crawls

Procedure

  1. Launch the administration cockpit.

  2. Choose the search object connectors of type SAP File Search.

    The system displays all search object connectors with their respective status.

  3. Expand the node for the first search object connector for which you want to add, edit, or delete a crawl.

    The system displays the index that was created for the search object connector. If the index contains information about authorization checks, this is visible in the node.

    If you expand the index node, you can display a list of the crawls for this index.

    Note Note

    As soon as a crawl for which the authorization check is active is added to a public index, this is visible in the node text as Index contains authorization data.

    End of the note.
Adding Crawls To an Index
  1. Choose the index that you want to add a crawl to and choose   Crawl   Create  .

  2. Specify the following settings for the crawl:

    Screen Element

    Value

    Seed URL

    Enter the URL of the file share that you want to connect to Enterprise Search, for example \\abc12345\docs.

    Repository Type

    Specify whether your file repository is a file server, a Web server, or a WebDAV repository.

    Crawling Algorithm

    Down the Tree: Choose this value if the crawl should only contain the files that are located in the node of the seed URL that you specified above (without referenced files).

    Same Server: Choose this value if the crawl should also contain the referenced files that are stored on this server.

    Same Domain: Choose this value if the crawl should also contain the referenced files that are stored in the same domain.

    Top-Level Domain: Choose this value if the crawl should also contain the referenced files that are stored in the same top-level domain.

    Free: Choose this value if the crawl should also contain all referenced files regardless of their location.

    Directory

    Do not Index Directories

    Choose this option to exclude directory names (and attributes) from the index.

    This means that the index contains only the files that are stored in a directory, but not the directory information.

    Index Directories: Choose this option to add directory information (name, attributes) to the index.

    Robot Rules

    Choose whether or not the system should comply with robot rules.

    Web server administrators can use crawler rules (robots.txt files or meta-tags in HTML files) to restrict access to the homepages.

    Set this flag to have the cruiser exclude directories or files that are subject to crawler rules.

    Access Rights

    Do not Get Access Rights: The search results are not checked for authorizations.

    Get Access Rights: Choose this option, if you want the system to perform an authorization check for the search results. If you want to use this option, the authorization check must be configured for the Enterprise Search File Search.

    More information: Authorization Checking for File Search

    Configuring Authorization Checking for File Search

    User,

    Password

    Enter the name and password of a user with read access for the MS Active Directory that also has authorization to extract authorization data.

    Domain

    Enter the domain that the user belongs to.

    Maximum Depth

    Enter the maximum number of subdirectories that the crawler should search below the seed URL.

    Minimum Size (Byte)

    Enter the minimum size of the documents that the crawler should take into account.

    Maximum Size (Byte)

    Enter the maximum size of the documents that the crawler should take into account.

    Pool Size

    Enter the number of parallel requests to the server on which the documents are located. For example, if you define the pool size as 3, the system retrieves three documents at a time.

    Note Note

    End of the note.

    Make sure that you find a suitable value for the server on which the crawl is to be performed. If the pool size is too large, this can affect server performance to an unacceptable extent. If the pool size is too small, the crawling process is very slow.

    The most suitable value depends particularly on the server; some servers can fail with a pool size of 4, other servers can experience no problems with a pool size of 20.

    Note Note

    If you have more than one seed URL (crawl), the pool sizes for each crawl are added together. The total pool size should not be too large because simultaneous crawl requests increase the load on Enterprise Search. The suitable maximum pool size depends on your hardware, for example 40 for a blade configuration and 10 for a single host).

    End of the note.

    Access Delay

    Define the cruiser interval between two requests in seconds.

    For example, you use this parameter for Web servers that cannot process more than one query in parallel. You can also use this parameter to avoid a denial of service. Some Web server administrators block the pages against clients that send too many requests within a defined time period.

    Regular Expression

    Here you can enter a regular expression that the URL for which the crawl is performed must fulfill, otherwise the crawl is not performed. This can be any string. Restricting the search with a regular expression makes sense, for example, if you have selected the crawl algorithm Free.

    Example Example

    http://www.sap.com or abcd01234

    End of the example.

    Included Extensions

    Enter the extensions of the file types that you want the crawl to include, separated by commas.

    Example Example

    doc,pdf,xls

    End of the example.

    Note Note

    End of the note.

    If you leave this field blank, all file types are crawled (except for any file types you enter under Excluded Extensions).

    Excluded Extensions

    Enter the extensions of the file types that you want the crawl to exclude.

    Example Example

    .tmp

    End of the example.
  3. Choose Next.

  4. Optional: Enter user information for reference paths.

    If the files to be crawled refer to a path that requires a different user for access, you can define the relevant path and user information here:

    Choose Add and enter the path to the directory and the name, password, and domain of the user required to access this directory.

  5. Choose Next.

  6. Optional: Exclude specific directories from the crawl as follows:

    Choose Add and enter the path to the directory that you want to exclude, relative to the seed URL in the Directories column.

  7. Choose Finish.

Editing Crawls

Note Note

You can only edit a crawl as long as it has not been started.

End of the note.
  1. Expand the node of the index for which you want to edit a crawl.

    You can display the list of all crawls defined for this index.

  2. Choose the relevant crawl and then Edit on the Crawl Settings tab.

  3. Edit the crawl settings. See Adding Crawls in the text above for details.

    Note Note

    Note that you cannot edit all settings subsequently.

    End of the note.
  4. Choose Save to adopt your changes.

The system asks you whether you want to perform a complete update using the new crawl settings now or at the next scheduled update.

Flushing Index Queues

You can trigger immediate flushing of the index queue. As soon as the queue has been flushed, the crawler data that was written to the index queue is available for searching.

  1. Select the index that you want to flush the queue for.

  2. Choose   Crawl   Flush Index Queue  .

Performing Crawls

Proceed as follows to start, cancel, suspend, or resume a crawl:

  1. Choose the index for which the crawl is defined and expand the node.

  2. Choose the crawl in question and then   Crawl   and one of the following commands:

    Screen Element

    Function

    Start

    Starts the crawl for the first time. This makes sense only for crawls that have never been run before.

    Cancel

    Cancels a running crawl process. You can do this only for crawls with the status Indexing.

    Restart

    Deletes all the data the crawl has written to the index (but not the index itself) and starts crawling and filling the index again from scratch according to the crawl settings defined at that time.

    This option is required if a crawl was unsuccessful and the index is corrupt.

    Suspend

    Suspends a crawl process without deleting the data the crawl has already added to the index. You can do this only for crawls with the status Indexing.

    Resume

    Restarts the crawl at the point where you suspended it. This is only relevant for crawls with the status Suspended.

    Update

    Only checks documents that have been changed (check sum or change data) and updates the index only for the files for which an attribute has changed.

    Full Update

    Crawls all the files in the index and indexes them again, including files that have not been changed. This makes sense if you have changed the crawl settings, for example.

Deleting Crawls
  1. Choose the index for which the crawl is defined and expand the node.

  2. Choose the crawl that you want to delete and choose   Crawl   Delete  .