Crawling and Indexing a Web Repository

Purpose

In order to find documents that are stored on external remote Web servers when searching in the portal, you have to configure the Web repositories and then crawl and index them.

Prerequisites

The service user of the index management service must be properly configured as a system principal . Note that the index management service user is passed on to the crawler as a parameter.
If you want to index a Web site that requires user authentication, user mapping must be defined in portal user management for the external system.

Process Flow

Depending on your requirements you can use the simple Web repository or a standard Web repository (if you need special functions).

Case A: Simple Configuration

You carry out the following steps in order to be able to search for the content of remote Web servers without carrying out a lot of configuration.

Create a Web address for each remote server.
Create a new index (see Creating an Index ). Assign the simple Web repository to the index and use the standard crawler.

All Web addresses that you then create can be reached using the simple Web repository and can be indexed periodically.

Note

If required, you specify a proxy server in the configuration of the simply Web repository manager. For more information, see Simple Web Repository Manager .

If you want to implement filters for crawling, you can carry out steps 5 - 7 of the exensive configuration for case A.

Case B: Extensive Configuration

If you want to use functions such as form-based registration or implementing filters when crawling the content of remote Web servers, you have to create a standard Web repository.

Carry out the following steps in order to be able to search the content of a standard Web repository.

Register the remote Web servers in HTTP systems (see HTTP System ).
Create Web sites . Assign an HTTP system to every Web site.
Configure a cache for the Web repository manager (see Caches ).
Configure a Web repository manager . In the configuration, select the cache and the Web sites you just created.
Note
The repository service properties must be activated in the configuration of the Web repositories so that the content of the Web repositories can be classified.
Optional: You can create resource filters to restrict the scope or the results of the crawling procedure.
Optional: You can configure a new set of crawler parameters (see Crawlers and Crawler Parameters ). Specify the resource filters.
Creating an Index or edit an existing index and assign the Web repository to it. Select the standard crawler or the crawler parameter set that you created. This will be used to search the repository. For more information, see Assigning Data Sources . You can also define a schedule (optional).

The following is true for both cases.

You can use the crawler monitor to monitor the crawler process. When the crawler has transmitted the results to TREX, you can monitor the indexing process using the TREX monitor . When TREX has indexed the content of the Web repository, you can search for documents stored there.