Assigning Data Sources

Use

You assign one or more data sources to an index in order to index the contents of internal or external repositories.

You can assign one or more data sources of the following types to an index:

Hierarchical repository

If you want to index a part of a hierarchical repository, you can navigate within the repository and select the folder that you want to index. The system indexes the contents of this folder including the contents of all subfolders.

Note that a folder cannot be assigned to more than one index.
Web repository

You cannot navigate in Web repositories. You can only assign a complete repository to an index. However, you can define a start page for the crawler.

Note
The crawler only searches Web sites or parts of Web sites that are not protected by robot instructions. Robot instructions are part of Internet standards. They allow Web site owners to permit or forbid the crawling of their sites or parts thereof.

Crawler Parameters

Depending on the type of repository, you may have to set up a crawler and a schedule.

For Web repositories , the index is updated using a crawler. If you are assigning a Web repository to an index for the first time, it is indexed immediately. You then need to regularly schedule the crawler so that the index is updated.
For hierarchical repositories, the index is updated by using events. Therefore, it is not absolutely necessary that the crawler be started at regular intervals. However, you can start the crawler at regular intervals in order to make changes in the index for which no event is triggered. This can be the case if documents have been created, changed, or deleted directly in the file system without using Knowledge Management.

Start Page (Only Web Repositories)

The start page is the page on which the crawler begins the crawling process. You can specify the name of an HTML page or a complete path. The character string that you enter in the Start Page field is added to the URL that is defined in the Web repository configuration.

Example

The following URL is defined in the configuration of a Web repository:

http://www.<my-website>.com/

The initial access page for this site, which contains links for navigating the entire site, is the file main.html. The complete URL for this page is http://www.<my-website>.com/main.html .

In the Start Page field, simply enter the part of the URL that is not defined in the configuration of the Web repository.

main.html

Note

A Web repository can only be assigned to a single index. However, it is possible to assign a Web repository to the same index more than once, whereby different start pages are specified on each occasion.

If you change the start page at a later date, the system deletes the assignment of the index folder to the index and creates a new assignment.

Scheduler

The scheduler defines one or more time intervals at which the crawler is started.

Note

For example, you create two time intervals if you want to start the crawler weekly at 3pm on Mondays, and weekly at 3pm on Thursdays.

Note that crawling takes place at local server time. The time zone of the server and the current local server time are displayed in the iView.

Prerequisites

The documents to be indexed are located in a repository that has been set up in the configuration of Content Management. If you want to index an external Web site, you have to configure a Web repository first.

For classification into taxonomies, the application property service ( properties ) is activated in the repository that contains the objects to be classified.

You have configured crawler parameters in the configuration of Content Management ( Global Services) . For more information, see Crawlers and Crawler Parameters .

Procedure

You are in the Index Administration iView (by default in the KM Admin workset).

Select one or more data sources.

By default, the system assigns the crawler parameters of the index ( Properties dialog box) to the data sources.
Select another crawler if necessary.
Enter a start page if necessary (only for Web repositories).
Save the data source.

The system immediately begins the initial indexing run.

Afterwards, the crawler is only started again if you have defined a schedule.
Define a schedule.

The Define Schedule pushbutton is only displayed after you have saved the data source.

After you have defined a schedule, you reach the same dialog box again by choosing Modify Schedule.

The settings for your schedule take place immediately. You do not need to save the data sources again.