Skip navigation links
SAP NetWeaver 7.50 SP 13 KMC

Package com.sapportals.wcm.service.xcrawler

Provides a service that crawls repositories to obtain references to resources.

See: Description

Package com.sapportals.wcm.service.xcrawler Description

Provides a service that crawls repositories to obtain references to resources.

Package Specification

Purpose
Detailed Concept
Interfaces
Configuration
Code Samples

Purpose

A crawler is a process that traverses Content Management (CM) repositories and provides the resources they contain to result receivers. The behaviour of a crawler is controlled via crawler parameters (replacing the former crawler profiles). Crawlers store their status and the information they obtain in the database for two reasons:

  1. survival of portal downtimes - interrupted crawlers are restarted automatically at the next portal startup and continue at the point they were interrupted
  2. incremental updates - only differences between the current content of crawled repositories and their content at the last time they were crawled are reported to the result receiver

The CM xcrawler service provides means to start, stop, suspend, resume, and delete crawlers. It is a replacement for the former crawler service. Some enhancements come with the xcrawler service:

Several CM services use the CM xcrawler service:

Detailed Concept

Abstract

This documents describes the new version of the CM xcrawler service. The main reason for changing the crawler service implementation was the need for crawlers which can be resumed after a restart of  the underlying J2EE engine. Crawling a large amount of data may take longer than the maintenance interval of the system the crawler is running on. During maintenance it is not unusual that the system must be restarted due to configuration changes. This caused the problem that long lasting crawls could never be completed because ongoing crawls could not be resumed at the point they were interrupted. They had to be restarted from the beginning.

Other reasons were simplification of the crawler interfaces, performance, stability.

The CM xcrawler service uses the CM job processor service to run crawlers.

Terminology

XCrawler service CM service that offers the functionality to start, stop, suspend, resume, and delete crawlers
Crawler Process that traverses Content Management (CM) repositories and provides the resources they contain to result receivers
Data source Set of resources
Result receiver Java class that operates on resources provided by a crawler; a result receiver is specified by the process which runs a crawler
Crawler parameters Set of attributes which controls the behaviour of a crawler; crawler parameters can be persisted in the configuration framework
Retriever thread Thread that retrieves resources from repositories
Provider thread Thread that provides documents to result receivers
Postprocessor thread Thread that compares the result of the previous crawl to the current one to identify unvisited, unvisited changed, or unvisited deleted documents

New Features

  • content length (min, max)
  • mime type (wildcards supported)
  • URL (wildcards supported)
  • name (wildcards supported)
  • time since the last modification (days)

Removed Features

Architecture

Components  

A crawler  is capable of crawling a set of data sources. Each data source is crawled using dedicated crawler parameters. A crawler starts threads which do the work. Thus retrieving and providing of resources is decoupled. Blocking operations on one side do not affect the ongoing process of the other side. Multiple threads can be used for retrieving or providing to speed up the crawl.

A retriever thread collects resources and passes what it found to some provider thread. Starting with the resources of the data source, descendants are searched. For each data source multiple retriever threads can be used.

A provider thread passes the resources which have been found by some retriever thread to the result receiver. For each data source multiple provider threads can be used.

A postprocessor thread is executed when the retrievers and providers are finished. It compares the result of the previous crawl to the current one to identify differences. It is used for delta crawls (incremental updates). For each data source one postprocessor threads is used.

Resources which are collected and processed by the threads is stored in memory cached sets in the database. Each set holds resources in a certain state. For example the todo-set contains resources which have to be processed by retriever threads. The following sets are used: todo, retrieving, found, providing, finished, postprocessing, postprocessed, old, error. For each data source one assemblage of sets is used.

Sets

The todo-set is prefilled at crawl start with the start resources of a data source. It contains the resources which still have to be processed by the retriever threads. The retriever threads fetch resources from this set and move them to the retrieving-set before collecting the direct descendants of this resources. The collected descendants are put to the todo-set so that they are processed by some retriever thread, too.

Direct descendants of a resource are the children of the resource (in case the resource is a collection) and embedded links - in case the resource is from a web repository.

The retrieving-set holds the resources which are currently in process by retriever threads. In case the system crashes and the crawler is resumed at the next restart they are moved back to the todo-set and are processed again. So no information is lost through a restart.

The resources (not collections) which are collected by the retriever threads are added as well to the found-set. The found-set is the interface between the retriever threads and the provider threads. The provider threads fetch resources from the found-set and move them to the providing-set before passing them to the result receiver.

All resources which are added to the found-set are removed from the old-set. The old-set contains the resources that have been found during the previous crawl.

The providing-set holds the resources which are currently in process by provider threads. In case the system crashes and the crawler is resumed at the next restart they are moved back to the found-set and are processed again. So no information is lost through a restart.

Resources which have been provided to the result receiver are moved to the finished-set.

After the retriever threads and provider threads are done the postprocessor thread is executed. It fetches resources from the old-set and moves them to the postprocessing-set. Since the resources which have been found during the current crawl were removed from the old-set, this set only contains the resources which were found during the previous crawl but not during the current one.

The postprocessor threads checks if these resources still exists and, for this case, if they changed.

The postprocessing-set holds the resources which are currently in process by the postprocessor thread. In case the system crashes and the crawler is resumed at the next restart they are moved back to the old-set and are processed again. So no information is lost through a restart.

The resources processed by the postprocessor thread are moved to the postprocessed-set.

After the postprocessor thread emptied the old-set, all resources from the postprocessed-set are moved back to the old-set. All resources from the finished-set are moved to the old-set too. Since resources were removed from the old-set by the retriever threads when they moved them to the found-set, the two sets are completely distinct.

The old-set now contains a consolidated result of all previous crawls and can be used in future delta crawls to track all changes of the descendants of the data source.

Resources which could not be retrieved from the repositories during the crawl are moved to the error-set.

Threads

A retriever thread operates on the todo-set, the retrieving-set, the found-set, and the error-set. It fetches resources from the todo-set, moves them to the retrieving-set, collects their descendants, adds the descendants to the todo-set, and adds the descendants (which are not collections) to the found-set. Resources which could not be retrieved from the repository are moved to the error-set. The retriever thread informs the result receiver about deleted resources (for delta crawls). 

The provider thread operates on the found-set, the providing-set, the finished-set, and the error-set. It fetches resources from the found-set, moves them to the providing-set, passes them to the result receiver, and finally moves them to the finished-set. The provider thread informs the result receiver about found resources (for normal crawls), new or changed resources (for delta crawls).

The postprocessor thread operates on the old-set, the postprocessing-set, the postprocessed-set, and the finished-set. It fetches resources from the old-set, moves them to the postprocessing-set, checks their existence, and if given, whether they changed. Then it moves the resources to the postprocessed-set. After the old-set is empty all resources from the postprocessed-set are moved back to the old-set. The resources of the finished-set are moved as well to the old-set. The postprocessor thread informs the result receiver about unchanged, changed, and deleted unvisited resources (for delta crawls).

Interfaces

IXCrawlerService

Interface of the global CM xcrawler service. It provides methods for creation of crawler parameters, creation of crawlers and retrieval of information about running or finished crawlers.

(Implemented by the crawler service).

IXCrawlerResultReceiverFactory

A factory which is used by a crawler to create its result receiver. This factory must be provided by the application that uses the crawler. The reason for the use of a factory here is that crawlers may resume after a restart of the system. In this case they lost the result receiver instance they used before the restart. The application which uses the crawler must provide a way to instantiate a new result receiver instance by implementing this factory. The class name of the factory is persisted by the crawler in the database. After a restart of the system the crawler creates an instance of the factory using the CRT class loader.

(Implemented by the applications that use the crawler)

IXCrawlerResultReceiver

General result receiver interface. A result receiver must implement this interface to provide a resource context for crawling. The crawler calls methods of the result receiver to announce the start of the crawl, the stop of the crawl, and failure. For every collection processed during a crawl the implemention of this interface can decide whether it is crawled or not.

(Implemented by the applications that use the crawler)

IXCrawlerPushedResultReceiver, IXCrawlerPushedDeltaResultReceiver, IXCrawlerListResultReceiver, IXCrawlerListDeltaResultReceiver

There are two kinds of result receivers. Pushed result receivers get the results of a crawl while the crawl is running. List result receivers get the results when the crawl is finished at once in a list (the list is persisted in the database � no results get lost if the system is restarted during a crawl).

For each kind of result receiver there are two interfaces: one for normal crawls and one for delta crawls (incremental updates).

(Implemented by the applications that use the crawler)

IXCrawlerParameters

Parameters for a crawler. Crawler parameters can either be created based on configuration or by specifying the single values.

(Implemented by the crawler service).

IXCrawlerTaskSummary

Summary of information concerning a running or finished crawler.

(Implemented by the crawler service).

IXCrawlerJob

Container for an IJob which contains a crawler task.

(Implemented by the crawler service).

IXCrawlerException

Exception thrown by the XCrawler service.

(Implemented by the crawler service).

Configuration

The CM xcrawler service is configured in the Content Management configuration in the Global Services section:

Parameter Name   Default Value Description
Active mandatory true Activate/deactivate CM xcrawler service
Default Crawler Parameters mandatory standard Default crawler parameters
Connection Pool mandatory dbcon_rep JDBC connection pool for database access

The various crawler parameters are configured in the Crawler Parameters section in the Global Services section. There are 4 categories:

  1. General Purpose Crawler - for crawlers used by the CM indexing service and crawlers started via the CM xcrawler service Java API
  2. Content Exchange Crawler - for crawlers used by the CM content exchange service (ICE) (parameter subset only)
  3. Subscription Crawler - for crawlers used by the CM subscription service (parameter subset only)
  4. Taxonomy Crawler - for crawlers used by the CM indexing service for taxonomy crawling (parameter subset only)

Crawler Parameters are:

Parameter Name Default Value Description
Description optional - Description
Maximum Depth optional - Maximum depth (0 or empty=unlimited)
Number of Retriever Threads mandatory 1 Number of retriever threads
Number of Provider Threads mandatory 1 Number of provider threads
Repository Access Delay optional - Time the crawler wait after each repository access to limit backend server load (0 or empty=unlimited)
Document Retrieval Timeout optional - Timeout for retrieval of a single document (0 or empty=unlimited)
Resource Filters (Scope) optional - Resource filters narrowing the scope of the crawl
Resource Filters (Result) optional - Resource Filters which are applied to the result of the crawl but do not narrow the scope
Follow Links (Not Applicable for Web-Sites) mandatory false Set to true if resources referred by links should be crawled, too
Verify Modification Using Checksum mandatory false Set to true if the crawler should detect changes of resources by calculating a checksum
Verify Modification Using ETag mandatory false Set to true if the crawler should detect changes of resources using the ETag provided by web-servers
Condition for Treating a Document as Modified  mandatory Date OR Checksum OR ETag differ Condition for treating a document as modified 
Crawl Hidden Documents mandatory false Set to true if hidden documents should be included in the crawl
Crawl Document Versions mandatory false Set to true if document versions should be included in the crawl
Maximum Log Level mandatory Off Maximum log level
Path for Log Files optional - Path where log files are created
Maximum Size of a Single Log File optional 1000000 Maximum size of a single log file
Maximum Number of Backed Up Log Files optional 2 Maximum number of backed up log files
Test Mode mandatory false Set to true if no documents should be passed to the result receiver

Code Samples

Crawl the documents repository

private static final String SERVICE_USER="index_service";

void runCrawler() throws Exception {

// get xcrawler service
IXCrawlerService crawlerService=(IXCrawlerService)ResourceFactory.getInstance().getServiceFactory().getService(IServiceTypesConst.XCRAWLER_SERVICE);

// get landscape service
ILandscapeService landscapeService=(ILandscapeService)ResourceFactory.getInstance().getServiceFactory().getService(IServiceTypesConst.LANDSCAPE_SERVICE);

// set up start resources
IRidList[] startResourceArray=new RidList[1];
startResourceArray[0]=new RidList();
startResourceArray[0].add(RID.getRid("/documents"));

// create the crawler parameters
IXCrawlerParameters parameters=crawlerService.createCrawlerParameters(
 
0
, // maxDepth,
  3, // retrieverCount,
  3, // providerCount,
  false, // useETag,
  false, // useChecksum,
  false, // followLinks,
  false, // crawlVersions,
  false, // crawlHidden,
  false, // crawlSystem,
  0, // requestDelay,
  IXCrawlerParameters.ModificationCheckMode.OR, // modificationCheckMode,
  false, // findAllDocsInDepth,
  true, // respectRobots,
  false, // test,
  new IResourceFilter[0], // scopeFilters,
  new IResourceFilter[0], // resultFilters,
  0, // maxLogFileSizeInBytes,
  2, // maxBacklogFiles,
  null, // logFilePath,
  IXCrawlerParameters.LogLevel.OFF, // maxLogLevel
  0 // documentTimeoutInSeconds
);
IXCrawlerParameters[] parametersArray=new CrawlerParameters[1];
parametersArray[0]=parameters;

// create result receiver
ResultReceiverImpl receiver=new ResultReceiverImpl(getResourceContext());

ResultReceiverFactory.setResultReceiver(
receiver);

// run the crawler
crawlerService.runCrawlerTask(
  "documents_crawler", // taskID,
  "Crawler for /documents", // taskDisplayName,
  startResourceArray, // start resources
  parametersArray, // parameters,
  ResultReceiverFactory.class.getName(), // resultReceiverFactoryClassName,
  "", // userDataForFactory,
  false, // survivesRestart,
  false, // delta,
  landscapeService.getSystemFactory().getLocalSystem(), // node
  true // delete after completion
);

while(!receiver.isTerminated()) {
  Thread.sleep(100);
}

if(!receiver.isSuccessful()) {
  fail("crawl failed");
}

}

public IResourceContext getResourceContext() {

  return ResourceFactory.getInstance().getServiceContext(SERVICE_USER);

}

public static class ResultReceiverFactory implements IXCrawlerResultReceiverFactory {

private static IXCrawlerResultReceiver s_resultReceiver=null;

public static void setResultReceiver(IXCrawlerResultReceiver resultReceiver) {
 
s_resultReceiver=resultReceiver;
}

public IXCrawlerResultReceiver createResultReceiver(String taskID, String userDataForFactory, boolean delta) throws Exception {
 
return s_resultReceiver;
}

}

public static class ResultReceiverImpl implements IXCrawlerPushedResultReceiver {

private IResourceContext m_resourceContext;
private ArrayList m_rids=new ArrayList();
private volatile boolean m_terminated=false;
private volatile boolean m_success=false;

public ResultReceiverImpl(IResourceContext resourceContext) {
  m_resourceContext=resourceContext;
}

public void crawlStarted() throws Exception {
}

public void crawlFinished() throws Exception {
  m_success=
true;
  m_terminated=
true;
}

public void crawlStopped() throws Exception {
  m_terminated=
true;
}

public void crawlFailed(Exception e) throws Exception {
  m_terminated=
true;
}

public void indicateStop() {
}

public void finalGetFaulty(ArrayList faulty) {
}

public IResourceContext getResourceContext() throws Exception {
 
return m_resourceContext;
}

public boolean approveCollectionCrawling(RID rid) throws Exception {
 
return true;
}

public synchronized boolean receive(RID rid, ArrayList faulty) throws Exception {
  m_rids.add(rid);
 
return true;
}

public ArrayList getResult() {
 
return m_rids;
}

public boolean isTerminated() {
  return m_terminated;
}

public boolean isSuccessful() {
 
return m_success;
}

}

Skip navigation links
SAP NetWeaver 7.50 SP 13 KMC

Copyright 2018 SAP AG Complete Copyright Notice