See: Description
Interface | Description |
---|---|
IXCrawlerExtendedPushedDeltaResultReceiver |
Extension for IXCrawlerPushedDeltaResultReceiver calling inform method for already provided resources
Version for incremental updates.
|
IXCrawlerListDeltaResultReceiver |
Result receiver which receives the result of a crawl as lists of
RID s when the crawl is over. |
IXCrawlerListResultReceiver |
Result receiver which receives the result of a crawl as lists of
RID s when the crawl is over. |
IXCrawlerParameters |
Parameters determining the behavior of a crawl.
|
IXCrawlerPushedDeltaResultReceiver |
Result receiver which receives the result of a crawl as single
RID s as the crawl goes on. |
IXCrawlerPushedResultReceiver |
Result receiver which receives the result of a crawl as single
RID s as the crawl goes on. |
IXCrawlerResultReceiver |
Abstract result receiver - all result receivers must implement this interface.
|
IXCrawlerResultReceiverFactory |
Factory which creates result receivers for a crawler task.
|
IXCrawlerService |
Global service for crawling repositories.
|
IXCrawlerTaskSummary |
State summary of a crawler task.
|
IXCrawlerTaskSummary.IOperation |
Operations carried out in crawler states
Added in 7.X |
Class | Description |
---|---|
AbstractEnum |
Enum as Object representation.
|
IXCrawlerParameters.LogLevel |
Log levels for crawler log files
|
IXCrawlerParameters.ModificationCheckMode |
Modes for checking whether a resource was modified
|
IXCrawlerTaskSummary.OperationType |
Types of operations within crawler states
Added in 7.X |
IXCrawlerTaskSummary.TaskState |
States of a crawler task.
|
IXCrawlerTaskSummary.ThreadState |
States of a crawler thread.
|
Exception | Description |
---|---|
XCrawlerException |
Exception which is thrown in case of crawler errors.
|
Provides a service that crawls repositories to obtain references to resources.
Purpose
Detailed Concept
Interfaces
Configuration
Code Samples
A crawler is a process that traverses Content Management (CM) repositories and provides the resources they contain to result receivers. The behaviour of a crawler is controlled via crawler parameters (replacing the former crawler profiles). Crawlers store their status and the information they obtain in the database for two reasons:
The CM xcrawler service provides means to start, stop, suspend, resume, and delete crawlers. It is a replacement for the former crawler service. Some enhancements come with the xcrawler service:
Several CM services use the CM xcrawler service:
This documents describes the new version of the CM xcrawler service. The main reason for changing the crawler service implementation was the need for crawlers which can be resumed after a restart of the underlying J2EE engine. Crawling a large amount of data may take longer than the maintenance interval of the system the crawler is running on. During maintenance it is not unusual that the system must be restarted due to configuration changes. This caused the problem that long lasting crawls could never be completed because ongoing crawls could not be resumed at the point they were interrupted. They had to be restarted from the beginning.
Other reasons were simplification of the crawler interfaces, performance, stability.
The CM xcrawler service uses the CM job processor service to run crawlers.
XCrawler service | CM service that offers the functionality to start, stop, suspend, resume, and delete crawlers |
Crawler | Process that traverses Content Management (CM) repositories and provides the resources they contain to result receivers |
Data source | Set of resources |
Result receiver | Java class that operates on resources provided by a crawler; a result receiver is specified by the process which runs a crawler |
Crawler parameters | Set of attributes which controls the behaviour of a crawler; crawler parameters can be persisted in the configuration framework |
Retriever thread | Thread that retrieves resources from repositories |
Provider thread | Thread that provides documents to result receivers |
Postprocessor thread | Thread that compares the result of the previous crawl to the current one to identify unvisited, unvisited changed, or unvisited deleted documents |
- content length (min, max)
- mime type (wildcards supported)
- URL (wildcards supported)
- name (wildcards supported)
- time since the last modification (days)
A crawler is capable of crawling a set of data sources. Each data source is crawled using dedicated crawler parameters. A crawler starts threads which do the work. Thus retrieving and providing of resources is decoupled. Blocking operations on one side do not affect the ongoing process of the other side. Multiple threads can be used for retrieving or providing to speed up the crawl.
A retriever thread collects resources and passes what it found to some provider thread. Starting with the resources of the data source, descendants are searched. For each data source multiple retriever threads can be used.
A provider thread passes the resources which have been found by some retriever thread to the result receiver. For each data source multiple provider threads can be used.
A postprocessor thread is executed when the retrievers and providers are finished. It compares the result of the previous crawl to the current one to identify differences. It is used for delta crawls (incremental updates). For each data source one postprocessor threads is used.
Resources which are collected and processed by the threads is stored in memory cached sets in the database. Each set holds resources in a certain state. For example the todo-set contains resources which have to be processed by retriever threads. The following sets are used: todo, retrieving, found, providing, finished, postprocessing, postprocessed, old, error. For each data source one assemblage of sets is used.
The todo-set is prefilled at crawl start with the start resources of a data source. It contains the resources which still have to be processed by the retriever threads. The retriever threads fetch resources from this set and move them to the retrieving-set before collecting the direct descendants of this resources. The collected descendants are put to the todo-set so that they are processed by some retriever thread, too.
Direct descendants of a resource are the children of the resource (in case the resource is a collection) and embedded links - in case the resource is from a web repository.
The retrieving-set holds the resources which are currently in process by retriever threads. In case the system crashes and the crawler is resumed at the next restart they are moved back to the todo-set and are processed again. So no information is lost through a restart.
The resources (not collections) which are collected by the retriever threads are added as well to the found-set. The found-set is the interface between the retriever threads and the provider threads. The provider threads fetch resources from the found-set and move them to the providing-set before passing them to the result receiver.
All resources which are added to the found-set are removed from the old-set. The old-set contains the resources that have been found during the previous crawl.
The providing-set holds the resources which are currently in process by provider threads. In case the system crashes and the crawler is resumed at the next restart they are moved back to the found-set and are processed again. So no information is lost through a restart.
Resources which have been provided to the result receiver are moved to the finished-set.
After the retriever threads and provider threads are done the postprocessor thread is executed. It fetches resources from the old-set and moves them to the postprocessing-set. Since the resources which have been found during the current crawl were removed from the old-set, this set only contains the resources which were found during the previous crawl but not during the current one.
The postprocessor threads checks if these resources still exists and, for this case, if they changed.
The postprocessing-set holds the resources which are currently in process by the postprocessor thread. In case the system crashes and the crawler is resumed at the next restart they are moved back to the old-set and are processed again. So no information is lost through a restart.
The resources processed by the postprocessor thread are moved to the postprocessed-set.
After the postprocessor thread emptied the old-set, all resources from the postprocessed-set are moved back to the old-set. All resources from the finished-set are moved to the old-set too. Since resources were removed from the old-set by the retriever threads when they moved them to the found-set, the two sets are completely distinct.
The old-set now contains a consolidated result of all previous crawls and can be used in future delta crawls to track all changes of the descendants of the data source.
Resources which could not be retrieved from the repositories during the crawl are moved to the error-set.
A retriever thread operates on the todo-set, the
retrieving-set, the found-set, and the error-set. It fetches resources from the
todo-set, moves them to the retrieving-set, collects their descendants, adds the descendants to the todo-set, and adds the descendants (which are not
collections) to the found-set. Resources which could not be retrieved from the
repository are moved to the error-set. The retriever thread informs the result receiver about
deleted resources (for delta crawls).
The provider thread operates on the found-set, the providing-set, the finished-set, and the error-set. It fetches resources from the found-set, moves them to the providing-set, passes them to the result receiver, and finally moves them to the finished-set. The provider thread informs the result receiver about found resources (for normal crawls), new or changed resources (for delta crawls).
The postprocessor thread operates on the old-set, the postprocessing-set, the postprocessed-set, and the finished-set. It fetches resources from the old-set, moves them to the postprocessing-set, checks their existence, and if given, whether they changed. Then it moves the resources to the postprocessed-set. After the old-set is empty all resources from the postprocessed-set are moved back to the old-set. The resources of the finished-set are moved as well to the old-set. The postprocessor thread informs the result receiver about unchanged, changed, and deleted unvisited resources (for delta crawls).
IXCrawlerService
Interface of the global CM xcrawler service. It provides methods for creation of crawler parameters, creation of crawlers and retrieval of information about running or finished crawlers.
(Implemented
by the crawler service).
IXCrawlerResultReceiverFactory
A factory which is used by a crawler to create its result receiver. This factory must be provided by the application that uses the crawler. The reason for the use of a factory here is that crawlers may resume after a restart of the system. In this case they lost the result receiver instance they used before the restart. The application which uses the crawler must provide a way to instantiate a new result receiver instance by implementing this factory. The class name of the factory is persisted by the crawler in the database. After a restart of the system the crawler creates an instance of the factory using the CRT class loader.
(Implemented
by the applications that use the crawler)
IXCrawlerResultReceiver
General result receiver interface. A result receiver must implement this interface to provide a resource context for crawling. The crawler calls methods of the result receiver to announce the start of the crawl, the stop of the crawl, and failure. For every collection processed during a crawl the implemention of this interface can decide whether it is crawled or not.
(Implemented
by the applications that use the crawler)
IXCrawlerPushedResultReceiver, IXCrawlerPushedDeltaResultReceiver, IXCrawlerListResultReceiver, IXCrawlerListDeltaResultReceiver
There are two kinds of result receivers. Pushed result receivers get the results of a crawl while the crawl is running. List result receivers get the results when the crawl is finished at once in a list (the list is persisted in the database � no results get lost if the system is restarted during a crawl).
For each kind of result receiver there are two interfaces: one for normal crawls and one for delta crawls (incremental updates).
(Implemented
by the applications that use the crawler)
IXCrawlerParameters
Parameters for a crawler. Crawler parameters can either be created based on configuration or by specifying the single values.
(Implemented
by the crawler service).
IXCrawlerTaskSummary
Summary of information concerning a running or finished crawler.
(Implemented
by the crawler service).
IXCrawlerJob
Container for an IJob which contains a crawler task.
(Implemented
by the crawler service).
IXCrawlerException
Exception thrown by the XCrawler service.
(Implemented
by the crawler service).
The CM xcrawler service is configured in the Content Management configuration in the Global Services section:
Parameter Name | Default Value | Description | |
---|---|---|---|
Active | mandatory | true | Activate/deactivate CM xcrawler service |
Default Crawler Parameters | mandatory | standard | Default crawler parameters |
Connection Pool | mandatory | dbcon_rep | JDBC connection pool for database access |
The various crawler parameters are configured in the Crawler Parameters section in the Global Services section. There are 4 categories:
Crawler Parameters are:
Parameter Name | Default Value | Description | |
---|---|---|---|
optional | - | Description | |
optional | - | Maximum depth (0 or empty=unlimited) | |
Number of Retriever Threads | mandatory | 1 | Number of retriever threads |
Number of Provider Threads | mandatory | 1 | Number of provider threads |
optional | - | Time the crawler wait after each repository access to limit backend server load (0 or empty=unlimited) | |
optional | - | Timeout for retrieval of a single document (0 or empty=unlimited) | |
optional | - | Resource filters narrowing the scope of the crawl | |
optional | - | Resource Filters which are applied to the result of the crawl but do not narrow the scope | |
mandatory | false | Set to true if resources referred by links should be crawled, too | |
mandatory | false | Set to true if the crawler should detect changes of resources by calculating a checksum | |
mandatory | false | Set to true if the crawler should detect changes of resources using the ETag provided by web-servers | |
mandatory | Date OR Checksum OR ETag differ | ||
mandatory | false | Set to true if hidden documents should be included in the crawl | |
mandatory | false | Set to true if document versions should be included in the crawl | |
mandatory | Off | Maximum log level | |
optional | - | Path where log files are created | |
optional | 1000000 | Maximum size of a single log file | |
optional | 2 | Maximum number of backed up log files | |
mandatory | false | Set to true if no documents should be passed to the result receiver |
private static final String SERVICE_USER="index_service";
void runCrawler() throws Exception {
// get xcrawler service
IXCrawlerService crawlerService=(IXCrawlerService)ResourceFactory.getInstance().getServiceFactory().getService(IServiceTypesConst.XCRAWLER_SERVICE);// get landscape service
ILandscapeService landscapeService=(ILandscapeService)ResourceFactory.getInstance().getServiceFactory().getService(IServiceTypesConst.LANDSCAPE_SERVICE);// set up start resources
IRidList[] startResourceArray=new RidList[1];
startResourceArray[0]=new RidList();
startResourceArray[0].add(RID.getRid("/documents"));// create the crawler parameters
IXCrawlerParameters parameters=crawlerService.createCrawlerParameters(
0, // maxDepth,
3, // retrieverCount,
3, // providerCount,
false, // useETag,
false, // useChecksum,
false, // followLinks,
false, // crawlVersions,
false, // crawlHidden,
false, // crawlSystem,
0, // requestDelay,
IXCrawlerParameters.ModificationCheckMode.OR, // modificationCheckMode,
false, // findAllDocsInDepth,
true, // respectRobots,
false, // test,
new IResourceFilter[0], // scopeFilters,
new IResourceFilter[0], // resultFilters,
0, // maxLogFileSizeInBytes,
2, // maxBacklogFiles,
null, // logFilePath,
IXCrawlerParameters.LogLevel.OFF, // maxLogLevel
0 // documentTimeoutInSeconds
);
IXCrawlerParameters[] parametersArray=new CrawlerParameters[1];
parametersArray[0]=parameters;// create result receiver
ResultReceiverImpl receiver=new ResultReceiverImpl(getResourceContext());
ResultReceiverFactory.setResultReceiver(receiver);// run the crawler
crawlerService.runCrawlerTask(
"documents_crawler", // taskID,
"Crawler for /documents", // taskDisplayName,
startResourceArray, // start resources
parametersArray, // parameters,
ResultReceiverFactory.class.getName(), // resultReceiverFactoryClassName,
"", // userDataForFactory,
false, // survivesRestart,
false, // delta,
landscapeService.getSystemFactory().getLocalSystem(), // node
true // delete after completion
);while(!receiver.isTerminated()) {
Thread.sleep(100);
}if(!receiver.isSuccessful()) {
fail("crawl failed");
}
}
public IResourceContext getResourceContext() {
return ResourceFactory.getInstance().getServiceContext(SERVICE_USER);
}
public static class ResultReceiverFactory implements IXCrawlerResultReceiverFactory {
private static IXCrawlerResultReceiver s_resultReceiver=null;
public static void setResultReceiver(IXCrawlerResultReceiver resultReceiver) {
s_resultReceiver=resultReceiver;
}public IXCrawlerResultReceiver createResultReceiver(String taskID, String userDataForFactory, boolean delta) throws Exception {
return s_resultReceiver;
}
}
public static class ResultReceiverImpl implements IXCrawlerPushedResultReceiver {
private IResourceContext m_resourceContext;
private ArrayList m_rids=new ArrayList();
private volatile boolean m_terminated=false;
private volatile boolean m_success=false;public ResultReceiverImpl(IResourceContext resourceContext) {
m_resourceContext=resourceContext;
}public void crawlStarted() throws Exception {
}public void crawlFinished() throws Exception {
m_success=true;
m_terminated=true;
}public void crawlStopped() throws Exception {
m_terminated=true;
}public void crawlFailed(Exception e) throws Exception {
m_terminated=true;
}public void indicateStop() {
}public void finalGetFaulty(ArrayList faulty) {
}public IResourceContext getResourceContext() throws Exception {
return m_resourceContext;
}public boolean approveCollectionCrawling(RID rid) throws Exception {
return true;
}public synchronized boolean receive(RID rid, ArrayList faulty) throws Exception {
m_rids.add(rid);
return true;
}public ArrayList getResult() {
return m_rids;
}public boolean isTerminated() {
return m_terminated;
}public boolean isSuccessful() {
return m_success;
}
}
Copyright 2018 SAP AG Complete Copyright Notice