com.sapportals.wcm.service.xcrawler

Interface IXCrawlerService


public interface IXCrawlerService

Global service for crawling repositories.

Copyright (c) SAP AG 2004


Field Summary
static int MAX_TASK_DISPLAY_NAME_LENGTH
           
static int MAX_TASK_ID_LENGTH
           
static int MAX_USER_DATA_LENGTH
           
 
Method Summary
 IXCrawlerParameters createCrawlerParameters(int maxDepth, int retrieverCount, int providerCount, boolean useETag, boolean useChecksum, boolean followLinks, boolean followRedirects, boolean crawlVersions, boolean crawlHidden, boolean crawlSystem, long requestDelayInMilliseconds, IXCrawlerParameters.ModificationCheckMode modificationCheckMode, boolean findAllDocsInDepth, boolean respectRobots, boolean respectNoIndex, boolean test, IResourceFilter[] scopeFilters, IResourceFilter[] resultFilters, long maxLogFileSizeInBytes, int maxBacklogFiles, String logFilePath, IXCrawlerParameters.LogLevel maxLogLevel, long documentTimeoutInSeconds)
          Create crawler parameters.
 IXCrawlerParameters createCrawlerParameters(int maxDepth, int retrieverCount, int providerCount, boolean useETag, boolean useChecksum, boolean followLinks, boolean followRedirects, boolean crawlVersions, boolean crawlHidden, boolean crawlSystem, long requestDelayInMilliseconds, IXCrawlerParameters.ModificationCheckMode modificationCheckMode, boolean findAllDocsInDepth, boolean respectRobots, boolean test, IResourceFilter[] scopeFilters, IResourceFilter[] resultFilters, long maxLogFileSizeInBytes, int maxBacklogFiles, String logFilePath, IXCrawlerParameters.LogLevel maxLogLevel, long documentTimeoutInSeconds)
          Create crawler parameters.
 IXCrawlerParameters createCrawlerParameters(int maxDepth, int retrieverCount, int providerCount, boolean useETag, boolean useChecksum, boolean followLinks, boolean crawlVersions, boolean crawlHidden, boolean crawlSystem, long requestDelayInMilliseconds, IXCrawlerParameters.ModificationCheckMode modificationCheckMode, boolean findAllDocsInDepth, boolean respectRobots, boolean test, IResourceFilter[] scopeFilters, IResourceFilter[] resultFilters, long maxLogFileSizeInBytes, int maxBacklogFiles, String logFilePath, IXCrawlerParameters.LogLevel maxLogLevel, long documentTimeoutInSeconds)
          Create crawler parameters.
 IXCrawlerParameters createCrawlerParameters(String parameterName)
          Create crawler parameters from a configurable in the configuration plugin /cm/services/xcrawlers.
 void deleteCrawlerTask(String taskID)
          Delete a crawler task.
 String[] getCrawlerParameterNames()
          Get the names of the available crawler parameters.
 IXCrawlerTaskSummary[] getCrawlerTaskSummaries()
          Get the state summaries of all crawler tasks.
 IXCrawlerTaskSummary getCrawlerTaskSummary(String taskID)
          Get the state summary of a crawler task.
 String getDefaultCrawlerParameterName()
          Get the name of the default crawler parameters.
 boolean isFiltered(IResource resource, IXCrawlerParameters parameters, RID crawlStartPath)
          Check, if a resource would be filtered out during a crawl with specific crawler parameters
 boolean isRunning(String taskID)
          Check, if a crawler task is running for the specified taskID.
 boolean isScheduled(String taskID)
          Check, if a crawler task is scheduled for the specified taskID
(and will run if any running or suspended crawler tasks for the
same taskID are finished).
 boolean isSuspended(String taskID)
          Check, if a crawler task is suspended for the specified taskID.
 void recrawlErrors(String taskID)
          Restart a crawler task by crawling only the documents that failed during the last crawl.
 void reloadResourceFilters(String taskID)
          Reload the current version of the resource filters for a crawler.
 void resumeCrawlerTask(String taskID)
          Resume a crawler task.
 void runCrawlerTask(String taskID, String taskDisplayName, IRidList[] startResources, IXCrawlerParameters[] parameters, String resultReceiverFactoryClassName, String userDataForFactory, boolean survivesRestart, boolean delta, ISystem node, boolean deleteAfterCompletion)
          Run a crawler task.
 void stopCrawlerTask(String taskID)
          Stop a crawler task.
 void stopCrawlerTaskAsync(String taskID)
          Stop a crawler task.
 void suspendCrawlerTask(String taskID)
          Suspend a crawler task.
 

Field Detail

MAX_TASK_ID_LENGTH

public static final int MAX_TASK_ID_LENGTH
See Also:
Constant Field Values

MAX_TASK_DISPLAY_NAME_LENGTH

public static final int MAX_TASK_DISPLAY_NAME_LENGTH
See Also:
Constant Field Values

MAX_USER_DATA_LENGTH

public static final int MAX_USER_DATA_LENGTH
See Also:
Constant Field Values
Method Detail

getCrawlerParameterNames

public String[] getCrawlerParameterNames()
                                  throws XCrawlerException
Get the names of the available crawler parameters.
These are the configurables in the configuration plugin /cm/services/xcrawlers of the config class XCrawler.

Returns:
the names of the available crawler parameters
Throws:
XCrawlerException

getDefaultCrawlerParameterName

public String getDefaultCrawlerParameterName()
                                      throws XCrawlerException
Get the name of the default crawler parameters.

Returns:
the name of the default crawler parameters
Throws:
XCrawlerException

createCrawlerParameters

public IXCrawlerParameters createCrawlerParameters(String parameterName)
                                            throws XCrawlerException
Create crawler parameters from a configurable in the configuration plugin /cm/services/xcrawlers.

Parameters:
parameterName - name of the configurable
Returns:
the created crawler parameters
Throws:
XCrawlerException

createCrawlerParameters

public IXCrawlerParameters createCrawlerParameters(int maxDepth,
                                                   int retrieverCount,
                                                   int providerCount,
                                                   boolean useETag,
                                                   boolean useChecksum,
                                                   boolean followLinks,
                                                   boolean crawlVersions,
                                                   boolean crawlHidden,
                                                   boolean crawlSystem,
                                                   long requestDelayInMilliseconds,
                                                   IXCrawlerParameters.ModificationCheckMode modificationCheckMode,
                                                   boolean findAllDocsInDepth,
                                                   boolean respectRobots,
                                                   boolean test,
                                                   IResourceFilter[] scopeFilters,
                                                   IResourceFilter[] resultFilters,
                                                   long maxLogFileSizeInBytes,
                                                   int maxBacklogFiles,
                                                   String logFilePath,
                                                   IXCrawlerParameters.LogLevel maxLogLevel,
                                                   long documentTimeoutInSeconds)
                                            throws XCrawlerException
Create crawler parameters.
Old version: does not include followRedirects - this parameter is internally set to the value of followLinks

Parameters:
maxDepth - maximum depth of the crawl (0 is unlimited)
retrieverCount - number of threads which retrieve the resources from the repositories
providerCount - number of threads which provide the found resources to the result receivers
useETag - true if the ETag of a resource should be used to detect modification
useChecksum - true if the checksum of the resource content should be used to detect modification
followLinks - true if links should be followed during the crawl
crawlVersions - true if versions of resources should be included in the crawl
crawlHidden - true if hidden resources should be included in the crawl
crawlSystem - true if system resources should be included in the crawl
requestDelayInMilliseconds - number of milliseconds between two consecutive resources retrievals (to limit repository load)
modificationCheckMode - mode of resource modification detection (ETag AND checksum, ETag OR checksum)
findAllDocsInDepth - true if resources should be found on the shortest possible path
respectRobots - true if robot-rules of web-servers should be respected
test - true if no resources should be provided to the result receiver
scopeFilters - resource filters narrowing the scope of the crawl
resultFilters - resource filters which are applied to the result of the crawl but do not narrow the scope
maxLogFileSizeInBytes - maximum size of the crawler log file in bytes (0 is unlimited)
maxBacklogFiles - maximum number of old crawler log files
logFilePath - path to the crawler log file (if null the current system path is used)
maxLogLevel - maximum log level
documentTimeoutInSeconds - the document retrieval timeout in seconds
Returns:
the created crawler parameters
Throws:
XCrawlerException

createCrawlerParameters

public IXCrawlerParameters createCrawlerParameters(int maxDepth,
                                                   int retrieverCount,
                                                   int providerCount,
                                                   boolean useETag,
                                                   boolean useChecksum,
                                                   boolean followLinks,
                                                   boolean followRedirects,
                                                   boolean crawlVersions,
                                                   boolean crawlHidden,
                                                   boolean crawlSystem,
                                                   long requestDelayInMilliseconds,
                                                   IXCrawlerParameters.ModificationCheckMode modificationCheckMode,
                                                   boolean findAllDocsInDepth,
                                                   boolean respectRobots,
                                                   boolean test,
                                                   IResourceFilter[] scopeFilters,
                                                   IResourceFilter[] resultFilters,
                                                   long maxLogFileSizeInBytes,
                                                   int maxBacklogFiles,
                                                   String logFilePath,
                                                   IXCrawlerParameters.LogLevel maxLogLevel,
                                                   long documentTimeoutInSeconds)
                                            throws XCrawlerException
Create crawler parameters.
New version: includes followRedirects

Parameters:
maxDepth - maximum depth of the crawl (0 is unlimited)
retrieverCount - number of threads which retrieve the resources from the repositories
providerCount - number of threads which provide the found resources to the result receivers
useETag - true if the ETag of a resource should be used to detect modification
useChecksum - true if the checksum of the resource content should be used to detect modification
followLinks - true if links should be followed during the crawl
followRedirects - true if redirects in Web-RMs should be followed during the crawl
crawlVersions - true if versions of resources should be included in the crawl
crawlHidden - true if hidden resources should be included in the crawl
crawlSystem - true if system resources should be included in the crawl
requestDelayInMilliseconds - number of milliseconds between two consecutive resources retrievals (to limit repository load)
modificationCheckMode - mode of resource modification detection (ETag AND checksum, ETag OR checksum)
findAllDocsInDepth - true if resources should be found on the shortest possible path
respectRobots - true if robot-rules of web-servers should be respected
test - true if no resources should be provided to the result receiver
scopeFilters - resource filters narrowing the scope of the crawl
resultFilters - resource filters which are applied to the result of the crawl but do not narrow the scope
maxLogFileSizeInBytes - maximum size of the crawler log file in bytes (0 is unlimited)
maxBacklogFiles - maximum number of old crawler log files
logFilePath - path to the crawler log file (if null the current system path is used)
maxLogLevel - maximum log level
documentTimeoutInSeconds - the document retrieval timeout in seconds
Returns:
the created crawler parameters
Throws:
XCrawlerException

createCrawlerParameters

public IXCrawlerParameters createCrawlerParameters(int maxDepth,
                                                   int retrieverCount,
                                                   int providerCount,
                                                   boolean useETag,
                                                   boolean useChecksum,
                                                   boolean followLinks,
                                                   boolean followRedirects,
                                                   boolean crawlVersions,
                                                   boolean crawlHidden,
                                                   boolean crawlSystem,
                                                   long requestDelayInMilliseconds,
                                                   IXCrawlerParameters.ModificationCheckMode modificationCheckMode,
                                                   boolean findAllDocsInDepth,
                                                   boolean respectRobots,
                                                   boolean respectNoIndex,
                                                   boolean test,
                                                   IResourceFilter[] scopeFilters,
                                                   IResourceFilter[] resultFilters,
                                                   long maxLogFileSizeInBytes,
                                                   int maxBacklogFiles,
                                                   String logFilePath,
                                                   IXCrawlerParameters.LogLevel maxLogLevel,
                                                   long documentTimeoutInSeconds)
                                            throws XCrawlerException
Create crawler parameters.
New version: includes followRedirects

Parameters:
maxDepth - maximum depth of the crawl (0 is unlimited)
retrieverCount - number of threads which retrieve the resources from the repositories
providerCount - number of threads which provide the found resources to the result receivers
useETag - true if the ETag of a resource should be used to detect modification
useChecksum - true if the checksum of the resource content should be used to detect modification
followLinks - true if links should be followed during the crawl
followRedirects - true if redirects in Web-RMs should be followed during the crawl
crawlVersions - true if versions of resources should be included in the crawl
crawlHidden - true if hidden resources should be included in the crawl
crawlSystem - true if system resources should be included in the crawl
requestDelayInMilliseconds - number of milliseconds between two consecutive resources retrievals (to limit repository load)
modificationCheckMode - mode of resource modification detection (ETag AND checksum, ETag OR checksum)
findAllDocsInDepth - true if resources should be found on the shortest possible path
respectRobots - true if robot-rules of web-servers should be respected
respectNoIndex - true if the index-content property should be respected
test - true if no resources should be provided to the result receiver
scopeFilters - resource filters narrowing the scope of the crawl
resultFilters - resource filters which are applied to the result of the crawl but do not narrow the scope
maxLogFileSizeInBytes - maximum size of the crawler log file in bytes (0 is unlimited)
maxBacklogFiles - maximum number of old crawler log files
logFilePath - path to the crawler log file (if null the current system path is used)
maxLogLevel - maximum log level
documentTimeoutInSeconds - the document retrieval timeout in seconds
Returns:
the created crawler parameters
Throws:
XCrawlerException

runCrawlerTask

public void runCrawlerTask(String taskID,
                           String taskDisplayName,
                           IRidList[] startResources,
                           IXCrawlerParameters[] parameters,
                           String resultReceiverFactoryClassName,
                           String userDataForFactory,
                           boolean survivesRestart,
                           boolean delta,
                           ISystem node,
                           boolean deleteAfterCompletion)
                    throws XCrawlerException
Run a crawler task.
Multiple lists of start resources can be specified. Each list has its own crawler parameters. The number of crawler parameters must match the number of lists of start resources.
The crawler task is started asynchronously.
Tasks with the same ID are started sequentially.

Parameters:
taskID - ID of the new task (maximum length is MAX_TASK_ID_LENGTH)
taskDisplayName - display name of the new task (maximum length is MAX_TASK_DISPLAY_NAME_LENGTH, may be null)
startResources - lists of start resources
parameters - crawler parameters for the lists of start resources
resultReceiverFactoryClassName - class which created result receivers; the name of the class is persisted in the database and reused via reflection when the crawler task is resumed; the class must implement IXCrawlerResultReceiverFactory
userDataForFactory - this string is passed to the createResultReceiver() method of the resultReceiverFactory; here the result receiving application can store any data up to MAX_USER_DATA_LENGTH characters in length (may be null)
survivesRestart - if true the crawler can be resumed even after a restart of CM
delta - true if an incremental update should be performed
node - cluster node on which the task should be executed
deleteAfterCompletion - true if the crawler should be deleted after it is complete
Throws:
XCrawlerException

suspendCrawlerTask

public void suspendCrawlerTask(String taskID)
                        throws XCrawlerException
Suspend a crawler task.
The task must be running.

Parameters:
taskID - ID of the task
Throws:
XCrawlerException

resumeCrawlerTask

public void resumeCrawlerTask(String taskID)
                       throws XCrawlerException
Resume a crawler task.
The task must be suspended.

Parameters:
taskID - ID of the task
Throws:
XCrawlerException

stopCrawlerTask

public void stopCrawlerTask(String taskID)
                     throws XCrawlerException
Stop a crawler task.
The method returns after the task is stopped.

Parameters:
taskID - ID of the task
Throws:
XCrawlerException

stopCrawlerTaskAsync

public void stopCrawlerTaskAsync(String taskID)
                          throws XCrawlerException
Stop a crawler task.
The method returns immediately.

Parameters:
taskID - ID of the task
Throws:
XCrawlerException

recrawlErrors

public void recrawlErrors(String taskID)
                   throws XCrawlerException
Restart a crawler task by crawling only the documents that failed during the last crawl.
The task must be down, done, failed, or stopped.

Parameters:
taskID - ID of the task
Throws:
XCrawlerException

deleteCrawlerTask

public void deleteCrawlerTask(String taskID)
                       throws XCrawlerException
Delete a crawler task.
The task is stopped before deletion.
The task is deleted in the database.

Parameters:
taskID - ID of the task
Throws:
XCrawlerException

getCrawlerTaskSummaries

public IXCrawlerTaskSummary[] getCrawlerTaskSummaries()
                                               throws XCrawlerException
Get the state summaries of all crawler tasks.

Returns:
the state summaries of all crawler tasks
Throws:
XCrawlerException

getCrawlerTaskSummary

public IXCrawlerTaskSummary getCrawlerTaskSummary(String taskID)
                                           throws XCrawlerException
Get the state summary of a crawler task.

Parameters:
taskID - ID of the task
Returns:
the state summary of a crawler task (or null if no summary exists for this task)
Throws:
XCrawlerException

isRunning

public boolean isRunning(String taskID)
                  throws XCrawlerException
Check, if a crawler task is running for the specified taskID.

Parameters:
taskID - ID of the task
Returns:
true if a crawler task is running for the specified taskID
Throws:
XCrawlerException

isSuspended

public boolean isSuspended(String taskID)
                    throws XCrawlerException
Check, if a crawler task is suspended for the specified taskID.

Parameters:
taskID - ID of the task
Returns:
true if a crawler task is suspended for the specified taskID
Throws:
XCrawlerException

isScheduled

public boolean isScheduled(String taskID)
                    throws XCrawlerException
Check, if a crawler task is scheduled for the specified taskID
(and will run if any running or suspended crawler tasks for the
same taskID are finished).

Parameters:
taskID - ID of the task
Returns:
true if a crawler task is scheduled for the specified taskID
Throws:
XCrawlerException

isFiltered

public boolean isFiltered(IResource resource,
                          IXCrawlerParameters parameters,
                          RID crawlStartPath)
                   throws XCrawlerException
Check, if a resource would be filtered out during a crawl with specific crawler parameters

Parameters:
resource - the resource
parameters - the crawler parameters
crawlStartPath - path of the related datasource that is attached to the index (for depth calculation)
Returns:
true if the resource would be filtered (i.e. NOT passed to the result receiver)
Throws:
XCrawlerException

reloadResourceFilters

public void reloadResourceFilters(String taskID)
                           throws XCrawlerException
Reload the current version of the resource filters for a crawler.
Works only for suspended crawlers! After the next resume the new filters apply.
Works only for crawlers whose crawler parameters have been created from a configurable (via createCrawlerParameters(String parameterName)).

Throws:
XCrawlerException


Copyright 2006 SAP AG. All rights reserved. No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP AG. The information contained herein may be changed without prior notice. Microsoft, Windows, Outlook, and PowerPoint are registered trademarks of Microsoft Corporation. Oracle is a registered trademark of Oracle Corporation. UNIX, X/Open, OSF/1, and Motif are registered trademarks of the Open Group. Citrix, ICA, Program Neighborhood, MetaFrame, WinFrame, VideoFrame, and MultiWin are trademarks or registered trademarks of Citrix Systems, Inc. HTML, XML, XHTML and W3C are trademarks or registered trademarks of W3C, World Wide Web Consortium, Massachusetts Institute of Technology. Java is a registered trademark of Sun Microsystems, Inc. JavaScript is a registered trademark of Sun Microsystems, Inc., used under license for technology invented and implemented by Netscape. MaxDB is a trademark of MySQL AB, Sweden. SAP, R/3, mySAP, mySAP.com, xApps, xApp, SAP NetWeaver, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and in several other countries all over the world. All other product and service names mentioned are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary. These materials are subject to change without notice. These materials are provided by SAP AG and its affiliated companies ("SAP Group") for informational purposes only, without representation or warranty of any kind, and SAP Group shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP Group products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.