Skip navigation links

Oracle Secure Enterprise Search Java API Reference
10g Release 1 (10.1.8)

B32260-01


oracle.search.sdk.crawler
Interface CrawlingThreadService


public interface CrawlingThreadService

CrawlingThreadService is an interface used by a crawler plugin to perform crawl related tasks. It has execution context specific to the crawling thread that invokes the plugin crawl() method


Field Summary
static int DOC_EXCLUDED_BY_MIMETYPE
          document excluded by mimetype
static int DOC_EXCLUDED_BY_SIZE
          document excluced by document size
static int DOC_EXCLUDED_BY_URL_BOUNDARY
          document excluded by url boundary
static int DOC_INCLUDED
          document should be included

 

Method Summary
 int checkDocumentExcluded(DocumentMetadata meta)
          check if the document should be crawled or not the check stops if one rule excludes the document and only status code for this rule is returned.
 java.lang.String inferMimeType(java.lang.String url)
          check the mime type based on the URL suffix.
 void markStatusNotChanged(DocumentMetadata meta)
          mark a url entry as not requiring any changes or updates.
 void submitForProcessing(DocumentContainer target)
          submit the document for processing.

 

Field Detail

DOC_INCLUDED

public static final int DOC_INCLUDED
document should be included
See Also:
Constant Field Values

DOC_EXCLUDED_BY_URL_BOUNDARY

public static final int DOC_EXCLUDED_BY_URL_BOUNDARY
document excluded by url boundary
See Also:
Constant Field Values

DOC_EXCLUDED_BY_MIMETYPE

public static final int DOC_EXCLUDED_BY_MIMETYPE
document excluded by mimetype
See Also:
Constant Field Values

DOC_EXCLUDED_BY_SIZE

public static final int DOC_EXCLUDED_BY_SIZE
document excluced by document size
See Also:
Constant Field Values

Method Detail

submitForProcessing

public void submitForProcessing(DocumentContainer target)
                         throws ProcessingException
submit the document for processing. It will be indexed if its status code is DocumentContainer.STATUS_OK_FOR_INDEX. After the processing is done this document will be automatically remove from the queue. Note that DocumentMetadata in the submitted target will be cleared automatically if the operation is a success.
Parameters:
target - the document container containing the content and metadata.
Throws:
ProcessingException

markStatusNotChanged

public void markStatusNotChanged(DocumentMetadata meta)
                          throws ProcessingException
mark a url entry as not requiring any changes or updates. This will simply remove the entry from the URL Queue and will not re-index or perform any additional operations on this url entry. This should be used when re-crawling a content and when there is no change * to a particular URL.
Parameters:
meta - the metadata object corresponding to the url entry
Throws:
ProcessingException

checkDocumentExcluded

public int checkDocumentExcluded(DocumentMetadata meta)
check if the document should be crawled or not the check stops if one rule excludes the document and only status code for this rule is returned.
To avoid the overhead on processing the excluded documents, this method should be called before enqueue or submit (if not using queue) the document. If document size or mimetype information is not available, rules based on size or mimitype are not appliable.The check order is boundary, mimetype and size.

The internal exclusion chekcing always happens during submiting the documents.

Returns:
INCLUDED (0), EXCLUDED_BY_URL_BOUNDARY(1), EXCLUDED_BY_MIMETYPE (2), EXCLUDED_BY_SIZE (4)

inferMimeType

public java.lang.String inferMimeType(java.lang.String url)
check the mime type based on the URL suffix.
Parameters:
url -
Returns:
mimetype like "text/html" etc. If no application associated with the suffix or there is no suffix, return null.

Skip navigation links

Oracle Secure Enterprise Search Java API Reference
10g Release 1 (10.1.8)

B32260-01


Copyright © 2006, Oracle. All rights reserved.