9 Oracle Ultra Search Developer's Guide and API Reference

This chapter explains the Oracle Ultra Search APIs and related information. This chapter contains the following topics:

Overview of Oracle Ultra Search APIs
Oracle Ultra Search Query API
Customizing the Query Syntax Expansion
Oracle Ultra Search Query Tag Library
Oracle Ultra Search Crawler Agent API
Oracle Ultra Search Java E-mail API
Oracle Ultra Search URL Rewriter API
Oracle Ultra Search Document Service API
Oracle Ultra Search Query Applications

See Also:

Overview of Oracle Ultra Search APIs

Oracle Ultra Search provides the following APIs:

Query API

The query API works with indexed data. The Java API does not impose any HTML rendering elements. The application can completely customize the HTML interface.

Crawler Agent API

The crawler agent API crawls and indexes proprietary document repositories.

E-Mail API

The e-mail API is used by the Oracle Ultra Search query application to display e-mails. It can also be used when building your own custom query application.

URL Rewriter API

The URL rewriter API is used by the crawler to filter and rewrite extracted URL links before they are inserted into the URL queue.

Document Service API

The document service API allows generation of attribute data based on the document contents.

Oracle Ultra Search also includes highly functional query applications to query and display search results. The query applications are J2EE-compliant Web applications.

The following dependencies for ultrasearch_query.jar must be included if you are using it outside OC4J:

classes12.jar
orai18n.jar
orai18n-mapping.jar
orai18n-translation.jar
$ORACLE_HOME/oc4j/j2ee/home/jazn.jar
$ORACLE_HOME/oc4j/j2ee/home/jazncore.jar
$ORACLE_HOME/jlib/ldapjclnt10.jar

Oracle Ultra Search Query API

Oracle Ultra Search provides a Java API for querying indexed data. The API methods retrieve and display query results. Because it is written in Java, it is compatible with a large spectrum of Web application servers that support any Java-based technology, such as JSP version 1.1 and higher. The API uses JDBC connection pooling for scalability.

The Java API does not impose any HTML rendering elements. The application can completely customize the HTML interface. For example, you can build the following:

Basic search form
Advanced search form
Query result display
Help page
Feedback page
Register URL

You embed Oracle Ultra Search query functionality in your Web application with the supplied Oracle Ultra Search Java query API. The API supports two methods:

Methods that retrieve query result data only.
Methods that retrieve HTML code containing query result data.

The data-only methods do not return any HTML and can be used when you require full control over the HTML code to be rendered. The methods that retrieve HTML code support features such as allowing you to embed query input boxes and result lists in your Web application.

With the Oracle Ultra Search Java query API you can:

Retrieve query results
Set query properties, such as the total number of hits to return, and so on
Set the query session language
Access Oracle Ultra Search tables to retrieve Oracle Ultra Search dictionary data, such as all defined data groups and attributes
Customize and generate your query interface and search result screen with procedures that return blocks of HTML code that you can embed into your Web application
Search end user submit URLs to the seed URL list

The Oracle Ultra Search Java query API is encapsulated in the oracle.ultrasearch.query package.

See Also:

"Tuning Query Performance"

Customizing the Query Syntax Expansion

Oracle Ultra Search uses the Oracle Text engine to index and search documents. When a user specifies a certain query string, Oracle Ultra Search takes that string and transforms it into an Oracle Text query expression. This process is called query syntax expansion.

You can customize Oracle Ultra Search to use your own implementation of the query syntax expansion.

The default query expansion lets you specify a query syntax similar to most internet search engines. The syntax boosts scores for documents that match the user's query in the document title string attribute. The syntax for Contains is the same when used on the document content and on string attributes.

The default query syntax expansion is implemented in the oracle.ultrasearch.query.Contains class. To customize query expansion, use the oracle.ultrasearch.query.CtxContains class.

This section describes the default query expansion rules, and how to customize the query syntax expansion to suit your organization's preferences.

Default Query Syntax Expansion Implementation

The default query syntax expansion implementation directly affects the following:

End user query syntax: The way a query string is entered
Scoring: The way the documents matching the query are scored
Expansion rules: The way the user's query string is transformed into an Oracle Text query string

The default query syntax expansion is implemented in the oracle.ultrasearch.query.Contains class. The query applications makes use of this syntax expansion for content search as well as string attribute search.

End User Query Syntax

The end user query syntax defined by the default query syntax expansion implementation is similar to the standard text query syntax employed by most search engines on the Web.

Token: A token is a string enclosed in double-quotes ("). It can be a single word or a phrase.
Operators: The default implementation defines three operators. They are the [+], [-] and [*] operators. These operators are defined by the default implementation. Change these operators to whatever you prefer in your own custom implementation.

The plus operator [+] specifies that the token immediately following it must appear in all documents included in the search result.

The minus operator [-] specifies that the token immediately following it cannot appear in any document included in the search result.

The asterisk [*] specifies a wildcard search. It matches zero or more characters. A token starting with the asterisk is ignored. The asterisk can only be specified at the end (right side) or middle of a token. For example, "hel*o" and "hell*" use the asterisk correctly, but "*ello" is unacceptable.

The following table summarizes the rules for the Oracle Ultra Search end user query syntax:

Note:

All end user query strings are encased in square braces. For example, the end user query string Oracle Applications is notated as [Oracle Applications].

Rule	Description
Single word search	Entering one word finds documents that contain that word. For example, searching for [Oracle] finds all documents that contain the word "Oracle" anywhere in that document. Note: Searching for [Oracle] is not equivalent to [Oracle*].
Multiple word search	Entering more than one word finds documents that contain any of the words in any order. For example, searching for [Oracle Applications] finds documents that contain "Oracle" or "Applications" or "Oracle Applications."
Compulsory inclusion [+]	Attaching a [+] in front of a word requires that the word be found in all matching documents. For example, searching for [Oracle + Applications] only finds documents that contain the word "Applications." Note: In a multiple word search, you can attach a [+] in front of every token including the very first token.
Compulsory exclusion [-]	Attaching a [-] in front of a word requires that the word must not be found in all matching documents. For example, searching for [Oracle - Applications] only finds documents that do not contain the word "Applications". Note: In a multiple word search, you can attach a [-] in front of every token except the very first token.
Phrase matching ["..."]	Putting quotes around a set of words only finds documents that contain that precise phrase. For example, searching for ["Oracle Applications"] finds only documents that contain the string "Oracle Applications."
Wildcard matching [*]	Attaching a [] to the right-hand side of a word returns left side partial matches. For example, searching for the string [Ora] finds documents that contain all words beginning with "Ora," such as "Oracle" and "Orator." You can also insert an asterisk in the middle of a word. For example, searching for the string [A*e] retrieves documents that contain words such as "Apple", "Ate", "Ape", and so on. Wildcard matching requires more computational processing power and is generally slower than other types of queries.

Scoring Classes

There are three ways documents are matched against an end user query string. These three ways are known as scoring "classes." Documents are scored and ranked higher if they satisfy the requirements for a higher class. Within each class, documents are also ranked differently depending on how well they match the conditions of the scoring class.

Class 1 is the highest class. The score is derived from the number of occurrences of a precise phrase in a document. A document that has more instances of the precise phrase have a higher score than another document that has fewer occurrences of the precise phrase.

Class 2 is the next highest class. In this class, the closer the tokens appear in a document, the higher the score becomes. For example, an end user query string [Oracle Applications Financials] can result in three documents found. None of the three documents contain the precise phrase "Oracle Applications Financials." However, document X contains the all three tokens "Oracle", "Applications", and "Financials" in the same sentence separated by other words. Document Y contains the individual tokens in the same paragraph but in different sentences. Document Z contains the same three tokens, but each token resides in different paragraphs. In this scenario, document X has the highest score, because the tokens are closest together. Likewise, Y has a higher score than Z.

Class 3 is the last class. A document that has more tokens gets a higher score. For example, an end user query string [Oracle Applications Financials] can result in three documents found. Document X might contain all three tokens. Document Y might contain the tokens "Oracle" and "Applications" only. Document Z might contain only the token "Oracle." In this scenario, document X has a higher score than Y. Likewise, Y has a higher score than Z.

Expansion Rules

As mentioned, the end user query is expanded to an Oracle Text query. The expanded query string rules are captured in BNF (Backus Naur Form) notation. Again, these rules are the rules that Oracle Ultra Search uses as a default query syntax expansion implementation.

The rules that define an expanded query:

<expanded query> ::= (<expression> within <title section>)*2, <expression>

<generic query expression> ::= (([ <plus expression>*100 & ]) (<main expression>)) [ <minus expression> ]

<simple query expression> ::= (<phrase expression>)*2, (<main expression>)

<main expression> ::= (<near expression>)*2, (<accum expression>)

The following list contains some terms and their meanings, which explain some of the terms used in the preceding rules:

A <plus expression> is an AND expression of all plus tokens.

A <minus expression> is a NOT expression of all minus tokens.

A <phrase expression> is a PHRASE formed by all tokens in the <main expression>

A <near expression> is a NEAR expression of all tokens but minus tokens.

An <accum expression> is an ACCUMULATE expression of all tokens but minus tokens.

A <simple query expression> is used only when the end user query has multiple tokens and does not have any operator or a double quote. Otherwise, a <generic query expression> is used.

If there is no token that is neither plus token nor minus token, then the <plus expression> and the <accum expression> are eliminated.

Examples of Applying the Rules

The following table illustrates how the default query syntax expansion implementation converts end user query strings into Oracle Text compatible query strings.

End User Query String	Expanded Query String Understandable by Oracle Text
[Oracle]	((({Oracle}) within TITLE__31)*2,({Oracle}))
[Oracle + Applications]	((((({Applications})10)10&(({Oracle};{Applications})2,({Oracle},{Applications }))) within TITLE__31)2,((({Applications})10)10&(({Oracle};{Applications})*2, ({Oracle},{Applications}))))
[Oracle - Applications]	(((({Oracle})~{Applications}) within TITLE__31)*2,(({Oracle})~{Applications}))
["Oracle Applications"]	((({Oracle Applications}) within TITLE__31)*2,({Oracle Applications}))
[Ora*]	((((Ora%)) within TITLE__31)*2,((Ora%)))
[Oracle Applications]	(((({Oracle Applications})2,(({Oracle};{Applications})2,({Oracle},{Application s}))) within TITLE__31)2,(({Oracle Applications})2,(({Oracle};{Applications})* 2,({Oracle},{Applications}))))

Customizing the Rules

Customize this expansion to suit your organization's purposes by defining and implementing your own query syntax expansion. You should have detailed understanding of Oracle Text queries using the ctxsys.contains operator. Oracle Text offers a rich set of linguistic features, such as thesaurus, theme, stemming, and soundex as a part of its query language.

See Also:

To customize Oracle Ultra Search and to use your own implementation of the query syntax expansion, use the oracle.ultrasearch.CtxContains class in your query application instead of the oracle.ultrasearch.query.Contains class. CtxContains lets you use any Oracle Text query as a part of an Oracle Ultra Search query. Do the following steps:

Construct a Oracle Text query based on the user's input. For example, if the user's input is "cat", using the stemming feature, you can construct a Text query "$cat", which will find documents with "cat" or "cats". You can use any tool to construct the Text query, as long as it is a string object. Depending on the complexity of user's query syntax, you might want to leverage some existing lexers in Java.

Construct a CtxContains using the Text query. For example:

String textQuery = "$cat";
oracle.ultrasearch.Query query = new oracle.ultrasearch.CtxContains (textQuery);

The preceding code constructs a query for documents with "cat" or "cats". You can also limit that query to document titles (not content) as follows:

String textQuery = "cat";
StringAttribute titleAttribute = instanceMetaData.getStringAttribute("TITLE");
oracle.ultrasearch.Query query = new oracle.ultrasearch.CtxContains (textQuery, titleAttribute);

You can optionally combine the CtxContains with any other Oracle Ultra Search query by joining them with the And/Or query operators.
Run the query by invoking the getResult method with the constructed query object.

See Also:

Oracle Ultra Search Java API Reference for detailed information on the oracle.ultrasearch.query.CtxContains API

Oracle Ultra Search Query Tag Library

On top of the Java query API, Oracle Ultra Search provides a JSP tag library as an alternative for developing search applications. Based on the Sun Microsystems JavaServer Pages specification version 1.1, the Oracle Ultra Search tag library better separates the dynamic/Java development effort from the static/HTML development effort, and enables Web developers who are unfamiliar with Java to incorporate search functionality into their applications.

The Oracle Ultra Search tag library provides a subset of the features in the Java query API. Advanced features, such as custom query expansion and URL submission, are not available as tags. The main features of the tag library are the following: ability to retrieve search attributes, groups, languages, and LOVs for rendering the advance query form; and ability to iterate through the resulting result set, and retrieve document attributes and properties for rendering the result page.

The tag library is summarized in following table:

Tag	Description	Attributes
instance	This tag establishes a connection to an Oracle Ultra Search instance.	instanceId username password URL dataSourceName tablePagePath emailPagePath filePagePath
showAttributes	For an advanced query, use this tag to show the list of attributes available.	instance locale
showGroups	For an advanced query, use this tag to show the list of groups.	instance locale
showLanguages	For an advanced query, use this tag to show the list of languages defined in the instance.	instance
showLOV	Show all values defined for a search attribute.	instance locale attributeName attributeType
getResult	Perform the search.	resultId instance query queryLocale documentLanguage from to boostTerm withCount
fetchAttribute	This is a nested tag within getResult to specify which attributes of each document should be fetched along with the query results. There can be any number of nested fetchAttribute tags.	attributeName attributeType
showHitCount	If withCount="true" in the getResult tag, then the result includes a total number of hits, and you can use showHitCount to display this number.	result
showResults	Renders the results of the search.	result instance
showAttributeValue	Renders a document attribute.	attributeName attributeType

Details of these tags are described in the following subsections. Note the following requirements for using Oracle Ultra Search tags:

Install the file ultrasearch_query.jar and include it in classpath or the WEB-INF/lib directory of the Web application. This file is provided with the Oracle Ultra Search installation under the ultrasearch/lib directory.
Make sure that the tag library description file, ultrasearch-taglib.tld, is deployed with the application and is in the location specified in the taglib directives of your JSP pages, such as in the following example: <%@ taglib uri="/WEB-INF/ultrasearch-taglib.tld" prefix="US" %>

The Oracle Ultra Search tag library definition (TLD) file can be found in $ORACLE_HOME/ultrasearch/sample/query/WEB-INF/ultrasearch-taglib.tld after sample.ear has been deployed. It is also packaged with ultrasearch_query.jar under the name META-INF/taglib.tld.

Query Tag Descriptions

The following section describes each Oracle Ultra Search tag, its attributes, and action. Examples are shown without any static HTML, which can be inserted to format the output.

<instance> Tag: Connecting to the Oracle Ultra Search Instance

This tag establishes a connection to an Oracle Ultra Search instance. Some basic parameters must be established for this tag to work, such as JDBC connection string, schema user name/password, Oracle Ultra Search instance name, and so on.

Attribute Name	Description
instanceId="name"	Names the instance defined by this tag. This name is then used by other Oracle Ultra Search tags to specify the instance being searched.
username	Creates a database connection.
password	Creates a database connection.
url	Gets the URL used to create a JDBC connection. This attribute is optional if dataSourceName is specified.
dataSourceName	The JNDI name that identifies a JDBC data source. Users should set either the URL or data source name properties. This is optional if URL is specified.
instanceName	The name of the Oracle Ultra Search instance that is owned by the schema user. If the schema user owns only one Oracle Ultra Search instance, then this is optional.
tablePagePath	The URL path of the Web application that renders the contents of a database table.
emailPagePath	The URL path of the Web application that renders the contents of an e-mail.
filePagePath	The URL path of the Web application that renders the contents of a file.

This tag defines a scripting variable of the name set by the instance Id property. All the other tag properties correspond to a property in the oracle.ultrasearch.query.QueryInstance class. Either the URL or the dataSourceName attribute should be set, they are unique.

The following example uses the URL property to connect to the database.

<US:instance 
 instanceId="mybookstore"
 url="oracle:jdbc:thin:@dbhost:1521:inst1"
 username="scott"
 password="tiger"
 tablePage="../display.jsp"
 emailPage="../mail.jsp"
 filePage="../display.jsp"
/>

<iterAttributes> Tag: Show All Search Attributes

When a user wants to perform an advanced query, the application needs to show the list of attributes that are available, the list of groups, and the list of languages defined in the instance. This can be done using some iteration tags that define script variables for page rendering.

Each attribute in Oracle Ultra Search has a name, a type, and a display name that is translated depending on the locale that is set for the QueryInstance tag. The attribute type should be used to determine which operators can be used on this attribute and how to parse the user's input.

Attribute Name	Description
instance="name"	This is a mandatory attribute to refer to the object defined by the instance tag.
locale="locale"	This determines the display name fetched using this tag.

This tag is an iteration tag. It loops through all the search attributes in the instance referred to by the instance tag attribute. In each loop, it defines a scripting variable named "attribute", which is an oracle.ultrasearch.query.Attribute object. It also defines a string variable named "displayname", which is the localized name of the attribute.

The following example shows all the attributes in "mybookstore" instance, using their English display names.

<US:iterAttributes instance="mybookstore" locale="<%=Locale.ENGLISH%>" >
<%= attribute %>
<%= displayname %>
</US:iterAttributes>

<iterGroups> Tag: Show All Search Groups

Similar to the ShowAttributes tag, the Show Groups tag iterates through all the groups defined in an instance.

Attribute Name	Description
instance="name"	This is a mandatory attribute to refer to the object defined by the instance tag.
locale="locale"	This determines the display name fetched using this tag.

This tag loops through all the search groups in the instance referred to by the instance tag attribute. In each loop, it defines a scripting variable named "group", which is an oracle.ultrasearch.query.Group object. It also defines a string variable named "displayname", which is the localized name of the group.

The following example shows all the groups in "mybookstore" instance, using their English display names.

<US:iterGroups instance="mybookstore" locale="<%=Locale.ENGLISH%>" >
<%= group %>
<%= displayname %>
</US:iterGroups >

<iterLanguages> Tag: Show All Search Languages

Similar to the showAttributes tag, the showLanguages tag iterates through all the languages defined in an instance. Because each language is defined by a java.util.Locale object, their display names are not handled by Oracle Ultra Search. Therefore, this tag does not define the displayname scripting variable.

Attribute Name	Description
instance="name"	This is a mandatory attribute to refer to the object defined by the instance tag.

This tag is an iteration tag. It loops through all the search languages in the instance referred to by the instance tag attribute. In each loop, it defines a scripting variable named "language", which is a java.util.Locale object. The display name for the language is provided by Java as a property of the object itself (through the getDisplayName method).

The following example shows all the languages in "mybookstore" instance, using their English display names.

<US:iterLanguages instance="mybookstore">
<%= language %>
<%= language.getDisplayName (Locale.ENGLISH) %>
</US:iterLanguages >

<iterLOV> Tag: Show All Values Defined for a Search Attribute

Attribute Name	Description
instance="name"	This is a mandatory attribute to refer to the object defined by the instance tag.
locale="locale"	This determines the display name fetched using this tag.
attributeName="attname"	The name of the attribute whose LOV is being fetched in this LOV.
attributeType="string \| number \| date"	The type of the attribute whose LOV is being fetched. This is required because attribute name does not uniquely identify an attribute in the instance.

This tag is an iteration tag. It loops through all the values in a search attribute's LOV. In each loop, it defines a scripting variable named "value", which is either a java.lang.String, java.util.Date, or java.math.BigDecimal object, depending on the attribute type. It also defines a string variable named "displayname", which is the localized display name of the value.

The following example shows all the values for a string attribute named "Dept" in "mybookstore" instance, using their English display names.

<US:iterLOV instance="mybookstore" attribute_name="Dept" attribute_type="String" >
<%= value %>
<%= displayname %>
</US:iterLOV >

Formulating the Query

Oracle Ultra Search supports a set of classes for building queries. Currently these classes do not have any tag equivalents.

<getResult> Tag: Perform Search

This tag performs the search and returns the result by defining a scripting variable of the type oracle.ultrasearch.query.Result.

Attribute Name	Description
resultId="name"	The names is the result generated by this tag. This name is then used by other tags to render the result on the page.
instance="name"	This is a mandatory attribute to refer to the object defined by the instance tag.
query="<%= expression %>"	This specifies a query object to search with.
queryLocale="locale"	This specifies the locale of the query object.
documentLanguage="locale"	This specifies the language of the documents. This is optional. If the language is not specified, then all languages are included in the search.
from="number"	This specifies the index of the first result.
to="number"	This specifies the index of the last result.
boostTerm="string"	This specifies the search term that is used for relevance boosting. This is optional.
withCount="true \| false"	This specifies whether the result has an estimate of the total result count. This is optional. If unspecified, the behavior is same as withCount=false.

The <getResult> tag corresponds to the getResult method on the oracle.ultrasearch.query.Instance class. The attributes of tag map to the parameters of the method, with the exception that getResult method can specify the attributes to fetch. The <getResult> tag requires the use of the nested <fetchAttribute> tag to accomplish metadata selection.

The following example shows a search for the first 20 documents of a query in English that appears in French documents.

<US:getResult 
 resultId="searchresult"
 instance="mybookstore"
 query=""
 queryLocale=""
 documentLanguage=""
 from="1" to="20">
</US:getResult>

<fetchAttribute> Tag: Metadata Selection

This tag is used as nested tag inside <getResult>. It specifies which attributes of each document should be fetched along with the query result. Each <getResult> can have any number of nested <fetchAttribute> tags.

Attribute Name	Description
attributeName="attname"	The name of the attribute whose LOV is being fetched in this LOV.
attributeType="string \| number \| date"	The type of the attribute whose LOV is being fetched in this LOV. This is needed because attribute name does not uniquely identify an attribute in the instance.

Each occurrence of the <fetchAttribute> adds to the list of attributes passed to the getResult invoked by the <getResult> tag.

The following example shows the search in <getResult> tag, fetching title and publication-date attributes of each book.

<US:getResult 
 resultId="searchresult"
 instance="mybookstore"
 query=""
 queryLocale=""
 documentLanguage=""
 from="1" to="20">
<US:fetchAttribute 
 attributeName="title"
 attributeType="string" />
<US:fetchAttribute 
 attributeName="publication-date"
 attributeType="date" />
</US:getResult>

<showHitCount> Tag: Show Estimated Hit Count

After the search is performed, the result must be rendered. If withCount=true is in the <US:getResult> tag, then the result contains a count of total hits, and <showHitCount> tag can be used to display it.

Attribute Name	Description
result="name"	This refers to the resultId specified in the <US:getResult> tag.

The following shows the result count of the a search result.

<US:showHitCount result="searchresult" />

<iterResult> Tag: Render the Results

This tag is an iteration tag. It loops through all the documents in a search result.

Attribute Name	Description
result="name"	This refers to the resultId specified in the <US:getResult> tag.
instance="name"	This refers to the instanceId specified in the <US:instance> tag.

The tag loops through all the documents in a search result and defines a scripting variable "doc" that is a oracle.ultrasearch.query.Document object. In addition, it can have nested tags of <showAttributeValue>, which helps to render the document's attributes. If the result specified is not one obtained from search on the instance specified, then it is an error. In other words, the result must come from the instance.

The following example shows the URL of all documents in a search result.

<US:iterResult
result="searchresult" 
instance="mybookstore">
</US:iterResult>

<showAttributeValue> Tag: Render a Document Attribute

This tag shows an attribute of a document within the <US:iterResult> tag.

Attribute Name	Description
attributeName="attname"	The name of the document attribute.
attributeType="string \| number \| date"	The type of the document attribute. This is needed because attribute name does not uniquely identify an attribute in the instance.
default="default string"	A value to output when the document has no value for this attribute. This is useful when a document has no title. The string "No Title" can be displayed as the default value.

This tag searches the document attribute value and displays it on the page. If the attribute was not fetched as a part of the search result, then no output is displayed.

The following example shows the title and publication dates of all documents in a search result.

<US:iterResult
result="searchresult" 
instance="mybookstore">
<US:showAttributeValue attributeName="title" attributeType="string" default="No Title" />
<US:showAttributeValue attributeName="publication-date" attributeType="date" />
</US:iterResult>

Oracle Ultra Search Crawler Agent API

You can implement a crawler agent to crawl and index a proprietary document repository, such as Lotus Notes or Documentum. In Oracle Ultra Search, the proprietary repository is called a user-defined data source. The module that enables the crawler to access the data source is called a crawler agent.

The agent collects the document URLs and associated metadata from the user-defined data source and returns the information to the Oracle Ultra Search crawler, which then enqueues it for later crawling. The crawler agent must be implemented in Java using the Oracle Ultra Search crawler agent API.

Oracle Ultra Search provides a sample implementation of user-defined crawler agents using the Oracle Ultra Search agent API. Upon invocation, this sample agent connects to a specified Oracle database and retrieves the contents of a table for the crawler to collect and index.

The sample agents are fully functional and can be customized to adapt to other database-based data sources. These agents performs the following tasks:

Read data source parameters
Connect to the database that contains the data source
Initialize fetching document URL and attributes from the data source
Fetch document URL and attributes from the data source
Disconnect from the data source

Crawler Agent Overview

A crawler agent does the following:

Authenticates the crawler for accessing the data source
Provides access to the data source document through a HTTP URL (display URL)
Provides the metadata of the document in the form of document attributes
Maps each document attribute to a common attribute name used by end users
Provides a "flattened" view of the data source, such that documents are retrieved one by one in a streaming fashion
Instructs the crawler to parse the URL document for standard metadata, like author and title, if necessary
Optionally provides the list of URLs that have changed since a given time stamp
Optionally provides an access URL in addition to the display URL for the processing of the document

From the crawler's perspective, the agent retrieves the list of URLs from the target data source and saves it in the crawler queue before processing it.

Note:

If the crawler is interrupted for any reason, then the agent invocation process is repeated with the original last crawl time stamp. If the crawler finished enqueuing URLs fetched from the agent and is half way done crawling, then the crawler only starts the agent, but does not try to fetch URLs from the agent. Instead, it finishes crawling the URLs already enqueued.

There are two kinds of crawler agents:

Standard Agent
Smart Agent

Standard Agent

The standard agent returns the list of URLs currently existing in the data source. It does not know whether any of the URLs had been crawled before, and it relies on the crawler to find any updates to the target data source. The standard agent's interaction with the crawler is the following:

Crawler marks all existing URLs of this data source for garbage collection, assuming they no longer exist in the target data source.
Crawler calls the agent to get an updated list of URLs. It marks for crawling every URL that already exists. If it is new, it inserts it into the URL table and queue.
Crawler deletes the URLs that are still marked for garbage collection.
Crawler goes through every URL marked for crawling and checks for updates.

Smart Agent

The smart agent uses a modified-since time stamp (provided by the crawler) to return the list of URLs that have been updated, inserted, and deleted. The crawler only crawls URLs returned by the agent and does not recrawl existing ones. For URLs that were deleted, the crawler removes them from the URL table. If the smart agent can only return updated or inserted URLs but not deleted URLs, then deleted URLs are not detected by the crawler. In this case, you must change the schedule crawler recrawl policy to periodically run the schedule in force recrawl mode. Force recrawl mode signals to the agent to return every URL in the data source.

The agent API isDeltaCrawlingCapable, it tells the crawler whether the agent it invokes is a standard agent or a smart agent. The agent API startCrawling(boolean forceRecrawl, Date lastCrawlTime) allows the crawler to tell the agent that the last crawl time and whether the crawler is running in force recrawl mode.

Document Attributes and Properties

Document attributes, or metadata, describe document properties. Some attributes can be irrelevant to your application. The crawler agent creator must decide which document attributes should be extracted and saved. The agent also can be created such that the list of collected attributes are configurable. Oracle Ultra Search automatically registers attributes returned by the agent. The agent can decide which attributes to return for a document.

Library Path and Java Class Path

Any other Java class needed by the agent should be included in the agent jar file. This is because Oracle Ultra Search automatically adds the agent jar file to the crawler Java class path, and Oracle Ultra Search does not allow you to add other class paths from the administration interface. To add a new class path, see Appendix B, "Altering the Crawler Java Classpath"

If the agent code also relies on a particular library file (for example, a .ddl file on Windows or a .so file on UNIX), then the library path environment variable (PATH on Windows, LD_LIBRARY_PATH on UNIX) must contain the path to it. Make sure that Oracle is started from this environment. As the crawler is spawned by the Oracle process, it automatically inherits all environment variables from Oracle, including the library path.

Crawler Agent Functionality

This section describes aspects of the crawler agent.

Data Source Type Registration

A data source type is an abstraction of a data source. You can define new data source types with the following attributes:

Name of data source type: For example, Lotus Notes. The name cannot be more than 100 bytes.
ID of data source type: This is automatically assigned.
Description of the data source type: This limit is 4000 bytes.
Agent Java class name: For example, WebDbAgent. The location of this class is predefined by Oracle Ultra Search in $ORACLE_HOME/ultrasearch/lib/agent/ and cannot be changed.
Agent Java jar file name: The agent class can be stored in a Java jar file. This jar file must be in $ORACLE_HOME/ultrasearch/lib/agent/, where $ORACLE_HOME is the Oracle home directory where the Oracle Ultra Search backend, not the middle tier, is installed.
Parameters: Parameters are the properties of a data source; for example, seed URL, inclusion pattern, and robots exclusion for a Web data source. Define a parameter by specifying a parameter name (100 bytes maximum) and a description (4000 bytes maximum). By default, a parameter is not encrypted.
Encryption: Should the value of this parameter be encrypted when stored.

Oracle Ultra Search does not enforce the occurrence of parameters. You cannot specify a particular parameter to have 0 or more, at least 1, or only 1 occurrence.

Agent Class Dependency

The crawler agent class has a dependency on log4j.jar, Apache's logging facility. The crawler hangs when the log4j logger class is first referenced in the agent Java class. This section describes how to put this classpath into Oracle Ultra Search. You can bundle log4j classes and the agent class into one jar file and specify this jar file for the custom type in the administration tool. Or, you can package your own manifest.mf file in the agent.jar. You can then specify a dependent library path, which will be loaded by JVM. For example, assume that you are in: /home/pdevulap/make_jar directory, and you have the following files and directories:

META-INF/MANIFEST.MForacle/marketing/search/crawleragent/SearchCrawlerAgent.java + other class fileslog4j-1.2.8.jar

Example of the contents of MANIFEST.MF:

Manifest-Version: 1.0Created-By: Praveen DevulapalliMain-Class: oracle.marketing.search.crawleragent.SearchCrawlerAgentClass-Path: log4j-1.2.8.jar

If there are other jars, simply separate them with spaces

Create the agent jar file:
```
jar cvfm SearchCrawlerAgent.jar META-INF/MANIFEST.MF oracle/marketing/search/crawleragent/*
```
This creates SearchCrawlerAgent.jar, along with the manifest file specified. In creating the jar file, use lowercase 'm' to tell the jar utility to use your manifest file.

Note: log4j is NOT included in this jar file.
Move the jar files to the appropriate location of Oracle Ultra Search; that is, $ORACLE_HOME/ultrasearch/lib/agent/SearchCrawlerAgent.jar and $ORACLE_HOME/ultrasearch/lib/agent/log4j-1.2.8.jar

To update the existing agent jar file with the new class path:

Check if the jar already has a MANIFEST.MF file:
```
% jar tf my.jar 
 
META-INF/MANIFEST.MF
 
a/b/c/d/xyz.class
 
...
```
If the jar does not have the MANIFEST.MF, then skip step 2.

Extract the manifest file:

% jar xf my.jar META-INF/MANIFEST.MF  --you must specify the complete path as it appears in the jar
 
META-INF/MANIFEST.MF

Change the contents of MANIFEST.MF, and update the jar file with the new version:
```
%jar umf META-INF/MANIFEST.MF my.jar
```

Data Source Registration

After a data source type is defined, any instance of that data source type can be defined:

Data source name
Description of the data source, limit to 4000 bytes
Data source type ID
Default language; default is 'en' (English)
Parameter values, for example, seed - http://www.oracle.com depth - 8

Data Source Attribute Registration

You can add new attributes to Oracle Ultra Search by providing the attribute name and the attribute data type. The data type can be string, number, or date. Attributes with the same name but different data type can be added. Attributes returned by an agent are automatically registered if they have not been defined.

User-Implemented Crawler Agent

The crawler agent has the following requirements:

The agent must be implemented in Java.
The agent must support the Java agent APIs defined by Oracle Ultra Search.
The agent must return the URL attributes and properties.
The agent optionally can authenticate the crawler's access to the data source.
The agent must "flatten" the data source such that each document is retrieved one by one in a streaming fashion. This is to encapsulate the crawling logic of a specific data source into the agent.
The agent must decide which document attributes Oracle Ultra Search should keep. Any attribute not defined in Oracle Ultra Search is registered automatically.
The agent can map attributes to data source properties. For example, if an attribute "ID" is the unique ID of a document, then the agent should return (document_key, 4) where "ID" has been mapped to the property "document_key" and its value is 4 for this particular document.
If the attribute LOV is available, then the agent returns them upon request.

Interaction Between the Crawler and the Crawler Agent

The crawler crawls data sources defined by the user through the invoca tion of the user-supplied crawler agent. The crawler can do the following:

Invoke the crawler agent of the defined data source
Supply data source parameter information to the agent
Authenticate itself with the agent if needed
Retrieve a list of URLs and associate attributes/properties that must be crawled
Use the URL provided by the agent to retrieve the document
Detect insert, update, and delete to the data source
Retrieve attribute LOV data if available

Crawler Agent APIs and Classes

The crawler agent API is a collection of methods used to implement a crawler agent. A sample implementation of a crawler agent SampleAgent.java is provided under $ORACLE_HOME/ultrasearch/extension/.

UrlData: The crawler agent uses this interface to populate document properties and attribute values. Oracle Ultra Search provides a basic implementation of this interface that the agent can use directly or extend if necessary. The class is DocAttributes with a constructor that has no argument. The agent might decide to create a pool of UrlData objects and cycle through them during crawling. In the most simple implementation, the agent creates one DocAttributes object, repeatedly resets and populates the data, and returns this object.

LovInfo: The crawler agent uses this interface to submit attribute LOV definitions.

DataSourceParams: The crawler agent uses this interface to read and write data source parameters.

AgentException: The crawler agent uses this exception class when an error occurs.

CrawlerAgent: This interface lets the crawler communicate with the user-defined data source. The crawler agent must implement this interface.

Sample Agent Files

The sample agent files are located in the $ORACLE_HOME/ultrasearch/extension directory. You can view the agent source code using your preferred text editor.

There is a SampleAgent_readme.htm file and a SampleAgent.java file. These are for the sample crawler agent implementation using agent APIs.

Setting up the Sample Crawler Agent

This section describes how to set up the sample crawler agent.

Compiling and Building the Agent Jar File

The Java source code for the sample agent first must be compiled into class files and put into a jar file in the $ORACLE_HOME/ultrasearch/lib/agent/ directory, where $ORACLE_HOME is the Oracle home directory where the Oracle Ultra Search backend, not the middle tier, is installed.

The classes needed for compilation are the JDK class (classes.zip), Oracle JDBC Thin Driver (classes12.zip), and ultrasearch.jar. For example:

$ORACLE_HOME/jdk/bin/javac -J-ms16m -J-mx96m -O -classpath $ORACLE_HOME/dbjava/lib/classes12.zip: 
$ORACLE_HOME/ultrasearch/lib/ultrasearch.jar SampleAgent.java

To build the SampleAgent.jar file, enter the following:

$ORACLE_HOME/jdk/bin/jar cv0f  $ORACLE_HOME/ultrasearch/lib/agent/SampleAgent.jar 
SampleAgent.class 'SampleAgent$DocNode.class'

Creating a Data Source Type

A data source type that uses the sample agent must be created first.

Name: URL table type
Description: Table with rows of URLs
Agent Name: SampleAgent
Agent Jar File: sampleagent

Defining Data Source Parameters

Define parameters for a data source type:

Database Connect String (DB connection)
User Name (schema owner of the URL table)
Password (schema owner password, encrypted)
Table Name (URL table name)
URL Column (Column holding doc URLs)
Ignore Flag Column (1 for ignoring, 0 otherwise)
Language Column (Document Language)
Attribute List (List of column for attributes)
It is in the following format: [column name/attribute name] <data type> [column name/attribute name] <data type> ... where <data type> 0 is number, 1 is string, and 2 is date. For example, if the document has 4 attributes: Company Name, Category, Revenue, S&P Rating, then it is specified as: [Company Name/Company/1][Category/Classification/1][Revenue/Revenue/0][Rating/Analyst Rating/1]
Log File Name (log file)
Log Directory (Location of log file)

Defining a Data Source of this Type

A data source is defined, which initializes the data source parameters. For example, the value specified accesses a table whose schema is the following:

TABLE NEWS (
    ARTICLE_NO    NUMBER,
    NEWS_URL      VARCHAR2(740),
    TITLE         VARCHAR2(200),
    AUTHOR        VARCHAR2(100),
    PUB_DATE      DATE default SYSDATE,
    PUBLISHER     VARCHAR2(100),
    PRICE         NUMBER,
    LANG          VARCHAR2(10),
    IGNORE        NUMBER DEFAULT 0,
    PRIMARY KEY (NEWS_URL)
    );

Database Connect String: dlsun1710:5521:search
User Name: SCOTT
Password: TIGER
Table Name: NEWS
URL Column: NEWS_URL
Ignore Flag Column: IGNORE
Language Column: LANG
Attribute List: [ARTICLE_NO/Article Number/0][TITLE/Article Title/1][AUTHOR/Author/1][PUB_DATE/Report Date/2][PUBLISHER/Newspaper/1][PRICE/Download Cost/0]
Log File Name: testagent.log
Log Directory: /tmp/ultrasearch/

Oracle Ultra Search Java E-mail API

Oracle Ultra Search provides a Java API for accessing archived e-mails. The Oracle Ultra Search query application uses the API to display e-mails addressed to mailing lists that have been indexed by the Oracle Ultra Search system. The API can also be used to build your own custom query application.

The application user-interface logic is entirely controlled in the JSP. Therefore, you can customize the look-and-feel to your needs.

E-mail documents contain valuable information, but they are not structured to find specific relevant information easily. Oracle Ultra Search lets you retrieve and index e-mails on a server that supports the IMAP4 protocol.

An e-mail source is a data source that derives its content from e-mails sent to a specific e-mail address. When the Oracle Ultra Search crawler searches an e-mail source, the crawler collects all e-mails that have the specific e-mail address in any of the "To:" or "Cc:" e-mail header fields.

Note:

Oracle Ultra Search stores copies of all retrieved e-mails in the local file system of the Oracle Ultra Search server installation.

A possible application of an e-mail source is where an e-mail source represents all e-mails sent to a mailing list. In such a scenario, multiple e-mail sources are defined where each e-mail source represents an e-mail list.

Oracle Ultra Search e-mail crawling and rendering is built on top of the JavaMail API using Sun Microsystems' reference implementation of JavaMail. This enables Oracle Ultra Search to provide a Java API for accessing indexed e-mails. The API is known as the Oracle Ultra Search Java e-mail API. This API lets you retrieve information such as e-mail header information, e-mail body content, and attachments of an e-mail.

Use this API to embed Oracle Ultra Search e-mail browsing functionality into JavaServer Page (JSP) or servlet-based Web applications. Oracle Ultra Search ships a fully functional JSP Web application that directly uses this API to render indexed e-mails. Because the source code is viewable, you can use it as an example for building your own customized e-mail browser.

JavaMail Implementation

Oracle Ultra Search requires a JavaMail 1.1 compliant implementation. The reference implementation by Sun Microsystems is JavaMail version 1.2. This reference implementation is shipped with Oracle Ultra Search.

Java E-mail API

The Oracle Ultra Search Java e-mail API is encapsulated in the oracle.ultrasearch.query package.

Mailing List Browser Application Files

The mailing list browser applications files are located in the $ORACLE_HOME/ultrasearch/sample/query directory. You can directly view the mailing list browser application source code using your preferred text editor.

The following tables describe all mailing list browser application files, README file, and stylesheets:

File	Description
SampleAgent_readme.html	Readme
mail.css	Style sheet for the e-mail Web application

JavaServer Page Mailing List Browser Applications Files:

File	Description
mail.jsp	Mailing list browser applications that selectively include HTML code returned by other JSP files, depending on what the end user wants to view
mailindex.jsp	JSP page that displays all e-mail sources (mailing lists) of an Oracle Ultra Search instance
mailmsgs.jsp	JSP page that displays all e-mails for an e-mail source (mailing list)
mailreader.jsp	JSP page that displays an e-mail
mailutil.jsp	JSP page that defines various functions that are used by `mailreader`.`jsp`

Graphics Files for All Applications:

File	Description
images/ultra_mediumbanner.gif	Oracle Ultra Search banner
images/wsd.gif	Background image used in query application

Setting up the Mailing List Browser Application

For detailed instructions on setting up the JSP mailing list browser application, see "Installing the Oracle Ultra Search Middle Tier".

Oracle Ultra Search URL Rewriter API

A URL rewriter is a user supplied Java module that implements the Oracle Ultra Search UrlRewriter Java interface. When activated, it is used by the crawler to filter and rewrite extracted URL links before they are inserted into the URL queue.

Web crawling generally consists of the following steps:

Get the next URL from the URL queue. (Web crawling stops when the queue is empty.)
Fetch the contents of the URL.
Extract URL links from the contents.
Insert the links into the URL queue.

The generated new URL link is subject to all existing host, path, and mimetype inclusion and exclusion rules.

There are two possible operations that can be done on the extracted URL link:

Filtering: removes the unwanted URL link
Rewriting: transforms the URL link

URL Link Filtering

Users control what type of URL links are allowed to be inserted into the queue with the following mechanisms supported by the Oracle Ultra Search crawler:

robots.txt file on the target Web site, for example, disallow URLs from the /cgi directory
Hosts inclusion and exclusion rules, for example, only allow URLs from www.acme.com
File path inclusion and exclusion rules, for example, only allow URLs under the /archive directory
Mimetype inclusion rules, for example, only allow HTML and PDF files
Robots metatag NOFOLLOW, for example, do not extract any link from that page
Black list URL, for example, URL explicitly singled out not to be crawled

Note:

All URLs must pass domain rules before being checked for path rules. Pa th rules let you further restrict the crawling space. Path rules are host-specific, but you can specify more than one path rule for each host. For example, on the same host, you can include path files://host/doc and exclude path files://host/doc/unwanted.

With these mechanisms, only URL links that meet the filtering criteria are processed. However, there are other criteria that users might want to use to filter URL links. For example:

Allow URLs with certain file name extensions
Allow URLs only from a particular port number
Disallow any PDF file if it is from a particular directory

The possible criteria could be very large, which is why it is delegated to a user-implemented module that can be used by the crawler when evaluating an extracted URL link.

URL Link Rewriting

For some applications, due to security reasons, the URL crawled is different from the one seen by the end user. For example, crawling is done on an internal Web site behind a firewall without security checking, but when queried by an end user, a corresponding mirror URL outside the firewall must be used.

A display URL is a URL string used for search result display. This is the URL used when users click the search result link. An access URL is a URL string used by the crawler for crawling and indexing. An access URL is optional. If it does not exist, then the crawler uses the display URL for crawling and indexing. If it does exist, then it is used by the crawler instead of the display URL for crawling.

For regular Web crawling, there are only display URLs available. But in some situations, the crawler needs an access URL for crawling the internal site while keeping a display URL for the external use. For every internal URL, there is an external mirrored one.

For example:

http://www.acme-qa.us.com:9393/index.html
http://www.acme.com/index.html

When the URL link http://www.acme-qa.us.com:9393/index.html is extracted and before it is inserted into the queue, the crawler generates a new display URL and a new access URL for it:

Access URL:

http://www.acme-qa.us.com:9393/index.html

Display URL:

http://www.acme.com/index.html

The extracted URL link is rewritten, and the crawler crawls the internal Web site without exposing it to the end user.

Another example is when the links that the crawler picks up are generated dynamically and can be different (depending on referencing page or other factor) even though they all point to the same page. For example:

http://compete3.acme.com/rt/rt.wwv_media.show?p_type=text&p_id=4424&p_currcornerid=281&p_textid=4423&p_language=us

http://compete3.acme.com/rt/rt.wwv_media.show?p_type=text&p_id=4424&p_currcornerid=498&p_textid=4423&p_language=us

Because the crawler detects different URLs with the same contents only when there is sufficient number of duplication, the URL queue could grow to a huge number of URLs, causing excessive URL link generation. In this situation, allow "normalization" of the extracted links so that URLs pointing to the same page have the same URL. The algorithm for rewriting these URLs is application dependent and cannot be handled by the crawler in a generic way.

When a URL link goes through a rewriter, there are the following possible outcomes:

The link is inserted with no changes made to it.
The link is discarded, it is not inserted.
A new display URL is returned, replacing the URL link for insertion.
A display URL and an access URL are returned. The display URL may or may not be identical to the URL link.

Creating and Using a URL Rewriter

Follow these steps to cr eate and use a URL rewriter:

Create a new Java file implementing the UrlRewriter interface open, close, and rewrite methods. A rewriter, SampleRewriter.java, is available for reference under $ORACLE_HOME/ultrasearch/extension/.

Compile the rewriter Java file into a class file. For example:

/jdk1.3.1/bin/javac -O -classpath $ORACLE_HOME/ultrasearch/lib/ultrasearch.jar SampleRewriter.java

Package the rewriter class file into a jar file under the $ORACLE_HOME/ultrasearch/lib/agent/ directory. For example:
```
/jdk1.3.1/bin/jar cv0f $ORACLE_HOME/ultrasearch/lib/agent/sample.jar SampleRewriter.class 
```
Specify the rewriter class name and jar file name (for example, SampleRewriter and sample.jar) in the administration tool in step 2 of "Creating Web Sources" or in the crawler parameters page of an existing Web data source.
Enable the UrlRewriter option from Web Sources page in the administration tool.
Crawl the target Web data source by launching the corresponding schedule. The crawler log file confirms the use of the URL rewriter with the message Loading URL rewriter "SampleRewriter"...

Note:

URL rewriting is available for Web data sources only.

See Also:

Oracle Ultra Search API Reference for the API (oracle.ultrasearch.crawler package)
The URL rewriter SampleRewriter.java under $ORACLE_HOME/ultrasearch/extension/
"Web Sources"

Oracle Ultra Search Document Service API

The Document Service crawler agent API allows generation of attribute data based on the document contents. It accepts robot metatag instructions from the agent for the target document, and it transforms the original document contents for indexing control.

The document service agent is a user-implemented Java module that implements the DocumentService Java interface such that it can interact with the crawler to provide more control on the crawled document. It is a Java interface in the form of a crawler agent that allows callout during crawling for user-defined document processing. It has the following features:

Returns processed results, by user's choice, to the crawler as document attributes (for example, classification, theme, and gist) to be indexed and made searchable
Allows robot metatag instruction from the agent to ignore the target document, not following link, and indexing it or not indexing it
Allows transformation and filtering of the original document contents for indexing
Allows assigning one document service agent for all data sources or for a particular data source
Supports definition of document service agent

Each crawling thread has a copy of the service agent object. The document service agent jar file or class file must be under the $ORACLE_HOME/ultrasearch/lib/agent/ directory.

APIs and Classes

The DataSourceParams interface and AgentException class are used by the interface introduced here. They are used by the Oracle Ultra Search user to implement their own service agent. The agent Java code should import the following classes:

import oracle.ultrasearch.crawler.UrlData;

import oracle.ultrasearch.crawler.DocAttributes;

import oracle.ultrasearch.crawler.DataSourceParams;

import oracle.ultrasearch.crawler.AgentException;

Interface DocumentService

Interface DocumentService processes documents for summarization, classification, or any transformation function that takes the document text as input and produces some kind of text output.

boolean open(DataSourceParams params, PrintWriter log) throws AgentException:

This is always the first method crawler called when the agent is loaded. It lets the agent perform initialization work.

The crawler passes the agent parameters through the DataSourceParams interface. The agent should verify the parameter and raise an agent fatal exception if any error is detected.

log is the crawler log file where the agent can output any information to it.

A document service session is established.

void close() throws AgentException:

This is always the last method called by the crawler. It lets the agent perform clean up work.

The document service session is terminated.

int doService(String documentUrl, number urlId, Reader docReader) throws AgentException:

Requesting service on the submitted document. This is invoked right before extraction of links and attributes.

This function returns a status code indicating what kind of result it has for this document. The possible codes are:

NO_CHANGE: No further information about this document

FOLLOW_UP: Call getRobotsControl, getAttribute, and getContents to retrieve the value

Any other status code value is treated as NO_CHANGE.

An agent exception is thrown if there is a problem processing the document. Fatal agent exceptions stop the crawler. Warning agent exceptions are treated as NO_CHANGE with the exception printed to the crawler log.

UrlData getAttribute(number urlId):

This is called only when doService returns FOLLOW_UP status code.

The agent returns a UrlData object that contains attribute data for this document. If there is no attribute to be added, then the agent can simply return null. The attribute will be automatically registered if it has not been registered.

int getRobotControl (number urlId):

This is called only when doService returns FOLLOW_UP status code.

This function returns a status code indicating what kind of robots control it has for this document. The possible codes are:

USE_CURRENT: Use existing setting

FOLLOW_AND_INDEX: Follow link, and index the document

FOLLOW_AND_NO_INDEX: Follow links, but do not index this document

NO_FOLLOW_AND_INDEX: No link extraction, but index the document

NO_FOLLOW_AND_NO_INDEX: No link extraction, and do not index

Reader getContents(number urlId):

This is called only when doService returns FOLLOW_UP status code.

The returned Reader object contains the new document contents in HTML to be indexed. The original document contents will be discarded.

The crawler closes the Reader when the crawler finishes reading it. The returned Reader should not be a filter Reader based on the original Reader passed in from doService, because the original Reader is closed when getContents return a new Reader.

If the return is null, then no contents replacement will happen.

Checksum of the original document is not changed.

void received(number urlId) throws AgentException:

This is called after finishing work with the target document. It allows the agent to perform clean up work associated with its service.

doServie call and received are always paired.

The crawler is guaranteed not to hold on to the UrlData object returned by the agent. This allows the agent to reuse the object.

Agent Registration Client Interface

Registration of a document service agent is through PL/SQL API from the wkds_adm package. You register a document service agent, define an agent instance, and assign it to one or all data sources in the form of crawler preference. The detail of the API can be referenced in wk0ds.pkh under $ORACLE_HOME/ultrasearch/admin/.

After a document service agent instance is created, use the following API to assign it to be loaded for all data source:

wk_crw.update_crawler_config(wk_crw.CRAWLER_COMMON,'CC_AGENT_INSTANCE','<agent instance name>');

Use the following if it is only to be used for a particular data source:

wk_crw.update_crawler_config(<data source id>,'CC_AGENT_INSTANCE','<agent instance name>');

To remove the agent instance, use the null value for the instance name:

wk_crw.update_crawler_config(wk_crw.CRAWLER_COMMON,'CC_AGENT_INSTANCE',null);

Note:

These are internal APIs, which are subject to change.

Example of Setting Up the Document Service Agent

This example assumes that DocServiceAgent.java has been compiled into DocServiceAgent.class file and is archived in a jar file called wkagent.jar.

declare
    g_tid1 number;
    g_dsid0 number;
    g_pid0 number;
    g_pid1 number;
  begin
    -- All API calls must start with wk_adm.use_instance
    wk_adm.use_instance('<INSTANCE NAME>');
    -- register the agent
    -- agent class name is 'DocServiceAgent', located in wkagent.jar under
    -- $OH/ultrasearch/lib/agent/
    g_tid1 := WKDS_ADM.new_agent('Simple Agent','Document Service Test Agent',
              'DocServiceAgent','wkagent.jar',wkds_adm.DOC_SERVICE_TYPE);
    -- define agent parameters
    g_pid0 := wkds_adm.add_agent_param(g_tid1,'Admin_user','The user');
    g_pid1 := wkds_adm.add_agent_param(g_tid1,'Password','password','Y'); -- encrypted
    -- define an agent instance based on the registered agent
    g_dsid0 := WKDS_ADM.new_agent_inst('Simple Agent Instance',g_tid1);
    -- set agent parameter value for instance 'Simple Agent Instance'
    wkds_adm.set_agent_param_value(g_dsid0,g_pid0,'WK_TEST'); -- user name
    wkds_adm.set_agent_param_value(g_dsid0,g_pid1,'WK_TEST'); -- password
    -- Associate the agent instance with all data sources
    wk_crw.update_crawler_config(wk_crw.CRAWLER_COMMON,'CC_AGENT_INSTANCE','Simple Agent Instance');
 
    -- Or associate it with one particular data source:
    -- wk_crw.update_crawler_config(<Data Source id>,'CC_AGENT_INSTANCE','Simple Agent Instance');
  exception when others then wk_err.raise;
  end;
  /

Oracle Ultra Search Query Applications

Oracle Ultra Search provides several query applications and a sample crawler agent. Use the query applications as examples for creating your own query application. The query applications are written as J2EE-compliant Web applications. Your query application uses the Oracle Ultra Search query API. You can also use the sample crawler agent to create your own crawler agent.

Note:

Pointers to the query applications and the sample crawler agent Java source code, as well as their corresponding readmes, are in the Oracle Ultra Search welcome page: http://hostname.domainname:port/ultrasearch/index.html

The query application has been designed to showcase keyword in context and highlighting features. These changes are made to search.jsp and its dependent files.

Keyword in context shows a section of the original document that contains the search terms. Highlighting shows the entire document with the search terms in a different color. For highlighting to work, the crawler must be configured to keep cache file as one of its settings. Highlighting is implemented in cache.jsp and can be customized by the customer.

Note that framed HTML documents contain only the frame layout and frame content specification, not the actual content. Therefore, the cached version of these documents appears blank in a browser.

The query applications are shipped as a deployed J2EE Web application (sample.ear). This component depends on a J2EE container to host the Web pages, a JDBC driver, and Java Mail API for displaying e-mail results. After the sample.ear file is deployed by the Oracle Containers for J2EE (OC4J), you see a set of JSP files that demonstrate the query API usage.

The query applications include a search portlet. The Oracle Ultra Search portlet demonstrates how to write a search portlet for use in Oracle Application Server Portal.

When the user issues a query in any of the query applications, a result list containing query results is returned. The user can select a document to view from the result list. A result list can include HTML documents, files, database table content, archived e-mails, or Oracle Application Server items. The Oracle Ultra Search query applications also incorporate an e-mail browser for reading and browsing e-mails.

The Oracle Ultra Search administration tool and the Oracle Ultra Search query applications are part of the Oracle Ultra Search middle tier. However, the Oracle Ultra Search administration tool is independent from the Oracle Ultra Search query applications. Therefore, they can be hosted on different computers to enhance security or scalability.

If you do not want to use the query applications, you can build your own query application by directly invoking the Oracle Ultra Search Java query API. Because the API is coded in Java, you can invoke the API methods from any Java-based application, such as from a Java servlet or a JavaServer Page (as in the case of the provided query applications). For rendering e-mails that have been crawled and indexed, you can also directly invoke the Oracle Ultra Search Java e-mail API methods.

Query Applications

The query applications are located in the $ORACLE_HOME/ultrasearch/sample directory.

JavaServer Page Concepts

As mentioned earlier, you can use JSP code and the supplied Java APIs to create your Web application. Typically, your Web application runs in an application server, such as Oracle Application Server. The application server typically runs on a separate computer from the Oracle server for performance and scalability reasons. The Oracle server holds the Oracle Ultra Search indexes.

JSP applications are compiled into Java servlets at runtime. The compiled servlets run in one or more Java Virtual Machine processes. The JSP application communicates with the Oracle server through the Oracle JDBC driver.

As in any Java application, you must include the following files in your servlet engine classpath to use the Java query and e-mail APIs:

$ORACLE_HOME/ultrasearch/lib/ultrasearch_query.jar
$ORACLE_HOME/lib/mail.jar
$ORACLE_HOME/lib/activation.jar

Figure 9-1 shows how your Web query application calls the Oracle Ultra Search Java query API.

Figure 9-1 Calling JavaServer Pages

Description of the illustration isrch009.gif