Oracle® Text Reference 10g Release 2 (10.2) Part Number B14218-01 |
|
|
View PDF |
This appendix contains a list of the document formats supported by the automatic (AUTO_FILTER
) filtering technology. The following topics are covered in this appendix:
Oracle Text's automatic filtering technology, licensed from Verity, Inc., enables you to index most document formats. This technology also enables you to convert documents to HTML for document presentation with the CTX_DOC
package.
To use automatic filtering for indexing and DML processing, you must specify the AUTO_FILTER
object in your filter preference.
To use automatic filtering technology for converting documents to HTML with the CTX_DOC
package, you need not use the AUTO_FILTER
indexing preference, but you must still set up your environment to use this filtering technology, as described in this appendix.
The supported platforms and formats listed in this appendix apply for this release. These supported formats are updated for patch releases. To view the latest formats, refer to the Oracle Technology Network:
http://www.oracle.com/technology/products/text
Password-protected documents and documents with password-protected content are not supported by the AUTO_FILTER
filter.
For other limitations, refer to sections in this chapter concerning specific document types.
Several platforms can take advantage of AUTO_FILTER
filter technology.
AUTO_FILTER
filter technology is supported on the following platforms:
Microsoft Windows
Server 2003 (x86 and IA-64)
XP (Service Packs 1 and 2)
2000 x86 (Service Pack 2)
NT 4.0 x86 (Intel) (Service Pack 6a)
Sun Solaris 8.0 and 9.0
HP-UX 11.0 and 11i, PA-RISC
HP-UX 11i v11.23, IA-64
IBM AIX 5.1 and 5.2L
Red Hat Linux 7.3 and 8.0
Red Hat Enterprise Linux AS 2.1 and 3.0 (x86)
Red Hat Enterprise Linux AS 3.0 (IA-64)
SuSE Linux Standard Server 8 (x86)
The tables in this section list the document formats that Oracle Text supports for filtering. Oracle Text licenses its filtering technology from Verity, Inc.
Document filtering is used for indexing, DML, and for converting documents to HTML with the CTX_DOC
package.
Note:
These lists do not represent the complete list of formats that Oracle Text is able to process. The external filter framework enables Oracle Text to process any document format, provided an external filter exists that can filter to text..Plain-text, HTML, XHTML, XML, and SGML formats pass through the filter without any conversion.
Format | Version | Single-byte | Asian (and Most Multi-byte) | Bi-directional? |
---|---|---|---|---|
ANSI (TXT) | all versions | Y | Y | n/a |
ASCII (TXT) | all versions | Y | Y | n/a |
HTML | 2.0, 3.2, 4.0 | Y | Y | n/a |
IBM DCA/RFT (Revisable Form Text) (DC) | SC23-0758-1 | character sets 500 and 1026 only | N | N |
Rich Text Format (RTF) | 1 through 1.7 | Y | Y | Y |
Unicode Text | 3, 4 | Y | Y | n/a |
XHTML | 1.0 | Y | Y | n/a |
Generic XML | 1.0 | Y | Y | n/a |
Format | Version | Single-byte | Asian (and Most Multi-byte) | Bi-directional? |
---|---|---|---|---|
Adobe Maker Interchange Format (MIF) | 5, 5.5, 6, 7 | character set 1252 only | N | N |
Applix Words (AW) | 3.11, 4.2, 4.3, 4.4, 4, 41, 4.2 | character set 1252 only | N | N |
DisplayWrite (IP) | 4 | character sets 500 and 1026 only | N | N |
Folio Flat File (FFF) | 3.1 | character set 1252 only | N | N |
Fujitsu Oasys (OA2) | 7 | Y | Japanese only | N |
JustSystems Ichitaro (JTD) | 8, 9, 10, 12 | Y | Japanese only | N |
Lotus AMI Pro (SAM) | 2, 3 | Y | Simplified Chinese, Traditional Chinese, Japanese, and Thai only | Y |
Lotus Word Pro (LWP) | 96, 97, Millennium Edition R9, 9.8 (supported on Windows 32-bit platform only) | Y | Y | Y |
Lotus Master (MWP) | 96, 97, Millennium Edition R9, 9.8 (supported on Windows 32-bit platform only) | Y | Y | Y |
Lotus Master (MWP) | 96, 97 (supported on Windows 32-bit platform only) | Y | Y | N |
Microsoft Word for PC (DOC) | 4, 5, 5.5, 6 | character set 1252 only | N | N |
Microsoft Word for Windows (DOC) | 1 through 2003 | Y | N: versions 1-2
Y: versions 6,7,8,95,97,2000,XP,2002,2003 |
N: versions 1-2
Hebrew only: versions 6,7,8,95 Y: versions 97,2000,XP,2002,2003 |
Microsoft Word for Windows XML format | 2003 (No formatting extracted) | Y | Y | Y |
Microsoft Word for Macintosh (DOC) | 4, 5, 6, 98 | Y (version 98) | N (version 98) | Y (version 98) |
Microsoft Works (WPS) | 1 through 2000 | Y | Japanese only | N |
Microsoft Windows Write (WRI) | 1, 2, 3 | Y | Japanese only | N |
OpenOffice (SXW) | 1, 1.1 (No formatting extracted) | Y | Y | Y |
StarOffice (SXW) | 6, 7 (No formatting extracted) | Y | Y | Y |
WordPad | through 2003 | Y | Y | Y |
WordPerfect for Windows (WO) | 5, 5.1 | Y | N | Y |
WordPerfect for Windows (WPD) | 6, 7, 8, 10, 2000, 2002, 11 | Y | N | N |
WordPerfect for Macintosh | 1.02, 2, 2.1, 2.2, 3, 3.1 | Y | N | N |
WordPerfect for Linux | 6 | Y | N | N |
XyWrite (XY4) | 4.12 | character set 1252 only | N | N |
The following limitations apply to filtering of word processing documents:
Mixed-page orientation (landscape and portrait) within the same word processing document is not supported.
When text color in a Microsoft Word document is set to Automatic on a dark background, the resulting text is rendered as black. If the text color is explicitly set, the resulting text is rendered correctly in the same color as the original document.
If a graphic or table appears in a word processing text box, the filter cannot position it correctly in the HTML output.
Nested tables (a table inside another table) in word processing documents are not supported.
Comments in Microsoft Word documents are not filtered.
Format | Version | Single-byte | Asian (and Most Multi-byte) | Bi-directional? |
---|---|---|---|---|
Applix Spreadsheets (AS) | 4.2, 4.3, 4.4 | character set 1252 only | N | N |
Corel Quattro Pro (QPW, WB3) | 6, 7, 8, 10, 2000, 2002, 11 | Y | N | N |
Lotus 1-2-3 (123) | 96, 97, Millennium Edition R9, 9.8 | Y | Y | Y |
Lotus 1-2-3 (WK4) | 2, 3, 4, 5 | Y | Y | N |
Lotus 1-2-3 Charts (123) | 2, 3, 4, 5 | Y | Y | N |
Microsoft Excel for Windows (XLS) | 2.2 through 2003 | Y | Y | Y |
Microsoft Excel for Windows XML format | 2003 (No formatting extracted) | Y | Y | Y |
Microsoft Excel for Macintosh (XLS) | 98 | Y | N | N |
Microsoft Excel Charts (XLS) | 2, 3, 4, 5, 6, 7 | Y | Y | N |
Microsoft Works Spreadsheet (S30,S40) | 1, 2, 3, 4 | Y | N | N |
OpenOffice (SXC) | 1, 1.1 (No formatting extracted) | Y | Y | Y |
StarOffice (SXC) | 6, 7 (No formatting extracted) | Y | Y | Y |
The following limitations apply to the filtering of spreadsheets:
Cell outline borders in Microsoft Excel spreadsheets are not filtered.
Microsoft Excel "Donut," "Radar," "Surface," and custom charts are not supported.
Comments in Microsoft Excel spreadsheets are not filtered.
Format | Version | Single-byte | Asian (and Most Multi-byte) | Bi-directional? |
---|---|---|---|---|
Applix Presents (AG) | 4.0, 4.2, 4.3, 4.4 | character set 1252 only | N | N |
Corel Presentations (SHW) | 6, 7, 8, 10, 2000, 2002, 11 | character set 1252 only | N | N |
Lotus Freelance Graphics (PRE) | 2, 96, 97, 98, Millennium Edition R9, 9.8 | character set 850 only (V96 and higher) | N (V96 and higher) | N (V96 and higher) |
Lotus Freelance Graphics 2 (PRE) | 2 | Y | Japanese, Simplified Chinese, Traditional Chinese, and Thai only | N |
Microsoft PowerPoint for Windows (PPT) | 95 through 2003 | Y | Japanese, Simplified Chinese, Traditional Chinese, and Korean only | Hebrew only |
Microsoft PowerPoint for PC (PPT) | 4 | character set 1252 only | Traditional Chinese only | N |
Microsoft PowerPoint for Macintosh (PPT) | 98 | Y | N | Y |
Microsoft Project (MPP) | 98, 2000, 2002 (XP) | character set 1252 only | N | N |
Microsoft Visio (VSD) | 6 | Y | Y | N |
Microsoft Visio XML format | 2003 (No formatting extracted) | Y | Y | Y |
OpenOffice (SXI, SXP) | 1, 1.1 (No formatting extracted) | Y | Y | Y |
StarOffice (SXI, SXP) | 6, 7 (No formatting extracted) | Y | Y | Y |
Format | Version | Single-byte | Asian (and Most Multi-byte) | Bi-directional? |
---|---|---|---|---|
Adobe Portable Document Format (PDF) | 1.1 (Acrobat 2.0) to 1.5 (Acrobat 6.0) | Y | Japanese, Simplified and Traditional Chinese, and Korean | N |
Multi-byte PDFs are supported, provided the PDF document is created using Character ID-keyed (CID) fonts, predefined CJK CMap files, or ToUnicode font encodings, and the document does not contain embedded fonts. See the Adobe website and the Adobe Acrobat documentation for more information.
To determine the type of font encodings that are used in a PDF, open the PDF document in Adobe Acrobat, and select File->Document Info->Fonts. If the Encodings column lists Custom or Embedded encodings, then you may encounter problems filtering the PDF document.
The following limitations apply to PDF documents:
All PDF security attributes are supported except for user and master passwords.
Embedded fonts in a PDF document are not filtered correctly.
If an unsupported font is encountered during conversion of a PDF document, the default font, Times New Roman, is substituted. If the original font is wider than the substituted font, extra whitespace will appear in the output HTML.
The following color spaces are supported:
DeviceRGB
DeviceGray
DeviceCMYK
CalGray
CalRGB
Index color spaces are supported as long as they are used with a supported basic color space.
Hyperlinks in PDF documents are not supported.
All pre-defined CMaps in PDF 1.3 specification are supported. CMaps added in PDF 1.4 and PDF 1.5 specifications are not supported.
Annotations, such as notes, sound, or movie, are not supported.
The following features of PDF 1.5 for Acrobat 6.0 are not supported:
Tagged PDFs
Images compressed in JPEG2000
Crypt Filter encryption
Hidden content in a PDF document, such as, Optional Content and OCG-State Actions
Interactive forms
Embedded multimedia presentations
Digital signatures and signature fields
Interactive presentations, that is, navigation between pages and transition actions.
Vector images are not supported. Since background colors are defined in PDF as vector images, background colors are also not supported. Raster images are supported.
Table B-1 lists the graphic formats that the AUTO_FILTER
filter recognizes. This means that indexing a text column that contains any of these formats produces no error. As such, it is safe for the column to contain any of these formats.
Formats are categorized as either embedded graphics or standalone graphics. Embedded graphics are inserted or referenced within a document.
Note:
This filter cannot extract textual information from graphics.Table B-1 Supported Graphics Formats for AUTO_FILTER Filter
Graphics Format | Version | Bidirectional? |
---|---|---|
AutoCAD Drawing format (DWG) | R13, R14, and R2000 (standalone only) | |
AutoCAD Drawing format (DXF) | R13, R14, and R2000 (standalone only) | |
Encapsulated PostScript (EPS) (raster only) | TIFF header only | |
Enhanced Metafile (EMF) | no specific version | N |
Graphics Interchange Format (GIF) | 87, 89 | |
JPEG File Interchange Format | no specific version | |
Lotus AMIDraw Graphics (SDW) | no specific version | |
Lotus Pic (PIC) | no specific version | |
Macintosh Raster (PICT/PCT) | 2 | |
MacPaint (PNTG) | no specific version | |
Microsoft Windows Bitmap (BMP) | no specific version | |
PC Paintbrush (PCX) | 3 | |
Portable Network Graphics (PNG) | no specific version | |
SGI RGB Image (RGB) | no specific version | |
Sun Raster Image (RS) | no specific version | |
Tagged Image File (TIFF) | 5 | N |
Truevision TARGA (TGA) | 2 | |
Windows Animated Cursor (ANI) | no specific version | |
Windows Metafile (WMF) | 3 | N |
WordPerfect Graphics (WPG) | 1 | N |
WordPerfect Graphics 2 (WPG) | 2, 7 | N |