54
Text in Oracle The Search Platform and Ultra Search Omar Alonso, Senior Product Manager, Oracle Corp. Stefan Buchta, Principal Product Manager, Oracle Corp. NoCOUG May 16 th 2001

Text in Oracle The Search Platform and Ultra Search Omar Alonso, Senior Product Manager, Oracle Corp. Stefan Buchta, Principal Product Manager, Oracle

Embed Size (px)

Citation preview

Text in Oracle

The Search Platform and Ultra Search

Omar Alonso, Senior Product Manager, Oracle Corp.Stefan Buchta, Principal Product Manager, Oracle Corp.

NoCOUGMay 16th 2001

Agenda What is Oracle Text?

Introducing Oracle Text

Text in the database – Why Integration is Key Performance and scalability Ease of Use Global Solutions Search Quality Specialized Indexes XML Document Services Ultra Search Summary

What is Oracle Text?

Formerly know as interMedia Text Oracle Text adds powerful text search and

intelligent text management capabilities to the Oracle database.

Oracle Text:– Fully integrated with the database– Offers premier text search quality– Provides several advances features for text

management, document services, XML, etc.– Has the best internationalization set of

features for multilingual text search applications.

Introducing Oracle Text – An example

create index description_idx on PRODUCT_INFORMATION(PRODUCT_DESCRIPTION) indextype is ctxsys.context ;

select score(1), product_id, product_name from product_information where contains (product_description, 'monitor NEAR "high

resolution"', 1)>0 order by score(1) desc ;

SCORE(1) PRODUCT_ID PRODUCT_NAME -------- ---------- ------------------------------ 29 3331 Monitor 21/HR 27 3060 Monitor 17/HR 14 1726 LCD Monitor 11/PM 14 3054 Plasma Monitor 10/XGA 14 2252 Monitor 21/HR/M 14 2243 Monitor 17/HR/F

Integration with the database

The attempt to separate text and normal business (structured) data fails:

– Cost– Complexity– High latency of development and

deployment– Performance

No Integration - Separate Everything

Application

Repository Index Search Engine(API)

Oracle Database

File System

B-Tree

Inverted

SQL

C API

Full Integration – text, index, API, optimizer

Application

Repository Index Search Engine(API)

Oracle Database B-Tree

SQL

Integration Benefits

Low cost Low complexity High performance High integrity Manageability Leveraging existing skills

Oracle Uses Oracle Text

Oracle internet File System Oracle Portal Oracle CRM Oracle E-Business Suite Oracle eXchange Ultra Search Oracle.com OTN

Oracle Internet File System

Oracle E-Business Suite

Performance – illustration

Large doc set – 100Gig (20million web pages)

Hardware : Enterprise Sparc Task : web query

– Web-style query syntax– 2-3 words– Return first 100 hits

40 queries/second 90% of requests take < 0.5 second 7 hours to create index

Performance – Query throughput

Throughput Comparison

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

0 10 20 30

Number of Users

Qu

erie

s p

er S

eco

nd

CONTEXT Queries persecond

TCOMP Queries persecond

Oracle Text vs one of the best-known specialist Text search engines

Ease of Use, Ease of Development

Simple SQL and PL/SQL interface– Can be used by any developer that knows

SQL– Can be called by any tool that knows SQL– Using any language: Java, JSP, PL/SQL, C, etc.

Choice of datastores– Stored in the database– Stored in the file system– Stored on the web (URL)– User-defined datastore

Global Solutions

Basic indexing/search works in any NLS language

Special support for Japanese, Chinese, Korean Theme search and services available in any

single-byte, white space-delimited language Can mix languages, character sets in a single

column Can query across languages

Chinese, Japanese, Korean Text

• Character sets:• Japanese: JA16SJIS, JA16EUC

• Simplified Chinese: GBK, GB2312-80• Traditional Chinese: BIG5, EUC, TRIS• Korean: KO16KSC5601• Unicode: UTF8

• Lexing:• Lexical segmentation for Japanese, Chinese• Morphological segmentation for Korean

Multilingual Search

Cross-language queries

Can mix languages, character sets within a document collection (e.g. Chinese and English documents)

Can use English to query e.g. Chinese terms or vice versa.

Query a term which is expressed differently in simplified and traditional Chinese.

select score(1), product_id, product_name from product_information where contains (product_description, 'TRSYN(monitor,

Chinese)', 1)>0 order by score(1) desc ;

Find products whose description contains ‘monitor' or its Chinese equivalents.

Search Quality

Exact word Boolean expression Phrase Proximity Fuzzy Stemming Wildcards

– Prefix, substring index

Thesaurus, multiple Thesauri

ABOUT search Theme (concept-based)

search Accumulate scores Term weighting Advanced XML search XPath support Query Feedback

ABOUT – themes and theme queries

"We ordered a bottle of chardonnay to go with the fish, and cabernet sauvignon for the steak …"

select id from docswhere contains(text, ‘ABOUT(wine)')>0

The knowledge base allows Oracle Text to associate words and concepts.

Knowledge base contains over 400,000 concepts.

You can extend the knowledgebase to include– Words and concepts from your specialist field e.g.

medicine– Associations of words and spellings to guide

novice/internet users

Catalog Index

Optimized for response time on small text fields

True transactional DML Supports structured query, including range

query Subset of CONTEXT query language

– No fuzzy, stemming, about– User-friendly web-like query syntax

Classification

CTXRULE is an index type designed classification/routing applications

Efficiently take a document and find matching queries

Classification Application

Perform Action

Incoming documents

Matched Documents

9i

Compares against rules

Prefix, substring index

Prefix and Substring are flavors of the CONTEXT index

Prefix will add more tokens to the CONTEXT index to efficiently process prefix searches (e.g. 'ora%')

Substring will add an index on substrings of each token, to efficiently process substring searches (e.g. '%oxy%')

Storing XML in Oracle

Decomposition– decompose documents into atomic elements– store elements in columns/rows– compose XML documents using SQL

xmltype– store XML as xmltype, use xmltype methods

Store as LOB or varchar– Store XML as-is, in a LOB or VARCHAR– Search using Oracle Text section searching or

XPath

Content search and XML

Create indexcreate index BOOKINDEX on BOOKS(text) indextype is

ctxsys.context

Query by contentselect PRICE from BOOKSwhere contains(text, ‘Harry Potter')>0 order by price

desc;

Create index to include section info create index BOOKINDEX on BOOKS(text) indextype is ctxsys.context parameters ('section group my_auto_section_group' ) ;

Limit content search to a section of textselect price from bookswhere contains(text, ‘Harry Potter within title’)>0 order by price desc;

Advanced XML searches

Nested section search<movie><title>The Matrix</title></movie>

<book><title>Introduction to Matrix Algebra</title></book>

select price from media

where contains(desc, ‘matrix within title within movie’)>0

Search inside attribute values<book author=“Barry Hughart”>Bridge of Birds</book>

select title from books

where contains(text, ‘Hughart within book@author’)>0

More advanced XML searches

map multiple tags to same name<H1>The Diamond Age</H1><H2>or, A Young Lady’s Illustrated Primer</H2>

(map H1 and H2 to section name of “headline”)

select author from articleswhere contains(text, ‘Diamond within headline’)>0

doctype limiters to handle tag collisions<!DOCTYPE foo> … <address>[email protected] …<!DOCTYPE bar> … <address>123 Meheula Pkwy …map (foo)address to “email”, (bar)address to “address”

Document Services

Extract Themes (major concepts)– Extract hierarchical structure

Extract Gist– Generic or Point-of-View– Sentence- or Paragraph- level

View a document as HTML– Highlight search terms, highlight navigation

Return results in a table or a PL/SQL table Basis for Clustering, More Like This, …

Summary

Fully integrated with the database Premier text search quality Advanced features for text

management, document services, and XML.

Best multilingual features in the market.

Introducing Oracle Ultra Search

Issues in Corporate Search

Information Management Crisis– Explosive Growth of Information flowing

over corporate Intranets.– Knowledge scattered across: IT

repositories, billions of documents, and data fragments.

– Non-Uniform Information Structured in databases. Unstructured -

Word processing doc., presentations.

Impacts of Bad Search

Customers - Turn to competitor’s Website.

Employees - Waste time and money on useless searches.

Oracle Ultra Search– Solves problem of finding relevant

information.– Across your company’s many disparate

information repositories.

Oracle Ultra Search

Out-of-the-Box solution that– Searches text across multiple repositories

Databases, HTML Pages, Files, Mail Servers.

– Provides the best relevance ranking and globalization in the industry.

– Provides value added Portal functionality.– Presents Web style interface.

Built onto Oracle’s proven, reliable Text Retrieval software and Oracle9i server.

Oracle Ultra Search

Docum. Title

Relevance

Ultra Search Applications

Portal Search– Most powerful search for Oracle9iAS

Portal.– Build your own portal.– Special ‘Portlet’ crawls inside and outside

of Portal Repository.

Canned Web Search for Oracle Text Library or Archive Search Content Management Platform Searc

Search Multiple Repositories

Value Added Portal Functionality

‘Canned’ Web-Style Search Aggregates Information For Indexing

– Documents stay in their own repositories.– Search returns ‘normalized’ results,

uniformly ranked by relevance. Organize & Categorize Content From

Multiple Repositories– Extract valuable metadata.– Improve search by narrowing through

‘fielded search’.

‘Out-of-the-Box’ Web-Style Search

Oracle Text Application– Uses public Text interfaces.– Enhanced with expertise about gathering

and indexing information for best quality search.

– No coding against low level API’s.

Oracle Text Retrieval Engine– Highly integrated with Oracle9i server.– Best interoperability with dynamic data.– Scalability and Reliability of Oracle platform.

Aggregates Information

GatherAnalyze

MakeQueryable

Maintain

Gather– Crawls Web,

corporate repositories Analyze

– Create index required for querying, filter

Make Queryable– Embedd through API

Maintain– Schedule crawling– Easy Administration

Powerful Fielded Search

Narrow search to parts of document - title, body, name of author.

Extract and use repository metadata– Word processing documents: Author, Title.– Databases: Identify Columns.– Email: Header/Body/Attachment.

Unify repositories in common, logical terms.

– Uniform set of results, ranked by overall relevance.

Flexible Metadata Mapping

Search Term

Repositories

Metadata Fields

Ultra Search Architecture

Architecture

Simple, Robust Architecture Built on:– Oracle9i Server Platform– Oracle’s Text Retrieval Engine

Flexible Deployment– Server-Tier– Mid-Tier

Ultra Search Components

Crawler Server Component Query API &

Application Administration Tool Mail API

Ultra Search Crawler

Multi-Threaded JAVA process.– Gathers documents from repositories you

specify on a set schedule.– Maps and analyzes link relationships.– Filters (150+) Non-HTML Documents,

extracts valuable metadata.– Indexes documents and data fragments.

Flexible Configuration– Run on one or more machines: ‘Remote

crawling’

Ultra Search Crawler

Set Inclusion/Exclusion Domains– Limit crawling to corporate net or specific

sections of it.

Maintain Fresh Search Results– Set crawling schedules for each Web site

or repository.

Crawling Abilities

Web Sites (HTTP Protocol) Database Tables

– Oracle and any ODBC compliant database.– Local (Ultra Search instance) or remote

database– Crawls both fulltext and ‘fielded’ columns.

Files (file:// Protocol)– Ultra Search filters, extracts text and

metadata. Emails (IMAP Protocol)

– Crawl and index mailing lists through IMAP.

Ultra Search Query API

‘Embed’ Ultra Search in your Portal or Application.

– Customize look-and-feel to your requirements.– Easy integration with your application.

API for JAVA (JSP) and PL/SQL (PSP). Returns data with or without HTML markup.

– Build: Basic Search Form, Search Result Form...

Includes Highly Functional Query Application.

Java Query API Illustration

1 2

3

Administration Environment

Browser-based, Self-Service Web Application.

Define Ultra Search Instances. Configure and Schedule Crawler. Set Query Options To Narrow Searches.

– Document Attributes (e.g. TITLE, AUTHOR).

– Define ‘Data Source Groups’.

Manage Administrative Users.

Administration Environment

Summary

Eliminate the Chaos Inside Your Firewalls !

Oracle Ultra Search– Crawls, Indexes, and makes searchable

your Intranet.– Provides Web-style search without the

need for coding.– Organizes, categorizes, and unifies

content from multiple repositories.– Leverages Oracle9i platform - reliable,

scalable, always available.

AQ&Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S