32
One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Text Mining with Oracle - Text Mining Summit

Embed Size (px)

Citation preview

Page 1: Text Mining with Oracle - Text Mining Summit

One Tool, Many IndustriesText Mining with Oracle

Omar AlonsoChuck Adams

Oracle Corp.

Text Mining Summit, Boston, 2005

Page 2: Text Mining with Oracle - Text Mining Summit

Agenda

Introduction Text mining Define problems Present solutions A look at Oracle’s technology stack Oracle’s roadmap A case study Conclusions

Page 3: Text Mining with Oracle - Text Mining Summit

Data mining and Text mining

OLTP

OLAP

DM

Keyword search

BK

TM• Classification

• Clustering

• Ontologies

• NLP

• Inexact match

Structured Data Unstructured Data

Page 4: Text Mining with Oracle - Text Mining Summit

An analogy

RFID and robot vision– Put tags on everything instead having the

robot do the vision Similar approach for text mining

– Language is very social, not technical– Instead, start with a unified storage model– Then do mining

Page 5: Text Mining with Oracle - Text Mining Summit

What about text mining?

Text mining is one of many features in text technology

Real future of text technology is business intelligence (BI)

What is BI? – Ability to make better decisions

What are the obstacles today?– Structured data is well understood– Unstructured data is different

Page 6: Text Mining with Oracle - Text Mining Summit

Text and XML

Increased exploitationof structure

Plain Old File System

File System on Steroids(WinFS)

Records Mgmt, ECMDynamic Doc Generation

Traditional Content Mgmt

XML Content Mgmt.

Page 7: Text Mining with Oracle - Text Mining Summit

First problem: access

No uniform access over all sources Each source has separate storage and

algebra Examples

– Email – Databases– Applications– Web

Page 8: Text Mining with Oracle - Text Mining Summit

Second problem: management Management of unstructured of data

very poor compared with structure data Cleaning Noise is larger than in structure data Security Multilingual

Page 9: Text Mining with Oracle - Text Mining Summit

Third problem – user needs Perception with current search engines Large data -> 80/20 rule Doesn't provide uniform information Two users type same query and get the

same results– Cricket the game or cricket the bug?

Page 10: Text Mining with Oracle - Text Mining Summit

Foundations XML as the common model XML allows:

– Manipulation data with standards– Mining becomes more data mining– RDF emerging as a complementary model

The more structure you can explore the better you can do mining

Integration use cases

Page 11: Text Mining with Oracle - Text Mining Summit

Foundations - II

Unstructured data is too AI Too easy to get fooled by the complexity Hybrid solution Domain knowledge

– You know your domain– You own the content – You can do better

Page 12: Text Mining with Oracle - Text Mining Summit

Remember?

Page 13: Text Mining with Oracle - Text Mining Summit

Personalization problem

Lack of personalization You own the content, you own the user Two users type the same query: “financials”

– Sales rep looks for customers and other deals– Tech guy looks for bugs, architecture, etc.

LDAP shows who they are Combination with query logs shows

patterns in the same peer group Recommendation systems

Page 14: Text Mining with Oracle - Text Mining Summit

Better Answers: Beyond Keywords

Noise theory– As you cast your nets ever wider, you catch disproportionately more junk

Must develop new models of Quality in the face of comprehensiveness– Combine Link-Analysis with Context-sensitive relevance– Personalization

Must summarize information– Theme Maps, Gists

Show patterns in information vs. many pages of hit-lists– Tree Maps, Stretch Viewer

Ability to post-process and refine search hit lists– Dynamic categories for navigation– Reorder by date

Progressive query relaxation– Nearest inexact match

Page 15: Text Mining with Oracle - Text Mining Summit

Technology StackBetter Answers

Relevance Toward BI

Progressive Relaxation

Multi-Criterion Support

Visualization

Classification

Personalization

Direct Answers

Link Analysis

Query Log Analysis

Metadata Extraction

Keyword Ranking

Intelligent Match

Duplicate Elimination

Page 16: Text Mining with Oracle - Text Mining Summit
Page 17: Text Mining with Oracle - Text Mining Summit
Page 18: Text Mining with Oracle - Text Mining Summit
Page 19: Text Mining with Oracle - Text Mining Summit

Oracle’s position

Text mining is one of many tools for information retrieval and discovery in many assets

Text mining is best used in the context of other techniques

– Personalization– Search query logs– Visualization

Product: one integrated platform

Page 20: Text Mining with Oracle - Text Mining Summit

Oracle platform Integrated platform vs. niche technology

Full-text searching

XML

Classification

Clustering

Visualization

Google, FAST

Tamino

Autonomy

Vivisimo

Inxight

One platform, low cost, low complexity

Several products, different APIs, performance, maintenance cost, etc.

Application search SAP/TREX

Page 21: Text Mining with Oracle - Text Mining Summit

Oracle platform

“If I can see further than anyone else, it is only because I am standing on the shoulders of giants” – Isaac Newton

Oracle provides you all the functionality– Plus you get backup, recovery, scalability,

and other benefits You build the mining application

Page 22: Text Mining with Oracle - Text Mining Summit

Case study

Federal customer High Performance Text Information

Mining and Entity Extraction

Page 23: Text Mining with Oracle - Text Mining Summit

Business Need

Enterprise Search Capability Information Fusion Profiles and alerting Security – user need to know Entity identification and extraction High Performance ingestion, search, and

indexing Scalability

Page 24: Text Mining with Oracle - Text Mining Summit

Challenges

Search quality Performance Scalability Document formats Integration Operations and maintenance

Page 25: Text Mining with Oracle - Text Mining Summit

Solutions Architecture

Oracle 10g Integrated Framework 10g release 2

– Oracle Real Application Clusters– Oracle Text

Full text and rule based indexingExtensible thesauriDocument classificationDocument filters

– Oracle Partitioning– Oracle Virtual Private database– Oracle Advanced Security

Page 26: Text Mining with Oracle - Text Mining Summit

Technical Architecture

Application ServerEDL Portal User

EDL Portal User

Oracle 10g RAC

Application Server

LoadBalancer

Oracle 10g RACInterconnect

Enterprise MetaData Layer

Scalar, Domain, andB*Tree Indices

EDL Portal User

EDL Portal User

ADS OID

Process Isolated RAC DBNodes. 1 tuned for Userquery and the other fordata synchronization

Application Server

Key meta dataconsolidated and indexedfor enterprise data layer

access.

CIA PKI Authenticationfrom ADSN clients

ADS LDAP Integrated forClient and Server

Authentication

ExistingMissionSystem

Network BasedIntegration Hub and EDLSynchronization Services

Federated Data AccessJ2EE Services for

mission system drill

ExistingMissionSystem

ExistingMissionSystem

Page 27: Text Mining with Oracle - Text Mining Summit

Scalable load and indexing

Oracle 9i& 9i Text

Raw Payload Payload Index

Scalar Indexes XML Indexes

DataCollec-

tion

Preprocess&

Filtering

JavaLoad

Thread

JavaLoad

Thread

JavaLoad

Thread

JavaLoad

Thread

JavaLoad

Thread

JavaLoad

Thread

JavaLoad

Thread

JavaLoad

Thread

Java LoadDistri-bution

Process

Standard-ized

Xml DTD

UTF8 TextExtracted

fromCollection

Page 28: Text Mining with Oracle - Text Mining Summit

Real world results

Single search for user Profiles and alerts Couple second query response 80,000,000 + documents indexed 1.2 TB raw text and growing 700 Gig index size Incremental index 1-2 Gig / day

Page 29: Text Mining with Oracle - Text Mining Summit

Next Steps

Oracle 10gText Indexstructure

Entityidentification

andextraction

engine

Languagespecific

dictionary

Languagespecific

dictionary

Languagespecific

dictionary

Languagespecific

dictionary

ExtractedEntities

XMLInterface

Relationshipdetectionengine

• Entity Extraction and Relationship Awareness

Page 30: Text Mining with Oracle - Text Mining Summit

Oracle database 10g release 2 Enterprise Search Capability Information Fusion Profiles and alerting Security – user need to know Entity identification and extraction High Performance ingestion, search, and

indexing Scalability

Page 31: Text Mining with Oracle - Text Mining Summit

Conclusions

Text mining is one of many features needed for BI on unstructured data

– Not a silver bullet in itself Must exploit other approaches – metadata

(XML, RDF), personalization, classification, entity extraction, full-text search, …

– Hybrid solution Focus on an integrated platform that gives you

all the functionality Drive the platform for your information need

Page 32: Text Mining with Oracle - Text Mining Summit