37
Bjørn Olstad CTO FAST Search & Transfer Adjunct Prof. The Norwegian University of Science & Technology Email: [email protected] Cell: +47 48011157 Why Search Engines are used increasingly to Offload Queries from Databases

Bjørn Olstad CTO FAST Search & Transfer Adjunct Prof. The Norwegian University of Science & Technology Email: [email protected]@fast.no

  • View
    218

  • Download
    3

Embed Size (px)

Citation preview

Bjørn OlstadCTO FAST Search & TransferAdjunct Prof. The Norwegian University of Science & Technology

Email: [email protected]: +47 48011157

Why Search Engines are used increasingly to Offload Queries from Databases

The Typo Problem...

Talent Offloading ....

The Web Search Experience

”You are viewing 5 random jobs out of 2461 jobs in total....”

High input barrier

The RDBMS Experience

11

30956 jobs

CareerBuilderUse scenario, part 1

22

1084 jobs

CareerBuilderUse scenario, part 2

33

30 jobs

CareerBuilderUse scenario, part 3

5 jobs

30956 5 targeted jobs in 3 steps

CareerBuilderUse scenario, part 4

Challenger Shuttle Launch

Fax to NASA from contractor with O-ring concern

Presentation Matters …

IYP: A Disruptive Change

ESP: Cleansing, Mining, Relevance and Discovery

Company nameBusiness CategoryTelephone numberAddress20 key terms

Company nameBusiness CategoryTelephone numberAddress20 key terms

Product &Services

Blogs++

Companyweb site

What is the phone numberto Will’s Barber shop?

Taylor or Gibson guitar?Good local offers?

Compare offeringsPhone / Directions

BTW: I’m using my iPAQ

ISVs: A Disruptive Change

Siebel 2000 Siebel 2005

“my” CRM Application “my” CRM Application

Information Access Layer3rd party content

Search is a strategic enablerSearch is a tactical afterthought

Search

Revisit the Assumptions …

1999

GIG

AB

YTES

2001: 6B

2002: 12B

2003: 24B

80

% U

nstr

uctu

red

2000: 3B

Cave paintings,Bone tools 40,000

BCEWriting 3500 BCE

0 C.E.Paper 105

Printing 1450

Electricity, Telephone 1870

Transistor 1947

Computing 1950

Internet (DARPA) Late 1960s

The Web 1993

SQ

L-7

0O

racl

e-7

9SQ

L-8

9SQ

L-9

2

SQ

L-9

9

SQ

L-0

3

Relational algebralarge – but “finite”data sets

structured data

Search & Explore focused“infinite”data sets

Unstructured & Structured

• Feeding/streaming, transaction, retrieval or analytics centric?

• Content size: M, L, VL, VVVL or Vn∞ L?

• Schema centric, Semi-structured XML, Text, Agnostic?

• Fuzzy & Value vs. Binary & Completeness?

• Discovery primitives?

• User interaction part of design target?

Extreme Capabilities?

Query LatencyRDBMS vs ESP

0

2

4

6

8

10

12

14

16

18

20

1/16 1/8 1/4 1/2 1 2 4 8 16 32

[sec.]

# q

uer

ies

• Structured data:• 5 million records; • 13 fields per record

• Structured queries:• 22 SQL queries( Representative in ERP )

Test Data:

• #1: FAST ESP w/ disk• Mean = 99 [ms]• St.dev. = 36 [ms]

• #2: Oracle w/ memory mapping• Mean = 4 057 [ms]• St.dev. = 9 368 [ms]

The Result:

ESP

RDBMS

0

100

200

300

400

500

600

700

800

900

1 2 3

FAST

ORA

20 users

50 users

100 users

Identical HW : single node, 2 CPU, 4GB ram 3 SCSI disksIdentical data : auction data from eBay, 3.6 million doc’sIdentical queries: 200 queries defined by Oracle

Query Per SecondRDBMS vs ESP

QPS

Disruptive Change

Relational Model

Queries that fit The ModelQueries that don’t fit The Model

Alternative I Alternative II

• Star, snowflake schemas++• Cubes / datamarts ++

Incremental fixes to painful shortcomings

Adds complexity

• Schema agnostic• Scalable ad-hoc querying• BLOBS Contextual Insight• Real-time fusion of disparate data

models• Massive fault tolerant scalability

0

2

4

6

8

10

12

14

16

18

20

1/16 1/8 1/4 1/2 1 2 4 8 16 32

[sec.]

# q

uer

ies

Extreme CapabilitiesESP Design Targets

Contextual Insight

Value/Noise SNR

User Interaction

ContextualRefinement

Game Changer driven by Extreme Retrival and on-the-fly Analytics

Powering Search Derivative Applications (SDAs)

• HW-cost: $320K (32CPU on 4 Sun servers)

• 90% sub-second query responseAverage = 12 s for the rest ….

• Relevance = Sorting

• 5 FTE to maintain

• HW-cost: $90K

• 100% sub-second query response

• Flexible relevance and discovery

• 0.5 FTE to maintain

Database Query OffloadingExample: AutoTrader.com

Car Dealers - Product Supply

ESP

RDBMS:

ESP:

Content ScalabilityRDBMS vs ESP

Examples of ESP deployments

• Compliance case:– 50B documents @ 80k average 4 PB (around 100 web indexes)

• Storage:– Intelligent content addressable storage– XML metadata and full content– EMC Centera: N * 256TB (N=1..400)

• Webmining – Webfountain:– 60.000 : 1 in query capacity (ESP : DB)

Intelligent StorageStorage and Search Unite

Simple

Scalable

Secure

Discover

Contextual Search

Contextual Relevance Contextual Navigation• “Best of Web”

Recommender / Authority

• “Best of Enterprise”Linguistic / Statistic

• Contextual fact discovery

• On-the-fly meta-dataanalysis

From ACCESS To INSIGHT

Where is the emailfrom Peter aboutROI analysis?

Any new supiciousfinancial transactionpatterns?

FIND EXPLORE

STRUCTURED

FAST ESP

Single Field Search

Quering

SQL LIB

WWW(HTML, XML, WML,

JavaScript)

DB DB DB DB

DB

Turning around the PyramidHBZ.de – Leading German Library Service Center

Researchers

Librarians

From:

To:

ESP @ SCOPUS

• >200M articles / 180M citations• 180TB capacity / 14000 journals

David Goodman standing up and declaring in public, that Scopus is the best-designed database he's ever seen …

Search Reduces Clicks to Purchase and Browsing…

• Reduced # of clicks to buy content from > 4 to < 2

• 50% reduction in ringtone browsing

page views per sale

Wee

k 1

Wee

k 10

Browsing

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

-60%

-40%

-20%

0%

20%

40%

60%

80%

100%

120%

140%Launched search

• 100% increase in search • 20% increase in ringtone revenue

… and Drives Revenue

Wee

k 1

Wee

k 10

Search

Revenue

Launched search

-60%

-40%

-20%

0%

20%

40%

60%

80%

100%

120%

140%

Clicks to P

urchase

Relevance Drives Revenue

ØKOKRIM

Firewall

Real-time Registration

Transaction Log

Me

ss

ag

e Q

ue

ue

Data Validation

Queries

Results

Example: Norwegian Customs Foreign Exchange Transaction Monitoring

SECURITY ACCESS MODULEACL Monitor User Monitor

Databaseconnector

Firewall

Alerts

Business AnalyticsProcessing of real-time streams

Technology Maturity...RDBMS vs ESP

Business IntelligenceESP vs. RDBMS Technology

OBSERVATIONThe Enterprise Search Platform (ESP), a relatively new concept, integrating advanced technologies typically associated with search engines, database tools, and analytical systems, is fast becoming able to solve modern business intelligence problems (using both structured and unstructured data) in a way that is fundamentally different from, and ultimately superior to, that of other currently available analytical or database software.

PREDICTIONEnterprise Search Platform and search centric application technology represents a true paradigm shift in the way data will be stored, analyzed and reported on in the future. Resulting realignments in the marketplace may be both rapid and tumultuous.

- Chief strategist leading BI vendor

If your only tool is a hammer ....

... every problem looks like a nail

UIMA: Architecture

<Company>Dynegy Inc</Company>

<Person>Roger Hamilton</Person>

<Company>John Hancock Advisers Inc. </Company>

<PersonPositionCompany>  <OFFLEN OFFSET="3576" LENGTH="63" />  <Person>Roger Hamilton</Person> <Position>money manager</Position> <Company>John Hancock Advisers Inc.</Company> </PersonPositionCompany>

<Company>Enron Corp</Company>

<CreditRating>  <OFFLEN OFFSET="3814" LENGTH="61" />  <Company_Source>Moody's Investors Service</Company_Source>   <Company_Rated>Enron Corp</Company_Rated> <Trend>downgraded</Trend>   <Rank_New>Baa2</Rank_New>   <__Type>bonds</__Type>   </CreditRating>

<Company>Moody's Investors Service</Company>

…….

``Dynegy has to act fast,'' said Roger Hamilton, a money manager with John Hancock Advisers Inc., which sold its Enron shares in recent weeks. ``If Enron can't get financing and its bonds go to junk, they lose counterparties and their marvelous business vanishes.''

Moody's Investors Service lowered its rating on Enron's bonds to ``Baa2'' and Standard & Poor's cut the debt to ``BBB.'' in the past two weeks.

……

Fact

Fact

Even

tEven

t

<Author>George Stein</ Author > BC-dynegy-enron-offer-update5Dynegy May Offer at Least $8 Bln to Acquire Enron (Update5)By George SteinSOURCEc.2001 Bloomberg NewsBODY

<Category>FINANCIAL</ Category >

Text Structure

The BI “hammer” Approach

Antiobiotics,Peptidyl,Eubacteria,RNA,Mg,…

Document Vector

SVD Analysis

{ λ1, λ2, ..., λn, Structured attributes }

( λ1, λ2, ..., λn )

XML

Direct access to RDBMsfor info from some Telco’s

XML feed from other Telco’s

Flat files (CSV or fixed)from the ’laggards’

Master database for persistant storage

ESP lookup

Ordered hits (by quality)Logic for cleansing

Cleansed data to ESP

clean data

’Error’ database for manual inspection, correction, storage/learning

Ambigous data(close hits or unidentified)

Contextual RefinementETL and Semantic understanding unite

Contextual InsightQuery-time fact analysis @ sub-document level

“…entry probe carried to[Saturn]’s moon Titan

as part of the…”

Co

nce

pts

Inte

nt

Automatedvisitor ratings

Contextual NavigationThisIsTravel

Revisit the Assumptions …

1999

GIG

AB

YTES

2001: 6B

2002: 12B

2003: 24B

80

% U

nstr

uctu

red

2000: 3B

Cave paintings,Bone tools 40,000

BCEWriting 3500 BCE

0 C.E.Paper 105

Printing 1450

Electricity, Telephone 1870

Transistor 1947

Computing 1950

Internet (DARPA) Late 1960s

The Web 1993

SQ

L-7

0O

racl

e-7

9SQ

L-8

9SQ

L-9

2

SQ

L-9

9

SQ

L-0

3

Scalable Search