33
23/06/22 1 SPARK: Top-k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

Embed Size (px)

Citation preview

Page 1: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 1

SPARK: Top-k Keyword Query in Relational Database

Wei Wang

University of New South Wales Australia

Page 2: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 2

Outline

Demo & Introduction Ranking Query Evaluation Conclusions

Page 3: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 3

Demo

Page 4: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 4

Demo …

Page 5: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 5

SPARK I

Searching, Probing & Ranking Top-k Results• Thesis project (2004 – 2005) with Nino

Svonja

• Taste of Research Summary Scholarship (2005)

• Finally, CISRA prize winner• http://www.computing.unsw.edu.au/softwareengine

ering.php

Page 6: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 6

SPARK II

Continued as a research project with PhD student Yi Luo• 2005 – 2006

• SIGMOD 2007 paper

• Still under active development

Page 7: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 7

A Motivating Example

Page 8: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 8

A Motivating Example …

Top-3 results in our system

1 Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller (#2.1)

2 Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller (#2.1) ActorPlay: Character = Himself Actors: Hanks, Tom

3 Actors: John Hanks ActorPlay: Character = Alexander Kerst Movies: Rosamunde Pilcher - Winduber dem Fluss (2001)

Page 9: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 9

Improving the Effectiveness

Three factors are considered to contribute to the final score of a search result (joined tuple tree) • (modified) IR ranking score.

• the completeness factor.

• the size normalization factor.

Page 10: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 10

Preliminaries

Data Model• Relation-based

Query Model• Joined tuple trees (JTTs)

• Sophisticated ranking• address one flaw in previous approaches

• unify AND and OR semantics

• alternative size normalization

Page 11: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 11

Problems with DISCOVER2

score(ci) score(pj) score

c1 p1 1.0 1.0 2.0

c2 p2 1.0 1.0 2.0

df

Nqtf

ss

tf

DQt avdldl

1ln

)1(

))ln(1ln(1

DQt df

Ntf

1ln))ln(1ln(1

signature SPARK

(1, 1) 0.98

(0, 2) 0.44

Page 12: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 12

Virtual Document Combine tf contributions

before tf normalization / attenuation.

ci pj score(maxtor) score(netvista) scorea*

c1 p1 1.00 1.00 2.00

c2 p2 0.00 1.53 1.53

DQt df

Ntf

1ln))ln(1ln(1

Page 13: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 13

df

Nqtf

ss

tf

DQt avdldl

1ln

)1(

))ln(1ln(1ln

Virtual Document Collection Collection: 3 results

• idfnetvista = ln(4/3)

• idfmaxtor = ln(4/2) Estimate idf:

• idfnetvista = • idfmaxtor = 5

9ln)3

11)(311(1

1ln

Estimate avdl = avdlC + avdlP

scorea

c1 p1 0.98

c2 p2 0.44

Page 14: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 14

Completeness Factor

For “short queries”• User prefer results

matching more keywords Derive completeness

factor based on extended Boolean model• Measure Lp distance to

the ideal position

netvista

maxtor

(1,1)

Ideal Pos

(c1 p1)

(c2 p2)

d = 1

d = 0.5

L2 distance

scoreb

c1 p1 (1.41-0.5)/1.41 = 0.65

c2 p2 (1.41-1)/1.41 = 0.29

d = 1.41

Page 15: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 15

Size Normalization

Results in large CNs tend to have more matches to the keywords

Scorec = (1+s1-s1*|CN|) * (1+s2-s2*|CNnf|)

• Empirically, s1 = 0.15, s2 = 1 / (|Q| + 1) works well

Page 16: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 16

Putting ‘em Together

score(JTT) = scorea * scoreb * scorec

• a: IR-score of the virtual document

• b: completeness factor

• c: size normalization factor

scorea * scoreb

c1 p1 0.98 * 0.65 = 0.64

c2 p2 0.44 * 0.29 = 0.13

Page 17: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 17

Comparing Top-1 Results

DBLP; Query = “nikos clique”

Page 18: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 18

#Rel and R-Rank Results

DBLP; 18 queries; Union of top-20 results

Mondial; 35 queries; Union of top-20 results

DISCOVER2 [Liu et al, SIGMOD06] p = 1.0 p = 1.4 p = 2.0

#Rel 2 2 16 16 18

R-Rank 0.243 0.333 0.926 0.935 1.000

DISCOVER2 [Liu et al, SIGMOD06] p = 1.0 p = 1.4 p = 2.0

#Rel 2 10 27 29 34

R-Rank 0.276 0.491 0.881 0.909 0.986

Page 19: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 19

Query Processing

3 Steps Generate candidate tuples in every relation in the

schema (using full-text indexes)

Page 20: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 20

Query Processing …

3 Steps Generate candidate tuples in every relation in the

schema (using full-text indexes) Enumerate all possible Candidate Networks (CN)

Page 21: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 21

Query Processing …

3 Steps Generate candidate tuples in every relation in the

schema (using full-text indexes) Enumerate all possible Candidate Networks (CN) Execute the CNs

• Most algorithms differ here.

• The key is how to optimize for top-k retrieval

Page 22: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 22

Monotonic Scoring Function

Execute a CN

CN: PQ CQ

C

P

C2 C1

P2

P1

DISCOVER2

Assume: idfnetvista > idfmaxtor and k = 1

score(ci) score(pj) score

c1 p1 1.06 0.97 2.03

c2 p2 1.06 1.06 2.12

c1 p1

c2 p2

<c1 p1

c2 p2

<

Page 23: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 23

Non-Monotonic Scoring Function

Execute a CN

CN: PQ CQ

C

P

C2

C1

P2 P1

SPARK

Assume: idfnetvista > idfmaxtor and k = 1

score(ci) score(pj) scorea

c1 p1 1.06 0.97 0.98

c2 p2 1.06 1.06 0.44

c1 p1

c2 p2

<c1 p1

c2 p2

<?

?

1) Re-establish the early stopping criterion2) Check candidates in an optimal order

Page 24: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 24

Upper Bounding Function

Idea: use a monotonic & tight, upper bounding function to SPARK’s non-monotonic scoring function

Details• sumidf = w idfw

• watf(t) = (1/sumidf) * w (tfw(t) * idfw)

• A = sumidf * (1 + ln(1 + ln( t watf(t) )))

• B = sumidf * t watf(t)

• then, scorea uscorea = (1/(1-s)) * min(A, B)scoreb

scorec

are constants given the CN scoreuscore

monotonic wrt. watf(t)

Page 25: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 25

Early Stopping Criterion

Execute a CN

CN: PQ CQ

C

P

C2 C1

P2

P1

SPARK

Assume: idfnetvista > idfmaxtor and k = 1

uscore scorea

c1 p1 1.13 0.98

c2 p2 1.76 0.44

1) Re-establish the early stopping criterion2) Check candidates in an optimal order

score( ) uscore( )score( ) uscore( ) stop!

Page 26: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 26

Query Processing …

Execute the CNs

CN: PQ CQ

C

P

C1 C2 C3

P1

P2

P3

[P1 ,P1] [C1 ,C1] C.get_next() [P1 ,P1] C2

P.get_next() P2 [C1 ,C2] P.get_next() P3 [C1 ,C2] …

[VLDB 03]

Operations:

• {P1, P2, …} and {C1, C2, …} have been sorted based on their IR relevance scores.

• Score(Pi Cj) = Score(Pi) + Score(Cj)

// a parametric SQL query is sent to the dbms

Page 27: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 27

Skyline Sweeping Algorithm

Execute the CNs

CN: PQ CQ

C

P

C1 C2 C3

P1

P2

P3

P1 C1

P2 C1

P3 C1

Skyline Sweep

<P1 , C1 >

<P2 , C1 >, <P1 , C2 >

<P3 , C1 >, <P1 , C2 >, <P2 , C2 >

<P1 , C2 >, <P2 , C2 >, <P4 , C1 >, <P3 , C2 >

Dominance uscore(<Pi, Cj>) > uscore(<Pi+1, Cj>) anduscore(<Pi, Cj>) > uscore(<Pi, Cj+1>)

Priority Queue:Operations:

1) Re-establish the early stopping criterion2) Check candidates in an optimal order

sort of

Page 28: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 28

Block Pipeline Algorithm Inherent deficiency to bound non-monotonic function with

(a few) monotonic upper bounding functions draw an example

• Lots of candidates with high uscores return much lower (real) score• unnecessary (expensive) checking

• cannot stop earlier Idea

• Partition the space (into blocks) and derive tighter upper bounds for each partitions

• “unwilling” to check a candidate until we are quite sure about its “prospect” (bscore)

Page 29: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 29

Block Pipeline Algorithm …

Execute a CN

CN: PQ CQ

C

P

Block Pipeline

Assume: idfn > idfm and k = 1

Block uscore bscore scorea

2.74 1.05

2.63 2.63

2.63 2.63

2.50 0.95

1) Re-establish the early stopping criterion2) Check candidates in an optimal order

(n:1, m:0) (n:0, m:1)

(n:1, m:0)

(n:0, m:1)

2.74

2.63

2.63

1.05

2.63

2.63 1.05

2.41

2.38 stop!

Page 30: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 30

Efficiency DBLP

• ~ 0.9M tuples in total

• k = 10

• PC 1.8G, 512M

Page 31: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 31

Efficiency … DBLP, DQ13

Page 32: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 32

Conclusions

A system that can perform effective & efficient keyword search on relational databases• Meaningful query results with appropriate

rankings

• second-level response time for ~10M tuple DB (imdb data) on a commodity PC

Page 33: 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 33

Q&A

Thank you.