13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

21/04/23 1

SPARK: Top-k Keyword Query in Relational Database

Wei Wang

University of New South Wales Australia

21/04/23 2

Outline

Demo & Introduction Ranking Query Evaluation Conclusions

21/04/23 3

Demo

21/04/23 4

Demo …

21/04/23 5

SPARK I

Searching, Probing & Ranking Top-k Results• Thesis project (2004 – 2005) with Nino

Svonja

• Taste of Research Summary Scholarship (2005)

• Finally, CISRA prize winner• http://www.computing.unsw.edu.au/softwareengine

ering.php

21/04/23 6

SPARK II

Continued as a research project with PhD student Yi Luo• 2005 – 2006

• SIGMOD 2007 paper

• Still under active development

21/04/23 7

A Motivating Example

21/04/23 8

A Motivating Example …

Top-3 results in our system

1 Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller (#2.1)

2 Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller (#2.1) ActorPlay: Character = Himself Actors: Hanks, Tom

3 Actors: John Hanks ActorPlay: Character = Alexander Kerst Movies: Rosamunde Pilcher - Winduber dem Fluss (2001)

21/04/23 9

Improving the Effectiveness

Three factors are considered to contribute to the final score of a search result (joined tuple tree) • (modified) IR ranking score.

• the completeness factor.

• the size normalization factor.

21/04/23 10

Preliminaries

Data Model• Relation-based

Query Model• Joined tuple trees (JTTs)

• Sophisticated ranking• address one flaw in previous approaches

• unify AND and OR semantics

• alternative size normalization

21/04/23 11

Problems with DISCOVER2

score(ci) score(pj) score

c1 p1 1.0 1.0 2.0

c2 p2 1.0 1.0 2.0

df

Nqtf

ss

tf

DQt avdldl

1ln

)1(

))ln(1ln(1

DQt df

Ntf

1ln))ln(1ln(1

signature SPARK

(1, 1) 0.98

(0, 2) 0.44

21/04/23 12

Virtual Document Combine tf contributions

before tf normalization / attenuation.

ci pj score(maxtor) score(netvista) scorea*

c1 p1 1.00 1.00 2.00

c2 p2 0.00 1.53 1.53

DQt df

Ntf

1ln))ln(1ln(1

21/04/23 13

df

Nqtf

ss

tf

DQt avdldl

1ln

)1(

))ln(1ln(1ln

Virtual Document Collection Collection: 3 results

• idfnetvista = ln(4/3)

• idfmaxtor = ln(4/2) Estimate idf:

• idfnetvista = • idfmaxtor = 5

9ln)3

11)(311(1

1ln

Estimate avdl = avdlC + avdlP

scorea

c1 p1 0.98

c2 p2 0.44

21/04/23 14

Completeness Factor

For “short queries”• User prefer results

matching more keywords Derive completeness

factor based on extended Boolean model• Measure Lp distance to

the ideal position

netvista

maxtor

(1,1)

Ideal Pos

(c1 p1)

(c2 p2)

d = 1

d = 0.5

L2 distance

scoreb

c1 p1 (1.41-0.5)/1.41 = 0.65

c2 p2 (1.41-1)/1.41 = 0.29

d = 1.41

21/04/23 15

Size Normalization

Results in large CNs tend to have more matches to the keywords

Scorec = (1+s1-s1*|CN|) * (1+s2-s2*|CNnf|)

• Empirically, s1 = 0.15, s2 = 1 / (|Q| + 1) works well

21/04/23 16

Putting ‘em Together

score(JTT) = scorea * scoreb * scorec

• a: IR-score of the virtual document

• b: completeness factor

• c: size normalization factor

scorea * scoreb

c1 p1 0.98 * 0.65 = 0.64

c2 p2 0.44 * 0.29 = 0.13

21/04/23 17

Comparing Top-1 Results

DBLP; Query = “nikos clique”

21/04/23 18

#Rel and R-Rank Results

DBLP; 18 queries; Union of top-20 results

Mondial; 35 queries; Union of top-20 results

DISCOVER2 [Liu et al, SIGMOD06] p = 1.0 p = 1.4 p = 2.0

#Rel 2 2 16 16 18

R-Rank 0.243 0.333 0.926 0.935 1.000

DISCOVER2 [Liu et al, SIGMOD06] p = 1.0 p = 1.4 p = 2.0

#Rel 2 10 27 29 34

R-Rank 0.276 0.491 0.881 0.909 0.986

21/04/23 19

Query Processing

3 Steps Generate candidate tuples in every relation in the

schema (using full-text indexes)

21/04/23 20

Query Processing …


schema (using full-text indexes) Enumerate all possible Candidate Networks (CN)

21/04/23 21



schema (using full-text indexes) Enumerate all possible Candidate Networks (CN) Execute the CNs

• Most algorithms differ here.

• The key is how to optimize for top-k retrieval

21/04/23 22

Monotonic Scoring Function

Execute a CN

CN: PQ CQ

C

P

C2 C1

P2

P1

DISCOVER2

Assume: idfnetvista > idfmaxtor and k = 1

score(ci) score(pj) score

c1 p1 1.06 0.97 2.03

c2 p2 1.06 1.06 2.12

c1 p1

c2 p2

<c1 p1

c2 p2

<

21/04/23 23

Non-Monotonic Scoring Function

Execute a CN

CN: PQ CQ

C

P

C2

C1

P2 P1

SPARK


score(ci) score(pj) scorea

c1 p1 1.06 0.97 0.98

c2 p2 1.06 1.06 0.44

c1 p1

c2 p2

<c1 p1

c2 p2

<?

?

1) Re-establish the early stopping criterion2) Check candidates in an optimal order

21/04/23 24

Upper Bounding Function

Idea: use a monotonic & tight, upper bounding function to SPARK’s non-monotonic scoring function

Details• sumidf = w idfw

• watf(t) = (1/sumidf) * w (tfw(t) * idfw)

• A = sumidf * (1 + ln(1 + ln( t watf(t) )))

• B = sumidf * t watf(t)

• then, scorea uscorea = (1/(1-s)) * min(A, B)scoreb

scorec

are constants given the CN scoreuscore

monotonic wrt. watf(t)

21/04/23 25

Early Stopping Criterion

Execute a CN

CN: PQ CQ

C

P

C2 C1

P2

P1

SPARK


uscore scorea

c1 p1 1.13 0.98

c2 p2 1.76 0.44


score( ) uscore( )score( ) uscore( ) stop!

21/04/23 26


Execute the CNs

CN: PQ CQ

C

P

C1 C2 C3

P1

P2

P3

[P1 ,P1] [C1 ,C1] C.get_next() [P1 ,P1] C2

P.get_next() P2 [C1 ,C2] P.get_next() P3 [C1 ,C2] …

[VLDB 03]

Operations:

• {P1, P2, …} and {C1, C2, …} have been sorted based on their IR relevance scores.

• Score(Pi Cj) = Score(Pi) + Score(Cj)

// a parametric SQL query is sent to the dbms

21/04/23 27

Skyline Sweeping Algorithm

Execute the CNs

CN: PQ CQ

C

P

C1 C2 C3

P1

P2

P3

P1 C1

P2 C1

P3 C1

Skyline Sweep

<P1 , C1 >

<P2 , C1 >, <P1 , C2 >

<P3 , C1 >, <P1 , C2 >, <P2 , C2 >

<P1 , C2 >, <P2 , C2 >, <P4 , C1 >, <P3 , C2 >

…

Dominance uscore(<Pi, Cj>) > uscore(<Pi+1, Cj>) anduscore(<Pi, Cj>) > uscore(<Pi, Cj+1>)

Priority Queue:Operations:


sort of

21/04/23 28

Block Pipeline Algorithm Inherent deficiency to bound non-monotonic function with

(a few) monotonic upper bounding functions draw an example

• Lots of candidates with high uscores return much lower (real) score• unnecessary (expensive) checking

• cannot stop earlier Idea

• Partition the space (into blocks) and derive tighter upper bounds for each partitions

• “unwilling” to check a candidate until we are quite sure about its “prospect” (bscore)

21/04/23 29

Block Pipeline Algorithm …

Execute a CN

CN: PQ CQ

C

P

Block Pipeline

Assume: idfn > idfm and k = 1

Block uscore bscore scorea

2.74 1.05

2.63 2.63

2.63 2.63

2.50 0.95


(n:1, m:0) (n:0, m:1)

(n:1, m:0)

(n:0, m:1)

2.74

2.63

2.63

1.05

2.63

2.63 1.05

2.41

2.38 stop!

21/04/23 30

Efficiency DBLP

• ~ 0.9M tuples in total

• k = 10

• PC 1.8G, 512M

21/04/23 31

Efficiency … DBLP, DQ13

21/04/23 32

Conclusions

A system that can perform effective & efficient keyword search on relational databases• Meaningful query results with appropriate

rankings

• second-level response time for ~10M tuple DB (imdb data) on a commodity PC

21/04/23 33

Q&A

Thank you.

Documents

13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia