Upload
jaylene-mewborn
View
225
Download
0
Embed Size (px)
Citation preview
21/04/23 1
SPARK: Top-k Keyword Query in Relational Database
Wei Wang
University of New South Wales Australia
21/04/23 2
Outline
Demo & Introduction Ranking Query Evaluation Conclusions
21/04/23 3
Demo
21/04/23 4
Demo …
21/04/23 5
SPARK I
Searching, Probing & Ranking Top-k Results• Thesis project (2004 – 2005) with Nino
Svonja
• Taste of Research Summary Scholarship (2005)
• Finally, CISRA prize winner• http://www.computing.unsw.edu.au/softwareengine
ering.php
21/04/23 6
SPARK II
Continued as a research project with PhD student Yi Luo• 2005 – 2006
• SIGMOD 2007 paper
• Still under active development
21/04/23 7
A Motivating Example
21/04/23 8
A Motivating Example …
Top-3 results in our system
1 Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller (#2.1)
2 Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller (#2.1) ActorPlay: Character = Himself Actors: Hanks, Tom
3 Actors: John Hanks ActorPlay: Character = Alexander Kerst Movies: Rosamunde Pilcher - Winduber dem Fluss (2001)
21/04/23 9
Improving the Effectiveness
Three factors are considered to contribute to the final score of a search result (joined tuple tree) • (modified) IR ranking score.
• the completeness factor.
• the size normalization factor.
21/04/23 10
Preliminaries
Data Model• Relation-based
Query Model• Joined tuple trees (JTTs)
• Sophisticated ranking• address one flaw in previous approaches
• unify AND and OR semantics
• alternative size normalization
21/04/23 11
Problems with DISCOVER2
score(ci) score(pj) score
c1 p1 1.0 1.0 2.0
c2 p2 1.0 1.0 2.0
df
Nqtf
ss
tf
DQt avdldl
1ln
)1(
))ln(1ln(1
DQt df
Ntf
1ln))ln(1ln(1
signature SPARK
(1, 1) 0.98
(0, 2) 0.44
21/04/23 12
Virtual Document Combine tf contributions
before tf normalization / attenuation.
ci pj score(maxtor) score(netvista) scorea*
c1 p1 1.00 1.00 2.00
c2 p2 0.00 1.53 1.53
DQt df
Ntf
1ln))ln(1ln(1
21/04/23 13
df
Nqtf
ss
tf
DQt avdldl
1ln
)1(
))ln(1ln(1ln
Virtual Document Collection Collection: 3 results
• idfnetvista = ln(4/3)
• idfmaxtor = ln(4/2) Estimate idf:
• idfnetvista = • idfmaxtor = 5
9ln)3
11)(311(1
1ln
Estimate avdl = avdlC + avdlP
scorea
c1 p1 0.98
c2 p2 0.44
21/04/23 14
Completeness Factor
For “short queries”• User prefer results
matching more keywords Derive completeness
factor based on extended Boolean model• Measure Lp distance to
the ideal position
netvista
maxtor
(1,1)
Ideal Pos
(c1 p1)
(c2 p2)
d = 1
d = 0.5
L2 distance
scoreb
c1 p1 (1.41-0.5)/1.41 = 0.65
c2 p2 (1.41-1)/1.41 = 0.29
d = 1.41
21/04/23 15
Size Normalization
Results in large CNs tend to have more matches to the keywords
Scorec = (1+s1-s1*|CN|) * (1+s2-s2*|CNnf|)
• Empirically, s1 = 0.15, s2 = 1 / (|Q| + 1) works well
21/04/23 16
Putting ‘em Together
score(JTT) = scorea * scoreb * scorec
• a: IR-score of the virtual document
• b: completeness factor
• c: size normalization factor
scorea * scoreb
c1 p1 0.98 * 0.65 = 0.64
c2 p2 0.44 * 0.29 = 0.13
21/04/23 17
Comparing Top-1 Results
DBLP; Query = “nikos clique”
21/04/23 18
#Rel and R-Rank Results
DBLP; 18 queries; Union of top-20 results
Mondial; 35 queries; Union of top-20 results
DISCOVER2 [Liu et al, SIGMOD06] p = 1.0 p = 1.4 p = 2.0
#Rel 2 2 16 16 18
R-Rank 0.243 0.333 0.926 0.935 1.000
DISCOVER2 [Liu et al, SIGMOD06] p = 1.0 p = 1.4 p = 2.0
#Rel 2 10 27 29 34
R-Rank 0.276 0.491 0.881 0.909 0.986
21/04/23 19
Query Processing
3 Steps Generate candidate tuples in every relation in the
schema (using full-text indexes)
21/04/23 20
Query Processing …
3 Steps Generate candidate tuples in every relation in the
schema (using full-text indexes) Enumerate all possible Candidate Networks (CN)
21/04/23 21
Query Processing …
3 Steps Generate candidate tuples in every relation in the
schema (using full-text indexes) Enumerate all possible Candidate Networks (CN) Execute the CNs
• Most algorithms differ here.
• The key is how to optimize for top-k retrieval
21/04/23 22
Monotonic Scoring Function
Execute a CN
CN: PQ CQ
C
P
C2 C1
P2
P1
DISCOVER2
Assume: idfnetvista > idfmaxtor and k = 1
score(ci) score(pj) score
c1 p1 1.06 0.97 2.03
c2 p2 1.06 1.06 2.12
c1 p1
c2 p2
<c1 p1
c2 p2
<
21/04/23 23
Non-Monotonic Scoring Function
Execute a CN
CN: PQ CQ
C
P
C2
C1
P2 P1
SPARK
Assume: idfnetvista > idfmaxtor and k = 1
score(ci) score(pj) scorea
c1 p1 1.06 0.97 0.98
c2 p2 1.06 1.06 0.44
c1 p1
c2 p2
<c1 p1
c2 p2
<?
?
1) Re-establish the early stopping criterion2) Check candidates in an optimal order
21/04/23 24
Upper Bounding Function
Idea: use a monotonic & tight, upper bounding function to SPARK’s non-monotonic scoring function
Details• sumidf = w idfw
• watf(t) = (1/sumidf) * w (tfw(t) * idfw)
• A = sumidf * (1 + ln(1 + ln( t watf(t) )))
• B = sumidf * t watf(t)
• then, scorea uscorea = (1/(1-s)) * min(A, B)scoreb
scorec
are constants given the CN scoreuscore
monotonic wrt. watf(t)
21/04/23 25
Early Stopping Criterion
Execute a CN
CN: PQ CQ
C
P
C2 C1
P2
P1
SPARK
Assume: idfnetvista > idfmaxtor and k = 1
uscore scorea
c1 p1 1.13 0.98
c2 p2 1.76 0.44
1) Re-establish the early stopping criterion2) Check candidates in an optimal order
score( ) uscore( )score( ) uscore( ) stop!
21/04/23 26
Query Processing …
Execute the CNs
CN: PQ CQ
C
P
C1 C2 C3
P1
P2
P3
[P1 ,P1] [C1 ,C1] C.get_next() [P1 ,P1] C2
P.get_next() P2 [C1 ,C2] P.get_next() P3 [C1 ,C2] …
[VLDB 03]
Operations:
• {P1, P2, …} and {C1, C2, …} have been sorted based on their IR relevance scores.
• Score(Pi Cj) = Score(Pi) + Score(Cj)
// a parametric SQL query is sent to the dbms
21/04/23 27
Skyline Sweeping Algorithm
Execute the CNs
CN: PQ CQ
C
P
C1 C2 C3
P1
P2
P3
P1 C1
P2 C1
P3 C1
Skyline Sweep
<P1 , C1 >
<P2 , C1 >, <P1 , C2 >
<P3 , C1 >, <P1 , C2 >, <P2 , C2 >
<P1 , C2 >, <P2 , C2 >, <P4 , C1 >, <P3 , C2 >
…
Dominance uscore(<Pi, Cj>) > uscore(<Pi+1, Cj>) anduscore(<Pi, Cj>) > uscore(<Pi, Cj+1>)
Priority Queue:Operations:
1) Re-establish the early stopping criterion2) Check candidates in an optimal order
sort of
21/04/23 28
Block Pipeline Algorithm Inherent deficiency to bound non-monotonic function with
(a few) monotonic upper bounding functions draw an example
• Lots of candidates with high uscores return much lower (real) score• unnecessary (expensive) checking
• cannot stop earlier Idea
• Partition the space (into blocks) and derive tighter upper bounds for each partitions
• “unwilling” to check a candidate until we are quite sure about its “prospect” (bscore)
21/04/23 29
Block Pipeline Algorithm …
Execute a CN
CN: PQ CQ
C
P
Block Pipeline
Assume: idfn > idfm and k = 1
Block uscore bscore scorea
2.74 1.05
2.63 2.63
2.63 2.63
2.50 0.95
1) Re-establish the early stopping criterion2) Check candidates in an optimal order
(n:1, m:0) (n:0, m:1)
(n:1, m:0)
(n:0, m:1)
2.74
2.63
2.63
1.05
2.63
2.63 1.05
2.41
2.38 stop!
21/04/23 30
Efficiency DBLP
• ~ 0.9M tuples in total
• k = 10
• PC 1.8G, 512M
21/04/23 31
Efficiency … DBLP, DQ13
21/04/23 32
Conclusions
A system that can perform effective & efficient keyword search on relational databases• Meaningful query results with appropriate
rankings
• second-level response time for ~10M tuple DB (imdb data) on a commodity PC
21/04/23 33
Q&A
Thank you.