Upload
daniela-sutton
View
223
Download
0
Embed Size (px)
Citation preview
SPARK: Top-k Keyword Query in Relational SPARK: Top-k Keyword Query in Relational DatabasesDatabases
Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou
Univ. of New South Wales, Univ. of QueenslandUniv. of New South Wales, Univ. of Queensland
SIGMOD 2007SIGMOD 2007
2009. 02. 05.
Summarized by Jaehui Park, IDS Lab., Seoul National University
Presented by Jaehui Park, IDS Lab., Seoul National University
Copyright 2009 by CEBT
IntroductionIntroduction
Demand for RDB to support effective and efficient IR-style keyword queries Features
– Assembling data collectively
– Supporting casual users
– Revealing unexpected relationships among entities
– More flexible search for back-end databases than pre-built template querying
Issues Search results contradictory to human perception (in previous work)
Technical challenges– Aggregating final score of an answer
Relying on monotonicity of the rank aggregation function
Contributions New ranking function
– Non-monotonic nature of ranking methods
Techniques for avoiding unnecessary DB accesses– Skyline sweeping algorithm
– Block pipeline algorithm
2
Copyright 2009 by CEBT
PreliminariesPreliminaries
Keyword queries on a set of relations
Joined Tuple Tree (JTT)
Tree of tuples
– Top-k results
Foreign key to primary key relationships
Candidate Network (CN)
Relevance score
– How relevant the JTT is to the query
Example query : “maxtor netvista”
3
Top-3 JTTs
c3
c3->p2
c1->p1
c2->p2
c2->p2<-c3
Copyright 2009 by CEBT
Preliminaries: existing solutions Preliminaries: existing solutions (DISCOVER (DISCOVER
2002,2003)2002,2003)
Enumerating (Union) all possible CNs
CQ->PQ : valid
CQ->U : not valid
CQ->U<-CQ : may be valid
Example (cont.)
4
rules
Prune duplicate CNs
Prune non-minimal CNs
Prune CNs of type: RQ<-S->RQ
DISCOVER (2003)
Copyright 2009 by CEBT
Preliminaries: existing solutions Preliminaries: existing solutions (DISCOVER (DISCOVER
2002,2003)2002,2003)
Upper bounding functions
Bound the scores of potential answers from each CN
– Stop query execution earlier
– Ex) Sparse algorithm
Global pipeline algorithm
Focus of this paper
How to score a JTT : Ranking Function
How to generate and order the SQL queries for the CNs : Top-k Join query
– Minimal DB accesses are required before top-k results are returned.
5
id score
t1 50
t2 40
t3 30
t4 20
id score
I1 70
I2 60
I3 40
i4 20
aggregate
Copyright 2009 by CEBT
Ranking FunctionRanking Function
Problems with existing ranking functions
Monotonic aggregation function have been considered.
– SUM
Discordance with human perception
Side Effect : Overly rewarding contributions of the same keyword in different tuples in the same JTT
6
CQ->PQ
Copyright 2009 by CEBT
Ranking FunctionRanking Function
Modeling a JTT as a virtual document
attenuating : same keyword in different relations
Technical issues
Expensive cost to compute
Completeness score and Size normalization score
7
C(t1)
K2
K1 P(t1)
C(t1)
K2
K1 P(t1)
Copyright 2009 by CEBT
Top-k Join algorithmTop-k Join algorithm
None of the existing top-k query processing methods deals with non-monotonic scoring function
c[i]->p[i]
max(score(p[1],c[i+1]),
score(p[j+1], c[1]))
Monotonic, upper bounding function to the actual function
Lemma 1. score(T,Q) can be bounded by a function uscore(T,Q)=1/(1-s) * min(A,B)
max(uscore(c[i+1], p[1]), uscore(c[1],p[j+1]))
8
C(t1)
K2
K1 P(t2)
X
Copyright 2009 by CEBT
Top-k Join algorithmTop-k Join algorithm
Skyline Sweeping Algorithm
Avoid unnecessary join checking
-> minimal number of accesses to the database
dominate relationship among candidates
– Checking candidate of higher upper bound first
– Priority queue
Descending order of the upper bound scores
Technical point
– Duplicate checking
9
uscore
uscore
uscore
uscore
Copyright 2009 by CEBT
Top-k Join algorithmTop-k Join algorithm
Large gaps between the upper bound scores and the corresponding real scores
Harder to stop early
– upper bound of un-processed >> real score
Block Pipeline Algorithm
Employing local non-monotonic upper bounding function that bounds the real score of JTTs more accurately
Tighter upper bounding: bscore < uscore
signature
– An ordered sequence of term frequencies for all the query keywords
<tfw1(t), …, tfw2(t)>
– Signature of the block
< >
10
Copyright 2009 by CEBT
ExperimentsExperiments
Dataset: IMDB, DBLP and Mondial
Oracle 10g, MySQL 5.00.18, JDK 1.5
Implementation: Sparse, Global pipeline (GP). Skyline sweep (SS), Block pipeline (BP)
Metrics
Number of top-1 answers (#Rel)
Reciprocal rank (R-Rank)
Relevance answer
It must match all the search keyword
Its size must be the smallest
11
Copyright 2009 by CEBT
ExperimentsExperiments
Effectiveness
Efficiency
Observations
– Fastest : BP
– SS outperforms Sparse and GP
– Sparse == GP (GP > Sparse for small k or easy query)
– All algorithms are more responsive for smaller k values
12
Copyright 2009 by CEBT
ConclusionConclusion
New ranking method
Adapts that the state-of-the-art IR ranking function and principles
Query processing method
Tailored for our non-monotonic ranking functions
Extensive experiments on large scale real databases
High precision with high efficiency
14