SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007

SPARK: Top-k Keyword Query in Relational SPARK: Top-k Keyword Query in Relational DatabasesDatabases

Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou

Univ. of New South Wales, Univ. of QueenslandUniv. of New South Wales, Univ. of Queensland

SIGMOD 2007SIGMOD 2007

2009. 02. 05.

Summarized by Jaehui Park, IDS Lab., Seoul National University

Presented by Jaehui Park, IDS Lab., Seoul National University

Copyright 2009 by CEBT

IntroductionIntroduction

Demand for RDB to support effective and efficient IR-style keyword queries Features

– Assembling data collectively

– Supporting casual users

– Revealing unexpected relationships among entities

– More flexible search for back-end databases than pre-built template querying

Issues Search results contradictory to human perception (in previous work)

Technical challenges– Aggregating final score of an answer

Relying on monotonicity of the rank aggregation function

Contributions New ranking function

– Non-monotonic nature of ranking methods

Techniques for avoiding unnecessary DB accesses– Skyline sweeping algorithm

– Block pipeline algorithm

2


PreliminariesPreliminaries

Keyword queries on a set of relations

Joined Tuple Tree (JTT)

Tree of tuples

– Top-k results

Foreign key to primary key relationships

Candidate Network (CN)

Relevance score

– How relevant the JTT is to the query

Example query : “maxtor netvista”

3

Top-3 JTTs

c3

c3->p2

c1->p1

c2->p2

c2->p2<-c3


Preliminaries: existing solutions Preliminaries: existing solutions (DISCOVER (DISCOVER

2002,2003)2002,2003)

Enumerating (Union) all possible CNs

CQ->PQ : valid

CQ->U : not valid

CQ->U<-CQ : may be valid

Example (cont.)

4

rules

Prune duplicate CNs

Prune non-minimal CNs

Prune CNs of type: RQ<-S->RQ

DISCOVER (2003)


Preliminaries: existing solutions Preliminaries: existing solutions (DISCOVER (DISCOVER

2002,2003)2002,2003)

Upper bounding functions

Bound the scores of potential answers from each CN

– Stop query execution earlier

– Ex) Sparse algorithm

Global pipeline algorithm

Focus of this paper

How to score a JTT : Ranking Function

How to generate and order the SQL queries for the CNs : Top-k Join query

– Minimal DB accesses are required before top-k results are returned.

5

id score

t1 50

t2 40

t3 30

t4 20

id score

I1 70

I2 60

I3 40

i4 20

aggregate


Ranking FunctionRanking Function

Problems with existing ranking functions

Monotonic aggregation function have been considered.

– SUM

Discordance with human perception

Side Effect : Overly rewarding contributions of the same keyword in different tuples in the same JTT

6

CQ->PQ


Ranking FunctionRanking Function

Modeling a JTT as a virtual document

attenuating : same keyword in different relations

Technical issues

Expensive cost to compute

Completeness score and Size normalization score

7

C(t1)

K2

K1 P(t1)

C(t1)

K2

K1 P(t1)


Top-k Join algorithmTop-k Join algorithm

None of the existing top-k query processing methods deals with non-monotonic scoring function

c[i]->p[i]

max(score(p[1],c[i+1]),

score(p[j+1], c[1]))

Monotonic, upper bounding function to the actual function

Lemma 1. score(T,Q) can be bounded by a function uscore(T,Q)=1/(1-s) * min(A,B)

max(uscore(c[i+1], p[1]), uscore(c[1],p[j+1]))

8

C(t1)

K2

K1 P(t2)

X



Skyline Sweeping Algorithm

Avoid unnecessary join checking

-> minimal number of accesses to the database

dominate relationship among candidates

– Checking candidate of higher upper bound first

– Priority queue

Descending order of the upper bound scores

Technical point

– Duplicate checking

9

uscore

uscore

uscore

uscore



Large gaps between the upper bound scores and the corresponding real scores

Harder to stop early

– upper bound of un-processed >> real score

Block Pipeline Algorithm

Employing local non-monotonic upper bounding function that bounds the real score of JTTs more accurately

Tighter upper bounding: bscore < uscore

signature

– An ordered sequence of term frequencies for all the query keywords

<tfw1(t), …, tfw2(t)>

– Signature of the block

< >

10


ExperimentsExperiments

Dataset: IMDB, DBLP and Mondial

Oracle 10g, MySQL 5.00.18, JDK 1.5

Implementation: Sparse, Global pipeline (GP). Skyline sweep (SS), Block pipeline (BP)

Metrics

Number of top-1 answers (#Rel)

Reciprocal rank (R-Rank)

Relevance answer

It must match all the search keyword

Its size must be the smallest

11



Effectiveness

Efficiency

Observations

– Fastest : BP

– SS outperforms Sparse and GP

– Sparse == GP (GP > Sparse for small k or easy query)

– All algorithms are more responsive for smaller k values

12



13


ConclusionConclusion

New ranking method

Adapts that the state-of-the-art IR ranking function and principles

Query processing method

Tailored for our non-monotonic ranking functions

Extensive experiments on large scale real databases

High precision with high efficiency

14


ReviewsReviews

Good

Detailed explanation of background and existing approach

Good paper organization and good examples

Short of rationale for new algorithms

Non-monotonicity of Block pipeline algorithm

15

Documents

SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007