Upload
kelly-hodges
View
220
Download
4
Tags:
Embed Size (px)
Citation preview
Top-K Query Evaluation on Probabilistic Data
Christopher Ré, Nilesh Dalvi and Dan Suciu
University of Washington
Evaluating Complex SQL on PDBs 212/8/2006
High Level Overview
DBMS: Precise answers over clean data Data are often imprecise
Information Integration Information Extraction
Probabilistic DB (PDB) handle imprecisionMany low quality answersTop-K ranked by probability
This talk: Compute Top-K Efficiently
Evaluating Complex SQL on PDBs 312/8/2006
Overview
Motivating Example Query Processing Background Multisimulation Experimental Results
Evaluating Complex SQL on PDBs 412/8/2006
Overview
Motivating Example Query Processing Background Multisimulation Experimental Results
Evaluating Complex SQL on PDBs 512/8/2006
Example Application
IMDB
• Lots of interesting data above movies (e.g. actors, directors)
• Well maintained and clean
• But no reviews!
On the web there are lots of reviews
How will I know which movie they
are about?
Alice needs to do information extraction and object reconcillation.
Is a movie good or bad?
Alice wants to do sentiment analysis.
A probabilistic database can help Alice store and query her uncertain data.
Find all years where ‘Anthony Hopkins’ starred in a good
movie
Evaluating Complex SQL on PDBs 612/8/2006
Imprecision is out there…Object Reconciliation
RID Title
r124 12 Monkeys
r155 Twelve Monkeys
r175 2 Monkey
r194 Monk
MID Title
m232 12 Monkeys
m143 Monkey Love
Our Approach: Convert scores to probabilities
Data extracted from Reviews
Clean IMDB Data
Output: (RID,MID) pairs
12/8/2006
MatchNo Match
t’ t
Felligi-Sunter Approach: Score (s) each (RID,MID)
Evaluating Complex SQL on PDBs 712/8/2006
Imprecision is out there…
Object Reconciliation
RID Title
r124 12 Monkeys
r155 Twelve Monkeys
r175 2 Monkey
r194 Monk
MID Title
m232 12 Monkeys
m143 Monkey Love
RID MID Prob
r175 m232 0.8
r175 m143 0.2
Felligi-Sunter Approach: Score (s) each (RID,MID)
MatchNo Match
t’ t
Evaluating Complex SQL on PDBs 812/8/2006
Overview
Motivating Example Query Processing Background Multisimulation Experimental Results
Evaluating Complex SQL on PDBs 912/8/2006
Query Processing Background
RID MID Probr175 m232 0.8
r175 m143 0.2
Query Processing builds event expression
• Intensional Query Processing [FR97]
• Associate to each tuple an event
• Probability event is satisfied = query value
Technical Point: Projection as last operator implies result is a DNF
Evaluating Complex SQL on PDBs 1012/8/2006
DNF Sampling at a High Level
Estimate p(t),probability DNF sat satisfied Do for each output tuple, t#P-Hard [Valiant79] even if only conjunctive
queries [RDS06,DS04]Randomized Approximation [LK84]
Simulation reduces uncertainty
0.0 1.0Uncertain about p(t)
Evaluating Complex SQL on PDBs 1112/8/2006
Naïve Query Processing
Naïve algorithm (PTIME): Simulate until all small “Epsilon”-small
0.0 1.0
Christopher Walken
Harvey Keitel
Samuel L. Jackson
Bruce Willis
1
3
4
2
Can we do better?
Evaluating Complex SQL on PDBs 1212/8/2006
Overview
Motivating Example Query Processing Background Multisimulation Experimental Results
Evaluating Complex SQL on PDBs 1312/8/2006
A Better Method: Multisimulation Separate Top-K with few simulations
Concentrate on intervals in Top-K Asymptotically, confidence intervals are nested
Compare against OPT “knows” which intervals to simulate
Evaluating Complex SQL on PDBs 1312/8/2006
0.0 1.0
Christopher Walken
Harvey Keitel
Samuel L. Jackson
Bruce Willis
1
3
4
2
Evaluating Complex SQL on PDBs 1412/8/2006
The Critical Region
The critical region is the interval (kth-highest min, k+1st higest max) For k = 2
0.0 1.0
Evaluating Complex SQL on PDBs 1512/8/2006
Three Simple Rules: Rule 1
0.0 1.0
Pick a “Double Crosser” OPT must pick this too
Evaluating Complex SQL on PDBs 1612/8/2006
Three Simple Rules: Rule 2
All lower/upper crossers then maximal OPT must pick this too
0.0 1.0
Evaluating Complex SQL on PDBs 1712/8/2006
Three Simple Rules: Rule 3
Pick an upper and a lower crosser OPT may only pick 1 of these two
0.0 1.0
Evaluating Complex SQL on PDBs 1812/8/2006
Multisimulation is a 2-Approx
Thm: Multisimulation performs at most twice as many simulations as OPT And, no deterministic algorithm can do better on every
instance. Extensions
Top-K Set (shown) Anytime (produce from 1 to k) Rank (produce top k ranked) All ( rank all intervals )
Evaluating Complex SQL on PDBs 1912/8/2006
Overview
Motivating Example Query Processing Background Multisimulation Experimental Results
Evaluating Complex SQL on PDBs 2012/8/2006
Experiment Details: Uncertain tuples
Table # Tuples
StringMatch 339k
ActorMatch 6,758k
DirectorMatch 18k
Table # Tuples
Reviews 292k
Evaluating Complex SQL on PDBs 2112/8/2006
Running Time
Evaluating Complex SQL on PDBs 2212/8/2006
Running Time
“Find all years in which Anthony Hopkins was in a highly rated movie” (SS)
Small Number of Tuples Output (33)
Small DNFs per Output
(Avg. 20.4, Max 63)
Evaluating Complex SQL on PDBs 2312/8/2006
Running Time
“Find all directors who have a highly rated drama but low rated comedy” (LL)
Large #Tuples Output (1415)
Large DNFs per Output
(Avg. 234.8, Max. 9088)
Evaluating Complex SQL on PDBs 2412/8/2006
Conclusions
Mystiq is a general purpose probabilistic database
Multisimulation and Logical Optimization key to performance on large data sets
Advert: Demo on my laptop
Evaluating Complex SQL on PDBs 2512/8/2006
Running Time“Find all actors in Pulp Fiction who appeared in two very bad movies in the five years before appearing in Pulp Fiction” (SL)
Small Number of Tuples Output (33)
Large DNFs per Output
(Avg. 117.7,Max 685)
Evaluating Complex SQL on PDBs 2612/8/2006
Running Time“Find all directors in the 80s who had a highly rated movie” (LS)
Large #Tuples Output (3259)
Small DNFs per Output
(Avg 3.03, Max 30)
Evaluating Complex SQL on PDBs 2712/8/2006
0.0 1.0
Christopher Walken
Harvey Keitel
Samuel L. Jackson
Bruce Willis
Evaluating Complex SQL on PDBs 2812/8/2006
0.0 1.0
Christopher Walken
Harvey Keitel
Samuel L. Jackson
Bruce Willis
1
3
4
2
Evaluating Complex SQL on PDBs 2912/8/2006
0.0 1.0
Evaluating Complex SQL on PDBs 3012/8/2006
0.0 1.0
Evaluating Complex SQL on PDBs 3112/8/2006
0.0 1.0
Evaluating Complex SQL on PDBs 3212/8/2006
0.0 1.0
Evaluating Complex SQL on PDBs 3312/8/2006
0.0 1.0