Upload
julius-walters
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
All right reserved by Xuehua Shen [email protected] 1
Optimal Aggregation Algorithms for Middleware
Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
All right reserved by Xuehua Shen [email protected] 2
Problem: Rank Aggregation Each object is scored using m different criteria,
m sorted list for each criterion
Combined score is calculated by an aggregation function
Problem: find top-k objects with highest combined scores
All right reserved by Xuehua Shen [email protected] 3
carID
MileageScore
c 1.0a 0.8e 0.6b 0.5d 0.5
carID
YearScore
a 0.9b 0.7c 0.7d 0.7e 0.5
carID
PriceScore
d 1.0e 0.9b 0.8c 0.7a 0.6
Rank Aggregation
carID score
d 0.81c 0.76
Top 2 Car
e.g. weighted sumCombined score = 0.2 *mileage score + 0.3*year score + 0.5 * price score
Do we need access all entries of all sorted lists?
Example
All right reserved by Xuehua Shen [email protected] 4
Applications
Multimedia database system
Web search query
Query
Rank AggregationEngine
Color=‘red’and Shape=‘round’
Top k
Color = ‘red’
Sorted List
Sorted List
color shape
Shape =‘round’
From Zhang2002 talk
All right reserved by Xuehua Shen [email protected] 5
Outline
Assumptions Fagin Algorithm Threshold Algorithm Summary & Comments
All right reserved by Xuehua Shen [email protected] 6
Assumption 1: Modes of Access
Sequential Access: obtain score of an object in one sorted list sequentially from current position
Random Access: obtain score of an object in one sorted list using one random access
carID Yearscore
a 0.8 c 0.8 e 0.7…
Assumption: Both Access Modes are available
All right reserved by Xuehua Shen [email protected] 7
Assumption 2: Aggregation Function
Object gets different scores from different subsystems in the interval [0,1]
Aggregation function to compute them into
combined scores e.g. min, avg
Monotone: if for every i
1 2 1 2( , ,..., ) ( , ,..., )m mf x x x f y y y i ix y
All right reserved by Xuehua Shen [email protected] 8
Intuition of Algorithms
Top objects in individual sorted lists also have chances to be correct answers
Do some accesses, and think “Can we stop now?”
All right reserved by Xuehua Shen [email protected] 9
Fagin AlgorithmcarID Price
score
a 0.9 c 0.8 e 0.7…
carID Mileagescore
b 1.0 e 0.8 f 0.7…
carID Yearscore
a 0.8 c 0.8 e 0.7…
’e’ appears in all of them. top-1 object must be in {a, b, c, e, f}. why?
Monotone function, object ‘e’ blocks all objects below
Do random access for these 5 objects to get their scores and pickTop-1.
We can’t say ‘e’ must be top-1,other objects can still have highercombined score
All right reserved by Xuehua Shen [email protected] 10
Drawbacks of Fagin Algorithm
Only use information provided by sorted list and monotone property
Have to remember lots of objects: large buffer size
All right reserved by Xuehua Shen [email protected] 11
Threshold Algorithm (TA)
When object R is seen under sequential access, immediately do random access to get all other scores of object R and compute combined score
Halt when at least k objects have combined scores no lessthan upper bound
Intuition: Combined score calculated by aggregation function can provide some extra information.
upper bound (or threshold) of combined score of unseen objects!
At the same time, Keep track of the upper bound of the unseen objects
All right reserved by Xuehua Shen [email protected] 12
TA: Example (K=1,AVG aggregation)
carID
Pricescore
a 0.9 c 0.8 e 0.7
…
carID Yearscore
a 0.8 c 0.8 e 0.7
…
carID
Mileagescore
b 1.0 e 0.8 f 0.7
…Step 1: sequential access ‘a’ price score(0.9), then random access ‘a’mileage score(0.6) and year score(0.8), avg is (0.77)
Step 2: sequential access ‘b’ mileage score(1.0), then random access ‘b’price score(0.7) and year score(0.7), avg is (0.8)
Upper Bound:0.9
0.77
Upper Bound:0.8
0.8Const-size buffer
All right reserved by Xuehua Shen [email protected] 13
Evaluation of TA
TA never stops later than FA
TA requires only small constant-size (K) buffer
However, TA may perform more random accesses
All right reserved by Xuehua Shen [email protected] 14
Summary
FA and TA with both sequential access and random access
Extend TA to other situations Approximate algorithm No random access
All right reserved by Xuehua Shen [email protected] 15
Comments
Rely on universal identification of objects from different lists
Assumptions can not always be valid e.g. not every sorted list exists beforehand
Do sequential access wisely for speeding up TA for skewed data
All right reserved by Xuehua Shen [email protected] 16
All right reserved by Xuehua Shen [email protected] 17
Backup Slides
All right reserved by Xuehua Shen [email protected] 18
Middleware
Middleware: functions as a translation layer, handles all incoming requests (such as Top-K query) and replies, interacting with the disparate back-office systems to gather the information it needs.
Application developers don’t need know there are several heterogeneous systems behind the middleware.
All right reserved by Xuehua Shen [email protected] 19
Boolean Query Vs. Fuzzy Query
Semantics Get all the results that satisfy the conditions Vs. get the
best possible answers to the query Size of result: constant Vs. variable
Processing the query It’s possible to determine whether the tuple belongs to
result only based on the tuple itself, but for fuzzy query it’s not. So for boolean query we can deal with each tuple individually, but for fuzzy query, we cannot determine whether it’s in the result just by itself
All right reserved by Xuehua Shen [email protected] 20
Fuzzy Query Processor(from Zhang02)
Query
Query Processor(Boolean)
Title=‘database’ and Price <100
Query
Query Processor(Fuzzy)
Color=‘red’and Shape=‘round’
SetTop k
Traditional Database Database with fuzzy data
Color = ‘red’
Sorted List
Sorted List
color shapeShape =‘round’
All right reserved by Xuehua Shen [email protected] 21
Cost
Reduce the number of sequential access(Cs) Number of random accesses is bounded by
sequential access by a factor of m-1 Overall cost is bounded by the Cs by constant
factor Really optimal?
All right reserved by Xuehua Shen [email protected] 22
Approximation Algorithm Approximately top k answers are acceptable or
even desirable θ-approximation (θ>1)
For any object y in the answer, z in database θt(y) >= t(z)
Turning TA to approximate algorithm The top k objects seen so far satisfy the inequality
All right reserved by Xuehua Shen [email protected] 23
Non Random Access (NRA)
Similar as TA, except that No exact score No sorted order The lower bound and upper bound of such
objects
Do sequential access until there are k objects whose lower bound no less than the upper bound of all other objects
All right reserved by Xuehua Shen [email protected] 24
NRA cont.
Low Bound: use 0 Upper Bound: use last score seen
carID Pricescore
a 0.9 c 0.8 e 0.7…
carID Mileagescore
b 1.0 e 0.8 f 0.7…
carID Yearscore
a 0.8 c 0.8 e 0.7…
All right reserved by Xuehua Shen [email protected] 25
NRA example
Advantage: R1(1,0), others(1/3,1/3) Top 1
Top 2 vs. Top 1: R1(1,0),R2(1,1/4),others(1/3,1/3) Top 2
Lots of Bookkeeping
All right reserved by Xuehua Shen [email protected] 26
Optimality of FA
Assumption t is monotone
Cost Θ(N(m-1)/mk1/m) with arbitrarily high probability
Optimality Each algorithm that correctly find the top k answers for
strict monotone query Ft(A1, A2, …,Am) where A1, A2, …,Am are independent, and without wild guess has the cost Θ (N(m-1)/mk1/m) with arbitrarily high probability
FA is optimal in all such algorithms in high probability sense
All right reserved by Xuehua Shen [email protected] 27
Optimality of TA
Assumption t is monotone
Instance Optimality For any algorithm C that correctly find the top k
answers for monotone query Ft(A1, A2, …,Am) without wild guess on any database D
Cost(TA,D)=O(cost(C,D)) TA is instance optimal in all such algorithms
All right reserved by Xuehua Shen [email protected] 28
Optimality of NRA
Assumption t is monotone
Instance Optimality For all algorithm that correctly find the top k objects for
monotone query t for every database and don’t make random access
All right reserved by Xuehua Shen [email protected] 29
Algorithm Comparision(from Zhang2002 talk)
Algorithm
Assumption Access Model
Termination
Worst Case
Termination
Expected
Buffer Space
FA Monotone Sorted Rando
m
n(m-1)/m + k/m
Nm-1/mk1/m N
TA Monotone Sorted Rando
m
Bounded by FA
Depends on
distribution
k
NRA Monotone Sorted N Depends on
distribution
N
All right reserved by Xuehua Shen [email protected] 30
Worst Case
O11.0 0.0
O21.0 0.0
...
On+11.0 1.0
On+20.0 1.0
On+30.0 1.0
...
O2n+10.0 1.0
Aggregation Function: min
n(m-1)/m + k/m
All right reserved by Xuehua Shen [email protected] 31
Naïve algorithm
Algorithm: For each criterion, do sequential access to retrieve all objects and their scores Calculate combined scores for all objects Pick up top K
Comments: Access the entire database Cost is linear in the database size Does NOT use the fact that each list is sorted
All right reserved by Xuehua Shen [email protected] 32
Fagin AlgorithmAlgorith
m:Do sequential in parallel to all sorted list Li, until there is k “matches”. A “match” is an object that has been seen in all sorted lists Li.
Then for each object that has been seen, do random access to get all its score.
Compute the combined scores and pick the top k