32
All right reserved by Xuehua Shen [email protected] 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

Embed Size (px)

Citation preview

Page 1: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 1

Optimal Aggregation Algorithms for Middleware

Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

Page 2: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 2

Problem: Rank Aggregation Each object is scored using m different criteria,

m sorted list for each criterion

Combined score is calculated by an aggregation function

Problem: find top-k objects with highest combined scores

Page 3: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 3

carID

MileageScore

c 1.0a 0.8e 0.6b 0.5d 0.5

carID

YearScore

a 0.9b 0.7c 0.7d 0.7e 0.5

carID

PriceScore

d 1.0e 0.9b 0.8c 0.7a 0.6

Rank Aggregation

carID score

d 0.81c 0.76

Top 2 Car

e.g. weighted sumCombined score = 0.2 *mileage score + 0.3*year score + 0.5 * price score

Do we need access all entries of all sorted lists?

Example

Page 4: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 4

Applications

Multimedia database system

Web search query

Query

Rank AggregationEngine

Color=‘red’and Shape=‘round’

Top k

Color = ‘red’

Sorted List

Sorted List

color shape

Shape =‘round’

From Zhang2002 talk

Page 5: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 5

Outline

Assumptions Fagin Algorithm Threshold Algorithm Summary & Comments

Page 6: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 6

Assumption 1: Modes of Access

Sequential Access: obtain score of an object in one sorted list sequentially from current position

Random Access: obtain score of an object in one sorted list using one random access

carID Yearscore

a 0.8 c 0.8 e 0.7…

Assumption: Both Access Modes are available

Page 7: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 7

Assumption 2: Aggregation Function

Object gets different scores from different subsystems in the interval [0,1]

Aggregation function to compute them into

combined scores e.g. min, avg

Monotone: if for every i

1 2 1 2( , ,..., ) ( , ,..., )m mf x x x f y y y i ix y

Page 8: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 8

Intuition of Algorithms

Top objects in individual sorted lists also have chances to be correct answers

Do some accesses, and think “Can we stop now?”

Page 9: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 9

Fagin AlgorithmcarID Price

score

a 0.9 c 0.8 e 0.7…

carID Mileagescore

b 1.0 e 0.8 f 0.7…

carID Yearscore

a 0.8 c 0.8 e 0.7…

’e’ appears in all of them. top-1 object must be in {a, b, c, e, f}. why?

Monotone function, object ‘e’ blocks all objects below

Do random access for these 5 objects to get their scores and pickTop-1.

We can’t say ‘e’ must be top-1,other objects can still have highercombined score

Page 10: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 10

Drawbacks of Fagin Algorithm

Only use information provided by sorted list and monotone property

Have to remember lots of objects: large buffer size

Page 11: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 11

Threshold Algorithm (TA)

When object R is seen under sequential access, immediately do random access to get all other scores of object R and compute combined score

Halt when at least k objects have combined scores no lessthan upper bound

Intuition: Combined score calculated by aggregation function can provide some extra information.

upper bound (or threshold) of combined score of unseen objects!

At the same time, Keep track of the upper bound of the unseen objects

Page 12: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 12

TA: Example (K=1,AVG aggregation)

carID

Pricescore

a 0.9 c 0.8 e 0.7

carID Yearscore

a 0.8 c 0.8 e 0.7

carID

Mileagescore

b 1.0 e 0.8 f 0.7

…Step 1: sequential access ‘a’ price score(0.9), then random access ‘a’mileage score(0.6) and year score(0.8), avg is (0.77)

Step 2: sequential access ‘b’ mileage score(1.0), then random access ‘b’price score(0.7) and year score(0.7), avg is (0.8)

Upper Bound:0.9

0.77

Upper Bound:0.8

0.8Const-size buffer

Page 13: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 13

Evaluation of TA

TA never stops later than FA

TA requires only small constant-size (K) buffer

However, TA may perform more random accesses

Page 14: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 14

Summary

FA and TA with both sequential access and random access

Extend TA to other situations Approximate algorithm No random access

Page 15: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 15

Comments

Rely on universal identification of objects from different lists

Assumptions can not always be valid e.g. not every sorted list exists beforehand

Do sequential access wisely for speeding up TA for skewed data

Page 16: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 16

Page 17: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 17

Backup Slides

Page 18: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 18

Middleware

Middleware: functions as a translation layer, handles all incoming requests (such as Top-K query) and replies, interacting with the disparate back-office systems to gather the information it needs.

Application developers don’t need know there are several heterogeneous systems behind the middleware.

Page 19: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 19

Boolean Query Vs. Fuzzy Query

Semantics Get all the results that satisfy the conditions Vs. get the

best possible answers to the query Size of result: constant Vs. variable

Processing the query It’s possible to determine whether the tuple belongs to

result only based on the tuple itself, but for fuzzy query it’s not. So for boolean query we can deal with each tuple individually, but for fuzzy query, we cannot determine whether it’s in the result just by itself

Page 20: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 20

Fuzzy Query Processor(from Zhang02)

Query

Query Processor(Boolean)

Title=‘database’ and Price <100

Query

Query Processor(Fuzzy)

Color=‘red’and Shape=‘round’

SetTop k

Traditional Database Database with fuzzy data

Color = ‘red’

Sorted List

Sorted List

color shapeShape =‘round’

Page 21: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 21

Cost

Reduce the number of sequential access(Cs) Number of random accesses is bounded by

sequential access by a factor of m-1 Overall cost is bounded by the Cs by constant

factor Really optimal?

Page 22: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 22

Approximation Algorithm Approximately top k answers are acceptable or

even desirable θ-approximation (θ>1)

For any object y in the answer, z in database θt(y) >= t(z)

Turning TA to approximate algorithm The top k objects seen so far satisfy the inequality

Page 23: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 23

Non Random Access (NRA)

Similar as TA, except that No exact score No sorted order The lower bound and upper bound of such

objects

Do sequential access until there are k objects whose lower bound no less than the upper bound of all other objects

Page 24: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 24

NRA cont.

Low Bound: use 0 Upper Bound: use last score seen

carID Pricescore

a 0.9 c 0.8 e 0.7…

carID Mileagescore

b 1.0 e 0.8 f 0.7…

carID Yearscore

a 0.8 c 0.8 e 0.7…

Page 25: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 25

NRA example

Advantage: R1(1,0), others(1/3,1/3) Top 1

Top 2 vs. Top 1: R1(1,0),R2(1,1/4),others(1/3,1/3) Top 2

Lots of Bookkeeping

Page 26: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 26

Optimality of FA

Assumption t is monotone

Cost Θ(N(m-1)/mk1/m) with arbitrarily high probability

Optimality Each algorithm that correctly find the top k answers for

strict monotone query Ft(A1, A2, …,Am) where A1, A2, …,Am are independent, and without wild guess has the cost Θ (N(m-1)/mk1/m) with arbitrarily high probability

FA is optimal in all such algorithms in high probability sense

Page 27: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 27

Optimality of TA

Assumption t is monotone

Instance Optimality For any algorithm C that correctly find the top k

answers for monotone query Ft(A1, A2, …,Am) without wild guess on any database D

Cost(TA,D)=O(cost(C,D)) TA is instance optimal in all such algorithms

Page 28: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 28

Optimality of NRA

Assumption t is monotone

Instance Optimality For all algorithm that correctly find the top k objects for

monotone query t for every database and don’t make random access

Page 29: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 29

Algorithm Comparision(from Zhang2002 talk)

Algorithm

Assumption Access Model

Termination

Worst Case

Termination

Expected

Buffer Space

FA Monotone Sorted Rando

m

n(m-1)/m + k/m

Nm-1/mk1/m N

TA Monotone Sorted Rando

m

Bounded by FA

Depends on

distribution

k

NRA Monotone Sorted N Depends on

distribution

N

Page 30: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 30

Worst Case

O11.0 0.0

O21.0 0.0

...

On+11.0 1.0

On+20.0 1.0

On+30.0 1.0

...

O2n+10.0 1.0

Aggregation Function: min

n(m-1)/m + k/m

Page 31: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 31

Naïve algorithm

Algorithm: For each criterion, do sequential access to retrieve all objects and their scores Calculate combined scores for all objects Pick up top K

Comments: Access the entire database Cost is linear in the database size Does NOT use the fact that each list is sorted

Page 32: All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen [email protected] 32

Fagin AlgorithmAlgorith

m:Do sequential in parallel to all sorted list Li, until there is k “matches”. A “match” is an object that has been seen in all sorted lists Li.

Then for each object that has been seen, do random access to get all its score.

Compute the combined scores and pick the top k