All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware...

Preview:

Citation preview

All right reserved by Xuehua Shen xshen@uiuc.edu 1

Optimal Aggregation Algorithms for Middleware

Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen xshen@uiuc.edu 2

Problem: Rank Aggregation Each object is scored using m different criteria,

m sorted list for each criterion

Combined score is calculated by an aggregation function

Problem: find top-k objects with highest combined scores

All right reserved by Xuehua Shen xshen@uiuc.edu 3

carID

MileageScore

c 1.0a 0.8e 0.6b 0.5d 0.5

carID

YearScore

a 0.9b 0.7c 0.7d 0.7e 0.5

carID

PriceScore

d 1.0e 0.9b 0.8c 0.7a 0.6

Rank Aggregation

carID score

d 0.81c 0.76

Top 2 Car

e.g. weighted sumCombined score = 0.2 *mileage score + 0.3*year score + 0.5 * price score

Do we need access all entries of all sorted lists?

Example

All right reserved by Xuehua Shen xshen@uiuc.edu 4

Applications

Multimedia database system

Web search query

Query

Rank AggregationEngine

Color=‘red’and Shape=‘round’

Top k

Color = ‘red’

Sorted List

Sorted List

color shape

Shape =‘round’

From Zhang2002 talk

All right reserved by Xuehua Shen xshen@uiuc.edu 5

Outline

Assumptions Fagin Algorithm Threshold Algorithm Summary & Comments

All right reserved by Xuehua Shen xshen@uiuc.edu 6

Assumption 1: Modes of Access

Sequential Access: obtain score of an object in one sorted list sequentially from current position

Random Access: obtain score of an object in one sorted list using one random access

carID Yearscore

a 0.8 c 0.8 e 0.7…

Assumption: Both Access Modes are available

All right reserved by Xuehua Shen xshen@uiuc.edu 7

Assumption 2: Aggregation Function

Object gets different scores from different subsystems in the interval [0,1]

Aggregation function to compute them into

combined scores e.g. min, avg

Monotone: if for every i

1 2 1 2( , ,..., ) ( , ,..., )m mf x x x f y y y i ix y

All right reserved by Xuehua Shen xshen@uiuc.edu 8

Intuition of Algorithms

Top objects in individual sorted lists also have chances to be correct answers

Do some accesses, and think “Can we stop now?”

All right reserved by Xuehua Shen xshen@uiuc.edu 9

Fagin AlgorithmcarID Price

score

a 0.9 c 0.8 e 0.7…

carID Mileagescore

b 1.0 e 0.8 f 0.7…

carID Yearscore

a 0.8 c 0.8 e 0.7…

’e’ appears in all of them. top-1 object must be in {a, b, c, e, f}. why?

Monotone function, object ‘e’ blocks all objects below

Do random access for these 5 objects to get their scores and pickTop-1.

We can’t say ‘e’ must be top-1,other objects can still have highercombined score

All right reserved by Xuehua Shen xshen@uiuc.edu 10

Drawbacks of Fagin Algorithm

Only use information provided by sorted list and monotone property

Have to remember lots of objects: large buffer size

All right reserved by Xuehua Shen xshen@uiuc.edu 11

Threshold Algorithm (TA)

When object R is seen under sequential access, immediately do random access to get all other scores of object R and compute combined score

Halt when at least k objects have combined scores no lessthan upper bound

Intuition: Combined score calculated by aggregation function can provide some extra information.

upper bound (or threshold) of combined score of unseen objects!

At the same time, Keep track of the upper bound of the unseen objects

All right reserved by Xuehua Shen xshen@uiuc.edu 12

TA: Example (K=1,AVG aggregation)

carID

Pricescore

a 0.9 c 0.8 e 0.7

carID Yearscore

a 0.8 c 0.8 e 0.7

carID

Mileagescore

b 1.0 e 0.8 f 0.7

…Step 1: sequential access ‘a’ price score(0.9), then random access ‘a’mileage score(0.6) and year score(0.8), avg is (0.77)

Step 2: sequential access ‘b’ mileage score(1.0), then random access ‘b’price score(0.7) and year score(0.7), avg is (0.8)

Upper Bound:0.9

0.77

Upper Bound:0.8

0.8Const-size buffer

All right reserved by Xuehua Shen xshen@uiuc.edu 13

Evaluation of TA

TA never stops later than FA

TA requires only small constant-size (K) buffer

However, TA may perform more random accesses

All right reserved by Xuehua Shen xshen@uiuc.edu 14

Summary

FA and TA with both sequential access and random access

Extend TA to other situations Approximate algorithm No random access

All right reserved by Xuehua Shen xshen@uiuc.edu 15

Comments

Rely on universal identification of objects from different lists

Assumptions can not always be valid e.g. not every sorted list exists beforehand

Do sequential access wisely for speeding up TA for skewed data

All right reserved by Xuehua Shen xshen@uiuc.edu 16

All right reserved by Xuehua Shen xshen@uiuc.edu 17

Backup Slides

All right reserved by Xuehua Shen xshen@uiuc.edu 18

Middleware

Middleware: functions as a translation layer, handles all incoming requests (such as Top-K query) and replies, interacting with the disparate back-office systems to gather the information it needs.

Application developers don’t need know there are several heterogeneous systems behind the middleware.

All right reserved by Xuehua Shen xshen@uiuc.edu 19

Boolean Query Vs. Fuzzy Query

Semantics Get all the results that satisfy the conditions Vs. get the

best possible answers to the query Size of result: constant Vs. variable

Processing the query It’s possible to determine whether the tuple belongs to

result only based on the tuple itself, but for fuzzy query it’s not. So for boolean query we can deal with each tuple individually, but for fuzzy query, we cannot determine whether it’s in the result just by itself

All right reserved by Xuehua Shen xshen@uiuc.edu 20

Fuzzy Query Processor(from Zhang02)

Query

Query Processor(Boolean)

Title=‘database’ and Price <100

Query

Query Processor(Fuzzy)

Color=‘red’and Shape=‘round’

SetTop k

Traditional Database Database with fuzzy data

Color = ‘red’

Sorted List

Sorted List

color shapeShape =‘round’

All right reserved by Xuehua Shen xshen@uiuc.edu 21

Cost

Reduce the number of sequential access(Cs) Number of random accesses is bounded by

sequential access by a factor of m-1 Overall cost is bounded by the Cs by constant

factor Really optimal?

All right reserved by Xuehua Shen xshen@uiuc.edu 22

Approximation Algorithm Approximately top k answers are acceptable or

even desirable θ-approximation (θ>1)

For any object y in the answer, z in database θt(y) >= t(z)

Turning TA to approximate algorithm The top k objects seen so far satisfy the inequality

All right reserved by Xuehua Shen xshen@uiuc.edu 23

Non Random Access (NRA)

Similar as TA, except that No exact score No sorted order The lower bound and upper bound of such

objects

Do sequential access until there are k objects whose lower bound no less than the upper bound of all other objects

All right reserved by Xuehua Shen xshen@uiuc.edu 24

NRA cont.

Low Bound: use 0 Upper Bound: use last score seen

carID Pricescore

a 0.9 c 0.8 e 0.7…

carID Mileagescore

b 1.0 e 0.8 f 0.7…

carID Yearscore

a 0.8 c 0.8 e 0.7…

All right reserved by Xuehua Shen xshen@uiuc.edu 25

NRA example

Advantage: R1(1,0), others(1/3,1/3) Top 1

Top 2 vs. Top 1: R1(1,0),R2(1,1/4),others(1/3,1/3) Top 2

Lots of Bookkeeping

All right reserved by Xuehua Shen xshen@uiuc.edu 26

Optimality of FA

Assumption t is monotone

Cost Θ(N(m-1)/mk1/m) with arbitrarily high probability

Optimality Each algorithm that correctly find the top k answers for

strict monotone query Ft(A1, A2, …,Am) where A1, A2, …,Am are independent, and without wild guess has the cost Θ (N(m-1)/mk1/m) with arbitrarily high probability

FA is optimal in all such algorithms in high probability sense

All right reserved by Xuehua Shen xshen@uiuc.edu 27

Optimality of TA

Assumption t is monotone

Instance Optimality For any algorithm C that correctly find the top k

answers for monotone query Ft(A1, A2, …,Am) without wild guess on any database D

Cost(TA,D)=O(cost(C,D)) TA is instance optimal in all such algorithms

All right reserved by Xuehua Shen xshen@uiuc.edu 28

Optimality of NRA

Assumption t is monotone

Instance Optimality For all algorithm that correctly find the top k objects for

monotone query t for every database and don’t make random access

All right reserved by Xuehua Shen xshen@uiuc.edu 29

Algorithm Comparision(from Zhang2002 talk)

Algorithm

Assumption Access Model

Termination

Worst Case

Termination

Expected

Buffer Space

FA Monotone Sorted Rando

m

n(m-1)/m + k/m

Nm-1/mk1/m N

TA Monotone Sorted Rando

m

Bounded by FA

Depends on

distribution

k

NRA Monotone Sorted N Depends on

distribution

N

All right reserved by Xuehua Shen xshen@uiuc.edu 30

Worst Case

O11.0 0.0

O21.0 0.0

...

On+11.0 1.0

On+20.0 1.0

On+30.0 1.0

...

O2n+10.0 1.0

Aggregation Function: min

n(m-1)/m + k/m

All right reserved by Xuehua Shen xshen@uiuc.edu 31

Naïve algorithm

Algorithm: For each criterion, do sequential access to retrieve all objects and their scores Calculate combined scores for all objects Pick up top K

Comments: Access the entire database Cost is linear in the database size Does NOT use the fact that each list is sorted

All right reserved by Xuehua Shen xshen@uiuc.edu 32

Fagin AlgorithmAlgorith

m:Do sequential in parallel to all sorted list Li, until there is k “matches”. A “match” is an object that has been seen in all sorted lists Li.

Then for each object that has been seen, do random access to get all its score.

Compute the combined scores and pick the top k

Recommended