31
1 An Overview of Distributed Top-K Ranking Algorithms 30-min presentation by Demetris Zeinalipour Lecturer School of Pure and Applied Sciences Open University of Cyprus Friday, December 12 th , 2008, 16:00-16:30 Communication Systems Group (CSG), ETH Zurich, Switzerland http://www.cs.ucy.ac.cy/~dzeina/

An Overview of Distributed Top-K Ranking Algorithms

  • Upload
    bracha

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

An Overview of Distributed Top-K Ranking Algorithms. 30-min presentation by Demetris Zeinalipour Lecturer School of Pure and Applied Sciences Open University of Cyprus. Friday , December 12 th , 200 8, 16:00-16:30 Communication Systems Group (CSG), ETH Zurich, Switzerland. - PowerPoint PPT Presentation

Citation preview

Page 1: An Overview of Distributed Top-K Ranking Algorithms

1

An Overview of Distributed Top-K Ranking Algorithms

30-min presentation by

Demetris ZeinalipourLecturer

School of Pure and Applied SciencesOpen University of Cyprus

Friday, December 12th, 2008, 16:00-16:30Communication Systems Group (CSG), ETH Zurich, Switzerland

http://www.cs.ucy.ac.cy/~dzeina/

Page 2: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)2

Top-k Queries: Introduction• Top-K Queries are a long studied topic in the

database and information retrieval communities

• The main objective has been to return the K highest-ranked answers quickly and efficiently.

• A Top-K query returns the subset of most relevant answers, in place of ALL answers, for two reasons:

– i) to minimize the cost metric that is associated with the retrieval of all answers (e.g., disk, network, etc.)

– ii) to maximize the quality of the answer set, such that the user is not overwhelmed with irrelevant results

Page 4: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)4

Top-k Queries: Now

A few motivating queries:• Snapshot Query: Find the K nodes with the highest

temperature values• Continuous Query: For the next one hour continuously

report the K rooms with the highest average temperature• Historic Query (nodes store all data locally): Find the

K nodes with the highest average temperature during the last 6 months

Base Station

In-Network Top-k Query Processing

Page 5: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)5

Top-k Queries: Now • Assume a cluster of n=5 Web-servers• Each server maintains locally a replica of the

same m=5 static Web-pages• When a web page is accessed by a client, the

respective server increases a local hit counter by one

v1 v2 v3 v4 v5

http://www.amazon.com/

TOP-1 Query: “Find the webpage with the highest number of hits across all servers”

client

Hits++

5

v1 v2 v3 v4 v5o1,.91o3,.90o0,.61o4,.07o2,.01

o1,.92o3,.75o4,.70o2,.16o0,.01

o3,.74o1,.56o2,.56o0,.28o4,.19

o3,.67o4,.67o1,.58o2,.54o0,.35

TOP-1

o3,4.05/5=.81o1,3.63/5=.73o4,2.07/5=.41o0,1.88/5=.32o2,1.75/5=.29

o3,4.05/5=.81o3,.99o1,.66o0,.63o2,.48o4,.44

{(N) Web-servers

HitsPageID

(M)

Tim

esta

mp

s

Scoring Table

Page 6: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)6

Presentation OutlineA. IntroductionB. Centralized Top-K Query Processing

• The Threshold Algorithm (TA)C. Distributed Top-K Query Processing

• The Threshold Join Algorithm (TJA)• Experimentation using 75 workstations

D. Other Applications of Top-K Queries• Distributed Spatio-temporal Trajectory Retrieval• In-Network Top-K Views (MINT Views)

Page 7: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)7

Centralized Top-K Query ProcessingFagin’s* Threshold Algorithm (TA): (In ACM PODS’02) * Concurrently developed by 3 groupsThe most widely recognized algorithm for Top-K Query Processing in database systems

ΤΑ Algorithm1) Access the n lists in parallel.2) While some object oi is seen, perform a random access to the other lists to find the complete score for oi. 3) Do the same for all objects in the current row.4) Now compute the threshold τ as the sum of scores in the current row.5)The algorithm stops after K objects have been found with a score above τ.

v1 v2 v3 v4 v5o1, 91o3, 90o0, 61o4, 07o2, 01

o1, 92o3, 75o4, 70o2, 16o0, 01

o3, 74o1, 56o2, 56o0, 28o4, 19

o3, 67o4, 67o1, 58o2, 54o0, 35

o3, 99o1, 66o0, 63o2, 48o4, 44

Page 8: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)8

Centralized Top-K: The TA Algorithm (Example)

o3,4.05/5=.81

v1 v2 v3 v4 v5o3, 99o1, 66o0, 63o2, 48o4, 44

o1, 91o3, 90o0, 61o4, 07o2, 01

o1, 92o3, 75o4, 70o2, 16o0, 01

o3, 74o1, 56o2, 56o0, 28o4, 19

o3, 67o4, 67o1, 58o2, 54o0, 35

TOP-K

Have we found K=1 objects with a score above τ? => ΝΟ

Have we found K=1 objects with a score above τ? => YES!

Iteration 1 Thresholdτ = 99 + 91 + 92 + 74 + 67 => τ = 423

Iteration 2 Thresholdτ (2nd row)= 66 + 90 + 75 + 56 + 67 => τ = 354

O3, 405O1, 363O4, 207

Why is the threshold correct? It gives us the maximum score for the objects we have not seen yet (<= τ)

Page 9: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)9

Presentation OutlineA. Top-K Algorithms: DefinitionsB. Centralized Top-K Query Processing

• The Threshold Algorithm (TA)

C. Distributed Top-K Query Processing• The Threshold Join Algorithm (TJA)• Experimentation using 75 workstations

D. Other Applications of Top-K Queries• Distributed Spatio-temporal Trajectory

Retrieval• In-Network Top-K Views (MINT Views

Page 10: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)10

The Centralized Join Algorithm (CJA)

• Naive solution:– Perform the computation in

one phase: each node sends its complete list of scores

– Each intermediate node forwards all received lists

v1

v3

v2

v4

v5

TOP-1

5:4:

5:

3:

5:4:

3:2:

5:4:3:2:1:

1,2,3,4,5

o3, 67o4, 67o1, 58o2, 54o0, 35

• Problem: To overcome the arbitrary phases of the Threshold Algorithm?

• Disadvantage– Overwhelming amount of

messages.– Huge Query Response Time

Page 11: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)11

The Staged Join Algorithm (SJA)

• Improved Solution: Aggregate the lists before these are forwarded to the parent:

• This is the In-network aggregation approach

• Advantage: Only O(n) messages• Disadvantage: The size of each

message is still very large in size (i.e., the complete list)

v1

v3

v2

v4

v5

5:

3:

2,3,4,5:

4,5:

TOP-1

1,2,3,4,51,2,3,4,5

2,3 4,5

4 5

o3, 67o4, 67o1, 58o2, 54o0, 35

o3, 74o1, 56o2, 56o0, 28o4, 19

Page 12: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)12

Threshold Join Algorithm (TJA)• TJA is our 3-phase algorithm that

optimizes top-k query execution in distributed (hierarchical) environments.

• Advantage:– It usually completes in 2 phases.– It never completes in more than 3 phases

(LB Phase, HJ Phase and CL Phase)– It is therefore highly appropriate for distributed

environments• “The Threshold Join Algorithm for Top-k Queries in Distributed Sensor Networks", D. Zeinalipour-Yazti et. al, In VLDB’s DMSN’05.• “Finding the K Highest-Ranked Answers in a Distributed Network”, D. Zeinalipour-Yazti et. al, Computer Networks, Elsevier, 2008.

Page 13: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)13

Step 1 - LB (Lower Bound) Phase• Recursively send the K

highest objectIDs of each node to the sink.

• Each intermediate node performs a union of the received results (defined as τ)

v1

v3

v2

v4

v5

5:

3:

2,3,4,5:

TJA1) LB Phase

4,5:

4U5

2,3U4,5

U1

1,2,3,4,5Ltotal{1,3}

Occupied Oij

Empty Oij

v1 v2 v3 v4 v5o3, 99o1, 66o0, 63o2, 48o4, 44

o1, 91o3, 90o0, 61o4, 07o2, 01

o1, 92o3, 75o4, 70o2, 16o0, 01

o3, 74o1, 56o2, 56o0, 28o4, 19

o3, 67o4, 67o1, 58o2, 54o0, 35

LB

{o3, o1}

Query: TOP-1

Τ=

Page 14: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)14

Step 2 – HJ (Hierarchical Join) Phase• Disseminate τ to all nodes • Each node sends back all

objects with score above the objectIDs in τ

• Before sending the objects, each node tags as incomplete, scores that could not be computed exactly

TJA2) HJ Phase

v1

v3

v2

v4

v5

5:

3:

2,3,4,5:

4,5:

4 5

2,3 4,5

1,2,3,4,5Rtotal{1,3,4}

Occupied Oij

Empty Oij

Incomplete Oij

U+

U+

U+

o3, 405o1, 363o4',354

v1 v2 v3 v4 v5o3, 99o1, 66o0, 63o2, 48o4, 44

o1, 91o3, 90o0, 61o4, 07o2, 01

o1, 92o3, 75o4, 70o2, 16o0, 01

o3, 74o1, 56o2, 56o0, 28o4,19

o3, 67o4, 67o1, 58o2, 54o0, 35

HJ

} Complete

Incomplete

Page 15: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)15

Step 3 – CL (Cleanup) Phase

• Have we found K objects with a complete score that is above all incomplete scores?

– Yes: The answer has been found!– No: Find the complete score for each

incomplete object (all in a single batch phase)

• CL ensures correctness

• This phase is rarely required in practice!

Page 16: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)16

Experimental Evaluation• We have implemented a P2P middleware in

JAVA (sockets + binary transfer protocol).• We tested our implementation with a

network of 1000 real nodes using 75 Linux workstations.

• We use a trace driven experimentation methodology with data from an Environmental Monitoring Facility in Washington / OregonSummary of FindingsBytes: CJA = 10xTJA; SJA = 3xTJATime: TJA:3.7s [LB:1.0s,HJ:2.7s,CL:0.08s]; SJA: 8.2s; CJA:18.6sMessages:TJA:259, SJA:183, CJA:246

Page 17: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)17

Presentation OutlineA. Top-K Algorithms: DefinitionsB. Centralized Top-K Query Processing

• The Threshold Algorithm (TA)

C. Distributed Top-K Query Processing• The Threshold Join Algorithm (TJA)• Experimentation using 75 workstations

D. Other Applications of Top-K Queries• Distributed Spatio-temporal Trajectory

Retrieval (UB-K and UBLB-K Algorithms)• In-Network Top-K Views (MINT Views)

Page 18: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)18

Application 2: SpatioTemporal Similarity Search• Similarity Search: Given a query Q, find the degree of

similarity (Euclidean distance, DTW, LCSS) between Q and a set of m target trajectories {A1,A2,…,Am}.

• Each Αi (i<=m) is segmented into a number of non-overlapping cells {C1,C2,…,Cn} that maintain the local subsequences.

• Challenge: How can we find the K most similar trajectories to Q without pulling together all subsequences

G

trajectoriesA2

A1

x

ycell

Access Pointmoving object

Q "Distributed Spatio-Temporal Similarity Search”, D. Zeinalipour-Yazti, S. Lin, D. Gunopulos, ACM 15th Conference on Information and Knowledge Management, (ACM CIKM 2006), November 6-11, Arlington, VA, USA, pp.14-23, August 2006.

Page 19: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)19

Application 2: Spatiotemporal Query ProcessingSolution Outline• Each cell computes a lower bound and an upper bound on

the matching of Q to its local subsequences.• The distributed scoring table now contains score bounds

(lower,upper) rather than exact scores.

A2,3,6A0,4,8A4,5,10A7,7,9A3,8,11A9,8,9

....

A4,10,18A2,13,19A0,15,25A3,20,27A9,22,26A7,30,35

....

m

A4,4,5A2,5,6A0,5,7A3,5,6A9,8,10A7,12,13

....

A4,1,3A0,6,10A2,5,7A9,6,7A3,7,10A7,11,13

....

id,lb,ubv3

id,lb,ubv2

id,lb,ubv1

id,lb,ubMETADATA

nG

trajectoriesA2

A1

x

ycell

Access Pointmoving object

Q

• We have proposed two iterative algorithms: UB-K and UBLB-K, which combine these score bounds.

• UB-K and UBLB-K find the K most similar trajectories to Q without pulling together the distributed subsequences.

Page 20: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)20

Application 3: ΜΙΝT• ΜΙΝΤ : a framework for optimizing the execution of

continuous monitoring queries in sensor networks. • "MINT Views: Materialized In-Network Top-k Views in Sensor

Networks" D. Zeinalipour-Yazti, P. Andreou, P. Chrysanthis and G. Samaras, In IEEE 8th International Conference on Mobile Data Management, Mannheim, Germany, May 7 – 11, 2007

Query: Find the K=1 rooms with the highest average temperature

Page 21: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)21

ΜΙΝΤ Views: ProblemObjective: To prune away tuples locally at each sensor such that messaging is minimized.

Naïve Solution: Each node eliminates any tuple with a score lower than its top-1 result.

D,76.5C,75B,41

(B,40)Problem:

We received a incorrect answer i.e., (D,76.5) instead of (C,75).

Page 22: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)22

ΜΙΝΤ Views: Main Idea• Bound above each tuple with its maximum possible value.• K-covered Bound-set : Includes all the objects which

have an upper bound (vub) greater or equal to the kth highest lower bound (τ), i.e., vub > τ

vubvlbτ sum

Page 23: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)23

ΜΙΝΤ Views: Main Idea• Bound above each tuple with its maximum possible value.• K-covered Bound-set : Includes all the objects which

have an upper bound (vub) greater or equal to the kth highest lower bound (τ), i.e., vub > τ

vubvlbτ sum

Page 24: An Overview of Distributed Top-K Ranking Algorithms

24

An Overview of Distributed Top-K Ranking Algorithms

Thank you!Demetris Zeinalipour

This presentation is available at:http://www2.cs.ucy.ac.cy/~dzeina/talks.html

Related Publications available at:http://www2.cs.ucy.ac.cy/~dzeina/publications.htm

Page 25: An Overview of Distributed Top-K Ranking Algorithms

Backup Slides

Main Findings: Dataset: Environmental Measurements from atmospheric monitoring stations in Washington & Oregon. (2003-2004) Query: Find the K timestamps on which the average temperature across all stations was maximum. Network: Random Graph (degree=4, diameter 10) Evaluation Criterions: i) Bytes, ii) Time, iii) Messages

Page 26: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)26

Experimental Results

TJA requires one order of magnitude less bytes than CJAs!

Page 27: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)27

Experimental Results

TJA: 3.7sec [ LB:1.0sec, HJ:2.7sec, CL:0.08sec ] SJA: 8.2sec CJA:18.6sec

Page 28: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)28

Experimental Results

259

183

246

Although TJA consumes more messages than SJA these are small-size messages

Page 29: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)29

The TPUT Algorithmv1 v2 v3 v4 v5

o3, 99o1, 66o0, 63o2, 48o4, 44

o1, 91o3, 90o0, 61o4, 07o2, 01

o1, 92o3, 75o4, 70o2, 16o0, 01

o3, 74o1, 56o2, 56o0, 28o4, 19

o3, 67o4, 67o1, 58o2, 54o0, 35

P1 P2 P3

TOP-1

Phase 1 : o1 = 91+92 = 183, o3 = 99+67+74 = 240

τ = (Kth highest score (partial) / n) => 240 / 5 => τ = 48 Phase 2 : Have we computed K exact scores ?

Computed Exactly: [o3, o1] Incompletely Computed: [o4,o2,o0] Drawback: The threshold is uniform (too coarse)

Q: TOP-1

o1=183, o3=240o3=405o1=363o2’=158o4’=137o0’=124

Page 30: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)30

TJA vs. TPUT

Page 31: An Overview of Distributed Top-K Ranking Algorithms

Demetris Zeinalipour (Open University of Cyprus)31

ΜΙΝΤ Views: Experimentation• We obtained a real trace of atmospheric data collected by

UC-Berkeley on the Great Duck Island (Maine) in 2002.• We then performed a trace-driven experimentation using

XBows TELOSB sensor.• Our query was as follows:

– SELECT TOP-K area, Avg(temp)– FROM sensors– GROUP BY area

0%

39%

77%

34%12%