Upload
damon-young
View
215
Download
0
Embed Size (px)
Citation preview
Bounds for Overlapping Interval Join on
MapReduceFoto N. Afrati1, Shlomi Dolev2,
Shantanu Sharma2, and Jeffrey D. Ullman3
1 National Technical University of Athens, Greece2 Ben-Gurion University of the Negev, Israel
3 Stanford University, USA
2nd Algorithms and Systems for MapReduce and Beyond (BeyondMR)Brussels, Belgium (27 March 2015)
2
Outline
• Introduction
• Goal of Mapping Schema and Our Contribution
• Unit-Length and Equally-Spaced Intervals
• Variable-Length and Equally-Spaced Intervals
• Conclusion
3
Outline
• Introduction– Interval and Overlapping Intervals– Interval Join– Reducer capacity and Mapping Schema
• Goal of Mapping Schema and Our Contribution
• Unit-Length and Equally-Spaced Intervals
• Variable-Length and Equally-Spaced Intervals
• Conclusion
4
• Interval– A pair [starting time , ending time]– A (time) interval, i, is represented by a pair of times
[, ], <, where and show the starting-point and the ending-point of the interval i, respectively
– Example:• My talk, • a phase of a project, a class of a professor
Introduction
= 10am
Talk
= 10:30am
5
• Overlapping Intervals– Two intervals, say interval i and interval j are called
overlapping intervals if the intersection of both the interval is nonempty
Introduction
Non-overlapping intervals Overlapping intervals
i
j
Overlapping intervals
Talk
Coffee break
10am 10:35am
10:30am 11am
6
Introduction
EmpID Name Duration
U 1-Apr –1-June
V 1-May –1-July
W 1-Apr –1-July
X 1-Mar –1-June
Y 1-Mar –1-Aug
Phase Duration
Requirement Analysis (RA)
1-Mar – 1-May
Design (D) 1-Apr – 1-June
Coding (C) 1-May –1-Aug
1-Mar 1-Apr 1-May 1-June 1-July 1-Aug
Project Employee
Project
Employee
RADC
• Overlapping Interval Join: an example
Find all the employee that are involved in RA phase of the project
7
• Reducer capacity– An upper bound on the total number of
intervals that are assigned to the reducer
– Example
• Reducer capacity to be the size of the main memory of the processors on which reducers run
• Communication cost
– Total amount of data to be transferred from the map phase to reduce phase
– Tradeoff between the reducer capacity and communication cost
Introduction
8
IntroductionMapping schema for interval join
An assignment of the set of intervals to some given reducers, such that
– Respect the reducer capacity• The total number of intervals assigned to a reducer must
be less than or equal to the reducer capacity
– Assignment of inputs• For every output, it is required to assign every two
corrosponding overlapping corrossponding intervals to at least one reducer in common
Reducer
I1 I2 I3
Reducer
Reducer
Reducer
I1 I2 I3I1 I2 I3
9
State-of-the-Art
• B. Chawda, H. Gupta, S. Negi, T.A. Faruquie, L.V. Subramaniam, and M.K. Mohania, “Processing Interval Joins On Map-Reduce,” EDBT, 2014.
• MapReduce-based 2-way and multiway interval join algorithms of overlapping intervals
• Not regarding the reducer capacity
• No analysis of a lower bound on replication of individual intervals
• No analysis of the replication rate of the algorithms offered therein
10
Outline
• Introduction
• Goal of Mapping Schema and Our Contribution
• Unit-Length and Equally-Spaced Intervals
• Variable-Length and Equally-Spaced Intervals
• Conclusion
11
• Interval join problem– Assign all the intervals that share at least
one common point of time to at least one reduce in common for finding outputs
Goal of Mapping Schema
12
• An algorithm for variable-length intervals that can start at any time
– Before this, we consider two simple cases of
• Unit-length and equally-spaced intervals and provide algorithm
• Variable-length and equally-spaced intervals and provide algorithm
• All the algorithms achieve almost matching upper bound on the replication rate to the lower bound
Our Contribution
13
Outline
• Introduction
• Goal of Mapping Schema and Our Contribution
• Unit-Length and Equally-Spaced Intervals
• Variable-Length and Equally-Spaced Intervals
• Conclusion
• Relations X and Y of n intervals
• All intervals do not have beginning beyond k and before 0
• Hence, spacing between starting points of two successive intervals = < 1
Unit-Length and Equally-Spaced Intervals
14
0 .25 .50 .75 1 1.25 1.5 1.75 2 2.25
X
Y
n = 9 and k = 2.25, so spacing = 0.25
• Divide the time-range from 0 to k into equal-sized partitions of length w (say P partitions are created)
• Arrange P reducers
• Assign all intervals of X that exist in a partition pi to ith reducer
• Assign all intervals of Y that have their starting or ending-point in partition pi to ith reducer
Unit-Length and Equally-Spaced Intervals-
Algorithm
15
0 .25 .50 .75 1 1.25 1.5 1.75 2 2.25
X
Y
n = 9 and k = 2.25
1 partition
2 partition
3 partition 5
partition
4 partition
• Does the algorithm work?
• Consider q = + + 2• q: the reducer capacity• w: length of a partition• n: the total number of intervals in a relation• k: the last starting point of an interval
• Count how many intervals lie in a partition, if they are less than or equal to q then we have a solution and the algorithm works.
Unit-Length and Equally-Spaced Intervals
16
• Does the algorithm work?– Count 1: How many intervals of Y overlap
with an interval X in a partition of length w?
• Spacing is n/k, so at most 2wn/k intervals of Y can overlap with an interval of X
– Count 2: How many intervals can have starting points after starting of xi and starting points before ending of xi.
• Intervals of X after starting point of xi = wn/k
• Intervals of X before starting point of xi = n/k
– Count 3: Do not forget to count xi itself and an identical interval of Y i.e. yi.
Unit-Length and Equally-Spaced Intervals
17
0 .25 .50 .75 1 1.25 1.5 1.75 2 2.25
X
Yn = 9 and k = 2.25
1 partition
2 partition
3 partition 5
partition
4 partition
• Does the algorithm work? – Total number of intervals in a partition
– Count 1 + Count 2 + Count 3 =
+ + 2
= q
– OK. The algorithm works
Unit-Length and Equally-Spaced Intervals
18
19
Outline
• Introduction
• Goal of Mapping Schema and Our Contribution
• Unit-Length and Equally-Spaced Intervals
• Variable-Length and Equally-Spaced Intervals
• Conclusion
20
• Two types of intervals– Big and small intervals
– Different length intervals
Variable-Length and Equally-Spaced Intervals
21
• Big and small intervals
– All the intervals of X are of length lmin
– All the intervals of Y are of length lmax
– The previous algorithm will work here too
– Note that an interval of X will be replicated to several reducers, while an interval of Y will be replicated to at most two reducers
Variable-Length and Equally-Spaced Intervals
0 .7 1.4 2.1 2.8 3.5 4.2
X
Y
n = 6 and spacing = 0.7
22
• Variable-length intervals: A general case– All the restriction regarding length of an
interval and spacing between two interval is removed
– Intervals can begin at some time greater than or equal to 0 and end by time T
– S: the total length of intervals in one relation
Variable-Length and Equally-Spaced Intervals
0 s s+1 s+2 s+3 T
X
Y
23
• Variable-length intervals: A general case– Algorithm
• Divide the time range into equal sized partitions • Arrange reducers• Follow the same procedure as in the previous
algorithm– i.e., assign all the intervals of X that belong to ith partition to ith
reducers and assign all the intervals of Y to reducers corresponding to their starting and ending points (only to at most two reducers)
Variable-Length and Equally-Spaced Intervals
0 s s+1 s+2 s+3 T
X
Y
24
• Variable-length intervals: A general case– Does the algorithm work?– Consider q =
– Count the average number of intervals of X and Y sent to a reducer; if they are less than or equal to the reducer capacity, then the algorithm will work
Variable-Length and Equally-Spaced Intervals
25
• Variable-length intervals: A general case– Count 1: Average number of intervals of Y
received by a reducer
– An interval of Y is sent to at most to 2 reducers (Replication)
– There are reducers and n intervals in Y
• Average number of intervals of Y received by a reducer =
Variable-Length and Equally-Spaced Intervals
26
• Variable-length intervals: A general case– Count 2: Average number of intervals of X
received by a reducer
– Average length of intervals is S/n
– An interval of X is sent to at most to 1 + S/nw reducers
– There are reducers and n intervals in X
• Average number of intervals of X received by a reducer =
Variable-Length and Equally-Spaced Intervals
Average length/how
much length a reducer can hold
27
• Variable-length intervals: A general case– Does the algorithm work?
– Total number of intervals that a reducer receive
= Count 1+ Count 2
+ =
= q
The algorithm works
Variable-Length and Equally-Spaced Intervals
28
Outline
• Introduction
• Problem Statement and Our Contribution
• Unit-Length and Equally-Spaced Intervals
• Variable-Length and Equally-Spaced Intervals
• Conclusion
Conclusion
• An investigation for good MapReduce algorithms for the problem of finding pairs of overlapping intervals
• Algorithms for:– Unit-sized and equally-spaced intervals
• Lower bounds on the replication rate = 2 or 2q • Upper bounds on the replication rate =
– Big-small and equally-spaced intervals• Lower bounds on the replication rate = 2 or 2q• Upper bounds on the replication rate =
– A general case for variable length intervals• Upper bounds on the replication rate =
29Proofs of lower and upper bounds on the replication rate are given in the paper
Foto Afrati1, Shlomi Dolev2, Shantanu Sharma2, and Jeffrey D. Ullman3
1 School of Electrical and Computing Engineering, National Technical University of Athens, Greece
[email protected] Department of Computer Science, Ben-Gurion University of
the Negev, Israel{dolev,sharmas}@cs.bgu.ac.il
3 Department of Computer Science, Stanford University, USA [email protected]
Presentation is available athttp://www.cs.bgu.ac.il/~sharmas/publication.html