View
32
Download
1
Category
Preview:
DESCRIPTION
MRShare: Sharing Across Multiple Queries in MapReduce. Presented by Xiaolan Wang and Pengfei Tang. By Tomasz Nykiel (University of Toronto) Michalis Potamias (Boston University) Chaitanya Mishra (University of Toronto, currently Facebook ) George Kollios (Boston University) - PowerPoint PPT Presentation
Citation preview
1
MRShare: Sharing Across Multiple Queries in MapReduce
By Tomasz Nykiel (University of Toronto)Michalis Potamias (Boston University)Chaitanya Mishra (University of Toronto, currently Facebook)George Kollios (Boston University)Nick Koudas (University of Toronto)
Presented by Xiaolan Wang and Pengfei Tang
2
Motivation
• Reducing the execution time• Reducing energy consumption• Monetary savings
*http://aws.amazon.com/ec2/#pricing
MRShare – a sharing framework for Map Reduce
• MRShare framework:– Inspired by sharing primitives from relational domain– Introduces a cost model for Map Reduce jobs– Searches for the optimal sharing strategies– Does not change the Map Reduce computational model
hsdhquweiquwijksajdajsdjhwhjadjhashdj
3
4
Outline
• Introduction• Map Reduce recap.• MRShare – Sharing opportunities in Map-Reduce• Cost model for MapReduce• MRShare – Grouping algorithms• MRShare Implementation and Evaluation• Summary
5
Outline
• Introduction• Map Reduce recap.• MRShare – Sharing opportunities in Map-Reduce• Cost model for MapReduce• MRShare – Grouping algorithms• MRShare Implementation and Evaluation• Summary
network
Map Reduce recap.
I
I
I
I
Map Reduce
Output
Output
HDFSHDFS
6
7
Outline
• Introduction• Map Reduce recap.• MRShare - Sharing opportunities in Map-Reduce– Sharing scans– Sharing intermediate data
• Cost model for MapReduce• MRShare – Grouping algorithms• MRShare Implementation and Evaluation• Summary
Sharing opportunities– sharing scans
• SELECT COUNT(*) FROM user GROUP BY hometown
• SELECT AVG(age) FROM user GROUP BY hometown
Map
id1 studentToronto
Toronto 1 Map
id1 studentToronto
Toronto 17
Reduce
Toronto 1Toronto 1Toronto 1Ottawa 1Ottawa 1
Toronto 3
Ottawa 2
Reduce
Toronto 17Toronto 19
Montreal 20Ottawa 23Ottawa 25
Toronto 18Montreal 20Ottawa 24
8
User_id Hometown Occupation Age
SQL
MAP
RED
UCE
9
Meta-map
MRShare – sharing scans (map).Input
Map 1 Map 2 Map 3 Map 4
Map output
Meta-reduce
MRShare – sharing scans (reduce)
J1 J2 J3 J4 key value
Toronto 1
Toronto 1
Toronto 1
Toronto 17
Toronto 19
Toronto 2
Toronto 5
Reduce 1
Reduce 2
Reduce 3
Reduce 4
10
11
Sharing Map OutputSELECT T.a, sum(T.b) SELECT T.a, avg(T.b)FROM T FROM TWHERE T.a>10 AND T.a<20 WHERE T.b>10 AND T.c<100GROUP BY T.a GROUP BY T.a
12
Sharing MapSELECT T.c, sum(T.b) SELECT T.a, avg(T.b)FROM T FROM TWHERE T.c > 10 WHERE T.c > 10GROUP BY T.c GROUP BY T.a
Same reducing.
13
Sharing Parts of MapSELECT T.a, sum(T.b) SELECT T.a, avg(T.b)FROM T FROM TWHERE T.c>10 AND T.a<20 WHERE T.c>10 AND T.c<100GROUP BY T.a GROUP BY T.a
Outline• Introduction• Map Reduce recap.• MRShare – Sharing opportunities in Map-Reduce• Cost model for MapReduce• MRShare – Grouping algorithm• MRShare Implementation and Evaluation• Summary
14
15
Cost model for Map Reduce (single job)
• Reading – f(input size)• Sorting – f(intermediate data size)• Transferring– f(intermediate data size)• Writing – f(output size)
Reading input Sorting int. data Transferring Writing output
T(J) = Tread(J) + Tsort(J) + Ttr(J)
16
Cost of executing a group of jobsRead Sort Transfer Write
Read Sort Transfer Write
Read Sort Transfer Write
J1
J2
J3
Read Sort Transfer Write
Potential costs
SavingsPotential savings
J1+J2+J3
17
Cost without grouping
n – n jobs;m – m maps;r – r reduces; |Mi| - the average output size of a map task;|Ri| - the average input size of a reduce task;|Di| - the size of the intermediate data of job Ji.
|Di| = |Mi| · m = |Ri| · r
n MapReduce jobs, J = {J1, . . . , Jn}, read from the same input file F.
19
Cost with grouping
m – m maps;r – r reduces; |Xm| - the average size of the combined output of map tasks;|Xr| - the average size of the combined input of reduce tasks; |XG| - the size of the intermediate data.
| XG | = | Xm | · m = | Xr | · r
Single group G contains all n jobs and execute it as a single job JG.
20
Beneficial conditions
n <= B
21
Finding the optimal sharing strategy
• An optimization problem
J1
J2
J5
J3
J4
J1
J2
J5
J3
J4
J1
J2
J5
J3
J4
“NoShare”
“GreedyShare”
22
Sharing scans - cost based optimization
• Savings come from reduced number of scans• The sorting cost might change• The costs of copying and writing the output do not change
Read Sort
Read Sort
Read Sort
J1
J2
J3
Read Sort
Potential costsSavings
J1+J2+J3
Outline• Introduction• Map Reduce recap.• MRShare – Sharing opportunities in Map-Reduce• Cost model for MapReduce• MRShare – Grouping algorithms– SplitJobs – cost based algorithm for sharing scans– MultiSplitJobs – an improvement of SplitJobs
• MRShare Evaluation• Summary
23
SplitJobs – a DP solution for sharing scans.
• We reduce the problem of grouping to the problem of splitting a sorted list of jobs – by approximating the cost of sorting.
24
• Using our cost model and the approximation, we employ a DP algorithm to find the optimal split points.
J1 J2 J3 J4 J5 J6
J1 J2 J3 J4 J5 J6
G1 G2 G3
SplitJobs
25
SplitJobs (cont.)
GS(i, l) = GAIN(i, l) − f
c(l) is the savings of the optimal grouping of jobs J1,…Jl.
26
MultiSplitJobs – an improvement of SplitJobs
J1 J2 J7 J8
G1 G2
G3
J6J3 J4 J5
SplitJobs
SplitJobs
G4SplitJobs
MultiSplitJobs
27
MultiSplitJobs (cont.)
Outline
• Introduction• Map Reduce recap.• MRShare – Sharing primitives in Map-Reduce• MRShare – Cost based approach to sharing • MRShare Implementation and Evaluation• Summary
28
29
Implementing MRShare• MRShare implement on Hadoop• First, acquire a batch of jobs from queries in a short time T• Second, MultiSplit Jobs is called to compute the optimal
grouping of the jobs• Third, the groups are rewritten, using a meta-map and a
meta-reduce function. These are MRShare specific container and their functionality relies on tagging.
• Finally, new jobs are submitted for execution
30
Tagging for Sharing Only Scans
31
Tagging for Sharing Map Output
32
Tagging for Sharing Map Output
33
Tagging for Sharing Map Output
Evaluation setup
• 40 EC2 small instance virtual machines• Modified Hadoop engine• 30 GB text dataset consisting of blogs• Multiple grep-wordcount queries– Counts words matching a regular expression– Allows for variable intermediate data sizes– Generic aggregation Map Reduce job
34
35
Validation of the Cost Model
Evaluation goals
• Sharing is not always beneficial.– ‘GreedyShare’ policy
• How much can we save on sharing scans?– MRShare - MultiSplitJobs evaluation
• How much can we save on sharing intermediate data? – MRShare - γ-MultiSplitJobs evaluation
36
Is sharing always beneficial?- ‘GreedyShare’ policy
Group of jobs
Group size
d=|intermediate data| / |input data|
H1 16 0.3 < d <0.7H2 16 0.7 < dH3 16 0.9 < d
37
How much we save on sharing scans – MRShare MultiSplitJobs
Group of jobs
Group size
d=|intermediate data| / |input data|
G1 16 0.7 < d
G2 16 0.2 < d < 0.7
G3 16 0.0 < d < 0.2
G4 16 0.0 < d < max
G5 64 0.0 < d < max
38
39
How much we save on sharing Map-output – MRShare MultiSplitJobs
How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs
40
Group of jobs
Group size
d=|intermediate data| / |input data|
G1 16 0.7 < d
G2 16 0.2 < d < 0.7
G3 16 0.0 < d < 0.2
Summary
• Introduction on MRShare – a framework for automatic work sharing in Map Reduce.
• We identified sharing primitives and demonstrated the implementation thereof in a Map-Reduce engine.
• We established a cost model and solved several work sharing optimization problems.
• We demonstrated vast savings when using MRShare.
41
Thank you!!!
Questions?
42
Recommended