Upload
zubair-nabi
View
430
Download
8
Embed Size (px)
DESCRIPTION
Cloud Computing Workshop 2013, ITU
Citation preview
Outline
1 Introduction
2 MR+
3 Implementation
4 Code-base
Zubair Nabi 9: MR+ April 19, 2013 2 / 26
Outline
1 Introduction
2 MR+
3 Implementation
4 Code-base
Zubair Nabi 9: MR+ April 19, 2013 3 / 26
Implicit MapReduce Assumptions
The input data has no structure
The distribution of intermediate data is balanced
Results materialize when all the map and reduce tasks complete
The number of values of each key is small enough to be processed bya single reduce task
Processing the data at the reduce stage in most cases is usually asimple aggregation function
Zubair Nabi 9: MR+ April 19, 2013 4 / 26
Implicit MapReduce Assumptions
The input data has no structure
The distribution of intermediate data is balanced
Results materialize when all the map and reduce tasks complete
The number of values of each key is small enough to be processed bya single reduce task
Processing the data at the reduce stage in most cases is usually asimple aggregation function
Zubair Nabi 9: MR+ April 19, 2013 4 / 26
Implicit MapReduce Assumptions
The input data has no structure
The distribution of intermediate data is balanced
Results materialize when all the map and reduce tasks complete
The number of values of each key is small enough to be processed bya single reduce task
Processing the data at the reduce stage in most cases is usually asimple aggregation function
Zubair Nabi 9: MR+ April 19, 2013 4 / 26
Implicit MapReduce Assumptions
The input data has no structure
The distribution of intermediate data is balanced
Results materialize when all the map and reduce tasks complete
The number of values of each key is small enough to be processed bya single reduce task
Processing the data at the reduce stage in most cases is usually asimple aggregation function
Zubair Nabi 9: MR+ April 19, 2013 4 / 26
Implicit MapReduce Assumptions
The input data has no structure
The distribution of intermediate data is balanced
Results materialize when all the map and reduce tasks complete
The number of values of each key is small enough to be processed bya single reduce task
Processing the data at the reduce stage in most cases is usually asimple aggregation function
Zubair Nabi 9: MR+ April 19, 2013 4 / 26
Zipf distributions are everywhere
Zubair Nabi 9: MR+ April 19, 2013 5 / 26
Reduce-intensive applications
Image and speech correlation
Backpropagation in neural networks
Co-clustering
Tree learning
Computation of node diameter and radii in Tera-scale graphs
. . .
Zubair Nabi 9: MR+ April 19, 2013 6 / 26
Outline
1 Introduction
2 MR+
3 Implementation
4 Code-base
Zubair Nabi 9: MR+ April 19, 2013 7 / 26
Design Goals
Negate skew in intermediate data
Exploit structure in input data
Estimate results
Favour commodity clusters
Maintain original functional model of MapReduce
Zubair Nabi 9: MR+ April 19, 2013 8 / 26
Design Goals
Negate skew in intermediate data
Exploit structure in input data
Estimate results
Favour commodity clusters
Maintain original functional model of MapReduce
Zubair Nabi 9: MR+ April 19, 2013 8 / 26
Design Goals
Negate skew in intermediate data
Exploit structure in input data
Estimate results
Favour commodity clusters
Maintain original functional model of MapReduce
Zubair Nabi 9: MR+ April 19, 2013 8 / 26
Design Goals
Negate skew in intermediate data
Exploit structure in input data
Estimate results
Favour commodity clusters
Maintain original functional model of MapReduce
Zubair Nabi 9: MR+ April 19, 2013 8 / 26
Design Goals
Negate skew in intermediate data
Exploit structure in input data
Estimate results
Favour commodity clusters
Maintain original functional model of MapReduce
Zubair Nabi 9: MR+ April 19, 2013 8 / 26
Design
Maintains the simple MapReduce programming model
Instead of implementing MapReduce as a sequential two-stagedarchitecture, MR+ allows map and reduce stages to interleave anditerate over intermediate results
Leading to a multi-level inverted tree of reduce workers
Zubair Nabi 9: MR+ April 19, 2013 9 / 26
Design
Maintains the simple MapReduce programming model
Instead of implementing MapReduce as a sequential two-stagedarchitecture, MR+ allows map and reduce stages to interleave anditerate over intermediate results
Leading to a multi-level inverted tree of reduce workers
Zubair Nabi 9: MR+ April 19, 2013 9 / 26
Design
Maintains the simple MapReduce programming model
Instead of implementing MapReduce as a sequential two-stagedarchitecture, MR+ allows map and reduce stages to interleave anditerate over intermediate results
Leading to a multi-level inverted tree of reduce workers
Zubair Nabi 9: MR+ April 19, 2013 9 / 26
Architecture
MR MR End Brick-wall
Map Phase Reduce Phase
Map Reduce
(a) MapReduce
MR+ Start MR+ End Brick-wall
5% -10% Estimation cycle prioritizes data
(b) MR+
Figure: Architectural comparison of MapReduce and MR+.
Zubair Nabi 9: MR+ April 19, 2013 10 / 26
Architectural Flexibility
1 Instead of waiting for all maps to finish before scheduling a reducetask, MR+ permits a model where a reduce task can be scheduled forevery n invocations of the map function
2 A densely populated key can be recursively reduced by repeatedinvocation of the reduce function at multiple reduce workers
Zubair Nabi 9: MR+ April 19, 2013 11 / 26
Architectural Flexibility
1 Instead of waiting for all maps to finish before scheduling a reducetask, MR+ permits a model where a reduce task can be scheduled forevery n invocations of the map function
2 A densely populated key can be recursively reduced by repeatedinvocation of the reduce function at multiple reduce workers
Zubair Nabi 9: MR+ April 19, 2013 11 / 26
Advantages
Resilient to TCP Incast by amortizing data copying over the course ofthe job
Early materialization of partial results for queries with thresholds orconfidence intervals
Finds structure in the data by running a sample cycle to learn thedistribution of information and prioritizes input data with respect to theuser query
Zubair Nabi 9: MR+ April 19, 2013 12 / 26
Advantages
Resilient to TCP Incast by amortizing data copying over the course ofthe job
Early materialization of partial results for queries with thresholds orconfidence intervals
Finds structure in the data by running a sample cycle to learn thedistribution of information and prioritizes input data with respect to theuser query
Zubair Nabi 9: MR+ April 19, 2013 12 / 26
Advantages
Resilient to TCP Incast by amortizing data copying over the course ofthe job
Early materialization of partial results for queries with thresholds orconfidence intervals
Finds structure in the data by running a sample cycle to learn thedistribution of information and prioritizes input data with respect to theuser query
Zubair Nabi 9: MR+ April 19, 2013 12 / 26
Programming Model
Retains the 2-stage MapReduce API
MR+ reducers can be likened to distributed combiners
Repeated invocation of the reducer by default rules out non-associativefunctions
But reducers can be designed in such a way that the associativeoperation is applied only at the very last reduce
Zubair Nabi 9: MR+ April 19, 2013 13 / 26
Programming Model
Retains the 2-stage MapReduce API
MR+ reducers can be likened to distributed combiners
Repeated invocation of the reducer by default rules out non-associativefunctions
But reducers can be designed in such a way that the associativeoperation is applied only at the very last reduce
Zubair Nabi 9: MR+ April 19, 2013 13 / 26
Programming Model
Retains the 2-stage MapReduce API
MR+ reducers can be likened to distributed combiners
Repeated invocation of the reducer by default rules out non-associativefunctions
But reducers can be designed in such a way that the associativeoperation is applied only at the very last reduce
Zubair Nabi 9: MR+ April 19, 2013 13 / 26
Programming Model
Retains the 2-stage MapReduce API
MR+ reducers can be likened to distributed combiners
Repeated invocation of the reducer by default rules out non-associativefunctions
But reducers can be designed in such a way that the associativeoperation is applied only at the very last reduce
Zubair Nabi 9: MR+ April 19, 2013 13 / 26
Outline
1 Introduction
2 MR+
3 Implementation
4 Code-base
Zubair Nabi 9: MR+ April 19, 2013 14 / 26
Scheduling
Tasks are scheduled according to a configurablemap_to_reduce_schedule_ratio parameter
For every map_to_reduce_schedule_ratio map tasks, 1reduce task is scheduled
For instance, if map_to_reduce_schedule_ratio is 4, then thefirst reduce task is scheduled when 4 map tasks complete
Zubair Nabi 9: MR+ April 19, 2013 15 / 26
Scheduling
Tasks are scheduled according to a configurablemap_to_reduce_schedule_ratio parameter
For every map_to_reduce_schedule_ratio map tasks, 1reduce task is scheduled
For instance, if map_to_reduce_schedule_ratio is 4, then thefirst reduce task is scheduled when 4 map tasks complete
Zubair Nabi 9: MR+ April 19, 2013 15 / 26
Scheduling
Tasks are scheduled according to a configurablemap_to_reduce_schedule_ratio parameter
For every map_to_reduce_schedule_ratio map tasks, 1reduce task is scheduled
For instance, if map_to_reduce_schedule_ratio is 4, then thefirst reduce task is scheduled when 4 map tasks complete
Zubair Nabi 9: MR+ April 19, 2013 15 / 26
Level-1 reducers
Each reduce is assigned the output of map_to_reduce_rationumber of maps
The location of their inputs is communicated by the JobTracker
Each reduce task pulls its input via HTTP
After the reduce logic has been applied to all keys, the output isearmarked for L > 1 reducers
Zubair Nabi 9: MR+ April 19, 2013 16 / 26
Level-1 reducers
Each reduce is assigned the output of map_to_reduce_rationumber of maps
The location of their inputs is communicated by the JobTracker
Each reduce task pulls its input via HTTP
After the reduce logic has been applied to all keys, the output isearmarked for L > 1 reducers
Zubair Nabi 9: MR+ April 19, 2013 16 / 26
Level-1 reducers
Each reduce is assigned the output of map_to_reduce_rationumber of maps
The location of their inputs is communicated by the JobTracker
Each reduce task pulls its input via HTTP
After the reduce logic has been applied to all keys, the output isearmarked for L > 1 reducers
Zubair Nabi 9: MR+ April 19, 2013 16 / 26
Level-1 reducers
Each reduce is assigned the output of map_to_reduce_rationumber of maps
The location of their inputs is communicated by the JobTracker
Each reduce task pulls its input via HTTP
After the reduce logic has been applied to all keys, the output isearmarked for L > 1 reducers
Zubair Nabi 9: MR+ April 19, 2013 16 / 26
Level > 1 reducers
Assigned the input of reduce_input_ratio number of reducetasks
Eventually all key/value pairs make their way to the final level, whichhas a single worker
This final reduce can also be used to apply any non-associativeoperation
Zubair Nabi 9: MR+ April 19, 2013 17 / 26
Level > 1 reducers
Assigned the input of reduce_input_ratio number of reducetasks
Eventually all key/value pairs make their way to the final level, whichhas a single worker
This final reduce can also be used to apply any non-associativeoperation
Zubair Nabi 9: MR+ April 19, 2013 17 / 26
Level > 1 reducers
Assigned the input of reduce_input_ratio number of reducetasks
Eventually all key/value pairs make their way to the final level, whichhas a single worker
This final reduce can also be used to apply any non-associativeoperation
Zubair Nabi 9: MR+ April 19, 2013 17 / 26
Structural comparison
Reduce1
Shuffler....
k1, v1,v2,...k2, v1,v2,...
...kn, v1,v2,...
k1, v1,v2,...k2, v1,v2,...
...kn, v1,v2,...
.
.
.
.
Reduce2
Reduce3
Reduce4
Reduceθ
k1, v1,v2,...
k2, v1,v2,...
k3, v1,v2,...
k4, v1,v2,...
kn, v1,v2,...
Brick-wall
Map1
Mapω
Map2
k1, v1,v2,...k2, v1,v2,...
...kn, v1,v2,...
(a) MapReduce
Reduce1,1
k1, v1,v2,...k2, v1,v2,...
...kn, v1,v2,...
.....
.....
Reduce2,1
Reduce3,1
Reduce4,1
Reduceα-1,1
Reduceα,1
Reduce1,2
Reduce2,2
Reduceβ,2
... Reduce1,φ
k1, v1,v2,...k2, v1,v2,...
...kn, v1,v2,...
k1, v1,v2,...k2, v1,v2,...
...kn, v1,v2,...
k1, v1,v2,...k2, v1,v2,...
...kn, v1,v2,...
k1, v1,v2,...k2, v1,v2,...
...kn, v1,v2,...
Map2
Map1
k1, v1,v2,...k2, v1,v2,...
...kn, v1,v2,...
Mapω
Mapω-1
...
...
...α = ω/mr
β = α/rr
ϒ = β/rr...1
(b) MR+
Figure: Structural comparison of MapReduce and MR+.
Zubair Nabi 9: MR+ April 19, 2013 18 / 26
Reduce Locality
MR+ does not rely on key/values for input assignment
Reduce inputs are assigned on the basis of locality1 Node-local2 Rack-local3 Any
Zubair Nabi 9: MR+ April 19, 2013 19 / 26
Reduce Locality
MR+ does not rely on key/values for input assignmentReduce inputs are assigned on the basis of locality
1 Node-local2 Rack-local3 Any
Zubair Nabi 9: MR+ April 19, 2013 19 / 26
Reduce Locality
MR+ does not rely on key/values for input assignmentReduce inputs are assigned on the basis of locality
1 Node-local2 Rack-local3 Any
Zubair Nabi 9: MR+ April 19, 2013 19 / 26
Fault Tolerance
Deterministic input assignment simplifies failure recovery inMapReduce
In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage
1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local
file system of a rack-local neighbour3 Distributed replication: The output of each reduce is replicated on the
distributed file system
Zubair Nabi 9: MR+ April 19, 2013 20 / 26
Fault Tolerance
Deterministic input assignment simplifies failure recovery inMapReduce
In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executed
For level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage
1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local
file system of a rack-local neighbour3 Distributed replication: The output of each reduce is replicated on the
distributed file system
Zubair Nabi 9: MR+ April 19, 2013 20 / 26
Fault Tolerance
Deterministic input assignment simplifies failure recovery inMapReduce
In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage
1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local
file system of a rack-local neighbour3 Distributed replication: The output of each reduce is replicated on the
distributed file system
Zubair Nabi 9: MR+ April 19, 2013 20 / 26
Fault Tolerance
Deterministic input assignment simplifies failure recovery inMapReduce
In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage
1 Chain re-execution: The entire chain is re-executed
2 Local replication: The output of each reduce is replicated on the localfile system of a rack-local neighbour
3 Distributed replication: The output of each reduce is replicated on thedistributed file system
Zubair Nabi 9: MR+ April 19, 2013 20 / 26
Fault Tolerance
Deterministic input assignment simplifies failure recovery inMapReduce
In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage
1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local
file system of a rack-local neighbour
3 Distributed replication: The output of each reduce is replicated on thedistributed file system
Zubair Nabi 9: MR+ April 19, 2013 20 / 26
Fault Tolerance
Deterministic input assignment simplifies failure recovery inMapReduce
In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage
1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local
file system of a rack-local neighbour3 Distributed replication: The output of each reduce is replicated on the
distributed file system
Zubair Nabi 9: MR+ April 19, 2013 20 / 26
Input Prioritization
User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random
This sampling cycle yields a representative distribution of data
Used to exploit structure: data with semantic grouping or clusters ofrelevant information
The distribution is used to generate a priority queue to assign to maptasks
A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue
Zubair Nabi 9: MR+ April 19, 2013 21 / 26
Input Prioritization
User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random
This sampling cycle yields a representative distribution of data
Used to exploit structure: data with semantic grouping or clusters ofrelevant information
The distribution is used to generate a priority queue to assign to maptasks
A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue
Zubair Nabi 9: MR+ April 19, 2013 21 / 26
Input Prioritization
User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random
This sampling cycle yields a representative distribution of data
Used to exploit structure: data with semantic grouping or clusters ofrelevant information
The distribution is used to generate a priority queue to assign to maptasks
A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue
Zubair Nabi 9: MR+ April 19, 2013 21 / 26
Input Prioritization
User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random
This sampling cycle yields a representative distribution of data
Used to exploit structure: data with semantic grouping or clusters ofrelevant information
The distribution is used to generate a priority queue to assign to maptasks
A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue
Zubair Nabi 9: MR+ April 19, 2013 21 / 26
Input Prioritization
User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random
This sampling cycle yields a representative distribution of data
Used to exploit structure: data with semantic grouping or clusters ofrelevant information
The distribution is used to generate a priority queue to assign to maptasks
A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue
Zubair Nabi 9: MR+ April 19, 2013 21 / 26
Input Prioritization (2)
Due to this prioritization, relevant clusters of information are processedfirst
As a result, the computation can be stopped mid-way if a thresholdcondition is satisfied
Zubair Nabi 9: MR+ April 19, 2013 22 / 26
Input Prioritization (2)
Due to this prioritization, relevant clusters of information are processedfirst
As a result, the computation can be stopped mid-way if a thresholdcondition is satisfied
Zubair Nabi 9: MR+ April 19, 2013 22 / 26
Outline
1 Introduction
2 MR+
3 Implementation
4 Code-base
Zubair Nabi 9: MR+ April 19, 2013 23 / 26
Code-base
Around 15,000 lines of Python code
Code implements both vanilla MapReduce and MR+
Written over the course of roughly 5 years at LUMS
Publicly available at: https://code.google.com/p/mrplus/source/browse/?name=BRANCH_VER_0_0_0_4_PY2x
Zubair Nabi 9: MR+ April 19, 2013 24 / 26
Code-base
Around 15,000 lines of Python code
Code implements both vanilla MapReduce and MR+
Written over the course of roughly 5 years at LUMS
Publicly available at: https://code.google.com/p/mrplus/source/browse/?name=BRANCH_VER_0_0_0_4_PY2x
Zubair Nabi 9: MR+ April 19, 2013 24 / 26
Code-base
Around 15,000 lines of Python code
Code implements both vanilla MapReduce and MR+
Written over the course of roughly 5 years at LUMS
Publicly available at: https://code.google.com/p/mrplus/source/browse/?name=BRANCH_VER_0_0_0_4_PY2x
Zubair Nabi 9: MR+ April 19, 2013 24 / 26
Code-base
Around 15,000 lines of Python code
Code implements both vanilla MapReduce and MR+
Written over the course of roughly 5 years at LUMS
Publicly available at: https://code.google.com/p/mrplus/source/browse/?name=BRANCH_VER_0_0_0_4_PY2x
Zubair Nabi 9: MR+ April 19, 2013 24 / 26
Storage
Abstracts away the underlying storage system
Currently supports the HDFS and Amazon’s S3
Also supports the local OS file system (for unit testing)
Zubair Nabi 9: MR+ April 19, 2013 25 / 26
Storage
Abstracts away the underlying storage system
Currently supports the HDFS and Amazon’s S3
Also supports the local OS file system (for unit testing)
Zubair Nabi 9: MR+ April 19, 2013 25 / 26
Storage
Abstracts away the underlying storage system
Currently supports the HDFS and Amazon’s S3
Also supports the local OS file system (for unit testing)
Zubair Nabi 9: MR+ April 19, 2013 25 / 26
Structure
Modular structure so most of the code is re-used across MapReduceand MR+
Google Protobufs and JSON used for serialization
All configuration options within two files: siteconf.xml (site-wide)and jobconf.xml (job-specific)
Zubair Nabi 9: MR+ April 19, 2013 26 / 26
Structure
Modular structure so most of the code is re-used across MapReduceand MR+
Google Protobufs and JSON used for serialization
All configuration options within two files: siteconf.xml (site-wide)and jobconf.xml (job-specific)
Zubair Nabi 9: MR+ April 19, 2013 26 / 26
Structure
Modular structure so most of the code is re-used across MapReduceand MR+
Google Protobufs and JSON used for serialization
All configuration options within two files: siteconf.xml (site-wide)and jobconf.xml (job-specific)
Zubair Nabi 9: MR+ April 19, 2013 26 / 26