68
9: MR+ Zubair Nabi [email protected] April 19, 2013 Zubair Nabi 9: MR+ April 19, 2013 1 / 26

Topic 9: MR+

Embed Size (px)

DESCRIPTION

Cloud Computing Workshop 2013, ITU

Citation preview

Page 1: Topic 9: MR+

9: MR+

Zubair Nabi

[email protected]

April 19, 2013

Zubair Nabi 9: MR+ April 19, 2013 1 / 26

Page 2: Topic 9: MR+

Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base

Zubair Nabi 9: MR+ April 19, 2013 2 / 26

Page 3: Topic 9: MR+

Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base

Zubair Nabi 9: MR+ April 19, 2013 3 / 26

Page 4: Topic 9: MR+

Implicit MapReduce Assumptions

The input data has no structure

The distribution of intermediate data is balanced

Results materialize when all the map and reduce tasks complete

The number of values of each key is small enough to be processed bya single reduce task

Processing the data at the reduce stage in most cases is usually asimple aggregation function

Zubair Nabi 9: MR+ April 19, 2013 4 / 26

Page 5: Topic 9: MR+

Implicit MapReduce Assumptions

The input data has no structure

The distribution of intermediate data is balanced

Results materialize when all the map and reduce tasks complete

The number of values of each key is small enough to be processed bya single reduce task

Processing the data at the reduce stage in most cases is usually asimple aggregation function

Zubair Nabi 9: MR+ April 19, 2013 4 / 26

Page 6: Topic 9: MR+

Implicit MapReduce Assumptions

The input data has no structure

The distribution of intermediate data is balanced

Results materialize when all the map and reduce tasks complete

The number of values of each key is small enough to be processed bya single reduce task

Processing the data at the reduce stage in most cases is usually asimple aggregation function

Zubair Nabi 9: MR+ April 19, 2013 4 / 26

Page 7: Topic 9: MR+

Implicit MapReduce Assumptions

The input data has no structure

The distribution of intermediate data is balanced

Results materialize when all the map and reduce tasks complete

The number of values of each key is small enough to be processed bya single reduce task

Processing the data at the reduce stage in most cases is usually asimple aggregation function

Zubair Nabi 9: MR+ April 19, 2013 4 / 26

Page 8: Topic 9: MR+

Implicit MapReduce Assumptions

The input data has no structure

The distribution of intermediate data is balanced

Results materialize when all the map and reduce tasks complete

The number of values of each key is small enough to be processed bya single reduce task

Processing the data at the reduce stage in most cases is usually asimple aggregation function

Zubair Nabi 9: MR+ April 19, 2013 4 / 26

Page 9: Topic 9: MR+

Zipf distributions are everywhere

Zubair Nabi 9: MR+ April 19, 2013 5 / 26

Page 10: Topic 9: MR+

Reduce-intensive applications

Image and speech correlation

Backpropagation in neural networks

Co-clustering

Tree learning

Computation of node diameter and radii in Tera-scale graphs

. . .

Zubair Nabi 9: MR+ April 19, 2013 6 / 26

Page 11: Topic 9: MR+

Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base

Zubair Nabi 9: MR+ April 19, 2013 7 / 26

Page 12: Topic 9: MR+

Design Goals

Negate skew in intermediate data

Exploit structure in input data

Estimate results

Favour commodity clusters

Maintain original functional model of MapReduce

Zubair Nabi 9: MR+ April 19, 2013 8 / 26

Page 13: Topic 9: MR+

Design Goals

Negate skew in intermediate data

Exploit structure in input data

Estimate results

Favour commodity clusters

Maintain original functional model of MapReduce

Zubair Nabi 9: MR+ April 19, 2013 8 / 26

Page 14: Topic 9: MR+

Design Goals

Negate skew in intermediate data

Exploit structure in input data

Estimate results

Favour commodity clusters

Maintain original functional model of MapReduce

Zubair Nabi 9: MR+ April 19, 2013 8 / 26

Page 15: Topic 9: MR+

Design Goals

Negate skew in intermediate data

Exploit structure in input data

Estimate results

Favour commodity clusters

Maintain original functional model of MapReduce

Zubair Nabi 9: MR+ April 19, 2013 8 / 26

Page 16: Topic 9: MR+

Design Goals

Negate skew in intermediate data

Exploit structure in input data

Estimate results

Favour commodity clusters

Maintain original functional model of MapReduce

Zubair Nabi 9: MR+ April 19, 2013 8 / 26

Page 17: Topic 9: MR+

Design

Maintains the simple MapReduce programming model

Instead of implementing MapReduce as a sequential two-stagedarchitecture, MR+ allows map and reduce stages to interleave anditerate over intermediate results

Leading to a multi-level inverted tree of reduce workers

Zubair Nabi 9: MR+ April 19, 2013 9 / 26

Page 18: Topic 9: MR+

Design

Maintains the simple MapReduce programming model

Instead of implementing MapReduce as a sequential two-stagedarchitecture, MR+ allows map and reduce stages to interleave anditerate over intermediate results

Leading to a multi-level inverted tree of reduce workers

Zubair Nabi 9: MR+ April 19, 2013 9 / 26

Page 19: Topic 9: MR+

Design

Maintains the simple MapReduce programming model

Instead of implementing MapReduce as a sequential two-stagedarchitecture, MR+ allows map and reduce stages to interleave anditerate over intermediate results

Leading to a multi-level inverted tree of reduce workers

Zubair Nabi 9: MR+ April 19, 2013 9 / 26

Page 20: Topic 9: MR+

Architecture

MR MR End Brick-wall

Map Phase Reduce Phase

Map Reduce

(a) MapReduce

MR+ Start MR+ End Brick-wall

5% -10% Estimation cycle prioritizes data

(b) MR+

Figure: Architectural comparison of MapReduce and MR+.

Zubair Nabi 9: MR+ April 19, 2013 10 / 26

Page 21: Topic 9: MR+

Architectural Flexibility

1 Instead of waiting for all maps to finish before scheduling a reducetask, MR+ permits a model where a reduce task can be scheduled forevery n invocations of the map function

2 A densely populated key can be recursively reduced by repeatedinvocation of the reduce function at multiple reduce workers

Zubair Nabi 9: MR+ April 19, 2013 11 / 26

Page 22: Topic 9: MR+

Architectural Flexibility

1 Instead of waiting for all maps to finish before scheduling a reducetask, MR+ permits a model where a reduce task can be scheduled forevery n invocations of the map function

2 A densely populated key can be recursively reduced by repeatedinvocation of the reduce function at multiple reduce workers

Zubair Nabi 9: MR+ April 19, 2013 11 / 26

Page 23: Topic 9: MR+

Advantages

Resilient to TCP Incast by amortizing data copying over the course ofthe job

Early materialization of partial results for queries with thresholds orconfidence intervals

Finds structure in the data by running a sample cycle to learn thedistribution of information and prioritizes input data with respect to theuser query

Zubair Nabi 9: MR+ April 19, 2013 12 / 26

Page 24: Topic 9: MR+

Advantages

Resilient to TCP Incast by amortizing data copying over the course ofthe job

Early materialization of partial results for queries with thresholds orconfidence intervals

Finds structure in the data by running a sample cycle to learn thedistribution of information and prioritizes input data with respect to theuser query

Zubair Nabi 9: MR+ April 19, 2013 12 / 26

Page 25: Topic 9: MR+

Advantages

Resilient to TCP Incast by amortizing data copying over the course ofthe job

Early materialization of partial results for queries with thresholds orconfidence intervals

Finds structure in the data by running a sample cycle to learn thedistribution of information and prioritizes input data with respect to theuser query

Zubair Nabi 9: MR+ April 19, 2013 12 / 26

Page 26: Topic 9: MR+

Programming Model

Retains the 2-stage MapReduce API

MR+ reducers can be likened to distributed combiners

Repeated invocation of the reducer by default rules out non-associativefunctions

But reducers can be designed in such a way that the associativeoperation is applied only at the very last reduce

Zubair Nabi 9: MR+ April 19, 2013 13 / 26

Page 27: Topic 9: MR+

Programming Model

Retains the 2-stage MapReduce API

MR+ reducers can be likened to distributed combiners

Repeated invocation of the reducer by default rules out non-associativefunctions

But reducers can be designed in such a way that the associativeoperation is applied only at the very last reduce

Zubair Nabi 9: MR+ April 19, 2013 13 / 26

Page 28: Topic 9: MR+

Programming Model

Retains the 2-stage MapReduce API

MR+ reducers can be likened to distributed combiners

Repeated invocation of the reducer by default rules out non-associativefunctions

But reducers can be designed in such a way that the associativeoperation is applied only at the very last reduce

Zubair Nabi 9: MR+ April 19, 2013 13 / 26

Page 29: Topic 9: MR+

Programming Model

Retains the 2-stage MapReduce API

MR+ reducers can be likened to distributed combiners

Repeated invocation of the reducer by default rules out non-associativefunctions

But reducers can be designed in such a way that the associativeoperation is applied only at the very last reduce

Zubair Nabi 9: MR+ April 19, 2013 13 / 26

Page 30: Topic 9: MR+

Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base

Zubair Nabi 9: MR+ April 19, 2013 14 / 26

Page 31: Topic 9: MR+

Scheduling

Tasks are scheduled according to a configurablemap_to_reduce_schedule_ratio parameter

For every map_to_reduce_schedule_ratio map tasks, 1reduce task is scheduled

For instance, if map_to_reduce_schedule_ratio is 4, then thefirst reduce task is scheduled when 4 map tasks complete

Zubair Nabi 9: MR+ April 19, 2013 15 / 26

Page 32: Topic 9: MR+

Scheduling

Tasks are scheduled according to a configurablemap_to_reduce_schedule_ratio parameter

For every map_to_reduce_schedule_ratio map tasks, 1reduce task is scheduled

For instance, if map_to_reduce_schedule_ratio is 4, then thefirst reduce task is scheduled when 4 map tasks complete

Zubair Nabi 9: MR+ April 19, 2013 15 / 26

Page 33: Topic 9: MR+

Scheduling

Tasks are scheduled according to a configurablemap_to_reduce_schedule_ratio parameter

For every map_to_reduce_schedule_ratio map tasks, 1reduce task is scheduled

For instance, if map_to_reduce_schedule_ratio is 4, then thefirst reduce task is scheduled when 4 map tasks complete

Zubair Nabi 9: MR+ April 19, 2013 15 / 26

Page 34: Topic 9: MR+

Level-1 reducers

Each reduce is assigned the output of map_to_reduce_rationumber of maps

The location of their inputs is communicated by the JobTracker

Each reduce task pulls its input via HTTP

After the reduce logic has been applied to all keys, the output isearmarked for L > 1 reducers

Zubair Nabi 9: MR+ April 19, 2013 16 / 26

Page 35: Topic 9: MR+

Level-1 reducers

Each reduce is assigned the output of map_to_reduce_rationumber of maps

The location of their inputs is communicated by the JobTracker

Each reduce task pulls its input via HTTP

After the reduce logic has been applied to all keys, the output isearmarked for L > 1 reducers

Zubair Nabi 9: MR+ April 19, 2013 16 / 26

Page 36: Topic 9: MR+

Level-1 reducers

Each reduce is assigned the output of map_to_reduce_rationumber of maps

The location of their inputs is communicated by the JobTracker

Each reduce task pulls its input via HTTP

After the reduce logic has been applied to all keys, the output isearmarked for L > 1 reducers

Zubair Nabi 9: MR+ April 19, 2013 16 / 26

Page 37: Topic 9: MR+

Level-1 reducers

Each reduce is assigned the output of map_to_reduce_rationumber of maps

The location of their inputs is communicated by the JobTracker

Each reduce task pulls its input via HTTP

After the reduce logic has been applied to all keys, the output isearmarked for L > 1 reducers

Zubair Nabi 9: MR+ April 19, 2013 16 / 26

Page 38: Topic 9: MR+

Level > 1 reducers

Assigned the input of reduce_input_ratio number of reducetasks

Eventually all key/value pairs make their way to the final level, whichhas a single worker

This final reduce can also be used to apply any non-associativeoperation

Zubair Nabi 9: MR+ April 19, 2013 17 / 26

Page 39: Topic 9: MR+

Level > 1 reducers

Assigned the input of reduce_input_ratio number of reducetasks

Eventually all key/value pairs make their way to the final level, whichhas a single worker

This final reduce can also be used to apply any non-associativeoperation

Zubair Nabi 9: MR+ April 19, 2013 17 / 26

Page 40: Topic 9: MR+

Level > 1 reducers

Assigned the input of reduce_input_ratio number of reducetasks

Eventually all key/value pairs make their way to the final level, whichhas a single worker

This final reduce can also be used to apply any non-associativeoperation

Zubair Nabi 9: MR+ April 19, 2013 17 / 26

Page 41: Topic 9: MR+

Structural comparison

Reduce1

Shuffler....

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

.

.

.

.

Reduce2

Reduce3

Reduce4

Reduceθ

k1, v1,v2,...

k2, v1,v2,...

k3, v1,v2,...

k4, v1,v2,...

kn, v1,v2,...

Brick-wall

Map1

Mapω

Map2

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

(a) MapReduce

Reduce1,1

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

.....

.....

Reduce2,1

Reduce3,1

Reduce4,1

Reduceα-1,1

Reduceα,1

Reduce1,2

Reduce2,2

Reduceβ,2

... Reduce1,φ

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

Map2

Map1

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

Mapω

Mapω-1

...

...

...α = ω/mr

β = α/rr

ϒ = β/rr...1

(b) MR+

Figure: Structural comparison of MapReduce and MR+.

Zubair Nabi 9: MR+ April 19, 2013 18 / 26

Page 42: Topic 9: MR+

Reduce Locality

MR+ does not rely on key/values for input assignment

Reduce inputs are assigned on the basis of locality1 Node-local2 Rack-local3 Any

Zubair Nabi 9: MR+ April 19, 2013 19 / 26

Page 43: Topic 9: MR+

Reduce Locality

MR+ does not rely on key/values for input assignmentReduce inputs are assigned on the basis of locality

1 Node-local2 Rack-local3 Any

Zubair Nabi 9: MR+ April 19, 2013 19 / 26

Page 44: Topic 9: MR+

Reduce Locality

MR+ does not rely on key/values for input assignmentReduce inputs are assigned on the basis of locality

1 Node-local2 Rack-local3 Any

Zubair Nabi 9: MR+ April 19, 2013 19 / 26

Page 45: Topic 9: MR+

Fault Tolerance

Deterministic input assignment simplifies failure recovery inMapReduce

In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage

1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local

file system of a rack-local neighbour3 Distributed replication: The output of each reduce is replicated on the

distributed file system

Zubair Nabi 9: MR+ April 19, 2013 20 / 26

Page 46: Topic 9: MR+

Fault Tolerance

Deterministic input assignment simplifies failure recovery inMapReduce

In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executed

For level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage

1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local

file system of a rack-local neighbour3 Distributed replication: The output of each reduce is replicated on the

distributed file system

Zubair Nabi 9: MR+ April 19, 2013 20 / 26

Page 47: Topic 9: MR+

Fault Tolerance

Deterministic input assignment simplifies failure recovery inMapReduce

In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage

1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local

file system of a rack-local neighbour3 Distributed replication: The output of each reduce is replicated on the

distributed file system

Zubair Nabi 9: MR+ April 19, 2013 20 / 26

Page 48: Topic 9: MR+

Fault Tolerance

Deterministic input assignment simplifies failure recovery inMapReduce

In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage

1 Chain re-execution: The entire chain is re-executed

2 Local replication: The output of each reduce is replicated on the localfile system of a rack-local neighbour

3 Distributed replication: The output of each reduce is replicated on thedistributed file system

Zubair Nabi 9: MR+ April 19, 2013 20 / 26

Page 49: Topic 9: MR+

Fault Tolerance

Deterministic input assignment simplifies failure recovery inMapReduce

In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage

1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local

file system of a rack-local neighbour

3 Distributed replication: The output of each reduce is replicated on thedistributed file system

Zubair Nabi 9: MR+ April 19, 2013 20 / 26

Page 50: Topic 9: MR+

Fault Tolerance

Deterministic input assignment simplifies failure recovery inMapReduce

In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage

1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local

file system of a rack-local neighbour3 Distributed replication: The output of each reduce is replicated on the

distributed file system

Zubair Nabi 9: MR+ April 19, 2013 20 / 26

Page 51: Topic 9: MR+

Input Prioritization

User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random

This sampling cycle yields a representative distribution of data

Used to exploit structure: data with semantic grouping or clusters ofrelevant information

The distribution is used to generate a priority queue to assign to maptasks

A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue

Zubair Nabi 9: MR+ April 19, 2013 21 / 26

Page 52: Topic 9: MR+

Input Prioritization

User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random

This sampling cycle yields a representative distribution of data

Used to exploit structure: data with semantic grouping or clusters ofrelevant information

The distribution is used to generate a priority queue to assign to maptasks

A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue

Zubair Nabi 9: MR+ April 19, 2013 21 / 26

Page 53: Topic 9: MR+

Input Prioritization

User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random

This sampling cycle yields a representative distribution of data

Used to exploit structure: data with semantic grouping or clusters ofrelevant information

The distribution is used to generate a priority queue to assign to maptasks

A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue

Zubair Nabi 9: MR+ April 19, 2013 21 / 26

Page 54: Topic 9: MR+

Input Prioritization

User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random

This sampling cycle yields a representative distribution of data

Used to exploit structure: data with semantic grouping or clusters ofrelevant information

The distribution is used to generate a priority queue to assign to maptasks

A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue

Zubair Nabi 9: MR+ April 19, 2013 21 / 26

Page 55: Topic 9: MR+

Input Prioritization

User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random

This sampling cycle yields a representative distribution of data

Used to exploit structure: data with semantic grouping or clusters ofrelevant information

The distribution is used to generate a priority queue to assign to maptasks

A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue

Zubair Nabi 9: MR+ April 19, 2013 21 / 26

Page 56: Topic 9: MR+

Input Prioritization (2)

Due to this prioritization, relevant clusters of information are processedfirst

As a result, the computation can be stopped mid-way if a thresholdcondition is satisfied

Zubair Nabi 9: MR+ April 19, 2013 22 / 26

Page 57: Topic 9: MR+

Input Prioritization (2)

Due to this prioritization, relevant clusters of information are processedfirst

As a result, the computation can be stopped mid-way if a thresholdcondition is satisfied

Zubair Nabi 9: MR+ April 19, 2013 22 / 26

Page 58: Topic 9: MR+

Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base

Zubair Nabi 9: MR+ April 19, 2013 23 / 26

Page 59: Topic 9: MR+

Code-base

Around 15,000 lines of Python code

Code implements both vanilla MapReduce and MR+

Written over the course of roughly 5 years at LUMS

Publicly available at: https://code.google.com/p/mrplus/source/browse/?name=BRANCH_VER_0_0_0_4_PY2x

Zubair Nabi 9: MR+ April 19, 2013 24 / 26

Page 60: Topic 9: MR+

Code-base

Around 15,000 lines of Python code

Code implements both vanilla MapReduce and MR+

Written over the course of roughly 5 years at LUMS

Publicly available at: https://code.google.com/p/mrplus/source/browse/?name=BRANCH_VER_0_0_0_4_PY2x

Zubair Nabi 9: MR+ April 19, 2013 24 / 26

Page 61: Topic 9: MR+

Code-base

Around 15,000 lines of Python code

Code implements both vanilla MapReduce and MR+

Written over the course of roughly 5 years at LUMS

Publicly available at: https://code.google.com/p/mrplus/source/browse/?name=BRANCH_VER_0_0_0_4_PY2x

Zubair Nabi 9: MR+ April 19, 2013 24 / 26

Page 62: Topic 9: MR+

Code-base

Around 15,000 lines of Python code

Code implements both vanilla MapReduce and MR+

Written over the course of roughly 5 years at LUMS

Publicly available at: https://code.google.com/p/mrplus/source/browse/?name=BRANCH_VER_0_0_0_4_PY2x

Zubair Nabi 9: MR+ April 19, 2013 24 / 26

Page 63: Topic 9: MR+

Storage

Abstracts away the underlying storage system

Currently supports the HDFS and Amazon’s S3

Also supports the local OS file system (for unit testing)

Zubair Nabi 9: MR+ April 19, 2013 25 / 26

Page 64: Topic 9: MR+

Storage

Abstracts away the underlying storage system

Currently supports the HDFS and Amazon’s S3

Also supports the local OS file system (for unit testing)

Zubair Nabi 9: MR+ April 19, 2013 25 / 26

Page 65: Topic 9: MR+

Storage

Abstracts away the underlying storage system

Currently supports the HDFS and Amazon’s S3

Also supports the local OS file system (for unit testing)

Zubair Nabi 9: MR+ April 19, 2013 25 / 26

Page 66: Topic 9: MR+

Structure

Modular structure so most of the code is re-used across MapReduceand MR+

Google Protobufs and JSON used for serialization

All configuration options within two files: siteconf.xml (site-wide)and jobconf.xml (job-specific)

Zubair Nabi 9: MR+ April 19, 2013 26 / 26

Page 67: Topic 9: MR+

Structure

Modular structure so most of the code is re-used across MapReduceand MR+

Google Protobufs and JSON used for serialization

All configuration options within two files: siteconf.xml (site-wide)and jobconf.xml (job-specific)

Zubair Nabi 9: MR+ April 19, 2013 26 / 26

Page 68: Topic 9: MR+

Structure

Modular structure so most of the code is re-used across MapReduceand MR+

Google Protobufs and JSON used for serialization

All configuration options within two files: siteconf.xml (site-wide)and jobconf.xml (job-specific)

Zubair Nabi 9: MR+ April 19, 2013 26 / 26