Topic 9: MR+

9: MR+

Zubair Nabi

[email protected]

April 19, 2013

Zubair Nabi 9: MR+ April 19, 2013 1 / 26

Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base


Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base


Implicit MapReduce Assumptions

The input data has no structure

The distribution of intermediate data is balanced

Results materialize when all the map and reduce tasks complete

The number of values of each key is small enough to be processed bya single reduce task

Processing the data at the reduce stage in most cases is usually asimple aggregation function






























Zipf distributions are everywhere


Reduce-intensive applications

Image and speech correlation

Backpropagation in neural networks

Co-clustering

Tree learning

Computation of node diameter and radii in Tera-scale graphs

. . .


Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base


Design Goals

Negate skew in intermediate data

Exploit structure in input data

Estimate results

Favour commodity clusters

Maintain original functional model of MapReduce


Design Goals



Estimate results




Design Goals



Estimate results




Design Goals



Estimate results




Design Goals



Estimate results




Design

Maintains the simple MapReduce programming model

Instead of implementing MapReduce as a sequential two-stagedarchitecture, MR+ allows map and reduce stages to interleave anditerate over intermediate results

Leading to a multi-level inverted tree of reduce workers


Design





Design





Architecture

MR MR End Brick-wall

Map Phase Reduce Phase

Map Reduce

(a) MapReduce

MR+ Start MR+ End Brick-wall

5% -10% Estimation cycle prioritizes data

(b) MR+

Figure: Architectural comparison of MapReduce and MR+.


Architectural Flexibility

1 Instead of waiting for all maps to finish before scheduling a reducetask, MR+ permits a model where a reduce task can be scheduled forevery n invocations of the map function

2 A densely populated key can be recursively reduced by repeatedinvocation of the reduce function at multiple reduce workers


Architectural Flexibility

1 Instead of waiting for all maps to finish before scheduling a reducetask, MR+ permits a model where a reduce task can be scheduled forevery n invocations of the map function

2 A densely populated key can be recursively reduced by repeatedinvocation of the reduce function at multiple reduce workers


Advantages

Resilient to TCP Incast by amortizing data copying over the course ofthe job

Early materialization of partial results for queries with thresholds orconfidence intervals

Finds structure in the data by running a sample cycle to learn thedistribution of information and prioritizes input data with respect to theuser query


Advantages





Advantages





Programming Model

Retains the 2-stage MapReduce API

MR+ reducers can be likened to distributed combiners

Repeated invocation of the reducer by default rules out non-associativefunctions

But reducers can be designed in such a way that the associativeoperation is applied only at the very last reduce


Programming Model






Programming Model






Programming Model






Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base


Scheduling

Tasks are scheduled according to a configurablemap_to_reduce_schedule_ratio parameter

For every map_to_reduce_schedule_ratio map tasks, 1reduce task is scheduled

For instance, if map_to_reduce_schedule_ratio is 4, then thefirst reduce task is scheduled when 4 map tasks complete


Scheduling





Scheduling





Level-1 reducers

Each reduce is assigned the output of map_to_reduce_rationumber of maps

The location of their inputs is communicated by the JobTracker

Each reduce task pulls its input via HTTP

After the reduce logic has been applied to all keys, the output isearmarked for L > 1 reducers


Level-1 reducers






Level-1 reducers






Level-1 reducers






Level > 1 reducers

Assigned the input of reduce_input_ratio number of reducetasks

Eventually all key/value pairs make their way to the final level, whichhas a single worker

This final reduce can also be used to apply any non-associativeoperation


Level > 1 reducers





Level > 1 reducers





Structural comparison

Reduce1

Shuffler....

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

.

.

.

.

Reduce2

Reduce3

Reduce4

Reduceθ

k1, v1,v2,...

k2, v1,v2,...

k3, v1,v2,...

k4, v1,v2,...

kn, v1,v2,...

Brick-wall

Map1

Mapω

Map2

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

(a) MapReduce

Reduce1,1

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

.....

.....

Reduce2,1

Reduce3,1

Reduce4,1

Reduceα-1,1

Reduceα,1

Reduce1,2

Reduce2,2

Reduceβ,2

... Reduce1,φ

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

Map2

Map1

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

Mapω

Mapω-1

...

...

...α = ω/mr

β = α/rr

ϒ = β/rr...1

(b) MR+

Figure: Structural comparison of MapReduce and MR+.


Reduce Locality

MR+ does not rely on key/values for input assignment

Reduce inputs are assigned on the basis of locality1 Node-local2 Rack-local3 Any


Reduce Locality

MR+ does not rely on key/values for input assignmentReduce inputs are assigned on the basis of locality

1 Node-local2 Rack-local3 Any


Reduce Locality

MR+ does not rely on key/values for input assignmentReduce inputs are assigned on the basis of locality

1 Node-local2 Rack-local3 Any


Fault Tolerance

Deterministic input assignment simplifies failure recovery inMapReduce

In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage

1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local

file system of a rack-local neighbour3 Distributed replication: The output of each reduce is replicated on the

distributed file system


Fault Tolerance


In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executed

For level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage





Fault Tolerance







Fault Tolerance



1 Chain re-execution: The entire chain is re-executed

2 Local replication: The output of each reduce is replicated on the localfile system of a rack-local neighbour

3 Distributed replication: The output of each reduce is replicated on thedistributed file system


Fault Tolerance




file system of a rack-local neighbour

3 Distributed replication: The output of each reduce is replicated on thedistributed file system


Fault Tolerance







Input Prioritization

User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random

This sampling cycle yields a representative distribution of data

Used to exploit structure: data with semantic grouping or clusters ofrelevant information

The distribution is used to generate a priority queue to assign to maptasks

A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue






























Input Prioritization (2)

Due to this prioritization, relevant clusters of information are processedfirst

As a result, the computation can be stopped mid-way if a thresholdcondition is satisfied


Input Prioritization (2)

Due to this prioritization, relevant clusters of information are processedfirst

As a result, the computation can be stopped mid-way if a thresholdcondition is satisfied


Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base


Code-base

Around 15,000 lines of Python code

Code implements both vanilla MapReduce and MR+

Written over the course of roughly 5 years at LUMS

Publicly available at: https://code.google.com/p/mrplus/source/browse/?name=BRANCH_VER_0_0_0_4_PY2x


https://code.google.com/p/mrplus/source/browse/?name=BRANCH_VER_0_0_0_4_PY2x


Code-base








Code-base








Code-base








Storage

Abstracts away the underlying storage system

Currently supports the HDFS and Amazon’s S3

Also supports the local OS file system (for unit testing)


Storage





Storage





Structure

Modular structure so most of the code is re-used across MapReduceand MR+

Google Protobufs and JSON used for serialization

All configuration options within two files: siteconf.xml (site-wide)and jobconf.xml (job-specific)


Structure





Structure





Technology

Topic 9: MR+