MapReduce DesignPatterns

MapReduceDesignPatterns

with

Evgeny Benediktov,EIS Architecture

MapReduce Scalable

Flexible

No overhead

(K1,V1) –> Map –> (K2,V2)

Shuffle & Sort

(K2,List[V2]) –> Reduce –> (K3,V3)

How does MapReduce work?

Line 1: How many cookies could

Line 2: a good cook cook if a

Line 3: good cook could cook cookies?

WordCount

IN: Offset, Line1OUT: could, 1

IN: Offset, Line3OUT: cook, 1OUT: could, 1

IN: Offset, Line2OUT: cook, 1OUT: cook, 1OUT: if, 1

IN: could, <1, 1>OUT: could, 2

IN: cook, <1, 1, 1>OUT: cook, 3IN: If, 1OUT: If, 1

Shuffle & Sort

Buffer in RAM

Partition, Sort & Spill to disk

Pulled by Reducers

Merge

MongoDB

Spark

Hadoop

Where is MapReduce implemented?

Distributions

HDFS

MapReduce

Everything Else

What is inside

NameNode

DataNode DataNode DataNode

Append only64-256MB BlocksReplicated

HDFS

NameNode

TaskTrackerDataNode

TaskTrackerDataNode

TaskTrackerDataNode

JobTraker

HDFS+MapReduce1

NameNode

ContainerNodeManager

DataNode

ContainerNodeManager

DataNode

AppMasterNodeManager

DataNode

ResourceManager

HDFS+MapReduce2

Maper

Reducer

Partitoner

Combiner

InputFormat

OutputFormat

RecordReader

RecordWriter

Classes

(K2, V2)->(K2, List(V2))

setPartitionerClass

setGroupComparator

setSortComparatorClass

SecondarySort

MetaData

Client->HDFS->Local FS

DistributedCache

Summarization

Numerical Summarizations

Inverted Index Summarizations

Counting with Counter

Filtering

Filtering

Bloom Filtering

Top Ten

Distinct

Data Organization

Structured to Hierarchical

Partitioning

Binning

Total Order Sorting

Shuffling

Input and Output

Generating Data

External Source Output

External Source Input

Partition Pruning

Metapatterns

Job Chaining

Job Merging

Joins

Reduce Side Join

Replicated Join

Composite Join

Cartesian Product

Summarizations

Summarization with Counters

No Reducer

Up to 100

Named

Filtering

map(key, record):

if (keep record) emit key,value

Identity Reducer or None

Output file per mapper

Bloom Filtering

Training: Records → BloomFilter File

Mapper.setup:

DistributedCache→BloomFilter

Mapper.map:

filter.membershipTest

Emit value, null

Filtering Top Ten

Mapper.setup(): initialize a sorted list

Mapper.map(key, record):

insert record into list

truncate list to 10

Mapper.cleanup():

for records in the list: emit null, record

Reducer.reduce(key, records):

as in mappers

Filtering Distinct Values

map(key, record):

emit record,null

reduce(key, records):

emit key

Structured to Hierarchical

Mappers on dataset1 send to Reducers:

Ids, Records of Type1

Mappers on dataset2 send to Reducers:

Parent Ids, Records of Type 2

Partitioning

Identity Mapper

Identity Reducer

Smart Partitioner:

public int getPartition(IntWritable key, Text value, intnumPartitions)

{

return key.get() /*year*/ - minLastAccessDateYear;

}

Binning

setup:

mos = new MultipleOutputs

map:

If (…) {

mos.write(key, value, BINNAME)

//BINNAME-mNNNNN

} else..

Shuffling

Mapper.map:

Emit random, record

Reducer.reduce:

Emit record, null

Map-side Join

Mapper.setup:

DistributedCache → Map (Right Table)

Mapper.map:

Read split of Left Table, Join

Reduce-Side JoinsWith Secondary Sort

TableAMapper.map:

Emit primary key+’A’, record+’A’

TableBMapper.map:

Emit foreign key+’B’, record+’B’

SortComporator:

Records 'A' before Records 'B'

Reducer:

emits A` Record + B` Record, null

Composite (Merge) Join

Data sets pre-sorted

Data sets partitioned on the same key

CompositeInputFormat in Mappers

Total Order Sorting

Job 1:

Data → Mappers -> SequenceFile (key, value)

Job 2:

InputSampler

TotalOrderPartitioner(InputSampler)

Identity mapper, reducers

Input:

Site1 tag1

Site1 tag2

Site3 tag3

Output - top 10 similar sites per site, (secondary) sorted

Site1 Similar1 count-of-common-tags



Millions sites

Some tags are in thousands sites

What is input/output of each mapper/reducer?

Hint – chain jobs

Software

MapReduce DesignPatterns