33
MapReduce DesignPatterns with Evgeny Benediktov, EIS Architecture

MapReduce DesignPatterns

Embed Size (px)

Citation preview

Page 1: MapReduce DesignPatterns

MapReduceDesignPatterns

with

Evgeny Benediktov,EIS Architecture

Page 2: MapReduce DesignPatterns

MapReduce Scalable

Flexible

No overhead

Page 3: MapReduce DesignPatterns

(K1,V1) –> Map –> (K2,V2)

Shuffle & Sort

(K2,List[V2]) –> Reduce –> (K3,V3)

How does MapReduce work?

Page 4: MapReduce DesignPatterns

Line 1: How many cookies could

Line 2: a good cook cook if a

Line 3: good cook could cook cookies?

WordCount

Page 5: MapReduce DesignPatterns

IN: Offset, Line1OUT: could, 1

IN: Offset, Line3OUT: cook, 1OUT: could, 1

IN: Offset, Line2OUT: cook, 1OUT: cook, 1OUT: if, 1

IN: could, <1, 1>OUT: could, 2

IN: cook, <1, 1, 1>OUT: cook, 3IN: If, 1OUT: If, 1

Page 6: MapReduce DesignPatterns

Shuffle & Sort

Buffer in RAM

Partition, Sort & Spill to disk

Pulled by Reducers

Merge

Page 7: MapReduce DesignPatterns

MongoDB

Spark

Hadoop

Where is MapReduce implemented?

Page 8: MapReduce DesignPatterns

Distributions

Page 9: MapReduce DesignPatterns

HDFS

MapReduce

Everything Else

What is inside

Page 10: MapReduce DesignPatterns
Page 11: MapReduce DesignPatterns

NameNode

DataNode DataNode DataNode

Append only64-256MB BlocksReplicated

HDFS

Page 12: MapReduce DesignPatterns

NameNode

TaskTrackerDataNode

TaskTrackerDataNode

TaskTrackerDataNode

JobTraker

HDFS+MapReduce1

Page 13: MapReduce DesignPatterns
Page 14: MapReduce DesignPatterns

NameNode

ContainerNodeManager

DataNode

ContainerNodeManager

DataNode

AppMasterNodeManager

DataNode

ResourceManager

HDFS+MapReduce2

Page 15: MapReduce DesignPatterns

Maper

Reducer

Partitoner

Combiner

InputFormat

OutputFormat

RecordReader

RecordWriter

Classes

Page 16: MapReduce DesignPatterns

(K2, V2)->(K2, List(V2))

setPartitionerClass

setGroupComparator

setSortComparatorClass

SecondarySort

Page 17: MapReduce DesignPatterns

MetaData

Client->HDFS->Local FS

DistributedCache

Page 18: MapReduce DesignPatterns

Summarization

Numerical Summarizations

Inverted Index Summarizations

Counting with Counter

Filtering

Filtering

Bloom Filtering

Top Ten

Distinct

Data Organization

Structured to Hierarchical

Partitioning

Binning

Total Order Sorting

Shuffling

Input and Output

Generating Data

External Source Output

External Source Input

Partition Pruning

Metapatterns

Job Chaining

Job Merging

Joins

Reduce Side Join

Replicated Join

Composite Join

Cartesian Product

Page 19: MapReduce DesignPatterns

Summarizations

Page 20: MapReduce DesignPatterns

Summarization with Counters

No Reducer

Up to 100

Named

Page 21: MapReduce DesignPatterns

Filtering

map(key, record):

if (keep record) emit key,value

Identity Reducer or None

Output file per mapper

Page 22: MapReduce DesignPatterns

Bloom Filtering

Training: Records → BloomFilter File

Mapper.setup:

DistributedCache→BloomFilter

Mapper.map:

filter.membershipTest

Emit value, null

Page 23: MapReduce DesignPatterns

Filtering Top Ten

Mapper.setup(): initialize a sorted list

Mapper.map(key, record):

insert record into list

truncate list to 10

Mapper.cleanup():

for records in the list: emit null, record

Reducer.reduce(key, records):

as in mappers

Page 24: MapReduce DesignPatterns

Filtering Distinct Values

map(key, record):

emit record,null

reduce(key, records):

emit key

Page 25: MapReduce DesignPatterns

Structured to Hierarchical

Mappers on dataset1 send to Reducers:

Ids, Records of Type1

Mappers on dataset2 send to Reducers:

Parent Ids, Records of Type 2

Page 26: MapReduce DesignPatterns

Partitioning

Identity Mapper

Identity Reducer

Smart Partitioner:

public int getPartition(IntWritable key, Text value, intnumPartitions)

{

return key.get() /*year*/ - minLastAccessDateYear;

}

Page 27: MapReduce DesignPatterns

Binning

setup:

mos = new MultipleOutputs

map:

If (…) {

mos.write(key, value, BINNAME)

//BINNAME-mNNNNN

} else..

Page 28: MapReduce DesignPatterns

Shuffling

Mapper.map:

Emit random, record

Reducer.reduce:

Emit record, null

Page 29: MapReduce DesignPatterns

Map-side Join

Mapper.setup:

DistributedCache → Map (Right Table)

Mapper.map:

Read split of Left Table, Join

Page 30: MapReduce DesignPatterns

Reduce-Side JoinsWith Secondary Sort

TableAMapper.map:

Emit primary key+’A’, record+’A’

TableBMapper.map:

Emit foreign key+’B’, record+’B’

SortComporator:

Records 'A' before Records 'B'

Reducer:

emits A` Record + B` Record, null

Page 31: MapReduce DesignPatterns

Composite (Merge) Join

Data sets pre-sorted

Data sets partitioned on the same key

CompositeInputFormat in Mappers

Page 32: MapReduce DesignPatterns

Total Order Sorting

Job 1:

Data → Mappers -> SequenceFile (key, value)

Job 2:

InputSampler

TotalOrderPartitioner(InputSampler)

Identity mapper, reducers

Page 33: MapReduce DesignPatterns

Input:

Site1 tag1

Site1 tag2

Site3 tag3

Output - top 10 similar sites per site, (secondary) sorted

Site1 Similar1 count-of-common-tags

Site1 Similar2 count-of-common-tags

Site2 Similar1 count-of-common-tags

Millions sites

Some tags are in thousands sites

What is input/output of each mapper/reducer?

Hint – chain jobs