Upload
evgeny-benediktov
View
118
Download
1
Tags:
Embed Size (px)
Citation preview
MapReduceDesignPatterns
with
Evgeny Benediktov,EIS Architecture
MapReduce Scalable
Flexible
No overhead
(K1,V1) –> Map –> (K2,V2)
Shuffle & Sort
(K2,List[V2]) –> Reduce –> (K3,V3)
How does MapReduce work?
Line 1: How many cookies could
Line 2: a good cook cook if a
Line 3: good cook could cook cookies?
WordCount
IN: Offset, Line1OUT: could, 1
IN: Offset, Line3OUT: cook, 1OUT: could, 1
IN: Offset, Line2OUT: cook, 1OUT: cook, 1OUT: if, 1
IN: could, <1, 1>OUT: could, 2
IN: cook, <1, 1, 1>OUT: cook, 3IN: If, 1OUT: If, 1
Shuffle & Sort
Buffer in RAM
Partition, Sort & Spill to disk
Pulled by Reducers
Merge
MongoDB
Spark
Hadoop
Where is MapReduce implemented?
Distributions
HDFS
MapReduce
Everything Else
What is inside
NameNode
DataNode DataNode DataNode
Append only64-256MB BlocksReplicated
HDFS
NameNode
TaskTrackerDataNode
TaskTrackerDataNode
TaskTrackerDataNode
JobTraker
HDFS+MapReduce1
NameNode
ContainerNodeManager
DataNode
ContainerNodeManager
DataNode
AppMasterNodeManager
DataNode
ResourceManager
HDFS+MapReduce2
Maper
Reducer
Partitoner
Combiner
InputFormat
OutputFormat
RecordReader
RecordWriter
Classes
(K2, V2)->(K2, List(V2))
setPartitionerClass
setGroupComparator
setSortComparatorClass
SecondarySort
MetaData
Client->HDFS->Local FS
DistributedCache
Summarization
Numerical Summarizations
Inverted Index Summarizations
Counting with Counter
Filtering
Filtering
Bloom Filtering
Top Ten
Distinct
Data Organization
Structured to Hierarchical
Partitioning
Binning
Total Order Sorting
Shuffling
Input and Output
Generating Data
External Source Output
External Source Input
Partition Pruning
Metapatterns
Job Chaining
Job Merging
Joins
Reduce Side Join
Replicated Join
Composite Join
Cartesian Product
Summarizations
Summarization with Counters
No Reducer
Up to 100
Named
Filtering
map(key, record):
if (keep record) emit key,value
Identity Reducer or None
Output file per mapper
Bloom Filtering
Training: Records → BloomFilter File
Mapper.setup:
DistributedCache→BloomFilter
Mapper.map:
filter.membershipTest
Emit value, null
Filtering Top Ten
Mapper.setup(): initialize a sorted list
Mapper.map(key, record):
insert record into list
truncate list to 10
Mapper.cleanup():
for records in the list: emit null, record
Reducer.reduce(key, records):
as in mappers
Filtering Distinct Values
map(key, record):
emit record,null
reduce(key, records):
emit key
Structured to Hierarchical
Mappers on dataset1 send to Reducers:
Ids, Records of Type1
Mappers on dataset2 send to Reducers:
Parent Ids, Records of Type 2
Partitioning
Identity Mapper
Identity Reducer
Smart Partitioner:
public int getPartition(IntWritable key, Text value, intnumPartitions)
{
return key.get() /*year*/ - minLastAccessDateYear;
}
Binning
setup:
mos = new MultipleOutputs
map:
If (…) {
mos.write(key, value, BINNAME)
//BINNAME-mNNNNN
} else..
Shuffling
Mapper.map:
Emit random, record
Reducer.reduce:
Emit record, null
Map-side Join
Mapper.setup:
DistributedCache → Map (Right Table)
Mapper.map:
Read split of Left Table, Join
Reduce-Side JoinsWith Secondary Sort
TableAMapper.map:
Emit primary key+’A’, record+’A’
TableBMapper.map:
Emit foreign key+’B’, record+’B’
SortComporator:
Records 'A' before Records 'B'
Reducer:
emits A` Record + B` Record, null
Composite (Merge) Join
Data sets pre-sorted
Data sets partitioned on the same key
CompositeInputFormat in Mappers
Total Order Sorting
Job 1:
Data → Mappers -> SequenceFile (key, value)
Job 2:
InputSampler
TotalOrderPartitioner(InputSampler)
Identity mapper, reducers
Input:
Site1 tag1
Site1 tag2
Site3 tag3
Output - top 10 similar sites per site, (secondary) sorted
Site1 Similar1 count-of-common-tags
Site1 Similar2 count-of-common-tags
Site2 Similar1 count-of-common-tags
Millions sites
Some tags are in thousands sites
What is input/output of each mapper/reducer?
Hint – chain jobs