Upload
barrie-tucker
View
230
Download
0
Embed Size (px)
Citation preview
2
Motivation for Map-Reduce
Distribution makes simple computations complex•Communication•Load balancing•Fault tolerance•…
What if we could write “simple” programs that were automatically parallelized?
CS 347
3
R’1
R’2
R’3
ko
k1
Local sort
Local sort
Local sort
R1
R2
R3
Result
Motivation for Map-Reduce
Recall one of our sort strategies:
process data& partition
additionalprocessing
CS 347
4
Another example: Asymmetric fragment + replicate join
Ra S
S
S
Rb
R1
R2
R3
Sa
Sb
Local join
Result
fpartition
unionprocess data& partition
additionalprocessing
From our point of view…
• What if we didn’t have to think too hard about the number of sites or fragments?
• MapReduce goal: a library that hides the distribution from the developer, who can then focus on the fundamental “nuggets” of their computation
CS347
5
6
Building Text Index - Part I
1
2
3
rat
dogdogcat
rat
dog
(rat, 1)
(dog, 1)
(dog, 2)
(cat, 2)
(rat, 3)
(dog, 3)
(cat, 2)
(dog, 1)
(dog, 2)
(dog, 3)
(rat, 1)
(rat, 3)
Disk
Page strea
m
Tokenizing SortingLoading
FLUSHING
Intermediate runs
CS347
original Map-Reduce application....
7
Building Text Index - Part II
(cat, 2)
(dog, 1)
(dog, 2)
(dog, 3)
(rat, 1)
(rat, 3)
IntermediateRuns
(ant, 5)
(cat, 4)
(dog, 4)
(dog, 5)
(eel, 6)
Merge
(ant, 5)
(cat, 2)
(cat, 4)
(dog, 1)
(dog, 2)
(dog, 3)
(dog, 4)
(dog, 5)
(eel, 6)
(rat, 1)
(rat, 3)Final index
(ant: 2)
(cat: 2,4)
(dog: 1,2,3,4,5)
(eel: 6)
(rat: 1, 3)
CS347
8
Generalizing: Map-Reduce
1
2
3
rat
dogdogcat
rat
dog
(rat, 1)
(dog, 1)
(dog, 2)
(cat, 2)
(rat, 3)
(dog, 3)
(cat, 2)
(dog, 1)
(dog, 2)
(dog, 3)
(rat, 1)
(rat, 3)
Disk
Page strea
m
Tokenizing SortingLoading
FLUSHING
Intermediate runs
Map
CS347
9
Generalizing: Map-Reduce
(cat, 2)
(dog, 1)
(dog, 2)
(dog, 3)
(rat, 1)
(rat, 3)
IntermediateRuns
Final index
(ant, 5)
(cat, 4)
(dog, 4)
(dog, 5)
(eel, 6)
Merge
(ant, 5)
(cat, 2)
(cat, 4)
(dog, 1)
(dog, 2)
(dog, 3)
(dog, 4)
(dog, 5)
(eel, 6)
(rat, 1)
(rat, 3)
(ant: 2)
(cat: 2,4)
(dog: 1,2,3,4,5)
(eel: 6)
(rat: 1, 3)
Reduce
CS347
10
Map Reduce
• Input: R={r1, r2, ...rn}, functions M, R
– M(ri) { [k1, v1], [k2, v2],.. }
– R(ki, valSet) [ki, valSet’]
• Let S={ [k, v] | [k, v] M(r) for some r R }
• Let K = {k | [k,v] S, for any v }
• Let G(k) = { v | [k, v] S }
• Output = { [k, T] | k K, T=R(k, G(k)) }
S is bag
G is bag
CS347
11
References
• MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, available athttp://labs.google.com/papers/mapreduce-osdi04.pdf
CS347
12
Example: Counting Word Occurrences
• map(String doc, String value);// doc is document name// value is document contentfor each word w in value: EmitIntermediate(w, “1”);
• Example:– map(doc, “cat dog cat bat dog”) emits
[cat 1], [dog 1], [cat 1], [bat 1], [dog 1]
CS347
13
Example: Counting Word Occurrences
• reduce(String key, Iterator values);// key is a word// values is a list of countsint result = 0;for each v in values: result += ParseInt(v)Emit(AsString(result));
• Example:– reduce(“dog”, “1 1 1 1”) emits “4”
Becomes (“dog”, 4)CS347
Another way to think about it:
• Mapper: (some query)
• Shuffle: GROUP BY
• Reducer: SELECT Aggregate()
CS347
29
Doesn’t have to be relational
• Mapper: Parse data into K,V pairs
• Shuffle: Repartition by K
• Reducer: Transform the V’s for a K into a Vfinal
CS347
30
Process model
CS347
32
Can vary the number of mappers to tune performance
Reduce tasks bounded by number of reduce shards
33
Implementation Issues
• Combine function
• File system
• Partition of input, keys
• Failures
• Backup tasks
• Ordering of results
CS347
34
Combine Function
worker
worker
worker[cat 1], [cat 1], [cat 1]...
[dog 1], [dog 1]...
worker
worker
worker[cat 3]...
[dog 2]...
Combine is like a local reduce applied before distribution:
CS347
35
Data flow
worker must be able to access any part of input file; so input on distributed fs
reduce worker must be able to access local disks on map workers
any worker must be able to write its part of answer; answer is left as distributed file
CS347
High-throughput network is essential
36
Partition of input, keys
• How many workers, partitions of input file?
9
321
worker
worker
worker
How many splits?How many workers? Best to have many splits per worker: Improves load balance; if worker fails, easier to spread its tasks
Should workers be assigned to splits “near” them?
Similar questions for reduce workers
CS347
What takes time?
• Reading input data• Mapping data• Shuffling data• Reducing data
• Map and reduce are separate phases– Latency determined by slowest task– Reduce shard skew can increase latency
• Map and shuffle can be overlapped– But if lots intermediate data, shuffle may be slow
CS347
37
38
Failures
• Distributed implementation should produce same output as would have been produced by a non-faulty sequential execution of the program.
• General strategy: Master detects worker failures, and has work re-done by another worker.
worker
worker
split j
masterok?
redo j
CS347
39
Backup Tasks
• Straggler is a machine that takes unusually long (e.g., bad disk) to finish its work.
• A straggler can delay final completion.• When task is close to finishing, master schedules
backup executions for remaining tasks.
Must be able to eliminateredundant results
CS347
40
Ordering of Results
• Final result (at each node) is in key order
[k1, T1][k2, T2][k3, T3][k4, T4]
[k1, v1][k3, v3]
also in key order:
CS347
41
Example: Sorting Records
5 e3 c6 f
2 b1 a6 f*
3 c5 e
6 f
1 a
2 b6 f*
8 h4 d9 i7 g
7 g9 i
4 d8 h
1 a3 c5 e7 g9 i
2 b4 d6 f6 f*8 h
W1
W2
W3
W5
W6
Map: extract k, output [k, record]Reduce: Do nothing!
one or two records for
k=6?
CS347
43
MR Claimed Advantages
• Model easy to use, hides details of parallelization, fault recovery
• Many problems expressible in MR framework• Scales to thousands of machines
CS347
44
MR Possible Disadvantages
• 1-input 2-stage data flow rigid, hard to adapt to other scenarios
• Custom code needs to be written even for the most common operations, e.g., projection and filtering
• Opaque nature of map, reduce functions impedes optimization
CS347
45
Hadoop
• Open-source Map-Reduce system• Also, a toolkit
– HDFS – filesystem for Hadoop
– HBase – Database for Hadoop
• Also, an ecosystem– Tools
– Recipes
– Developer community
CS347
MapReduce pipelines
• Output of one MapReduce becomes input for another– Example:
• Stage 1: Translate data to canonical form
• Stage 2: Term count
• Stage 3: Sort by frequency
• Stage 4: Extract top-k
CS347
46
Make it database-y?
• Simple idea: each operator is a MapReduce
• How to do:– Select– Project– Group by, aggregate– Join
CS347
47
Reduce-side join
• Shuffle puts all values for the same key at the same reducer
• Mapper– Input: tuples from R; tuples from S– Output: (join value, (R|S, tuple))
• Reducer– Local join of all R tuples with all S tuples
CS347
48
Map-side join
• Like a hash-join, but every mapper has a copy of the hash table
• Mapper:– Read in hash table of R
– Input: Tuples of S
– Output: Tuples of S joined with tuples of R
• Reducer– Pass through
CS347
49
Comparison
• Reduce-side join shuffles all the data
• Map-side join requires one table to be small
CS347
50
Semi-join?
• One idea:– MapReduce 1: Extract the join keys of R;
reduce-side join with S– MapReduce 2: Map-side join result of
MapReduce 1 with R
CS347
51
Why not just use a DBMS?• Many DBMSs exist and are highly optimized
• A comparison of approaches to large-scale data analysis. Pavlo et al, SIGMOD 2009
CS347
53
Why not just use a DBMS?• One reason: loading data into a DBMS is hard
• A comparison of approaches to large-scale data analysis. Pavlo et al, SIGMOD 2009
CS347
54
Why not just use a DBMS?
• Other possible reasons:– MapReduce is more scalable
– MapReduce is more easily deployed
– MapReduce is more easily extended
– MapReduce is more easily optimized
– MapReduce is free (that is, Hadoop)
– I already know Java
– MapReduce is exciting and new
CS347
55
Data store
• Instead of HDFS, data could be stored in a database
• Example: HadoopDB/Hadapt– Data store is PostgresSQL– Allows for indexing, fast local query processing
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, Alex Rasin. In Proceedings of VLDB, 2009.
CS347
56
Batch processing?
• MapReduce materializes intermediate results– Map output written to disk before reduce starts
• What if we pipelined the data from map to reduce?– Reduce could quickly produce approximate answers
– MapReduce could implement continuous queries
MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joe Hellerstein, Khaled Elmeleegy, Russell Sears. NSDI 2010
CS347
57