Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.1
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
CS 555: DISTRIBUTED SYSTEMS
[MAPREDUCE]
Shrideep PallickaraComputer Science
Colorado State University
September 26, 2019 L10.1 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.2
Professor: SHRIDEEP PALLICKARA
Frequently asked questions from the previous class survey
September 26, 2019
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.3Professor: SHRIDEEP PALLICKARA
Topics covered in this lecture
¨ MapReduce
September 26, 2019CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
MAPREDUCE
September 26, 2019 L10.4
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.5Professor: SHRIDEEP PALLICKARA
MapReduce: Topics that we will cover
¨ Why?¨ What it is and what it is not?
¨ The core framework and original Google paper¨ Development of simple programs using Hadoop
¤ The dominant MapReduce implementation
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.6
Professor: SHRIDEEP PALLICKARA
MapReduce
¨ It’s a framework for processing data residing on a large number of computers
¨ Very powerful framework¤ Excellent for some problems
¤ Challenging or not applicable in other classes of problems
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.2
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.7Professor: SHRIDEEP PALLICKARA
What is MapReduce?
¨ More a framework than a tool
¨ You are required to fit (some folks shoehorn it) your solution into the MapReduce framework
¨ MapReduce is not a feature, but rather a constraint
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.8
Professor: SHRIDEEP PALLICKARA
What does this constraint mean?
¨ It makes problem solving easier and harder
¨ Clear boundaries for what you can and cannot do¤ You actually need to consider fewer options than what you are used to
¨ But solving problems with constraints requires planning and a change in your thinking
September 26, 2019
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.9Professor: SHRIDEEP PALLICKARA
But what does this get us?
¨ Tradeoff of being confined to the MapReduce framework?¤ Ability to process data on a large number of computers
¤ But, more importantly, without having to worry about concurrency, scale, fault tolerance, and robustness
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.10
Professor: SHRIDEEP PALLICKARA
A challenge in writing MapReduce programs
¨ Design!¤ Good programmers can produce bad software due to poor design
¤ Good programmers can produce bad MapReduce algorithms
¨ Only in this case your mistakes will be amplified
¤ Your job may be distributed on 100s or 1000s of machines and operating on a Petabyte of data
September 26, 2019
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.11Professor: SHRIDEEP PALLICKARA
MapReduce: Origins of the design
¨ Process crawled data and logs of web requests
¨ Several computations work on this raw data to compute derived data¤ Inverted indices¤ Representation of graph structure of web documents
¤ Pages crawled per host
¤ Most frequent queries in a day …
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.12
Professor: SHRIDEEP PALLICKARA
Most computations are conceptually straightforward
¨ But data is large
¨ Computations must be scalable¤ Distributed across thousands of machines¤ To complete in a reasonable amount of time
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.3
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.13Professor: SHRIDEEP PALLICKARA
Complexity of managing distributed computations can …
¨ Obscure simplicity of original computation
¨ Contributing factors:
① How to parallelize computation
② Distribute the data
③ Handle failures
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.14
Professor: SHRIDEEP PALLICKARA
MapReduce was developed to cope with this complexity
¨ Express simple computations
¨ Hide messy details of ¤ Parallelization
¤ Data distribution¤ Fault tolerance
¤ Load balancing
September 26, 2019
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.15Professor: SHRIDEEP PALLICKARA
MapReduce
¨ Programming model
¨ Associated implementation for ¤ Processing & Generating large data sets
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.16
Professor: SHRIDEEP PALLICKARA
Programming model
¨ Computation takes a set of input key/value pairs
¨ Produces a set of output key/value pairs
¨ Express the computation as two functions:¤ Map
¤ Reduce
September 26, 2019
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.17Professor: SHRIDEEP PALLICKARA
Map
¨ Takes an input pair¨ Produces a set of intermediate key/value pairs
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.18
Professor: SHRIDEEP PALLICKARA
MapReduce library
¨ Groups all intermediate values with the same intermediate key
¨ Passes them to the Reduce function
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.4
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.19Professor: SHRIDEEP PALLICKARA
Reduce function
¨ Accepts intermediate key I and ¤ Set of values for that key
¨ Merge these values together to get¤ Smaller set of values
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.20
Professor: SHRIDEEP PALLICKARA
Counting number occurrences of each word in a large collection of documentsmap (String key, String value)
//key: document name//value: document contents
for each word w in valueEmitIntermediate(w, “1”)
September 26, 2019
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.21Professor: SHRIDEEP PALLICKARA
Counting number occurrences of each word in a large collection of documents
reduce (String key, Iterator values)//key: a word//value: a list of counts
int result = 0; for each v in values
result += ParseInt(v);Emit(AsString(result)); Sums together all counts
emitted for a particular word
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.22
Professor: SHRIDEEP PALLICKARA
MapReduce specification object contains
¨ Names of¤ Input
¤ Output
¨ Tuning parameters
September 26, 2019
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.23Professor: SHRIDEEP PALLICKARA
Map and reduce functions have associated types drawn from different domains
map(k1, v1) à list(k2, v2)
reduce(k2, list(v2)) à list(v2)
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.24
Professor: SHRIDEEP PALLICKARA
What’s passed to-and-from user-defined functions
¨ Strings
¨ User code converts between¤ String¤ Appropriate types
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.5
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.25Professor: SHRIDEEP PALLICKARA
Programs expressed as MapReduce computations: Distributed Grep
¨ Map¤ Emit line if it matches specified pattern
¨ Reduce¤ Just copy intermediate data to the output
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.26
Professor: SHRIDEEP PALLICKARA
Term-Vector per Host
¨ Summarizes important terms that occur in a set of documents <word, frequency>
¨ Map¤ Emit <hostname, term vector>¤ For each input document
¨ Reduce function¤ Has all per-document vectors for a given host¤ Add term vectors; discard away infrequent terms
n <hostname, term vector>
September 26, 2019
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
IMPLEMENTATION OF THE RUNTIME
September 26, 2019 L10.27 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.28
Professor: SHRIDEEP PALLICKARA
Implementation
¨ Machines are commodity machines¨ GFS is used to manage the data stored on the disks
September 26, 2019
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.29Professor: SHRIDEEP PALLICKARA
Execution Overview – Part I
¨ Maps distributed across multiple machines
¨ Automatic partitioning of data into M splits
¨ Splits processed concurrently on different machines
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.30
Professor: SHRIDEEP PALLICKARA
Execution Overview – Part II
¨ Partition intermediate key space into R pieces
¨ E.g. hash(key) mod R
¨ User specified parameters¤ Partitioning function
¤ Number of partitions (R)
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.6
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.31
Professor: SHRIDEEP PALLICKARA
Execution Overview
Split 0Split 1Split 2Split 3Split 4
User
Program
Master
Worker
Worker
Worker
Worker
Worker
Output
file 0
Output
file 1
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.32
Professor: SHRIDEEP PALLICKARA
Execution Overview: Step IThe MapReduce library
¨ Splits input files into M pieces¤ 16-64 MB per piece
¨ Starts up copies of the program on a cluster of machines
September 26, 2019
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.33Professor: SHRIDEEP PALLICKARA
Execution Overview: Step IIProgram copies
¨ One of the copies is a Master
¨ There are M map tasks and R reduce tasks to assign
¨ Master¤ Picks idle workers
¤ Assigns each worker a map or reduce task
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.34
Professor: SHRIDEEP PALLICKARA
Execution Overview: Step IIIWorkers that are assigned a map task
¨ Read contents of their input split
¨ Parses <key, value> pairs out of input data
¨ Pass each pair to user-defined Map function
¨ Intermediate <key, value> pairs from Maps¤ Buffered in Memory
September 26, 2019
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.35
Professor: SHRIDEEP PALLICKARA
Execution Overview: Step IVWriting to disk
¨ Periodically, buffered pairs are written to disk
¨ These writes are partitioned¤ By the partitioning function
¨ Locations of buffered pairs on local disk¤ Reported to back to Master¤ Master forwards these locations to reduce workers
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.36
Professor: SHRIDEEP PALLICKARA
Execution Overview: Step VReading Intermediate data
¨ Master notifies Reduce worker about locations
¨ Reduce worker reads buffered data from the local disks of Maps
¨ Read all intermediate data; sort by intermediate key¤ All occurrences of same key grouped together
¤ Many different keys map to the same Reduce task
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.7
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.37Professor: SHRIDEEP PALLICKARA
Execution Overview: Step VIProcessing data at the Reduce worker
¨ Iterate over sorted intermediate data
¨ For each unique key pass¤ Key + set of intermediate values to Reduce function
¨ Output of Reduce function is appended¤ To output file of reduce partition
September 26, 2019 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.38
Professor: SHRIDEEP PALLICKARA
Execution Overview: Step VIIWaking up the user
¨ After all Map & Reduce tasks have been completed
¨ Control returns to the user code
September 26, 2019
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
TASK GRANULARITY
September 26, 2019 L10.39 CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.40
Professor: SHRIDEEP PALLICKARA
Task Granularity
¨ Subdivide map phase into M pieces
¨ Subdivide reduce phase into R pieces
¨ M, R >> number of worker machines
¨ Each worker performing many different tasks¤ Improves dynamic load balancing
¤ Speeds up recovery during failures
September 26, 2019
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.41Professor: SHRIDEEP PALLICKARA
Master Data Structures
September 26, 2019
¨ For each Map and Reduce task¤ State: {idle, in-progress, completed}
¤ Worker machine identity
¨ For each completed Map task store ¤ Location and sizes of R intermediate file regions
¨ Information pushed incrementally to in-progress Reduce tasks
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.42
Professor: SHRIDEEP PALLICKARA
Practical bounds on how large M and R can be
¨ Master must make O(M + R) scheduling decisions
¨ Keep O(MR) state in memory
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.8
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]
Dept. Of Computer Science , Colorado State University
L10.43Professor: SHRIDEEP PALLICKARA
The contents of this slide-set are based on the following references¨ JEFFREY DEAN and SANJAY GHEMAWAT: MapReduce: Simplified Data Processing on
Large Clusters. OSDI 2004: 137-150
¨ MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoopand Other Systems. 1st Edition. Donald Miner and Adam Shook. O'Reilly Media ISBN: 978-1449327170. [Chapter 1]
September 26, 2019