8
SLIDES CREATED BY : SHRIDEEP P ALLICKARA L10.1 CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University September 26, 2019 L10.1 CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University L10.2 Professor: SHRIDEEP PALLICKARA Frequently asked questions from the previous class survey September 26, 2019 CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University L10.3 Professor: SHRIDEEP PALLICKARA Topics covered in this lecture ¨ MapReduce September 26, 2019 CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University MAPREDUCE September 26, 2019 L10.4 CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University L10.5 Professor: SHRIDEEP PALLICKARA MapReduce: Topics that we will cover ¨ Why? ¨ What it is and what it is not? ¨ The core framework and original Google paper ¨ Development of simple programs using Hadoop ¤ The dominant MapReduce implementation September 26, 2019 CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science, Colorado State University L10.6 Professor: SHRIDEEP PALLICKARA MapReduce ¨ It’s a framework for processing data residing on a large number of computers ¨ Very powerful framework ¤ Excellent for some problems ¤ Challenging or not applicable in other classes of problems September 26, 2019

CS555: Distributed Systems [Fall 2019] Dept. Of Computer ...cs555/lectures/CS555-L10-MapReduc… · ¤Inverted indices ¤Representation of graph structure of web documents ¤Pages

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS555: Distributed Systems [Fall 2019] Dept. Of Computer ...cs555/lectures/CS555-L10-MapReduc… · ¤Inverted indices ¤Representation of graph structure of web documents ¤Pages

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.1

CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

CS 555: DISTRIBUTED SYSTEMS

[MAPREDUCE]

Shrideep PallickaraComputer Science

Colorado State University

September 26, 2019 L10.1 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.2

Professor: SHRIDEEP PALLICKARA

Frequently asked questions from the previous class survey

September 26, 2019

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.3Professor: SHRIDEEP PALLICKARA

Topics covered in this lecture

¨ MapReduce

September 26, 2019CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

MAPREDUCE

September 26, 2019 L10.4

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.5Professor: SHRIDEEP PALLICKARA

MapReduce: Topics that we will cover

¨ Why?¨ What it is and what it is not?

¨ The core framework and original Google paper¨ Development of simple programs using Hadoop

¤ The dominant MapReduce implementation

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.6

Professor: SHRIDEEP PALLICKARA

MapReduce

¨ It’s a framework for processing data residing on a large number of computers

¨ Very powerful framework¤ Excellent for some problems

¤ Challenging or not applicable in other classes of problems

September 26, 2019

Page 2: CS555: Distributed Systems [Fall 2019] Dept. Of Computer ...cs555/lectures/CS555-L10-MapReduc… · ¤Inverted indices ¤Representation of graph structure of web documents ¤Pages

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.2

CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.7Professor: SHRIDEEP PALLICKARA

What is MapReduce?

¨ More a framework than a tool

¨ You are required to fit (some folks shoehorn it) your solution into the MapReduce framework

¨ MapReduce is not a feature, but rather a constraint

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.8

Professor: SHRIDEEP PALLICKARA

What does this constraint mean?

¨ It makes problem solving easier and harder

¨ Clear boundaries for what you can and cannot do¤ You actually need to consider fewer options than what you are used to

¨ But solving problems with constraints requires planning and a change in your thinking

September 26, 2019

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.9Professor: SHRIDEEP PALLICKARA

But what does this get us?

¨ Tradeoff of being confined to the MapReduce framework?¤ Ability to process data on a large number of computers

¤ But, more importantly, without having to worry about concurrency, scale, fault tolerance, and robustness

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.10

Professor: SHRIDEEP PALLICKARA

A challenge in writing MapReduce programs

¨ Design!¤ Good programmers can produce bad software due to poor design

¤ Good programmers can produce bad MapReduce algorithms

¨ Only in this case your mistakes will be amplified

¤ Your job may be distributed on 100s or 1000s of machines and operating on a Petabyte of data

September 26, 2019

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.11Professor: SHRIDEEP PALLICKARA

MapReduce: Origins of the design

¨ Process crawled data and logs of web requests

¨ Several computations work on this raw data to compute derived data¤ Inverted indices¤ Representation of graph structure of web documents

¤ Pages crawled per host

¤ Most frequent queries in a day …

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.12

Professor: SHRIDEEP PALLICKARA

Most computations are conceptually straightforward

¨ But data is large

¨ Computations must be scalable¤ Distributed across thousands of machines¤ To complete in a reasonable amount of time

September 26, 2019

Page 3: CS555: Distributed Systems [Fall 2019] Dept. Of Computer ...cs555/lectures/CS555-L10-MapReduc… · ¤Inverted indices ¤Representation of graph structure of web documents ¤Pages

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.3

CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.13Professor: SHRIDEEP PALLICKARA

Complexity of managing distributed computations can …

¨ Obscure simplicity of original computation

¨ Contributing factors:

① How to parallelize computation

② Distribute the data

③ Handle failures

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.14

Professor: SHRIDEEP PALLICKARA

MapReduce was developed to cope with this complexity

¨ Express simple computations

¨ Hide messy details of ¤ Parallelization

¤ Data distribution¤ Fault tolerance

¤ Load balancing

September 26, 2019

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.15Professor: SHRIDEEP PALLICKARA

MapReduce

¨ Programming model

¨ Associated implementation for ¤ Processing & Generating large data sets

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.16

Professor: SHRIDEEP PALLICKARA

Programming model

¨ Computation takes a set of input key/value pairs

¨ Produces a set of output key/value pairs

¨ Express the computation as two functions:¤ Map

¤ Reduce

September 26, 2019

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.17Professor: SHRIDEEP PALLICKARA

Map

¨ Takes an input pair¨ Produces a set of intermediate key/value pairs

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.18

Professor: SHRIDEEP PALLICKARA

MapReduce library

¨ Groups all intermediate values with the same intermediate key

¨ Passes them to the Reduce function

September 26, 2019

Page 4: CS555: Distributed Systems [Fall 2019] Dept. Of Computer ...cs555/lectures/CS555-L10-MapReduc… · ¤Inverted indices ¤Representation of graph structure of web documents ¤Pages

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.4

CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.19Professor: SHRIDEEP PALLICKARA

Reduce function

¨ Accepts intermediate key I and ¤ Set of values for that key

¨ Merge these values together to get¤ Smaller set of values

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.20

Professor: SHRIDEEP PALLICKARA

Counting number occurrences of each word in a large collection of documentsmap (String key, String value)

//key: document name//value: document contents

for each word w in valueEmitIntermediate(w, “1”)

September 26, 2019

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.21Professor: SHRIDEEP PALLICKARA

Counting number occurrences of each word in a large collection of documents

reduce (String key, Iterator values)//key: a word//value: a list of counts

int result = 0; for each v in values

result += ParseInt(v);Emit(AsString(result)); Sums together all counts

emitted for a particular word

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.22

Professor: SHRIDEEP PALLICKARA

MapReduce specification object contains

¨ Names of¤ Input

¤ Output

¨ Tuning parameters

September 26, 2019

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.23Professor: SHRIDEEP PALLICKARA

Map and reduce functions have associated types drawn from different domains

map(k1, v1) à list(k2, v2)

reduce(k2, list(v2)) à list(v2)

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.24

Professor: SHRIDEEP PALLICKARA

What’s passed to-and-from user-defined functions

¨ Strings

¨ User code converts between¤ String¤ Appropriate types

September 26, 2019

Page 5: CS555: Distributed Systems [Fall 2019] Dept. Of Computer ...cs555/lectures/CS555-L10-MapReduc… · ¤Inverted indices ¤Representation of graph structure of web documents ¤Pages

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.5

CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.25Professor: SHRIDEEP PALLICKARA

Programs expressed as MapReduce computations: Distributed Grep

¨ Map¤ Emit line if it matches specified pattern

¨ Reduce¤ Just copy intermediate data to the output

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.26

Professor: SHRIDEEP PALLICKARA

Term-Vector per Host

¨ Summarizes important terms that occur in a set of documents <word, frequency>

¨ Map¤ Emit <hostname, term vector>¤ For each input document

¨ Reduce function¤ Has all per-document vectors for a given host¤ Add term vectors; discard away infrequent terms

n <hostname, term vector>

September 26, 2019

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

IMPLEMENTATION OF THE RUNTIME

September 26, 2019 L10.27 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.28

Professor: SHRIDEEP PALLICKARA

Implementation

¨ Machines are commodity machines¨ GFS is used to manage the data stored on the disks

September 26, 2019

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.29Professor: SHRIDEEP PALLICKARA

Execution Overview – Part I

¨ Maps distributed across multiple machines

¨ Automatic partitioning of data into M splits

¨ Splits processed concurrently on different machines

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.30

Professor: SHRIDEEP PALLICKARA

Execution Overview – Part II

¨ Partition intermediate key space into R pieces

¨ E.g. hash(key) mod R

¨ User specified parameters¤ Partitioning function

¤ Number of partitions (R)

September 26, 2019

Page 6: CS555: Distributed Systems [Fall 2019] Dept. Of Computer ...cs555/lectures/CS555-L10-MapReduc… · ¤Inverted indices ¤Representation of graph structure of web documents ¤Pages

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.6

CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.31

Professor: SHRIDEEP PALLICKARA

Execution Overview

Split 0Split 1Split 2Split 3Split 4

User

Program

Master

Worker

Worker

Worker

Worker

Worker

Output

file 0

Output

file 1

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.32

Professor: SHRIDEEP PALLICKARA

Execution Overview: Step IThe MapReduce library

¨ Splits input files into M pieces¤ 16-64 MB per piece

¨ Starts up copies of the program on a cluster of machines

September 26, 2019

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.33Professor: SHRIDEEP PALLICKARA

Execution Overview: Step IIProgram copies

¨ One of the copies is a Master

¨ There are M map tasks and R reduce tasks to assign

¨ Master¤ Picks idle workers

¤ Assigns each worker a map or reduce task

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.34

Professor: SHRIDEEP PALLICKARA

Execution Overview: Step IIIWorkers that are assigned a map task

¨ Read contents of their input split

¨ Parses <key, value> pairs out of input data

¨ Pass each pair to user-defined Map function

¨ Intermediate <key, value> pairs from Maps¤ Buffered in Memory

September 26, 2019

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.35

Professor: SHRIDEEP PALLICKARA

Execution Overview: Step IVWriting to disk

¨ Periodically, buffered pairs are written to disk

¨ These writes are partitioned¤ By the partitioning function

¨ Locations of buffered pairs on local disk¤ Reported to back to Master¤ Master forwards these locations to reduce workers

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.36

Professor: SHRIDEEP PALLICKARA

Execution Overview: Step VReading Intermediate data

¨ Master notifies Reduce worker about locations

¨ Reduce worker reads buffered data from the local disks of Maps

¨ Read all intermediate data; sort by intermediate key¤ All occurrences of same key grouped together

¤ Many different keys map to the same Reduce task

September 26, 2019

Page 7: CS555: Distributed Systems [Fall 2019] Dept. Of Computer ...cs555/lectures/CS555-L10-MapReduc… · ¤Inverted indices ¤Representation of graph structure of web documents ¤Pages

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.7

CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.37Professor: SHRIDEEP PALLICKARA

Execution Overview: Step VIProcessing data at the Reduce worker

¨ Iterate over sorted intermediate data

¨ For each unique key pass¤ Key + set of intermediate values to Reduce function

¨ Output of Reduce function is appended¤ To output file of reduce partition

September 26, 2019 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.38

Professor: SHRIDEEP PALLICKARA

Execution Overview: Step VIIWaking up the user

¨ After all Map & Reduce tasks have been completed

¨ Control returns to the user code

September 26, 2019

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

TASK GRANULARITY

September 26, 2019 L10.39 CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.40

Professor: SHRIDEEP PALLICKARA

Task Granularity

¨ Subdivide map phase into M pieces

¨ Subdivide reduce phase into R pieces

¨ M, R >> number of worker machines

¨ Each worker performing many different tasks¤ Improves dynamic load balancing

¤ Speeds up recovery during failures

September 26, 2019

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.41Professor: SHRIDEEP PALLICKARA

Master Data Structures

September 26, 2019

¨ For each Map and Reduce task¤ State: {idle, in-progress, completed}

¤ Worker machine identity

¨ For each completed Map task store ¤ Location and sizes of R intermediate file regions

¨ Information pushed incrementally to in-progress Reduce tasks

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.42

Professor: SHRIDEEP PALLICKARA

Practical bounds on how large M and R can be

¨ Master must make O(M + R) scheduling decisions

¨ Keep O(MR) state in memory

September 26, 2019

Page 8: CS555: Distributed Systems [Fall 2019] Dept. Of Computer ...cs555/lectures/CS555-L10-MapReduc… · ¤Inverted indices ¤Representation of graph structure of web documents ¤Pages

SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.8

CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2019]

Dept. Of Computer Science , Colorado State University

L10.43Professor: SHRIDEEP PALLICKARA

The contents of this slide-set are based on the following references¨ JEFFREY DEAN and SANJAY GHEMAWAT: MapReduce: Simplified Data Processing on

Large Clusters. OSDI 2004: 137-150

¨ MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoopand Other Systems. 1st Edition. Donald Miner and Adam Shook. O'Reilly Media ISBN: 978-1449327170. [Chapter 1]

September 26, 2019