MapReduce Based on: MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. Mining of Massive Datasets. Jure Leskovec,

Embed Size (px)

Citation preview

  • Slide 1

MapReduce Based on: MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Je rey D. Ullman Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 2 contents Why do we need distributed computing for big data ? What is mapReduce? Functional programming review. MapReduce concept. First example word counting. Fail tolerance. Optimizations. More examples. Complexity. Real world example. Slide 3 Why do we need distributed computing for big data ? Single computer has not enough: RAM HD capacity, IOPS. network bandwidth. CPU. Slide 4 What is MapReduce MapReduce is a software framework introduced by Google to support distributes computing on large data sets on clusters of computers. There are many other MapReduce framework made to work on different environments (Hadoop is the leading open source implementation). Why not other framwork (like MPI)? Slide 5 Functional programming review Functional operations do not modify data structures : they always create new ones. original data still exists in unmodified form. No side-affect (reading input from user, networking etc) Data flows are explicit in program design. Order of operations does not matter: Fun foo (I :int list) = sum(I)+ mul(I) + length(I) Functions can be passed as arguments Slide 6 Map Creates a new list applying f to each element of the input list; returns output in order. map f a [] = f(a) map f (a:as) = list(f(a), map(f, as)) Example: upper(x) : char->char Input : lst = [ a,b,c] Operation : Map upper lst ; Output : [A, B, C] Google's video slides - Cluster Computing and MapReduce Slide 7 Fold Moves across a list, applying f to each element plus an accumulator. F returns the next accumulator value, which is combined with the next element of the list. fun foldl f z [] = z | foldl f z (x::xs) = foldl f (f(z,x)) xs; Example: We wish to write sum function Receiving int list; return sum; fun sum(lst) = foldl(fn (x,a)=>x+a) 0 lst Google's video slides - Cluster Computing and MapReduce Slide 8 "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 9 Googles map Map(play.txt,to be or not to be) Will Emit: (to,1), (be,1), (or,1), (not,1), (to,1), (be,1) Map (in_key, in_value) -> (out_key, intermediate_value) list Example : "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 10 (key, out_value) list Example : reduce(to,[1,1,1]) Will Emit: [(to,3)] "Data-Intensive"> Googles reduce Reduce (out_key,intermediate_value list) -> (key, out_value) list Example : reduce(to,[1,1,1]) Will Emit: [(to,3)] "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 11 Partition and combine functions Partition: A simple hash function - hash(key) mod R. the key may be different like hash(hostname(url)) mod R (to,1), (be,1), (or,1), (not,1), (to,1), (be,1) (to,1), (be,1), (to,1), (be,1) (or,1), (not,1) hash(key) mod 2 combine: Similar to reduce function, applied over local worker (more details will fallow) (to,1), (be,1), (or,1), (not,1), (to,1), (be,1) (to,2), (be,2), (or,1), (not,1) Slide 12 The MapReduce concept MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat Slide 13 Spite the work to pieces Start running code on workers The MapReduce concept MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat Slide 14 The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 15 The MapReduce concept Assign mappers Assign reducers MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat Slide 16 The MapReduce concept Mappers read input MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat Slide 17 The MapReduce concept Workers finishes : writes the output of map into R regions by the partitioning function Registers the results at the master. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat Slide 18 The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 19 The MapReduce concept Reducers read the input, sort it and start reducing. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat Slide 20 The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 21 The MapReduce concept Reducer store its output on GFS, and Inform the master. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat Slide 22 The MapReduce concept "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 23 First example word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 24 First example word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 25 First example word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer (w1,1) Slide 26 First example word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 27 First example word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 28 First example word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 29 First example word counting "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 30 Example 2 reverse a list of links mapper input: ( url, web page content) Mapper function: reduce function : Slide 31 Example 2 reverse a list of links mapper input: ( url, web page content) ( themarker.com, href=" ynet.com..) ( calcalist.com, href=" ynet.com..) mapper function: (url,web page content) -> (target,source) list [(ynet.com, themarker.com)] [(ynet.com, calcalist.com)] reduce function : (target,source) list -> (target, source list) (ynet.com,[themarker.com, calcalist.com]) Slide 32 Example 3 distributed grep Given a word and a list of text file, will return the files and lines that the word appears in. Mapper input: ( docId, docContent) Mapper function: Reduce function : Slide 33 mapper input: ( docId, docContent) mapper function: ( docId, docContent) -> (docId, line that match pattern) reduce function : Identity function Example 3 distributed grep Slide 34 Example 4 BFS Given N, will return the nodes in the graph. Each node will include the distance from N. Mapper input: (nodeId,N) // N.distance distance from source node // N.AdjacencyList Mapper function: Reduce function : Slide 35 Example 4 BFS "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 36 Example 4 BFS "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Slide 37 Example 5 Matrix Multiplication Slide 38 Example 5 Matrix-Vector Multiplication Slide 39 Fail tolerance during the mapReduce task, the master ping all workers. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat Slide 40 Fail tolerance 1) in progress map or reduce task is restarted on another machine. 2) Completed map task is restarted on another machine. 3) Completed reduce task is not restarted since its result stored on GFS. 4) If a few mappers fail on the same input the input is marked as non- valid. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat Slide 41 Fail tolerance MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat Master a single point of failure Slide 42 Optimizations The master tries to allocate mapper which is the closest to the machine that stores the input file. Combine is used to reduce network bandwidths consumption. e.g, better transmitting (pig,3) then (pig,1), (pig,1), (pig,1). Some mappers may be lagging behind, the master allocate a backup worker near the end of the mappers operation. Slide 43 Optimizations Some mappers may be lagging behind, the master allocate a backup worker near the end of the mappers operation. MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat Slide 44 BW over time MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat Slide 45 Google mapReduce usage MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat Slide 46 Complexity Theory for MapReduce we wish to: Shrink the wall-clock time Execute each reducer in main memory We will look into two parameters in the algorithm: reducer size(q): This parameter is the upper bound on the number of values that are allowed to appear in the list associated with a single key. replication rate(r):the number of key-value pairs produced by all the Map tasks on all the inputs, divided by the number of inputs. Slide 47 Complexity - Example Similarity Joins: given a large set of elements X and a similarity measure s(x, y) that tells how similar two elements x and y of set X are. 1M images, 1MB each. Slide 48 Complexity - Example Slide 49 Complexity- Example(fixed) Slide 50 Real world example A graphical model is a probabilistic model for which a graph denotes the conditional dependence structure between random variables. Slide 51 Real world example Distributed Message Passing for Large Scale Graphical Models, Alexander Schwing Slide 52 Real world example Iteration 1: Input entry for a single map task will be as followed: Slide 53 Real world example Iteration 1: Slide 54 Real world example Iteration 1: mapper output Slide 55 Real world example Slide 56 Iteration 2: Slide 57 Real world example Slide 58 Hands on Slide 59 conclusions The good: 1) simple. 2) proven. 3) many implementations for different platforms and languages. The bad: 1) performance improvements enabled by common database is prevented. 2) map reduce algorithms is not always easy to design. 3) not all algorithms can be converted to work efficiently on mapreduce. Slide 60 References MapReduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. "Data-Intensive Text Processing with MapReduce", jimmy Lin & Chris dyer Pro Hadoop By Jason Venner series of Google's video - Cluster Computing and MapReduce http://code.google.com/edu/submissions/mapreduce- minilecture/listing.html http://code.google.com/edu/submissions/mapreduce- minilecture/listing.html Slide 61 Partial Implementations list The Google MapReduce framework is implemented in C++ with interfaces in Python and Java. The Hadoop project is a free open source Java MapReduce implementation.Hadoop Twister is an open source Java MapReduce implementation that supports iterative MapReduce computations efficiently. Twister Greenplum is a commercial MapReduce implementation, with support for Python, Perl, SQL and other languages. Greenplum Aster Data Systems nCluster In-Database MapReduce supports Java, C, C++, Perl, and Python algorithms integrated into ANSI SQL. Aster Data Systems GridGain is a free open source Java MapReduce implementation. GridGainJava Phoenix is a shared-memory implementation of MapReduce implemented in C. Phoenix FileMap is an open version of the framework that operates on files using existing file-processing tools rather than tuples. FileMap MapReduce has also been implemented for the Cell Broadband Engine, also in C.Cell Broadband Engine Mars:MapReduce has been implemented on NVIDIA GPUs (Graphics Processors) using CUDA. MarsNVIDIACUDA Qt Concurrent is a simplified version of the framework, implemented in C++, used for distributing a task between multiple processor cores. Qt Concurrent CouchDB uses a MapReduce framework for defining views over distributed documents and is implemented in Erlang. CouchDBErlang Skynet is an open source Ruby implementation of Googles MapReduce framework Skynet Disco is an open source MapReduce implementation by Nokia. Its core is written in Erlang and jobs are normally written in Python. DiscoNokiaErlang Misco is an open source MapReduce designed for mobile devices and is implemented in Python. Misco Qizmt is an open source MapReduce framework from MySpace written in C#. QizmtMySpace The open-source Hive framework from Facebook (which provides an SQL-like language over files, layered on the open-source Hadoop MapReduce engine.)Hive frameworkHadoop The Holumbus Framework: Distributed computing with MapReduce in Haskell Holumbus-MapReduce The Holumbus FrameworkHolumbus-MapReduce BashReduce: MapReduce written as a Bash script written by Erik Frey of Last.fm BashReduce MapReduce for Go MapReduce for Go Meguro - a Javascript MapReduce framework Meguro MongoDB is a scalable, high-performance, open source, schema-free, document-oriented database. Written in C++ that features MapReduce MongoDB Parallel::MapReduce is a CPAN module providing experimental MapReduce functionality for Perl. Parallel::MapReduceCPAN MapReduce on volunteer computing MapReduce on volunteer computing Secure MapReduce Secure MapReduce MapReduce with MPI implementation MapReduce with MPI implementation