MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number...

MapReduce

Hadoop Seminar, TUT, 2014-10-22

Antti Nieminen

MapReduce

• MapReduce is a programming model for distributed processing of large data sets

• Scales ~linearly – Twice as many nodes -> twice as fast

– Achieved by exploiting data locality • Data processing where the data is

• Simple programming model – Programmer only needs to write two functions:

Map and Reduce

Map & Reduce

• The programmer writes two functions: – Map maps input data to key/value pairs

– Reduce processes the list of values for a given key

• The MapReduce framework (such as Hadoop) does the rest – Distributes the job among nodes

– Moves the data to/from nodes

– Handles node failures

– etc.

MapReduce

MAP TRE-1, 2°C TRE 2

SHUFFLE

REDUCE

TKU 5 HKI 7 TRE 5

HKI HKI HKI HKI HKI

TKU TKU TKU TKU

5 9 6 4 4

2 2 8 8

TRE 9 TRE 8

TKU-1, 5°C HKI-1, 7°C TRE-2, 1°C HKI-1, 5°C HKI-2, 9°C HKI-2, 6°C

MapReduce

TRE-1, 2°C TKU-1, 5°C HKI-1, 7°C TRE-2, 1°C HKI-1, 5°C HKI-2, 9°C HKI-2, 6°C

MAP TRE 2 TKU 5 HKI 7 TRE 5

HKI HKI HKI HKI HKI

TKU TKU TKU TKU

5 9 6 4 4

2 2 8 8

TRE 9 TRE 8

SHUFFLE

REDUCE

Node 1

Node 2

Node 3

Node Y

Node Z

Node X

MapReduce

MAP REDUCE

Map(k1,v1) → list(k2,v2) Reduce(k2, list(v2)) → list(v3)

Map & Reduce in Hadoop

• In Hadoop, Map and Reduce functions can be written in – Java

• org.apache.hadoop.mapreduce.lib

– C++ using Hadoop Pipes

– any language, using Hadoop Streaming

• Also a number of third party programming frameworks for Hadoop MapReduce – For Java, Scala, Python, Ruby, PHP, …

– See eg. this blog post

Mapper Java example

public class MyMapper extends

Mapper<LongWritable, Text, Text, IntWritable> {

@Override

public void map(LongWritable key, Text value, Context context) {

String city = cityFromValue(value);

int temp = tempFromValue(value); context.write(new Text(city), new IntWritable(temp));

Input key and value types Output key and value types

• The Mapper input types depend on the defined InputFormat • By default TextInputFormat

• Key (LongWritable): position in the file • Value (Text): the line

Reducer Java example

public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int maxValue = Integer.MIN_VALUE; for (IntWritable value : values) { maxValue = Math.max(maxValue, value.get()); } context.write(key, new IntWritable(maxValue)); } }

Input key and value types Output key and value types

Run MapReduce example

Job job = new Job(); job.setJarByClass(MyClass.class); job.setJobName("Max temperature"); FileInputFormat.addInputPath(job, new Path("~/input")); FileInputFormat.setOutputPath(job, new Path("~/output")); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.waitForCompletion(true);

Hadoop Streaming

• Map and Reduce functions can be implemented in any language with the Hadoop Streaming API

• Input is read from standard input

• Output is written to standard output

• Input/output items are lines of the form key\tvalue – \t is the tabulator character

• Reducer input lines are grouped by key – One reducer instance may receive multiple keys

Python example

• mapper.py

import sys for line in sys.stdin: city, temp = city_temp_from_line(line) print('%s\t%s' % (city, temp)) • reducer.py

import sys last_key = None max_val = None # < anything for line in sys.stdin: key, val = line.strip().split('\t') if key != last_key: print('%s\t%s' % (last_key, max_val)) max_val = None last_key = key max_val = max(val, max_val) if last_key: print('%s\t%s' % (last_key, max_val))

Run Hadoop Streaming

• Debug using Unix pipes: cat sample.txt | ./mapper.py | sort | ./reducer.py

• On Hadoop: hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \ -input sample.txt \ -output output \ -mapper ./mapper.py \ -reducer ./reducer.py

Combiners

A 3 B 2

Map node 1

Reduce node for key A

Combiners

A 3 B 2

Map node 1

Reduce node for key A

A 1 5 3

Combiner

Combiners

• Combiner can ”compress” data on a mapper node before sending it forward

• Combiner input/output types must equal the mapper output types

• In Hadoop Java, Combiners use the Reducer interface

job.setCombinerClass(MyReducer.class);

Reducer as a Combiner

• Reducer can be used as a Combiner if it is commutative and associative – Eg. max is

• max(1, 2, max(3,4,5)) = max(max(2, 4), max(1, 5, 3))

• true for any order of function applications…

– Eg. avg is not • avg(1, 2, avg(3, 4, 5)) = 2.33333 ≠

avg(avg(2, 4), avg(1, 5, 3)) = 3

• Note: if Reducer is not c&a, Combiners can still be used – The Combiner just has to be different from the Reducer

and designed for the specific case

MapReduce example

• Find the number of unique purchasing locations for each product

• Data: – Users

• (user id, name, location, …)

– Transactions • (transaction id, product id, product name, user id, …)

• The example stolen from here – Here are some more examples…

u1, Antti, FI u2, Bob, US u3, Carola, SE

p1, Apple, u1 p1, Apple, u2

p2, Banana, u2

u1 FI u2 US u3 SE

u2 p1 u2 p2

u1 FI p1

u2 US p1

Transactions

MAP 1 REDUCE 1

p1 FI p1 p2

p1 FI US

MAP 2 REDUCE 2

The End

Questions? Comments?

MapReduce - cs.tut.fiaaltone3/kurssit/hadoop/MapReduce.pdf · MapReduce example •Find the number...

Documents

mapreduce - The Stanford University InfoLabinfolab.stanford.edu/~ullman/mining/2009/mapreduce.pdf · MapReduce Great thing is it is naturally parallelizable. ... Distributed Grep

Processing with What is MapReduce? Hadoop/MapReduce

MapReduce a distribuovane´ vy´pocˇtyufal.mff.cuni.cz/~straka/courses/npfl102/mapreduce_slides_2009.pdf · MapReduce Implementace MapReduce Google MapReduce Hadoop Phoenix Mars

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview

Reliable MapReduce computing on opportunistic resourcessynergy.cs.vt.edu/.../papers/lin-cc11-reliable-mapreduce.pdf · 2011-11-14 · Google’s MapReduce production systems use its

Clustering Lecture 8: MapReduce - csce.uark.eduxintaowu/BDAM/mapreduce.pdf · Distributed Grep . Very big data . Split data Split data Split data Split data . grep grep grep grep

Data-Intensive Text Processing with MapReducegtsat/collection/map reduce/Lin-MapReduce.pdf · Data-Intensive Text Processing with MapReduce Jimmy LinJimmy Lin The iSchool University

MapReduce: Theory and Implementationuser.it.uu.se/~justin/Teaching/DS/mapreduce.pdf · reduce After the map phase is over, all the intermediate values for a given output key are combined

Käden taitojen kurssit loppuvuosi 2015 ja kevät 2016

MapReduce - polito.itdbdmg.polito.it/.../uploads/2019/11/04-MapReduce.pdf · MapReduce •Published in 2004 by Google •J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing

MapReduce and Hadoop File Systemnsrit.edu.in/admin/img/cms/10096mapreduce.pdf · The Outline Introduction to MapReduce From CS Foundation to MapReduce MapReduce programming model

3.1.2 MATEMATIIKKALUKION TUNTIJAKO 2 8 LUKU ......2 Helsingin matematiikkalukion opetussuunnitelma 3.1.2 Matematiikkalukion tuntijako Pakolliset Syventävät Soveltavat kurssit kurssit

Pipelined-MapReduce an Improved MapReduce

Ympäristökasvatuksen kurssit 2019–2020 · Ideoita, innostusta ja ilmiöitä! 2019–2020 Ympäristökasvatuksen kurssit Kurser även på svenska s.13 Pääkaupunkiseudun luonto-

Suomen kielen kurssit 9.9.2019 - 1.12.2019, 7.1.2019 17.4file/download... · Suomen kielen kurssit 9.9.2019 - 1.12.2019, 7.1.2019– 17.4.2020 NAISTEN OMA SUOMEN KURSSI Suomen alkeet

1. Introduction to MapReduce - UPMlsd.ls.fi.upm.es/.../IntroToMapReduce.pdf · Processing of massive data: MapReduce – 1. Introduction to MapReduce MapReduce has a 'low semantic

MapReduce - Rangerranger.uta.edu/~jrao/CSE5306/fall2016/slides/MapReduce.pdf · Tod Lipcon@Hadoop summit RAMManager Local disk Merge to disk Yes, fetch to RAM No, fetch to disk Merge

MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

MapReduce-MPI Library Users Manualmapreduce.sandia.gov/doc/Manual.pdf · MapReduce-MPI WWW Site - MapReduce-MPI Documentation What is a MapReduce? The canonical example of a MapReduce

Luostarinkatu 13...6 TAULUKKO 1: LUOSTARIVUOREN LYSEON LUKION KURSSITARJONTA Oppiaine Lyhenne Valtakunnalliset pakolliset kurssit Valtakunnalliset syventävät kurssit Lukiokohtaiset