Data-intensiveprogrammingMapReduce
TimoAaltonenDepartmentofPervasiveComputing
Outline
• MapReduce
OriginalMapandReduce
• map(foo,i)– Applyfunctionfootoeveryitemofi andreturnalistoftheresults
– Example:map(square,[1,2,3,4])=[1,4,9,16]• reduce(bar,i)– Applytwoargumentfunctionbar
• cumulativelytotheitemsofi,• fromlefttoright• toreducethevaluesini toasinglevalue.
– Example:reduce(sum,[1,4,9])=(((1+4)+9)+16)=30
MapReduce
• MapReduce isaprogramming model fordistributed processing oflarge datasets
• Scales nearly linearly– Twice asmany nodes ->twice asfast– Achieved by exploiting datalocality• Computation ismoved close tothe data
• Simple programming model– Programmer only needs towrite two functions:Map andReduce
Map &Reduce
• Mapmaps inputdatato(key,value)pairs• Reduce processes thelist ofvalues foragivenkey
• TheMapReduce framework (such asHadoop)takes care oftherest– Distributes thejob among nodes–Moves thedatatoandfrom thenodes– Handles node failures– etc.
Sheenaisa
punkrockerSheena
isa
punkrockernowWellshe’sa
punkpunk
MAP
Sheena 1is 1a 1
punk 1
rocker 1
rockerSheena
isa
punk
nowWellshe’sa
11111
1111
punk 1punk 1
1
Sheena
is
1
a
1
SHUFFLE
Sheena 21
REDUCE
is 23
a 31
punk
1punk 37
Node 1
Node 2
Node 3
Node D
Node C
Node B
Node A
MapReduce
MAPSheena Sheena 1
SHUFFLE REDUCEis 1a 1
punk 1rockerSheena
is…
1111
isa
punkrockerSheena
is…
1
SheenaSheena 21
is
1
a
1
is 23
a 31
punk
1punk 37
MapReduce
MAP REDUCE
Map(k1,v1)→list(k2,v2) Reduce(k2,list(v2))→list(v3)
Map &Reduce inHadoop
• InHadoop,Map andReduce functions can bewritten in– Java• org.apache.hadoop.mapreduce.lib
– C++using Hadoop Pipes– any language,using Hadoop Streaming
• Also anumber ofthird partyprogrammingframeworks forHadoop MapReduce– ForJava,Scala,Python,Ruby,PHP,…
Mapper Javaexample
public class WordCount {public class MyMapper extends
Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String txt = value.toString(); String[] items = txt.split("\\s"); for(int i = 0; i < items.length; i++){
word = new Text( items[i]);one = new IntWritable(1);context.write(word, one);
}}
Inputkey andvalue types Outputkey andvalue types
• TheMapper inputtypes depend onthedefined InputFormat• Bydefault TextInputFormat
• Key(LongWritable):positioninthefile• Value(Text):theline
Reducer Javaexample
public class MyReducer extendsReducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {
int sum = 0;for (IntWritable value : values) {sum = sum + value.get();
}context.write(key, new IntWritable(sum));
}}
Inputkey andvalue types Outputkey andvalue types
Jobofthe MapReduce example• The maindriver program configures theMapReduce job andsubmits ittothe Hadoop YARNcluster:
public static void main( String[] args) throws Exception {…Job job = new Job();job.setJarByClass(WordCount.class);job.setJobName(”Word Count");
job.setMapperClass(MyMapper.class);job.setReducerClass(MyReducer.class);
Inputkey andvalue types
Outputkey andvalue types
• Inputandoutputvalue classes are defined:
job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);
Jobofthe MapReduce example
• FileInputFormat feeds the inputsplits tomapinstances from wc-files directoryFileInputFormat.addInputPath(job, new Path(”wc-files"));
• The result iswritten towc-outputdirectory:FileInputFormat.setOutputPath(job, new Path(”wc-output"));
• The job waits until Hadoop runtime environmenthas executed the program:job.waitForCompletion(true);
} // main} // class
MapReduce WordCount Demo
• Buildingtheprogram:%mkdir wordcount_classes%javac -classpath `hadoop classpath`\-dwc_classes WordCount.java%jar-cvf wordcount.jar -Cwordcount_classes/.
• SendWordCount toYARN:%hadoop jarwordcount.jar fi.tut.WordCount /dataoutput
MapReduceWordCount Demo
• Theresult:%hdfs dfs -lsoutputFound2items-rw-r--r-- 1hduser supergroup 02016-09-0811:46output/_SUCCESS-rw-r--r-- 1hduser supergroup 4709232016-09-0811:46output/part-r-00000%hdfs dfs -getoutput/part-r-00000%tail-10part-r-00000zoology, 2zu 2À 2à 4çela, 1…
FaultToleranceandSpeculativeExecution
• Faultsarehandledbyrestartingtasks• Allmanagedinthebackground• Noneedtomanageside-effectsorprocessstate
• Speculativeexecutionhelpspreventbottlenecks
Combiners
AA
11
A 1B 1
Map node 1
A
Reduce node forkey A
A111
A 3
Combiner
Combiners
• Combiner can ”compress”dataonamapper nodebefore sending it forward
• Combiner input/outputtypes must equal themapper outputtypes
• InHadoop Java,Combiners use theReducerinterface
job.setCombinerClass(MyReducer.class);
Reducer asaCombiner• Reducer can be used asaCombiner if it iscommutative andassociative– Eg.max is
• max(1,2,max(3,4,5))=max(max(2,4),max(1,5,3))
• true forany order offunction applications…– Eg.avg isnot
• avg(1,2,avg(3,4,5))=2.33333≠avg(avg(2,4),avg(1,5,3))=3
• Note:if Reducer isnot c&a,Combiners can still be used– TheCombiner justhas tobe different from theReduceranddesigned forthespecific case
AddingaCombinertoWordCount
walk,1run,1walk,1
run,1walk,2
Map
Combiner
Shuffle
Hadoop Streaming
• Map andReduce functions can be implementedinany language withtheHadoop Streaming API
• Inputisread from standard input• Outputiswritten tostandard output• Input/outputitems are lines oftheformkey\tvalue– \t isthetabulator character
• Reducer inputlines are grouped by key– Onereducer instance may receive multiple keys
Run Hadoop Streaming
• Debug using Unixpipes:cat sample.txt | ./mapper.py | sort | ./reducer.py
• OnHadoop:hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \-input sample.txt \-output output \-mapper ./mapper.py \-reducer ./reducer.py
MapReduce Examples• Thesearefromhttps://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
• CountingandSumming– WordCount
• Filtering(“Grepping”),Parsing,andValidation– Problem:Thereisasetofrecordsanditisrequiredtocollectallrecordsthatmeetsomeconditionortransformeachrecord(independentlyfromotherrecords)intoanotherrepresentation
– Solution:Mappertakesrecordsonebyoneandemitsaccepteditemsortheirtransformedversions
MapReduce Examples• Collating– ProblemStatement:Thereisasetofitemsandsomefunctionofoneitem.Itisrequiredtosaveallitemsthathavethesamevalueoffunctionintoonefileorperformsomeothercomputationthatrequiresallsuchitemstobeprocessedasagroup.Themosttypicalexampleisbuildingofinvertedindexes.
– Solution:Mappercomputesagivenfunctionforeachitemandemitsvalueofthefunctionasakeyanditemitselfasavalue.Reducerobtainsallitemsgroupedbyfunctionvalueandprocessorsavethem.Incaseofinvertedindexes,itemsareterms(words)andfunctionisadocumentIDwherethetermwasfound.
Index
Thispagecontainstext
Mypagecontainstoo
PageA
PageB
This,Apage,Acontains,Atext,A
My,Bpage,Bcontains,Btoo,B
This: Apage:A,Bcontains:A,Btext:Atoo:B
Reducedoutput
Conclusions
• Maptasksspreadouttheload(thelogic)– mayhavehundredsormillionsmappers– fastforlargeamountsofdata