MapReduce Programming

Yue-Shan Chang

split 0split 1split 2split 3split 4

worker

Master

UserProgram

outputfile 0

outputfile 1

(1) fork (1) fork (1) fork

(2) assign map(2) assign reduce

(3) read(4) local write

(5) remote read(6) write

Inputfiles

Mapphase

Intermediate files(on local disk)

Reducephase

Outputfiles

MapReduce Program Structure

Class MapReduceClass Mapper hellip Map程式碼Class Reduer hellip Reduce程式碼Main() 主程式設定區JobConf Conf=new JobConf(ldquoMRClassrdquo)其他設定參數程式碼

package orgmyorg import javaioIOException import javautil import orgapachehadoopfsPath import orgapachehadoopconf import orgapachehadoopio import orgapachehadoopmapred import orgapachehadooputil public class WordCount public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1) private Text word = new Text() public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException String line = valuetoString() StringTokenizer tokenizer = new StringTokenizer(line) while (tokenizerhasMoreTokens()) wordset(tokenizernextToken()) outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritablegt public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException int sum = 0 while (valueshasNext()) sum += valuesnext()get() outputcollect(key new IntWritable(sum)) public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass) confsetJobName(wordcount) confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass) confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass) confsetReducerClass(Reduceclass) confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass) FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1])) JobClientrunJob(conf)

MapReduce Job

Handled parts

Configuration of a Jobbull JobConf objectndash JobConf is the primary interface for a user to describe

a map-reduce job to the Hadoop framework for execution

ndash JobConf typically specifies the Mapper combiner (if any) Partitioner Reducer InputFormat and OutputFormat implementations to be used

ndash Indicates the set of input files (setInputPaths(JobConf Path) addInputPath(JobConf Path)) and (setInputPaths(JobConf String) addInputPaths(JobConf String)) and where the output files should be written (setOutputPath(Path))

Configuration of a Job

Input Splitting

bull An input split will normally be a contiguous group of records from a single input filendash If the number of requested map tasks is larger

than number of files ndash the individual files are larger than the suggested

fragment size there may be multiple input splits constructed of each input file

bull The user has considerable control over the number of input splits

Specifying Input Formatsbull The Hadoop framework provides a large variety of

input formatsndash KeyValueTextInputFormat Keyvalue pairs one per linendash TextInputFormant The key is the line number and the

value is the linendash NLineInputFormat Similar to KeyValueTextInputFormat

but the splits are based on N lines of input rather than Y bytes of input

ndash MultiFileInputFormat An abstract class that lets the user implement an input format that aggregates multiple files into one split

ndash SequenceFIleInputFormat The input file is a Hadoop sequence file containing serialized keyvalue pairs

Specifying Input Formats

Setting the Output Parameters

bull The framework requires that the output parameters be configured even if the job will not produce any output

bull The framework will collect the output from the specified tasks and place them into the configured output directory

A Simple Map Function IdentityMapper

A Simple Reduce Function IdentityReducer

Configuring the Reduce Phase

bull the user must supply the framework with five pieces of informationndash The number of reduce tasks if zero no reduce

phase is runndash The class supplying the reduce methodndash The input key and value types for the reduce task

by default the same as the reduce outputndash The output key and value types for the reduce

taskndash The output file type for the reduce task output

How Many Maps

bull The number of maps is usually driven by the total size of the inputs that is the total number of blocks of the input files

bull The right level of parallelism for maps seems to be around 10-100 maps per-node

bull it is best if the maps take at least a minute to execute

bull setNumMapTasks(int)

Reducerbull Reducer reduces a set of intermediate values which

share a key to a smaller set of valuesbull Reducer has 3 primary phases shuffle sort and

reducebull Shufflendash Input to the Reducer is the sorted output of the mappers

In this phase the framework fetches the relevant partition of the output of all the mappers via HTTP

bull Sortndash The framework groups Reducer inputs by keys (since

different mappers may have output the same key) in this stage

ndash The shuffle and sort phases occur simultaneously while map-outputs are being fetched they are merged

How Many Reduces

bull The right number of reduces seems to be 095 or 175 multiplied by (ltno of nodesgt mapredtasktrackerreducetasksmaximum)

bull With 095 all of the reduces can launch immediately and start transferring map outputs as the maps finish

bull With 175 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing

How Many Reducesbull Increasing the number of reduces increases the

framework overhead but increases load balancing and lowers the cost of failures

bull Reducer NONEndash It is legal to set the number of reduce-tasks to zero if

no reduction is desiredndash In this case the outputs of the map-tasks go directly

to the FileSystem into the output path set by setOutputPath(Path)

ndash The framework does not sort the map-outputs before writing them out to the FileSystem

Reporter

bull Reporter is a facility for MapReduce applications to report progress set application-level status messages and update Counters

bull Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive

JobTracker

bull JobTracker is the central location for submitting and tracking MR jobs in a network environment

bull JobClient is the primary interface by which user-job interacts with the JobTrackerndash provides facilities to submit jobs track their

progress access component-tasks reports and logs get the MapReduce clusters status information and so on

Job Submission and Monitoring

bull The job submission process involvesndash Checking the input and output specifications of

the jobndash Computing the InputSplit values for the jobndash Setting up the requisite accounting information

for the DistributedCache of the job if necessary ndash Copying the jobs jar and configuration to the

MapReduce system directory on the FileSystem ndash Submitting the job to the JobTracker and

optionally monitoring its status

MapReduce Details forMultimachine Clusters

Introduction

bull Whyndash datasets that canrsquot fit on a single machine ndash have time constraints that are impossible to

satisfy with a small number of machines ndash need to rapidly scale the computing power

applied to a problem due to varying input set sizes

Requirements for Successful MapReduce Jobs

bull Mapperndash ingest the input and process the input record sending

forward the records that can be passed to the reduce task or to the final output directly

bull Reducerndash Accept the key and value groups that passed through the

mapper and generate the final output

bull job must be configured with the location and type of the input data the mapper class to use the number of reduce tasks required and the reducer class and IO types

bull The TaskTracker service will actually run your map and reduce tasks and the JobTracker service will distribute the tasks and their input split to the various trackers

bull The cluster must be configured with the nodes that will run the TaskTrackers and with the number of TaskTrackers to run per node

bull Three levels of configuration to address to configure MapReduce on your clusterndash configure the machines ndash the Hadoop MapReduce framework ndash the jobs themselves

Launching MapReduce Jobs

bull launch the preceding example from the command linegt binhadoop [-libjars jar1jarjar2jarjar3jar] jar myjarjar MyClass

MapReduce-Specific Configuration for Each Machine in a Cluster

bull install any standard JARs that your application usesbull It is probable that your applications will have a

runtime environment that is deployed from a configuration management application which you will also need to deploy to each machine

bull The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks

bull The confslaves file should have the set of machines to serve as TaskTracker nodes

DistributedCache

bull distributes application-specific large read-only files efficiently

bull a facility provided by the MapReduce framework to cache files (text archives jars and so on) needed by applications

bull The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node

Adding Resources to the Task Classpath

bull Methodsndash JobConfsetJar(String jar) Sets the user JAR for the

MapReduce jobndash JobConfsetJarByClass(Class cls) Determines the

JAR that contains the class cls and calls JobConfsetJar(jar) with that JAR

ndash DistributedCacheaddArchiveToClassPath(Path archive Configuration conf) Adds an archive path to the current set of classpath entries

Configuring the Hadoop Core Cluster Information

bull Setting the Default File System URI

bull You can also use the JobConf object to set the default file systemndash confset( fsdefaultname

hdfsNamenodeHostnamePORT)

bull Setting the JobTracker Location

bull use the JobConf object to set the JobTracker informationndash confset( mapredjobtracker

JobtrackerHostnamePORT)

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

MapReduce Details for Multimachine Clusters

Introduction

DistributedCache

split 0split 1split 2split 3split 4

worker

Master

UserProgram

outputfile 0

outputfile 1

(1) fork (1) fork (1) fork

(2) assign map(2) assign reduce

(3) read(4) local write

(5) remote read(6) write

Inputfiles

Mapphase

Intermediate files(on local disk)

Reducephase

Outputfiles

MapReduce Job

Handled parts

Input Splitting

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

Handled parts

Input Splitting

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

Input Splitting

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

Input Splitting

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

Input Splitting

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

How Many Maps

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

MapReduce Job

Handled parts

Input Splitting

How Many Maps

Reducer

How Many Reduces

Reporter

JobTracker

Introduction

DistributedCache

Documents

MapReduce Programming