55
03-1 Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent. Hadoop Fundamentals: HDFS, MapReduce, Hive, Pig Tom White Devoxx, Antwerp, 2010 1010.01 03-2 Copyright © 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent. HDFS and MapReduce Three parts Hadoop: Basic Concepts Writing a MapReduce Program Advanced MapReduce Programming Training VM http://www.cloudera.com/hadoop-training

Parleys_Tom_Hadoop_Training Material

Embed Size (px)

DESCRIPTION

Coudera Hadoop Training Material

Citation preview

  • 03-1Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Hadoop Fundamentals:HDFS, MapReduce, Hive, Pig

    Tom WhiteDevoxx, Antwerp, 2010

    1010.01

    03-2Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    HDFS and MapReduce

    Three parts

    Hadoop: Basic Concepts

    Writing a MapReduce Program

    Advanced MapReduce Programming

    Training VM

    http://www.cloudera.com/hadoop-training

    Hello Everyone!

    In this talk Im going to cover Hadoop.Half for HDFS, MRHalf for Hive and Pig

    How many here have used Hadoop?

  • 03-3Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Introduction

    Apache Hadoop Committer, PMC Member, Apache Member

    Engineer at Cloudera working on core Hadoop

    Founder of Apache Whirr, currently in the Incubator

    Author of Hadoop: The Definitive Guide

    http://hadoopbook.com

    03-4Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Hadoop: Basic Concepts

    1010.01

  • 03-5Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Hadoop: Basic Concepts

    In this chapter you will learn

    What Hadoop is

    What features the Hadoop Distributed File System (HDFS) provides

    The concepts behind MapReduce

    03-6Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Hadoop: Basic Concepts

    What Is Hadoop?

    The Hadoop Distributed File System (HDFS)

    How MapReduce works

  • 03-7Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The Hadoop Project

    Hadoop is an open-source project overseen by the Apache Software Foundation

    Originally based on papers published by Google in 2003 and 2004

    Hadoop committers work at several different organizations Including Facebook, Yahoo!, Cloudera

    03-8Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Hadoop Components

    Hadoop consists of two core components The Hadoop Distributed File System (HDFS) MapReduce

    There are many other projects based around core Hadoop Often referred to as the Hadoop Ecosystem Pig, Hive, HBase, Flume, Oozie, Sqoop, etc

    A set of machines running HDFS and MapReduce is known as a Hadoop Cluster

    Individual machines are known as nodes A cluster can have as few as one node, as many as several

    thousands More nodes = better performance!

  • 03-9Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Hadoop Components: HDFS

    HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster

    Data is split into blocks and distributed across multiple nodes in the cluster

    Each block is typically 64Mb or 128Mb in size

    Each block is replicated multiple times Default is to replicate each block three times Replicas are stored on different nodes This ensures both reliability and availability

    03-10Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Hadoop Components: MapReduce

    MapReduce is the system used to process data in the Hadoop cluster

    Consists of two phases: Map, and then Reduce Between the two is a stage known as the sort and shuffle

    Each Map task operates on a discrete portion of the overall dataset

    Typically one HDFS block of data

    After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase

    Much more on this later!

  • 03-11Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Hadoop: Basic Concepts

    What Is Hadoop?

    The Hadoop Distributed File System (HDFS)

    How MapReduce works

    03-12Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    HDFS Basic Concepts

    HDFS is a filesystem written in Java Based on Googles GFS

    Sits on top of a native filesystem ext3, xfs etc

    Provides redundant storage for massive amounts of data Using cheap, unreliable computers

  • 03-13Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    HDFS Basic Concepts (contd)

    HDFS performs best with a modest number of large files Millions, rather than billions, of files Each file typically 100Mb or more

    HDFS is optimized for large, streaming reads of files Rather than random reads

    Files in HDFS are write once No random writes to files are allowed Append support is available in Clouderas Distribution for Hadoop

    (CDH) and in Hadoop 0.21

    03-14Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    How Files Are Stored

    Files are split into blocks Each block is usually 64Mb or 128Mb

    Data is distributed across many machines at load time Different blocks from the same file will be stored on different

    machines This provides for efficient MapReduce processing (see later)

    Blocks are replicated across multiple machines, known as DataNodes

    Default replication is three-fold i.e., each block exists on three different machines

    A master node called the NameNode keeps track of which blocks make up a file, and where those blocks are located

    Known as the metadata

    Hosting, project governance, communityContributions from hundreds of individuals

  • 03-15Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    How Files Are Stored: Example

    NameNode holds metadata for the two files (Foo.txt and Bar.txt)

    DataNodes hold the actual blocks

    Each block will be 64Mb or 128Mb in size

    Each block is replicated three times on the cluster

    03-16Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    More On The HDFS NameNode

    The NameNode daemon must be running at all times If the NameNode stops, the cluster becomes inaccessible Your system administrator will take care to ensure that the

    NameNode hardware is reliable!

    The NameNode holds all of its metadata in RAM for fast access It keeps a record of changes on disk for crash recovery

    A separate daemon known as the Secondary NameNode takes care of some housekeeping tasks for the NameNode

    Be careful: The Secondary NameNode is not a backup NameNode!

    MR depends on features of HDFS to work wellTalk on wedsLinear scaling

  • 03-17Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    HDFS: Points To Note

    Although files are split into 64Mb or 128Mb blocks, if a file is smaller than this the full 64Mb/128Mb will not be used

    Blocks are stored as standard files on the DataNodes, in a set of directories specified in Hadoops configuration files

    This will be set by the system administrator

    Without the metadata on the NameNode, there is no way to access the files in the HDFS cluster

    When a client application wants to read a file: It communicates with the NameNode to determine which blocks

    make up the file, and which DataNodes those blocks reside on It then communicates directly with the DataNodes to read the

    data The NameNode will not be a bottleneck

    03-18Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Accessing HDFS

    Applications can read and write HDFS files directly via the Java API

    Typically, files are created on a local filesystem and must be moved into HDFS

    Likewise, files created in HDFS may need to be moved to a machines local filesystem

    Access to HDFS from the command line is achieved with the hadoop fs command

    StorageCompare to local file systems.Replication is main way to get fault tolerance

  • 03-19Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    hadoop fs Examples

    Copy file foo.txt from local disk to the users directory in HDFS

    This will copy the file to /user/username/foo.txt

    Get a directory listing of the users home directory in HDFS

    Get a directory listing of the HDFS root directory

    hadoop fs -copyFromLocal foo.txt foo.txt

    hadoop fs -ls

    hadoop fs ls /

    03-20Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    hadoop fs Examples (contd)

    Display the contents of the HDFS file /user/fred/bar.txt

    Move that file to the local disk, named as baz.txt

    Create a directory called input under the users home directory

    hadoop fs cat /user/fred/bar.txt

    hadoop fs copyToLocal /user/fred/bar.txt baz.txt

    hadoop fs mkdir input

    MR is a programming model

  • 03-21Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    hadoop fs Examples (contd)

    Delete the directory input_old and all its contents

    hadoop fs rmr input_old

    03-22Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Hadoop: Basic Concepts

    What Is Hadoop?

    The Hadoop Distributed File System (HDFS)

    How MapReduce works

  • 03-23Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    What Is MapReduce?

    MapReduce is a method for distributing a task across multiple nodes

    Each node processes data stored on that node Where possible

    Consists of two phases: Map Reduce

    03-24Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Features of MapReduce

    Automatic parallelization and distribution

    Fault-tolerance

    Status and monitoring tools

    A clean abstraction for programmers MapReduce programs are usually written in Java

    Can be written in any scripting language using HadoopStreaming

    All of Hadoop is written in Java

    MapReduce abstracts all the housekeeping away from the developer

    Developer can concentrate simply on writing the Map and Reduce functions

    FS with semantics that are looser than posix (e.g. read consistency)Uses native fs on each storage node (dn)Scales to very large sizes terabytes the norm, petabytes possible

  • 03-25Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    MapReduce: The Big Picture

    03-26Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    MapReduce: Terminology

    A job is a full program a complete execution of Mappers and Reducers over a dataset

    A task is the execution of a single Mapper or Reducer over a slice of data

    A task attempt is a particular instance of an attempt to execute a task

    There will be at least as many task attempts as there are tasks If a task attempt fails, another will be started by the JobTracker Speculative execution (see later) can also result in more task

    attempts than completed tasks

    Geared towards large files.Takes advantage of disk transfer speed being faster than random readsCant updatea file at a random position application pattern is to read then rewrite entire fileAppends are now possible in latest releases, but an advanced feature. Hbase uses sync (to ensure file is persisted).

  • 03-27Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    MapReduce: The JobTracker

    MapReduce jobs are controlled by a software daemon known as the JobTracker

    The JobTracker resides on a master node Clients submit MapReduce jobs to the JobTracker The JobTracker assigns Map and Reduce tasks to other nodes

    on the cluster These nodes each run a software daemon known as the TaskTracker

    The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker

    03-28Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    MapReduce: The Mapper

    Hadoop attempts to ensure that Mappers run on nodes which hold their portion of the data locally, to avoid network traffic

    Multiple Mappers run in parallel, each processing a portion of the input data

    The Mapper reads data in the form of key/value pairs

    It outputs zero or more key/value pairs

    map(in_key, in_value) ->(inter_key, inter_value) list

    Block size is configurable per cluster default, overridable per fileDistribution - see this in a moHDFS does not dynamically re-balance, but if a block is lost it will repairMetadata is stored in mem and on disk

  • 03-29Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    MapReduce: The Mapper (contd)

    The Mapper may use or completely ignore the input key For example, a standard pattern is to read a line of a file at a time

    The key is the byte offset into the file at which the line starts The value is the contents of the line itself Typically the key is considered irrelevant

    If it writes anything at all out, the output must be in the form of key/value pairs

    03-30Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Example Mapper: Upper Case Mapper

    Turn input into upper case (pseudo-code):

    let map(k, v) =

    emit(k.toUpper(), v.toUpper())

    ('foo', 'bar') -> ('FOO', 'BAR')

    ('foo', 'other') -> ('FOO', 'OTHER')

    ('baz', 'more data') -> ('BAZ', 'MORE DATA')

    Two files5 blocksBlock one stored on first three machines; block2 2 on 1, 4 and 5Clients can read any block load balacing

    Not shown here but spread across 2 racks so can survice failure of a rack

  • 03-31Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Example Mapper: Explode Mapper

    Output each input character separately (pseudo-code):

    let map(k, v) = foreach char c in v:

    emit (k, c)

    ('foo', 'bar') -> ('foo', 'b'), ('foo', 'a'), ('foo', 'r')

    ('baz', 'other') -> ('baz', 'o'), ('baz', 't'),('baz', 'h'), ('baz', 'e'),('baz', 'r')

    03-32Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Example Mapper: Filter Mapper

    Only output key/value pairs where the input value is a prime number (pseudo-code):

    let map(k, v) = if (isPrime(v)) then emit(k, v)

    ('foo', 7) -> ('foo', 7)

    ('baz', 10) -> nothing

    Very quick to look up block locations, for example. (Although these are cached by clients too.)SNN perfoms edit log rolling. So TX log is merged with fs namespace image.SNN is a stale replica.

  • 03-33Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Example Mapper: Changing Keyspaces

    The key output by the Mapper does not need to be identical to the input key

    Output the wordlength as the key (pseudo-code):

    let map(k, v) = emit(v.length(), v)

    ('foo', 'bar') -> (3, 'bar')

    ('baz', 'other') -> (5, 'other')

    ('foo', 'abracadabra') -> (11, 'abracadabra')

    03-34Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    MapReduce: The Reducer

    After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list

    This list is given to a Reducer There may be a single Reducer, or multiple Reducers

    This is specified as part of the job configuration (see later) All values associated with a particular intermediate key are

    guaranteed to go to the same Reducer The intermediate keys, and their value lists, are passed to the

    Reducer in sorted key order This step is known as the shuffle and sort

    The Reducer outputs zero or more final key/value pairs These are written to HDFS In practice, the Reducer usually emits a single key/value pair for

    each input key

    Differs to normal local fs blocksYou can see the files in a directory on the DN DN dont know the files blocks are a part ofData does *not* flow through the NN.

  • 03-35Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Example Reducer: Sum Reducer

    Add up all the values associated with each intermediate key (pseudo-code):

    let reduce(k, vals) = sum = 0foreach int i in vals:

    sum += iemit(k, sum)

    ('foo', [9, 3, -17, 44]) -> ('foo', 39)

    ('bar', [123, 100, 77]) -> ('bar', 300)

    03-36Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Example Reducer: Identity Reducer

    The Identity Reducer is very common (pseudo-code):

    let reduce(k, vals) = foreach v in vals:

    emit(k, v)

    ('foo', [9, 3, -17, 44]) -> ('foo', 9), ('foo', 3),

    ('foo', -17), ('foo', 44)

    ('bar', [123, 100, 77]) -> ('bar', 123), ('bar', 100),('bar', 77)

    Central point is FileSystem classThere are other ways of getting data into HDFSFlume streaming data, or Sqoop, from DB

  • 03-37Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    MapReduce Execution

    03-38Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    MapReduce: Data Localization

    Whenever possible, Hadoop will attempt to ensure that a Map task on a node is working on a block of data data stored locally on that node via HDFS

    If this is not possible, the Map task will have to transfer the data across the network as it processes that data

    Once the Map tasks have finished, data is then transferred across the network to the Reducers

    Although the Reducers may run on the same physical machines as the Map tasks, there is no concept of data locality for the Reducers

    All Mappers will, in general, have to communicate with all Reducers

    Copy file foo.txt in local directory into home directory in HDFSThese are like the regular unix filesystem commands, except prefixed by hadoop fs

  • 03-39Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    MapReduce: Is Shuffle and Sort a Bottleneck?

    It appears that the shuffle and sort phase is a bottleneck No Reducer can start until all Mappers have finished

    In practice, Hadoop will start to transfer data from Mappers to Reducers as the Mappers finish work

    This mitigates against a huge amount of data transfer starting as soon as the last Mapper finishes

    03-40Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    MapReduce: Is a Slow Mapper a Bottleneck?

    It is possible for one Map task to run more slowly than the others Perhaps due to faulty hardware, or just a very slow machine

    It would appear that this would create a bottleneck No Reducer can start until every Mapper has finished

    Hadoop uses speculative execution to mitigate against this If a Mapper appears to be running significantly more slowly than

    the others, a new instance of the Mapper will be started on another machine, operating on the same data

    The results of the first Mapper to finish will be used Hadoop will kill off the Mapper which is still running

  • 03-41Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    MapReduce: The Combiner

    Often, Mappers produce large amounts of intermediate data That data must be passed to the Reducers This can result in a lot of network traffic

    It is often possible to specify a Combiner Like a mini-Reduce Runs locally on a single Mappers output Output from the Combiner is sent to the Reducers

    Combiner and Reducer code are often identical Technically, this is possible if the operation performed is

    commutative and associative

    03-42Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    MapReduce Example: Word Count

    Count the number of occurrences of each word in a large amount of input data

    This is the hello world of MapReduce programming

    map(String input_key, String input_value)foreach word w in input_value:

    emit(w, 1)

    reduce(String output_key, Iterator intermediate_vals)

    set count = 0foreach v in intermediate_vals:

    count += vemit(output_key, count)

  • 03-43Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    MapReduce Example: Word Count (contd)

    Input to the Mapper:

    Output from the Mapper:

    (3414, 'the cat sat on the mat')(3437, 'the aardvark sat on the sofa')

    ('the', 1), ('cat', 1), ('sat', 1), ('on', 1), ('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1),('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)

    03-44Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    MapReduce Example: Word Count (contd)

    Intermediate data sent to the Reducer:

    Final Reducer Output:

    ('aardvark', [1])('cat', [1]) ('mat', [1])('on', [1, 1])('sat', [1, 1])('sofa', [1]) ('the', [1, 1, 1, 1])

    ('aardvark', 1)('cat', 1) ('mat', 1)('on', 2)('sat', 2)('sofa', 1) ('the', 4)

  • 03-45Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Word Count With Combiner

    A Combiner would reduce the amount of data sent to the Reducer

    Intermediate data sent to the Reducer after a Combiner using the same code as the Reducer:

    Combiners decrease the amount of network traffic required during the shuffle and sort phase

    Often also decrease the amount of work needed to be done by the Reducer

    ('aardvark', [1])('cat', [1]) ('mat', [1])('on', [2])('sat', [2])('sofa', [1]) ('the', [4])

    03-46Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Writing a MapReduce Program

    1010.01

    MR strives to minimize data movement since bandwidth is the scarcest resource on the cluster

  • 03-47Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Writing a MapReduce Program

    In this chapter you will learn

    How to use the Hadoop API to write a MapReduce program in Java

    03-48Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Writing a MapReduce Program

    Examining a Sample MapReduce program

    The Driver Code

    The Mapper

    The Reducer

    The Combiner

    MR takes care of dividing up the work to be done and decising where to execute it using a work schedulerUsers dont need to worry about units of work (tasks) failing, since they will be re-runWell see how to monitor a jobVery simple abstraction, but very powerful a wide range of data processing problems can be solvedUnlike other models (e.g. MPI) high-level

  • 03-49Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Demo

    Before we discuss how to write a MapReduce program in Java, lets run and examine a simple MapReduce job

    03-50Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Components of a MapReduce Program

    MapReduce programs generally consist of three portions The Mapper The Reducer The driver code

    We will look at each element in turn

    Note: Your MapReduce program may also contain other elements

    Combiner (often the same code as the Reducer) Custom partitioner Etc

    Map func produces key-value pairsMR system aggregates map output by key and feeds to reducersReducers process all the values for a given key

    Well see more details on keys and values in a moment

  • 03-51Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Writing a MapReduce Program

    Examining our Sample MapReduce program

    The Driver Code

    The Mapper

    The Reducer

    The Combiner

    03-52Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The Driver: Complete Code

    import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;

    public class WordCount {

    public static void main(String[] args) throws Exception {

    if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);

    }

    JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");

    FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);

    conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);

    JobClient.runJob(conf);}

    }

  • 03-53Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The Driver: Import Statements

    import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;

    public class WordCount {

    public static void main(String[] args) throws Exception {

    if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);

    }

    JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");

    FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);

    conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);

    JobClient.runJob(conf);}

    }

    You will typically import these classes into every MapReduce job you write. We will omit the importstatements in future slides for brevity.

    03-54Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The Driver: Main Code (contd)

    public class WordCount {

    public static void main(String[] args) throws Exception {

    if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);

    }

    JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");

    FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);

    conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);

    JobClient.runJob(conf);}

    }

    Jobs are submutted to the JTIt coordinates the job, scheduling tasks on nodes on the cluster TTsTTs reside on the same nodes as DNsTTs run the code you write the JT doesnt

  • 03-55Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The Driver: Main Code

    public class WordCount {

    public static void main(String[] args) throws Exception {

    if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);

    }

    JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");

    FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);

    conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);

    JobClient.runJob(conf);}

    }

    You usually configure your MapReduce job in the mainmethod of your driver code. Here, we first check to ensure that the user has specified the HDFS directories to use for input and output on the command line.

    03-56Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Configuring The Job With JobConf

    public class WordCount {

    public static void main(String[] args) throws Exception {

    if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);

    }

    JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");

    FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);

    conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);

    JobClient.runJob(conf);}

    }

    To configure your job, create a new JobConf object and specify the class which will be called to run the job.

    Map phase is embarrassingly parallelInput data is k-v pairsThe map may produce multiple (0 or more) output records for each input recordIntermediate data is not in HDFS

  • 03-57Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Creating a New JobConf Object

    The JobConf class allows you to set configuration options for your MapReduce job

    The classes to be used for your Mapper and Reducer The input and output directories Many other options

    Any options not explicitly set in your driver code will be read from your Hadoop configuration files

    Usually located in /etc/hadoop/conf

    Any options not specified in your configuration files will receive Hadoops default values

    03-58Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Naming The Job

    public class WordCount {

    public static void main(String[] args) throws Exception {

    if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);

    }

    JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");

    FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);

    conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);

    JobClient.runJob(conf);}

    }

    First, we give our job a meaningful name.

    What the input k-v are depends on the input formatText filesWrite k-v but they may be of diferent types. E.g. key string, value integer.

  • 03-59Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Specifying Input and Output Directories

    public class WordCount {

    public static void main(String[] args) throws Exception {

    if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);

    }

    JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");

    FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);

    conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);

    JobClient.runJob(conf);}

    }

    Next, we specify the input directory from which data will be read, and the output directory to which our final output will be written.

    03-60Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Determining Which Files To Read

    By default, FileInputFormat.setInputPath() will read all files from a specified directory and send them to Mappers

    Exceptions: items whose names begin with a period (.) or underscore (_)

    Globs can be specified to restrict input For example, /2010/*/01/*

    Alternatively, FileInputFormat.addInputPath() can be called multiple times, specifying a single file or directory each time

    More advanced filtering can be performed by implementing a PathFilter (see later)

  • 03-61Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Getting Data to the Mapper

    The data passed to the Mapper is specified by an InputFormat Specified in the driver code Defines the location of the input data

    A file or directory, for example Determines how to split the input data into input splits

    Each Mapper deals with a single input split InputFormat is a factory for RecordReader objects to extract

    (key, value) records from the input source

    03-62Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Getting Data to the Mapper (contd)

    Multiple emits

  • 03-63Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Some Standard InputFormats

    FileInputFormat The base class used for all file-based InputFormats

    TextInputFormat The default Treats each \n-terminated line of a file as a value Key is the byte offset within the file of that line

    KeyValueTextInputFormat Maps \n-terminated lines as key SEP value

    By default, separator is a tab

    SequenceFileInputFormat Binary file of (key, value) pairs with some additional metadata

    SequenceFileAsTextInputFormat Similar, but maps (key.toString(), value.toString())

    03-64Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Specifying Final Output With OutputFormat

    FileOutputFormat.setOutputPath() specifies the directory to which the Reducers will write their final output

    The driver can also specify the format of the output data Default is a plain text file Could be explicitly written as

    conf.setOutputFormat(TextOutputFormat.class);

    We will discuss OutputFormats in more depth in a later chapter

    Nothing emitted for second record

  • 03-65Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Specify The Classes for Mapper and Reducer

    public class WordCount {

    public static void main(String[] args) throws Exception {

    if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);

    }

    JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");

    FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);

    conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);

    JobClient.runJob(conf);}

    }

    Give the JobConf object information about which classes are to be instantiated as the Mapper and Reducer. You also specify the classes for the intermediate and final keys and values

    03-66Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Running The Job

    public class WordCount {

    public static void main(String[] args) throws Exception {

    if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);

    }

    JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");

    FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);

    conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);

    JobClient.runJob(conf);}

    }

    Finally, run the job by calling the runJob method.

    Type can be different.Value type can be different too.

  • 03-67Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Running The Job (contd)

    There are two ways to run your MapReduce job: JobClient.runJob(conf)

    Blocks (waits for the job to complete before continuing) JobClient.submitJob(conf)

    Does not block (driver code continues as the job is running)

    JobClient determines the proper division of input data into InputSplits

    JobClient then sends the job information to the JobTrackerdaemon on the cluster

    03-68Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Reprise: Driver Code

    public class WordCount {

    public static void main(String[] args) throws Exception {

    if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);

    }

    JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");

    FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);

    conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);

    JobClient.runJob(conf);}

    }

    Not materialized as a list given to the user as an IteratorA Reducer may process multiple keysEach key has *all* the values associated with itNot the case that some go to reducer a, and some to reducer bThis is a very useful property

  • 03-69Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Writing a MapReduce Program

    Examining our Sample MapReduce program

    The Driver Code

    The Mapper

    The Reducer

    The Combiner

    03-70Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The Mapper: Complete Code

    import java.io.IOException;import java.util.StringTokenizer;

    import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reporter;

    public class WordMapper extends MapReduceBaseimplements Mapper {

    private Text word = new Text();private final static IntWritable one = new IntWritable(1);

    public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {

    // Break line into words for processingStringTokenizer wordList = new StringTokenizer(value.toString());

    while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);

    }}

    }

    Sums the values for a given key.If there was one reducer it would process both foo and bars data.Actually would process bar first, since MR guarantees that the keys given to a particualr reducer are in order.

  • 03-71Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The Mapper: import Statements

    import java.io.IOException;import java.util.StringTokenizer;

    import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reporter;

    public class WordMapper extends MapReduceBaseimplements Mapper {

    private Text word = new Text();private final static IntWritable one = new IntWritable(1);

    public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {

    // Break line into words for processingStringTokenizer wordList = new StringTokenizer(value.toString());

    while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);

    }}

    }

    You will typically import java.io.IOException, and the org.apache.hadoop classes shown, in every Mapper you write. We have also imported java.util.StringTokenizer as we will need this for our particular Mapper. We will omit the importstatements in future slides for brevity.

    03-72Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The Mapper: Main Code

    }

    public class WordMapper extends MapReduceBaseimplements Mapper {

    private Text word = new Text();private final static IntWritable one = new

    IntWritable(1);

    public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {

    // Break line into words for processingStringTokenizer wordList = new

    StringTokenizer(value.toString());

    while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);

    }}

    }

  • 03-73Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The Mapper: Main Code (contd)

    }

    public class WordMapper extends MapReduceBaseimplements Mapper {

    private Text word = new Text();private final static IntWritable one = new

    IntWritable(1);

    public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {

    // Break line into words for processingStringTokenizer wordList = new

    StringTokenizer(value.toString());

    while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);

    }}

    }

    Your Mapper class should extend MapReduceBase, and will implement the Mapper interface. The Mapperinterface expects four parameters, which define the types of the input and output key/value pairs. The first two parameters define the input key and value types, the second two define the output key and value types. In our example, because we are going to ignore the input key we can just specify that it will be an Object we do not need to be more specific than that.

    03-74Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Mapper Parameters

    The Mappers parameters define the input and output key/value types

    Keys are always WritableComparable

    Values are always Writable

    Another picture of the processIn general, each reducer is fed by a part of the output from each mapper.

  • 03-75Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    What is Writable?

    Hadoop defines its own box classes for strings, integers and so on

    IntWritable for ints LongWritable for longs FloatWritable for floats DoubleWritable for doubles Text for strings Etc.

    The Writable interface makes serialization quick and easy for Hadoop

    Any values type must implement the Writable interface

    03-76Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    What is WritableComparable?

    A WritableComparable is a Writable which is also Comparable

    Two WritableComparables can be compared against each other to determine their order

    Keys must be WritableComparables because they are passed to the Reducer in sorted order

    We will talk more about WritableComparable later

    Note that despite their names, all Hadoop box classes implement both Writable and WritableComparable

    For example, IntWritable is actually a WritableComparable

    Will try to schedule map tasks on nodes with local dataIf not possible, then rack localOtherwise offrack, normally a small proportion of tasksReducers are scheuled anywhere

  • 03-77Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Creating Objects: Efficiency

    }

    public class WordMapper extends MapReduceBaseimplements Mapper {

    private Text word = new Text();private final static IntWritable one = new

    IntWritable(1);

    public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {

    // Break line into words for processingStringTokenizer wordList = new

    StringTokenizer(value.toString());

    while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);

    }}

    }

    We create two objects outside of the map function we are about to write. One is a Text object, the other an IntWritable. We do this here for efficiency, as we will show.

    03-78Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Using Objects Efficiently in MapReduce

    A typical way to write Java code might look something like this:

    Problem: this creates a new object each time around the loop

    Your map function will probably be called many thousands or millions of times for each Map task

    Resulting in millions of objects being created

    This is very inefficient Running the garbage collector takes time

    while (more input exists) {myIntermediate = new intermediate(input);myIntermediate.doSomethingUseful();export outputs;

    }

  • 03-79Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Using Objects Efficiently in MapReduce (contd)

    A more efficient way to code:

    Only one object is created It is populated with different data each time around the loop Note that for this to work, the intermediate class must be

    mutable All relevant Hadoop classes are mutable

    myIntermediate = new intermediate(junk);while (more input exists) {

    myIntermediate.setupState(input);myIntermediate.doSomethingUseful();export outputs;

    }

    03-80Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Using Objects Efficiently in MapReduce (contd)

    Reusing objects allows for much better cache usage Provides a significant performance benefit (up to 2x)

    All keys and values given to you by Hadoop use this model

    Caution! You must take this into account when you write your code

    For example, if you create a list of all the objects passed to your map method you must be sure to do a deep copy

  • 03-81Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The map Method

    }

    public class WordMapper extends MapReduceBaseimplements Mapper {

    private Text word = new Text();private final static IntWritable one = new

    IntWritable(1);

    public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {

    // Break line into words for processingStringTokenizer wordList = new

    StringTokenizer(value.toString());

    while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);

    }}

    }

    The map methods signature looks like this. It will be passed a key, a value, an OutputCollector object and a Reporter object. You specify the data types that the OutputCollector object will write

    03-82Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The map Method (contd)

    One instance of your Mapper is instantiated per task attempt This exists in a separate process from any other Mapper

    Usually on a different physical machine No data sharing between Mappers is allowed!

    C&A roughly order doesnt matter.E.g. calculating the sum of numbers. Calculating the mean is *not*

  • 03-83Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The map Method: Processing The Line

    }

    public class WordMapper extends MapReduceBaseimplements Mapper {

    private Text word = new Text();private final static IntWritable one = new

    IntWritable(1);

    public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {

    // Break line into words for processingStringTokenizer wordList = new

    StringTokenizer(value.toString());

    while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);

    }}

    }

    Within the map method, we split each line into separate words using Javas StringTokenizer class.

    03-84Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Outputting Intermediate Data

    }

    public class WordMapper extends MapReduceBaseimplements Mapper {

    private Text word = new Text();private final static IntWritable one = new

    IntWritable(1);

    public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {

    // Break line into words for processingStringTokenizer wordList = new

    StringTokenizer(value.toString());

    while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);

    }}

    }

    To emit a (key, value) pair, we call the collectmethod of our OutputCollector object.The key will be the word itself, the value will be the number 1. Recall that the output key must be of type WritableComparable, and the value must be a Writable. We have created a Text object called word, so we can simply put a new value into that object. We have also created an IntWritable object called one.

  • 03-85Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Reprise: The Map Method

    }

    public class WordMapper extends MapReduceBaseimplements Mapper {

    private Text word = new Text();private final static IntWritable one = new

    IntWritable(1);

    public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {

    // Break line into words for processingStringTokenizer wordList = new

    StringTokenizer(value.toString());

    while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);

    }}

    }

    03-86Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The Reporter Object

    Notice that in this example we have not used the Reporterobject passed into the Mapper

    The Reporter object can be used to pass some information back to the driver code

    Also allows the Mapper to retrieve information about the input split it is working on

    We will investigate the Reporter later in the course

  • 03-87Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Writing a MapReduce Program

    Examining our Sample MapReduce program

    The Driver Code

    The Mapper

    The Reducer

    The Combiner

    03-88Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The Reducer: Complete Code

    import java.io.IOException;import java.util.Iterator;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;

    public class SumReducer extends MapReduceBaseimplements Reducer {

    private IntWritable totalWordCount = new IntWritable();

    public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter)throws IOException {

    int wordCount = 0;while (values.hasNext()) {

    wordCount += values.next().get();}

    totalWordCount.set(wordCount);output.collect(key, totalWordCount);

    }}

  • 03-89Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The Reducer: Import Statements

    import java.io.IOException;import java.util.Iterator;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;

    public class SumReducer extends MapReduceBaseimplements Reducer {

    private IntWritable totalWordCount = new IntWritable();

    public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter)throws IOException {

    int wordCount = 0;while (values.hasNext()) {

    wordCount += values.next().get();}

    totalWordCount.set(wordCount);output.collect(key, totalWordCount);

    }}

    As with the Mapper, you will typically import java.io.IOException, and the org.apache.hadoopclasses shown, in every Mapper you write. You will also import java.util.Iterator, which will be used to step through the values provided to the Reducer for each key. We will omit the import statements in future slides for brevity.

    03-90Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The Reducer: Main Code

    }

    public class SumReducer extends MapReduceBaseimplements Reducer {

    private IntWritable totalWordCount = new IntWritable();

    public void reduce(Text key, Iterator values, OutputCollector output, Reporter

    reporter)throws IOException {

    int wordCount = 0;while (values.hasNext()) {

    wordCount += values.next().get();}

    totalWordCount.set(wordCount);output.collect(key, totalWordCount);

    }}

  • 03-91Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The Reducer: Main Code (contd)

    }

    public class SumReducer extends MapReduceBaseimplements Reducer {

    private IntWritable totalWordCount = new IntWritable();

    public void reduce(Text key, Iterator values, OutputCollector output, Reporter

    reporter)throws IOException {

    int wordCount = 0;while (values.hasNext()) {

    wordCount += values.next().get();}

    totalWordCount.set(wordCount);output.collect(key, totalWordCount);

    }}

    Your Reducer class should extend MapReduceBaseand implement Reducer. The Reducer interface expects four parameters, which define the types of the input and output key/value pairs. The first two parameters define the intermediate key and value types, the second two define the final output key and value types. The keys are WritableComparables, the values are Writables.

    03-92Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Object Reuse for Efficiency

    }

    public class SumReducer extends MapReduceBaseimplements Reducer {

    private IntWritable totalWordCount = new IntWritable();

    public void reduce(Text key, Iterator values, OutputCollector output, Reporter

    reporter)throws IOException {

    int wordCount = 0;while (values.hasNext()) {

    wordCount += values.next().get();}

    totalWordCount.set(wordCount);output.collect(key, totalWordCount);

    }}

    As before, we create an IntWritable object outside of the actual reduce method for efficiency.

  • 03-93Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The reduce Method

    }

    public class SumReducer extends MapReduceBaseimplements Reducer {

    private IntWritable totalWordCount = new IntWritable();

    public void reduce(Text key, Iterator values, OutputCollector output, Reporter

    reporter)throws IOException {

    int wordCount = 0;while (values.hasNext()) {

    wordCount += values.next().get();}

    totalWordCount.set(wordCount);output.collect(key, totalWordCount);

    }}

    The reduce method receives a key and an Iterator of values; it also receives an OutputCollector object and a Reporter object.

    03-94Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Processing The Values

    }

    public class SumReducer extends MapReduceBaseimplements Reducer {

    private IntWritable totalWordCount = new IntWritable();

    public void reduce(Text key, Iterator values, OutputCollector output, Reporter

    reporter)throws IOException {

    int wordCount = 0;while (values.hasNext()) {

    wordCount += values.next().get();}

    totalWordCount.set(wordCount);output.collect(key, totalWordCount);

    }}

    We use the hasNext() and next().get() methods on values to step through all the elements in the iterator. In our example, we are merely adding all the values together.

  • 03-95Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Writing The Final Output

    }

    public class SumReducer extends MapReduceBaseimplements Reducer {

    private IntWritable totalWordCount = new IntWritable();

    public void reduce(Text key, Iterator values, OutputCollector output, Reporter

    reporter)throws IOException {

    int wordCount = 0;while (values.hasNext()) {

    wordCount += values.next().get();}

    totalWordCount.set(wordCount);output.collect(key, totalWordCount);

    }}

    Finally, we place the total into our wordCount object and write the output (key, value) pair using the collect method of our OutputCollector object.

    03-96Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Reprise: The Reduce Method

    }

    public class SumReducer extends MapReduceBaseimplements Reducer {

    private IntWritable totalWordCount = new IntWritable();

    public void reduce(Text key, Iterator values, OutputCollector output, Reporter

    reporter)throws IOException {

    int wordCount = 0;while (values.hasNext()) {

    wordCount += values.next().get();}

    totalWordCount.set(wordCount);output.collect(key, totalWordCount);

    }}

  • 03-97Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Writing a MapReduce Program

    Examining our Sample MapReduce program

    The Driver Code

    The Mapper

    The Reducer

    The Combiner

    03-98Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Better Efficiency With a Combiner

    Often, Mappers produce large amounts of intermediate data That data must be passed to the Reducers This can result in a lot of network traffic

    It is often possible to specify a Combiner Like a mini-Reduce Runs locally on a single Mappers output Output from the Combiner is sent to the Reducers

    Combiner and Reducer code are often identical Technically, this is possible if the operation performed is

    commutative and associative If this is the case, you do not have to re-write the code

    Just specify the Combiner as being the same class as the Reducer

    Shows progress on the consoleMuch better view from the JT ui Number of tasks Click through to logs for each task attempt Counters - e.g. data local map tasks, number of bytes written, records (map input, spilled, map output, combine input, reduce input, reduce output) If you understand these, then you will have a great understanding of MR. A synthesis of all the important concepts.Output is written to files in HDFS. A number of part files, one for each reducer. Let's look at it.

  • 03-99Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Specifying The Combiner Class

    Specify the Combiner class in your driver code as:

    conf.setCombinerClass(MyCombiner.class)

    03-100Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Advanced MapReduceProgramming

    1010.01

  • 03-101Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Advanced MapReduce Programming

    SequenceFiles

    Counters

    Using The DistributedCache

    03-102Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    What Are SequenceFiles?

    SequenceFiles are files containing binary-encoded key-value pairs

    Work naturally with Hadoop data types SequenceFiles include metadata which identifies the data type of

    the key and value

    Actually, three file types in one Uncompressed Record-compressed Block-compressed

    Often used in MapReduce Especially when the output of one job will be used as the input for

    another SequenceFileInputFormat SequenceFileOutputFormat

  • 03-103Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Directly Accessing SequenceFiles

    Possible to directly read and write SequenceFiles from your code

    Example:

    Configuration config = new Configuration();SequenceFile.Reader reader =

    new SequenceFile.Reader(FileSystem.get(config), path, config);

    Writable key = (Writable) reader.getKeyClass().newInstance();Writable value = (Writable) reader.getValueClass().newInstance();

    while (reader.next(key, value)) {// do something here

    }reader.close();

    03-104Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Advanced MapReduce Programming

    SequenceFiles

    Counters

    Using The DistributedCache

  • 03-105Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    What Are Counters?

    Counters provide a way for Mappers or Reducers to pass aggregate values back to the driver after the job has completed

    Their values are also visible from the JobTrackers Web UI And are reported on the console when the job ends

    Counters can be modified via the method

    Can be retrieved in the driver code as

    Reporter.incrCounter(enum key, long val);

    RunningJob job = JobClient.runJob(conf);Counters c = job.getCounters();

    03-106Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Counters: An Example

    static enum RecType { A, B }...void map(KeyType key, ValueType value,

    OutputCollector output, Reporter reporter) {if (isTypeA(value)) {

    reporter.incrCounter(RecType.A, 1);} else {

    reporter.incrCounter(RecType.B, 1);}

    // actually process (k, v)}

  • 03-107Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Counters: Caution

    Do not rely on a counters value from the Web UI while a job is running

    Due to possible speculative execution, a counters value could appear larger than the actual final value

    03-108Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    Advanced MapReduce Programming

    SequenceFiles

    Counters

    Using The DistributedCache

  • 03-109Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The DistributedCache: Motivation

    A common requirement is for a Mapper or Reducer to need access to some side data

    Lookup tables Dictionaries Standard configuration values Etc

    One option: read directly from HDFS in the configure method Works, but is not scalable

    The DistributedCache provides an API to push data to all slave nodes

    Transfer happens behind the scenes before any task is executed Note: DistributedCache is read only

    03-110Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

    The DistributedCache: Mechanics

    Main class should implement Tool

    Invoke the ToolRunner

    Specify -files as an argument to hadoop jar ... Local files

    Local files will be copied to your tasks automatically

    Use local filesystem interface to read the files in the task