03-1Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Hadoop Fundamentals:HDFS, MapReduce, Hive, Pig
Tom WhiteDevoxx, Antwerp, 2010
1010.01
03-2Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
HDFS and MapReduce
Three parts
Hadoop: Basic Concepts
Writing a MapReduce Program
Advanced MapReduce Programming
Training VM
http://www.cloudera.com/hadoop-training
Hello Everyone!
In this talk Im going to cover Hadoop.Half for HDFS, MRHalf for Hive and Pig
How many here have used Hadoop?
03-3Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Introduction
Apache Hadoop Committer, PMC Member, Apache Member
Engineer at Cloudera working on core Hadoop
Founder of Apache Whirr, currently in the Incubator
Author of Hadoop: The Definitive Guide
http://hadoopbook.com
03-4Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Hadoop: Basic Concepts
1010.01
03-5Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Hadoop: Basic Concepts
In this chapter you will learn
What Hadoop is
What features the Hadoop Distributed File System (HDFS) provides
The concepts behind MapReduce
03-6Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Hadoop: Basic Concepts
What Is Hadoop?
The Hadoop Distributed File System (HDFS)
How MapReduce works
03-7Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The Hadoop Project
Hadoop is an open-source project overseen by the Apache Software Foundation
Originally based on papers published by Google in 2003 and 2004
Hadoop committers work at several different organizations Including Facebook, Yahoo!, Cloudera
03-8Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Hadoop Components
Hadoop consists of two core components The Hadoop Distributed File System (HDFS) MapReduce
There are many other projects based around core Hadoop Often referred to as the Hadoop Ecosystem Pig, Hive, HBase, Flume, Oozie, Sqoop, etc
A set of machines running HDFS and MapReduce is known as a Hadoop Cluster
Individual machines are known as nodes A cluster can have as few as one node, as many as several
thousands More nodes = better performance!
03-9Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Hadoop Components: HDFS
HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster
Data is split into blocks and distributed across multiple nodes in the cluster
Each block is typically 64Mb or 128Mb in size
Each block is replicated multiple times Default is to replicate each block three times Replicas are stored on different nodes This ensures both reliability and availability
03-10Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Hadoop Components: MapReduce
MapReduce is the system used to process data in the Hadoop cluster
Consists of two phases: Map, and then Reduce Between the two is a stage known as the sort and shuffle
Each Map task operates on a discrete portion of the overall dataset
Typically one HDFS block of data
After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase
Much more on this later!
03-11Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Hadoop: Basic Concepts
What Is Hadoop?
The Hadoop Distributed File System (HDFS)
How MapReduce works
03-12Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
HDFS Basic Concepts
HDFS is a filesystem written in Java Based on Googles GFS
Sits on top of a native filesystem ext3, xfs etc
Provides redundant storage for massive amounts of data Using cheap, unreliable computers
03-13Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
HDFS Basic Concepts (contd)
HDFS performs best with a modest number of large files Millions, rather than billions, of files Each file typically 100Mb or more
HDFS is optimized for large, streaming reads of files Rather than random reads
Files in HDFS are write once No random writes to files are allowed Append support is available in Clouderas Distribution for Hadoop
(CDH) and in Hadoop 0.21
03-14Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
How Files Are Stored
Files are split into blocks Each block is usually 64Mb or 128Mb
Data is distributed across many machines at load time Different blocks from the same file will be stored on different
machines This provides for efficient MapReduce processing (see later)
Blocks are replicated across multiple machines, known as DataNodes
Default replication is three-fold i.e., each block exists on three different machines
A master node called the NameNode keeps track of which blocks make up a file, and where those blocks are located
Known as the metadata
Hosting, project governance, communityContributions from hundreds of individuals
03-15Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
How Files Are Stored: Example
NameNode holds metadata for the two files (Foo.txt and Bar.txt)
DataNodes hold the actual blocks
Each block will be 64Mb or 128Mb in size
Each block is replicated three times on the cluster
03-16Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
More On The HDFS NameNode
The NameNode daemon must be running at all times If the NameNode stops, the cluster becomes inaccessible Your system administrator will take care to ensure that the
NameNode hardware is reliable!
The NameNode holds all of its metadata in RAM for fast access It keeps a record of changes on disk for crash recovery
A separate daemon known as the Secondary NameNode takes care of some housekeeping tasks for the NameNode
Be careful: The Secondary NameNode is not a backup NameNode!
MR depends on features of HDFS to work wellTalk on wedsLinear scaling
03-17Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
HDFS: Points To Note
Although files are split into 64Mb or 128Mb blocks, if a file is smaller than this the full 64Mb/128Mb will not be used
Blocks are stored as standard files on the DataNodes, in a set of directories specified in Hadoops configuration files
This will be set by the system administrator
Without the metadata on the NameNode, there is no way to access the files in the HDFS cluster
When a client application wants to read a file: It communicates with the NameNode to determine which blocks
make up the file, and which DataNodes those blocks reside on It then communicates directly with the DataNodes to read the
data The NameNode will not be a bottleneck
03-18Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Accessing HDFS
Applications can read and write HDFS files directly via the Java API
Typically, files are created on a local filesystem and must be moved into HDFS
Likewise, files created in HDFS may need to be moved to a machines local filesystem
Access to HDFS from the command line is achieved with the hadoop fs command
StorageCompare to local file systems.Replication is main way to get fault tolerance
03-19Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
hadoop fs Examples
Copy file foo.txt from local disk to the users directory in HDFS
This will copy the file to /user/username/foo.txt
Get a directory listing of the users home directory in HDFS
Get a directory listing of the HDFS root directory
hadoop fs -copyFromLocal foo.txt foo.txt
hadoop fs -ls
hadoop fs ls /
03-20Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
hadoop fs Examples (contd)
Display the contents of the HDFS file /user/fred/bar.txt
Move that file to the local disk, named as baz.txt
Create a directory called input under the users home directory
hadoop fs cat /user/fred/bar.txt
hadoop fs copyToLocal /user/fred/bar.txt baz.txt
hadoop fs mkdir input
MR is a programming model
03-21Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
hadoop fs Examples (contd)
Delete the directory input_old and all its contents
hadoop fs rmr input_old
03-22Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Hadoop: Basic Concepts
What Is Hadoop?
The Hadoop Distributed File System (HDFS)
How MapReduce works
03-23Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
What Is MapReduce?
MapReduce is a method for distributing a task across multiple nodes
Each node processes data stored on that node Where possible
Consists of two phases: Map Reduce
03-24Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Features of MapReduce
Automatic parallelization and distribution
Fault-tolerance
Status and monitoring tools
A clean abstraction for programmers MapReduce programs are usually written in Java
Can be written in any scripting language using HadoopStreaming
All of Hadoop is written in Java
MapReduce abstracts all the housekeeping away from the developer
Developer can concentrate simply on writing the Map and Reduce functions
FS with semantics that are looser than posix (e.g. read consistency)Uses native fs on each storage node (dn)Scales to very large sizes terabytes the norm, petabytes possible
03-25Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
MapReduce: The Big Picture
03-26Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
MapReduce: Terminology
A job is a full program a complete execution of Mappers and Reducers over a dataset
A task is the execution of a single Mapper or Reducer over a slice of data
A task attempt is a particular instance of an attempt to execute a task
There will be at least as many task attempts as there are tasks If a task attempt fails, another will be started by the JobTracker Speculative execution (see later) can also result in more task
attempts than completed tasks
Geared towards large files.Takes advantage of disk transfer speed being faster than random readsCant updatea file at a random position application pattern is to read then rewrite entire fileAppends are now possible in latest releases, but an advanced feature. Hbase uses sync (to ensure file is persisted).
03-27Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
MapReduce: The JobTracker
MapReduce jobs are controlled by a software daemon known as the JobTracker
The JobTracker resides on a master node Clients submit MapReduce jobs to the JobTracker The JobTracker assigns Map and Reduce tasks to other nodes
on the cluster These nodes each run a software daemon known as the TaskTracker
The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker
03-28Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
MapReduce: The Mapper
Hadoop attempts to ensure that Mappers run on nodes which hold their portion of the data locally, to avoid network traffic
Multiple Mappers run in parallel, each processing a portion of the input data
The Mapper reads data in the form of key/value pairs
It outputs zero or more key/value pairs
map(in_key, in_value) ->(inter_key, inter_value) list
Block size is configurable per cluster default, overridable per fileDistribution - see this in a moHDFS does not dynamically re-balance, but if a block is lost it will repairMetadata is stored in mem and on disk
03-29Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
MapReduce: The Mapper (contd)
The Mapper may use or completely ignore the input key For example, a standard pattern is to read a line of a file at a time
The key is the byte offset into the file at which the line starts The value is the contents of the line itself Typically the key is considered irrelevant
If it writes anything at all out, the output must be in the form of key/value pairs
03-30Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Example Mapper: Upper Case Mapper
Turn input into upper case (pseudo-code):
let map(k, v) =
emit(k.toUpper(), v.toUpper())
('foo', 'bar') -> ('FOO', 'BAR')
('foo', 'other') -> ('FOO', 'OTHER')
('baz', 'more data') -> ('BAZ', 'MORE DATA')
Two files5 blocksBlock one stored on first three machines; block2 2 on 1, 4 and 5Clients can read any block load balacing
Not shown here but spread across 2 racks so can survice failure of a rack
03-31Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Example Mapper: Explode Mapper
Output each input character separately (pseudo-code):
let map(k, v) = foreach char c in v:
emit (k, c)
('foo', 'bar') -> ('foo', 'b'), ('foo', 'a'), ('foo', 'r')
('baz', 'other') -> ('baz', 'o'), ('baz', 't'),('baz', 'h'), ('baz', 'e'),('baz', 'r')
03-32Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Example Mapper: Filter Mapper
Only output key/value pairs where the input value is a prime number (pseudo-code):
let map(k, v) = if (isPrime(v)) then emit(k, v)
('foo', 7) -> ('foo', 7)
('baz', 10) -> nothing
Very quick to look up block locations, for example. (Although these are cached by clients too.)SNN perfoms edit log rolling. So TX log is merged with fs namespace image.SNN is a stale replica.
03-33Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Example Mapper: Changing Keyspaces
The key output by the Mapper does not need to be identical to the input key
Output the wordlength as the key (pseudo-code):
let map(k, v) = emit(v.length(), v)
('foo', 'bar') -> (3, 'bar')
('baz', 'other') -> (5, 'other')
('foo', 'abracadabra') -> (11, 'abracadabra')
03-34Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
MapReduce: The Reducer
After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list
This list is given to a Reducer There may be a single Reducer, or multiple Reducers
This is specified as part of the job configuration (see later) All values associated with a particular intermediate key are
guaranteed to go to the same Reducer The intermediate keys, and their value lists, are passed to the
Reducer in sorted key order This step is known as the shuffle and sort
The Reducer outputs zero or more final key/value pairs These are written to HDFS In practice, the Reducer usually emits a single key/value pair for
each input key
Differs to normal local fs blocksYou can see the files in a directory on the DN DN dont know the files blocks are a part ofData does *not* flow through the NN.
03-35Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Example Reducer: Sum Reducer
Add up all the values associated with each intermediate key (pseudo-code):
let reduce(k, vals) = sum = 0foreach int i in vals:
sum += iemit(k, sum)
('foo', [9, 3, -17, 44]) -> ('foo', 39)
('bar', [123, 100, 77]) -> ('bar', 300)
03-36Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Example Reducer: Identity Reducer
The Identity Reducer is very common (pseudo-code):
let reduce(k, vals) = foreach v in vals:
emit(k, v)
('foo', [9, 3, -17, 44]) -> ('foo', 9), ('foo', 3),
('foo', -17), ('foo', 44)
('bar', [123, 100, 77]) -> ('bar', 123), ('bar', 100),('bar', 77)
Central point is FileSystem classThere are other ways of getting data into HDFSFlume streaming data, or Sqoop, from DB
03-37Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
MapReduce Execution
03-38Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
MapReduce: Data Localization
Whenever possible, Hadoop will attempt to ensure that a Map task on a node is working on a block of data data stored locally on that node via HDFS
If this is not possible, the Map task will have to transfer the data across the network as it processes that data
Once the Map tasks have finished, data is then transferred across the network to the Reducers
Although the Reducers may run on the same physical machines as the Map tasks, there is no concept of data locality for the Reducers
All Mappers will, in general, have to communicate with all Reducers
Copy file foo.txt in local directory into home directory in HDFSThese are like the regular unix filesystem commands, except prefixed by hadoop fs
03-39Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
MapReduce: Is Shuffle and Sort a Bottleneck?
It appears that the shuffle and sort phase is a bottleneck No Reducer can start until all Mappers have finished
In practice, Hadoop will start to transfer data from Mappers to Reducers as the Mappers finish work
This mitigates against a huge amount of data transfer starting as soon as the last Mapper finishes
03-40Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
MapReduce: Is a Slow Mapper a Bottleneck?
It is possible for one Map task to run more slowly than the others Perhaps due to faulty hardware, or just a very slow machine
It would appear that this would create a bottleneck No Reducer can start until every Mapper has finished
Hadoop uses speculative execution to mitigate against this If a Mapper appears to be running significantly more slowly than
the others, a new instance of the Mapper will be started on another machine, operating on the same data
The results of the first Mapper to finish will be used Hadoop will kill off the Mapper which is still running
03-41Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
MapReduce: The Combiner
Often, Mappers produce large amounts of intermediate data That data must be passed to the Reducers This can result in a lot of network traffic
It is often possible to specify a Combiner Like a mini-Reduce Runs locally on a single Mappers output Output from the Combiner is sent to the Reducers
Combiner and Reducer code are often identical Technically, this is possible if the operation performed is
commutative and associative
03-42Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
MapReduce Example: Word Count
Count the number of occurrences of each word in a large amount of input data
This is the hello world of MapReduce programming
map(String input_key, String input_value)foreach word w in input_value:
emit(w, 1)
reduce(String output_key, Iterator intermediate_vals)
set count = 0foreach v in intermediate_vals:
count += vemit(output_key, count)
03-43Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
MapReduce Example: Word Count (contd)
Input to the Mapper:
Output from the Mapper:
(3414, 'the cat sat on the mat')(3437, 'the aardvark sat on the sofa')
('the', 1), ('cat', 1), ('sat', 1), ('on', 1), ('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1),('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)
03-44Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
MapReduce Example: Word Count (contd)
Intermediate data sent to the Reducer:
Final Reducer Output:
('aardvark', [1])('cat', [1]) ('mat', [1])('on', [1, 1])('sat', [1, 1])('sofa', [1]) ('the', [1, 1, 1, 1])
('aardvark', 1)('cat', 1) ('mat', 1)('on', 2)('sat', 2)('sofa', 1) ('the', 4)
03-45Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Word Count With Combiner
A Combiner would reduce the amount of data sent to the Reducer
Intermediate data sent to the Reducer after a Combiner using the same code as the Reducer:
Combiners decrease the amount of network traffic required during the shuffle and sort phase
Often also decrease the amount of work needed to be done by the Reducer
('aardvark', [1])('cat', [1]) ('mat', [1])('on', [2])('sat', [2])('sofa', [1]) ('the', [4])
03-46Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Writing a MapReduce Program
1010.01
MR strives to minimize data movement since bandwidth is the scarcest resource on the cluster
03-47Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Writing a MapReduce Program
In this chapter you will learn
How to use the Hadoop API to write a MapReduce program in Java
03-48Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Writing a MapReduce Program
Examining a Sample MapReduce program
The Driver Code
The Mapper
The Reducer
The Combiner
MR takes care of dividing up the work to be done and decising where to execute it using a work schedulerUsers dont need to worry about units of work (tasks) failing, since they will be re-runWell see how to monitor a jobVery simple abstraction, but very powerful a wide range of data processing problems can be solvedUnlike other models (e.g. MPI) high-level
03-49Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Demo
Before we discuss how to write a MapReduce program in Java, lets run and examine a simple MapReduce job
03-50Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Components of a MapReduce Program
MapReduce programs generally consist of three portions The Mapper The Reducer The driver code
We will look at each element in turn
Note: Your MapReduce program may also contain other elements
Combiner (often the same code as the Reducer) Custom partitioner Etc
Map func produces key-value pairsMR system aggregates map output by key and feeds to reducersReducers process all the values for a given key
Well see more details on keys and values in a moment
03-51Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Writing a MapReduce Program
Examining our Sample MapReduce program
The Driver Code
The Mapper
The Reducer
The Combiner
03-52Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The Driver: Complete Code
import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);
}
JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");
FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);
conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);}
}
03-53Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The Driver: Import Statements
import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);
}
JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");
FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);
conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);}
}
You will typically import these classes into every MapReduce job you write. We will omit the importstatements in future slides for brevity.
03-54Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The Driver: Main Code (contd)
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);
}
JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");
FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);
conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);}
}
Jobs are submutted to the JTIt coordinates the job, scheduling tasks on nodes on the cluster TTsTTs reside on the same nodes as DNsTTs run the code you write the JT doesnt
03-55Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The Driver: Main Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);
}
JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");
FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);
conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);}
}
You usually configure your MapReduce job in the mainmethod of your driver code. Here, we first check to ensure that the user has specified the HDFS directories to use for input and output on the command line.
03-56Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Configuring The Job With JobConf
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);
}
JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");
FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);
conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);}
}
To configure your job, create a new JobConf object and specify the class which will be called to run the job.
Map phase is embarrassingly parallelInput data is k-v pairsThe map may produce multiple (0 or more) output records for each input recordIntermediate data is not in HDFS
03-57Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Creating a New JobConf Object
The JobConf class allows you to set configuration options for your MapReduce job
The classes to be used for your Mapper and Reducer The input and output directories Many other options
Any options not explicitly set in your driver code will be read from your Hadoop configuration files
Usually located in /etc/hadoop/conf
Any options not specified in your configuration files will receive Hadoops default values
03-58Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Naming The Job
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);
}
JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");
FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);
conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);}
}
First, we give our job a meaningful name.
What the input k-v are depends on the input formatText filesWrite k-v but they may be of diferent types. E.g. key string, value integer.
03-59Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Specifying Input and Output Directories
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);
}
JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");
FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);
conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);}
}
Next, we specify the input directory from which data will be read, and the output directory to which our final output will be written.
03-60Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Determining Which Files To Read
By default, FileInputFormat.setInputPath() will read all files from a specified directory and send them to Mappers
Exceptions: items whose names begin with a period (.) or underscore (_)
Globs can be specified to restrict input For example, /2010/*/01/*
Alternatively, FileInputFormat.addInputPath() can be called multiple times, specifying a single file or directory each time
More advanced filtering can be performed by implementing a PathFilter (see later)
03-61Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Getting Data to the Mapper
The data passed to the Mapper is specified by an InputFormat Specified in the driver code Defines the location of the input data
A file or directory, for example Determines how to split the input data into input splits
Each Mapper deals with a single input split InputFormat is a factory for RecordReader objects to extract
(key, value) records from the input source
03-62Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Getting Data to the Mapper (contd)
Multiple emits
03-63Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Some Standard InputFormats
FileInputFormat The base class used for all file-based InputFormats
TextInputFormat The default Treats each \n-terminated line of a file as a value Key is the byte offset within the file of that line
KeyValueTextInputFormat Maps \n-terminated lines as key SEP value
By default, separator is a tab
SequenceFileInputFormat Binary file of (key, value) pairs with some additional metadata
SequenceFileAsTextInputFormat Similar, but maps (key.toString(), value.toString())
03-64Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Specifying Final Output With OutputFormat
FileOutputFormat.setOutputPath() specifies the directory to which the Reducers will write their final output
The driver can also specify the format of the output data Default is a plain text file Could be explicitly written as
conf.setOutputFormat(TextOutputFormat.class);
We will discuss OutputFormats in more depth in a later chapter
Nothing emitted for second record
03-65Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Specify The Classes for Mapper and Reducer
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);
}
JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");
FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);
conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);}
}
Give the JobConf object information about which classes are to be instantiated as the Mapper and Reducer. You also specify the classes for the intermediate and final keys and values
03-66Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Running The Job
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);
}
JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");
FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);
conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);}
}
Finally, run the job by calling the runJob method.
Type can be different.Value type can be different too.
03-67Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Running The Job (contd)
There are two ways to run your MapReduce job: JobClient.runJob(conf)
Blocks (waits for the job to complete before continuing) JobClient.submitJob(conf)
Does not block (driver code continues as the job is running)
JobClient determines the proper division of input data into InputSplits
JobClient then sends the job information to the JobTrackerdaemon on the cluster
03-68Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Reprise: Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {System.out.println("usage: [input] [output]");System.exit(-1);
}
JobConf conf = new JobConf(WordCount.class);conf.setJobName("WordCount");
FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);conf.setMapOutputKeyClass(Text.class);conf.setMapOutputValueClass(IntWritable.class);
conf.setReducerClass(SumReducer.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);}
}
Not materialized as a list given to the user as an IteratorA Reducer may process multiple keysEach key has *all* the values associated with itNot the case that some go to reducer a, and some to reducer bThis is a very useful property
03-69Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Writing a MapReduce Program
Examining our Sample MapReduce program
The Driver Code
The Mapper
The Reducer
The Combiner
03-70Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The Mapper: Complete Code
import java.io.IOException;import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reporter;
public class WordMapper extends MapReduceBaseimplements Mapper {
private Text word = new Text();private final static IntWritable one = new IntWritable(1);
public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {
// Break line into words for processingStringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);
}}
}
Sums the values for a given key.If there was one reducer it would process both foo and bars data.Actually would process bar first, since MR guarantees that the keys given to a particualr reducer are in order.
03-71Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The Mapper: import Statements
import java.io.IOException;import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reporter;
public class WordMapper extends MapReduceBaseimplements Mapper {
private Text word = new Text();private final static IntWritable one = new IntWritable(1);
public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {
// Break line into words for processingStringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);
}}
}
You will typically import java.io.IOException, and the org.apache.hadoop classes shown, in every Mapper you write. We have also imported java.util.StringTokenizer as we will need this for our particular Mapper. We will omit the importstatements in future slides for brevity.
03-72Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The Mapper: Main Code
}
public class WordMapper extends MapReduceBaseimplements Mapper {
private Text word = new Text();private final static IntWritable one = new
IntWritable(1);
public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {
// Break line into words for processingStringTokenizer wordList = new
StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);
}}
}
03-73Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The Mapper: Main Code (contd)
}
public class WordMapper extends MapReduceBaseimplements Mapper {
private Text word = new Text();private final static IntWritable one = new
IntWritable(1);
public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {
// Break line into words for processingStringTokenizer wordList = new
StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);
}}
}
Your Mapper class should extend MapReduceBase, and will implement the Mapper interface. The Mapperinterface expects four parameters, which define the types of the input and output key/value pairs. The first two parameters define the input key and value types, the second two define the output key and value types. In our example, because we are going to ignore the input key we can just specify that it will be an Object we do not need to be more specific than that.
03-74Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Mapper Parameters
The Mappers parameters define the input and output key/value types
Keys are always WritableComparable
Values are always Writable
Another picture of the processIn general, each reducer is fed by a part of the output from each mapper.
03-75Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
What is Writable?
Hadoop defines its own box classes for strings, integers and so on
IntWritable for ints LongWritable for longs FloatWritable for floats DoubleWritable for doubles Text for strings Etc.
The Writable interface makes serialization quick and easy for Hadoop
Any values type must implement the Writable interface
03-76Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
What is WritableComparable?
A WritableComparable is a Writable which is also Comparable
Two WritableComparables can be compared against each other to determine their order
Keys must be WritableComparables because they are passed to the Reducer in sorted order
We will talk more about WritableComparable later
Note that despite their names, all Hadoop box classes implement both Writable and WritableComparable
For example, IntWritable is actually a WritableComparable
Will try to schedule map tasks on nodes with local dataIf not possible, then rack localOtherwise offrack, normally a small proportion of tasksReducers are scheuled anywhere
03-77Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Creating Objects: Efficiency
}
public class WordMapper extends MapReduceBaseimplements Mapper {
private Text word = new Text();private final static IntWritable one = new
IntWritable(1);
public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {
// Break line into words for processingStringTokenizer wordList = new
StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);
}}
}
We create two objects outside of the map function we are about to write. One is a Text object, the other an IntWritable. We do this here for efficiency, as we will show.
03-78Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Using Objects Efficiently in MapReduce
A typical way to write Java code might look something like this:
Problem: this creates a new object each time around the loop
Your map function will probably be called many thousands or millions of times for each Map task
Resulting in millions of objects being created
This is very inefficient Running the garbage collector takes time
while (more input exists) {myIntermediate = new intermediate(input);myIntermediate.doSomethingUseful();export outputs;
}
03-79Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Using Objects Efficiently in MapReduce (contd)
A more efficient way to code:
Only one object is created It is populated with different data each time around the loop Note that for this to work, the intermediate class must be
mutable All relevant Hadoop classes are mutable
myIntermediate = new intermediate(junk);while (more input exists) {
myIntermediate.setupState(input);myIntermediate.doSomethingUseful();export outputs;
}
03-80Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Using Objects Efficiently in MapReduce (contd)
Reusing objects allows for much better cache usage Provides a significant performance benefit (up to 2x)
All keys and values given to you by Hadoop use this model
Caution! You must take this into account when you write your code
For example, if you create a list of all the objects passed to your map method you must be sure to do a deep copy
03-81Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The map Method
}
public class WordMapper extends MapReduceBaseimplements Mapper {
private Text word = new Text();private final static IntWritable one = new
IntWritable(1);
public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {
// Break line into words for processingStringTokenizer wordList = new
StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);
}}
}
The map methods signature looks like this. It will be passed a key, a value, an OutputCollector object and a Reporter object. You specify the data types that the OutputCollector object will write
03-82Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The map Method (contd)
One instance of your Mapper is instantiated per task attempt This exists in a separate process from any other Mapper
Usually on a different physical machine No data sharing between Mappers is allowed!
C&A roughly order doesnt matter.E.g. calculating the sum of numbers. Calculating the mean is *not*
03-83Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The map Method: Processing The Line
}
public class WordMapper extends MapReduceBaseimplements Mapper {
private Text word = new Text();private final static IntWritable one = new
IntWritable(1);
public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {
// Break line into words for processingStringTokenizer wordList = new
StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);
}}
}
Within the map method, we split each line into separate words using Javas StringTokenizer class.
03-84Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Outputting Intermediate Data
}
public class WordMapper extends MapReduceBaseimplements Mapper {
private Text word = new Text();private final static IntWritable one = new
IntWritable(1);
public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {
// Break line into words for processingStringTokenizer wordList = new
StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);
}}
}
To emit a (key, value) pair, we call the collectmethod of our OutputCollector object.The key will be the word itself, the value will be the number 1. Recall that the output key must be of type WritableComparable, and the value must be a Writable. We have created a Text object called word, so we can simply put a new value into that object. We have also created an IntWritable object called one.
03-85Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Reprise: The Map Method
}
public class WordMapper extends MapReduceBaseimplements Mapper {
private Text word = new Text();private final static IntWritable one = new
IntWritable(1);
public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException {
// Break line into words for processingStringTokenizer wordList = new
StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {word.set(wordList.nextToken());output.collect(word, one);
}}
}
03-86Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The Reporter Object
Notice that in this example we have not used the Reporterobject passed into the Mapper
The Reporter object can be used to pass some information back to the driver code
Also allows the Mapper to retrieve information about the input split it is working on
We will investigate the Reporter later in the course
03-87Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Writing a MapReduce Program
Examining our Sample MapReduce program
The Driver Code
The Mapper
The Reducer
The Combiner
03-88Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The Reducer: Complete Code
import java.io.IOException;import java.util.Iterator;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;
public class SumReducer extends MapReduceBaseimplements Reducer {
private IntWritable totalWordCount = new IntWritable();
public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter)throws IOException {
int wordCount = 0;while (values.hasNext()) {
wordCount += values.next().get();}
totalWordCount.set(wordCount);output.collect(key, totalWordCount);
}}
03-89Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The Reducer: Import Statements
import java.io.IOException;import java.util.Iterator;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;
public class SumReducer extends MapReduceBaseimplements Reducer {
private IntWritable totalWordCount = new IntWritable();
public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter)throws IOException {
int wordCount = 0;while (values.hasNext()) {
wordCount += values.next().get();}
totalWordCount.set(wordCount);output.collect(key, totalWordCount);
}}
As with the Mapper, you will typically import java.io.IOException, and the org.apache.hadoopclasses shown, in every Mapper you write. You will also import java.util.Iterator, which will be used to step through the values provided to the Reducer for each key. We will omit the import statements in future slides for brevity.
03-90Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The Reducer: Main Code
}
public class SumReducer extends MapReduceBaseimplements Reducer {
private IntWritable totalWordCount = new IntWritable();
public void reduce(Text key, Iterator values, OutputCollector output, Reporter
reporter)throws IOException {
int wordCount = 0;while (values.hasNext()) {
wordCount += values.next().get();}
totalWordCount.set(wordCount);output.collect(key, totalWordCount);
}}
03-91Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The Reducer: Main Code (contd)
}
public class SumReducer extends MapReduceBaseimplements Reducer {
private IntWritable totalWordCount = new IntWritable();
public void reduce(Text key, Iterator values, OutputCollector output, Reporter
reporter)throws IOException {
int wordCount = 0;while (values.hasNext()) {
wordCount += values.next().get();}
totalWordCount.set(wordCount);output.collect(key, totalWordCount);
}}
Your Reducer class should extend MapReduceBaseand implement Reducer. The Reducer interface expects four parameters, which define the types of the input and output key/value pairs. The first two parameters define the intermediate key and value types, the second two define the final output key and value types. The keys are WritableComparables, the values are Writables.
03-92Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Object Reuse for Efficiency
}
public class SumReducer extends MapReduceBaseimplements Reducer {
private IntWritable totalWordCount = new IntWritable();
public void reduce(Text key, Iterator values, OutputCollector output, Reporter
reporter)throws IOException {
int wordCount = 0;while (values.hasNext()) {
wordCount += values.next().get();}
totalWordCount.set(wordCount);output.collect(key, totalWordCount);
}}
As before, we create an IntWritable object outside of the actual reduce method for efficiency.
03-93Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The reduce Method
}
public class SumReducer extends MapReduceBaseimplements Reducer {
private IntWritable totalWordCount = new IntWritable();
public void reduce(Text key, Iterator values, OutputCollector output, Reporter
reporter)throws IOException {
int wordCount = 0;while (values.hasNext()) {
wordCount += values.next().get();}
totalWordCount.set(wordCount);output.collect(key, totalWordCount);
}}
The reduce method receives a key and an Iterator of values; it also receives an OutputCollector object and a Reporter object.
03-94Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Processing The Values
}
public class SumReducer extends MapReduceBaseimplements Reducer {
private IntWritable totalWordCount = new IntWritable();
public void reduce(Text key, Iterator values, OutputCollector output, Reporter
reporter)throws IOException {
int wordCount = 0;while (values.hasNext()) {
wordCount += values.next().get();}
totalWordCount.set(wordCount);output.collect(key, totalWordCount);
}}
We use the hasNext() and next().get() methods on values to step through all the elements in the iterator. In our example, we are merely adding all the values together.
03-95Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Writing The Final Output
}
public class SumReducer extends MapReduceBaseimplements Reducer {
private IntWritable totalWordCount = new IntWritable();
public void reduce(Text key, Iterator values, OutputCollector output, Reporter
reporter)throws IOException {
int wordCount = 0;while (values.hasNext()) {
wordCount += values.next().get();}
totalWordCount.set(wordCount);output.collect(key, totalWordCount);
}}
Finally, we place the total into our wordCount object and write the output (key, value) pair using the collect method of our OutputCollector object.
03-96Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Reprise: The Reduce Method
}
public class SumReducer extends MapReduceBaseimplements Reducer {
private IntWritable totalWordCount = new IntWritable();
public void reduce(Text key, Iterator values, OutputCollector output, Reporter
reporter)throws IOException {
int wordCount = 0;while (values.hasNext()) {
wordCount += values.next().get();}
totalWordCount.set(wordCount);output.collect(key, totalWordCount);
}}
03-97Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Writing a MapReduce Program
Examining our Sample MapReduce program
The Driver Code
The Mapper
The Reducer
The Combiner
03-98Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Better Efficiency With a Combiner
Often, Mappers produce large amounts of intermediate data That data must be passed to the Reducers This can result in a lot of network traffic
It is often possible to specify a Combiner Like a mini-Reduce Runs locally on a single Mappers output Output from the Combiner is sent to the Reducers
Combiner and Reducer code are often identical Technically, this is possible if the operation performed is
commutative and associative If this is the case, you do not have to re-write the code
Just specify the Combiner as being the same class as the Reducer
Shows progress on the consoleMuch better view from the JT ui Number of tasks Click through to logs for each task attempt Counters - e.g. data local map tasks, number of bytes written, records (map input, spilled, map output, combine input, reduce input, reduce output) If you understand these, then you will have a great understanding of MR. A synthesis of all the important concepts.Output is written to files in HDFS. A number of part files, one for each reducer. Let's look at it.
03-99Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Specifying The Combiner Class
Specify the Combiner class in your driver code as:
conf.setCombinerClass(MyCombiner.class)
03-100Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Advanced MapReduceProgramming
1010.01
03-101Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Advanced MapReduce Programming
SequenceFiles
Counters
Using The DistributedCache
03-102Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
What Are SequenceFiles?
SequenceFiles are files containing binary-encoded key-value pairs
Work naturally with Hadoop data types SequenceFiles include metadata which identifies the data type of
the key and value
Actually, three file types in one Uncompressed Record-compressed Block-compressed
Often used in MapReduce Especially when the output of one job will be used as the input for
another SequenceFileInputFormat SequenceFileOutputFormat
03-103Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Directly Accessing SequenceFiles
Possible to directly read and write SequenceFiles from your code
Example:
Configuration config = new Configuration();SequenceFile.Reader reader =
new SequenceFile.Reader(FileSystem.get(config), path, config);
Writable key = (Writable) reader.getKeyClass().newInstance();Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value)) {// do something here
}reader.close();
03-104Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Advanced MapReduce Programming
SequenceFiles
Counters
Using The DistributedCache
03-105Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
What Are Counters?
Counters provide a way for Mappers or Reducers to pass aggregate values back to the driver after the job has completed
Their values are also visible from the JobTrackers Web UI And are reported on the console when the job ends
Counters can be modified via the method
Can be retrieved in the driver code as
Reporter.incrCounter(enum key, long val);
RunningJob job = JobClient.runJob(conf);Counters c = job.getCounters();
03-106Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Counters: An Example
static enum RecType { A, B }...void map(KeyType key, ValueType value,
OutputCollector output, Reporter reporter) {if (isTypeA(value)) {
reporter.incrCounter(RecType.A, 1);} else {
reporter.incrCounter(RecType.B, 1);}
// actually process (k, v)}
03-107Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Counters: Caution
Do not rely on a counters value from the Web UI while a job is running
Due to possible speculative execution, a counters value could appear larger than the actual final value
03-108Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Advanced MapReduce Programming
SequenceFiles
Counters
Using The DistributedCache
03-109Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The DistributedCache: Motivation
A common requirement is for a Mapper or Reducer to need access to some side data
Lookup tables Dictionaries Standard configuration values Etc
One option: read directly from HDFS in the configure method Works, but is not scalable
The DistributedCache provides an API to push data to all slave nodes
Transfer happens behind the scenes before any task is executed Note: DistributedCache is read only
03-110Copyright 2010 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
The DistributedCache: Mechanics
Main class should implement Tool
Invoke the ToolRunner
Specify -files as an argument to hadoop jar ... Local files
Local files will be copied to your tasks automatically
Use local filesystem interface to read the files in the task