78
www.helsinki.fi Hadoop and MapReduce Guest Lecturer: Jiaheng Lu Homepage: https://www.cs.helsinki.fi/u/jilu / Autumn 2017 17.9.2017 1 Big Data Framework “Introduction to Data Science”

Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Hadoop and MapReduce

Guest Lecturer: Jiaheng Lu

Homepage: https://www.cs.helsinki.fi/u/jilu/

Autumn 2017

17.9.2017 1

Big Data Framework

“Introduction to Data Science”

Page 2: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Outline

• Big data and Google File System (GFS)

• Hadoop and HDFS

• MapReduce and examples

• Hands-on exercise on table join

• Questions and answers for quiz

Page 3: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

• One Big challenge in the era of Big Data:

• How to efficiently handle big data?

• Make big data divided

• Hadoop, GFS, MapReduce

• Make big data small

• FM Sketch, Count Sketch, Count Min Sketch

17.9.2017 3

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Two ways to handle big data

Page 4: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

• One Big challenge in the era of Big Data:

• How to efficiently handle big data?

• Make big data divided

• Hadoop, GFS, MapReduce

• Make big data small

• FM Sketch, Count Sketch, Count Min Sketch

17.9.2017 4

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Two ways to handle big data

this lecture

To appear in

“Introduction to Big

´Data Management”

Page 5: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

The Google File System(GFS)

A scalable distributed file system for large

distributed data intensive applications

MapReduce Bigtable

Google File System

Page 6: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

GFS: Introduction

Shares many same goals as previous

distributed file systems

performance, scalability, reliability, etc

GFS design has been driven by four key

observations of Google application

workloads and technological environment

Page 7: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Intro: Observations

•1. Component failures are the norm

constant monitoring, error detection, fault tolerance

and automatic recovery are integral to the system

•2. Huge files (by traditional standards)

Multi GB files are common

I/O operations and blocks sizes must be revisited

Page 8: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Intro: Observations (Contd)

• 3. Most files are mutated by appending new data

This is the focus of performance optimization and atomicity

guarantees

• 4. Co-designing the applications and APIs

benefits overall system by increasing flexibility

Page 9: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

The Design

Cluster consists of a single master and multiple

chunkservers and is accessed by multiple clients

Page 10: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

The Master

Maintains all file system metadata.

names space, access control info, file to chunk mappings, chunk

(including replicas) location, etc.

Periodically communicates with chunkservers in HeartBeat

messages to give instructions and check state

Page 11: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Chunkservers

Files are broken into chunks. Each chunk has

a globally unique 64-bit chunk-handle.

handle is assigned by the master at chunk creation

Chunk size is 64 MB

Each chunk is replicated on 3 (default)

servers

Page 12: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

GFS paper

• More information on data update and performance of

GFS, read the original paper:

• http://static.googleusercontent.com/media/research.g

oogle.com/en//archive/bigtable-osdi06.pdf

2017/9/17 12

Page 13: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Outline

• Google File System (GFS)

• Hadoop and HDFS

• MapReduce and examples

• Hands-on exercise on table join

• Questions and answers for quiz

Page 14: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

What is Hadoop?

• Apache top level project, open-source

implementation of frameworks for reliable, scalable,

distributed computing and data storage.

Page 15: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Hadoop’s Developers

2005: Doug Cutting and Michael J. Cafarella developed

Hadoop to support distribution for the Nutch search

engine project.

The project was funded by Yahoo.

2006: Yahoo gave the project to Apache Software

Foundation.

Page 16: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Some Hadoop Milestones

• 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of

data in 209 seconds, compared to previous record of 297 seconds)

• 2010 - Hadoop's Hbase, Hive and Pig subprojects completed,

adding more computational power to Hadoop framework

• 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha.

- Ambari, Cassandra, Mahout have been added

• 2016 - Hadoop 3.0.0 Alpha-1

Page 17: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Google Origins

2003

2004

2006

Page 18: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi 17.9.2017 18

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Page 19: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

• Hadoop Common - libraries and utilities

• Hadoop Distributed File System (HDFS) – a distributed

file-system

• Hadoop YARN – a resource-management platform,

scheduling

• Hadoop MapReduce – a programming model for large

scale data processing

17.9.2017 19

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

The Basic Hadoop Components

Page 20: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

• Single NameNode - a master server that manages the file

system namespace and regulates access to files by clients.

• Multiple DataNodes – typically one per node in the cluster.

Functions: Manage storage, serving read/write requests from

clients

17.9.2017 20

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Original HDFS Design

Page 21: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Unique features of HDFS

• Failure tolerant - data is duplicated across multiple

DataNodes to protect against machine failures.

• Scalability - data transfers happen directly with the

DataNodes so your read/write capacity scales fairly well

with the number of DataNodes

21

Page 22: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

HDFS Architecture

22

Page 23: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

• Watch two videos on Hadoop

• https://www.youtube.com/watch?v=9s-vSeWej1U

• https://www.youtube.com/watch?v=4DgTLaFNQq0

17.9.2017 23

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Page 24: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Outline

• Google File System (GFS)

• Hadoop and HDFS

• MapReduce and examples

• Hands-on exercise on table join

• Questions and answers for quiz

Page 25: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

MapReduce: Insight

Consider the problem of counting the number of frequency of each word in a large collection of documents

( Trump)

( Donald Trump)

(Trump Clinton)

(USA President)

(Donald Trump)

(President election)

( Donald, 2)

(election, 1)

(Clinton, 1)

( Trump, 4)

( USA, 1)

Page 26: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Simple example: Word count

Mapper(1-2)

Mapper(3-4)

Mapper(5-6)

Reducer(A-G)

Reducer(H-N)

Reducer(O-U)

Each mapper

receives some

of documents

as input

1

( Trump)

( Donald Trump)

(Trump Clinton)

(USA President)

(Donald Trump)

(President election)

Page 27: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Simple example: Word count

Mapper(1-2)

Mapper(3-4)

Mapper(5-6)

Reducer(A-G)

Reducer(H-N)

Reducer(O-U)

( Trump)

( Donald Trump)

(Trump Clinton)

(President election)

(USA President)

(Donald Trump)

Each mapper

receives some

of documents

as input

Mappers

process the

KV-pairs.

1 2

( Trump, 1)

( Donald, 1), (Trump, 1)

( President,1),(election, 1)

( Trump, 1), (Clinton, 1)

( Donald,1),(Trump, 1)

( USA, 1), (President, 1)

Page 28: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Simple example: Word count

Mapper(1-2)

Mapper(3-4)

Mapper(5-6)

Reducer(A-G)

Reducer(H-N)

Reducer(O-U)

( Trump)

( Donald Trump)

(Trump Clinton)

(President election)

(USA President)

(Donald Trump)

Each mapper

receives some

of documents

as input

Mappers

process the

KV-pairs.

1 2

( Trump, 1)

( Donald, 1)

(election, 1)

(Clinton, 1)

(Trump, 1)

(President, 1)

Each KV-pair output

by the mapper is sent

to the reducer

3

( Trump, 1)

( Trump, 1)

( President,1)

( USA, 1)

( Donald,1)

Page 29: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Simple example: Word count

Mapper(1-2)

Mapper(3-4)

Mapper(5-6)

Reducer(A-G)

Reducer(H-N)

Reducer(O-U)

( Trump)

( Donald Trump)

(Trump Clinton)

(President election)

(USA President)

(Donald Trump)

Each mapper

receives some

of documents

as input

Mappers

process the

KV-pairs.

1 2

( Trump, 1)

( Donald, 1)

(election, 1)

(Clinton, 1)

(Trump, 1)

(President, 1)

Each KV-pair output

by the mapper is sent

to the reducer

3

( Trump, 1) ( Trump, 1)

( President,1)

( USA, 1)

( Donald,1)

The reducers

sort their input

by key

4

Page 30: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Simple example: Word count

Mapper(1-2)

Mapper(3-4)

Mapper(5-6)

Reducer(A-G)

Reducer(H-N)

Reducer(O-U)

( Trump)

( Donald Trump)

(Trump Clinton)

(President election)

(USA President)

(Donald Trump)

Each mapper

receives some

of documents

as input

Mappers

process the

KV-pairs.

1 2

( Donald, 2)

(election, 1)

(Clinton, 1)

(President, 2)

Each KV-pair output

by the mapper is sent

to the reducer

3

( Trump, 4)

( USA, 1)

The reducers

sort their input

by key

4 The reducers

process their

input

5

Page 31: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

MapReduce dataflow

31

Mapper

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

Reducer

Input data

Outp

ut

data

"The Shuffle"

Intermediate

(key,value) pairs

Page 32: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

Pseudo-code

map(String input_key, String input_value):

// input_key: document name

// input_value: document contents

for each word w in input_value:

EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values):

// output_key: a word

// output_values: a list of counts

int result = 0;

for each v in intermediate_values:

result += ParseInt(v);

Emit(AsString(result));

Page 33: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

MapReduce: Example

Page 34: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

MapReduce in Parallel: Example

Page 35: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Common mistakes:

Use static variables

• Don't use static shared variables for mappers

• After map + reduce return, they should remember nothing about

the processed data!

35University of Pennsylvania

HashMap h = new HashMap();

map(key, value) {

if (h.contains(key)) {

h.add(key,value);

emit(key, "X");

}

}

Wrong!

Page 36: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Common mistakes:

Do your own I/O

• Don't try to do your own I/O!

• Don't try to read from, or write to, files in the file system

• The MapReduce framework does all the I/O for you:

‒ All the incoming data will be fed as arguments to map and reduce

‒ Any data your functions produce should be output via emit

36University of Pennsylvania

map(key, value) {

File foo =

new File("xyz.txt");

while (true) {

s = foo.readLine();

...

}

} Wrong!

Page 37: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Common mistakes:

Too much data on the same key

• Mapper must not map too much data to the same key

• In particular, don't map everything to the same key!!

• Otherwise the reduce worker will be overwhelmed!

• It's okay if some reduce workers have more work than others

37University of Pennsylvania

map(key, value) {

emit("FOO", key + " " + value);

}

Wrong!

Page 38: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Designing MapReduce

algorithms

• Key decision: What should be done by map, and what by

reduce?

• map can do something to each individual key-value pair, but

it can't look at other key-value pairs

• reduce can aggregate data; it can look at multiple values, as long

as map has mapped them to the same (intermediate) key

‒ Example: Count the number of words, add up the total cost, ...

38University of Pennsylvania

Page 39: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

More details on the MapReduce

data flow

39

Data partitions

by key

Map computation

partitions

Reduce

computation

partitions

Redistribution

by output’s key("shuffle")

Coordinator

Page 40: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Some additional details

• To make this work, we need a few more parts in Hadoop

HDFS system

• The file system (distributed across all nodes):

• Stores the inputs, outputs, and temporary results

• The driver program (executes on one node):

• Specifies where to find the inputs, the outputs

• Specifies what mapper and reducer to use

• Can customize behavior of the execution

• The runtime system (controls nodes):

• Supervises the execution of tasks

40

Page 41: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

Java MapReduce code

on Apache Hadoop 2.7.2

Page 42: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

MapReduce Program

• A MapReduce program consists of the following 3

parts :

• Driver → main (would trigger the map and reduce

methods)

• Mapper

• Reducer

• It is better to include the map reduce and main

methods in 3 different classes

2017/9/17 42

Page 43: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Mapper

• public static class TokenizerMapper

• extends Mapper<Object, Text, Text, IntWritable>{

• private final static IntWritable one = new IntWritable(1);

• private Text word = new Text();

• public void map(Object key, Text value, Context context

• ) throws IOException, InterruptedException {

• StringTokenizer itr = new StringTokenizer(value.toString());

• while (itr.hasMoreTokens()) {

• word.set(itr.nextToken());

• context.write(word, one);

• }

• }

• }

2017/9/17 43

Page 44: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Mapper

• public static class TokenizerMapper

• extends Mapper<Object, Text, Text, IntWritable>{

• private final static IntWritable one = new IntWritable(1);

• private Text word = new Text();

• public void map(Object key, Text value, Context context

• ) throws IOException, InterruptedException {

• StringTokenizer itr = new StringTokenizer(value.toString());

• while (itr.hasMoreTokens()) {

• word.set(itr.nextToken());

• context.write(word, one);

• }

• }

• }

2017/9/17 44

Interface

Mapper<K1,V1,K2,V2> , the

first pair is the input key/value

pair, the second is the output

key/value pair

Page 45: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Mapper

• public static class TokenizerMapper

• extends Mapper<Object, Text, Text, IntWritable>{

• private final static IntWritable one = new IntWritable(1);

• private Text word = new Text();

• public void map (Object key, Text value, Context context) throws IOException, InterruptedException {

• StringTokenizer itr = new StringTokenizer(value.toString());

• while (itr.hasMoreTokens()) {

• word.set(itr.nextToken());

• context.write(word, one);

• }

• }

• }

2017/9/17 45

Keys are the position in the file,

and values are the line of text.

Context emits the output.

Page 46: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Reducer

• public static class IntSumReducer

• extends Reducer<Text,IntWritable,Text,IntWritable> {

• private IntWritable result = new IntWritable();

• public void reduce(Text key, Iterable<IntWritable> values,

• Context context

• ) throws IOException, InterruptedException {

• int sum = 0;

• for (IntWritable val : values) {

• sum += val.get();

• }

• result.set(sum);

• context.write(key, result);

• }

• }

2017/9/17 46

Page 47: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Main function

• public static void main(String[] args) throws Exception {

• Configuration conf = new Configuration();

• Job job = Job.getInstance(conf, "word count");

• job.setJarByClass(WordCount.class);

• job.setMapperClass(TokenizerMapper.class);

• job.setCombinerClass(IntSumReducer.class);

• job.setReducerClass(IntSumReducer.class);

• job.setOutputKeyClass(Text.class);

• job.setOutputValueClass(IntWritable.class);

• FileInputFormat.addInputPath(job, new Path(args[0]));

• FileOutputFormat.setOutputPath(job, new Path(args[1]));

• System.exit(job.waitForCompletion(true) ? 0 : 1);

• }

2017/9/17 47

Given the Mapper and Reducer

code, the main() starts the

MapReduce running.

Page 48: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Main function

• public static void main(String[] args) throws Exception {

• Configuration conf = new Configuration();

• Job job = Job.getInstance(conf, "word count");

• job.setJarByClass(WordCount.class);

• job.setMapperClass(TokenizerMapper.class);

• job.setCombinerClass(IntSumReducer.class);

• job.setReducerClass(IntSumReducer.class);

• job.setOutputKeyClass(Text.class);

• job.setOutputValueClass(IntWritable.class);

• FileInputFormat.addInputPath(job, new Path(args[0]));

• FileOutputFormat.setOutputPath(job, new Path(args[1]));

• System.exit(job.waitForCompletion(true) ? 0 : 1);

• }

2017/9/17 48

Configurations are specified by

resources. A resource contains

a set of name/value pairs as

XML data.

Page 49: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Main function

• public static void main(String[] args) throws Exception {

• Configuration conf = new Configuration();

• Job job = Job.getInstance(conf, "word count");

• job.setJarByClass(WordCount.class);

• job.setMapperClass(TokenizerMapper.class);

• job.setCombinerClass(IntSumReducer.class);

• job.setReducerClass(IntSumReducer.class);

• job.setOutputKeyClass(Text.class);

• job.setOutputValueClass(IntWritable.class);

• FileInputFormat.addInputPath(job, new Path(args[0]));

• FileOutputFormat.setOutputPath(job, new Path(args[1]));

• System.exit(job.waitForCompletion(true) ? 0 : 1);

• }

2017/9/17 49

Normally the user creates the

application, describes various

facets of the job via Job and

then submits the job and

monitor its progress.

Page 50: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Main function

• public static void main(String[] args) throws Exception {

• Configuration conf = new Configuration();

• Job job = Job.getInstance(conf, "word count");

• job.setJarByClass(WordCount.class);

• job.setMapperClass(TokenizerMapper.class);

• job.setCombinerClass(IntSumReducer.class);

• job.setReducerClass(IntSumReducer.class);

• job.setOutputKeyClass(Text.class);

• job.setOutputValueClass(IntWritable.class);

• FileInputFormat.addInputPath(job, new Path(args[0]));

• FileOutputFormat.setOutputPath(job, new Path(args[1]));

• System.exit(job.waitForCompletion(true) ? 0 : 1);

• }

2017/9/17 50

CombinerClass is a

mini reducer running in

a single Mapper node.

Page 51: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Combiner class

• Combiner class "mini-reduce"

• machine A emits <the, 1>, <the, 1>

• machine B emits <the, 1>.

• a Combiner on machine A emits <the, 2>. This value,

along with the <the, 1> from machine B will both go

to the Reducer node

• We have now saved bandwidth, but preserved the

computation.

2017/9/17 51

Page 52: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

• Watch a video

• https://www.youtube.com/watch?v=bcjSe0xCHbE

2017/9/17 52

Page 53: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Outline

• Google File System (GFS)

• Hadoop Eco-system

• MapReduce and examples (with a video)

• Hands-on exercise on table join

• Questions and answers for quiz

Page 54: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Hands-on exercise on

MapReduce

• Write one executable MapReduce programs to perform

the table inner-join in the exercise

A B

1 ab

1 cd

4 ef

A C

1 b

2 d

4 c

Table x Table y

A B C

1 ab b

1 cd b

4 ef c

Output

Page 55: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Hands-on exercise on

MapReduce

• Download the instructions of the exercise at

• https://www.cs.helsinki.fi/u/jilu/dataset/HadoopExerci

ses.pdf

• Read the instruction to install Hadoop on Ukko

• Download the dataset

Page 56: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Reduce-side join

• Map

• output <key, value>

• key>>join key, value>>tagged with data source

• Reduce

• do a full cross-product of values

• output the combination results

Page 57: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Example

A B

1 ab

1 cd

4 ef

A C

1 b

2 d

4 c

table x

table y

map()

map()

1

4

key

x ab

x cd

x ef

value

1

2

4

key

y b

y d

y c

valuetag

join key

shuffle()1

key

x ab

x cd

y b

valuelist

2 y d

4x ef

y c

reduce()

A B C

1 ab b

1 cd b

4 ef c

output

1

Page 58: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Outline

• Google File System (GFS)

• Hadoop Eco-system

• MapReduce and examples (with a video)

• Hands-on exercise on table join

• Questions and answers for quiz

Page 59: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Google File System is scalable, distributed file system

on inexpensive commodity hardware that provides:

A. Fault Tolerance

B. High Aggregate Performance

C. ACID transaction model

D. Failure detection on replicas

17.9.2017 59

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 1

Page 60: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Google File System is scalable, distributed file system

on inexpensive commodity hardware that provides:

A. Fault Tolerance

B. High Aggregate Performance

C. ACID transaction model

D. Failure detection on replicas

17.9.2017 60

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 1

Page 61: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

What are the assumptions in designing Google File Systems?

A. The system is built from many inexpensive commodity

components.

B. The workloads have very frequent updating operations.

C. The stringent response time requirements for an individual

read or write are not the primary designing goal.

D. The workload consists of both large streaming reads and

small random reads.

17.9.2017 61

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 2

Page 62: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

What are the assumptions in designing Google File Systems?

A. The system is built from many inexpensive commodity

components.

B. The workloads have very frequent updating operations.

C. The stringent response time requirements for an individual

read or write are not the primary designing goal.

D. The workload consists of both large streaming reads and

small random reads.

17.9.2017 62

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 2

Page 63: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

3. What is the chunk size in GFS ?

A. 16MB

B. 32MB

C. 64 MB

D. 128MB

17.9.2017 63

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 3

Page 64: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

3. What is the chunk size in GFS ?

A. 16MB

B. 32MB

C. 64 MB

D. 128MB

17.9.2017 64

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 3

Page 65: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

4. Which are the mistakes on MapReduce programs?

A. Using the static shared variables for mappers

B. Map too much data to the same key

C. Write the own I/O codes

D. Always map all data to the same key

17.9.2017 65

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 4

Page 66: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Which are the mistakes on MapReduce programs?

A. Using the static shared variables for mappers

B. Map too much data to the same key

C. Write the own I/O codes

D. Always map all data to the same key

17.9.2017 66

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 4

Page 67: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Which are the typical application scenarios for a

MapReduce program?

A. Perform the matrix multiplication and other

complicated computing operations

B. Run machine learning algorithms with many

iterations

C. Compute the inverted indices

D. Summarize the number of pages crawled per host

on Internet

17.9.2017 67

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 5

Page 68: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Which are the typical application scenarios for a

MapReduce program?

A. Perform the matrix multiplication and other

complicated computing operations

B. Run machine learning algorithms with many

iterations

C. Compute the inverted indices

D. Summarize the number of pages crawled per host

on Internet

17.9.2017 68

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 5

Page 69: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

MapReduce is an abstraction to hide the following

messy details of parallelization, including:

A. fault-tolerance

B. data distribution

C. high performance

D. load balancing

17.9.2017 69

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 6

Page 70: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

MapReduce is an abstraction to hide the following

messy details of parallelization, including:

A. fault-tolerance

B. data distribution

C. high performance

D. load balancing

17.9.2017 70

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 6

Page 71: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Which are the correct statements on the workflow of MapReduce program?

A. The intermediate key/value pairs produced by the Map function are buffered in memory and periodically, these buffered pairs are written to local disk.

B. Master node is responsible for forwarding the location of the buffered pairs on local disk to the reduce works.

C. A reduce worker uses remote procedure calls to read the buffered data from the local disks of map workers.

D. When a reduce worker read partial of intermediate data, it start to sort it by the intermediate keys so that the same keys are grouped together.

17.9.2017 71

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 7

Page 72: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Which are the correct statements on the workflow of MapReduce program?

A. The intermediate key/value pairs produced by the Map function are buffered in memory and periodically, these buffered pairs are written to local disk.

B. Master is responsible for forwarding the location of the buffered pairs on local disk to the reduce workers.

C. A reduce worker uses remote procedure calls to read the buffered data from the local disks of map workers.

D. When a reduce worker read partial of intermediate data, it start to sort it by the intermediate keys so that the same keys are grouped together.

17.9.2017 72

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 7

Page 73: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Which are the correct statements on the functions of

Mapper and Reducer?

A. Each Mapper can do something to each individual

key-value pair.

B. Each Mapper can look at key-value pairs of other

mappers.

C. Each Reducer can aggregate data.

D. Each Reduce can look at multiple values from other

reducers.

17.9.2017 73

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 8

Page 74: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Which are the correct statements on the functions of

Mapper and Reducer?

A. Each Mapper can do something to each individual

key-value pair.

B. Each Mapper can look at key-value pairs of other

mappers.

C. Each Reducer can aggregate data.

D. Each Reduce can look at multiple values from other

reducers.

17.9.2017 74

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 8

Page 75: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

What are the purposes of Combine Function?

A. The Combine function is executed on each

machine that performs a reduce task.

B. Typically the same code is used to implement both

the combine and the reduce functions.

C. The output of a combiner function is written to an

intermediate file that will be sent to a reduce task.

D. Partial combining can significantly speed up certain

of MapReduce operations.

17.9.2017 75

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 9

Page 76: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

What are the purposes of Combine Function?

A. The Combine function is executed on each

machine that performs a reduce task.

B. Typically the same code is used to implement both

the combine and the reduce functions.

C. The output of a combiner function is written to an

intermediate file that will be sent to a reduce task.

D. Partial combining can significantly speed up certain

of MapReduce operations.

17.9.2017 76

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Question 9

Page 77: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

Limitations of Hadoop

• Latency, slow processing speed

• No Real-time Data Processing

• Not fit for small files

2017/9/17 77

Page 78: Hadoop and MapReduce · “Introduction to Data Science” Outline • Big data and Google File System (GFS) • Hadoop and HDFS • MapReduce and examples • Hands-on exercise on

www.helsinki.fi

• Hadoop is an open-source platform for big data processing

• MapReduce is a programming framework to process big

data

• More information on big data management, join the course

“Introduction to big data management”:

• https://courses.helsinki.fi/DATA14002/119122647

17.9.2017 78

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Summary