MapReduce: Design Patterns - ce.uniroma2.it · MapReduce: Design Patterns A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno . Università degli Studi

MapReduce: Design Patterns A.A. 2017/18

Matteo Nardelli

Laurea Magistrale in Ingegneria Informatica - II anno

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

The reference Big Data stack

Matteo Nardelli - SABD 2017/18

1

Resource Management

Data Storage

Data Processing

High-level Interfaces Support / Integration

Main reference for this lecture D. Miner and A. Shook MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O'Reilly Media, 2012.

2 Matteo Nardelli - SABD 2017/18

MapReduce is a Framework • Fit your solution into the framework of map and

reduce

• In some situations might be challenging – MapReduce can be a constraint – provides clear boundaries for what you can and cannot do

• Figuring out how to solve a problem with constraints requires – cleverness – a change in thinking!


MapReduce Design Pattern What is a MapReduce design pattern? • It is a template for solving a common and general data

manipulation problem with MapReduce. • Inspired by "Design Patterns: Elements of Reusable Object-

Oriented Software" by the Gang of four A pattern: • is a general approach for solving a problem • is not specific to a domain (e.g., text processing, graph analysis) A design patterns allows • to use tried and true design principles • to build better software


MapReduce Design Pattern • MapReduce is a framework, not a tool

– Fit your solution into the framework of map and reduce – Can be challenging in some situations

• Need to take the algorithm and break it into filter/aggregate steps – Filter becomes part of the map function – Aggregate becomes part of the reduce function

• Sometimes we need multiple MapReduce stages • MapReduce is not a solution to every problem, not

even every parallel problem • It makes sense when:

– Files are very large and are rarely updated – We need to iterate over all the files to generate some

interesting property of the data in those files


Hands-on Hadoop (our pre-configured Docker image)


HDFS with Dockers

7

• Create a small network named hadoop_network with one namenode (master) and 3 datanode (slave).

• We will interact on the master node, exchanging file through the volume mounted in /data

$ docker network create --driver bridge hadoop_network

$ docker run -t -i -p 50075:50075 -p 50061:50060 -d --network=hadoop_network --name=slave1 matnar/hadoop



$ docker run -t -i -p 50070:50070 -p 50060:50060 -p 50030:50030 -p 8088:8088 -p 19888:19888 --network=hadoop_network --name=master -v $PWD/hddata:/data matnar/hadoop


HDFS with Dockers

8

• Before we start, we need to initialize our environment

• On the master node

• The WebUI tells us if everything is working properly: – HDFS: http://localhost:50070/ – MapReduce Master: http://localhost:8088/

$ hdfs namenode –format

$ $HADOOP_HOME/sbin/start-dfs.sh

$ $HADOOP_HOME/sbin/start-yarn.sh


HDFS with Dockers

9

How to remove the containers • stop and delete the namenode and datanodes

• remove the network

$ docker network rm hadoop_network

$ docker kill master slave1 slave2 slave3

$ docker rm master slave1 slave2 slave3


A simplified view of MapReduce

• Mappers are applied to all input key-value pairs, to generate an arbitrary number of intermediate pairs

• Reducers are applied to all intermediate values associated with the same intermediate key

• Between the map and reduce phase lies a barrier that involves a large distributed sort and group by


A more detailed view of MapReduce

11

• Combiner: optimization that anticipates on the map node the reduce function;

– Hadoop does not provide a guarantee of how many times it will call it

• Partitioner: when there are multiple reducers, divides keys in partitions that will be assigned to each reducer

– A custom partitioner can be used to control how keys are passed to the reducer, e.g., to balance load, to guarantee properties – such as total ordering


Design Pattern: Number Summarizations • Goal: compute some numerical aggregate value

(count, maximum, average, ...) over a set of values

• Structure: – Mapper: it outputs keys that consist of each field to group by,

and values consisting of any pertinent numerical items – Combiner: (optional) it can greatly reduce the number of

intermediate key/value pairs to be sent across the network, but works well only with associative and commutative operations

– Partitioner: (optional) it can better distribute key/value pairs across the reduce tasks

– Reducer: The reducer receives a set of numerical values and applies the aggregation function


Design Pattern: Number Summarizations


Design Pattern: Number Summarizations Examples: • Word count, record count

– Count the number of requests to *.dicii.uniroma2.it/* • Min/Max

– Compute the max temperature per region • Average/Median/Standard Deviation

– Average the number of requests per page per Web site • Inverted Index Summarization:

– The inverted index pattern is commonly used to generate an index from a data set to allow for faster searches or data enrichment capabilities.


Summarization: Example

15

• Goal: compute the average word length by initial letter

• Check: AverageWordLengthByInitialLetter.java

public void map(Object key, Text value, Context context) { String line = value.toString().toLowerCase(); /* Emit length by initial letter */ StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { String word = itr.nextToken(); initialLetter.set(word.substring(0,1)); length.set(word.length()); context.write(initialLetter, length); } }

This is only an excerpt


Summarization: Example

16

• Goal: compute the average word length by initial letter

• Check: AverageWordLengthByInitialLetter.java

public void reduce(Text key, Iterable<IntWritable> values, Context context){ int sum = 0; int count = 0; for (IntWritable val : values) { sum += val.get(); count++; } average.set(((float) sum / (float) count)); context.write(key, average); }



Design Pattern: Filtering • Goal: filter out records that are not of interest and

keep the others. • An application of filtering is sampling

– Sampling can be used to get a smaller, yet representative, data set

• Structure: – Mapper: filters data (it does most of the work) – Reduce: may simply be the identity, if the job does not

produce an aggregation on filtered data

• Examples: – Grep Web logs for requests to *dicii.uniroma2.it/* – Find in the Web logs the URLs accessed by 160.80.85.34 – locate all the files that contain the words ‘Apple’ and ‘Jobs’


Design Pattern: Filtering


Design Pattern: Filtering Use cases: • Closer view of data: to extract records that have something in

common or something of interest (e.g., same event-date, same user id)

• Tracking a thread of events: extract a thread of consecutive events as a case study from a larger data set.

• Distributed grep • Simple random sampling: simple random sampling of the data

set – use a filter with an evaluation function that randomly returns

true or false • Remove low scoring data


Filtering: Example

20

• Goal: implement a distributed version of grep • Check: DistributedGrep.java

public static class GrepMapper extends Mapper<Object, Text, NullWritable, Text> {

private Pattern pattern = null;

public void setup(Context context) throws ... { pattern = Pattern.compile( ... ); } public void map(Object key, Text value, Context context) ... { Matcher matcher = pattern.matcher(value.toString()); if (matcher.find()) { context.write(NullWritable.get(), value); } } }

This is only an excerpt Matteo Nardelli - SABD 2017/18

Design Pattern: Distinct • Special case of filter pattern • Goal: filter out records that look like another record in

the data set • Structure:

– Mapper: it takes each record and extracts the data fields for which we want unique values. The mapper outputs the record as the key, and null as the value

– Reduce: it groups the nulls together by key. We then simply output the key. Because each key is grouped together, the output data set is guaranteed to be unique.

• Examples:

– Retrieve the list of unique visitors of a website


Distinct: Example

22

• Goal: retrieve the list of words, with no repetitions, in a document

• Check: DistinctWords.java public void map(Object key, Text value, Context context) ... { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, NullWritable.get()); } } ... public void reduce(Text key, Iterable<NullWritable> values, Context context) ... { context.write(key, NullWritable.get()); }


Design Pattern: Data Organization • Goal: combine and organize data in a more complex

data structure.

• This pattern includes several pattern sub-categories: – structure to hierarchical pattern (e.g., denormalization) – partitioning and binning patterns – total order sorting patterns – shuffling patterns


Design Pattern: Structure to Hierarchical • Goal: create new records from data stored in very

different structures. – This pattern follows the denormalization principles of big

data stores • Structure:

– We might need to combine data from multiple data sources (use MultipleInputs)

– Map: it associate data to be aggregated to the same key (e.g., root of hierarchical record). Each data can be enriched with a label to identify the source.

– Reduce: it creates the hierarchical structure from the list of received data items


Design Pattern: Structure to Hierarchical


Structure to Hierarchical: Example • Goal: create a json structure of a topic, which

contains the list of its items – Two inputs are provided, the list of topics, the list of items

• Check: TopicItemsHierarchy.java

26

public void map(Object key, Text value, Context context) ... { String line = value.toString(); String[] parts = line.split("::"); if (parts.length != 2) return; String id = parts[0]; String content = parts[1]; outKey.set(id); outValue.set(valuePrefix + content);context.write(outKey, outValue);

}


Structure to Hierarchical: Example • Check: TopicItemsHierarchy.java

27

public void reduce(Text key, Iterable<Text> values, Context context) ... { Topic topic = new Topic(); for (Text t : values) { String value = t.toString(); if (ValueType.TOPIC.equals(discriminate(value))){ topic.setTopic(getContent(value)); } else if (ValueType.ITEM.equals(discriminate(value))){ topic.addItem(getContent(value)); } } /* Serialize topic */ String serializedTopic = gson.toJson(topic); context.write(new Text(serializedTopic), NullWritable.get()); }


Design Pattern: Partitioning • Goal: move the records into categories (i.e., shards,

partitions, or bins) without taking care about the order of records.

• Structure:

– Map: in most cases, the identity mapper can be used. – Partitioner: it will determine which reducer to send each

record to; each reducer corresponds to a particular partition – Reduce: in most cases, the identity reducer can be used – All you have to define is the function that determines what

partition a record is going to go.


Design Pattern: Partitioning


Design Pattern: Partitioning • Example:

– Discretize continuous values: partitioning data into bins will allow you to group continuous values in the same partition

– Partition pruning by category – Sharding


Partitioning: Example • Goal: group date by year. In this case a year

represents a partition • Check: PartitionDatesByYear.java

31

public static class DatePartitioner extends Partitioner<IntWritable, Text> { public int getPartition(IntWritable key, Text value, int numPartitions) { return (key.get() - CONFIG_INITIAL_YEAR) % numPartitions; } }



Two-stage MapReduce • As map-reduce calculations get more complex, break

them down into stages – Output of one stage = input to next stage

• Intermediate output may be useful for different outputs too, so you can get some reuse – Intermediate records can be saved in the data store, forming

a materialized view • Early stages of map-reduce operations often

represent the heaviest amount of data access, so building and save them once as a basis for many downstream uses saves a lot of work


Example of Two-stage MapReduce

33

• Compare sales of products for each month in 2011 to prior year

• 1st stage: produce records showing the sales for a single product in a single month of the year

• 2nd stage: produce the result for a single product by comparing one month’s results with the same month in the prior year


Design Pattern: Total Order Sorting • Sort all the records of the data set

– Sorting in a parallel manner is not easy. • Observe:

– each individual reducer will sort its data by key, but unfortunately, this sorting is not global across all data.

• Goal: we want to have a total order sorting where, if

you concatenate the output files, the records are sorted.

• Sorted data has a number of useful properties: – Sorted by time, it can provide a timeline view on the data – Finding things in a sorted data set can be done with binary

search – Some databases can bulk load data faster if the data is

sorted on the primary key or index column 34 Matteo Nardelli - SABD 2017/18

Design Pattern: Total Order Sorting • This pattern has two phases (jobs):

– an analyze phase that determines the ranges, and the order phase that actually sorts the data.

Analyze Phase: identify the data set slices • Map: it does a simple random sampling • Reduce: only one reducer will be used. It collects the sort keys

and slices them into the data range boundaries Order Phase: order the dataset • Map: similar to the mapper function of the analyze phase, but

the record itself is stored as the value • Partition: it loads up the partition file, routes data according to

the paritions – Hadoop provides an implementation: TotalOrderPartitioner

• Reduce: it is the identify function; the number of reducers needs to be equal to the number of partitions


Total Order Sorting: Example • Goal: order the dataset

– We rely on the TotalOrderPartitioner class – Slightly different implementation of Analyze and Order Phases

• Check: TotalOrdering.java • Observe the driver, which defines the chain of MapReduce jobs

36

/* **** Job #1: Analyze phase **** */ Job sampleJob = Job.getInstance(conf, "TotalOrdering"); /* Map: samples data; Reduce: identity function */ sampleJob.setMapperClass(AnalyzePhaseMapper.class); sampleJob.setNumReduceTasks(0); sampleJob.setOutputFormatClass(SequenceFileOutputFormat.class); ... if (isCompletedCorrecty(sampleJob)) {



Total Order Sorting: Example

37

/* **** Job #2: Ordering phase **** */ Job orderJob = Job.getInstance(conf, "TotalOrderSortingStage"); /* Map: identity function; Reduce: emits only the key */ orderJob.setMapperClass(Mapper.class); orderJob.setReducerClass(OrderingPhaseReducer.class); orderJob.setNumReduceTasks(10); /* Partitioner */ orderJob.setPartitionerClass(TotalOrderPartitioner.class); /* Define the dataset sampling strategy to identify partition bounds */ InputSampler.writePartitionFile(orderJob, new InputSampler.RandomSampler(.3, 10)); }



Design Pattern: Shuffling • Goal: we want to shuffle our dataset, to randomize

our records (e.g., to improve anonymity)

• Structure: – Map: it emits the record as the value along with a random

key – Reduce: the reducer sorts the random keys, further

randomizing the data


Documents

MapReduce: Design Patterns - ce.uniroma2.it · MapReduce: Design Patterns A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno . Università degli Studi