MapReduce Design Patterns

Preview:

DESCRIPTION

This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.

Citation preview

1© Copyright 2012 EMC Corporation. All rights reserved.

MapReduceDesign Patterns

Donald MinerGreenplum Hadoop Solutions Architect

@donaldpminer

2© Copyright 2012 EMC Corporation. All rights reserved.

Book was made available December 2012

3© Copyright 2012 EMC Corporation. All rights reserved.

Inspiration for my book

4© Copyright 2012 EMC Corporation. All rights reserved.

What are design patterns?(in general)

Reusable solutions to problems

Domain independent

Not a cookbook, but not a guide

Not a finished solution

5© Copyright 2012 EMC Corporation. All rights reserved.

Why design patterns?(in general)

Makes the intent of code easier to understand

Provides a common language for solutions

Be able to reuse code

Known performance profiles and limitations of solutions

6© Copyright 2012 EMC Corporation. All rights reserved.

Why MapReduce design patterns?

Recurring patterns in data-related problem solving

Groups are building patterns independently

Lots of new users every day

MapReduce is a new way of thinking

Foundation for higher-level tools (Pig, Hive, …)

Community is reaching the right level of maturity

7© Copyright 2012 EMC Corporation. All rights reserved.

Pattern Template

Intent

Motivation

Applicability

Structure

Consequences

Resemblances

Performance analysis

Examples

8© Copyright 2012 EMC Corporation. All rights reserved.

Pattern Categories

Summarization

Filtering

Data Organization

Joins

Metapatterns

Input and output

9© Copyright 2012 EMC Corporation. All rights reserved.

Filtering patterns Extract interesting subsets

Filtering

Bloom filtering

Top ten

Distinct

Summarization patterns top-down summaries

Numerical summarizations

Inverted index

Counting with counters

I only wantsome of my data!

I only wanta top-level view

of my data!

10© Copyright 2012 EMC Corporation. All rights reserved.

Data organization patterns Reorganize, restructure

Structured to hierarchical

Partitioning

Binning

Total order sorting

Shuffling

Join patterns Bringing data sets together

Reduce-side join

Replicated join

Composite join

Cartesian product

I want to changethe way my data

is organized!

I want to mashmy different datasources together!

11© Copyright 2012 EMC Corporation. All rights reserved.

Metapatterns Patterns of patterns

Job chaining

Chain folding

Job merging

Input and output patterns Custom input and output

Generating data

External source output

External source input

Partition pruning

I want to solvea complex problem

with multiple patterns!

I want to get data orput data in anunusual place!

12© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Top Ten”(filtering)

IntentRetrieve a relatively small number of top K records, according to a ranking scheme in your data set, no matter how large the data.

MotivationFinding outliersTop ten lists are funBuilding dashboardsSorting/Limit isn’t going to work here

13© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Top Ten”

Applicability Rank-able recordsLimited number of output records

ConsequencesThe top K records are returned.

14© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Top Ten”Structureclass mapper: setup(): initialize top ten sorted list map(key, record): insert record into top ten sorted list if length of array is greater-than 10: truncate list to a length of 10 cleanup(): for record in top sorted ten list: emit null,record

class reducer: setup(): initialize top ten sorted list reduce(key, records): sort records truncate records to top 10 for record in records: emit record

15© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Top Ten”

Resemblances

SQL: SELECT * FROM table ORDER BY col4 DESC LIMIT 10;

Pig: B = ORDER A BY col4 DESC; C = LIMIT B 10;

16© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Top Ten”

Performance analysisPretty quick: map-heavy, low network usage

Pay attention to how many records the reducer is getting[number of input splits] x K

ExampleTop ten StackOverflow users by reputation

17© Copyright 2012 EMC Corporation. All rights reserved.

public static class TopTenMapper extends Mapper<Object, Text, NullWritable, Text> {

private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>();

public void map(Object key, Text value, Context context) {

Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString());

String userId = parsed.get("Id");

String reputation = parsed.get("Reputation");

repToRecordMap.put(Integer.parseInt(reputation), new Text(value));

if (repToRecordMap.size() > 10) {

repToRecordMap.remove(repToRecordMap.firstKey());

}

}

protected void cleanup(Context context) {

for (Text t : repToRecordMap.values()) {

context.write(NullWritable.get(), t);

}

}

}

Top Ten Mapper

18© Copyright 2012 EMC Corporation. All rights reserved.

public static class TopTenReducer extends Reducer<NullWritable, Text, NullWritable, Text> {

private TreeMap<Integer, Text> repToRecordMap = new TreeMap<Integer, Text>();

public void reduce(NullWritable key, Iterable<Text> values, Context context) {

for (Text value : values) {

Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString());

repToRecordMap.put(Integer.parseInt(parsed.get("Reputation")), new Text(value));

if (repToRecordMap.size() > 10) {

repToRecordMap.remove(repToRecordMap.firstKey());

}

}

for (Text t : repToRecordMap.descendingMap().values()) {

context.write(NullWritable.get(), t);

}

}

}

Top Ten Reducer

19© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Bloom Filtering”(filtering)

IntentKeep records that are a member of some predefined set of values. It is not a problem if the output is a bit inaccurate.

MotivationSimilar to normal Boolean filtering, but we are filtering on set membershipSet membership is evaluated with a Bloom filter

20© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Bloom Filtering”

Applicability A feature can be extracted and tested for set membershipPredetermined set is availableSome false positives are acceptable

ConsequencesRecords that pass the Bloom filter membership test are returned

Known UsesKeep all records in a watch list (and a few records that aren’t)Pre-filtering records before an expensive membership test

21© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Bloom Filtering”

Structureclass mapper: setup(): load bloom filter into memory map(key, record): if record in bloom filter:

emit (record, null)

Resemblances

UDFs?

22© Copyright 2012 EMC Corporation. All rights reserved.

Pattern: “Bloom Filtering”

Performance analysisMap-onlySlight overhead in moving Bloom filter into memoryBloom filter membership tests are constant time

ExampleFilter StackOverflow comments that do not contain a keywordDistributed HBase query using a Bloom filter

23© Copyright 2012 EMC Corporation. All rights reserved.

Candidate new patterns

Link Graph processing patterns (new category)– Shortest past, diameter, graph stats, connected

components, etc.– Too domain specific?– Has its own distinct patterns

Projection (filtering)– Remove “columns” of data

Transformation (data organization?)– Take a data set but transform it into something else

24© Copyright 2012 EMC Corporation. All rights reserved.

Future and call to action

Contributing your own patterns

Trends in the nature of data– Images, audio, video, biomedical, social …

Libraries, abstractions, and tools

Ecosystem patterns: YARN, HBase, ZooKeeper, …

Recommended