30
Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Embed Size (px)

Citation preview

Page 1: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Automatic optimization of MapReduce

Programs

Michael Cafarella, Eaman Jahani, Christopher Re

August 2011

Page 2: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

MapReduce is victorious

• Google statistics:

• Hadoop statistics:7 PB+ Vertica clusters vs. 22 PB+ Cloudera Hadoop clusters1

Aug 04 Mar 06 Sept 07 May 10

Number of jobs 29K 171K 2127K 4474K

Machine years used 217 2002 11081 39121

Input Data (TB) 3,288 52,254 403,152 946,460

Output Data (TB) 193 2,970 14,018 45,720

Average worker machines

157 268 394 368

1. Omer Trajman, Cloudera VP, http://www.dbms2.com/

Page 3: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

MapReduce in relational land

• Designers original Intention: free-formed datao web-scale indexing/log processing

• But, many relational workloads1

o Complex queries/data analysis

• Caveat: MR performance lags RDBMS performance

1. Karmasphere corporation: A study of hadoop developers, http://karmasphere.com, 2010

Page 4: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009

Selection is Slower with

MapReduce

Page 5: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Pavlo et al., A Comparison of Approaches to Large-Scale Data Analysis, SIGMOD 2009

Join is Even Slower

Page 6: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

MR Lags in Relational Land

• Stonebraker, Dewitt: ''MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.’’1

• Query processing taskso No metadata, semantics, indiceso Free-formed input is a double-edged sword

1. MapReduce: a major step backwards, http://databasecolumn.vertica.com/, 2008

Page 7: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Manimal• Manimal is a hybrid system, combining

MapReduce programming model and well-known execution techniques

• Techniques today only found in RDBMS, but shouldbe in MapReduce, too.

Page 8: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Manimal Approachbytecode *.classMR

Engine

Static Analyze

r

Optimizer logic

Execution Framewo

rk

optimizationopportunities

execution

path

void map(Text key, WebPage w) {if(w.rank > 10) emit(w.url,w.rank);

}

• Challenges:o Safely detect query semantic optimizationo How much performance gain?

SELECTION from B+Tree index on W.RANK

Page 9: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Manimal Contributions

• Our Manimal system:o Detect safe relational optimizations in users’

compiled MapReduce programs

• Our results:o Runs with unmodified MapReduce codeo Runs up to 11x faster on same codeo Provides framework for more optimizations

Page 10: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Outline• Introduction• Execution Framework• Optimization/Analyzer Examples• Experiments

o Analyzer recallo Performance gain

• Related Work and Conclusion

Page 11: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Execution framework

public void map(Text key, WebPage w, OutputCollector<Text, LongWritable> out) {

if(w.rank > 10)emit(w.url, w.rank);

}

Page 12: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Execution Framework

varload ‘value’invokevirtualastore ‘text’…ifeq …

Analyzer Optimizer Execution

Page 13: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

13

Execution Framework

void map(k, w) { out.set(indexedOutputFormat); emit(w.rank, (k,w)) }

(SELECT f, w.rank>10)

Analyzer in: user programAnalyzer out: optimization descriptor

index-generation program

varload ‘value’invokevirtualastore ‘text’…ifeq …

Analyzer Optimizer Execution

Page 14: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

14

Execution Framework

Optimizer in: optimization descriptor catalogOptimizer out: execution descriptor

/logs/log.1 /logs/log.1.idx select src…

/logs/log.2 /logs/log.2.idx select src…

(SELECT,“log.1.idx”,w.rank>10)

varload ‘value’invokevirtualastore ‘text’…ifeq …

Analyzer Optimizer Execution

(SELECT f, w.rank>10)

Page 15: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

15

Execution Framework

numwords 19519

(SELECT,“log.1.idx”,w.rank>10)

varload ‘value’invokevirtualastore ‘text’…ifeq …

Analyzer Optimizer Execution

Execution in: execution descriptor user programExecution out: program output

Page 16: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Outline• Introduction• Execution Framework• Optimization/Analyzer Examples• Experiments

o Analyzer recallo Performance gain

• Related Work and Conclusion

Page 17: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

An Optimization Example

//webpage.java: SCHEMA!Class WebPage {String URL,int rank,String content}

//mapper.javavoid map(Text key, WebPage w) {

if (w.url==‘teaparty.fr’)emit(w.url, 1);

}

• Data-centric programming idioms == relational ops

PROJECTED view: (url,null,null)DIRECT-OP on compressed Webpage

Page 18: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Semantic Extraction• Query semantic are obvious to human readers,

but not explicit in the code for framework

• EXTRACT IT!o Static code analysiso Control-flow graph and data-flow grapho Find opportunities: selection, projection, direct opo Safe optimizations: same output

Page 19: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Analyzer: An Example

//webpage.java

Class WebPage {String URL,int rank,String content}

//mapper.javamap(Text key,Webpage w) { if (w.rank > 10) emit(w.url,w.rank);}

Fn Entry w.rank > 10 Fn Exit

An

alyze

r

emit(url,rank)

Page 20: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Current Optimizations• B+-Tree for Selections • Projected views• Delta compression on numerics• Direct operation of compressed data

• Hadoop compression is not semantic aware

Page 21: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Outline• Introduction• Execution Framework• Optimization/Analyzer Examples• Experiments

o Analyzer recallo Performance gain

• Related Work and Conclusion

Page 22: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Experiments: Analyzer• Test MapReduce programs from Pavlo, SIGMOD ‘09:

• Detected 5 out of 8 opportunities:o Two misses due to custom serialization classo Another miss requires knowledge of

java.util.Hashtable semantics

Page 23: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Experiments: Performance

• Optimize four Web page handling tasks:o Selection (filtering)o Projection (aggregation on subfield of page)o Join (pages to user visits)o User Defined Functions (aggregation)

• 5 cluster nodes, 123GB of data

Page 24: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Experiments: Performance

Description

Hadoop

Selection 430 s

Projection 5496 s

Join 6078 s

Page 25: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Experiments: Performance

Description

Hadoop Manimal Speedup

Selection 430 s 38 s 11.2

Projection 5496 s 1856 s 2.96

Join 6078 s 904 s 6.73

Page 26: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Experiments: Performance

• Up to 11x speedup over original Hadoop• Performance comparable to DBMS-X from Pavlo• UDF not detected: running time identical

Description

Hadoop Manimal Speedup

Space Overhead

Selection 430 s 38 s 11.2 0.1%

Projection 5496 s 1856 s 2.96 20%

Join 6078 s 904 s 6.73 11.7%

Page 27: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Outline• Introduction• Execution Framework• Optimization/Analyzer Examples• Experiments

o Analyzer recallo Performance gain

• Related Work and Conclusion

Page 28: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Related Work• Lots of recent MapReduce activity

o Quincy: Task scheduling (Isard et al, SOSP, 2009)

o HadoopDB (Abouzeid et al, PVLDB 2009) o Hadoop++ (Dittrich et al, PVLDB 2010)o HaLoop (Bu et al, PVLDB 2010) o Twister (Ekanayake et al, HPDC 2010)o Starfish (Herodotou et al, CIDR 2011)

• Manimal does not introduce new optimizations. It detects and applies existing optimizations to code

Page 29: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Lessons Learned• The Good: We can recognize data processing

idioms in real code. Relational operations still exist even in NoSQL world

• The Ugly: When we started this project in 2009, we underestimated interest in writing in higher level languages (e.g., Pig Latin)

Page 30: Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011

Conclusion

• Manimal provides framework for applying well-known optimization techniques to MapReduceo Automatic optimization of user codeo Up to 11x speed increaseo Provides framework for more optimizations