Making Pig Fly Optimizing Data Processing on Hadoop

© Hortonworks Inc. 2011

Daniel Dai (@daijy)Thejas Nair (@thejasn)

Page 1

Making Pig FlyOptimizing Data Processing on Hadoop


What is Apache Pig?

Page 2Architecting the Future of Big Data

Pig Latin, a high level data processing language.

An engine that executes Pig Latin locally or on a Hadoop cluster.

Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/


Pig-latin example


• Query : Get the list of web pages visited by users whose age is between 20 and 29 years.

USERS = load ‘users’ as (uid, age);

USERS_20s = filter USERS by age >= 20 and age <= 29;

PVs = load ‘pages’ as (url, uid, timestamp);

PVs_u20s = join USERS_20s by uid, PVs by uid;


Why pig ?


•Faster development– Fewer lines of code– Don’t re-invent the wheel

•Flexible– Metadata is optional– Extensible– Procedural programming

Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/


Pig optimizations


• Ideally user should not have to bother

• Reality– Pig is still young and immature– Pig does not have the whole picture

–Cluster configuration–Data histogram

– Pig philosophy: Pig is docile


Pig optimizations


• What pig does for you– Do safe transformations of query to optimize– Optimized operations (join, sort)

• What you do– Organize input in optimal way– Optimize pig-latin query– Tell pig what join/group algorithm to use


Rule based optimizer


• Column pruner• Push up filter• Push down flatten• Push up limit• Partition pruning• Global optimizer


Column Pruner


• Pig will do column pruning automatically

• Cases Pig will not do column pruning automatically

– No schema specified in load statement

A = load ‘input’ as (a0, a1, a2);B = foreach A generate a0+a1;C = order B by $0;Store C into ‘output’;

Pig will prune a2 automatically

A = load ‘input’;B = order A by $0;C = foreach B generate $0+$1;Store C into ‘output’;

A = load ‘input’;A1 = foreach A generate $0, $1;B = order A1 by $0;C = foreach B generate $0+$1;Store C into ‘output’;

DIY


Column Pruner


• Another case Pig does not do column pruning

– Pig does not keep track of unused column after grouping

A = load ‘input’ as (a0, a1, a2);B = group A by a0;C = foreach B generate SUM(A.a1);Store C into ‘output’;

DIY

A = load ‘input’ as (a0, a1, a2);A1 = foreach A generate $0, $1;B = group A1 by a0;C = foreach B generate SUM(A.a1);Store C into ‘output’;


Push up filter


• Pig split the filter condition before push

A

Join

a0>0 && b0>10

B

Filter

A

Join

a0>0

B

Filter b0>10

Original query Split filter condition

A

Join

a0>0

B

Filter b0>10

Push up filter


Other push up/down


• Push down flatten

• Push up limit

Load

Flatten

Order

Load

Flatten

Order

A = load ‘input’ as (a0:bag, a1);B = foreach A generate flattten(a0), a1;C = order B by a1;Store C into ‘output’;

Load

Limit

Foreach

Load

Foreach

Limit

Load (limited)

Foreach

Load

Limit

Order

Load

Order (limited)


Partition pruning


• Prune unnecessary partitions entirely– HCatLoader

2010

2011

2012

HCatLoader Filter (year>=2011)

2010

2011

2012

HCatLoader (year>=2011)


Intermediate file compression


Pig Script

map 1

reduce 1

map 2

reduce 2

Pig temp file

map 3

reduce 3

Pig temp file

•Intermediate file between map and reduce

– Snappy

•Temp file between mapreduce jobs

– No compression by default


Enable temp file compression


•Pig temp file are not compressed by default

– Issues with snappy (HADOOP-7990)– LZO: not Apache license

•Enable LZO compression–Install LZO for Hadoop–In conf/pig.properties

–With lzo, up to > 90% disk saving and 4x query speed up

pig.tmpfilecompression = truepig.tmpfilecompression.codec = lzo


Multiquery


• Combine two or more map/reduce job into one

– Happens automatically– Cases we want to control multiquery: combine too many

Load

Group by $0 Group by $1

Foreach Foreach

Store Store

Group by $2

Foreach

Store


Control multiquery


• Disable multiquery– Command line option: -M

• Using “exec” to mark the boundaryA = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, COUNT(A);Store C0 into ‘output0’;B1 = group A by $1;C1 = foreach B1 generate group, COUNT(A);Store C1 into ‘output1’;execB2 = group A by $2;C2 = foreach B2 generate group, COUNT(A);Store C2 into ‘output2’;


Implement the right UDF


• Algebraic UDF– Initial– Intermediate– Final

A = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, SUM(A);Store C0 into ‘output0’;

MapInitial

CombinerIntermediate

ReduceFinal


Implement the right UDF


• Accumulator UDF– Reduce side UDF– Normally takes a bag

• Benefit– Big bag are passed in batches

– Avoid using too much memory

– Batch size

A = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, my_accum(A);Store C0 into ‘output0’;

my_accum extends Accumulator { public void accumulate() { // take a bag trunk } public void getValue() { // called after all bag trunks are processed }}pig.accumulative.batchsize=20000


Memory optimization


• Control bag size on reduce side

– If bag size exceed threshold, spill to disk

– Control the bag size to fit the bag in memory if possible

reduce(Text key, Iterator<Writable> values, ……)

Mapreduce:

Iterator

Bag of Input 1 Bag of Input 2 Bag of Input 3

pig.cachedbag.memusage=0.2


Optimization starts before pig


• Input format • Serialization format• Compression


Input format -Test Query


> searches = load ’aol_search_logs.txt' using PigStorage() as (ID, Query, …);

> search_thejas = filter searches by Query matches '.*thejas.*'; > dump search_thejas; (1568578 , thejasminesupperclub, ….)


Input formats


PigStor

age

LzoP

igStor

age

PigStor

age W

Typ

e

AvroStor

age (

has t

ypes

)0

20406080

100120140

RunTime (sec)

RunTime (sec)


Columnar format


•RCFile•Columnar format for a group of rows•More efficient if you query subset of columns


Tests with RCFile


• Tests with load + project + filter out all records.

• Using hcatalog, w compression,types•Test 1

•Project 1 out of 5 columns•Test 2

•Project all 5 columns


RCFile test results


Project 1 (sec) Project all (sec)0

20

40

60

80

100

120

140

Plain TextRCFile


Cost based optimizations


• Optimizations decisions based on your query/data

• Often iterative process

Run query Measure

Tune


• Hash Based Agg

Use pig.exec.mapPartAgg=true to enable

Map task

Cost based optimization - Aggregation


Map(logic) M.

Output

HBA HBAOutput

Reduce task


Cost based optimization – Hash Agg.


• Auto off feature • switches off HBA if output reduction is

not good enough• Configuring Hash Agg

• Configure auto off feature - pig.exec.mapPartAgg.minReduction

• Configure memory used - pig.cachedbag.memusage


Cost based optimization - Join


• Use appropriate join algorithm•Skew on join key - Skew join•Fits in memory – FR join


Cost based optimization – MR tuning


•Tune MR parameters to reduce IO•Control spills using map sort params •Reduce shuffle/sort-merge params


Parallelism of reduce tasks


4 6 8 24 48 2560:14:240:15:500:17:170:18:430:20:100:21:360:23:020:24:290:25:55

Runtime

Runtime

• Number of reduce slots = 6• Factors affecting runtime

• Cores simultaneously used/skew• Cost of having additional reduce tasks


Cost based optimization – keep data sorted


•Frequent joins operations on same keys

• Keep data sorted on keys• Use merge join

• Optimized group on sorted keys• Works with few load functions – needs

additional i/f implementation


Optimizations for sorted data


sort+sort+join+join join + join0

10

20

30

40

50

60

70

80

90

Join 2Join 1Sort2Sort1


Future Directions


• Optimize using stats• Using historical stats w hcatalog• Sampling


Questions


?

© Hortonworks Inc. 2011 Page 36

Documents

Making Pig Fly Optimizing Data Processing on Hadoop