36
© Hortonworks Inc. 2011 Daniel Dai (@daijy) Thejas Nair (@thejasn) Page 1 Making Pig Fly Optimizing Data Processing on Hadoop

Making Pig Fly Optimizing Data Processing on Hadoop

  • Upload
    marcel

  • View
    58

  • Download
    0

Embed Size (px)

DESCRIPTION

Making Pig Fly Optimizing Data Processing on Hadoop. Daniel Dai (@ daijy ) Thejas Nair (@ thejasn ). What is Apache Pig?. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig- latin -cup pic from http:// www.flickr.com /photos/ frippy /2507970530/. - PowerPoint PPT Presentation

Citation preview

Page 1: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Daniel Dai (@daijy)Thejas Nair (@thejasn)

Page 1

Making Pig FlyOptimizing Data Processing on Hadoop

Page 2: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

What is Apache Pig?

Page 2Architecting the Future of Big Data

Pig Latin, a high level data processing language.

An engine that executes Pig Latin locally or on a Hadoop cluster.

Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

Page 3: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Pig-latin example

Page 3Architecting the Future of Big Data

• Query : Get the list of web pages visited by users whose age is between 20 and 29 years.

USERS = load ‘users’ as (uid, age);

USERS_20s = filter USERS by age >= 20 and age <= 29;

PVs = load ‘pages’ as (url, uid, timestamp);

PVs_u20s = join USERS_20s by uid, PVs by uid;

Page 4: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Why pig ?

Page 4Architecting the Future of Big Data

•Faster development– Fewer lines of code– Don’t re-invent the wheel

•Flexible– Metadata is optional– Extensible– Procedural programming

Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

Page 5: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Pig optimizations

Page 5Architecting the Future of Big Data

• Ideally user should not have to bother

• Reality– Pig is still young and immature– Pig does not have the whole picture

–Cluster configuration–Data histogram

– Pig philosophy: Pig is docile

Page 6: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Pig optimizations

Page 6Architecting the Future of Big Data

• What pig does for you– Do safe transformations of query to optimize– Optimized operations (join, sort)

• What you do– Organize input in optimal way– Optimize pig-latin query– Tell pig what join/group algorithm to use

Page 7: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Rule based optimizer

Page 7Architecting the Future of Big Data

• Column pruner• Push up filter• Push down flatten• Push up limit• Partition pruning• Global optimizer

Page 8: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Column Pruner

Page 8Architecting the Future of Big Data

• Pig will do column pruning automatically

• Cases Pig will not do column pruning automatically

– No schema specified in load statement

A = load ‘input’ as (a0, a1, a2);B = foreach A generate a0+a1;C = order B by $0;Store C into ‘output’;

Pig will prune a2 automatically

A = load ‘input’;B = order A by $0;C = foreach B generate $0+$1;Store C into ‘output’;

A = load ‘input’;A1 = foreach A generate $0, $1;B = order A1 by $0;C = foreach B generate $0+$1;Store C into ‘output’;

DIY

Page 9: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Column Pruner

Page 9Architecting the Future of Big Data

• Another case Pig does not do column pruning

– Pig does not keep track of unused column after grouping

A = load ‘input’ as (a0, a1, a2);B = group A by a0;C = foreach B generate SUM(A.a1);Store C into ‘output’;

DIY

A = load ‘input’ as (a0, a1, a2);A1 = foreach A generate $0, $1;B = group A1 by a0;C = foreach B generate SUM(A.a1);Store C into ‘output’;

Page 10: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Push up filter

Page 10Architecting the Future of Big Data

• Pig split the filter condition before push

A

Join

a0>0 && b0>10

B

Filter

A

Join

a0>0

B

Filter b0>10

Original query Split filter condition

A

Join

a0>0

B

Filter b0>10

Push up filter

Page 11: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Other push up/down

Page 11Architecting the Future of Big Data

• Push down flatten

• Push up limit

Load

Flatten

Order

Load

Flatten

Order

A = load ‘input’ as (a0:bag, a1);B = foreach A generate flattten(a0), a1;C = order B by a1;Store C into ‘output’;

Load

Limit

Foreach

Load

Foreach

Limit

Load (limited)

Foreach

Load

Limit

Order

Load

Order (limited)

Page 12: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Partition pruning

Page 12Architecting the Future of Big Data

• Prune unnecessary partitions entirely– HCatLoader

2010

2011

2012

HCatLoader Filter (year>=2011)

2010

2011

2012

HCatLoader (year>=2011)

Page 13: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Intermediate file compression

Page 13Architecting the Future of Big Data

Pig Script

map 1

reduce 1

map 2

reduce 2

Pig temp file

map 3

reduce 3

Pig temp file

•Intermediate file between map and reduce

– Snappy

•Temp file between mapreduce jobs

– No compression by default

Page 14: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Enable temp file compression

Page 14Architecting the Future of Big Data

•Pig temp file are not compressed by default

– Issues with snappy (HADOOP-7990)– LZO: not Apache license

•Enable LZO compression–Install LZO for Hadoop–In conf/pig.properties

–With lzo, up to > 90% disk saving and 4x query speed up

pig.tmpfilecompression = truepig.tmpfilecompression.codec = lzo

Page 15: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Multiquery

Page 15Architecting the Future of Big Data

• Combine two or more map/reduce job into one

– Happens automatically– Cases we want to control multiquery: combine too many

Load

Group by $0 Group by $1

Foreach Foreach

Store Store

Group by $2

Foreach

Store

Page 16: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Control multiquery

Page 16Architecting the Future of Big Data

• Disable multiquery– Command line option: -M

• Using “exec” to mark the boundaryA = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, COUNT(A);Store C0 into ‘output0’;B1 = group A by $1;C1 = foreach B1 generate group, COUNT(A);Store C1 into ‘output1’;execB2 = group A by $2;C2 = foreach B2 generate group, COUNT(A);Store C2 into ‘output2’;

Page 17: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Implement the right UDF

Page 17Architecting the Future of Big Data

• Algebraic UDF– Initial– Intermediate– Final

A = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, SUM(A);Store C0 into ‘output0’;

MapInitial

CombinerIntermediate

ReduceFinal

Page 18: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Implement the right UDF

Page 18Architecting the Future of Big Data

• Accumulator UDF– Reduce side UDF– Normally takes a bag

• Benefit– Big bag are passed in batches

– Avoid using too much memory

– Batch size

A = load ‘input’;B0 = group A by $0;C0 = foreach B0 generate group, my_accum(A);Store C0 into ‘output0’;

my_accum extends Accumulator { public void accumulate() { // take a bag trunk } public void getValue() { // called after all bag trunks are processed }}pig.accumulative.batchsize=20000

Page 19: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Memory optimization

Page 19Architecting the Future of Big Data

• Control bag size on reduce side

– If bag size exceed threshold, spill to disk

– Control the bag size to fit the bag in memory if possible

reduce(Text key, Iterator<Writable> values, ……)

Mapreduce:

Iterator

Bag of Input 1 Bag of Input 2 Bag of Input 3

pig.cachedbag.memusage=0.2

Page 20: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Optimization starts before pig

Page 20Architecting the Future of Big Data

• Input format • Serialization format• Compression

Page 21: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Input format -Test Query

Page 21Architecting the Future of Big Data

> searches = load  ’aol_search_logs.txt' using PigStorage() as (ID, Query, …);

> search_thejas = filter searches by Query matches '.*thejas.*';    > dump search_thejas; (1568578 , thejasminesupperclub, ….)

Page 22: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Input formats

Page 22Architecting the Future of Big Data

PigStor

age

LzoP

igStor

age

PigStor

age W

Typ

e

AvroStor

age (

has t

ypes

)0

20406080

100120140

RunTime (sec)

RunTime (sec)

Page 23: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Columnar format

Page 23Architecting the Future of Big Data

•RCFile•Columnar format for a group of rows•More efficient if you query subset of columns

Page 24: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Tests with RCFile

Page 24Architecting the Future of Big Data

• Tests with load + project + filter out all records.

• Using hcatalog, w compression,types•Test 1

•Project 1 out of 5 columns•Test 2

•Project all 5 columns

Page 25: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

RCFile test results

Page 25Architecting the Future of Big Data

Project 1 (sec) Project all (sec)0

20

40

60

80

100

120

140

Plain TextRCFile

Page 26: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Cost based optimizations

Page 26Architecting the Future of Big Data

• Optimizations decisions based on your query/data

• Often iterative process

Run query Measure

Tune

Page 27: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

• Hash Based Agg

Use pig.exec.mapPartAgg=true to enable

Map task

Cost based optimization - Aggregation

Page 27Architecting the Future of Big Data

Map(logic) M.

Output

HBA HBAOutput

Reduce task

Page 28: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Cost based optimization – Hash Agg.

Page 28Architecting the Future of Big Data

• Auto off feature • switches off HBA if output reduction is

not good enough• Configuring Hash Agg

• Configure auto off feature - pig.exec.mapPartAgg.minReduction

• Configure memory used - pig.cachedbag.memusage

Page 29: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Cost based optimization - Join

Page 29Architecting the Future of Big Data

• Use appropriate join algorithm•Skew on join key - Skew join•Fits in memory – FR join

Page 30: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Cost based optimization – MR tuning

Page 30Architecting the Future of Big Data

•Tune MR parameters to reduce IO•Control spills using map sort params •Reduce shuffle/sort-merge params

Page 31: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Parallelism of reduce tasks

Page 31Architecting the Future of Big Data

4 6 8 24 48 2560:14:240:15:500:17:170:18:430:20:100:21:360:23:020:24:290:25:55

Runtime

Runtime

• Number of reduce slots = 6• Factors affecting runtime

• Cores simultaneously used/skew• Cost of having additional reduce tasks

Page 32: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Cost based optimization – keep data sorted

Page 32Architecting the Future of Big Data

•Frequent joins operations on same keys

• Keep data sorted on keys• Use merge join

• Optimized group on sorted keys• Works with few load functions – needs

additional i/f implementation

Page 33: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Optimizations for sorted data

Page 33Architecting the Future of Big Data

sort+sort+join+join join + join0

10

20

30

40

50

60

70

80

90

Join 2Join 1Sort2Sort1

Page 34: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Future Directions

Page 34Architecting the Future of Big Data

• Optimize using stats• Using historical stats w hcatalog• Sampling

Page 35: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011

Questions

Page 35Architecting the Future of Big Data

?

Page 36: Making Pig  Fly Optimizing  Data Processing on  Hadoop

© Hortonworks Inc. 2011 Page 36