22
Practical Pig + PigUnit Michael G. Noll, Verisign July 2012

Practical Pig and PigUnit (Michael Noll, Verisign)

Embed Size (px)

DESCRIPTION

This talk was held at the second meeting of the Swiss Big Data User Group on July 16 at ETH Zürich. http://www.bigdata-usergroup.ch/item/296477

Citation preview

Page 1: Practical Pig and PigUnit (Michael Noll, Verisign)

Practical Pig + PigUnit

Michael G. Noll, Verisign

July 2012

Page 2: Practical Pig and PigUnit (Michael Noll, Verisign)

2 Verisign Public

• High-level data flow language (think: DSL) for writing Hadoop MapReduce jobs

• Why and when should you care about Pig?• You are an Hadoop beginner

• … and want to implement a JOIN, for instance• You are an Hadoop expert• You only scratch your head when you see

public static void main(String args...)• You think Java is not the best tool for this job [pun!]

• Think: too low-level, too many lines of code, no interactive mode for exploratory analysis, readability > performance, et cetera

This talk is about Apache Pig

Apache Hadoop, Pig and Hive are trademarks of the Apache Software Foundation.Java is a trademark of Oracle Corporation.

Page 3: Practical Pig and PigUnit (Michael Noll, Verisign)

3 Verisign Public

A basic Pig script

• Example: sorting user records by users’ age

records = LOAD ‘/path/to/input’ AS (user:chararray, age:int);

sorted_records = ORDER records BY age DESC;

STORE sorted_records INTO ‘/path/to/output’;

• Popular alternatives to Pig• Hive: ~ SQL for Hadoop• Hadoop Streaming: use any programming language for MR

• Even though you still write code in a “real” programming language, Streaming provides an environment that makes it more convenient than native Hadoop Java code.

Page 4: Practical Pig and PigUnit (Michael Noll, Verisign)

4 Verisign Public

• Talk is based on Pig 0.10.0, released in April ’12• Some notable 0.10.0 improvements

• Hadoop 2.0 support• Loading and storing JSON• Ctrl-C’ing a Pig job will terminate all associated Hadoop

jobs• Amazon S3 support

Preliminaries

Page 5: Practical Pig and PigUnit (Michael Noll, Verisign)

5 Verisign Public

Testing Pig – a primer

Page 6: Practical Pig and PigUnit (Michael Noll, Verisign)

6 Verisign Public

“Testing” Pig scripts – some examples

$ pig -x local

$ pig [-debug | -dryrun]

$ pig -param input=/path/to/small-sample.txt

DESCRIBE | EXPLAIN | ILLUSTRATE | DUMP

Page 7: Practical Pig and PigUnit (Michael Noll, Verisign)

7 Verisign Public

“Testing” Pig scripts (cont.)

• JobTracker UI • PigStats, JobStats,HadoopJobHistoryLoader

Also: inspecting Hadoop log files, …

Now what have you been using?

Page 8: Practical Pig and PigUnit (Michael Noll, Verisign)

8 Verisign Public

• Previous approaches are primarily useful (and used) for creating the Pig script in the first place• Like ILLUSTRATE

• None of them are really geared towards unit testing

However…

#!/bin/bashpig –param date=$1 –param output=$2 myscript.pighadoop fs –copyToLocal $2 /tmp/jobresultif [ ARGH!!! ] ...

• Difficult to automate (think: production environment)

$ mvn clean test ??

• Difficult to integrate into a typical development workflow, e.g. backed by Maven, Java and a CI server

Maven is a trademark of JFrog ltd.

Page 9: Practical Pig and PigUnit (Michael Noll, Verisign)

9 Verisign Public

PigUnit

Page 10: Practical Pig and PigUnit (Michael Noll, Verisign)

10 Verisign Public

PigUnit

• Available in Pig since version 0.8

“PigUnit provides a unit-testing framework that plugs into JUnitto help you write unit tests that can be run on a regular basis.”-- Alan F. Gates, Programming Pig

• Easy way to add Pig unit testing to your dev workflowiff you are a Java developer• See “Tips and Tricks” later for working around this constraint

• Works with both JUnit and TestNG• PigUnit docs have “potential”

• Some basic examples, then it’s looking at the source code of both PigUnit and Pig (but it’s manageable)

• http://pig.apache.org/docs/r0.10.0/test.html#pigunit

Page 11: Practical Pig and PigUnit (Michael Noll, Verisign)

11 Verisign Public

• PigUnit is not included in current Pig releases :(• You must manually build the PigUnit jar file

• Add these jar(s) to your CLASSPATH, done!

Getting PigUnit up and running

$ cd /path/to/pig-sources # can be a release tarball$ ant jar pigunit-jar...$ ls -l pig*jar-rw-r—r-- 1 mnoll mnoll 17768497 ... pig.jar-rw-r—r-- 1 mnoll mnoll 285627 ... pigunit.jar

Page 12: Practical Pig and PigUnit (Michael Noll, Verisign)

12 Verisign Public

• Alternatives:• Publish to your local Artifactory instance• Use a local file-based <repository>• Use a <system> scope in pom.xml (not recommended)• Use trusted third-party repos like Cloudera’s

PigUnit and Maven

WILL NOT WORK IN pom.xml :(<dependency> <groupId>org.apache.pig</groupId> <artifactId>pigunit</artifactId> <version>0.10.0</version></dependency>

• Unfortunately the Apache Pig project does not yet publish an official Maven artifact for PigUnit

Artifactory is a trademark of JFrog ltd.

Page 13: Practical Pig and PigUnit (Michael Noll, Verisign)

13 Verisign Public

A simple PigUnit test

Page 14: Practical Pig and PigUnit (Michael Noll, Verisign)

14 Verisign Public

• Here, we provide input + output data in the Java code• Pig script is read from file wordcount.pig

A simple PigUnit test

@Testpublic void testSimpleExample() { PigTest simpleTest = new PigTest(“wordcount.pig”);

String[] input = { “foo”, “bar”, “foo” }; String[] expectedOutput = { “(foo,2)”, “(bar,1)” };

simpleTest.assertOutput( “aliasInput”, input, “aliasOutput”, expectedOutput );}

Page 15: Practical Pig and PigUnit (Michael Noll, Verisign)

15 Verisign Public

• wordcount.pig

A simple PigUnit test (cont.)

-- PigUnit populates the alias ‘aliasInput’-- with the test input dataaliasInput = LOAD ‘<tmpLoc>’ AS <schema>;

-- ...here comes your actual code...

-- PigUnit will treat the contents of the alias -- ‘aliasOutput’ as the actual output data in-- the assert statementaliasOutput = <your_final_statement>;

-- Note: PigUnit ignores STORE operations by defaultSTORE aliasOutput INTO ‘output’;

Page 16: Practical Pig and PigUnit (Michael Noll, Verisign)

16 Verisign Public

A simple PigUnit test (cont.)

simpleTest.assertOutput( “aliasInput”, input, “aliasOutput”, expectedOutput );

Pig injects input[] = { “foo”, “bar”, “foo” } into the alias named aliasInput in the Pig script.

For this purpose Pig creates a temporary file, writes the equivalent of StringUtils.join(input, “\n”) to the file, and finally makes its location available to the LOAD operation.

1

1

Pig opens an iterator on the content of aliasOutput, and runs assertEquals() based on StringUtils.join(..., “\n”) with expectedOutput and the actual content.

2

2

See o.a.p.pigunit.{PigTest, Cluster} and o.a.p.test.Util.

Page 17: Practical Pig and PigUnit (Michael Noll, Verisign)

17 Verisign Public

• How to divide your “main” Pig script into testable units?• Only run a single end-to-end test for the full script?• Extract testable snippets from the main script?

• Argh, code duplication!• Split the main script into logical units = smaller scripts; then run

individual tests and include the smaller scripts in the main script• Ok-ish but splitting too much makes the Pig code

hard to understand (too many trees, no forest).• PigUnit is a nice tool but batteries are not included

• It does work but it is not as convenient or powerful as you’d like.• Notably you still need to know and write Java to

use it. But one compelling reason for Pig is that you can do without Java.

• You may end up writing your own wrapper/helper lib around it.• Consider contributing this back to the Apache Pig

project!

PigUnit drawbacks

Page 18: Practical Pig and PigUnit (Michael Noll, Verisign)

18 Verisign Public

Tips and tricks

Page 19: Practical Pig and PigUnit (Michael Noll, Verisign)

19 Verisign Public

• $HADOOP_CONF_DIR must be in CLASSPATH• Similar approach for enabling LZO support

• mapred.output.compress => “true”• mapred.output.compression.codec =>

“c.h.c.lzo.LzopCodec”

Connecting to a real cluster (default: local mode)

// this is not enough to enable cluster mode in PigUnitpigServer = new PigServer(ExecType.MAPREDUCE);// ...do PigUnit stuff...// rather:Properties props = System.getProperties();if (clusterMode) props.setProperty(“pigunit.exectype.cluster”, “true”);else props.removeProperty(“pigunit.exectype.cluster”);

Page 20: Practical Pig and PigUnit (Michael Noll, Verisign)

20 Verisign Public

• Pig user != Java developer• Pig users should only need to provide three files:

• pig/myscript.pig• input/testdata.txt• output/expected.txt

• PigUnit runner discovers and runs tests for users• PigTest#assertOutput() can also handle files• But you must manage file uploads and similar “glue” yourself

Write a convenient PigUnit runner for your users

pigUnitRunner.runPigTest( new Path(scriptFile), new Path(inputFile), new Path(expectedOutputFile));

Page 21: Practical Pig and PigUnit (Michael Noll, Verisign)

21 Verisign Public

• Pig API provides nifty features to control Pig workflows through Java• Similar to how working with PigUnit feels

• Definitely worth a look!

Slightly off-topic: Java/Pig combo

// ‘pigParams’ is the main glue between Java and Pig here,// e.g. to specify the location of input datapigServer.registerScript(scriptInputStream, pigParams);

ExecJob job = pigServer.store( “aliasOutput”, “/path/to/output”, “PigStorage()” );

if (job != null && job.getStatus() == JOB_STATUS.COMPLETED) System.out.println(“Happy world!”);

Page 22: Practical Pig and PigUnit (Michael Noll, Verisign)

Thank You

© 2012 VeriSign, Inc. All rights reserved.  VERISIGN and other trademarks, service marks, and designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United States and in foreign countries.  All other trademarks are property of their respective owners.