14
Pig and Pig Latin

Pig and Pig Latin - Module 5

Embed Size (px)

Citation preview

Page 1: Pig and Pig Latin - Module 5

Pig and Pig Latin

Page 2: Pig and Pig Latin - Module 5

What is Pig?• Apache Pig is a Hadoop platform for creating MapReduce jobs. Pig uses a high-level, SQL-like

programming language named Pig Latin. The benefits of Pig include:

• Run a MapReduce job with a few simple lines of code.

• Process structured data with a schema, or Pig can process unstructured data without a schema. (Pigs eat anything!)

• Pig Latin uses a familiar SQL-like syntax.

• Pig scripts read and write data from HDFS.

• Pig Latin is a data flow language, a logical solution for many MapReduce algorithms.

Page 3: Pig and Pig Latin - Module 5

Pig Latin• Pig Latin is a high-level data flow scripting language. Pig Latin scripts can be executed one of three ways:

• Pig script: write a Pig Latin program in a text file and execute it using the pig executable.

• Grunt shell: enter Pig statements manually one-at-a-time from a CLI tool known as the Grunt interactive shell.

• Embedded in Java: use the PigServer class to execute a Pig query from within Java code.

Page 4: Pig and Pig Latin - Module 5

The Grunt Shell• Grunt is an interactive shell that enables users to enter Pig Latin statements and

also interact with HDFS. • To enter the Grunt shell, run the pig executable in the PIG_HOME\bin folder:

Page 5: Pig and Pig Latin - Module 5

Pig Latin Types

Page 6: Pig and Pig Latin - Module 5

Functions• Functions in Pig come in four types:

• Eval function : A function that takes one or more expressions and returns another expression.

• Filter function : A special type of eval function that returns a logical Boolean result.

• Load function: A function that specifies how to load data into a relation from external storage.

• Store function : A function that specifies how to save the contents of a relation to external storage.

Page 7: Pig and Pig Latin - Module 5

Eval Function

Page 8: Pig and Pig Latin - Module 5

Filter, Load, Store Functions

Page 9: Pig and Pig Latin - Module 5

Data Processing Operators• Loading and Storing Data

• Filtering Data

Page 10: Pig and Pig Latin - Module 5

• Grouping and Joining Data

Page 11: Pig and Pig Latin - Module 5

• Combining Data

Page 12: Pig and Pig Latin - Module 5

User-Defined Functions : Filter UDF

• Filter UDFs are all subclasses of FilterFunc, which itself is a subclass of EvalFunc

• Override EvalFunc’s only abstract method, exec(),

Page 13: Pig and Pig Latin - Module 5

Filter UDF Contd.. public class IsGoodQuality extends FilterFunc {@Overridepublic Boolean exec(Tuple tuple) throws IOException {if (tuple == null || tuple.size() == 0) {return false;}try {Object object = tuple.get(0);if (object == null) {return false;}int i = (Integer) object;return i == 0 || i == 1 || i == 4 || i == 5 || i == 9;} catch (ExecException e) {throw new IOException(e);}}}

Page 14: Pig and Pig Latin - Module 5

Exploring Data with Apache Pig from the Grunt shell

LAB