Hadoop Module 5 Release 3.0

Embed Size (px)

DESCRIPTION

hadoop

Citation preview

  • 5/19/2018 Hadoop Module 5 Release 3.0

    1/48

    wSlide 1

  • 5/19/2018 Hadoop Module 5 Release 3.0

    2/48

    wSlide 2

    Module 1 Understanding Big Data

    Hadoop Architecture

    Module 2 Hadoop Cluster Configuration

    Data loading Techniques

    Hadoop Project Environment

    Module 3 Hadoop MapReduce framework

    Programming in Map Reduce

    Module 4 Advance MapReduce

    MRUnit testing framework

    Module 5 Analytics using Pig

    Understanding Pig Latin

    Module 6

    Analytics using Hive

    Understanding HIVE QL

    Module 7

    Advance Hive

    NoSQL Databases and HBASE

    Module 8

    Advance HBASE

    Zookeeper Service

    Module 9

    Hadoop 2.0 New Features

    Programming in MRv2

    Module 10

    Apache Oozie

    Real world Datasets and Anal

    Project Discussion

    Course Topics

  • 5/19/2018 Hadoop Module 5 Release 3.0

    3/48

    wSlide 3

    Topics for Today

    Need of PIG

    Why PIG was created?

    Why go for PIG when MapReduce is there?

    Use Cases where Pig is used

    Use Case in Healthcare

    Where not to use PIG

    Weather data with PIG

    Lets start with PIG

    PIG Components

    PIG Data Types

    PIG UDF

    PIG vs Hive

  • 5/19/2018 Hadoop Module 5 Release 3.0

    4/48

    w

    Letss Revise Advance MR

    Combiner and Partition functi

    MapReduce Joins

    Hadoop Data Types

    Custom Data Types

    Input and Output Formats

    Sequence Files

    Distributed Cache

    MRUnit testing framework

    Hadoop Counters: Reporting

  • 5/19/2018 Hadoop Module 5 Release 3.0

    5/48wSlide 5

    Need Of Pig

    Do you know Java? 10 lines of PIG = 200 lines of Java

    + Built in op

    Join Group Filter Sort and more

  • 5/19/2018 Hadoop Module 5 Release 3.0

    6/48wSlide 6

    Why Was Pig Created?

    Rapid Development

    No Java is required

    Developed by Yahoo!

    An ad-hoc way of creating and executing map-reduce jobs on very large data sets

  • 5/19/2018 Hadoop Module 5 Release 3.0

    7/48wSlide 7

    1/20 the lines of Code 1/16 the development T

    180

    160

    140

    120

    100

    80

    60

    40

    20

    0

    Hadoop Pig0

    Hadoop P

    50

    100

    150

    200

    250

    300

    Minutes

    Performance On Par With Raw Hadoop

    Why Should I Go For Pig When There Is MR?

  • 5/19/2018 Hadoop Module 5 Release 3.0

    8/48

    wSlide 8

    Why Should I Go For Pig When There Is MR?

    MapReduce Powerful model for parallelism.

    Based on a rigid procedural structure. Provides a good opportunity to parallelize algorithm. Have a higher level declarative language.

    PIG

    It is desirable to have a higher level declarativelanguage. Similar to SQL query where the user specifies the

    what and leaves the how to the underlyingprocessing engine.

  • 5/19/2018 Hadoop Module 5 Release 3.0

    9/48

    wSlide 9

    Where Should I use Pig?

    Pig is a data flow language.

    It is at the top of Hadoop and makes it possible to create coto process large volumes of data quickly and efficiently.

    Hadoop

    Pig

    Case 1 Time Sensitive Data Loads

    Case 2 Processing Many Data Sources

    Case 3 Analytic Insight Through Sampling

  • 5/19/2018 Hadoop Module 5 Release 3.0

    10/48

    wSlide 10

    Where not to use Pig?

    Really nasty data formats or completely unstructured data

    (video, audio, raw human-readable text).

    Pig is definitely slow compared to Map Reduce jobs.

    When you would like more power to optimize your code.

  • 5/19/2018 Hadoop Module 5 Release 3.0

    11/48

    wSlide 11

    Hello There!!My name is Annie.I love quizzes and

    puzzles and I am here tmake you guys think an

    answer my questions.

    Apache Pig is a platform for analysing?- Small Data- Data less than 10 GB- Large Data- All of them

    Annies Question

  • 5/19/2018 Hadoop Module 5 Release 3.0

    12/48

    wSlide 12

    Large Data.

    Annies Answer

  • 5/19/2018 Hadoop Module 5 Release 3.0

    13/48

    wSlide 13

    Pig is an open-source high-level dataflow system

    It provides a simple language for queries and data maLatin, that is compiled into map-reduce jobs that are

    Why Is It Important?

    Companies like Yahoo, Google and Microsoft are coenormous data sets in the form of click streams, se

    web crawls. Some form of ad-hoc processing and analysis of al

    information is required.

    What is Pig?

  • 5/19/2018 Hadoop Module 5 Release 3.0

    14/48

    wSlide 14

    Data processing for search platforms

    Quick Prototyping of algorithms for processing large datasets.

    Processing of Web Logs

    Support forAd Hoc queries across large datasets.

    Use Cases Where Pig Is Used

  • 5/19/2018 Hadoop Module 5 Release 3.0

    15/48

    wSlide 15

    Examples Of Data Analysis Task

    Find users who tend to visit good pages:

    User URL Time

    Amy www.cnn.com 8:00

    Amy www.crap.com 8:05

    Amy www.myblog.com 10:00

    Amy www.flickr.com 10:05

    Fred cnn.com/index.htm 12:00

    Page Rank Pa

    www.cnn.com 0.9

    www.flickr.com 0.9

    www.myblog.com 0.7

    www.crap.com 0.2

    VISITS PAGES

    http://www.cnn.com/http://www.crap.com/http://www.myblog.com/http://www.flickr.com/http://www.cnn.com/http://www.flickr.com/http://www.myblog.com/http://www.crap.com/http://www.crap.com/http://www.myblog.com/http://www.flickr.com/http://www.cnn.com/http://www.flickr.com/http://www.myblog.com/http://www.crap.com/http://www.cnn.com/
  • 5/19/2018 Hadoop Module 5 Release 3.0

    16/48

    wSlide 16

    Conceptual Data Flow

    LoadVisits (user, url, time)

    LoadPages (url, pagerank)

    Joinurl = url

    Group by user

    FilteravgPR>0.5

    Compute AveragePagerank

  • 5/19/2018 Hadoop Module 5 Release 3.0

    17/48

    wSlide 17

    How Yahoo Uses Pig?

    Pig is best suited for the data factory.

    Data Factory contains:Pipelines: Pipelines bring logs from Yahoo!'s web servers. These logs undergo a cleaning step where bots, company internal views, and click

    removed.

    Research: Researchers want to quickly write a script to test a theory.

    Pig integration with streaming makes it easy for researchers to take a Perl or Pythand run it against a huge data set.

  • 5/19/2018 Hadoop Module 5 Release 3.0

    18/48

    wSlide 18

    Use Case In HealthCare

    Problem Statement:De-identify personal health information.

    Challenges: Huge amount of data flows into the systems daily and there are multiple data sou

    need to aggregate data from.

    Crunching this huge data and deidentifying it in a traditional way had problems.

  • 5/19/2018 Hadoop Module 5 Release 3.0

    19/48

    wSlide 19

    Use Case In HealthCare

    Pig Script

    Taking DB dump in CSV format andingest into HDFS

    Store DeidentifiedCSV file into HDFS

    Read CSV filefrom HDFS

    Deidentify columns based on configurationsand store the data back in a CSV file

    010011011001HDFS

    01001101

    1001

  • 5/19/2018 Hadoop Module 5 Release 3.0

    20/48

    wSlide 20

    Weather Data With Pig

    ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/

    ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/
  • 5/19/2018 Hadoop Module 5 Release 3.0

    21/48

    wSlide 21

    Pig - Basic Program Structure

    Grunt:Grunt is an interactive shell for running Pig commands.It is also possible to run Pig scripts from within Gruntusing run and exec (execute).

    Script:Pig can run a script file that contains Pig commands.

    Example: pig script.pig runs the commands inthe local file script.pig.

    Embedded:Embedded can run Pig programs from Java, much like youcan use JDBC to run SQL programs from Java.

    Script

    Grun

    Embed

  • 5/19/2018 Hadoop Module 5 Release 3.0

    22/48

    wSlide 22

    Pig Is made Up Of Two Components

    Data Flows

    Pig Latin is used toexpress Data Flows

    Distributed executionon a Hadoop Cluster

    Local execution in asingle JVM

    Pig

    ExecutionEnvironments

    1.

    2.

  • 5/19/2018 Hadoop Module 5 Release 3.0

    23/48

    wSlide 23

    Pig Execution

    HadoopCluster

    Pig resides on user machine Job executes on Cluster

    User Machine

    No need to install anything extra on your Hadoop Clust

  • 5/19/2018 Hadoop Module 5 Release 3.0

    24/48

    wSlide 24

    Pig Latin Program

    Pig Latin Program

    Pig A series of MapReduTurns the transformations into

    It is made up of a series of operations or transformations that areapplied to the input data to produce output.

  • 5/19/2018 Hadoop Module 5 Release 3.0

    25/48

    wSlide 25

    Four Basic Types Of Data Models

    Atom Tuple

    MapBag

    DataModelTypes

  • 5/19/2018 Hadoop Module 5 Release 3.0

    26/48

    wSlide 26

    Data Model

    Data Models can be defined as follows:

    A bag is a collection of tuples.

    A tuple is an ordered set of fields.

    A field is a piece of data.

    A Data Map is a map from keys that are string literals to values that can be

    any data type.

    Example:

    t = < 1, {,,}, ['apache':'search']>

  • 5/19/2018 Hadoop Module 5 Release 3.0

    27/48

    wSlide 27

    Pig Data Types

    Pig DataType

    Implementing Class

    Bag org.apache.pig.data.DataBag

    Tuple org.apache.pig.data.Tuple

    Map java.util.Map

    Integer java.lang.Integer

    Long java.lang.Long

    Float java.lang.Float

    Double java.lang.Double

    Chararray java.lang.String

    Bytearray byte[]

    Pi L i R l i l O

  • 5/19/2018 Hadoop Module 5 Release 3.0

    28/48

    wSlide 28

    Pig Latin Relational Operators

    Category Operator Description

    Loading and Storing LOAD

    STOREDUMP

    Loads data from the file system or other storag

    Saves a relation to the file system or other storPrints a relation to the console.

    Filtering FILTER DISTINCTFOREACH...GENERATESTREAM

    Removes unwanted rows from a relation.Removes duplicate rows from a relation.

    Adds or removes fields from a relation.Transforms a relation using an external program

    Grouping and Joining JOINCOGROUPGROUPCROSS

    Joins two or more relations.Groups the data in two or more relations.Groups the data in a single relation.Creates the cross product of two or more relati

    Sorting ORDER LIMIT

    Sorts a relation by one or more fields.Limits the size of a relation to a maximum num

    Combining and Splitting UNIONSPLIT

    Combines two or more relations into one.Splits a relation into two or more relations.

  • 5/19/2018 Hadoop Module 5 Release 3.0

    29/48

    wSlide 29

    Pig Latin - Nulls

    NULLData of any type can be NULL.

    Includes the concept of a data element being

    Pig

    In Pig, when a element is NULmeans the valu

    unknown.

  • 5/19/2018 Hadoop Module 5 Release 3.0

    30/48

    wSlide 30

    Data

    File Student File Student Rol

    Name Age GPA

    Joe 18 2.5

    Sam 3.0

    Angel 21 7.9

    John 17 9.0

    Joe 19 2.9

    Name Roll No.

    Joe 45

    Sam 24

    Angel 1

    John 12

    Joe 19

    Pi L i G O

  • 5/19/2018 Hadoop Module 5 Release 3.0

    31/48

    wSlide 31

    Pig Latin Group Operator

    Example of GROUP Operator:

    A = load 'student' as (name:chararray, age:int, gpa:float); dump A;(joe,18,2.5)(sam,,3.0)(angel,21,7.9)(john,17,9.0)(joe,19,2.9)

    X = group A by name;

    dump X;(joe,{(joe,18,2.5),(joe,19,2.9)})(sam,{(sam,,3.0)})(john,{(john,17,9.0)})(angel,{(angel,21,7.9)})

    Pi L i COG O

  • 5/19/2018 Hadoop Module 5 Release 3.0

    32/48

    wSlide 32

    Example of COGROUP Operator:

    A = load 'student' as (name:chararray, age:int,gpa:float);

    B = load 'studentRoll' as (name:chararray, rollno:int);

    X = cogroup A by name, B by name;dump X;

    (joe,{(joe,18,2.5),(joe,19,2.9)},{(joe,45),(joe,19)})(sam,{(sam,,3.0)},{(sam,24)})

    (john,{(john,17,9.0)},{(john,12)})(angel,{(angel,21,7.9)},{(angel,1)})

    Pig Latin COGroup Operator

    J i A d COGROUP

  • 5/19/2018 Hadoop Module 5 Release 3.0

    33/48

    wSlide 33

    JOIN and COGROUP operators pefunctions.

    JOIN creates a flat set of output reCOGROUP creates a nested set of

    Joins And COGROUP

    U i

  • 5/19/2018 Hadoop Module 5 Release 3.0

    34/48

    wSlide 34

    Union

    UNION: To merge the contents of two or more relations.

    F th R di

  • 5/19/2018 Hadoop Module 5 Release 3.0

    35/48

    wSlide 35

    Further Reading

    Review the following Blogs on Pig Scripts:

    http://www.edureka.in/blog/operators-in-apache-pig/

    http://www.edureka.in/blog/diagnostic-operators-apachepig/

    Diagnostic Operators & UDF Statements

    http://www.edureka.in/blog/operators-in-apache-pig/http://www.edureka.in/blog/operators-in-apache-pig/http://www.edureka.in/blog/diagnostic-operators-apachepig/http://www.edureka.in/blog/diagnostic-operators-apachepig/http://www.edureka.in/blog/operators-in-apache-pig/
  • 5/19/2018 Hadoop Module 5 Release 3.0

    36/48

    wSlide 36

    Diagnostic Operators & UDF Statements

    Types of Pig Latin Diagnostic Operators:

    DESCRIBE - Prints a relations schema.EXPLAIN - Prints the logical and physical plans.ILLUSTRATE - Shows a sample execution of the logical plan, using a generated subs

    Types of Pig Latin UDF Statements:

    REGISTER - Registers a JAR file with the Pig runtime.DEFINE - Creates an alias for a UDF, streaming script, or a command specification.

    Pig Latin UDF Statements

    Pig Latin Diagnostic Operators

    Describe

  • 5/19/2018 Hadoop Module 5 Release 3.0

    37/48

    wSlide 37

    Describe

    Use the DESCRIBE operator to review the fields and data-types.

    EXPLAIN : Logical Plan

  • 5/19/2018 Hadoop Module 5 Release 3.0

    38/48

    wSlide 38

    EXPLAIN : Logical Plan

    Use the EXPLAIN operator to review the logical, physical, and map reduce executioused to compute the specified relationship.

    The logical plan shows a pipeline of operators to be executed to build the relation. Tbackend-independent optimizations (such as applying filters early on) also apply.

    EXPLAIN : Physical Plan

  • 5/19/2018 Hadoop Module 5 Release 3.0

    39/48

    wSlide 39

    EXPLAIN : Physical Plan

    The physical plan shows how the logical operators are translated to backend-specifoperators. Some backend optimizations also apply.

    EXPLAIN : MapReduce Plan

  • 5/19/2018 Hadoop Module 5 Release 3.0

    40/48

    wSlide 40

    EXPLAIN : MapReduce Plan

    The mapreduce plan shows how the physical operators are grouped into map reduc

    Illustrate

  • 5/19/2018 Hadoop Module 5 Release 3.0

    41/48

    wSlide 41

    Illustrate

    ILLUSTRATE command is used to demonstrate a "good" example input data.

    Judged by three measurements:

    1 Completeness 2 Conciseness 3 Degree of realism

    Pig Latin File Loaders

  • 5/19/2018 Hadoop Module 5 Release 3.0

    42/48

    wSlide 42

    Pig Latin File Loaders

    Pig Latin File Loaders

    BinStorage - "binary" storage

    PigStorage - loads and stores data that is delimited by something

    TextLoader - loads data line by line (delimited by the newline character)

    CSVLoader - Loads CSV files

    XML Loader - Loads XML files

    Pig Latin Creating UDF

  • 5/19/2018 Hadoop Module 5 Release 3.0

    43/48

    wSlide 43

    Pig Latin Creating UDF

    public class IsOfAge extends FilterFunc {

    @Override

    public Boolean exec(Tuple tuple) throws IOException {

    if (tuple == null || tuple.size() == 0) {

    return false;

    }

    try {

    Object object = tuple.get(0);

    if (object == null) {

    return false;

    }

    int i = (Integer) object;

    if (i == 18 || i == 19 || i == 21 || i == 23 || i == 27) {return true;

    } else {

    return false;

    }

    } catch (ExecException e) {

    throw new IOException(e);

    }

    }

    }

    A Program to create UDF:

    Pig Latin Calling A UDF

  • 5/19/2018 Hadoop Module 5 Release 3.0

    44/48

    wSlide 44

    Pig Latin Calling A UDF

    How to call a UDF?

    register myudf.jar;

    X = filter A by IsOfAge(age);

    Further Reading

  • 5/19/2018 Hadoop Module 5 Release 3.0

    45/48

    wSlide 45

    Further Reading

    Review the following Blogs on Pig Scripts:

    http://www.edureka.in/blog/pig-programming-apache-pig-script-with-udf

    http://www.edureka.in/blog/apache-pig-udf-part-1-eval-aggregate-filter-

    http://www.edureka.in/blog/apache-pig-udf-part-2-load-functions/

    http://www.edureka.in/blog/apache-pig-udf-store-functions/

    Pig And Hive

    http://www.edureka.in/blog/pig-programming-apache-pig-script-with-udf-in-hdfs-mode/http://www.edureka.in/blog/apache-pig-udf-part-1-eval-aggregate-filter-functions/http://www.edureka.in/blog/apache-pig-udf-part-1-eval-aggregate-filter-functions/http://www.edureka.in/blog/apache-pig-udf-part-2-load-functions/http://www.edureka.in/blog/apache-pig-udf-part-2-load-functions/http://www.edureka.in/blog/apache-pig-udf-store-functions/http://www.edureka.in/blog/apache-pig-udf-store-functions/http://www.edureka.in/blog/apache-pig-udf-part-2-load-functions/http://www.edureka.in/blog/apache-pig-udf-part-1-eval-aggregate-filter-functions/http://www.edureka.in/blog/pig-programming-apache-pig-script-with-udf-in-hdfs-mode/
  • 5/19/2018 Hadoop Module 5 Release 3.0

    46/48

    wSlide 46

    Pig And Hive

    ?

    Module-5 Pre-work

  • 5/19/2018 Hadoop Module 5 Release 3.0

    47/48

    wSlide 47

    Attempt the following assignment using the document present in the LMS under the

    Pig Set-up on Cloudera

    Execute Pig Weather Example

    Attempt Assignment Week 4

    odu e 5 e o

    Review the following Blogs on Pig Scripts:

    http://www.edureka.in/blog/apache-hive-installation-on-ubuntu/

    http://www.edureka.in/blog/apache-hadoop-hive-script/

    http://www.edureka.in/blog/apache-hive-installation-on-ubuntu/http://www.edureka.in/blog/apache-hive-installation-on-ubuntu/http://www.edureka.in/blog/apache-hadoop-hive-script/http://www.edureka.in/blog/apache-hadoop-hive-script/http://www.edureka.in/blog/apache-hive-installation-on-ubuntu/
  • 5/19/2018 Hadoop Module 5 Release 3.0

    48/48

    Thank YouSee You in Class Next Week