Upload
mastanshaik2010
View
122
Download
1
Embed Size (px)
DESCRIPTION
hadoop
Citation preview
5/19/2018 Hadoop Module 5 Release 3.0
1/48
wSlide 1
5/19/2018 Hadoop Module 5 Release 3.0
2/48
wSlide 2
Module 1 Understanding Big Data
Hadoop Architecture
Module 2 Hadoop Cluster Configuration
Data loading Techniques
Hadoop Project Environment
Module 3 Hadoop MapReduce framework
Programming in Map Reduce
Module 4 Advance MapReduce
MRUnit testing framework
Module 5 Analytics using Pig
Understanding Pig Latin
Module 6
Analytics using Hive
Understanding HIVE QL
Module 7
Advance Hive
NoSQL Databases and HBASE
Module 8
Advance HBASE
Zookeeper Service
Module 9
Hadoop 2.0 New Features
Programming in MRv2
Module 10
Apache Oozie
Real world Datasets and Anal
Project Discussion
Course Topics
5/19/2018 Hadoop Module 5 Release 3.0
3/48
wSlide 3
Topics for Today
Need of PIG
Why PIG was created?
Why go for PIG when MapReduce is there?
Use Cases where Pig is used
Use Case in Healthcare
Where not to use PIG
Weather data with PIG
Lets start with PIG
PIG Components
PIG Data Types
PIG UDF
PIG vs Hive
5/19/2018 Hadoop Module 5 Release 3.0
4/48
w
Letss Revise Advance MR
Combiner and Partition functi
MapReduce Joins
Hadoop Data Types
Custom Data Types
Input and Output Formats
Sequence Files
Distributed Cache
MRUnit testing framework
Hadoop Counters: Reporting
5/19/2018 Hadoop Module 5 Release 3.0
5/48wSlide 5
Need Of Pig
Do you know Java? 10 lines of PIG = 200 lines of Java
+ Built in op
Join Group Filter Sort and more
5/19/2018 Hadoop Module 5 Release 3.0
6/48wSlide 6
Why Was Pig Created?
Rapid Development
No Java is required
Developed by Yahoo!
An ad-hoc way of creating and executing map-reduce jobs on very large data sets
5/19/2018 Hadoop Module 5 Release 3.0
7/48wSlide 7
1/20 the lines of Code 1/16 the development T
180
160
140
120
100
80
60
40
20
0
Hadoop Pig0
Hadoop P
50
100
150
200
250
300
Minutes
Performance On Par With Raw Hadoop
Why Should I Go For Pig When There Is MR?
5/19/2018 Hadoop Module 5 Release 3.0
8/48
wSlide 8
Why Should I Go For Pig When There Is MR?
MapReduce Powerful model for parallelism.
Based on a rigid procedural structure. Provides a good opportunity to parallelize algorithm. Have a higher level declarative language.
PIG
It is desirable to have a higher level declarativelanguage. Similar to SQL query where the user specifies the
what and leaves the how to the underlyingprocessing engine.
5/19/2018 Hadoop Module 5 Release 3.0
9/48
wSlide 9
Where Should I use Pig?
Pig is a data flow language.
It is at the top of Hadoop and makes it possible to create coto process large volumes of data quickly and efficiently.
Hadoop
Pig
Case 1 Time Sensitive Data Loads
Case 2 Processing Many Data Sources
Case 3 Analytic Insight Through Sampling
5/19/2018 Hadoop Module 5 Release 3.0
10/48
wSlide 10
Where not to use Pig?
Really nasty data formats or completely unstructured data
(video, audio, raw human-readable text).
Pig is definitely slow compared to Map Reduce jobs.
When you would like more power to optimize your code.
5/19/2018 Hadoop Module 5 Release 3.0
11/48
wSlide 11
Hello There!!My name is Annie.I love quizzes and
puzzles and I am here tmake you guys think an
answer my questions.
Apache Pig is a platform for analysing?- Small Data- Data less than 10 GB- Large Data- All of them
Annies Question
5/19/2018 Hadoop Module 5 Release 3.0
12/48
wSlide 12
Large Data.
Annies Answer
5/19/2018 Hadoop Module 5 Release 3.0
13/48
wSlide 13
Pig is an open-source high-level dataflow system
It provides a simple language for queries and data maLatin, that is compiled into map-reduce jobs that are
Why Is It Important?
Companies like Yahoo, Google and Microsoft are coenormous data sets in the form of click streams, se
web crawls. Some form of ad-hoc processing and analysis of al
information is required.
What is Pig?
5/19/2018 Hadoop Module 5 Release 3.0
14/48
wSlide 14
Data processing for search platforms
Quick Prototyping of algorithms for processing large datasets.
Processing of Web Logs
Support forAd Hoc queries across large datasets.
Use Cases Where Pig Is Used
5/19/2018 Hadoop Module 5 Release 3.0
15/48
wSlide 15
Examples Of Data Analysis Task
Find users who tend to visit good pages:
User URL Time
Amy www.cnn.com 8:00
Amy www.crap.com 8:05
Amy www.myblog.com 10:00
Amy www.flickr.com 10:05
Fred cnn.com/index.htm 12:00
Page Rank Pa
www.cnn.com 0.9
www.flickr.com 0.9
www.myblog.com 0.7
www.crap.com 0.2
VISITS PAGES
http://www.cnn.com/http://www.crap.com/http://www.myblog.com/http://www.flickr.com/http://www.cnn.com/http://www.flickr.com/http://www.myblog.com/http://www.crap.com/http://www.crap.com/http://www.myblog.com/http://www.flickr.com/http://www.cnn.com/http://www.flickr.com/http://www.myblog.com/http://www.crap.com/http://www.cnn.com/5/19/2018 Hadoop Module 5 Release 3.0
16/48
wSlide 16
Conceptual Data Flow
LoadVisits (user, url, time)
LoadPages (url, pagerank)
Joinurl = url
Group by user
FilteravgPR>0.5
Compute AveragePagerank
5/19/2018 Hadoop Module 5 Release 3.0
17/48
wSlide 17
How Yahoo Uses Pig?
Pig is best suited for the data factory.
Data Factory contains:Pipelines: Pipelines bring logs from Yahoo!'s web servers. These logs undergo a cleaning step where bots, company internal views, and click
removed.
Research: Researchers want to quickly write a script to test a theory.
Pig integration with streaming makes it easy for researchers to take a Perl or Pythand run it against a huge data set.
5/19/2018 Hadoop Module 5 Release 3.0
18/48
wSlide 18
Use Case In HealthCare
Problem Statement:De-identify personal health information.
Challenges: Huge amount of data flows into the systems daily and there are multiple data sou
need to aggregate data from.
Crunching this huge data and deidentifying it in a traditional way had problems.
5/19/2018 Hadoop Module 5 Release 3.0
19/48
wSlide 19
Use Case In HealthCare
Pig Script
Taking DB dump in CSV format andingest into HDFS
Store DeidentifiedCSV file into HDFS
Read CSV filefrom HDFS
Deidentify columns based on configurationsand store the data back in a CSV file
010011011001HDFS
01001101
1001
5/19/2018 Hadoop Module 5 Release 3.0
20/48
wSlide 20
Weather Data With Pig
ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/
ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/5/19/2018 Hadoop Module 5 Release 3.0
21/48
wSlide 21
Pig - Basic Program Structure
Grunt:Grunt is an interactive shell for running Pig commands.It is also possible to run Pig scripts from within Gruntusing run and exec (execute).
Script:Pig can run a script file that contains Pig commands.
Example: pig script.pig runs the commands inthe local file script.pig.
Embedded:Embedded can run Pig programs from Java, much like youcan use JDBC to run SQL programs from Java.
Script
Grun
Embed
5/19/2018 Hadoop Module 5 Release 3.0
22/48
wSlide 22
Pig Is made Up Of Two Components
Data Flows
Pig Latin is used toexpress Data Flows
Distributed executionon a Hadoop Cluster
Local execution in asingle JVM
Pig
ExecutionEnvironments
1.
2.
5/19/2018 Hadoop Module 5 Release 3.0
23/48
wSlide 23
Pig Execution
HadoopCluster
Pig resides on user machine Job executes on Cluster
User Machine
No need to install anything extra on your Hadoop Clust
5/19/2018 Hadoop Module 5 Release 3.0
24/48
wSlide 24
Pig Latin Program
Pig Latin Program
Pig A series of MapReduTurns the transformations into
It is made up of a series of operations or transformations that areapplied to the input data to produce output.
5/19/2018 Hadoop Module 5 Release 3.0
25/48
wSlide 25
Four Basic Types Of Data Models
Atom Tuple
MapBag
DataModelTypes
5/19/2018 Hadoop Module 5 Release 3.0
26/48
wSlide 26
Data Model
Data Models can be defined as follows:
A bag is a collection of tuples.
A tuple is an ordered set of fields.
A field is a piece of data.
A Data Map is a map from keys that are string literals to values that can be
any data type.
Example:
t = < 1, {,,}, ['apache':'search']>
5/19/2018 Hadoop Module 5 Release 3.0
27/48
wSlide 27
Pig Data Types
Pig DataType
Implementing Class
Bag org.apache.pig.data.DataBag
Tuple org.apache.pig.data.Tuple
Map java.util.Map
Integer java.lang.Integer
Long java.lang.Long
Float java.lang.Float
Double java.lang.Double
Chararray java.lang.String
Bytearray byte[]
Pi L i R l i l O
5/19/2018 Hadoop Module 5 Release 3.0
28/48
wSlide 28
Pig Latin Relational Operators
Category Operator Description
Loading and Storing LOAD
STOREDUMP
Loads data from the file system or other storag
Saves a relation to the file system or other storPrints a relation to the console.
Filtering FILTER DISTINCTFOREACH...GENERATESTREAM
Removes unwanted rows from a relation.Removes duplicate rows from a relation.
Adds or removes fields from a relation.Transforms a relation using an external program
Grouping and Joining JOINCOGROUPGROUPCROSS
Joins two or more relations.Groups the data in two or more relations.Groups the data in a single relation.Creates the cross product of two or more relati
Sorting ORDER LIMIT
Sorts a relation by one or more fields.Limits the size of a relation to a maximum num
Combining and Splitting UNIONSPLIT
Combines two or more relations into one.Splits a relation into two or more relations.
5/19/2018 Hadoop Module 5 Release 3.0
29/48
wSlide 29
Pig Latin - Nulls
NULLData of any type can be NULL.
Includes the concept of a data element being
Pig
In Pig, when a element is NULmeans the valu
unknown.
5/19/2018 Hadoop Module 5 Release 3.0
30/48
wSlide 30
Data
File Student File Student Rol
Name Age GPA
Joe 18 2.5
Sam 3.0
Angel 21 7.9
John 17 9.0
Joe 19 2.9
Name Roll No.
Joe 45
Sam 24
Angel 1
John 12
Joe 19
Pi L i G O
5/19/2018 Hadoop Module 5 Release 3.0
31/48
wSlide 31
Pig Latin Group Operator
Example of GROUP Operator:
A = load 'student' as (name:chararray, age:int, gpa:float); dump A;(joe,18,2.5)(sam,,3.0)(angel,21,7.9)(john,17,9.0)(joe,19,2.9)
X = group A by name;
dump X;(joe,{(joe,18,2.5),(joe,19,2.9)})(sam,{(sam,,3.0)})(john,{(john,17,9.0)})(angel,{(angel,21,7.9)})
Pi L i COG O
5/19/2018 Hadoop Module 5 Release 3.0
32/48
wSlide 32
Example of COGROUP Operator:
A = load 'student' as (name:chararray, age:int,gpa:float);
B = load 'studentRoll' as (name:chararray, rollno:int);
X = cogroup A by name, B by name;dump X;
(joe,{(joe,18,2.5),(joe,19,2.9)},{(joe,45),(joe,19)})(sam,{(sam,,3.0)},{(sam,24)})
(john,{(john,17,9.0)},{(john,12)})(angel,{(angel,21,7.9)},{(angel,1)})
Pig Latin COGroup Operator
J i A d COGROUP
5/19/2018 Hadoop Module 5 Release 3.0
33/48
wSlide 33
JOIN and COGROUP operators pefunctions.
JOIN creates a flat set of output reCOGROUP creates a nested set of
Joins And COGROUP
U i
5/19/2018 Hadoop Module 5 Release 3.0
34/48
wSlide 34
Union
UNION: To merge the contents of two or more relations.
F th R di
5/19/2018 Hadoop Module 5 Release 3.0
35/48
wSlide 35
Further Reading
Review the following Blogs on Pig Scripts:
http://www.edureka.in/blog/operators-in-apache-pig/
http://www.edureka.in/blog/diagnostic-operators-apachepig/
Diagnostic Operators & UDF Statements
http://www.edureka.in/blog/operators-in-apache-pig/http://www.edureka.in/blog/operators-in-apache-pig/http://www.edureka.in/blog/diagnostic-operators-apachepig/http://www.edureka.in/blog/diagnostic-operators-apachepig/http://www.edureka.in/blog/operators-in-apache-pig/5/19/2018 Hadoop Module 5 Release 3.0
36/48
wSlide 36
Diagnostic Operators & UDF Statements
Types of Pig Latin Diagnostic Operators:
DESCRIBE - Prints a relations schema.EXPLAIN - Prints the logical and physical plans.ILLUSTRATE - Shows a sample execution of the logical plan, using a generated subs
Types of Pig Latin UDF Statements:
REGISTER - Registers a JAR file with the Pig runtime.DEFINE - Creates an alias for a UDF, streaming script, or a command specification.
Pig Latin UDF Statements
Pig Latin Diagnostic Operators
Describe
5/19/2018 Hadoop Module 5 Release 3.0
37/48
wSlide 37
Describe
Use the DESCRIBE operator to review the fields and data-types.
EXPLAIN : Logical Plan
5/19/2018 Hadoop Module 5 Release 3.0
38/48
wSlide 38
EXPLAIN : Logical Plan
Use the EXPLAIN operator to review the logical, physical, and map reduce executioused to compute the specified relationship.
The logical plan shows a pipeline of operators to be executed to build the relation. Tbackend-independent optimizations (such as applying filters early on) also apply.
EXPLAIN : Physical Plan
5/19/2018 Hadoop Module 5 Release 3.0
39/48
wSlide 39
EXPLAIN : Physical Plan
The physical plan shows how the logical operators are translated to backend-specifoperators. Some backend optimizations also apply.
EXPLAIN : MapReduce Plan
5/19/2018 Hadoop Module 5 Release 3.0
40/48
wSlide 40
EXPLAIN : MapReduce Plan
The mapreduce plan shows how the physical operators are grouped into map reduc
Illustrate
5/19/2018 Hadoop Module 5 Release 3.0
41/48
wSlide 41
Illustrate
ILLUSTRATE command is used to demonstrate a "good" example input data.
Judged by three measurements:
1 Completeness 2 Conciseness 3 Degree of realism
Pig Latin File Loaders
5/19/2018 Hadoop Module 5 Release 3.0
42/48
wSlide 42
Pig Latin File Loaders
Pig Latin File Loaders
BinStorage - "binary" storage
PigStorage - loads and stores data that is delimited by something
TextLoader - loads data line by line (delimited by the newline character)
CSVLoader - Loads CSV files
XML Loader - Loads XML files
Pig Latin Creating UDF
5/19/2018 Hadoop Module 5 Release 3.0
43/48
wSlide 43
Pig Latin Creating UDF
public class IsOfAge extends FilterFunc {
@Override
public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {
return false;
}
try {
Object object = tuple.get(0);
if (object == null) {
return false;
}
int i = (Integer) object;
if (i == 18 || i == 19 || i == 21 || i == 23 || i == 27) {return true;
} else {
return false;
}
} catch (ExecException e) {
throw new IOException(e);
}
}
}
A Program to create UDF:
Pig Latin Calling A UDF
5/19/2018 Hadoop Module 5 Release 3.0
44/48
wSlide 44
Pig Latin Calling A UDF
How to call a UDF?
register myudf.jar;
X = filter A by IsOfAge(age);
Further Reading
5/19/2018 Hadoop Module 5 Release 3.0
45/48
wSlide 45
Further Reading
Review the following Blogs on Pig Scripts:
http://www.edureka.in/blog/pig-programming-apache-pig-script-with-udf
http://www.edureka.in/blog/apache-pig-udf-part-1-eval-aggregate-filter-
http://www.edureka.in/blog/apache-pig-udf-part-2-load-functions/
http://www.edureka.in/blog/apache-pig-udf-store-functions/
Pig And Hive
http://www.edureka.in/blog/pig-programming-apache-pig-script-with-udf-in-hdfs-mode/http://www.edureka.in/blog/apache-pig-udf-part-1-eval-aggregate-filter-functions/http://www.edureka.in/blog/apache-pig-udf-part-1-eval-aggregate-filter-functions/http://www.edureka.in/blog/apache-pig-udf-part-2-load-functions/http://www.edureka.in/blog/apache-pig-udf-part-2-load-functions/http://www.edureka.in/blog/apache-pig-udf-store-functions/http://www.edureka.in/blog/apache-pig-udf-store-functions/http://www.edureka.in/blog/apache-pig-udf-part-2-load-functions/http://www.edureka.in/blog/apache-pig-udf-part-1-eval-aggregate-filter-functions/http://www.edureka.in/blog/pig-programming-apache-pig-script-with-udf-in-hdfs-mode/5/19/2018 Hadoop Module 5 Release 3.0
46/48
wSlide 46
Pig And Hive
?
Module-5 Pre-work
5/19/2018 Hadoop Module 5 Release 3.0
47/48
wSlide 47
Attempt the following assignment using the document present in the LMS under the
Pig Set-up on Cloudera
Execute Pig Weather Example
Attempt Assignment Week 4
odu e 5 e o
Review the following Blogs on Pig Scripts:
http://www.edureka.in/blog/apache-hive-installation-on-ubuntu/
http://www.edureka.in/blog/apache-hadoop-hive-script/
http://www.edureka.in/blog/apache-hive-installation-on-ubuntu/http://www.edureka.in/blog/apache-hive-installation-on-ubuntu/http://www.edureka.in/blog/apache-hadoop-hive-script/http://www.edureka.in/blog/apache-hadoop-hive-script/http://www.edureka.in/blog/apache-hive-installation-on-ubuntu/5/19/2018 Hadoop Module 5 Release 3.0
48/48
Thank YouSee You in Class Next Week