Upload
nathan-bijnens
View
111
Download
0
Embed Size (px)
DESCRIPTION
My presentation about Hadoop and Pig during the Fosdem Datadevroom 2011.
Citation preview
Hadoop Pig:MapReduce the easy way.
Nathan Bijnenshttp://nathan.gs@nathan_gs
We live in a world of data.
● Data analysis becomes
more and more
important
● Increasing complexity
of analysis
● Meanwhile the data we
analyze grows big, fast!
s: http://www.flickr.com/photos/pallotron/2479541331/ by pallotron
Hadoop is an open source Java framework aimed at data intensive distributed applications.
It enables applications to work with thousands of nodes and petabytes of data.
Hadoop: Intro
Hadoop was inspired by Google's Map Reduce and Google File System.
http://labs.google.com/papers/mapreduce.html
Hadoop: Intro
HDFS is a distributed, scalable filesystem designed to store large files.
In combination with the Hadoop JobTracker it provides data locality.
It auto replicates all blocks to 3 data nodes, where preferable 2 copies are stored on two data
nodes within the same rack and one in another rack.
Hadoop: HDFS
● NameNode● Keeps track of what is stored where
● In memory● Single Point of Failure
● DataNodes
Hadoop: HDFS
Hadoop: HDFS
s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
MapReduce works by breaking processing into two phases, a map and a reduce function.
MapReduce
● Input● Map● Shuffle● Reduce● Output
MapReduce
s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
MassiveMedia / Netlog● Cases
● Traffic analysis● User actions● ...
● On a 7 node cluster.
Use Cases: Who & how it's used
Yahoo!● Cases
● Ad Systems● Web Search● ...
● More than 36000 nodes!
Use Cases: Who & how it's used
s: http://wiki.apache.org/hadoop/PoweredBy
SETI@home● Highly CPU oriented● data locality is unimportant!
Use Cases: When not to use
Pig is a high level data flow language.
Hadoop Pig: Intro
Pig Latin
Grunt
PigServer
Hadoop Pig: 3 components
data = LOAD 'employee.csv' USING PigStorage() AS (first_name:chararray, last_name:chararray, age:int, wage:float, department:chararray
);
grouped_by_department = GROUP data BY department;
total_wage_by_department = FOREACH grouped_by_departmentGENERATE
group AS department,COUNT(data) as employee_count,SUM(data::wage) AS total_wage;
total_ordered = ORDER total_wage_by_department BY total_wage;
total_limited = LIMIT total_ordered 10;
DUMP total_limited;
Hadoop Pig
books = LOAD 'books.csv.bz2' USING PigStorage() AS (book_id:int,book_name:chararray,author_name:chararray
);
book_sales = LOAD 'book_sales.csv.bz2' USING PigStorage() AS (book_id:int,price:float,country:chararray
);
--- books = FILTER books BY (author_name LIKE 'Pamuk');
data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12;
grouped_by_book = GROUP data BY books::book_name;
total_sales_by_book = FOREACH grouped_by_bookGENERATE
group as book,COUNT(data) as sales_volume,SUM(book_sales::price) AS total_sales;
STORE total_sales_by_book INTO 'book_sale_results';
● Custom Load and Store classes.● Hbase● ProtocolBuffers● CombinedLog
● Custom extractioneg. date, ...
Take a look at the PiggyBank.
UDF
● Hive
● Streaming
● Native Java MapReduce
Some alternatives
Questions?
Thank you for listening!