23
Hadoop Pig: MapReduce the easy way. Nathan Bijnens http://nathan.gs @nathan_gs

Hadoop Pig: MapReduce the easy way!

Embed Size (px)

DESCRIPTION

My presentation about Hadoop and Pig during the Fosdem Datadevroom 2011.

Citation preview

Page 1: Hadoop Pig: MapReduce the easy way!

Hadoop Pig:MapReduce the easy way.

Nathan Bijnenshttp://nathan.gs@nathan_gs

Page 2: Hadoop Pig: MapReduce the easy way!

We live in a world of data.

Page 3: Hadoop Pig: MapReduce the easy way!

● Data analysis becomes

more and more

important

● Increasing complexity

of analysis

● Meanwhile the data we

analyze grows big, fast!

s: http://www.flickr.com/photos/pallotron/2479541331/ by pallotron

Page 4: Hadoop Pig: MapReduce the easy way!
Page 5: Hadoop Pig: MapReduce the easy way!

Hadoop is an open source Java framework aimed at data intensive distributed applications.

It enables applications to work with thousands of nodes and petabytes of data.

Hadoop: Intro

Page 6: Hadoop Pig: MapReduce the easy way!

Hadoop was inspired by Google's Map Reduce and Google File System.

http://labs.google.com/papers/mapreduce.html

Hadoop: Intro

Page 7: Hadoop Pig: MapReduce the easy way!

HDFS is a distributed, scalable filesystem designed to store large files.

In combination with the Hadoop JobTracker it provides data locality.

It auto replicates all blocks to 3 data nodes, where preferable 2 copies are stored on two data

nodes within the same rack and one in another rack.

Hadoop: HDFS

Page 8: Hadoop Pig: MapReduce the easy way!

● NameNode● Keeps track of what is stored where

● In memory● Single Point of Failure

● DataNodes

Hadoop: HDFS

Page 9: Hadoop Pig: MapReduce the easy way!

Hadoop: HDFS

s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

Page 10: Hadoop Pig: MapReduce the easy way!

MapReduce works by breaking processing into two phases, a map and a reduce function.

MapReduce

Page 11: Hadoop Pig: MapReduce the easy way!

● Input● Map● Shuffle● Reduce● Output

MapReduce

s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

Page 12: Hadoop Pig: MapReduce the easy way!

MassiveMedia / Netlog● Cases

● Traffic analysis● User actions● ...

● On a 7 node cluster.

Use Cases: Who & how it's used

Page 13: Hadoop Pig: MapReduce the easy way!

Yahoo!● Cases

● Ad Systems● Web Search● ...

● More than 36000 nodes!

Use Cases: Who & how it's used

s: http://wiki.apache.org/hadoop/PoweredBy

Page 14: Hadoop Pig: MapReduce the easy way!

SETI@home● Highly CPU oriented● data locality is unimportant!

Use Cases: When not to use

Page 15: Hadoop Pig: MapReduce the easy way!
Page 16: Hadoop Pig: MapReduce the easy way!

Pig is a high level data flow language.

Hadoop Pig: Intro

Page 17: Hadoop Pig: MapReduce the easy way!

Pig Latin

Grunt

PigServer

Hadoop Pig: 3 components

Page 18: Hadoop Pig: MapReduce the easy way!

data = LOAD 'employee.csv' USING PigStorage() AS (first_name:chararray, last_name:chararray, age:int, wage:float, department:chararray

);

grouped_by_department = GROUP data BY department;

total_wage_by_department = FOREACH grouped_by_departmentGENERATE

group AS department,COUNT(data) as employee_count,SUM(data::wage) AS total_wage;

total_ordered = ORDER total_wage_by_department BY total_wage;

total_limited = LIMIT total_ordered 10;

DUMP total_limited;

Hadoop Pig

Page 19: Hadoop Pig: MapReduce the easy way!

books = LOAD 'books.csv.bz2' USING PigStorage() AS (book_id:int,book_name:chararray,author_name:chararray

);

book_sales = LOAD 'book_sales.csv.bz2' USING PigStorage() AS (book_id:int,price:float,country:chararray

);

--- books = FILTER books BY (author_name LIKE 'Pamuk');

data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12;

grouped_by_book = GROUP data BY books::book_name;

total_sales_by_book = FOREACH grouped_by_bookGENERATE

group as book,COUNT(data) as sales_volume,SUM(book_sales::price) AS total_sales;

STORE total_sales_by_book INTO 'book_sale_results';

Page 20: Hadoop Pig: MapReduce the easy way!

● Custom Load and Store classes.● Hbase● ProtocolBuffers● CombinedLog

● Custom extractioneg. date, ...

Take a look at the PiggyBank.

UDF

Page 21: Hadoop Pig: MapReduce the easy way!

● Hive

● Streaming

● Native Java MapReduce

Some alternatives

Page 22: Hadoop Pig: MapReduce the easy way!

Questions?

Page 23: Hadoop Pig: MapReduce the easy way!

Thank you for listening!