Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins

Chris Olston Benjamin ReedUtkarsh Srivastava

Ravi Kumar Andrew Tomkins

Pig Latin: A Not-So-Foreign Language For Data Processing

Research

Data Processing Renaissance

Internet companies swimming in data• E.g. TBs/day at Yahoo!

Data analysis is “inner loop” of product innovation

Data analysts are skilled programmers

Data Warehousing …?

Scale Often not scalable enough

$ $ $ $Prohibitively expensive at web scale• Up to $200K/TB

SQL• Little control over execution method• Query optimization is hard• Parallel environment• Little or no statistics• Lots of UDFs

New Systems For Data Analysis

Map-Reduce

Apache Hadoop

Map-Reduce

Inputrecords

Outputrecords

reduce

Just a group-by-aggregate?

The Map-Reduce Appeal

ScaleScalable due to simpler design• Only parallelizable operations• No transactions

$ Runs on cheap commodity hardware

Procedural Control- a processing “pipe”SQL

Disadvantages

1. Extremely rigid data flow

Other flows constantly hacked in

Join, Union Split

M M R M

Chains

2. Common operations must be coded by hand• Join, filter, projection, aggregates, sorting, distinct

3. Semantics hidden inside map-reduce functions• Difficult to maintain, extend, and optimize

Pros And Cons

Need a high-level, general data flow language

Enter Pig Latin

Pig Latin

Need a high-level, general data flow language

Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin example

• Salient features

• Implementation

Example Data Analysis Task

User Url Time

Amy cnn.com 8:00

Amy bbc.com 10:00

Amy flickr.com 10:05

Fred cnn.com 12:00

Find the top 10 most visited pages in each category

Url Category PageRank

cnn.com News 0.9

bbc.com News 0.8

flickr.com Photos 0.7

espn.com Sports 0.9

Visits Url Info

Data Flow

Load Visits

Group by url

Foreach urlgenerate count

Load Url Info

Join on url

Group by category

Foreach categorygenerate top10 urls

In Pig Latinvisits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

Outline

• Salient features

• Implementation

Step-by-step Procedural ControlTarget users are entrenched procedural programmers

The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.

Jasmine NovakEngineer, Yahoo!

• Automatic query optimization is hard • Pig Latin does not preclude optimization

With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful.

David CiemiewiczSearch Excellence, Yahoo!

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);

Quick Start and Interoperability

Operates directly over files

Quick Start and Interoperability

Schemas optional; Can be assigned dynamically

User-Code as a First-Class Citizen

User-defined functions (UDFs) can be used in every construct• Load, Store• Group, Filter, Foreach

• Pig Latin has a fully-nestable data model with:– Atomic values, tuples, bags (lists), and maps

• More natural to programmers than flat tuples• Avoids expensive joins• See paper

Nested Data Model

yahoo ,financeemailnews

Outline

• Novel features

• Implementation

Implementation

cluster

Hadoop Map-Reduce

automaticrewrite +optimize

Pig is open-source.http://incubator.apache.org/pig

Compilation into Map-Reduce

Load Visits

Group by url

Foreach urlgenerate count

Load Url Info

Join on url

Group by category

Foreach categorygenerate top10(urls)

Reduce1Map2

Reduce2

Reduce3

Every group or join operation forms a map-reduce boundary

Other operations pipelined into map and reduce phases

• First production release about a year ago

• 150+ early adopters within Yahoo!

• Over 25% of the Yahoo! map-reduce user base

Related Work

• Sawzall– Data processing language on top of map-reduce– Rigid structure of filtering followed by aggregation

• DryadLINQ– SQL-like language on top of Dryad

• Nested data models– Object-oriented databases

Future Work

• Optional “safe” query optimizer– Performs only high-confidence rewrites

• User interface– Boxes and arrows UI– Promote collaboration, sharing code fragments and

• Tight integration with a scripting language– Use loops, conditionals of host language

Arun MurthyPi SongSanthosh SrinivasanAmir Youssefi

Shubham ChopraAlan GatesShravan NarayanamurthyOlga Natkovich

Credits

Summary

• Big demand for parallel data processing– Emerging tools that do not look like SQL DBMS– Programmers like dataflow pipes over static files

• Hence the excitement about Map-Reduce

• But, Map-Reduce is too low-level and rigid

Pig LatinSweet spot between map-reduce and SQL

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins

Documents

Laporan Hasil Kunjungan an Tomkins

ScentTrails: Integrating Browsing and Searching on the Webinfolab.stanford.edu/~olston/publications/scenttrails.pdf · Searching on the Web CHRISTOPHER OLSTON Carnegie Mellon University

Sankalp Semi Utkarsh 2011

Utkarsh Fourth Issue

Project Utkarsh Updates

Tomkins editorial-3105-v2

Utkarsh Tiwari AAI Report

Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Shahram Ghandeharizadeh

Duchamp Tomkins

Utkarsh Project Report

Generating Example Data for Dataﬂow Programsinfolab.stanford.edu/~olston/publications/sigmod09.pdf · Santa Clara, CA utkarsh@yahoo-inc.com ABSTRACT While developing data-centric

Interim Report Utkarsh

NOTES BY UTKARSH SIR - Testbook · NOTES BY UTKARSH SIR - Testbook ... 5

Home | Utkarsh CoreInvest Limited

Utkarsh Bhargava - Thesis

Ted tomkins uk film magazines

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins

UTKARSH STAR N.G.O Presention

Utkarsh srivastava

Project_report Utkarsh Verma