41
Jeremy Karn - co-founder, Mortar MongoDB + Pig

MongoDB + Pig on Hadoop (MongoSV 2012)

Embed Size (px)

DESCRIPTION

Slides from Mortar co-founder Jeremy Karn's presentation at MongoSV 2012. Learn to process Mongo data with Hadoop—specifically with Apache Pig. Jeremy's presentation covered the steps needed to read JSON from Mongo into Pig, parallel process it on Hadoop with sophisticated functions, and write back to Mongo. This talk will demonstrate its concepts with Mortar, which has contributed to the Mongo Hadoop connector, extending it to work with Pig.

Citation preview

Page 1: MongoDB + Pig on Hadoop (MongoSV 2012)

Jeremy Karn - co-founder, MortarMongoDB + Pig

Page 2: MongoDB + Pig on Hadoop (MongoSV 2012)

OF THIS SESSIONOverview

Intro to HadoopIntro to PigWhy MongoDB + Pig?Demo: load PigDemo: processing data with PigDemo: store data from Pig to MongoDB

Page 4: MongoDB + Pig on Hadoop (MongoSV 2012)

RAPID OVERVIEWHadoop

Page 5: MongoDB + Pig on Hadoop (MongoSV 2012)

RAPID OVERVIEWHadoop

Hadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more

Page 6: MongoDB + Pig on Hadoop (MongoSV 2012)

STRENGTHSHadoop

ScalableOpen sourceLots of momentumVery broadly applicable

Page 7: MongoDB + Pig on Hadoop (MongoSV 2012)

Social Graph

Page 8: MongoDB + Pig on Hadoop (MongoSV 2012)

Predict

Page 9: MongoDB + Pig on Hadoop (MongoSV 2012)

Detect

Page 10: MongoDB + Pig on Hadoop (MongoSV 2012)

Genetics

Page 11: MongoDB + Pig on Hadoop (MongoSV 2012)

PROBLEMSHadoop

DifficultBatch only (...or it was)

Page 12: MongoDB + Pig on Hadoop (MongoSV 2012)

FUTUREHadoop

YarnMapReduce optionalGeneric management + distributed appsImpala

Page 13: MongoDB + Pig on Hadoop (MongoSV 2012)

Alternatives to Hadoop

Write MapReduce in Javascript• Javascript is not fast• Has limited data types• Hard to use complex analytic libsAdds load to data store

MONGODB NATIVE MAPREDUCE

Page 14: MongoDB + Pig on Hadoop (MongoSV 2012)

Hadoop has libs for• Machine learning• ETL• Can access any JVM analytic libsAnd many organizations already use Hadoop

Alternatives to HadoopMONGODB NATIVE MAPREDUCE

Page 15: MongoDB + Pig on Hadoop (MongoSV 2012)

Alternatives to HadoopMONGODB AGGREGATION FRAMEWORK

Great when• Doing SQL-style aggregation• Do not require external data libs• Users will learn framework

Page 16: MongoDB + Pig on Hadoop (MongoSV 2012)

Alternatives to HadoopMONGODB AGGREGATION FRAMEWORK

But you may want Hadoop when• Doing sophisticated aggregation• Require external data libs• Users unwilling to learn framework• Need to transfer workload off datastore

Page 17: MongoDB + Pig on Hadoop (MongoSV 2012)

ON HADOOPPig

Less codeExpressive code

Page 18: MongoDB + Pig on Hadoop (MongoSV 2012)

BRIEF, EXPRESSIVELIKE PROCEDURAL SQL

Pig

(thanks: twitter hadoop world presentation)

Page 19: MongoDB + Pig on Hadoop (MongoSV 2012)

FOR SERIOUSThe Same Script, In MapReduce

Page 20: MongoDB + Pig on Hadoop (MongoSV 2012)

ON HADOOPPig

Less code Expressive codeCompiles to MRInsulates from APIPopular (LinkedIn, Twitter, Salesforce, Yahoo, Stanford

Page 21: MongoDB + Pig on Hadoop (MongoSV 2012)

MOTIVATIONSMongoDB + Pig

Data storage and data processing are often separate concerns

Hadoop is built for scalable processing of large datasets

Page 22: MongoDB + Pig on Hadoop (MongoSV 2012)

SIMILAR STANCE MongoDB, Pig

Poly-structured data• MongoDB: stores data, regardless of

structure• Pig: reads data, regardless of structure

(got its name because Pigs are omnivorous)

Page 23: MongoDB + Pig on Hadoop (MongoSV 2012)

JSON-PIG DATA TYPE MAPPINGMongoDB, Pig

JSON Pig

string chararrayinteger intboolean booleandouble doublearray bagobject map/tuplenull null

Page 24: MongoDB + Pig on Hadoop (MongoSV 2012)

MONGODB-PIG DATA TYPE MAPPINGMongoDB, Pig

MongoDB Pig

date datetimeobject id chararraybinary data

bytearrayregexp chararraycode chararray

Page 25: MongoDB + Pig on Hadoop (MongoSV 2012)

MortarFAST INTRO

Open-source code-based dev framework for data, built on Hadoop and Pig

Inspired by Rails

Self-contained, organized, executable projects

Page 26: MongoDB + Pig on Hadoop (MongoSV 2012)

> gem install mortar

Page 27: MongoDB + Pig on Hadoop (MongoSV 2012)

> mortar new my_project

Page 28: MongoDB + Pig on Hadoop (MongoSV 2012)
Page 29: MongoDB + Pig on Hadoop (MongoSV 2012)

MortarFAST INTRO

Our service hosts and executes mortar projects

Page 30: MongoDB + Pig on Hadoop (MongoSV 2012)

> mortar jobs:run your_pigscript --clustersize 5

Page 31: MongoDB + Pig on Hadoop (MongoSV 2012)
Page 32: MongoDB + Pig on Hadoop (MongoSV 2012)

MortarFAST INTRO

Browser-only interface, great for demonstrating Hadoop

Page 33: MongoDB + Pig on Hadoop (MongoSV 2012)

LOADING DATAMongoDB, Pig

One requirement:• Must specify top level fields to load from

the mongoDB collection.

Optional:• Specify a subset of embedded fields• Data type for any/all fields

Page 34: MongoDB + Pig on Hadoop (MongoSV 2012)

LOADING DATA - ENRON DATAMongoDB, Pig

{    "body": "the ... person...",    "subFolder": "notes_inbox",    "mailbox": "bass-e",    "filename": "450.",    "headers": {        "From": "[email protected]",        "To": "[email protected]", “Subject”: “Subject”        "Date": "Mon, 14 May 2001 16:39:00 -0700 (PDT)",    }}

Page 35: MongoDB + Pig on Hadoop (MongoSV 2012)

SCRIPT DEMOMongoDB, Pig

Page 36: MongoDB + Pig on Hadoop (MongoSV 2012)

STORE STATEMENTMongoDB, Pig

The MongoStorage function takes an optional list of arguments of two types:• A single set of keys to base updating on.

This has three options: None, update, or multi.

• Multiple indexes to ensure in the same format as db.col.ensureIndex().

Page 37: MongoDB + Pig on Hadoop (MongoSV 2012)

ILLUSTRATEPig

Auto-select dataset

Exercise every execution path

Step-by-step execution

Page 38: MongoDB + Pig on Hadoop (MongoSV 2012)

WHY ILLUSTRATEPig

Write correct code quickly

Understand others’ code

Test every execution path, every step

Page 39: MongoDB + Pig on Hadoop (MongoSV 2012)

USER-DEFINED FUNCTIONS (UDF)Pig

Pig is like procedural SQL

UDFs for rich data manipulation

UDFs: Java-based language

We made Pig work with CPython (NumPy, etc)

Page 40: MongoDB + Pig on Hadoop (MongoSV 2012)

WITHOUT MORTARMongoDB + Pig

Get the mongo-hadoop connector:http://github.com/mongodb/mongo-hadoop

Page 41: MongoDB + Pig on Hadoop (MongoSV 2012)

SUMMARYMongoDB + Pig

Hadoop and friends are maturingMongoDB and Pig are philosophically alignedReading and writing to Pig is straightforwardOnce in Pig (Hadoop)• massive batch calcs / analytics possible • work is offloaded• external libraries available