Webinar: What's New with MongoDB Hadoop Integration

Mongo-Hadoop IntegrationMike O’Brien, Software Engineer @ 10gen

Thursday, August 8, 13

We will cover:

The Mongo-Hadoop connector:•what it is•how it works•a tour of what it can do

A quick briefing on what Mongo and Hadoop are all about

(Q+A at the end)


Choosing the Right Tool for the TaskUpcoming Webinar:

MongoDB and Hadoop - Essential Tools for Your Big Data Playbook

August 21st, 201310am PDT, 1pm EDT, 6pm BST

Register at 10gen.com/events/biz-hadoop



document-oriented database with dynamic schema


document-oriented database with dynamic schema

stores data in JSON-like documents:

{ _id : “mike”,

age : 21,location : {

state : ”NY”,zip : ”11222”

},favorite_colors : [“red”, “green”]

}


mongoDB scales horizontally with sharding to handle lots of

data and load

app



data and load

app



data and load

app



data and load

app



data and load

app


Java-based framework for Map/Reduce

Excels at batch processing on large data setsby taking advantage of parallelism


Mongo-Hadoop Connector - Why

Lots of people using Hadoop and Mongo separately, but need integration

Custom code or slow and hacky import/export scripts often used to get data in+out

Scalability and flexibility with changes in Hadoop or MongoDB configurations

Need to process data across multiple sources


Mongo-Hadoop ConnectorTurn MongoDB into a Hadoop-enabled filesystem:

use as the input or output for Hadoop

New Feature: As of v1.1, also works with MongoDB backup files (.bson)

.BSON

-or-

input data

.BSON

-or-

Hadoop Cluster

outputresults


Mongo-Hadoop ConnectorBenefits + Features



Takes advantage of full multi-core parallelism to process data in Mongo




Full integration with Hadoop and JVM ecosystems





Can be used with Amazon Elastic Mapreduce





Can be used with Amazon Elastic Mapreduce

Can read and write backup files from local filesystem, HDFS, or S3




Mongo-Hadoop Connector

Vanilla Java MapReduce

Benefits + Features




or if you don’t want to use Java,support for Hadoop Streaming.

Benefits + Features




write MapReduce code in

ruby


Benefits + Features





ruby


Benefits + Features





ruby python


Benefits + Features





Support for Pighigh-level scripting language for data analysis and

building map/reduce workflows

Benefits + Features



Support for Pighigh-level scripting language for data analysis and

building map/reduce workflows

Support for HiveSQL-like language for ad-hoc queries + analysis of data sets on

Hadoop-compatible file systems

Benefits + Features



How it works:



How it works:

Adapter examines the MongoDB input collection and calculates a set of splits from the data



How it works:


Each split gets assigned to a node in Hadoop cluster



How it works:



In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally



How it works:



In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally

Hadoop merges results and streams output back to MongoDB or BSON


Tour of Mongo-Hadoop, by Example



- Using Java MapReduce with Mongo-Hadoop




- Using Hadoop Streaming





- Pig and Hive with Mongo-Hadoop





- Pig and Hive with Mongo-Hadoop

- Elastic MapReduce + BSON


{ "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecast\n\n ", "filename" : "1.", "headers" : { "From" : "[email protected]", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "[email protected]", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" }}

Input Data: Enron e-mail corpus (501k records, 1.75Gb)

each document is one email





sender





sender

recipients



Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair



bob

alice

eve

charlie

1499

9

48

20


{"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 14}





. . .


bob

alice

eve

charlie

1499

9

48

20


mailto:[email protected]




















Example 1 - Java MapReduce

Map phase - each input doc gets passed through a Mapper function

@Overridepublic void map(NullWritable key, BSONObject val, final Context context){ BSONObject headers = (BSONObject)val.get("headers"); if(headers.containsKey("From") && headers.containsKey("To")){ String from = (String)headers.get("From"); String to = (String)headers.get("To"); String[] recips = to.split(","); for(int i=0;i<recips.length;i++){ String recip = recips[i].trim(); context.write(new MailPair(from, recip), new IntWritable(1)); } }}


Example 1 - Java MapReduce

mongoDB document passed into Hadoop MapReduce

Map phase - each input doc gets passed through a Mapper function

@Overridepublic void map(NullWritable key, BSONObject val, final Context context){ BSONObject headers = (BSONObject)val.get("headers"); if(headers.containsKey("From") && headers.containsKey("To")){ String from = (String)headers.get("From"); String to = (String)headers.get("To"); String[] recips = to.split(","); for(int i=0;i<recips.length;i++){ String recip = recips[i].trim(); context.write(new MailPair(from, recip), new IntWritable(1)); } }}


Example 1 - Java MapReduce (cont)

Reduce phase - outputs of Map are grouped together by key and passed to Reducer

public void reduce( final MailPair pKey, final Iterable<IntWritable> pValues, final Context pContext ){ int sum = 0;

for ( final IntWritable value : pValues ){ sum += value.get(); }

BSONObject outDoc = new BasicDBObjectBuilder().start() .add( "f" , pKey.from)

.add( "t" , pKey.to )

.get(); BSONWritable pkeyOut = new BSONWritable(outDoc); pContext.write( pkeyOut, new IntWritable(sum) ); }




the {to, from} key









the {to, from} key

list of all the values collected under the key







output written back to MongoDB



the {to, from} key

list of all the values collected under the key








mongo.job.input.format=com.mongodb.hadoop.MongoInputFormatmongo.input.uri=mongodb://my-db:27017/enron.messages

Read from MongoDB




Read from MongoDB

Read from BSONmongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormatmapred.input.dir=file:///tmp/messages.bson




Read from MongoDB

Read from BSONmongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormatmapred.input.dir=file:///tmp/messages.bson

hdfs:///tmp/messages.bson

s3:///tmp/messages.bson



mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormatmongo.output.uri=mongodb://my-db:27017/enron.results_out

Write output to MongoDB





Write output to BSONmongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormatmapred.output.dir=file:///tmp/results.bson





Write output to BSONmongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormatmapred.output.dir=file:///tmp/results.bson

hdfs:///tmp/results.bson

s3:///tmp/results.bson


Results : Output Data

mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}){ "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 }{ "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 }{ "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 }{ "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 }{ "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 }{ "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 }{ "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 }...has more






























Example 2 - Hadoop Streaming

Let’s do the same Enron Map/Reduce job with Python instead of Java

$ pip install pymongo_hadoop


Example 2 - Hadoop Streaming (cont)

Hadoop passes data to an external process via STDOUT/STDIN

map(k, v)map(k, v)map(k, v)map()

JVM

STDIN Python / Ruby / JSinterpreter

STDOUT

hadoop (JVM)

def mapper(documents): . . .


from pymongo_hadoop import BSONMapper

def mapper(documents): i = 0 for doc in documents: i = i + 1 from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1}

BSONMapper(mapper)print >> sys.stderr, "Done Mapping."



from pymongo_hadoop import BSONReducer

def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count}

BSONReducer(reducer)



Surviving Hadoop:making MapReduce easier

with Pig + Hive


Example 3 - Mongo-Hadoop and Pig

Let’s do the same thing yet again, but this time using Pig




Pig is a powerful language that can generate sophisticated map/reduce

workflows from simple scripts




Pig is a powerful language that can generate sophisticated map/reduce

workflows from simple scripts

Can perform JOIN, GROUP, and execute user-defined functions (UDFs)


Example 3 - Mongo-Hadoop and Pig (cont)

Pig directives for loading data:BSONLoader and MongoLoader

data = LOAD 'mongodb://localhost:27017/db.collection' using com.mongodb.hadoop.pig.MongoLoader;

STORE records INTO 'file:///output.bson' using com.mongodb.hadoop.pig.BSONStorage;

Writing data outBSONStorage and MongoInsertStorage



Pig has its own special datatypes: Bags, Maps, and Tuples

Mongo-Hadoop Connector intelligently converts between Pig datatypes and

MongoDB datatypes



raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;




send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;





send_recip_filtered = FILTER send_recip BY to IS NOT NULL;

send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;







send_recip_grouped = GROUP send_recip_split BY (from, to);send_recip_counted = FOREACH send_recip_grouped GENERATE group, COUNT($1) as count;







send_recip_grouped = GROUP send_recip_split BY (from, to);send_recip_counted = FOREACH send_recip_grouped GENERATE group, COUNT($1) as count;

STORE send_recip_counted INTO 'file:///enron_results.bson' using com.mongodb.hadoop.pig.BSONStorage;


Hive with Mongo-Hadoop

Similar idea to Pig - process your data without needing to write Map/Reduce

code from scratch

...but with SQL as the language of choice



CREATE TABLE mongo_users (id int, name string, age int)STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"WITH SERDEPROPERTIES( "mongo.columns.mapping" = "_id,name,age" )TBLPROPERTIES ( "mongo.uri" = "mongodb://localhost:27017/test.users");

first, declare the collection to be accessible in Hive:

Sample Data:db.users

db.users.find(){ "_id": 1, "name": "Tom", "age": 28 }{ "_id": 2, "name": "Alice", "age": 18 }{ "_id": 3, "name": "Bob", "age": 29 }{ "_id": 101, "name": "Scott", "age": 10 }{ "_id": 104, "name": "Jesse", "age": 52 }{ "_id": 110, "name": "Mike", "age": 32 }...





. . .then you can run SQL on it, like a table.SELECT name,age FROM mongo_users WHERE id > 100 ;




SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ;

you can use GROUP BY:




SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ;

you can use GROUP BY:

or JOIN multiple tables/collections together:

SELECT * FROM mongo_users T1 JOIN user_emails T2 WHERE T1.id = T2.id;


Write the output of queries back into new tables:

INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ;




DROP TABLE mongo_users;




DROP TABLE mongo_users;

Drop a table in Hive to delete the underlying collection in MongoDB


Usage with Amazon Elastic MapReduce

Run mongo-hadoop jobs without needing to set up or manage your

own Hadoop cluster.


Usage with Amazon Elastic MapReduce

First, make a “bootstrap” script that fetches dependencies (mongo-hadoop

jar and java drivers)

#!/bin/sh

wget -P /home/hadoop/lib http://central.maven.org/maven2/org/mongodb/mongo-java-driver/2.11.1/mongo-java-driver-2.11.1.jar

wget -P /home/hadoop/lib https://s3.amazonaws.com/mongo-hadoop-code/mongo-hadoop-core_1.1.2-1.1.0.jar

this will get executed on each node in the cluster that EMR builds for us.


http://central.maven.org/maven2/org/mongodb/mongo-java-driver/2.11.1/mongo-java-driver-2.11.1.jar




https://s3.amazonaws.com/mongo-hadoop-code/mongo-hadoop-core_1.1.2-1.1.0.jar




Example 4 - Usage with Amazon Elastic MapReduce

Put the bootstrap script, and all your code, into an S3 bucket where Amazon can see it.

s3cp ./bootstrap.sh s3://$S3_BUCKET/bootstrap.shs3mod s3://$S3_BUCKET/bootstrap.sh public-read

s3cp $HERE/../enron/target/enron-example.jar s3://$S3_BUCKET/enron-example.jars3mod s3://$S3_BUCKET/enron-example.jar public-read


$ elastic-mapreduce --create --jobflow ENRON000 --instance-type m1.xlarge --num-instances 5 --bootstrap-action s3://$S3_BUCKET/bootstrap.sh --log-uri s3://$S3_BUCKET/enron_logs --jar s3://$S3_BUCKET/enron-example.jar --arg -D --arg mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat --arg -D --arg mapred.input.dir=s3n://mongo-test-data/messages.bson --arg -D --arg mapred.output.dir=s3n://$S3_BUCKET/BSON_OUT --arg -D --arg mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat # (any additional parameters here)


. . .then launch the job from the command line, pointing to your S3 locations

Control the type and number of instances

in the cluster



Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster




Turn up the “num-instances” knob to make jobs complete faster





Logs get captured into S3 files





(Pig, Hive, and streaming work on EMR, too!)

Logs get captured into S3 files


Example 5 - new feature: MongoUpdateWritable

. . . but we can also modify an existing output collection

Works by applying mongoDB update modifiers:$push, $pull, $addToSet, $inc, $set, etc.

Can be used to do incremental Map/Reduce or“join” two collections

In previous examples, we wrote job output data by inserting into a new collection


Example 5 - MongoUpdateWritable

Let’s say we have two collections.

{ "_id": ObjectId("51b792d381c3e67b0a18d0ed"), "name": "730LsRkX", "type": "pressure", "owner": "steve",}

sensors




{ "_id": ObjectId("51b792d381c3e67b0a18d678"), "sensor_id": ObjectId("51b792d381c3e67b0a18d4a1"), "value": 3328.5895416489802, "timestamp": ISODate("2013-‐05-‐18T13:11:38.709-‐0400"), "loc": [-‐175.13,51.658]}


sensors






sensors

log events






sensors

log events

refers to which sensor logged the event






sensors

log events







sensors

log events


For each owner, we want to calculate how many events were recorded for each type of sensor that logged it.






Plain english:

Bob’s sensors for temperature have stored 1300 readingsBob’s sensors for pressure have stored 400 readings

Alice’s sensors for humidity have stored 600 readingsAlice’s sensors for temperature have stored 700 readings

etc...


sensors(mongoDB collection)

Stage 1 -Map/Reduce on sensors collection

Results(mongoDB collection)

for each sensor, emit: {key: owner+type, value: _id}

group data from map() under each key, output:{key: owner+type, val: [ list of _ids] }

read from mongoDB

insert() new records to mongoDB

map/reduce

log events(mongoDB collection)


After stage one, the output docs look like:


the sensor’s owner and type





list of ID’s of sensors with this owner and type

{ "_id": "alice pressure", "sensors": [ ObjectId("51b792d381c3e67b0a18d475"), ObjectId("51b792d381c3e67b0a18d16d"), ObjectId("51b792d381c3e67b0a18d2bf"), … ]}




list of ID’s of sensors with this owner and type

{ "_id": "alice pressure", "sensors": [ ObjectId("51b792d381c3e67b0a18d475"), ObjectId("51b792d381c3e67b0a18d16d"), ObjectId("51b792d381c3e67b0a18d2bf"), … ]}

Now we just need to count the total # of log events recorded for any sensors that appear

in the list for each owner/type group.Thursday, August 8, 13


Stage 2 -Map/Reduce on log events collection


read from mongoDB

update() existing records in mongoDB

map/reduce


for each sensor, emit: {key: sensor_id, value: 1}

group data from map() under each keyfor each value in that key: update({sensors: key}, {$inc : {logs_count:1}})



Stage 2 -Map/Reduce on log events collection


read from mongoDB

update() existing records in mongoDB

map/reduce


for each sensor, emit: {key: sensor_id, value: 1}

group data from map() under each keyfor each value in that key: update({sensors: key}, {$inc : {logs_count:1}})

context.write(null, new MongoUpdateWritable( query, //which documents to modify update, //how to modify ($inc) true, //upsert false)); // multi


Example - MongoUpdateWritable

Result after stage 2

{ "_id": "1UoTcvnCTz temp", "sensors": [ ObjectId("51b792d381c3e67b0a18d475"), ObjectId("51b792d381c3e67b0a18d16d"), ObjectId("51b792d381c3e67b0a18d2bf"), … ], "logs_count": 1050616}

now populated with correct count


Upcoming Features (v1.2 and beyond)

Full-featured Hive support

Performance Improvements - Lazy BSON

Support for multi-collection input sources

API for adding custom splitter implementations

and more


Recap

Mongo-Hadoop - use Hadoop to do massive computations on big data sets stored in Mongo/BSON

Tools and APIs make it easier: Streaming, Pig, Hive, EMR, etc.

MongoDB becomes a Hadoop-enabled filesystem


Questions?

https://github.com/mongodb/mongo-hadoop/tree/master/examples

Examples can be found on github:






Technology

Webinar: What's New with MongoDB Hadoop Integration