MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoDB Hadoop Connector

Luke LovettMaintainer, mongo-hadoop

https://github.com/mongodb/mongo-hadoop


Overview

• Hadoop Overview• Why MongoDB and Hadoop• Connector Overview• Technical look into new features• What’s on the horizon?• Wrap-up

Hadoop Overview

• Distributed data processing• Fulfills analytical requirements• Jobs are infrequent, batch processes

Churn Analysis Recommendation Warehouse/ETL Risk Modeling

Trade Surveillance Predictive Analysis Ad Targeting Sentiment Analysis

MongoDB + Hadoop

• MongoDB backs application• Satisfy queries in real-time• MongoDB + Hadoop = application data analytics

Connector Overview

• Brings operational data into analytical lifecycle• Supporting an evolving Hadoop ecosystem

– Apache Spark has made a huge entrance• MongoDB interaction seamless, natural

Connector Examples

MongoInputFormat MongoOutputFormatBSONFileInputFormat BSONFileOutputFormat

Pig

data = LOAD “mongodb://myhost/db.collection” USING com.mongodb.hadoop.MongoInputFormat

Connector Examples


Hive

CREATE EXTERNAL TABLE mongo ( title STRING, address STRUCT<from:STRING, to:STRING>)STORED BY“com.mongodb.hadoop.hive.MongoStorageHandler”;

Connector Examples


Spark (Python)

import pymongo_sparkpymongo_spark.activate()rdd = sc.MongoRDD(“mongodb://host/db.coll”)

New Features

• Hive predicate pushdown• Pig projection• Compression support for BSON• PySpark support• MongoSplitter improvements

PySpark

• Python shell• Submit jobs written in Python• Problem: How do we provide a natural Python syntax

for accessing the connector inside the JVM?• What we want:

– Support for PyMongo’s objects– Have a natural API for working with MongoDB inside

Spark’s Python shell

PySpark

We need to understand:• How do the JVM and Python work together in Spark?• What does data look like between these processes?• How does the MongoDB Hadoop Connector fit into this?

We need to take a look inside PySpark.

What’s Inside PySpark?

• Uses py4j to connect to JVM running Spark• Communicates objects to/from JVM using Python’s

pickle protocol• org.apache.spark.api.python.Converter

converts Writables to Java Objects and vice-versa• Special PythonRDD type encapsulates JVM gateway

and necessary Converters, Picklers, and Constructors for un-pickling

What’s Inside PySpark?JVM Gatewaypython:

java:

What’s Inside PySpark?PythonRDDPython: Keeps Reference to SparkContext, JVM Gateway

Java: simply wrap a JavaRDD and do some conversions

What’s Inside PySpark?Pickler/Unpickler – What is a Pickle, anyway?

• Pickle – a Python object serialized into a byte stream, can be saved to a file

• defines a set of opcodes that operate as in a stack machine

• pickling turns a Python object into a stream of opcodes

• unpickling performs the operators, getting a Python object out

Example (pickle version 2)

>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VK\xc7ln2\xab`\x8fS\x14\xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP

{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}













What’s Inside PySpark?Pickle, implemented by Pyrolite libraryPyrolite - Python Remote Objects "light" and Pickle for Java/.NEThttps://github.com/irmen/Pyrolite

• Pyrolite library allows Spark to use Python’s Pickle protocol to serialize/deserialize Python objects across the gateway.

• Hooks available for handling custom types in each direction– registerCustomPickler – define how to turn a Java

object into a Python Pickle byte stream– registerConstructor – define how to construct a

Java object for a given Python type

https://github.com/irmen/Pyrolite



What’s Inside PySpark?BSONPickler – translates Java -> PyMongoPyMongo – MongoDB Python driverhttps://github.com/mongodb/mongo-python-driver

Special handling for- Binary- BSONTimestamp- Code- DBRef- ObjectId- Regex- Min/MaxKey

https://github.com/mongodb/mongo-python-driver

“PySpark” – Before Picture>>> config = {‘mongo.input.uri’: ‘mongodb://host/db.input’,... ‘mongo.output.uri’: ‘mongodb://host/db.output’}>>> rdd = sc.newAPIHadoopRDD(... ‘com.mongodb.hadoop.MongoInputFormat’,... ‘org.apache.hadoop.io.TextWritable’,... ‘org.apache.hadoop.io.MapWritable’... None, None, config)>>> rdd.first()({u'timeSecond': 1421872408, u'timestamp': 1421872408, u'__class__': u'org.bson.types.ObjectId', u'machine': 374500293, u'time': 1421872408000, u'date': datetime.datetime(2015, 1, 21, 12, 33, 28), u'new': False, u'inc': -1652246148}, {u’Hello’: u’World’})>>> # do some processing with RDD>>> processed_rdd = …>>> processed_rdd.saveAsNewAPIHadoopFile(... ‘file:///unused’,... ‘com.mongodb.hadoop.MongoOutputFormat’,... None, None, None, None, config)

PySpark – After Picture

>>> import pymongo_spark>>> pymongo_spark.activate()>>> rdd = sc.MongoRDD(‘mongodb://host/db.input’)>>> rdd.first(){u‘_id’: ObjectId('562e64ea6e32ab169586f9cc'), u‘Hello’: u‘World’}>>> processed_rdd = ...>>> processed_rdd.saveToMongoDB(... ‘mongodb://host/db.output’)

MongoSplitter

• splitting – cutting up data to distribute among worker nodes• Hadoop InputSplits / Spark Partitions• very important to get splitting right for optimum performance• improvements in splitting for mongo-hadoop

MongoSplitterSplitting Algorithms• split per shard chunk• split per shard• split using splitVector command

mongos

shard 1

connector

shard 0

config servers

MongoSplitterSplit per Shard Chunk

shards: { "_id" : "shard01", "host" : "shard01/llp:27018,llp:27019,llp:27020" } { "_id" : "shard02", "host" : "shard01/llp:27021,llp:27022,llp:27023" } { "_id" : "shard03", "host" : "shard01/llp:27024,llp:27025,llp:27026" }databases: { "_id" : "customer", "partitioned" : true, "primary" : "shard01" } customer.emails shard key: { "headers.From" : 1 } chunks: shard01 21 shard02 21 shard03 20 { "headers.From" : { "$minKey": 1}} -->> { "headers.From" : "[email protected]" } on : shard01 Timestamp(42, 1) { "headers.From" : "[email protected]": 1} -->> { "headers.From" : "[email protected]" } on : shard02 Timestamp(42, 1) { "headers.From" : "[email protected]" } -->> { "headers.From" : { "$maxKey": 1 }} on : shard01 Timestamp(41, 1)


mongos

shard 1

connector

shard 0

config server


_id_1

{“splitVector”: “db.collection”, “keyPattern”: {“_id”: 1}, “maxChunkSize”: 42}

_id: 0 _id: 25 _id: 50 _id: 75 _id: 100

MongoSplitter

Problem: empty/unbalanced splitsQuery{“createdOn”: {“$lte”: ISODate("2015-10-26T23:51:05.787Z")}})

• can use index on “createdOn”• splitVector can’t split on a subset of the index• some splits might be empty

MongoSplitter

Problem: empty/unbalanced splitsQuery{“createdOn”: {“$lte”: ISODate("2015-10-26T23:51:05.787Z")}})

Solutions• Create a new collection with subset of data• Create index over relevant documents only• Learn to live with empty splits

MongoSplitter

AlternativesFiltering out empty splits:mongo.input.split.filter_empty=true

• create cursor, check for empty• empty splits are thrown out from the final list• save resources from task processing empty split

MongoSplitter

Problem: empty/unbalanced splitsQuery{“published”: true}

• No index on “published” means splits more likely unbalanced

• Query selects documents throughout index for split pattern

MongoSplitter

SolutionPaginatingMongoSplittermongo.splitter.class= com.mongodb.hadoop.splitter.MongoPaginatingSplitter

• one-time collection scan, but splits have efficient queries• no empty splits• splits of equal size (except for last)

MongoSplitter

• choose the right splitting algorithm• more efficient splitting with input query

Future Work – Data Locality

• Processing happens where the data lives• Hadoop

– namenode (NN) knows locations of blocks– InputFormat can specify split locations– jobtracker collaborates with NN to schedule tasks to

take advantage of data locality• Spark

– RDD.getPreferredLocations


https://jira.mongodb.org/browse/HADOOP-202

Idea:• Data node/executor on same machine as shard• Connector assigns work based on local chunks

https://jira.mongodb.org/browse/HADOOP-202


• Set up Spark exectutors or Hadoop data nodes on machines with shards running

• Mark each InputSplit or Partition with the shard host that contains it

Wrapping Up

• Investigating Python in Spark• Understand splitting algorithms• Data locality with MongoDB

Thank You!

Questions?

Github:https://github.com/mongodb/mongo-hadoop

Issue Tracker:https://jira.mongodb.org/browse/HADOOP


https://jira.mongodb.com/browse/HADOOP

#MDBDaysmongodb.com

Get your technical questions answered

In the foyer, 10:00 - 5:00By appointment only – register in person

Tell me how I did today on Guidebook and enter for a chance to win one of these

How to do it: Download the Guidebook App

Search for MongoDB Silicon Valley Submit session feedback

Technology

MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector