Upload
mongodb
View
583
Download
3
Embed Size (px)
Citation preview
MongoDB Hadoop Connector
Luke LovettMaintainer, mongo-hadoop
https://github.com/mongodb/mongo-hadoop
Overview
• Hadoop Overview• Why MongoDB and Hadoop• Connector Overview• Technical look into new features• What’s on the horizon?• Wrap-up
Hadoop Overview
• Distributed data processing• Fulfills analytical requirements• Jobs are infrequent, batch processes
Churn Analysis Recommendation Warehouse/ETL Risk Modeling
Trade Surveillance Predictive Analysis Ad Targeting Sentiment Analysis
MongoDB + Hadoop
• MongoDB backs application• Satisfy queries in real-time• MongoDB + Hadoop = application data analytics
Connector Overview
• Brings operational data into analytical lifecycle• Supporting an evolving Hadoop ecosystem
– Apache Spark has made a huge entrance• MongoDB interaction seamless, natural
Connector Examples
MongoInputFormat MongoOutputFormatBSONFileInputFormat BSONFileOutputFormat
Pig
data = LOAD “mongodb://myhost/db.collection” USING com.mongodb.hadoop.MongoInputFormat
Connector Examples
MongoInputFormat MongoOutputFormatBSONFileInputFormat BSONFileOutputFormat
Hive
CREATE EXTERNAL TABLE mongo ( title STRING, address STRUCT<from:STRING, to:STRING>)STORED BY“com.mongodb.hadoop.hive.MongoStorageHandler”;
Connector Examples
MongoInputFormat MongoOutputFormatBSONFileInputFormat BSONFileOutputFormat
Spark (Python)
import pymongo_sparkpymongo_spark.activate()rdd = sc.MongoRDD(“mongodb://host/db.coll”)
New Features
• Hive predicate pushdown• Pig projection• Compression support for BSON• PySpark support• MongoSplitter improvements
PySpark
• Python shell• Submit jobs written in Python• Problem: How do we provide a natural Python syntax
for accessing the connector inside the JVM?• What we want:
– Support for PyMongo’s objects– Have a natural API for working with MongoDB inside
Spark’s Python shell
PySpark
We need to understand:• How do the JVM and Python work together in Spark?• What does data look like between these processes?• How does the MongoDB Hadoop Connector fit into this?
We need to take a look inside PySpark.
What’s Inside PySpark?
• Uses py4j to connect to JVM running Spark• Communicates objects to/from JVM using Python’s
pickle protocol• org.apache.spark.api.python.Converter
converts Writables to Java Objects and vice-versa• Special PythonRDD type encapsulates JVM gateway
and necessary Converters, Picklers, and Constructors for un-pickling
What’s Inside PySpark?JVM Gatewaypython:
java:
What’s Inside PySpark?PythonRDDPython: Keeps Reference to SparkContext, JVM Gateway
Java: simply wrap a JavaRDD and do some conversions
What’s Inside PySpark?Pickler/Unpickler – What is a Pickle, anyway?
• Pickle – a Python object serialized into a byte stream, can be saved to a file
• defines a set of opcodes that operate as in a stack machine
• pickling turns a Python object into a stream of opcodes
• unpickling performs the operators, getting a Python object out
Example (pickle version 2)
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VK\xc7ln2\xab`\x8fS\x14\xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VK\xc7ln2\xab`\x8fS\x14\xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP
Example (pickle version 2)
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VK\xc7ln2\xab`\x8fS\x14\xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP
Example (pickle version 2)
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VK\xc7ln2\xab`\x8fS\x14\xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP
Example (pickle version 2)
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VK\xc7ln2\xab`\x8fS\x14\xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP
Example (pickle version 2)
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
What’s Inside PySpark?Pickle, implemented by Pyrolite libraryPyrolite - Python Remote Objects "light" and Pickle for Java/.NEThttps://github.com/irmen/Pyrolite
• Pyrolite library allows Spark to use Python’s Pickle protocol to serialize/deserialize Python objects across the gateway.
• Hooks available for handling custom types in each direction– registerCustomPickler – define how to turn a Java
object into a Python Pickle byte stream– registerConstructor – define how to construct a
Java object for a given Python type
What’s Inside PySpark?BSONPickler – translates Java -> PyMongoPyMongo – MongoDB Python driverhttps://github.com/mongodb/mongo-python-driver
Special handling for- Binary- BSONTimestamp- Code- DBRef- ObjectId- Regex- Min/MaxKey
“PySpark” – Before Picture>>> config = {‘mongo.input.uri’: ‘mongodb://host/db.input’,... ‘mongo.output.uri’: ‘mongodb://host/db.output’}>>> rdd = sc.newAPIHadoopRDD(... ‘com.mongodb.hadoop.MongoInputFormat’,... ‘org.apache.hadoop.io.TextWritable’,... ‘org.apache.hadoop.io.MapWritable’... None, None, config)>>> rdd.first()({u'timeSecond': 1421872408, u'timestamp': 1421872408, u'__class__': u'org.bson.types.ObjectId', u'machine': 374500293, u'time': 1421872408000, u'date': datetime.datetime(2015, 1, 21, 12, 33, 28), u'new': False, u'inc': -1652246148}, {u’Hello’: u’World’})>>> # do some processing with RDD>>> processed_rdd = …>>> processed_rdd.saveAsNewAPIHadoopFile(... ‘file:///unused’,... ‘com.mongodb.hadoop.MongoOutputFormat’,... None, None, None, None, config)
PySpark – After Picture
>>> import pymongo_spark>>> pymongo_spark.activate()>>> rdd = sc.MongoRDD(‘mongodb://host/db.input’)>>> rdd.first(){u‘_id’: ObjectId('562e64ea6e32ab169586f9cc'), u‘Hello’: u‘World’}>>> processed_rdd = ...>>> processed_rdd.saveToMongoDB(... ‘mongodb://host/db.output’)
MongoSplitter
• splitting – cutting up data to distribute among worker nodes• Hadoop InputSplits / Spark Partitions• very important to get splitting right for optimum performance• improvements in splitting for mongo-hadoop
MongoSplitterSplitting Algorithms• split per shard chunk• split per shard• split using splitVector command
mongos
shard 1
connector
shard 0
config servers
MongoSplitterSplit per Shard Chunk
shards: { "_id" : "shard01", "host" : "shard01/llp:27018,llp:27019,llp:27020" } { "_id" : "shard02", "host" : "shard01/llp:27021,llp:27022,llp:27023" } { "_id" : "shard03", "host" : "shard01/llp:27024,llp:27025,llp:27026" }databases: { "_id" : "customer", "partitioned" : true, "primary" : "shard01" } customer.emails shard key: { "headers.From" : 1 } chunks: shard01 21 shard02 21 shard03 20 { "headers.From" : { "$minKey": 1}} -->> { "headers.From" : "[email protected]" } on : shard01 Timestamp(42, 1) { "headers.From" : "[email protected]": 1} -->> { "headers.From" : "[email protected]" } on : shard02 Timestamp(42, 1) { "headers.From" : "[email protected]" } -->> { "headers.From" : { "$maxKey": 1 }} on : shard01 Timestamp(41, 1)
MongoSplitterSplitting Algorithms• split per shard chunk• split per shard• split using splitVector command
mongos
shard 1
connector
shard 0
config server
MongoSplitterSplitting Algorithms• split per shard chunk• split per shard• split using splitVector command
_id_1
{“splitVector”: “db.collection”, “keyPattern”: {“_id”: 1}, “maxChunkSize”: 42}
_id: 0 _id: 25 _id: 50 _id: 75 _id: 100
MongoSplitter
Problem: empty/unbalanced splitsQuery{“createdOn”: {“$lte”: ISODate("2015-10-26T23:51:05.787Z")}})
• can use index on “createdOn”• splitVector can’t split on a subset of the index• some splits might be empty
MongoSplitter
Problem: empty/unbalanced splitsQuery{“createdOn”: {“$lte”: ISODate("2015-10-26T23:51:05.787Z")}})
Solutions• Create a new collection with subset of data• Create index over relevant documents only• Learn to live with empty splits
MongoSplitter
AlternativesFiltering out empty splits:mongo.input.split.filter_empty=true
• create cursor, check for empty• empty splits are thrown out from the final list• save resources from task processing empty split
MongoSplitter
Problem: empty/unbalanced splitsQuery{“published”: true}
• No index on “published” means splits more likely unbalanced
• Query selects documents throughout index for split pattern
MongoSplitter
SolutionPaginatingMongoSplittermongo.splitter.class= com.mongodb.hadoop.splitter.MongoPaginatingSplitter
• one-time collection scan, but splits have efficient queries• no empty splits• splits of equal size (except for last)
MongoSplitter
• choose the right splitting algorithm• more efficient splitting with input query
Future Work – Data Locality
• Processing happens where the data lives• Hadoop
– namenode (NN) knows locations of blocks– InputFormat can specify split locations– jobtracker collaborates with NN to schedule tasks to
take advantage of data locality• Spark
– RDD.getPreferredLocations
Future Work – Data Locality
https://jira.mongodb.org/browse/HADOOP-202
Idea:• Data node/executor on same machine as shard• Connector assigns work based on local chunks
Future Work – Data Locality
• Set up Spark exectutors or Hadoop data nodes on machines with shards running
• Mark each InputSplit or Partition with the shard host that contains it
Wrapping Up
• Investigating Python in Spark• Understand splitting algorithms• Data locality with MongoDB
Thank You!
Questions?
Github:https://github.com/mongodb/mongo-hadoop
Issue Tracker:https://jira.mongodb.org/browse/HADOOP
#MDBDaysmongodb.com
Get your technical questions answered
In the foyer, 10:00 - 5:00By appointment only – register in person
Tell me how I did today on Guidebook and enter for a chance to win one of these
How to do it: Download the Guidebook App
Search for MongoDB Silicon Valley Submit session feedback