111
Sarang Shravagi Python Developer,ScaleArc @_sarangs

MongoDB Basics

Embed Size (px)

DESCRIPTION

MongoDB workshop given by me at MIT, Pune. This PDF has example of how to design mongodb schema as per application usage.

Citation preview

Page 1: MongoDB Basics

Sarang Shravagi Python Developer,ScaleArc

@_sarangs

Page 2: MongoDB Basics

Let’s Know Each Other

• Why are you attending?

• Do you code?

• OS?

• Programing Language?

• JSON?

• MongoDB?

Page 3: MongoDB Basics

Agenda

• SQL and NoSQL Database

• What is MongoDB?

• Hands-On and Assignment

• Design Models

• MongoDB Language Driver

• Disaster Recovery

• Handling BigData

Page 4: MongoDB Basics

Data Patterns & Storage Needs

• Product Information

• User Information

• Purchase Information

• Product Reviews

• Site Interactions

• Social Graph

• Search Index

Page 5: MongoDB Basics

SQL to NoSQL

Design Paradigm Shift

Page 6: MongoDB Basics

Database Evolution

Page 7: MongoDB Basics

SQL Storage

• Was designed when – Storage and data transfer was costly

– Processing was slow

– Applications were oriented more towards data collection

• Initial adopters were financial institutions

Page 8: MongoDB Basics

SQL Storage

• Structured – schema

• Relational – foreign keys, constraints

• Transactional – Atomicity, Consistency, Isolation, Durability

• High Availability through robustness – Minimize failures

• Optimized for Writes

• Typically Scale Up

Page 9: MongoDB Basics

NoSQL Storage

• Is designed when – Storage is cheap

– Data transfer is fast

– Much more processing power is available

• Clustering of machines is also possible – Applications are oriented towards consumption of User

Generated Content

– Better on-screen user experience is in demand

Page 10: MongoDB Basics

NoSQL Storage

• Semi-structured – Schemaless

• Consistency, Availability, Partition Tolerance

• High Availability through clustering – expect failures

• Optimized for Reads

• Typically Scale Out

Page 11: MongoDB Basics

Different Databases

Half Level Deep

Page 12: MongoDB Basics

SQL: RDBMS

• MySql, Postgresql, Oracle etc.

• Stores data in tables having columns – Basic (number, text) data types

• Strong query language

• Transparent values – Query language can read and filter on them

– Relationship between tables based on values

• Suited for user info and transactions

Page 13: MongoDB Basics

NoSQL Data Model

Page 14: MongoDB Basics

NoSQL: Key/Value

Page 15: MongoDB Basics

NoSQL: Document

• MongoDB, CouchDB etc.

• Object Oriented data models – Stores data in document objects having fields

– Basic and compound (list, dict) data types

• SQL like queries

• Transparent values – Can be part of query

• Suited for product info and its reviews

Page 16: MongoDB Basics

NoSQL: Document

Page 17: MongoDB Basics

NoSQL: Column Family

• Cassandra, Big Table etc.

• Stores data in columns

• Transparent values – Can be part of query

• SQL like queries

• Suited for search

Page 18: MongoDB Basics

NoSQL: Graph

• Neo4j

• Stores data in form of nodes and relationships

• Query is in form of traversal

• In-memory

• Suited for social graph

Page 19: MongoDB Basics

NoSQL: Graph

Page 20: MongoDB Basics

What is MongoDB?

Page 21: MongoDB Basics

MongoDB is a ___________ database

1. Document

2. Open source

3. High performance

4. Horizontally scalable

5. Full featured

Page 22: MongoDB Basics

1. Document Database

• Not for .PDF & .DOC files

• A document is essentially an associative array

• Document = JSON object

• Document = PHP Array

• Document = Python Dict

• Document = Ruby Hash

• etc

Page 23: MongoDB Basics

Database Landscape

Page 24: MongoDB Basics

2. Open Source

• MongoDB is an open source project

• On GitHub

• Licensed under the AGPL

• Started & sponsored by MongoDB Inc (formerly

known as 10gen)

• Commercial licenses available

• Contributions welcome

Page 25: MongoDB Basics

7,000,000+ MongoDB Downloads

150,000+ Online Education Registrants

35,000+ MongoDB Management Service (MMS) Users

30,000+ MongoDB User Group Members

20,000+ MongoDB Days Attendees

Global Community

Page 26: MongoDB Basics

3. High Performance

• Written in C++

• Extensive use of memory-mapped files

i.e. read-through write-through memory caching.

• Runs nearly everywhere

• Data serialized as BSON (fast parsing)

• Full support for primary & secondary indexes

• Document model = less work

Page 27: MongoDB Basics

Better Data

Locality

Performance

In-Memory

Caching

In-Place

Updates

Page 28: MongoDB Basics

4. Scalability

Auto-Sharding

• Increase capacity as you go

• Commodity and cloud architectures

• Improved operational simplicity and cost visibility

Page 29: MongoDB Basics

High Availability

• Automated replication and failover

• Multi-data center support

• Improved operational simplicity (e.g., HW swaps)

• Data durability and consistency

Page 30: MongoDB Basics

Scalability: MongoDB Architecture

Page 31: MongoDB Basics

5. Full Featured

• Ad Hoc queries

• Real time aggregation

• Rich query capabilities

• Strongly consistent

• Geospatial features

• Support for most programming languages

• Flexible schema

Page 32: MongoDB Basics

MongoDB is Fully Featured

Page 33: MongoDB Basics

MongoDB Architecture

Page 34: MongoDB Basics

Terminology

Page 35: MongoDB Basics

Do More With Your Data

MongoDB Rich Queries

• Find Paul’s cars

• Find everybody in London with a car

built between 1970 and 1980

Geospatial • Find all of the car owners within 5km of

Trafalgar Sq.

Text Search • Find all the cars described as having

leather seats

Aggregation • Calculate the average value of Paul’s

car collection

Map Reduce

• What is the ownership pattern of colors

by geography over time? (is purple

trending up in China?)

{

first_name: ‘Paul’,

surname: ‘Miller’,

city: ‘London’,

location: [45.123,47.232],

cars: [

{ model: ‘Bentley’,

year: 1973,

value: 100000, … },

{ model: ‘Rolls Royce’,

year: 1965,

value: 330000, … }

}

}

Page 36: MongoDB Basics

Hands-On & Assignment

Page 37: MongoDB Basics

mongodb.org/downloads

Page 38: MongoDB Basics

$ tar –zxvf mongodb-osx-x86_64-2.6.0.tgz

$ cd mongodb-osx-i386-2.6.0/bin

$ mkdir –p /data/db

$ ./mongod

Running MongoDB

Page 39: MongoDB Basics

MongoDB: Core Binaries

• mongod – Database server

• mongo – Database client shell

• mongos – Router for Sharding

Page 40: MongoDB Basics

Getting Help

• For mongo shell – mongo –help

• Shows options available for running the shell

• Inside mongo shell – db.help()

• Shows commands available on the object

Page 41: MongoDB Basics

Database Operations

• Database creation

• Creating/changing collection

• Data insertion

• Data read

• Data update

• Creating indices

• Data deletion

• Dropping collection

Page 42: MongoDB Basics

MacBook-Pro-:~ $ mongo

MongoDB shell version: 2.6.0

connecting to: test

> db.cms.insert({text: 'Welcome to MongoDB'})

> db.cms.find().pretty()

{

"_id" : ObjectId("51c34130fbd5d7261b4cdb55"),

"text" : "Welcome to MongoDB"

}

Mongo Shell

Page 43: MongoDB Basics

Diagnostic Tools

• mongostat

• mongoperf

• mongosnif

• mongotop

Page 44: MongoDB Basics

Import Export Tools

• For objects – mongodump

– mongorestore

– bsondump

– mongooplog

• For data items – mongoimport

– mongoexport

Page 45: MongoDB Basics

Assignment

• Tasks – assignments.txt

• Data – students.json

Page 46: MongoDB Basics

Questions?

Page 47: MongoDB Basics

Sarang Shravagi

@_sarangs

Thank You

Page 48: MongoDB Basics

Design Models

Page 49: MongoDB Basics

First step in any application is

Determine your entities

Page 50: MongoDB Basics

Entities in our Blogging System

• Users (post authors)

• Article

• Comments

• Tags, Category

• Interactions (views, clicks)

Page 51: MongoDB Basics

In a relational base app

We would start by doing schema

design

Page 52: MongoDB Basics

Typical (relational) ERD

Page 53: MongoDB Basics

In a MongoDB based app

We start building our app and let the schema evolve

Page 54: MongoDB Basics

MongoDB ERD

Page 55: MongoDB Basics

Seek = 5+ ms Read = really really fast

Post

Author Comment

Disk seeks and data locality

Page 56: MongoDB Basics

Post

Author

Comment Comment Comment Comment Comment

Disk seeks and data locality

Page 57: MongoDB Basics

MongoDB Language Driver

Page 58: MongoDB Basics

Real applications are not

built in the shell

Page 59: MongoDB Basics

MongoDB has native

bindings for over 12

languages

Page 60: MongoDB Basics

Drivers & Ecosystem

Drivers

Support for the most popular

languages and frameworks

Frameworks

Morphia MEAN Stack

Java

Python

Perl

Ruby

Page 61: MongoDB Basics

Working With MongoDB

Page 62: MongoDB Basics

# Python dictionary (or object)

>>> article = { ‘title’ : ‘Schema design in MongoDB’,

‘author’ : ‘sarangs’,

‘section’ : ‘schema’,

‘slug’ : ‘schema-design-in-mongodb’,

‘text’ : ‘Data in MongoDB has a flexible schema.

So, 2 documents needn’t have same structure.

It allows implicit schema to evolve.’,

‘date’ : datetime.utcnow(),

‘tags’ : [‘MongoDB’, ‘schema’] }

>>> db[‘articles’].insert(article)

Design schema.. In application code

Page 63: MongoDB Basics

>>> img_data = Binary(open(‘article_img.jpg’).read())

>>> article = { ‘title’ : ‘Schema evolutionin MongoDB’,

‘author’ : ‘mattbates’,

‘section’ : ‘schema’,

‘slug’ : ‘schema-evolution-in-mongodb’,

‘text’ : ‘MongoDb has dynamic schema. For good

performance, you would need an implicit

structure and indexes’,

‘date’ : datetime.utcnow(),

‘tags’ : [‘MongoDB’, ‘schema’, ‘migration’],

‘headline_img’ : {

‘img’ : img_data,

‘caption’ : ‘A sample document at the shell’

}}

>>> db[‘articles’].insert(article)

Let’s add a headline image

Page 64: MongoDB Basics

>>> article = { ‘title’ : ‘Favourite web application framework’,

‘author’ : ‘sarangs’,

‘section’ : ‘web-dev’,

‘slug’ : ‘web-app-frameworks’,

‘gallery’ : [

{ ‘img_url’ : ‘http://x.com/45rty’, ‘caption’ : ‘Flask’, ..},

..

]

‘date’ : datetime.utcnow(),

‘tags’ : [‘Python’, ‘web’],

}

>>> db[‘articles’].insert(article)

And different types of article

Page 65: MongoDB Basics

>>> user = {

'user' : 'sarangs',

'email' : ‘[email protected]',

'password' : ‘sarang',

'joined' : datetime.utcnow(),

'location' : { 'city' : 'Mumbai' },

}

} >>> db[‘users’].insert(user)

Users and profiles

Page 66: MongoDB Basics

Modelling comments (1)

• Two collections – articles and comments

• Use a reference (i.e. foreign key) to link together

• But.. N+1 queries to retrieve article and comments

{

‘_id’ : ObjectId(..),

‘title’ : ‘Schema design in MongoDB’,

‘author’ : ‘mattbates’,

‘date’ : ISODate(..),

‘tags’ : [‘MongoDB’, ‘schema’],

‘section’ : ‘schema’,

‘slug’ : ‘schema-design-in-mongodb’,

‘comments’ : [ ObjectId(..), …]

}

{ ‘_id’ : ObjectId(..),

‘article_id’ : 1,

‘text’ : ‘A great article, helped me

understand schema design’,

‘date’ : ISODate(..),,

‘author’ : ‘johnsmith’

}

Page 67: MongoDB Basics

Modelling comments (2)

• Single articles collection –

embed comments in article

documents

• Pros • Single query, document

designed for the access pattern

• Locality (disk, shard)

• Cons • Comments array is unbounded;

documents will grow in size

(remember 16MB document

limit)

{

‘_id’ : ObjectId(..),

‘title’ : ‘Schema design in MongoDB’,

‘author’ : ‘mattbates’,

‘date’ : ISODate(..),

‘tags’ : [‘MongoDB’, ‘schema’],

‘comments’ : [

{

‘text’ : ‘A great article,

helped me

understand schema design’,

‘date’ : ISODate(..),

‘author’ : ‘johnsmith’

},

]

}

Page 68: MongoDB Basics

Modelling comments (3)

• Another option: hybrid of (2) and (3), embed top x comments (e.g. by date, popularity) into the article document

• Fixed-size (2.4 feature) comments array

• All other comments ‘overflow’ into a comments

collection (double write) in buckets

• Pros

– Document size is more fixed – fewer moves

– Single query built

– Full comment history with rich query/aggregation

Page 69: MongoDB Basics

Modelling comments (3) {

‘_id’ : ObjectId(..),

‘title’ : ‘Schema design in MongoDB’,

‘author’ : ‘mattbates’,

‘date’ : ISODate(..),

‘tags’ : [‘MongoDB’, ‘schema’],

‘comments_count’: 45,

‘comments_pages’ : 1

‘comments’ : [

{

‘text’ : ‘A great article, helped me

understand schema design’,

‘date’ : ISODate(..),

‘author’ : ‘johnsmith’

},

]

}

Total number of comments • Integer counter updated by

update operation as

comments added/removed

Number of pages • Page is a bucket of 100

comments (see next slide..)

Fixed-size comments array • 10 most recent

• Sorted by date on insertion

Page 70: MongoDB Basics

Modelling comments (3)

{

‘_id’ : ObjectId(..),

‘article_id’ : ObjectId(..),

‘page’ : 1,

‘count’ : 42

‘comments’ : [

{

‘text’ : ‘A great article, helped me

understand schema design’,

‘date’ : ISODate(..),

‘author’ : ‘johnsmith’

},

}

One comment bucket

(page) document

containing up to about 100

comments

Array of 100 comment sub-

documents

Page 71: MongoDB Basics

Modelling interactions

• Interactions – Article views

– Comments

– (Social media sharing)

• Requirements

– Time series

– Pre-aggregated in preparation for analytics

Page 72: MongoDB Basics

Modelling interactions

• Document per article per day –

‘bucketing’

• Daily counter and hourly sub-

document counters for

interactions

• Bounded array (24 hours)

• Single query to retrieve daily

article interactions; ready-made

for graphing and further

aggregation

{

‘_id’ : ObjectId(..),

‘article_id’ : ObjectId(..),

‘section’ : ‘schema’,

‘date’ : ISODate(..),

‘daily’: { ‘views’ : 45, ‘comments’ :

150 }

‘hours’ : {

0 : { ‘views’ : 10 },

1 : { ‘views’ : 2 },

23 : { ‘comments’ : 14, ‘views’ : 10

}

}

}

Page 73: MongoDB Basics

JSON and RESTful API

Client-side

JSON

(eg AngularJS,

(BSON)

Real applications are not built at a shell – let’s build a RESTful

API.

Pymongo

driver

Python web

app HTTP(S) REST

Examples to follow: Python RESTful API using Flask

microframework

Page 74: MongoDB Basics

myCMS REST endpoints

Method URI Action

GET /articles Retrieve all articles

GET /articles-by-tag/[tag] Retrieve all articles by tag

GET /articles/[article_id] Retrieve a specific article by article_id

POST /articles Add a new article

GET /articles/[article_id]/comments Retrieve all article comments by

article_id

POST /articles/[article_id]/comments Add a new comment to an article.

POST /users Register a user user

GET /users/[username] Retrieve user’s profile

PUT /users/[username] Update a user’s profile

Page 75: MongoDB Basics

$ git clone http://www.github.com/mattbates/mycms_mongodb

$ cd mycms-mongodb

$ virtualenv venv

$ source venv/bin/activate

$ pip install –r requirements.txt

$ mkdir –p data/db

$ mongod --dbpath=data/db --fork --logpath=mongod.log

$ python web.py

[$ deactivate]

Getting started with the skeleton code

Page 76: MongoDB Basics

@app.route('/cms/api/v1.0/articles', methods=['GET'])

def get_articles():

"""Retrieves all articles in the collection

sorted by date

"""

# query all articles and return a cursor sorted by date

cur = db['articles'].find().sort('date’)

if not cur:

abort(400)

# iterate the cursor and add docs to a dict

articles = [article for article in cur]

return jsonify({'articles' : json.dumps(articles, default=json_util.default)})

RESTful API methods in Python + Flask

Page 77: MongoDB Basics

@app.route('/cms/api/v1.0/articles/<string:article_id>/comments', methods = ['POST'])

def add_comment(article_id):

"""Adds a comment to the specified article and a

bucket, as well as updating a view counter

"””

page_id = article['last_comment_id'] // 100

# push the comment to the latest bucket and $inc the count

page = db['comments'].find_and_modify(

{ 'article_id' : ObjectId(article_id),

'page' : page_id},

{ '$inc' : { 'count' : 1 },

'$push' : {

'comments' : comment } },

fields= {'count' : 1},

upsert=True,

new=True)

RESTful API methods in Python + Flask

Page 78: MongoDB Basics

# $inc the page count if bucket size (100) is exceeded

if page['count'] > 100:

db.articles.update(

{ '_id' : article_id,

'comments_pages': article['comments_pages'] },

{ '$inc': { 'comments_pages': 1 } } )

# let's also add to the article itself

# most recent 10 comments only

res = db['articles'].update(

{'_id' : ObjectId(article_id)},

{'$push' : {'comments' : { '$each' : [comment],

'$sort' : {’date' : 1 },

'$slice' : -10}},

'$inc' : {'comment_count' : 1}})

RESTful API methods in Python + Flask

Page 79: MongoDB Basics

def add_interaction(article_id, type):

"""Record the interaction (view/comment) for the

specified article into the daily bucket and

update an hourly counter

"""

ts = datetime.datetime.utcnow()

# $inc daily and hourly view counters in day/article stats bucket

# note the unacknowledged w=0 write concern for performance

db['interactions'].update(

{ 'article_id' : ObjectId(article_id),

'date' : datetime.datetime(ts.year, ts.month, ts.day)},

{ '$inc' : {

'daily.{}’.format(type) : 1,

'hourly.{}.{}'.format(ts.hour, type) : 1

}},

upsert=True,

w=0)

RESTful API methods in Python + Flask

Page 80: MongoDB Basics

$ curl -i http://localhost:5000/cms/api/v1.0/articles

HTTP/1.0 200 OK

Content-Type: application/json

Content-Length: 335

Server: Werkzeug/0.9.4 Python/2.7.5

Date: Thu, 10 Apr 2014 16:00:51 GMT

{

"articles": "[{\"title\": \"Schema design in MongoDB\", \"text\": \"Data in MongoDB

has a flexible schema..\", \"section\": \"schema\", \"author\": \"sarangs\", \"date\":

{\"$date\": 1397145312505}, \"_id\": {\"$oid\": \"5346bef5f2610c064a36a793\"},

\"slug\": \"schema-design-in-mongodb\", \"tags\": [\"MongoDB\", \"schema\"]}]"}

Testing the API – retrieve articles

Page 81: MongoDB Basics

$ curl -H "Content-Type: application/json" -X POST -d '{"text":"An interesting

article and a great read."}'

http://localhost:5000/cms/api/v1.0/articles/52ed73a30bd031362b3c6bb3/comment

s

{

"comment": "{\"date\": {\"$date\": 1391639269724}, \"text\": \"An interesting

article and a great read.\"}”

}

Testing the API – comment on an article

Page 82: MongoDB Basics

Disaster Recovery

Introduction to Replica Sets and

High Availability

Page 83: MongoDB Basics

Disasters

• Physical Failure – Hardware

– Network

• Solution – Replica Sets

• Provide redundant storage for High Availability

– Real time data synchronization

• Automatic failover for zero down time

Page 84: MongoDB Basics

Replication

Page 85: MongoDB Basics

Multi Replication

• Data can be replicated to multiple places simultaneously

• Odd number of machines are always needed in a replica set

Page 86: MongoDB Basics

Single Replication

• If you want to have only one or odd number of secondary, you need to setup an arbiter

Page 87: MongoDB Basics

Failover

• When primary fails, remaining machines vote for electing new primary

Page 88: MongoDB Basics

Handling Big Data

Introduction to Map/Reduce

and Sharding

Page 89: MongoDB Basics

Large Data Sets

• Problem 1 – Performance

• Queries go slow

• Solution – Map/Reduce

Page 90: MongoDB Basics

Aggregation

Page 91: MongoDB Basics

Map Reduce

• A way to divide large query computation into smaller chunks

• May run in multiple processes across multiple machines

• Think of it as GROUP BY of SQL

Page 92: MongoDB Basics

Map/Reduce Example

• Map function digs the data and returns required values

Page 93: MongoDB Basics

Map/Reduce Example

• Reduce function uses the output of Map function and generates aggregated value

Page 94: MongoDB Basics

Large Data Sets

• Problem 2 – Vertical Scaling of Hardware

• Can’t increase machine size beyond a limit

• Solution – Sharding

Page 95: MongoDB Basics

Sharding

• A method for storing data across multiple machines

• Data is partitioned using Shard Keys

Page 96: MongoDB Basics

Data Partitioning: Range Based

• A range of Shard Keys stay in a chunk

Page 97: MongoDB Basics

Data Partitioning: Hash Bsed

• A hash function on Shard Keys decides the chunk

Page 98: MongoDB Basics

Sharded Cluster

Page 99: MongoDB Basics

Optimizing Shards: Splitting

• In a shard, when size of a chunk increases, the chunk is divided into two

Page 100: MongoDB Basics

Optimizing Shards: Balancing

• When number of chunks in a shard increase, a few chunks are migrated to other shard

Page 101: MongoDB Basics

Schema iteration

New feature in the backlog?

Documents have dynamic schema so we just iterate

the object schema.

>>> user = { ‘username’ : ‘matt’,

‘first’ : ‘Matt’,

‘last’ : ‘Bates’,

‘preferences’ : { ‘opt_out’ : True } }

>>> user.save(user)

Page 102: MongoDB Basics

docs.mongodb.org

Page 103: MongoDB Basics

Online Training at MongoDB University

Page 104: MongoDB Basics

For More Information

Resource Location

MongoDB Downloads mongodb.com/download

Free Online Training education.mongodb.com

Webinars and Events mongodb.com/events

White Papers mongodb.com/white-papers

Case Studies mongodb.com/customers

Presentations mongodb.com/presentations

Documentation docs.mongodb.org

Additional Info [email protected]

Resource Location

Page 105: MongoDB Basics

We've introduced a lot of

concepts here

Page 106: MongoDB Basics

Schema Design @

Page 107: MongoDB Basics

Replication @

Page 108: MongoDB Basics

Indexing @

Page 109: MongoDB Basics

Sharding @

Page 110: MongoDB Basics

Questions?

Page 111: MongoDB Basics

Sarang Shravagi

@_sarangs

Thank You