Finding the right NoSQL DB for the job
The path to a non-RDBMS solution at
Who we are
• A search engine• A people
search engine• An influencer
search engine• Subscription-
based
George Stathis
VP Engineering14+ years of experience building full-stack web software systems with a past focus on e-commerce and publishing. Currently responsible for building engineering capability to enable Traackr's growth goals.
What’s this talk about?
• Why we picked a NoSQL database
• How we picked a NoSQL database
• My NoSQL does not do the job! What now?!
• Nirvana = the right tool for the job
Why did we pick a NoSQL DB?
There are some misconceptions around NoSQL only being appropriate when one needs to achieve
“Web Scale”
I need web scale!http://www.youtube.com/watch?v=b2F-DItXtZs
Traackr picked NoSQL; are we “Web Scale”?
• In terms of users/traffic?
Do we fit the “Web scale” profile?
Source: compete.com
Source: compete.com
Source: compete.com
Source: compete.com
Source: highscalability.com
• In terms of users/traffic?
• In terms of the amount of data?
Do we fit the “Web scale” profile?
PRIMARY> use traackrswitched to db traackrPRIMARY> db.stats(){
"db" : "traackr","collections" : 12,"objects" : 68226121,"avgObjSize" : 2972.0800625760330,"dataSize" : 202773493971,"storageSize" : 221491429671,"numExtents" : 199,"indexes" : 33,"indexSize" : 27472394891,"fileSize" : 266623699968,"nsSizeMB" : 16,"ok" : 1
}
That’s a quarter of a terabyte …
Wait! What? My Synology NAS at home can hold 2TB!
No need for us to track the entire web
Web Content
Influencer Content
Not at scale :-)
• In terms of users/traffic?
• In terms of the amount of data?
Do we fit the “Web scale” profile?
Alternate view of “Web Scale”
Web data is:
Heterogeneous
Unstructured (text)
Source: http://www.opte.org/
Visualization of the Internet, Nov. 23rd 2003
Data sources are
isolated islands of rich
data with lose links to
one another
How do we build a database that models all possible entities found on the web?
Modeling the web: the RDBMS way
Source: socialbutterflyclt.com
or
{ "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteUrl": "http://twitter.com/dchancogne", "metrics": [ { "value": 216, "name": "twitter_followers_count" }, { "value": 2107, "name": "twitter_statuses_count" } ] }, { "siteUrl": "http://traackr.com/blog/author/david", "metrics": [ { "value": 21, "name": "google_inbound_links" } ] } ]}
Influencer data as JSON
“In the old world of data analysis you knew exactly which questions you wanted to ask,
which drove a very predictable collection and storage model. In the new world of data
analysis your questions are going to evolve and change over time and as such you need to be able to collect, store and analyze data without
being constrained by resources.”— Werner Vogels, CTO/VP Amazon.com
NoSQL = schema flexibility
• In terms of users/traffic?
• In terms of the amount of data?
Do we fit the “Web scale” profile?
• In terms of users/traffic?
• In terms of the amount of data?
• In terms of the variety of the data
Do we fit the “Web scale” profile?
✓
Traackr’s Datastore Requirements
• Schema flexibility
• Good at storing lots of variable length text
• Batch processing options
✓
Requirement: text storage
Variable text length:
< big variance <140
character tweets
multi-page
blog posts
Requirement: text storage
RDBMS’ answer to variable text length:
Plan ahead for largest value
CLOB/BLOB
Requirement: text storage
Issues with CLOB/BLOG for us:
No clue what largest value is
CLOB/BLOB for tweets = wasted space
Requirement: text storage
NoSQL solutions are great for text:
No length requirements (automated
chunking)
Limited space overhead
Traackr’s Datastore Requirements
• Schema flexibility
• Good at storing lots of variable length text
• Batch processing options
✓
✓
Requirement: batch processing
Some NoSQL
solutions come
with MapReduce
Source: http://code.google.com/
Requirement: batch processing
MapReduce + RDBMS:
Possible but proprietary solutions
Usually involves exporting data from
RDBMS into a NoSQL system anyway.
Defeats data locality benefit of MR
Traackr’s Datastore Requirements
• Schema flexibility
• Good at storing lots of variable length text
• Batch processing options
✓
✓
A NoSQL option is the right fit
✓
How did we pick a NoSQL DB?
Bewildering number of optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Bewildering number of optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Graph Databases: while we can model our domain as a graph we don’t want to pigeonhole ourselves into this structure.
We’d rather use these tools for specialized data analysis but not as the
main data store.
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Memcache: memory-based,we need true persistence
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Amazon SimpleDB: not willing to store our data in a proprietary
datastore.
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Not willing to store our data in a proprietary datastore.
Redis and LinkedIn’s Project Voldermort: no query filters,
better used as queues or distributed caches
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
CouchDB: no ad-hoc queries; maturity in early 2010 made us shy away although we did try
early prototypes.
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Cassandra: in early 2010, maturity questions, no secondary indexes and no batch processing options
(came later on).
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
MongoDB: in early 2010, maturity questions, adoption questions
and no batch processing options.
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Riak: very close but in early 2010, we had adoption questions.
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
HBase: came across as the most mature at the time, with several deployments, a
healthy community, "out-of-the box" secondary indexes through a contrib and
support for batch processing using Hadoop/MR .
Climbing the learning curve
When Big-Data = Big Architectures
Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Must have a Hadoop HDFS cluster of at least 2x replication
factor nodes
Must have an odd number of
Zookeeper quorum nodes
Then you can run your Hbase nodes but it’s recommended to
co-locate regionservers with hadoop datanodes so you have
to manage resources.
Master/slave architecture means a single point of failure,
so you need to protect your master.
And then we also have to manage the MapReduce
processes and resources in the Hadoop layer.
Source: socialbutterflyclt.com
Jokes aside, no one said open source was easy to use
To be expected
• Hadoop/Hbase are
designed to move
mountains
• If you want to move big
stuff, be prepared to
sometimes use big
equipment
What it means to a startup
Development capacity before
Development capacity after
Congrats, you are now a sysadmin…
Whatever, we can do it!
Source: http://knowyourmeme.com/memes/honey-badger
Mapping an A-List to a column store
Name
Ranks References to influencer records
Mapping an A-List to a column store
Unique key
“attributes” column family
for general attributes
“influencerId” column familyfor influencer ranks and foreign keys
Mapping an A-List to a column store
Qualifiers (basically attribute names)
Mapping an A-List to a column store
“name” attribute
Influencer ranks can be attribute names as well
Mapping an A-List to a column store
Alist name value
Influencer id values assigned to each rank (basically foreign keys to an influencer table)
Mapping an A-List to a column store
Can get pretty long so needs indexing and pagination
Problem: no out-of-the-box row-based indexing and pagination
Whatever, it’s open-source!
Source: http://knowyourmeme.com/memes/honey-badger
Jumping right into the code
MapReduce for batch scoring
• Need to re-score our influencer
database once a week
• M/R cranked through it in 15 mins
Source: http://www.charliesheentshirts.info/
a few months later…
Need to upgrade to Hbase 0.90
• Making sure to remain on recent code base
• Performance improvements
• Mostly to get the latest bug fixes
No thanks!
Looks like something is missing
Our DB indexes depend on this!
Let’s get this straight
• Hbase no longer comes with secondary
indexing out-of-the-box
• It’s been moved out of the trunk to GitHub
• Where only one other company besides us
seems to care about it
Only one other maintainer besides us
What it means to a startup
Development capacity
Congrats, you are now an hbase maintainer…
Source: socialbutterflyclt.com
Whatever, we’ll roll our own indexing!
Source: http://knowyourmeme.com/memes/honey-badger
Homegrown Hbase Indexes
Rows have id prefixes that can be efficiently scanned using STARTROW and STOPROW filters
Row ids for Posts
Homegrown Hbase Indexes
Find posts for influencer_id_1234
Row ids for Posts
Homegrown Hbase Indexes
Find posts for influencer_id_5678
Row ids for Posts
Homegrown Hbase Indexes
• No longer depending on
unmaintained code
• Work with out-of-the-box Hbase
installation
What it means to a startup
Development capacity
You are back but you still need to
maintain indexing logic
Source: http://www.charliesheentshirts.info/
Application layer indexes are slow and brittle. The DB should be doing this, not us.
Sort of…
a few months later…
Cracks in the data modelhuffingtonpost.com
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Cracks in the data modelhuffingtonpost.com
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Denormalized/duplicated for fast runtime access
and storage of influencer-to-site relationship
properties
Cracks in the data modelhuffingtonpost.com
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Content attribution logic could sometimes mis-attribute posts because of the
duplicated data.
Cracks in the data modelhuffingtonpost.com
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.html
http://www.huffingtonpost.com/arianna-huffington/post_2.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Exacerbated when we started tracking people’s content on a daily basis in mid-
2011
Fixing the cracks in the data model
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Fixing the cracks in the data model
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Normalize the sites
Fixing the cracks in the data model
• Normalization requires stronger
secondary indexing
• Our application layer indexing would
need revisiting…again!
What it means to a startup
Development capacity
Psych! You are back to writing indexing
code.
Source: socialbutterflyclt.com
Whatever, we’ll change our NoSQL!
Source: http://knowyourmeme.com/memes/honey-badger
Traackr’s Datastore Requirements (Revisited)
• Schema flexibility
• Good at storing lots of variable length text
• Batch processing options (maybe)
• Out-of-the-box SECONDARY INDEX support!
• Simple to use and administer
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Nope!
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Graph Databases: we looked at Neo4J a bit closer but passed again
for the same reasons as before.
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Memcache: still no
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Amazon SimpleDB: still no.
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Not willing to store our data in a proprietary datastore.
Redis and LinkedIn’s Project Voldermort: still no
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
CouchDB: more mature but still no ad-hoc queries.
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Cassandra: matured quite a bit, added secondary indexes and batch processing
options but more restrictive in its’ use than other solutions. After the Hbase lesson,
simplicity of use was now more important.
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Riak: strong contender still but adoption questions remained.
NoSQL picking – Round 2Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
MongoDB: matured by leaps and bounds, increased adoption, support from 10gen, advanced indexing out-of-the-box as well as some batch processing
options, breeze to use, well documented and fit into our existing code base very nicely.
Immediate Benefits
• No more maintaining custom application-layer
secondary indexing code
What it means to a startup
Development capacity
Yay! I’m back!
Immediate Benefits
• No more maintaining custom application-layer
secondary indexing code
• Single binary installation greatly simplifies
administration
What it means to a startup
Development capacity
Honestly, I thought I’d never see you
guys again!
Immediate Benefits
• No more maintaining custom application-layer
secondary indexing code
• Single binary installation greatly simplifies
administration
• Our NoSQL could now support our domain
model
many-to-many relationship
Modeling an influencer
Embedded list of references to sites augmented with
influencer-specific site attributes (e.g.
percent contribution to content)
{ ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribution": 0.25 }, { "siteId": "602dc370945d3b3480fff4f2a541227c", "contribution": 1.0 } ]}
Modeling an influencer
siteId indexed for “find influencers
connected to site X”
> db.influencers.ensureIndex({siteReferences.siteId: 1});> db.influencers.find({siteReferences.siteId: "602dc370945d3b3480fff4f2a541227c"});
{ ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribution": 0.25 }, { "siteId": "602dc370945d3b3480fff4f2a541227c", "contribution": 1.0 } ]}
Embedded list of influencer references
augmented with “usernames” (useful
for content attribution)
{ ”_id": "0001e86f73cc3975a29e6a98a41a4280”, ”url": "http://traackr.com/blog/", "metrics": [ { "name": "google_inbound_links", "value": 5432 } ], "authors": [ { "username": "dchancogne", "influencerId": "770cf5c54492344ad5e45fb791ae5d52" }, { "username": ”gstathis", "influencerId": "0001e86f73cc3975a29e6a98a41a4280" } ]}
Modeling a site
Modeling a site
Indexed for “find sites associated to
influencer X”
> db.sites.ensureIndex({authors.influencerId: 1});> db.sites.find({authors.influencerId: "0001e86f73cc3975a29e6a98a41a4280"});
{ ”_id": "0001e86f73cc3975a29e6a98a41a4280”, ”url": "http://traackr.com/blog/", "metrics": [ { "name": "google_inbound_links", "value": 5432 } ], "authors": [ { "username": "dchancogne", "influencerId": "770cf5c54492344ad5e45fb791ae5d52" }, { "username": ”gstathis", "influencerId": "0001e86f73cc3975a29e6a98a41a4280" } ]}
Other index uses
Support for alternate site URLs (a.k.a. URL aliases):{ "_id": "0001e86f73cc3975a29e6a98a41a4280", "url_hash_list": [ { "url": "http://traackr.com/blog", "hash": "770cf5c54492344ad5e45fb791ae5d52" }, { "url": "http://blog.traackr.com/", "hash": "0001e86f73cc3975a29e6a98a41a4280" } ]}
Indexed for “find sites associated to
influencer X”
Index on MD5 hash of URL
Other Benefits
• Ad hoc queries and reports became easier to write with JavaScript:
no need for a Java developer to write map reduce code to extract
the data in a usable form like it was needed with Hbase.
Ad hoc report example// File Name: retweetTotal.js// Purpose: report the count of twitter URLs for which we have// computed the the number of total retweetsprint( "NUMBER OF TWITTER URLS where retweetTotal IS SET:" );print( db.sites.find( { platformName: "twitter.com", retweetTotal: { $exists: true } } ).count() );
• Easy to execute JS report script remotely:
> mongo <hostname>:<port>/traackr --quiet retweetTotal.js
• Run as a cron job, pipe the output to a file and email it out• Also, more complex MR-based reports are easily accessible to
someone with some JavaScript knowledge
Other Benefits (cont.)
• Ad hoc queries and reports became easier to write with JavaScript:
no need for a Java developer to write map reduce code to extract
the data in a usable form like it was needed with Hbase.
• Simpler backups: Hbase mostly relied on HDFS redundancy; intra-
cluster replication is available but experimental and a lot more
involved to setup.
Same binary can be deployed several times for replication & backups
Same binary can be deployed several times for replication & backups
Different Availability Zones for better SPOF
tolerance
Same binary can be deployed several times for replication & backups
priority 0 for backup server so that it never
gets elected as primary
Same binary can be deployed several times for replication & backups
Using xfs_freeze before taking backups
Same binary can be deployed several times for replication & backups
EBS snapshots as backups are portable to new instances (e.g.
QA)
Other Benefits (cont.)
• Ad hoc queries and reports became easier to write with
JavaScript: no need for a Java developer to write map reduce code
to extract the data in a usable form like it was needed with Hbase.
• Simpler backups: Hbase mostly relied on HDFS redundancy; intra-
cluster replication is available but experimental and a lot more
involved to setup.
• Great documentation
• Great adoption and community
Mongo cursors for batch scoring
• Mongo is fast enough for our data size to
be able to serially score the DB faster
than the MapReduce jobs did in parallel.
• When we grow larger, MapReduce is still
available as an option
looks like we found the right fit!
We have more of this
Development capacity
And less of this
Source: socialbutterflyclt.com
Source: http://www.charliesheentshirts.info/
for now…
Additional takeaways
• Fearless refactoring
• Ease of use and administration cannot be
overstated for a small startup
Q&A