Shift: Real World Migration from MongoDB to Cassandra

SHIFT.com Migrating from MongoDB to Cassandra by: Blake Eggleston & Jon Haddad

What is SHIFT.com?

Shift is a platform that enables marketers to communicate across organizations and departments in one single place.

It’s also an open application platform with a set of applications built on top of it that can communicate with one another.

Initial Stack

●  Python ○  Flask ○  Celery

●  MongoDB ○  mongoengine

●  Neo4j / Titan ○  Bulbs ○  thunderdome

●  Redis ●  AWS

○  m1.xlarge for mongo

Current Stack

●  Python ○  still flask ○  still celery ○  gevent (it rocks)

●  Cassandra ○  1.2.6 ○  cqlengine

●  ElasticSearch ●  Redis

○  jondis

●  AWS ○  m1.xlarge

Why did we move to Cassandra?

●  Operational Benefits ○  Adding and removing nodes is much easier,

compared to Mongo’s shards

●  Control over our Data on Disk (LSMT) ●  Love CQL3 ●  Long term scalability

○  Scales Linearly ○  Multi DC Support Baked in

Migration Goals

●  Zero downtime ○  We wanted to roll out Cassandra without any

service interruptions

●  No loss of performance ○  By carefully structuring our schema we were able

to match MongoDB’s performance.

Migration Strategy

Benefits of CQL3

●  Easy to understand if you’re coming from RDBMS

●  Collections ○  sets, lists, maps

●  Batch Queries ●  Clustering Keys

○  Handles ordering of logical rows ○  Saved us from column name management scheme

and allowed us to focus on our data

Physical vs Logical Row

Single Row

Clustered Row

Data Modelling Patterns

●  considerations: working with Mongo’s dbrefs and optimizing layout on disk

●  structured tables as materialized views of the queries we planned on using

●  moving multiple documents into a single physical row

●  creating supporting index tables for looking up logical rows

Time Series: Message Stream

●  Users have tens of thousands of messages ●  Each users message stream is specific to

them, like a twitter feed ●  This is Cassandra’s strength - Time Series ●  Considered Redis - but poor for multi-dc

create table news_feed ( user_id uuid,

message_id timeuuid,

message,

primary key (user_id, message_id));

cqlengine

●  cqlengine.org ●  the Python CQL3 object-row mapper ●  exposes CQL3 tables as Python classes ●  maps columns to properties ●  builds CQL queries

#model definition class ExampleModel(Model): example_id = columns.UUID(primary_key=True) example_type = columns.Integer(index=True) created_at = columns.DateTime() description = columns.Text(required=False) # example query ExampleModel.objects(example_type=1)

Improvements from moving to C*

●  Operationally we’ve had zero problems ●  Outstanding Performance ●  Easy to build new features ●  Community has been amazing (mailing list

and #cassandra)

misc tips

●  leveled compaction - good for read heavy workloads

●  use secondary indexes sparingly, understand how they work and when to use them

●  to reiterate, think about how you’re going to query your data

●  use elastic search / solr for ad hoc queries

Contact Info

Jon Haddad @rustyrazorblade [email protected]

Blake Eggleston @blakeeggleston [email protected]

….we’re hiring!

Technology

Shift: Real World Migration from MongoDB to Cassandra