Massively Scalable NoSQL with Apache Cassandra

Massively scalable NoSQL with Apache Cassandra!Jonathan Ellis Project Chair, Apache Cassandra CTO, DataStax@spyced

©2012 DataStax

Big data

Analytics(Hadoop)

Realtime(“NoSQL”)

?

©2012 DataStax

Some Casandra users

©2012 DataStax

eBay

Application/Use Case• Social Signals: like/want/own features for

eBay product and item pages• Hunch taste graph for eBay users and items• Many time series use cases

Why Cassandra? • Multi-datacenter• Scalable• Write performance• Distributed counters• Hadoop support

ACE

©2012 DataStax

Time series data

©2012 DataStax

Multi-datacenter support

©2012 DataStax

Distributed counters

©2012 DataStax

Hadoop support

©2012 DataStax

Disney

Application/Use Case• Meet the data management needs of user

facing applications across The Walt Disney Company with a single platform

Why Cassandra? • DataStax Enterprise can tackle real-time

and search functions in the same cluster• Scalability• 24x7 uptime

NDI

©2012 DataStax

Multitenancy

©2012 DataStax

Multitenancy

©2012 DataStax

Enterprise search

©2012 DataStax

SimpleReachApplication/Use Case• SimpleReach tracks social actions for

content creators, from Twitter and Facebook to Pinterest and Reddit, to deliver detailed insights and clear metrics around social behavior.

Why Cassandra? • Very high velocity data ingest rate and

large data volumes• Workload separation between realtime

and batch applications

NDE

©2012 DataStax

SourceNinjaApplication/Use Case• SourceNinja notifies you to performance,

security, and bug fixes for the software you depend on

Why Cassandra? • Previous database system could not

handle load; HBase has too many points of failure and was too slow

• Fast real time capabilities, batch analytics on that data, and enterprise search

RDE

©2012 DataStax

NetflixApplication/Use Case• General purpose backend for large scale

highly available cloud based web services supporting Netflix Streaming

Why Cassandra? • Highly available, highly robust and no

schema change downtime• Highly scalable, optimized for SSD• Much lower cost than previous Oracle and

SimpleDB implementations• Flexible data model• Ability to directly influence/implement

OSS feature set• Supports local and wide area distributed

operations, spanning US and Europe

RCE

©2012 DataStax

Optimized for SSD

©2012 DataStax

Open source

©2012 DataStax

• Massively scalable

• High performance

• Reliable/Available

Use case patterns

©2012 DataStax

©2012 DataStax

0

5000

10000

15000

20000

25000

30000

35000

Cassandra 0.6

Cassandra 1.0

reads/s writes/s

©2012 DataStax

©2012 DataStax

Classic partitioning with SPOFpartition 1 partition 2 partition 3 partition 4

router

client

©2012 DataStax

Availability• “High availability implies that a single fault will not bring

down your system. Not ‘we’ll recover quickly.’” -- Ben Coverston: DataStax

• “The biggest problem with failover is that you're almost never using it until it really hurts. It's like backups that you never test.” -- Rick Branson: Instagram

©2012 DataStax

Fully distributed, no SPOFclient

p1

p1

p1p3

p6

©2012 DataStax

Partitioning

jim

carol

johnny

suzy

age: 36 car: camaro gender: M

age: 37 car: subaru gender: F

age:12 gender: M

age:10 gender: F

©2012 DataStax

Primary key determines placement*

Partitioning

jim

carol

johnny

suzy

age: 36 car: camaro gender: M

age: 37 car: subaru gender: F

age:12 gender: M

age:10 gender: F

©2012 DataStax

jim

carol

johnny

suzy

PK

5e02739678...

a9a0198010...

f4eb27cea7...

78b421309e...

MD5 Hash

MD5* hash operation yields a 128-bit number

for keysof any size.

©2012 DataStax

Node A

Node D Node C

Node B

The “token ring”

©2012 DataStax

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Start EndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

©2012 DataStax

jim 5e02739678...

carol a9a0198010...


suzy 78b421309e...

Start EndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

©2012 DataStax

jim 5e02739678...

carol a9a0198010...


suzy 78b421309e...

Start EndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

©2012 DataStax

jim 5e02739678...

carol a9a0198010...


suzy 78b421309e...

Start EndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

©2012 DataStax

jim 5e02739678...

carol a9a0198010...


suzy 78b421309e...

Start EndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

©2012 DataStax

Node A

Node D Node C

Node B

carol a9a0198010...

Replication

©2012 DataStax

Node A

Node D Node C

Node B

carol a9a0198010...

©2012 DataStax

Node A

Node D Node C

Node B

carol a9a0198010...

©2012 DataStax

Highlights• Adding capacity is application-transparent and requires

no downtime

• No SPOF, not even temporarily• No “primary” replica

• Con!gurable synchronous/asynchronous

• Tolerates node failure; never have to restart replication “from scratch”

• “Smart” replication avoids correlated failures

©2012 DataStax

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int);

CREATE INDEX ON users(state);

SELECT * FROM users WHERE state=‘Texas’ AND birth_date > 1950;

CQL: You got SQL in my NoSQL!

©2012 DataStax

Strictly “realtime” focused• No joins

• No subqueries

• No aggregation functions* or GROUP BY

• ORDER BY?

©2012 DataStax

Clustered data in in CFS

©2012 DataStax

Clustered data in in CFS

©2012 DataStax

CREATE TABLE sblocks ( block_id uuid, subblock_id uuid, data blob, PRIMARY KEY (block_id, subblock_id));

Clustering in CQL3

block_id subblock_id data

Block1 subblock A data ABlock1 subblock B data B

... ... ...

Block2 subblock C data CBlock2 subblock D data D

... ... ...

Block3 subblock E data EBlock3 subblock F data F

... ... ...

©2012 DataStax


CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text);

SELECT *FROM users NATURAL JOIN users_addresses;

Collections

©2012 DataStax


CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text);

SELECT *FROM users NATURAL JOIN users_addresses;

Collections

X

©2012 DataStax

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int, email_addresses set<text>);

UPDATE usersSET email_addresses = email_addresses + {‘[email protected]’, ‘[email protected]’};

Collections

mailto:[email protected]




©2012 DataStax

Big data

Analytics(Hadoop)

Realtime(“NoSQL”)

?

©2012 DataStax

The evolution of Analytics

Analytics + Realtime

©2012 DataStax


Analytics Realtime

replication

©2012 DataStax


ETL

©2012 DataStax

Big data

Analytics(Hadoop)

Realtime(Cassandra)

DatastaxEnterprise

©2012 DataStax

©2012 DataStax

Better Hadoop than Hadoop• “Vanilla” Hadoop

• 8+ services to setup, monitor, backup, and recover(NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper, Region Server,...)

• Single points of failure

• Can't separate online and offline processing

• DataStax Enterprise• Single, simpli!ed component

• Self-organizes based on workload

• Peer to peer

• JobTracker failover

©2012 DataStax

SELECT title FROM solr WHERE solr_query='title:natio*';

title-------------------------------------------------------------------------- Bolivia national football team 2002 List of French born footballers who have played for other national teams Lithuania national basketball team at Eurobasket 2009 Bolivia national football team 2000 Kenya national under-20 football team Bolivia national football team 1999 Israel men's national inline hockey team Bolivia national football team 2001

Enterprise search with Solr

©2012 DataStax

Managing & Monitoring Big Data

©2012 DataStax

Questions?• http://www.datastax.com/docs

• http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-1

• http://www.datastax.com/dev/blog/schema-in-cassandra-1-1

• http://www.datastax.com/products/enterprise

http://www.datastax.com/docs

http://www.datastax.com/docs

http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-1




http://www.datastax.com/dev/blog/schema-in-cassandra-1-1




http://www.datastax.com/products/enterprise

http://www.datastax.com/products/enterprise

Technology

Massively Scalable NoSQL with Apache Cassandra