©2013 DataStax Confidential. Do not distribute without consent.
@PatrickMcFadin
Patrick McFadin Chief Evangelist/Solution Architect - DataStax
Cassandra : Introduction
Who I am
!2
• Patrick McFadin • Solution Architect at DataStax • Cassandra MVP • User for years • Follow me for more:
I talk about Cassandra and building scalable, resilient apps ALL THE TIME!
@PatrickMcFadin
Dude. Uptime == $$
Five Years of Cassandra
0 1 2 3 4 5
0.1 0.3 0.6 0.7 1.0 1.2...
2.0
DSE
Jul-08
Why Cassandra?
The Best !!Persistence !!Tier !!For Your !!Application
Cassandra - An introduction
Cassandra - Roots
• Based on Amazon Dynamo and Google BigTable paper
• Shared nothing
• Data safe as possible
• Predictable scaling
!7
Dynamo
BigTable
Cassandra - More than one server
• All nodes participate in a cluster
• Shared nothing
• Add or remove as needed
•More capacity? Add a server
!8
Each node owns 25% of the data
25%
25%
25%
25%
Core Concepts Write path
Compacted later
<row,column>
Core Concepts Read Path
Real user story • New app • SSDs • 2.5 m requests • Client P99: 3.17ms!
Cassandra - Locally Distributed
• Client writes to any node
• Node coordinates with others
• Data replicated in parallel
• Replication factor: How many copies of your data?
• RF = 3 here
!11
Cassandra - Consistency
• Consistency Level (CL)
• Client specifies per read or write
!12
• ALL = All replicas ack
• QUORUM = > 51% of replicas ack
• LOCAL_QUORUM = > 51% in local DC ack
• ONE = Only one replica acks
Cassandra - Transparent to the application
• A single node failure shouldn’t bring failure
• Replication Factor + Consistency Level = Success
• This example:
• RF = 3
• CL = QUORUM
!13
>51% Ack so we are good!
My favorite feature.
!14
Ever!
Cassandra - Geographically Distributed
• Client writes local
• Data syncs across WAN
• Replication Factor per DC
!15
Cassandra Applications - Drivers
• DataStax Drivers for Cassandra
• Java
• C#
• Python
•more on the way
!16
Cassandra Applications - Connecting
• Create a pool of local servers
• Client just uses session to interact with Cassandra
!17
!contactPoints = {“10.0.0.1”,”10.0.0.2”}!!keyspace = “videodb”!!public VideoDbBasicImpl(List<String> contactPoints, String keyspace) {!
! cluster = Cluster! .builder()! .addContactPoints(!! contactPoints.toArray(new String[contactPoints.size()]))! .withLoadBalancingPolicy(Policies.defaultLoadBalancingPolicy())! .withRetryPolicy(Policies.defaultRetryPolicy())! .build();!! session = cluster.connect(keyspace);! }
CQL Intro
• Cassandra Query Language
• SQL–like language to query Cassandra
• Limited predicates. Attempts to prevent bad queries
• But still offers enough leeway to get into trouble
!18
Data Model Logical containers
Cluster - Contains all nodes. Even across WAN
Keyspace - Contains all tables. Specifies replication
Table (Column Family) - Contains rows
CQL Intro
• CREATE / DROP / ALTER TABLE
• SELECT
!
• BUT
• INSERT AND UPDATE are similar to each other
• If a row doesn’t exist, UPDATE will insert it, and if it exists, INSERT will replace it.
• Think of it as an UPSERT
• Therefore we never get a key violation
• For updates, Cassandra never reads (no col = col + 1)
!20
Data Modeling Creating Tables
CREATE TABLE shopping_cart (!! username varchar,!! cart_name text!! item_id int,!! item_name varchar,! description varchar,!
! price float,!! item_detail map<varchar,varchar>!! PRIMARY KEY ((username,cart_name),item_id)!);
Creates compound partition row key
CREATE TABLE user (!! username varchar,!! firstname varchar,!! lastname varchar,!! shopping_carts set<varchar>,!! PRIMARY KEY (username)!);
Collection!
CQL Inserts
• Insert will always overwrite
!22
INSERT INTO users (username, firstname, lastname, ! email, password, created_date)!VALUES ('pmcfadin','Patrick','McFadin',! ['[email protected]'],'ba27e03fd95e507daf2937c937d499ab',! '2011-06-20 13:50:00');!
CQL Selects
• No joins
• Data is returned in row/column format
!23
SELECT username, firstname, lastname, ! email, password, created_date!FROM users!WHERE username = 'pmcfadin';!
username | firstname | lastname | email | password | created_date!----------+-----------+----------+--------------------------+----------------------------------+--------------------------! pmcfadin | Patrick | McFadin | ['[email protected]'] | ba27e03fd95e507daf2937c937d499ab | 2011-06-20 13:50:00-0700!
Cassandra and Time Series
Time Series Taming the beast• Peter Higgs and Francois Englert. Nobel prize for Physics
• Theorized the existence of the Higgs boson
!
• Found using ATLAS
!
!
• Data stored in P-BEAST
!
!
• Time series running on Cassandra
Use Cassandra for time series
Get a nobel prize
Time Series Why• Storage model from BigTable is perfect
• One row key and tons of (variable)columns
• Single layout on disk
Row Key Column Name Column Name
Column Value Column Value
Time Series Example• Storing weather data
• One weather station
• Temperature measurements every minute
WeatherStation ID 2013-10-09 10:00 AM 2013-10-09 10:00 AM 2013-10-10 11:00 AM
72 Degrees 72 Degrees 65 Degrees
Time Series Example• Query data
•Weather Station ID = Locality of single node
WeatherStation ID 100
2013-10-09 10:00 AM 2013-10-09 10:00 AM 2013-10-10 11:00 AM
72 Degrees 72 Degrees 65 Degrees
Date query weatherStationID = 100 AND!date = 2013-10-09 10:00 AM
weatherStationID = 100 AND!date > 2013-10-09 10:00 AM AND!date < 2013-10-10 11:01 AM
Date Range
OR
Time Series How• CQL expresses this well
• Data partitioned by weather station ID and time
!
!
!
• Easy to insert data
!
!
• Easy to query
CREATE TABLE temperature (! weatherstation_id text,! event_time timestamp,! temperature text,! PRIMARY KEY (weatherstation_id,event_time)!);
INSERT INTO temperature(weatherstation_id,event_time,temperature) !VALUES ('1234ABCD','2013-04-03 07:01:00','72F');
SELECT temperature !FROM temperature !WHERE weatherstation_id='1234ABCD'!AND event_time > '2013-04-03 07:01:00'!AND event_time < '2013-04-03 07:04:00';
Time Series Further partitioning• At every minute you will eventually run out of rows
• 2 billion columns per storage row
• Data partitioned by weather station ID and time
• Use the partition key to split things up
CREATE TABLE temperature_by_day (! weatherstation_id text,! date text,! event_time timestamp,! temperature text,! PRIMARY KEY ((weatherstation_id,date),event_time)!);
Time Series Further Partitioning• Still easy to insert
!
!
!
!
• Still easy to query
INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature) !VALUES ('1234ABCD','2013-04-03','2013-04-03 07:01:00','72F');
SELECT temperature !FROM temperature_by_day !WHERE weatherstation_id='1234ABCD' !AND date='2013-04-03'!AND event_time > '2013-04-03 07:01:00'!AND event_time < '2013-04-03 07:04:00';
Time Series Use cases• Logging
• Thing Tracking (IoT)
• Sensor Data
• User Tracking
• Fraud Detection
•Nobel prizes!
Application Example - Layout
• Active-Active
• Service based DNS routing
!34
Cassandra Replication
Application Example - Uptime
!35
• Normal server maintenance
• Application is unaware
Cassandra Replication
Application Example - Failure
!36
• Data center failure
• Data is safe. Route traffic.
33
Another happy user!
Cassandra Users and Use Cases
Netflix!• If you haven’t heard their story… where have you been?
• 18B market cap — Runs on Cassandra
• User accounts
• Play lists
• Payments
• Statistics
Spotify
•Millions of songs. Millions of users.
• Playlists
• 1 billion playlists
• 30+ Cassandra clusters
• 50+ TB of data
• 40k req/sec peak
!39
http://www.slideshare.net/noaresare/cassandra-nyc
Instagram(Facebook)
• Loads and loads of photos. (Probably yours)
• All in AWS
• Security audits
• News feed
• 20k writes/sec. 15k reads/sec.
!40
DataStax Ac*demy for Apache Cassandra
• 100,000 Registrations by the end of 2014
• 25,000 Certifications by the end of 2014
!41
• First four sessions available with Weekly roll-out of 7 sessions total
• Based on DataStax Community Edition
• CQL, Schema Design and Data Modeling
• Introduction to Cassandra Objects
• First Java, then Python, C# and .NET
https://datastaxacademy.elogiclearning.com/
Content
Goals
©2013 DataStax Confidential. Do not distribute without consent. !42