Cassandra for impatients

Cassandra for impatientsCarlos Alonso (@calonso)MADRID · NOV 27-28 · 2015

MADRID · NOV 27-28 · 2015@calonso

• Introductions

• Cassandra Core concepts

• CQL

• Data modelling and examples


CREATE TABLE users ( id INT PRIMARY KEY, name VARCHAR, surname VARCHAR, birthdate TIMESTAMP);

INSERT INTO users (id, name, surname, birthdate) VALUES (1, ‘Carlos’, ‘Alonso’, ’1985-03-19’);

SELECT * FROM users WHERE id = 1;

ALTER TABLE users ADD city VARCHAR;

UPDATE users SET city = ‘London’ WHERE id = 1;


IMPATIENT LEVEL…

Aha!! That’s SQL, I already know all this stuff. I’ll start using Cassandra as soon as I get home.


IMPATIENT LEVEL…

Why is this taking so long? Have they gone killing the cow for my BigMac?


IMPATIENT LEVEL…As soon as I see the loading spinner…


CARLOS ALONSO•Spanish Londoner

•MSc Salamanca University, Spain

•Software Engineer @MyDrive Solutions

•Cassandra certified developer

•Cassandra MVP 2015

•@calonso / http://mrcalonso.com

http://mrcalonso.com


MYDRIVE SOLUTIONS

• World leading driver profiling company

• Using technology and data to understand how to improve driving behaviour

• Recently acquired by the Generali Group

• @_MyDrive / http://mydrivesolutions.com

• We are hiring!!

http://mydrivesolutions.com


CASSANDRA• Open Source distributed database

management system

• Initially developed at Facebook

• Inspired by Amazon’s Dynamo and Google BigTable papers

• Became Apache top-level project in Feb, 2010

• Nowadays developed by DataStax

Cassandra Core ConceptsTechnical introduction to Apache CassandraMADRID · NOV 27-28 · 2015


THE CAP THEOREM


CASSANDRA• Fast Distributed NoSQL Database

• High Availability

• Linear Scalability => Predictability

• No SPOF

• Multi-DC

• Horizontally scalable => $$$

• Not a drop in replacement for RDBMS


CLUSTER COMPONENTS


CLUSTER COMPONENTS


CONSISTENT HASHING

Hash function

“Carlos” 185664

1773456738847666528349

-894763734895827651234


REQUEST COORDINATION

• Coordinator: The node designated to coordinate a particular query.

• ANY node can coordinate ANY request.

• No SPOF: One of the main Cassandra’s principles.

• The driver chooses which node will coordinate


REPLICATION FACTOR

How many copies (replicas) for your data


WHY REPLICATION?

• Disaster recovery

• Bring data closer to users (to reduce latencies)

• Workload segregation (analytical vs transactional)


REPLICATIONDefined at keyspace level

CREATE KEYSPACE <my_keyspace> WITH REPLICATION = { “class”: “SimpleStrategy”, “replication_factor”: 2 };

CREATE KEYSPACE <my_keyspace> WITH REPLICATION = { “class”: “NetworkTopologyStrategy”,

“dc-east”: 2, “dc-west”: 3 };


CONSISTENCY LEVEL

How many replicas of your data must respond ok?


CONSISTENCY LEVEL

Availability /Partition tolerance Consistency


DriverClient

Partitioner

f81d4fae-…834

• RF = 3• CL = QUORUM• SELECT * … WHERE id = f81d4fae-…


COMPACTIONS• Merges most recent partition keys and columns

• Evicts tombstoned columns

• Creates new SSTable

• Rebuilds partition indexes and summaries

• Deletes old SSTables


COMPACTIONS

• SizeTieredCompactionStrategy

• LeveledCompactionStrategy

• DateTieredCompactionStrategy

CREATE TABLE user (id UUID PRIMARY KEY,email TEXT,name TEXT,preferences SET<TEXT>,

) WITH COMPACTION = { “class”: “<strategy>”, <params> };

CQLCassandra Query LanguageMADRID · NOV 27-28 · 2015


CQL

CREATE TABLE users ( id UUID, name VARCHAR, surname VARCHAR, birthdate TIMESTAMP, PRIMARY KEY(id));

Familiar row-column SQL-like approach.

INSERT INTO users (id, name, surname, birthdate) VALUES (uuid(), ‘Carlos’, ‘Alonso’, ’1985-03-19’);

SELECT * FROM users WHERE id = ‘f81d4fae-7dec-11d0-a765-00a0c91e6bf6’;

ALTER TABLE users ADD address VARCHAR;


WHERE• Equality matches:

• SELECT … WHERE album_title = ‘Revolver’;

• IN:

• Only applicable in the last WHERE clause

• SELECT … WHERE album_title = ‘Revolver’ AND number IN (2, 3, 4);

• Range search:

• Only on clustering columns.

• SELECT … WHERE album_title = ‘Revolver’ AND number >= 6 AND number < 2;


COLLECTIONS

• Set: Uniqueness

• List: Order

• Map: Key-Value pairs

• Tuple: Fixed length

email_addresses SET<VARCHAR>

email_addresses LIST<VARCHAR>

email_addresses MAP<VARCHAR, VARCHAR>

email_addresses TUPLE<VARCHAR, VARCHAR>


MORE FEATURES

• COUNTERS

• LWT

• TTL

• UDT

• BATCHES

• BATCH + LWT

• STATIC COLUMNS

Data ModellingBasic concepts for a Cassandra modelMADRID · NOV 27-28 · 2015


PHYSICAL DATA STRUCTURE


DATA MODELLING

• Understand your data

• Decide how you’ll query the data

• Define column families to satisfy those queries

• Implement and optimize


DATA MODELLINGConceptual

Model

Logical Model

Physical Model

Query-DrivenMethodology

Analysis &Validation


PRIMARY KEY

PARTITION KEY +CLUSTERING COLUMN(S)


QUERY DRIVEN METHODOLOGY• Spread data evenly around the cluster

• Minimise the number of partitions read

• Follow the mapping rules:

• Entities and relationships: map to tables

• Equality search attributes: must be at the beginning of the primary key

• Inequality search attributes: become clustering columns

• Ordering attributes: become clustering columns

• Key attributes: map to primary key columns


ANALYSIS & VALIDATION• 1 Partition per read?

• Are write conflicts (overwrites) possible?

• How large are partitions?

• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

• How much data duplication? (batches)

• Client side joins or new table?


An E-Library project.


EXAMPLE 1Books can be uniquely identified and accessed by ISBN,

we also need a title, genre, author and publisher.

QDMBook

ISBN KTitle

AuthorGenre

Publisher

Q1

Q1: Find books by ISBN






• 1 X (5 - 1 - 0) + 0 < 1M



BookISBN KTitle

AuthorGenre

Publisher

Q1

Q1: Find books by ISBN

N/A

N/A


BookISBN KTitle

AuthorGenre

Publisher

Q1

Q1: Find books by ISBNCREATE TABLE books ( ISBN VARCHAR PRIMARY KEY, title VARCHAR, author VARCHAR, genre VARCHAR, publisher VARCHAR);

SELECT * FROM books WHERE ISBN = ‘…’;


Users register into the system uniquely identified by an email and a password. We also want their full name. They will be

accessed by email and password or internal unique ID.

QDM KUsers_by_IDID

full_name

Q1

Q1: Find users by ID

Users_by_login_infoemail K

password

Q2

Q2: Find users by login info

K

full_name

ID






• 1 X (2 - 1 - 0) + 0 < 1M



N/A

N/A

KUsers_by_IDID

full_name

Q1



CREATE TABLE users_by_id ( ID TIMEUUID PRIMARY KEY, full_name VARCHAR);

SELECT * FROM users_by_id WHERE ID = …;

KUsers_by_IDID

full_name

Q1







• 1 X (4 - 2 - 0) + 0 < 1M


• Client side joins or new table? N/A


password

Q2


K

full_name

ID


CREATE TABLE users_by_login_info ( email VARCHAR, password VARCHAR, full_name VARCHAR, ID TIMEUUID, PRIMARY_KEY ((email, password)));

SELECT * FROM users_by_login_info WHERE email = ‘…’ AND password = ‘…’;


password

Q2


K

full_name

ID


BEGIN BATCH INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS; INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…);APPLY BATCH;


Users read books. We want to know which books has a user read.

QDM

KBooks_read_by_user

user_IDtitle

Q1

full_nameISBNgenre

publisher

authorCC

Q1: Find all books a loggeduser has read






• Books X (7 - 1 - 0) + 0 < 1M => 166.666 books per user




KBooks_read_by_user

user_IDtitle

Q1

full_nameISBNgenre

publisher

authorCC

New table

N/A


CREATE TABLE books_read_by_user ( user_ID TIMEUUID, title VARCHAR, author VARCHAR, full_name VARCHAR, ISBN VARCHAR, genre VARCHAR, publisher VARCHAR PRIMARY_KEY (user_id, title, author));

SELECT * FROM books_read_by_user WHERE user_ID = …;


KBooks_read_by_user

user_IDtitle

Q1

full_nameISBNgenre

publisher

authorCC


In order to improve our site’s usability we need to understand how our users use it by tracking every interaction

they have with our site.

QDM

user_ID2_weeks

timeelement

type

KActions_by_user

Q1

Q1: Find all actions a user does in a time range

KC






• Actions X (5 - 2 - 0) + 0 < 1M => 333.333 per user every 2 weeks



N/A

KActions_by_user

user_ID2_weeks

Q1


Ktime

elementtype

C

N/A


CREATE TABLE actions_by_user ( user_ID TIMEUUID, 2_weeks INT, time TIMESTAMP, element VARCHAR, type VARCHAR PRIMARY_KEY ((user_ID, 2_weeks), time));

SELECT * FROM actions_by_user WHERE user_ID = … AND 2_weeks = … AND

time < … AND time > …;

KActions_by_user

user_ID2_weeks

Q1


Ktime

elementtype

C


THANKS!

Engineering

Cassandra for impatients