55
Cassandra for impatients Carlos Alonso (@calonso) MADRID · NOV 27-28 · 2015

Cassandra for impatients

Embed Size (px)

Citation preview

Page 1: Cassandra for impatients

Cassandra for impatientsCarlos Alonso (@calonso)MADRID · NOV 27-28 · 2015

Page 2: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

• Introductions

• Cassandra Core concepts

• CQL

• Data modelling and examples

Page 3: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

CREATE TABLE users ( id INT PRIMARY KEY, name VARCHAR, surname VARCHAR, birthdate TIMESTAMP);

INSERT INTO users (id, name, surname, birthdate) VALUES (1, ‘Carlos’, ‘Alonso’, ’1985-03-19’);

SELECT * FROM users WHERE id = 1;

ALTER TABLE users ADD city VARCHAR;

UPDATE users SET city = ‘London’ WHERE id = 1;

Page 4: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

IMPATIENT LEVEL…

Aha!! That’s SQL, I already know all this stuff. I’ll start using Cassandra as soon as I get home.

Page 5: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

IMPATIENT LEVEL…

Why is this taking so long? Have they gone killing the cow for my BigMac?

Page 6: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

IMPATIENT LEVEL…As soon as I see the loading spinner…

Page 7: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

CARLOS ALONSO•Spanish Londoner

•MSc Salamanca University, Spain

•Software Engineer @MyDrive Solutions

•Cassandra certified developer

•Cassandra MVP 2015

•@calonso / http://mrcalonso.com

Page 8: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

MYDRIVE SOLUTIONS

• World leading driver profiling company

• Using technology and data to understand how to improve driving behaviour

• Recently acquired by the Generali Group

• @_MyDrive / http://mydrivesolutions.com

• We are hiring!!

Page 9: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

CASSANDRA• Open Source distributed database

management system

• Initially developed at Facebook

• Inspired by Amazon’s Dynamo and Google BigTable papers

• Became Apache top-level project in Feb, 2010

• Nowadays developed by DataStax

Page 10: Cassandra for impatients

Cassandra Core ConceptsTechnical introduction to Apache CassandraMADRID · NOV 27-28 · 2015

Page 11: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

THE CAP THEOREM

Page 12: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

CASSANDRA• Fast Distributed NoSQL Database

• High Availability

• Linear Scalability => Predictability

• No SPOF

• Multi-DC

• Horizontally scalable => $$$

• Not a drop in replacement for RDBMS

Page 13: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

CLUSTER COMPONENTS

Page 14: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

CLUSTER COMPONENTS

Page 15: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

CONSISTENT HASHING

Hash function

“Carlos” 185664

1773456738847666528349

-894763734895827651234

Page 16: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

REQUEST COORDINATION

• Coordinator: The node designated to coordinate a particular query.

• ANY node can coordinate ANY request.

• No SPOF: One of the main Cassandra’s principles.

• The driver chooses which node will coordinate

Page 17: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

REPLICATION FACTOR

How many copies (replicas) for your data

Page 18: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

WHY REPLICATION?

• Disaster recovery

• Bring data closer to users (to reduce latencies)

• Workload segregation (analytical vs transactional)

Page 19: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

REPLICATIONDefined at keyspace level

CREATE KEYSPACE <my_keyspace> WITH REPLICATION = { “class”: “SimpleStrategy”, “replication_factor”: 2 };

CREATE KEYSPACE <my_keyspace> WITH REPLICATION = { “class”: “NetworkTopologyStrategy”,

“dc-east”: 2, “dc-west”: 3 };

Page 20: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

CONSISTENCY LEVEL

How many replicas of your data must respond ok?

Page 21: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

CONSISTENCY LEVEL

Availability /Partition tolerance Consistency

Page 22: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

DriverClient

Partitioner

f81d4fae-…834

• RF = 3• CL = QUORUM• SELECT * … WHERE id = f81d4fae-…

Page 23: Cassandra for impatients
Page 24: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

COMPACTIONS• Merges most recent partition keys and columns

• Evicts tombstoned columns

• Creates new SSTable

• Rebuilds partition indexes and summaries

• Deletes old SSTables

Page 25: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

COMPACTIONS

• SizeTieredCompactionStrategy

• LeveledCompactionStrategy

• DateTieredCompactionStrategy

CREATE TABLE user (id UUID PRIMARY KEY,email TEXT,name TEXT,preferences SET<TEXT>,

) WITH COMPACTION = { “class”: “<strategy>”, <params> };

Page 26: Cassandra for impatients
Page 27: Cassandra for impatients

CQLCassandra Query LanguageMADRID · NOV 27-28 · 2015

Page 28: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

CQL

CREATE TABLE users ( id UUID, name VARCHAR, surname VARCHAR, birthdate TIMESTAMP, PRIMARY KEY(id));

Familiar row-column SQL-like approach.

INSERT INTO users (id, name, surname, birthdate) VALUES (uuid(), ‘Carlos’, ‘Alonso’, ’1985-03-19’);

SELECT * FROM users WHERE id = ‘f81d4fae-7dec-11d0-a765-00a0c91e6bf6’;

ALTER TABLE users ADD address VARCHAR;

Page 29: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

WHERE• Equality matches:

• SELECT … WHERE album_title = ‘Revolver’;

• IN:

• Only applicable in the last WHERE clause

• SELECT … WHERE album_title = ‘Revolver’ AND number IN (2, 3, 4);

• Range search:

• Only on clustering columns.

• SELECT … WHERE album_title = ‘Revolver’ AND number >= 6 AND number < 2;

Page 30: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

COLLECTIONS

• Set: Uniqueness

• List: Order

• Map: Key-Value pairs

• Tuple: Fixed length

email_addresses SET<VARCHAR>

email_addresses LIST<VARCHAR>

email_addresses MAP<VARCHAR, VARCHAR>

email_addresses TUPLE<VARCHAR, VARCHAR>

Page 31: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

MORE FEATURES

• COUNTERS

• LWT

• TTL

• UDT

• BATCHES

• BATCH + LWT

• STATIC COLUMNS

Page 32: Cassandra for impatients

Data ModellingBasic concepts for a Cassandra modelMADRID · NOV 27-28 · 2015

Page 33: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

PHYSICAL DATA STRUCTURE

Page 34: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

DATA MODELLING

• Understand your data

• Decide how you’ll query the data

• Define column families to satisfy those queries

• Implement and optimize

Page 35: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

DATA MODELLINGConceptual

Model

Logical Model

Physical Model

Query-DrivenMethodology

Analysis &Validation

Page 36: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

PRIMARY KEY

PARTITION KEY +CLUSTERING COLUMN(S)

Page 37: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

QUERY DRIVEN METHODOLOGY• Spread data evenly around the cluster

• Minimise the number of partitions read

• Follow the mapping rules:

• Entities and relationships: map to tables

• Equality search attributes: must be at the beginning of the primary key

• Inequality search attributes: become clustering columns

• Ordering attributes: become clustering columns

• Key attributes: map to primary key columns

Page 38: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

ANALYSIS & VALIDATION• 1 Partition per read?

• Are write conflicts (overwrites) possible?

• How large are partitions?

• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

• How much data duplication? (batches)

• Client side joins or new table?

Page 39: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

An E-Library project.

Page 40: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

EXAMPLE 1Books can be uniquely identified and accessed by ISBN,

we also need a title, genre, author and publisher.

QDMBook

ISBN KTitle

AuthorGenre

Publisher

Q1

Q1: Find books by ISBN

Page 41: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

ANALYSIS & VALIDATION• 1 Partition per read?

• Are write conflicts (overwrites) possible?

• How large are partitions?

• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

• 1 X (5 - 1 - 0) + 0 < 1M

• How much data duplication? (batches)

• Client side joins or new table?

BookISBN KTitle

AuthorGenre

Publisher

Q1

Q1: Find books by ISBN

N/A

N/A

Page 42: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

BookISBN KTitle

AuthorGenre

Publisher

Q1

Q1: Find books by ISBNCREATE TABLE books ( ISBN VARCHAR PRIMARY KEY, title VARCHAR, author VARCHAR, genre VARCHAR, publisher VARCHAR);

SELECT * FROM books WHERE ISBN = ‘…’;

Page 43: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

Users register into the system uniquely identified by an email and a password. We also want their full name. They will be

accessed by email and password or internal unique ID.

QDM KUsers_by_IDID

full_name

Q1

Q1: Find users by ID

Users_by_login_infoemail K

password

Q2

Q2: Find users by login info

K

full_name

ID

Page 44: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

ANALYSIS & VALIDATION• 1 Partition per read?

• Are write conflicts (overwrites) possible?

• How large are partitions?

• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

• 1 X (2 - 1 - 0) + 0 < 1M

• How much data duplication? (batches)

• Client side joins or new table?

N/A

N/A

KUsers_by_IDID

full_name

Q1

Q1: Find users by ID

Page 45: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

CREATE TABLE users_by_id ( ID TIMEUUID PRIMARY KEY, full_name VARCHAR);

SELECT * FROM users_by_id WHERE ID = …;

KUsers_by_IDID

full_name

Q1

Q1: Find users by ID

Page 46: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

ANALYSIS & VALIDATION• 1 Partition per read?

• Are write conflicts (overwrites) possible?

• How large are partitions?

• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

• 1 X (4 - 2 - 0) + 0 < 1M

• How much data duplication? (batches)

• Client side joins or new table? N/A

Users_by_login_infoemail K

password

Q2

Q2: Find users by login info

K

full_name

ID

Page 47: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

CREATE TABLE users_by_login_info ( email VARCHAR, password VARCHAR, full_name VARCHAR, ID TIMEUUID, PRIMARY_KEY ((email, password)));

SELECT * FROM users_by_login_info WHERE email = ‘…’ AND password = ‘…’;

Users_by_login_infoemail K

password

Q2

Q2: Find users by login info

K

full_name

ID

Page 48: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

BEGIN BATCH INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS; INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…);APPLY BATCH;

Page 49: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

Users read books. We want to know which books has a user read.

QDM

KBooks_read_by_user

user_IDtitle

Q1

full_nameISBNgenre

publisher

authorCC

Q1: Find all books a loggeduser has read

Page 50: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

ANALYSIS & VALIDATION• 1 Partition per read?

• Are write conflicts (overwrites) possible?

• How large are partitions?

• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

• Books X (7 - 1 - 0) + 0 < 1M => 166.666 books per user

• How much data duplication? (batches)

• Client side joins or new table?

Q1: Find all books a loggeduser has read

KBooks_read_by_user

user_IDtitle

Q1

full_nameISBNgenre

publisher

authorCC

New table

N/A

Page 51: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

CREATE TABLE books_read_by_user ( user_ID TIMEUUID, title VARCHAR, author VARCHAR, full_name VARCHAR, ISBN VARCHAR, genre VARCHAR, publisher VARCHAR PRIMARY_KEY (user_id, title, author));

SELECT * FROM books_read_by_user WHERE user_ID = …;

Q1: Find all books a loggeduser has read

KBooks_read_by_user

user_IDtitle

Q1

full_nameISBNgenre

publisher

authorCC

Page 52: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

In order to improve our site’s usability we need to understand how our users use it by tracking every interaction

they have with our site.

QDM

user_ID2_weeks

timeelement

type

KActions_by_user

Q1

Q1: Find all actions a user does in a time range

KC

Page 53: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

ANALYSIS & VALIDATION• 1 Partition per read?

• Are write conflicts (overwrites) possible?

• How large are partitions?

• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

• Actions X (5 - 2 - 0) + 0 < 1M => 333.333 per user every 2 weeks

• How much data duplication? (batches)

• Client side joins or new table?

N/A

KActions_by_user

user_ID2_weeks

Q1

Q1: Find all actions a user does in a time range

Ktime

elementtype

C

N/A

Page 54: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

CREATE TABLE actions_by_user ( user_ID TIMEUUID, 2_weeks INT, time TIMESTAMP, element VARCHAR, type VARCHAR PRIMARY_KEY ((user_ID, 2_weeks), time));

SELECT * FROM actions_by_user WHERE user_ID = … AND 2_weeks = … AND

time < … AND time > …;

KActions_by_user

user_ID2_weeks

Q1

Q1: Find all actions a user does in a time range

Ktime

elementtype

C

Page 55: Cassandra for impatients

MADRID · NOV 27-28 · 2015@calonso

THANKS!