Upload
carlos-alonso-perez
View
925
Download
0
Embed Size (px)
Citation preview
Cassandra for impatientsCarlos Alonso (@calonso)MADRID · NOV 27-28 · 2015
MADRID · NOV 27-28 · 2015@calonso
• Introductions
• Cassandra Core concepts
• CQL
• Data modelling and examples
MADRID · NOV 27-28 · 2015@calonso
CREATE TABLE users ( id INT PRIMARY KEY, name VARCHAR, surname VARCHAR, birthdate TIMESTAMP);
INSERT INTO users (id, name, surname, birthdate) VALUES (1, ‘Carlos’, ‘Alonso’, ’1985-03-19’);
SELECT * FROM users WHERE id = 1;
ALTER TABLE users ADD city VARCHAR;
UPDATE users SET city = ‘London’ WHERE id = 1;
MADRID · NOV 27-28 · 2015@calonso
IMPATIENT LEVEL…
Aha!! That’s SQL, I already know all this stuff. I’ll start using Cassandra as soon as I get home.
MADRID · NOV 27-28 · 2015@calonso
IMPATIENT LEVEL…
Why is this taking so long? Have they gone killing the cow for my BigMac?
MADRID · NOV 27-28 · 2015@calonso
IMPATIENT LEVEL…As soon as I see the loading spinner…
MADRID · NOV 27-28 · 2015@calonso
CARLOS ALONSO•Spanish Londoner
•MSc Salamanca University, Spain
•Software Engineer @MyDrive Solutions
•Cassandra certified developer
•Cassandra MVP 2015
•@calonso / http://mrcalonso.com
MADRID · NOV 27-28 · 2015@calonso
MYDRIVE SOLUTIONS
• World leading driver profiling company
• Using technology and data to understand how to improve driving behaviour
• Recently acquired by the Generali Group
• @_MyDrive / http://mydrivesolutions.com
• We are hiring!!
MADRID · NOV 27-28 · 2015@calonso
CASSANDRA• Open Source distributed database
management system
• Initially developed at Facebook
• Inspired by Amazon’s Dynamo and Google BigTable papers
• Became Apache top-level project in Feb, 2010
• Nowadays developed by DataStax
Cassandra Core ConceptsTechnical introduction to Apache CassandraMADRID · NOV 27-28 · 2015
MADRID · NOV 27-28 · 2015@calonso
THE CAP THEOREM
MADRID · NOV 27-28 · 2015@calonso
CASSANDRA• Fast Distributed NoSQL Database
• High Availability
• Linear Scalability => Predictability
• No SPOF
• Multi-DC
• Horizontally scalable => $$$
• Not a drop in replacement for RDBMS
MADRID · NOV 27-28 · 2015@calonso
CLUSTER COMPONENTS
MADRID · NOV 27-28 · 2015@calonso
CLUSTER COMPONENTS
MADRID · NOV 27-28 · 2015@calonso
CONSISTENT HASHING
Hash function
“Carlos” 185664
1773456738847666528349
-894763734895827651234
MADRID · NOV 27-28 · 2015@calonso
REQUEST COORDINATION
• Coordinator: The node designated to coordinate a particular query.
• ANY node can coordinate ANY request.
• No SPOF: One of the main Cassandra’s principles.
• The driver chooses which node will coordinate
MADRID · NOV 27-28 · 2015@calonso
REPLICATION FACTOR
How many copies (replicas) for your data
MADRID · NOV 27-28 · 2015@calonso
WHY REPLICATION?
• Disaster recovery
• Bring data closer to users (to reduce latencies)
• Workload segregation (analytical vs transactional)
MADRID · NOV 27-28 · 2015@calonso
REPLICATIONDefined at keyspace level
CREATE KEYSPACE <my_keyspace> WITH REPLICATION = { “class”: “SimpleStrategy”, “replication_factor”: 2 };
CREATE KEYSPACE <my_keyspace> WITH REPLICATION = { “class”: “NetworkTopologyStrategy”,
“dc-east”: 2, “dc-west”: 3 };
MADRID · NOV 27-28 · 2015@calonso
CONSISTENCY LEVEL
How many replicas of your data must respond ok?
MADRID · NOV 27-28 · 2015@calonso
CONSISTENCY LEVEL
Availability /Partition tolerance Consistency
MADRID · NOV 27-28 · 2015@calonso
DriverClient
Partitioner
f81d4fae-…834
• RF = 3• CL = QUORUM• SELECT * … WHERE id = f81d4fae-…
MADRID · NOV 27-28 · 2015@calonso
COMPACTIONS• Merges most recent partition keys and columns
• Evicts tombstoned columns
• Creates new SSTable
• Rebuilds partition indexes and summaries
• Deletes old SSTables
MADRID · NOV 27-28 · 2015@calonso
COMPACTIONS
• SizeTieredCompactionStrategy
• LeveledCompactionStrategy
• DateTieredCompactionStrategy
CREATE TABLE user (id UUID PRIMARY KEY,email TEXT,name TEXT,preferences SET<TEXT>,
) WITH COMPACTION = { “class”: “<strategy>”, <params> };
CQLCassandra Query LanguageMADRID · NOV 27-28 · 2015
MADRID · NOV 27-28 · 2015@calonso
CQL
CREATE TABLE users ( id UUID, name VARCHAR, surname VARCHAR, birthdate TIMESTAMP, PRIMARY KEY(id));
Familiar row-column SQL-like approach.
INSERT INTO users (id, name, surname, birthdate) VALUES (uuid(), ‘Carlos’, ‘Alonso’, ’1985-03-19’);
SELECT * FROM users WHERE id = ‘f81d4fae-7dec-11d0-a765-00a0c91e6bf6’;
ALTER TABLE users ADD address VARCHAR;
MADRID · NOV 27-28 · 2015@calonso
WHERE• Equality matches:
• SELECT … WHERE album_title = ‘Revolver’;
• IN:
• Only applicable in the last WHERE clause
• SELECT … WHERE album_title = ‘Revolver’ AND number IN (2, 3, 4);
• Range search:
• Only on clustering columns.
• SELECT … WHERE album_title = ‘Revolver’ AND number >= 6 AND number < 2;
MADRID · NOV 27-28 · 2015@calonso
COLLECTIONS
• Set: Uniqueness
• List: Order
• Map: Key-Value pairs
• Tuple: Fixed length
email_addresses SET<VARCHAR>
email_addresses LIST<VARCHAR>
email_addresses MAP<VARCHAR, VARCHAR>
email_addresses TUPLE<VARCHAR, VARCHAR>
MADRID · NOV 27-28 · 2015@calonso
MORE FEATURES
• COUNTERS
• LWT
• TTL
• UDT
• BATCHES
• BATCH + LWT
• STATIC COLUMNS
Data ModellingBasic concepts for a Cassandra modelMADRID · NOV 27-28 · 2015
MADRID · NOV 27-28 · 2015@calonso
PHYSICAL DATA STRUCTURE
MADRID · NOV 27-28 · 2015@calonso
DATA MODELLING
• Understand your data
• Decide how you’ll query the data
• Define column families to satisfy those queries
• Implement and optimize
MADRID · NOV 27-28 · 2015@calonso
DATA MODELLINGConceptual
Model
Logical Model
Physical Model
Query-DrivenMethodology
Analysis &Validation
MADRID · NOV 27-28 · 2015@calonso
PRIMARY KEY
PARTITION KEY +CLUSTERING COLUMN(S)
MADRID · NOV 27-28 · 2015@calonso
QUERY DRIVEN METHODOLOGY• Spread data evenly around the cluster
• Minimise the number of partitions read
• Follow the mapping rules:
• Entities and relationships: map to tables
• Equality search attributes: must be at the beginning of the primary key
• Inequality search attributes: become clustering columns
• Ordering attributes: become clustering columns
• Key attributes: map to primary key columns
MADRID · NOV 27-28 · 2015@calonso
ANALYSIS & VALIDATION• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• How much data duplication? (batches)
• Client side joins or new table?
MADRID · NOV 27-28 · 2015@calonso
An E-Library project.
MADRID · NOV 27-28 · 2015@calonso
EXAMPLE 1Books can be uniquely identified and accessed by ISBN,
we also need a title, genre, author and publisher.
QDMBook
ISBN KTitle
AuthorGenre
Publisher
Q1
Q1: Find books by ISBN
MADRID · NOV 27-28 · 2015@calonso
ANALYSIS & VALIDATION• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• 1 X (5 - 1 - 0) + 0 < 1M
• How much data duplication? (batches)
• Client side joins or new table?
BookISBN KTitle
AuthorGenre
Publisher
Q1
Q1: Find books by ISBN
N/A
N/A
MADRID · NOV 27-28 · 2015@calonso
BookISBN KTitle
AuthorGenre
Publisher
Q1
Q1: Find books by ISBNCREATE TABLE books ( ISBN VARCHAR PRIMARY KEY, title VARCHAR, author VARCHAR, genre VARCHAR, publisher VARCHAR);
SELECT * FROM books WHERE ISBN = ‘…’;
MADRID · NOV 27-28 · 2015@calonso
Users register into the system uniquely identified by an email and a password. We also want their full name. They will be
accessed by email and password or internal unique ID.
QDM KUsers_by_IDID
full_name
Q1
Q1: Find users by ID
Users_by_login_infoemail K
password
Q2
Q2: Find users by login info
K
full_name
ID
MADRID · NOV 27-28 · 2015@calonso
ANALYSIS & VALIDATION• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• 1 X (2 - 1 - 0) + 0 < 1M
• How much data duplication? (batches)
• Client side joins or new table?
N/A
N/A
KUsers_by_IDID
full_name
Q1
Q1: Find users by ID
MADRID · NOV 27-28 · 2015@calonso
CREATE TABLE users_by_id ( ID TIMEUUID PRIMARY KEY, full_name VARCHAR);
SELECT * FROM users_by_id WHERE ID = …;
KUsers_by_IDID
full_name
Q1
Q1: Find users by ID
MADRID · NOV 27-28 · 2015@calonso
ANALYSIS & VALIDATION• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• 1 X (4 - 2 - 0) + 0 < 1M
• How much data duplication? (batches)
• Client side joins or new table? N/A
Users_by_login_infoemail K
password
Q2
Q2: Find users by login info
K
full_name
ID
MADRID · NOV 27-28 · 2015@calonso
CREATE TABLE users_by_login_info ( email VARCHAR, password VARCHAR, full_name VARCHAR, ID TIMEUUID, PRIMARY_KEY ((email, password)));
SELECT * FROM users_by_login_info WHERE email = ‘…’ AND password = ‘…’;
Users_by_login_infoemail K
password
Q2
Q2: Find users by login info
K
full_name
ID
MADRID · NOV 27-28 · 2015@calonso
BEGIN BATCH INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS; INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…);APPLY BATCH;
MADRID · NOV 27-28 · 2015@calonso
Users read books. We want to know which books has a user read.
QDM
KBooks_read_by_user
user_IDtitle
Q1
full_nameISBNgenre
publisher
authorCC
Q1: Find all books a loggeduser has read
MADRID · NOV 27-28 · 2015@calonso
ANALYSIS & VALIDATION• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• Books X (7 - 1 - 0) + 0 < 1M => 166.666 books per user
• How much data duplication? (batches)
• Client side joins or new table?
Q1: Find all books a loggeduser has read
KBooks_read_by_user
user_IDtitle
Q1
full_nameISBNgenre
publisher
authorCC
New table
N/A
MADRID · NOV 27-28 · 2015@calonso
CREATE TABLE books_read_by_user ( user_ID TIMEUUID, title VARCHAR, author VARCHAR, full_name VARCHAR, ISBN VARCHAR, genre VARCHAR, publisher VARCHAR PRIMARY_KEY (user_id, title, author));
SELECT * FROM books_read_by_user WHERE user_ID = …;
Q1: Find all books a loggeduser has read
KBooks_read_by_user
user_IDtitle
Q1
full_nameISBNgenre
publisher
authorCC
MADRID · NOV 27-28 · 2015@calonso
In order to improve our site’s usability we need to understand how our users use it by tracking every interaction
they have with our site.
QDM
user_ID2_weeks
timeelement
type
KActions_by_user
Q1
Q1: Find all actions a user does in a time range
KC
MADRID · NOV 27-28 · 2015@calonso
ANALYSIS & VALIDATION• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• Actions X (5 - 2 - 0) + 0 < 1M => 333.333 per user every 2 weeks
• How much data duplication? (batches)
• Client side joins or new table?
N/A
KActions_by_user
user_ID2_weeks
Q1
Q1: Find all actions a user does in a time range
Ktime
elementtype
C
N/A
MADRID · NOV 27-28 · 2015@calonso
CREATE TABLE actions_by_user ( user_ID TIMEUUID, 2_weeks INT, time TIMESTAMP, element VARCHAR, type VARCHAR PRIMARY_KEY ((user_ID, 2_weeks), time));
SELECT * FROM actions_by_user WHERE user_ID = … AND 2_weeks = … AND
time < … AND time > …;
KActions_by_user
user_ID2_weeks
Q1
Q1: Find all actions a user does in a time range
Ktime
elementtype
C
MADRID · NOV 27-28 · 2015@calonso
THANKS!