Cassandra Data Modelling with CQL (OSCON 2015)

D ATA M O D E L I N G C A S S A N D R A U S I N G C Q L 3

M I K E B I G L A N & E L I J A H H A M O V I T Z

A B O U T U S

Mike Biglan, M.S.

• Twenty Ideas, Inc – twentyideas.com

• Analytic Spot – spo.td

Elijah Hamovitz

• code.org

• Analytic Spot – spo.td

S I M P L I F I E D R E P R E S E N TAT I O N O F R E A L I T Y

A D ATA M O D E L M O D E L S D ATA

• Framework to store and organize data

• Models things, their differences, and relationships between them

• Things can be real or virtual

I D E A L D ATA M O D E L P R O P E R T I E S

• Easy to create

• Easy to interface with

• Quick, flexible querying

• Writing is direct and simple

• Easily understandable

• Scalable: can read, write, and store huge amount of data safely

S C A L E A W AY, S C A L E A W AY, S C A L E A W AY

• Availability: fault tolerance, redundancy, supports multiple data centers

• Consistency: strong or tunable

• Huge amounts of data (that won’t fit on single server)

• High-speed of incoming and/or accessed data

C O L U M N A R D ATA B A S E

• Document-Store (e.g. MongoDB) and Columnar (e.g. Cassandra, HBase,

Dynamo) are both “NoSQL”

• But modeling in Document-Store is quite different than Columnar

• Atomic unit of data storage

• Document-Store: document

• Relational Database: row

• Columnar: column

C A S S A N D R A C O L U M N : - N A M E - V A L U E ( O R T O M B S T O N E ) - T I M E S TA M P - T T L

C A S S A N D R A

Highly Available Distributed Columnar Datastore that’s:

• Near-linearly scalable

• Fault tolerant, no master

• Tunable consistency

• Performant, especially for writes – don’t read before write

G E T T I N G O N T H E S C A L E

Phase 1 Install Cassandra

Phase 2

Phase 3 Scale!

CQL != SQL

S Q L / C Q L S TAT E M E N T F L O W

• SQL execution is complex

• CQL execution is relatively simple, hence tiny subset of syntax

• Much of CQL Query complexity is which node(s) to fetch/write/confirm data from/to

• So denormalize!Relational Database Cassandra

SQL Statement CQL Statement

Syntax/Semantic Check

Query Plan & Optimization

Result

Data Store Data Store

Syntax/Semantic Check

Query Execution Query Execution

Result

C Q L ! = S Q L

SELECT <col1>, <col2>, … FROM <table> LEFT JOIN <table2>… WHERE <where-clause> GROUP BY <colx> HAVING … ORDER BY <order-clause>

S E V E R E LY L I M I T E D

• CQL syntax is small subset of SQL

mechanics.flite.com/blog/2013/11/05/breaking-down-the-cql-where-clause/

S O W H Y T H E L I M I TAT I O N S ?

Thinking of Cassandra as a relational database, it’s hard to understand:

• what is easy

• what is hard

• what is impossible

“Language serves not only to express thought but to make possible thoughts which could not exist without it.”

— Bertrand Russell

T H E D I S T O R T I O N O F C Q L

• Broken mental model hinders optimal modeling

• CQL falsely implies a relational data model and the design patterns that go with it

• To model Cassandra well, know the underlying data structure

D ATA M O D E L I N G I N S Q L ( N O S H A R D I N G )

1. What are the Data?

2. What is the normalized data model?

… months pass …

3. How are the data going to be queried?

4. Optimize any slow areas and/or bottlenecks

• Add indexes, memcached/redis, sphinx/solr/elasticsearch, etc

D ATA M O D E L I N G I N C Q L

1. What are the data?

2. What read-queries are needed?

3. How to denormalize during writes?

• on initial write, or use external tools to make this sane

(Some) “premature” optimization is inherent and unavoidable

D ATA E C O S Y S T E M

To fully (and efficiently) enable everything SQL you are used to, must rely on the big(ish) data ecosystem:

• ElasticSearch, Solr, Sphinx

• Redis, Memcached

• Spark

• Spark Streaming or Storm (and Kafka)

C O M P L E X I T Y O F I N I T I A L D ATA M O D E L

• Modeling with Relational DB

• Items & their relationships

• Modeling with Cassandra

• Items & their relationships

• How/Where they are stored (sharding and hot spots)

• What data we want to read

• How (and how often) we write data into those models

C A N O P T I M I Z E L AT E R

T O M O D E L , O P E N T H E B L A C K B O X

• Goal of a good black box is you can do a lot without knowing much about what’s inside

• CQL DOES NOT allow you to ignore what’s inside Cassandra

I N T H E B E G I N N I N G : T H R I F T

• Around Cassandra version 0.8, Thrift started getting replaced with CQL

• Thrift too low-level, but the interface had a close mapping to the underlying Cassandra data structure

T H R I F T & C Q L T E R M I N O L O G Y

T H R I F T C Q L

C O L U M N FA M I LY TA B L E

R O W PA R T I T I O N

C O L U M N C E L L

[ C E L L N A M E C O M P O N E N T O R VA L U E ] C O L U M N

[ G R O U P O F C E L L S W I T H S H A R E D C O M P O N E N T P R E F I X E S ]

www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows

C Q L TA B L E

CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY ((company), name) );

A N D C Q L S E L E C T

> SELECT * from employees;

V S H O W I T ’ S S T O R E D

employees = { "Foo, inc" : { "Fred:age" : 31, "Fred:role" : "coder", "Sara:age" : 39, "Sara:role" : "boss" }, "BarCo" : { "Bill:age" : 50, "Bill:role" : "SQL guru" "Jane:age" : 20, "Jane:role" : "hotshot", } }

W I D E PA R T I T I O N S ( F O R M E R LY W I D E “ R O W S ” )

www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 51) www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows

• Based on Thrift “rows”, so actually wide “partitions”

• Columns are the clustering key values with the column-name suffix

• Up to 2 billion (but don’t do this)

S E T S , M A P S , A N D L I S T S : O H M Y

• Sets/Maps/List still column-level storage

• Enabling Schemaless, but can result in long column names

www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 52-71)

PA R T I T I O N A N D C L U S T E R I N G K E Y

CQL Where-Clause variants

• None (kind of)

• key1 & key2

• key1 & key2 & key3

• key1 & key2 & key3 & key4

company | year | month | day | employee | reason ------------------------------------------------- Foo, Inc | 2014 | 11 | 27 | Fred | Thanksgiving Foo, Inc | 2014 | 12 | 25 | Fred | Christmas Foo, Inc | 2014 | 12 | 25 | Sara | Christmas Foo, Inc | 2014 | 12 | 26 | Sara | Boxing day

CREATE TABLE breaks ( company text, year int, month int, day int, employee text, reason text, PRIMARY KEY ((company, year), month, day, employee) );

C O M P O S I T E K E Y S

CREATE TABLE breaks ( company text, year int, month int, day int, employee text, reason text, PRIMARY KEY ((company, year), month, day, employee) );

breaks = { "Foo, Inc:2014" : { "11" : { "27" : { "Fred:reason" : "Thanksgiving" } }, "12" : { "25" : { "Fred:reason" : "Christmas", "Sara:reason" : "Christmas" }, "26" : { "Sara:reason" : "Boxing day" } } } }

T H E N W H AT I S E A S Y ?

With a dictionary of ordered dictionaries:

• Grabbing the data (or subset) from a partition-key

• Getting a slice of data (uses linear

search) based on a partition-key

breaks = { "Foo, Inc:2014" : { "11" : { "27" : { "Fred:reason" : "Thanksgiving" } }, "12" : { "25" : { "Fred:reason" : "Christmas", "Sara:reason" : "Christmas" }, "26" : { "Sara:reason" : "Boxing day" } } } }

W H AT I S H A R D ? ( I . E . C O M M O N S Q L PAT T E R N S )

• Unique and Group by

• Ordered

• Inverted Index

G R O U P - B Y & C O U N T E R S

• Often group-by is used for counting

• Use counter columns or other tools (e.g. elasticsearch)

CREATE TABLE employee_break_counts ( company text, employee text, break_counts counter, PRIMARY KEY ((company), employee) );

O R D E R I N G O R I N V E R T E D - I N D E X

• Redundant table, but ordered by new “column”

• Depending on needs, this can store the order-field and lookup key OR some/all of the other data in that table

• If a read will generate more than a few subsequent child reads then some/all the other data should be included

CREATE TABLE employees_by_age ( company text, id int, age int, name text, role text, PRIMARY KEY ((company), age, id) );

C * M O D E L I N G A N T I - PAT T E R N S

C * G U I D E L I N E S S Q L G U I D E L I N E S

W R I T E S A R E C H E A P / FA S T M I N I M I Z E W R I T E S

S T O R A G E I S C H E A P M I N I M I Z E D U P L I C AT I O N O F D ATA

PA R T I T I O N S A R E I N H E R E N T S H A R D AT Y O U R O W N R I S K

S T R I C T C O M P O S I T E K E Y S F L E X I B L E S E C O N D A R Y I N D E X E S

S I M P L E Q U E R I E S C O M P L E X Q U E R I E S

C * M O D E L I N G PAT T E R N S

C * G U I D E L I N E S C * PAT T E R N S

W R I T E S A R E C H E A P / FA S T

D U P L I C AT E Y O U R D ATA

S T O R A G E I S C H E A P

PA R T I T I O N S A R E I N H E R E N T AV O I D H O T S P O T S

S T R I C T C O M P O S I T E K E Y S

D E S I G N TA B L E S A R O U N D Q U E R I E S

S I M P L E Q U E R I E S

Questions?

Mike Biglan mike@twentyideas.com

@twentyideas

Elijah Hamovitz elijah@code.org

Cassandra Data Modelling with CQL (OSCON 2015)

Technology

Migration from Thrift to CQL (Brij Bhushan Ravat, Ericsson) | Cassandra Summit 2016

New Materials - CQL

CQL for Cassandra 1 - Yuliang's Blog · 2016-04-01 · Data modeling A brief description of data modeling in Cassandra. At one level, Cassandra tables, rows, and columns can be thought

Oscon contratto

Cassandra EU - State of CQL

DataStax ODBC driver for Apache Cassandra and DataStax ... ODBC Driver for... · DataStax ODBC driver for Apache Cassandra and DataStax Enterprise with CQL connector Installation

Cassandra 2.1 boot camp, Protocol, Queries, CQL

Assetic (OSCON)

CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C* Summit 2016

CQL: SQL In Cassandra

SRI SIDDHARTHA INSTITUTE OF TECHNOLOGY, … Sem Syallabus.pdfMongoDB, Data types in MongoDB,MongoDB Query language. IV Introduction to cassandra and MAPREDUCE features, CQL data types,

DataStaxODBCdriverforApache ......[ODBC Drivers] DataStax ODBC driver for Apache Cassandra and DataStax Enterprise with CQL connector 32-bit=Installed DataStax ODBC driver for Apache

Vijay Oscon

Hands-on Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/45/Hands-on Cassandra Presentation.pdf · Hands-on Cassandra OSCON July 20, 2010 ... The materialized view of Tweets

From Simple CQL to Time-Series Event Tracking and Aggregation Using Cassandra and Hadoop

An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian) | C* Summit 2016

Cassandra SF Meetup - CQL Performance With Apache Cassandra 3.X

C* Summit EU 2013: From CQL to Time-Series Event Tracking and Aggregation Using Cassandra and Hadoop

OSCON 2015

Introduction to CQL and Data Modeling with Apache Cassandra