View
352
Download
5
Category
Preview:
Citation preview
D ATA M O D E L I N G C A S S A N D R A U S I N G C Q L 3
M I K E B I G L A N & E L I J A H H A M O V I T Z
A B O U T U S
Mike Biglan, M.S.
• Twenty Ideas, Inc – twentyideas.com
• Analytic Spot – spo.td
Elijah Hamovitz
• code.org
• Analytic Spot – spo.td
S I M P L I F I E D R E P R E S E N TAT I O N O F R E A L I T Y
A D ATA M O D E L M O D E L S D ATA
• Framework to store and organize data
• Models things, their differences, and relationships between them
• Things can be real or virtual
I D E A L D ATA M O D E L P R O P E R T I E S
• Easy to create
• Easy to interface with
• Quick, flexible querying
• Writing is direct and simple
• Easily understandable
• Scalable: can read, write, and store huge amount of data safely
S C A L E A W AY, S C A L E A W AY, S C A L E A W AY
• Availability: fault tolerance, redundancy, supports multiple data centers
• Consistency: strong or tunable
• Huge amounts of data (that won’t fit on single server)
• High-speed of incoming and/or accessed data
C O L U M N A R D ATA B A S E
• Document-Store (e.g. MongoDB) and Columnar (e.g. Cassandra, HBase,
Dynamo) are both “NoSQL”
• But modeling in Document-Store is quite different than Columnar
• Atomic unit of data storage
• Document-Store: document
• Relational Database: row
• Columnar: column
C A S S A N D R A C O L U M N : - N A M E - V A L U E ( O R T O M B S T O N E ) - T I M E S TA M P - T T L
C A S S A N D R A
Highly Available Distributed Columnar Datastore that’s:
• Near-linearly scalable
• Fault tolerant, no master
• Tunable consistency
• Performant, especially for writes – don’t read before write
G E T T I N G O N T H E S C A L E
Phase 1 Install Cassandra
Phase 2
Phase 3 Scale!
CQL != SQL
S Q L / C Q L S TAT E M E N T F L O W
• SQL execution is complex
• CQL execution is relatively simple, hence tiny subset of syntax
• Much of CQL Query complexity is which node(s) to fetch/write/confirm data from/to
• So denormalize!Relational Database Cassandra
SQL Statement CQL Statement
Syntax/Semantic Check
Query Plan & Optimization
Result
Data Store Data Store
Syntax/Semantic Check
Query Execution Query Execution
Result
C Q L ! = S Q L
SELECT <col1>, <col2>, … FROM <table> LEFT JOIN <table2>… WHERE <where-clause> GROUP BY <colx> HAVING … ORDER BY <order-clause>
S E V E R E LY L I M I T E D
S E V E R E LY L I M I T E D
• CQL syntax is small subset of SQL
mechanics.flite.com/blog/2013/11/05/breaking-down-the-cql-where-clause/
S O W H Y T H E L I M I TAT I O N S ?
Thinking of Cassandra as a relational database, it’s hard to understand:
• what is easy
• what is hard
• what is impossible
“Language serves not only to express thought but to make possible thoughts which could not exist without it.”
— Bertrand Russell
T H E D I S T O R T I O N O F C Q L
• Broken mental model hinders optimal modeling
• CQL falsely implies a relational data model and the design patterns that go with it
• To model Cassandra well, know the underlying data structure
D ATA M O D E L I N G I N S Q L ( N O S H A R D I N G )
1. What are the Data?
2. What is the normalized data model?
… months pass …
3. How are the data going to be queried?
4. Optimize any slow areas and/or bottlenecks
• Add indexes, memcached/redis, sphinx/solr/elasticsearch, etc
D ATA M O D E L I N G I N C Q L
1. What are the data?
2. What read-queries are needed?
3. How to denormalize during writes?
• on initial write, or use external tools to make this sane
(Some) “premature” optimization is inherent and unavoidable
D ATA E C O S Y S T E M
To fully (and efficiently) enable everything SQL you are used to, must rely on the big(ish) data ecosystem:
• ElasticSearch, Solr, Sphinx
• Redis, Memcached
• Spark
• Spark Streaming or Storm (and Kafka)
C O M P L E X I T Y O F I N I T I A L D ATA M O D E L
• Modeling with Relational DB
• Items & their relationships
• Modeling with Cassandra
• Items & their relationships
• How/Where they are stored (sharding and hot spots)
• What data we want to read
• How (and how often) we write data into those models
C A N O P T I M I Z E L AT E R
T O M O D E L , O P E N T H E B L A C K B O X
• Goal of a good black box is you can do a lot without knowing much about what’s inside
• CQL DOES NOT allow you to ignore what’s inside Cassandra
I N T H E B E G I N N I N G : T H R I F T
• Around Cassandra version 0.8, Thrift started getting replaced with CQL
• Thrift too low-level, but the interface had a close mapping to the underlying Cassandra data structure
T H R I F T & C Q L T E R M I N O L O G Y
T H R I F T C Q L
C O L U M N FA M I LY TA B L E
R O W PA R T I T I O N
C O L U M N C E L L
[ C E L L N A M E C O M P O N E N T O R VA L U E ] C O L U M N
[ G R O U P O F C E L L S W I T H S H A R E D C O M P O N E N T P R E F I X E S ]
R O W
www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
C Q L TA B L E
CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY ((company), name) );
A N D C Q L S E L E C T
> SELECT * from employees;
company | name | age | role ---------------------------- Foo, inc | Fred | 31 | coder Foo, inc | Sara | 39 | boss BarCo | Bill | 50 | SQL guru BarCo | Jane | 20 | hotshot
CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY ((company), name) );
V S H O W I T ’ S S T O R E D
employees = { "Foo, inc" : { "Fred:age" : 31, "Fred:role" : "coder", "Sara:age" : 39, "Sara:role" : "boss" }, "BarCo" : { "Bill:age" : 50, "Bill:role" : "SQL guru" "Jane:age" : 20, "Jane:role" : "hotshot", } }
CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY ((company), name) );
W I D E PA R T I T I O N S ( F O R M E R LY W I D E “ R O W S ” )
www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 51) www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
• Based on Thrift “rows”, so actually wide “partitions”
• Columns are the clustering key values with the column-name suffix
• Up to 2 billion (but don’t do this)
S E T S , M A P S , A N D L I S T S : O H M Y
• Sets/Maps/List still column-level storage
• Enabling Schemaless, but can result in long column names
www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 52-71)
PA R T I T I O N A N D C L U S T E R I N G K E Y
CQL Where-Clause variants
• None (kind of)
• key1 & key2
• key1 & key2 & key3
• key1 & key2 & key3 & key4
company | year | month | day | employee | reason ------------------------------------------------- Foo, Inc | 2014 | 11 | 27 | Fred | Thanksgiving Foo, Inc | 2014 | 12 | 25 | Fred | Christmas Foo, Inc | 2014 | 12 | 25 | Sara | Christmas Foo, Inc | 2014 | 12 | 26 | Sara | Boxing day
CREATE TABLE breaks ( company text, year int, month int, day int, employee text, reason text, PRIMARY KEY ((company, year), month, day, employee) );
C O M P O S I T E K E Y S
C O M P O S I T E K E Y S
CREATE TABLE breaks ( company text, year int, month int, day int, employee text, reason text, PRIMARY KEY ((company, year), month, day, employee) );
breaks = { "Foo, Inc:2014" : { "11" : { "27" : { "Fred:reason" : "Thanksgiving" } }, "12" : { "25" : { "Fred:reason" : "Christmas", "Sara:reason" : "Christmas" }, "26" : { "Sara:reason" : "Boxing day" } } } }
T H E N W H AT I S E A S Y ?
With a dictionary of ordered dictionaries:
• Grabbing the data (or subset) from a partition-key
• Getting a slice of data (uses linear
search) based on a partition-key
breaks = { "Foo, Inc:2014" : { "11" : { "27" : { "Fred:reason" : "Thanksgiving" } }, "12" : { "25" : { "Fred:reason" : "Christmas", "Sara:reason" : "Christmas" }, "26" : { "Sara:reason" : "Boxing day" } } } }
W H AT I S H A R D ? ( I . E . C O M M O N S Q L PAT T E R N S )
• Unique and Group by
• Ordered
• Inverted Index
G R O U P - B Y & C O U N T E R S
• Often group-by is used for counting
• Use counter columns or other tools (e.g. elasticsearch)
CREATE TABLE employee_break_counts ( company text, employee text, break_counts counter, PRIMARY KEY ((company), employee) );
O R D E R I N G O R I N V E R T E D - I N D E X
• Redundant table, but ordered by new “column”
• Depending on needs, this can store the order-field and lookup key OR some/all of the other data in that table
• If a read will generate more than a few subsequent child reads then some/all the other data should be included
CREATE TABLE employees_by_age ( company text, id int, age int, name text, role text, PRIMARY KEY ((company), age, id) );
C * M O D E L I N G A N T I - PAT T E R N S
C * G U I D E L I N E S S Q L G U I D E L I N E S
W R I T E S A R E C H E A P / FA S T M I N I M I Z E W R I T E S
S T O R A G E I S C H E A P M I N I M I Z E D U P L I C AT I O N O F D ATA
PA R T I T I O N S A R E I N H E R E N T S H A R D AT Y O U R O W N R I S K
S T R I C T C O M P O S I T E K E Y S F L E X I B L E S E C O N D A R Y I N D E X E S
S I M P L E Q U E R I E S C O M P L E X Q U E R I E S
C * M O D E L I N G PAT T E R N S
C * G U I D E L I N E S C * PAT T E R N S
W R I T E S A R E C H E A P / FA S T
D U P L I C AT E Y O U R D ATA
S T O R A G E I S C H E A P
PA R T I T I O N S A R E I N H E R E N T AV O I D H O T S P O T S
S T R I C T C O M P O S I T E K E Y S
D E S I G N TA B L E S A R O U N D Q U E R I E S
S I M P L E Q U E R I E S
Questions?
Mike Biglan mike@twentyideas.com
@twentyideas
Elijah Hamovitz elijah@code.org
Recommended