Upload
ratika-miglani-malhotra
View
218
Download
0
Embed Size (px)
Citation preview
8/17/2019 CassandraTraining v3.3.4
1/183
1
Developer track
ByNirmallya Mukherjee
8/17/2019 CassandraTraining v3.3.4
2/183
2
Index
• Part 1 - Warmup• Part 2 - Introducing C*
• Part 3 - Setup & Installation
• Part 4 - Prelude to modelling• Part 5 - Write and Read paths
• Part 6 - Modelling
• Part 7 - Applications with C*• Part 8 - Administration
• Part 9 - Future releases
8/17/2019 CassandraTraining v3.3.4
3/183
3
Part 1
Warm up!
8/17/2019 CassandraTraining v3.3.4
4/183
4
What's the challange?
• Consumer oriented applications
• Users in millions
• Data produced in hundred/thousand of TB• Hardware failure happens no matter what
• Is consumer growth predictable? Then how to docapacity planning for infrastructure?
▶
Databases are doing just fine!
8/17/2019 CassandraTraining v3.3.4
5/183
5
What can it do?
• Using 330 Google Compute Engine virtual machines, 300
1TB Persistent Disk volumes, Debian Linux, and DatastaxCassandra 2.2, we were able to construct a setup that can:– sustain one million writes per second to Cassandra with
a median latency of 10.3 ms and 95% completing under23 ms
– sustain a loss of ⅓ of the instances and volumes and stillmaintain the 1 million writes per second (though withhigher latency)
• Read more here
– http://googlecloudplatform.blogspot.in/2014/03/cassandra-hits-one-million-writes-per-second-on-google-compute-engine.html
▶
8/17/2019 CassandraTraining v3.3.4
6/183
6
Who uses it?
• eBay
• Netflix
• Comcast
• Safeway
• Sky
• CERN
• Travelocity
• Spotify
• Intuit
• GE
▶
8/17/2019 CassandraTraining v3.3.4
7/183
7
Gartner magic quadrant
▶
8/17/2019 CassandraTraining v3.3.4
8/183
8
Types of NoSQL
• Wide Row Store - Also known as wide-column stores, these databases
store data in rows and users are able to perform some query operationsvia column-based access. A wide-row store offers very high performanceand a highly scalable architecture. Examples include: Cassandra, HBase,and Google BigTable.
• Key/Value - These NoSQL databases are some of the least complex as allof the data within consists of an indexed key and a value. Examplesinclude Amazon DynamoDB, Riak, and Oracle NoSQL database
• Document - Expands on the basic idea of key-value stores where"documents" are more complex, in that they contain data and eachdocument is assigned a unique key, which is used to retrieve thedocument. These are designed for storing, retrieving, and managingdocument-oriented information, also known as semi-structured data.Examples include MongoDB and CouchDB
• Graph - Designed for data whose relationships are well represented as agraph structure and has elements that are interconnected; with anundetermined number of relationships between them. Examples include:Neo4J and TitanDB
▶
8/17/2019 CassandraTraining v3.3.4
9/183
9
SQL and NoSQL
1. Database, Relational, strict
models2. Data in rows, pre-defined
schema, sql supports join
3. Vertically scalable
4. Random access pattern
support5. Good fit for online
transactional systems
6. Master slave model
7. Periodic data replication as
read only copies in slave8. Availability model includes aslight downtime in case ofoutages
1. Datastore, distributed &
Non relational, flexibleschema
2. Data in key value pairs,flexible schema, no joins
3. Horizontally scalable
4. Designed for accesspatterns
5. Good for optimized readbased system or highavailability write systems
6. C* is materless model
7. Seamless data replication
8. Masterless allows for nodowntime
▶
8/17/2019 CassandraTraining v3.3.4
10/183
10
Applicability of SQL / NoSQL
• No one single rule and very usecasespecific– High read (not write) oriented– High write (not read) oriented– Document based storage
– KV based storage
• Important to know the access patternsupfront
– Focus on modeling specific to the use case
• Very difficult to fix an improper modellater on unlike a database
▶
8/17/2019 CassandraTraining v3.3.4
11/183
11
CAP Theorem
• ACID
– Atomicity - Atomicity requires that each transaction be "all or nothing"– Consistency - The consistency property ensures that any transaction
will bring the database from one valid state to another
– Isolation - The isolation property ensures that the concurrent executionof transactions result in a system state that would be obtained iftransactions were executed serially
– Durability - Durability means that once a transaction has beencommitted, it will remain so under all circumstances
• What is it? Can I have all?– Consistency - all nodes have the same data at all times
– Availability - Request must receive a response
– Partition tolerance - Should run even if there is a part failure
• CAP leads to BASE theory– Basically Available, Soft state, Eventual consistency
▶
8/17/2019 CassandraTraining v3.3.4
12/183
12
AP systems, what about C?
Consistency
Availability Partition Tolerance
RDBMS
CassandraGoogle Big Table
HBaseMongoDB
How does this impact architecture?1.Eventual consistency2.Some application logic is needed
▶
8/17/2019 CassandraTraining v3.3.4
13/183
13
Good fit use cases
• Store high volume stream data– Sensor data
– Logs
–Events
– Online usage impressions
• Scenario where the data can not berecreated, it is lost if not saved
• No need of rollups, aggregations
– Another system needs to do this
▶
8/17/2019 CassandraTraining v3.3.4
14/183
14
Let's brain storm your usecase
▶
8/17/2019 CassandraTraining v3.3.4
15/183
15
Part 2
Introducing Cassandra
8/17/2019 CassandraTraining v3.3.4
16/183
16
Database or Datastore
• Where did the name "Cassandra" come from?– Some interesting reference from Greek mythology– What's the significance of the logo?
• Symantics - no real definition, both are ok• I would like to call it "Datastore"• A bit of un-learning is required to get a hold of the "different" ways• Inspiration from (think of it as object persistence)
– Google BigTable– Dynamo DB from Amazon
• A few interesting observations about datastore– Put = insert/update (Upsert?)– Get = select– Delete = delete
• Think HashMaps, Sorted Maps ...
• Storage mechanism– Btree vs LSM Tree
• Row oriented datastore
• Biggest of all - No SPOF (Masterless)
▶
8/17/2019 CassandraTraining v3.3.4
17/183
17
Masterless architecture
Client 2
R1
R2
R3
Client 1
Client 3
▶
Why large malls have more than one entrance?
8/17/2019 CassandraTraining v3.3.4
18/183
18
Seed node(s)
• It is like the recruitment department• Helps a new node come on board• More than one seed is a very good
practice
• Per rack based could be good• Starts the gossip process in new
nodes• If you have more than one DC then
include seed nodes from each DC• No other purpose
▶
8/17/2019 CassandraTraining v3.3.4
19/183
19
Virtual node, Ring architecture
• C* operates by dividing all data evenly around
a cluster of nodes, which can be visualized as aring
• In the early days the nodes had a range oftokens
• This had to be done at the time of setting up C*
Ring withoutVNodes
A
F E
Node 1
B
A F
Node 2
F
E D
Node 6
C
B A
Node 3
E
D C
Node 5
D
C E
Node 4
Few peers helping to bring a downed node up, Increased loadon the selected peers
What will happen if a node with elivated utilization goes down?▶
8/17/2019 CassandraTraining v3.3.4
20/183
8/17/2019 CassandraTraining v3.3.4
21/183
21
Gossip
• I think we all know what this means in English! ☺
• It is a communication mechanism among nodes• Runs every second
• Passing state information– Available / down / bootstraping
– Load it is under
– Helps in detecting failed nodes• Upto 3 other nodes talk to one another
• A way any node learns about the cluster and other nodes
• Seed nodes play a critical role to start the gossip amongnodes
• It is versioned, older states are erased• Once gossip has started on a node, the stats about other
nodes are stored locally so that a re-start can be fast
• Local gossip info can be cleaned if needed
▶
8/17/2019 CassandraTraining v3.3.4
22/183
22
Partitioner
• How do you all organize your cubicle/office space? Where will your stuff
be? Assume the floor you are on is about 25,000 sq ft.• A partitioner determines how to distribute the data across the nodes in thecluster
• Murmer3Partitioner - recommended for most purposes (default strategy aswell)
• Once a partitioner is set for a cluster, cannot be changed without datareload
• The primary key is hashed using the murmer hash to determine whichnode it needs to go
• There is nothing called as the "Master replica", all replicas are identical
• The token range it can produce is 263 to -263
• Wikipedia details
– MurmurHash is a non-cryptographic hash function suitable for general hash-basedlookup. It was created by Austin Appleby in 2008, and exists in a number of variants, allof which have been released into the public domain. When compared to other popularhash functions, MurmurHash performed well in a random distribution of regular keys.
▶
8/17/2019 CassandraTraining v3.3.4
23/183
23
Replication
• How many of you have multiple copies of the most critical files -
pwd file, IT return confirmation etc on external drives?• Replication determines how many copies of the data will be
maintained in the cluster across nodes
• There is no single magic formula to determine the correct number
• The widely accepted number is 3 but your use case can be
different• This has an impact on the number of nodes in the cluster - cannot
have a high replication with less nodes
– Replication factor
8/17/2019 CassandraTraining v3.3.4
24/183
24
Partitioner and Replication
• Position of the first copy of the data is
determined by the partitioner and copiesare placed by walking the cluster in aclockwise direction
• For example, consider a simple 4 node clusterwhere all of the partition keys managed bythe cluster were numbers in the PRIMARYrange of 0 to 100.
• Each node is assigned a token that representsa point in this range. In this simple example,the token values are 0, 25, 50, and 75.
• The first node, the one with token 0, isresponsible for the wrapping range (76-0).
• The node with the lowest token also acceptspartition keys less than the lowest token andmore than the highest token.
• Once the first node is determined the replicaswill be placed in the nodes in a clockwiseorder
0
25
50
75
Data Range 4(76+wrappingrange)
Data Range 1(1-25)
Data Range 1(26-50)
Data Range 3(51-75)
▶
8/17/2019 CassandraTraining v3.3.4
25/183
25
Snitch
• What can you typically find at the entrance of a very large
mall? What's the need for "Can I help you?" desk?• Objective is to group machines into racks and data centers
to setup a DC/DR• Informs the partitioner about the rack and DC locations -
determines which nodes the replicas need to go• Identify which DC and rack a node belongs to
• Routing requests efficiently (replication)• Tries not to have more than one replica in the same rack
(may not be a physical grouping)• Allows for a truly distributed fault tolerant cluster• Seamless synchronization of data across DC/DR•
To be selected at the time of C* installation in C* yaml file• Changing a snitch is a long drawn up process especially ifdata exists in the keyspace - you have to run a full repair inyour cluster
▶
8/17/2019 CassandraTraining v3.3.4
26/183
26
Snitch
In addition a "Google cloud snitch" is also available for GCE
Mixed cloud = cloud + local nodes ▶
8/17/2019 CassandraTraining v3.3.4
27/183
27
Snitch - property file
A bit more about property file snitch• All nodes in the system must have the
same cassandra-topology.properties file• It is very useful if you cannot ensure the IP
octates (this prevents the use of Rack
infering snitch where every octate meanssomething) 10...
• Will give you full control of your clusterand the flexibilty of assigning any IP to a
node (once assigned remains assigned)• It can be hard to maintain informationabout a large cluster across multiple DCbut it is worth given the benifits
▶
8/17/2019 CassandraTraining v3.3.4
28/183
28
Snitch - property file example
# Data Center One19.82.20.3 DC1:RAC1
19.83.123.233 DC1:RAC119.84.193.101 DC1:RAC119.85.13.6 DC1:RAC1
19.23.20.87 DC1:RAC219.15.16.200 DC1:RAC219.24.102.103 DC1:RAC2
# Data Center Two29.51.8.2 DC2:RAC129.50.10.21 DC2:RAC129.50.29.14 DC2:RAC1
53.25.29.124 DC2:RAC253.34.20.223 DC2:RAC2
53.14.14.209 DC2:RAC2
The names like "DC1" and "DC2" are very important as we will see ...
Also, in production use GissipingPropertyFileSnitch (non cloud, for cloud use the appropriate snitch for thatcloud) where you keep the entries for the rack+DC (cassandra-rackdc.properties) and C* will gossipthis across the cluster automatically. C* topology properties is used as a fallback.
▶
8/17/2019 CassandraTraining v3.3.4
29/183
29
Commodity vs Specialized hardware
1. Large CAPEX2. Wasted/Idle resource3. Failure takes out a large chunk4. Expensive redundancy model
5. One shoe fitting all model6. Too much co-existence
1. Low CAPEX (rent on IaaS)2. Maximum resource utilization3. Failure takes out a small chunk4. Inexpensive redundancy
5. Specific h/w for specific tasks6. Very less to no co-existence
vs
V
e r t i c a l s c a l a b i l i t y
Horizontal scalability
This is what Google,LinkedIn, Facebook do. The norm is now beingadopted by largecorporations as well.
“ Just in time” expansion; stay in tune with the load. No need to build ahead in time inanticipation
▶
8/17/2019 CassandraTraining v3.3.4
30/183
30
Elastic Linear Scalability
n1
n2
n3
n4
n5
n6
n7
n8
8 nodes
n1
n2
n3
n4
n5
n6
n7
n8
n9
n10
10 nodes
1.Need more capacity? Add servers2.Auto Re-distribution of the cluster3.Results in linear performance increase
▶
8/17/2019 CassandraTraining v3.3.4
31/183
31
Debate - Failover design strategy
Scenario• Maintain a small failover in the data
center, and then simply add multiplenodes if a failure occurs
• Is this "Just in time" strategy correct?
Analysis• The additional overhead of bootstrapping
the new nodes will actually reduce
capacity at a critical point when thecapacity is needed most perhaps creatinga domino effect leading to severe outage
▶
Debate what if all machines are not
8/17/2019 CassandraTraining v3.3.4
32/183
32
Debate - what if all machines are notsimilar?
n1
n2
n3
n4
n5
n6
State your observations based on typical machine characteristics1. CPU, Memory3. Storage space (consider externally mounted space as local)3. Network connectivity4. Query performance and "predictability" of queries
▶
8/17/2019 CassandraTraining v3.3.4
33/183
33
Deployment - 4 dimensions
Node - One C* instance1
Rack - Logical set of nodes2
3 DC-DR - Logical set of racks
4 Cluster - Nodes that map to a single token ring(can be across DC), next slide ...
▶
8/17/2019 CassandraTraining v3.3.4
34/183
34
Distributed deployment
//These are the C* nodes that the DAO will look to connect topublic static final String[] cassandraNodes = { "10.24.37.1", "10.24.37.2", "10.24.37.3" };
public enum clusterName { events, counter }
//You can build with policies like withRetryPolicy(), .withLoadBalancingPolicy(Round Robin)cluster = Cluster.builder().addContactPoints(Config.cassandraNodes).build();
session1 = cluster.connect(Constants.clusterName.events.toString());
RACK 1RACK 2
RACK 3RACK A
RACK B
Region 1 - DC Region 2 - DR
Cluster: SKL-Platform
▶
8/17/2019 CassandraTraining v3.3.4
35/183
35
A bit more about multi DC setup
• When usingNetworkToplogyStrategy, you set
the number of replicas per datacenter
• For example if you set thereplicas as 4 (2 in each DC) thenthis is what you get ...
ALTER KEYSPACE system_auth WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC1' : 2, 'DC2' : 2};
Then run nodetool repair on each affected node, wait for itto come up and move on to the next node.
R1
R3
R2
R4
Data Center1
Data Center2Rack1
Rack2
Node1
Node3
Node5
Node2
Node4
Node6
R1
R2
Data Center1
Rack1
Rack2
Node7
Node9
Node11
Node8
Node10
Node12
R3
R4
Data Center2
▶
8/17/2019 CassandraTraining v3.3.4
36/183
36
Debate - replication setting
• Let's assume we have the cluster as follows– DC1 having 5 nodes– DC2 having 5 nodes
• What will happen if the following statements are
executed (assume no syntax errors)?– create keyspace if not exists test with replication = { 'class' :'NetworkTopologyStrategy', 'DC1' : 2, 'DC2' : 10 };
– create table test ( col1 int, col2 varchar, primary key(col1));
– insert into test (col1, col2) values (1, 'Sir Dennis Ritchie');
– Will this succeed?
▶
8/17/2019 CassandraTraining v3.3.4
37/183
37
Regions and Zones
• Many IaaS (Google / Amazon) providers have
multiple data centers around the world called"Regions"
• Each region is further divided into availability"zones"
• Depending on your IaaS or even your own DC,you must plan the topology in such a way that– A client request should not travel over the network too
long to reach C*
– An outage should not take out a significant chunk ofyour cluster
– A replication across DC should not add to the latency forthe client
▶
8/17/2019 CassandraTraining v3.3.4
38/183
38
Part 3
Setup & Installation
i ll i
8/17/2019 CassandraTraining v3.3.4
39/183
39
Let's get going! Installation
• Acquiring and Installing C*
– Understand the key components of Cassandra.yaml• Configuring and Installation structure
– Data directory
– Commit Log directory
– Cache directory
• System log configuration
– Cassandra-env.ps1 (change the log file position)– Default is $logdir = "$env:CASSANDRA_HOME\logs"
– In logback.xml change from .zip to .gz• ${cassandra.logdir}/system.log.%i.gz
• Creating a Cluster
• Adding Nodes to a Cluster
• Multiple Seed Nodes Ring management• Sizing is specific to the application - discussion
• Clock Synchronization and its implications– Network Time Protocol (NTP)
▶
i ll d
8/17/2019 CassandraTraining v3.3.4
40/183
40
Firewall and ports
•Cassandra primarily operates usingthree ports
– 7000 (or 7001 if SSL) - internode clustercommunication
– 7199 - JMX port
– 9160 / 9042 - Client connections
– 22 - for SSH
• Firewall must have these ports openfor each node in the cluster/DC
▶
C d l i d i
8/17/2019 CassandraTraining v3.3.4
41/183
41
Cassandra.yaml - an introduction
There are more configuration parameters, but we will cover those as we move along ... ▶
C d h
8/17/2019 CassandraTraining v3.3.4
42/183
42
Cassandra-env.sh
▶
CQL h ll l h
8/17/2019 CassandraTraining v3.3.4
43/183
43
CQL shell - cqlsh
• Command line tool
• Specify the host and port (none means localhost)• Provides tab finish• Supports two different command sets
– DDL– Shell commands (describe, source, tracing, copy etc)
• Try some commands– show version– show host– describe keyspaces– describe keyspace
– describe tables– describe table
CQL shell "source" command - loading a few million rows"sstableloader" which is another tool - beyond a few million rows
▶
8/17/2019 CassandraTraining v3.3.4
44/183
44
Part 4
Prelude to Modeling
K k S h
8/17/2019 CassandraTraining v3.3.4
45/183
45
Keyspace aka Schema
Cassandra Database
create keyspace [IF NOT EXISTS]meterdata with replication strategy(optionally with DC/DR setup), durablewrites (True/False)
create database meterdata;
It is like the database (MySQL) or schema/user (Oracle)
durable writes = false bypasses thecommit log. You can lose data.
▶
Ad i /S t k
8/17/2019 CassandraTraining v3.3.4
46/183
46
Admin/System keyspaces
• SELECT * FROM– system.schema_keyspaces
– system.peers
– system.schema_columnfamilies
– system.schema_columns– system.local (info about itself)
– system.hints
– system.• Local contains information about
nodes (superset of gossip)
▶
C l F il (CF) / T bl
8/17/2019 CassandraTraining v3.3.4
47/183
47
Column Family (CF) / Table
Cassandra Database
create table meterdata.bill_data (...)
primary key (compound key)
with compaction, clustering order
PK is mandatory in C*
create tablemeterdata.bill_date (...)
pk, references, engine,charset etc
Insert into bill_data () values ();
Update bill_data set=.. where ..
Delete from bill_data where ..
Standard SQL DMLstatements
What happens if we insert with a PK that already exists?Hold on to your thoughts ...
▶
PK k P titi k k R k
8/17/2019 CassandraTraining v3.3.4
48/183
48
PK aka Partition key aka Row key
• Partition key determines which node the data will reside
– Simple key– Composite key
• It is the unit of replication in a cluster– All data for a given PK replicates around in the cluster
– All data for a given PK is co-located on a node
• The PK is hashed– Murmer3 hashing is the preferred
– Random does based on MD5 (not considrerd secure)
– Others are for legacy support only
• A partition is fetched from the disk in one disk seek, every partitionrequires a seek
PK
meter_id_001
meter_id_002
meter_id_003
Hash
558a666df55
88yt543edfhj
aad543rgf54lMore of PK during modeling ...
▶
8/17/2019 CassandraTraining v3.3.4
49/183
49
CREATE TABLE all_events ( event_type int, date number, //format as yyyymmdd number (more efficient) created_hh int, created_min int, created_sec int, created_nn int, data text, PRIMARY KEY((event_type, date), created_hh, created_min));
Visualizing the Primary Key
Partition key Clusteringcolumn(s)
Unique
What are the components of the Primary key?
Primary key = Partition key(s) + Clustering column(s)Naturally by definition the primary key must be unique
▶
Visualizing partition key based storage
8/17/2019 CassandraTraining v3.3.4
50/183
50
Visualizing partition key based storage
• Question - how does C* keep the data based on
the primary key definition of a given table?
Meter_id_001
Meter_id_001
Meter_id_001
...
Meter_id_001
26-Jan-2014
27-Jan-2014
28-Jan-2014
...
15-Dec-2014
147 159 165 ... 183
5698 5712 5745
12 58 ... 302
86 91
Partition key
Contiguous chunk of data on disk
Data growth per partition key
▶
8/17/2019 CassandraTraining v3.3.4
51/183
51
Failure Tolerance By Replication
Remember CAP theorem? C* gives availabilityand partition tolerance
Replication Factor (RF) = 3
Insert a record with Pk=B C
D
E
F
G
H
A
B
1
2
3{B, A, H}
{C, B, A}
{D, C, B}
private static final String allEventCql = "insert into all_events (event_type, date, created_hh, created_min, created_sec, created_nn, data) values(?, ?, ?, ?, ?, ?, ?)";Session session = CassandraDAO.getEventSession();BoundStatement boundStatement = getWritableStatement(session, allEventCql);session.execute(boundStatement.bind(.., .., ..); ▶
PK data partitioning
8/17/2019 CassandraTraining v3.3.4
52/183
52
PK - data partitioning
• Every node in the cluster is an owner of a range of tokens
that are calculated from the PK • After the client sends the write request the coordinators
partitioner uses the configured partitioner to determine thetoken value
• Then it looks for the node that has the token range (primaryrange) and puts the first replica there
• A node can have other ranges as well but has one primaryrange
• Subsequent replicas is with respect to the node of the firstcopy/replica
• Data with the same partition key will reside on the same
physical node• Clustering columns (in case of compound PK) does not
impact the choice of node– Default sort based on clustering columns is ASC
▶
8/17/2019 CassandraTraining v3.3.4
53/183
53
Coordinator Node
Incoming Requests (read/Wrire)
Coordinator handles the request
An interesting aspect to note is that thedata (each column) is timestamped using
the coordinator node's time
Every node can be coordinator ->masterless
n3
n4
n5
n6n7
n8
n1
n2
1
2
3
Clientrequest
coordinator
▶
Hinted handoff a bit later ...
8/17/2019 CassandraTraining v3.3.4
54/183
54
Consistency
Tunable at runtime• ONE (default)• QUORUM (strict majority w.r.t RF)• ALL
Apply both to read and write
protected static BoundStatement getWritableStatement(Session session, String cql, boolean setAnyConsistencyLevel) { PreparedStatement statement = session.prepare(cql); if(setAnyConsistencyLevel) { statement.setConsistencyLevel(ConsistencyLevel.ONE); }
BoundStatement boundStatement = new BoundStatement(statement); return boundStatement;}
Other consistency levels are ANY, LOCAL_QUORUM, TWO, THREE,LOCAL_ONE (see documentation on the web)
▶
What is a Quorum?
8/17/2019 CassandraTraining v3.3.4
55/183
55
What is a Quorum?
quorum = RoundDown(sum_of_replication_factors / 2) + 1
sum_of_replication_factors = datacenter1_RF + datacenter2_RF + . . .+ datacentern_RF
• Using a replication factor of 3, a quorum is 2 nodes. The cluster can tolerate 1
replica down.
• Using a replication factor of 6, a quorum is 4. The cluster can tolerate 2replicas down.
• In a two data center cluster where each data center has a replication factor of3, a quorum is 4 nodes. The cluster can tolerate 2 replica nodes down.
• In a five data center cluster where two data centers have a replication factorof 3 and three data centers have a replication factor of 2, a quorum is 6 nodes.
▶
8/17/2019 CassandraTraining v3.3.4
56/183
56
Write Consistency
Write ONE
Send requests to all replicas in thecluster applicable to the PK
n3
n4
n5
n6
n7
n8
n1
n2
1
2
3
coordinator
▶
8/17/2019 CassandraTraining v3.3.4
57/183
57
Write Consistency
Write ONE
Send requests to all replicas in thecluster applicable to the PK
Wait for ONE ack before returning toclient
n3
n4
n5
n6
n7
n8
n1
n2
1
2
3
coordinator
5 μs
▶
8/17/2019 CassandraTraining v3.3.4
58/183
58
Write Consistency
Write ONE
Send requests to all replicas in the clusterapplicable to the PK
Wait for ONE ack before returning to client
Other acks later , asynchronouslyn3
n4
n5
n6
n7
n8
n1
n2
1
2
3
coordinator
5 μs
10 μs
120 μs
▶
8/17/2019 CassandraTraining v3.3.4
59/183
59
Write Consistency
Write QUORUM
Send requests to all replicas
Wait for QUORUM ack before returning to client
n3
n4
n5
n6
n7
n8
n1
n2
1
2
3
coordinator
5 μs
10 μs
120 μs
▶
d i
8/17/2019 CassandraTraining v3.3.4
60/183
60
Read Consistency
Read ONE
Read from one node among all replicas
n3
n4
n5
n6
n7
n8
n1
n2
1
2
3
coordinator
▶
R d C i
8/17/2019 CassandraTraining v3.3.4
61/183
61
Read Consistency
Read ONE
Read from one node among all replicas
Contact the faster node (stats)
n3
n4
n5
n6
n7
n8
n1
n2
1
2
3
coordinator
▶
R d C i t
8/17/2019 CassandraTraining v3.3.4
62/183
62
Read Consistency
Read QUORUM
Read from one fastest node
n3
n4
n5
n6
n7
n8
n1
n2
1
2
3
coordinator
▶
R d C i t
8/17/2019 CassandraTraining v3.3.4
63/183
63
Read Consistency
Read QUORUM
Read from one fastest node
AND request digest from other replicas toreach QUORUM
n3
n4
n5
n6
n7
n8
n1
n2
1
2
3
coordinator
▶
R d C i t
8/17/2019 CassandraTraining v3.3.4
64/183
64
Read Consistency
Read QUORUM
Read from one fastest node
AND request digest from other replicasto reach QUORUM
Return most up-to-date data to client
Read repair a bit later ...
n3
n4
n5
n6
n7
n8
n1
n2
1
2
3
coordinator
▶
How to maintain consistency across
8/17/2019 CassandraTraining v3.3.4
65/183
65
nodes?
• C* allows for tunable consistency andhence we have eventual consistency
• For whatever reason there can beinconsistency of data among nodes
• Inconsistencies are fixed using
– Read repair - update data partitionsduring reads (aka Anti-entropy ops)
– Nodetool - regular maintenance
• Let's see consistency in action ...
▶
C i t i A ti
8/17/2019 CassandraTraining v3.3.4
66/183
66
Consistency in Action
RF = 3, Write ONE Read ONE
B A A
B A A
Write ONE: B
Read ONE: A
Data replication in progress..
A write must be written to the commit log and
memtable of at least one replica node.
Returns a response from the closest replica, asdetermined by the snitch. By default, a read repair runs inthe background to make the other replicas consistent.
▶
C i t i A ti
8/17/2019 CassandraTraining v3.3.4
67/183
67
Consistency in Action
RF = 3, Write ONE Read QUORUM
B A A
B A A
Write ONE: B
Read QUORUM: A
Data replication in progress..
A write must be written to the commit log and
memtable of at least one replica node.
Returns the record after a quorum of replicas hasresponded from any data center.
▶
Consistenc in Action
8/17/2019 CassandraTraining v3.3.4
68/183
68
Consistency in Action
RF = 3, Write ONE Read ALL
B A A
B A A
Write ONE: B
Read ALL: B
Data replication in progress..
A write must be written to the commit log and
memtable of at least one replica node.
Returns the record after all replicas haveresponded. The read operation will fail if a replicadoes not respond.
▶
Consistency in Action
8/17/2019 CassandraTraining v3.3.4
69/183
69
Consistency in Action
RF = 3, Write QUORUM Read ONE
B B A
B B A
Write QUORUM: B
Read ONE: A
Data replication in progress..
A write must be written to the commit log andmemtable on a quorum of replica nodes.
Returns a response from the closest replica, asdetermined by the snitch. By default, a read repair runs inthe background to make the other replicas consistent.
▶
Consistency in Action
8/17/2019 CassandraTraining v3.3.4
70/183
70
Consistency in Action
RF = 3, Write QUORUM Read QUORUM
B B A
B B A
Write QUORUM: B
Read QUORUM: B
Data replication in progress..
▶
Consistency Level
8/17/2019 CassandraTraining v3.3.4
71/183
71
Consistency Level
ONEFast write, may not read latest
written value
QUORUM / LOCAL_QUORUMStrict majority w.r.t. Replication Factor
Good balance
ALLNot the best choice
Slow, no high availability
if(nodes_written + nodes_read) > replication factorthen you can get immediate consistency
▶
Debate - Immediate consistency
8/17/2019 CassandraTraining v3.3.4
72/183
72
Debate Immediate consistency
• The following consistency level gives
immediate consistency– Write CL = ALL, Read CL = ONE
– Write CL = ONE, Read CL = ALL
– Write CL = QUORUM, Read CL = QUORUM
• Debate which CL combination you willconsider in what scenarios?
• The combinations are– Few write but read heavy
– Heavy write but few reads
– Balanced
▶
Hinted Handoff
8/17/2019 CassandraTraining v3.3.4
73/183
73
n3
n4
n5
n6
n7
n8
n1
n2
1
2
3
Coordinator
Hinted Handoff
A write request is received by thecoordinator
Coordinator finds a node is down/offline
It stores "Hints" on behalf of the offline nodeif the coordinator knows in advance that thenode is down. Handoff is not taken if the
node goes down at the time of write.
Read repair / nodetool repair will fixinconsistencies.
Hints
✗
▶
A hinted handoff is taken in system.hints table {Location of the replica, version meta, actual data}It is taken only when the write consistency level can be satisfied and a guaranteethat any read consistency level can read the record
Hinted Handoff
8/17/2019 CassandraTraining v3.3.4
74/183
74
Hinted Handoff
When the offline node comes up, thecoordinator forwards the stored hints
The node synchs up the state with thehints
The coordinator cannot perpetually holdthe hints
Configure an appropriate duration for anynode to hold hints - default 3 hrs
n3n4
n5
n6
n7
n8
n1
n2
1
2
3
CoordinatorHints
▶
Consistency level - ANY
8/17/2019 CassandraTraining v3.3.4
75/183
75
Consistency level ANY
• This is a special consistency levelwhere even if any node is down thewrite will succeed using a hintedhandoff
• Use it very cautiously
▶
What is a Anti-entropy op?
8/17/2019 CassandraTraining v3.3.4
76/183
76
a s a e opy op
• During reads the full data is requestedfrom ONE replica by the coordinator
• Others send a digest which is a hashof the data to the coordinator
• Coordinator checks if the data and thedigests match
• If they do not then the coordinatortakes the latest (n4) and merges intothe result
• Then the stale nodes (n2, n3) areupdated automatically in thebackground
• Happens on a configurable % of readsand not for all because there is anoverhead
• read_repair_chance is theconfiguration (defaults to 10% or 0.1)
= a 10% chance that a read repair willhappen when a read comes in• This is defined for a table or column
family
Repair with nodetool, a bit later ...
n3
n4
n5
n6
n7
n8
n1
n2
1
2
3
coordinator
Andersonts = 521
Neots = 851
Andersonts = 268
▶
8/17/2019 CassandraTraining v3.3.4
77/183
77
Part 5
Write and Read path
▶
Why C* writes fast
8/17/2019 CassandraTraining v3.3.4
78/183
78
y
??
RDBMS
Seeks and writes values tovarious pre-set locations
Cassandra
Continuously appends, merging andcontracting when appropriate
Try reading a book that is organized asfollows
P1, P23, P45, P2, P6, P99, P125, P8 ...
How about a book that looks like
P1, P2, P3, P4, P5, P6, ...
▶
Storage
8/17/2019 CassandraTraining v3.3.4
79/183
79
g
• Writes are harder than reads to scale
• Spinning disks aren't good withrandom I/O
• Goal: minimize random I/O
– Log Structured Merged Tree (LSM)
– Push changes from a memory-basedcomponent to one or more disk
components
▶
Some more on LSM
8/17/2019 CassandraTraining v3.3.4
80/183
80
• An LSM-tree is composed of two or
more tree-like component datastructures
• Smaller component is C0 sits inmemory - aka Memtables
• Larger component C1 sits in the disk– aka SSTables (Sorted String Table)– Immutable component
• Writes are inserted in C0 and logs• In time flush C0 to C1• C1 is optimized for sequential access
▶
Key components for Write
8/17/2019 CassandraTraining v3.3.4
81/183
81
y p
Each node implements the folloing key components (3 storagemechanism and one process) to handle writes
1.Memtables - the in-memory tables corresponding to the CQLtables along with the related indexes
2.CommitLog - an append-only log, replayed to restore nodeMemtables
3.SSTables - Memtable snapshots periodically flushed (written
out) to free up heap
4.Compaction - periodic process to merge and streamlinemultiple generations of SSTables
▶
Cassandra Write Path
8/17/2019 CassandraTraining v3.3.4
82/183
82
Cassandra Write Path
Memory
Commit log1
Commit log2
Commit logn
……
1
MemTable 1 MemTable 2 MemTable n……2
Coordinator
Appends
d u r a b l e
w r i t e
s =
T R U E
SSTable1
Table1
SSTable2
Table2
Appends (corresponds to CQL tables)
Single node in the cluster
Client
Mem to disk
Related configurations are commitlog_sync (batch/periodic),commitlog_sync_batch_window_in_ms, commitlog_synch_period_in_ms
Disk
▶
Cassandra Write Path
8/17/2019 CassandraTraining v3.3.4
83/183
83
Cassandra Write Path
Memory
Commit log1
Commit log2
Commit logn
……
MemTable 1 MemTable 2 MemTable n……
2
SSTable1
1
Table1
SSTable2
Table2
SSTable3
Table3
Memtable exceeds a threshold, flush to SSTable, clean log,
clear heap corresponding to the memtables
New SSTable is created based on oldest commit logA very fast sequential write operation
These are multiple generations of SSTableswhich are compacted into one SSTableselect * from system.sstable_activity;
Writes are completely isolated from reads
No updated columns are visible until entire row is finished (technically, entire partition)
Disk
▶
Column family data state
8/17/2019 CassandraTraining v3.3.4
84/183
84
y
A Table data=
It's Memtable+All of it's SSTables that have
been flushed
▶
Memtables & SSTables
8/17/2019 CassandraTraining v3.3.4
85/183
85
• Memtables and SSTables are maintained per
table– SSTables are immutable, not written to again after the
memtable is flushed.
– A partition is typically stored across multiple SSTablefiles.
– What is sorted? - Partitions are sorted and stored
• For each SSTable, C* creates these structures– Partition index - A list of partition keys and the start position of
rows in the data file (on disk)– Partition index summary (in memory sample to speed up reads)
– Bloom filter (optional and depends on % false positive setting)
▶
Offheap memtables
8/17/2019 CassandraTraining v3.3.4
86/183
86
• As of C* 2.1 a new parameter has been introduced (memtable_allocation_type) This capability is still under active development and performance over the
period of time is expected to get better
• Heap buffers - the current default behavior
• Off heap buffers - moves the cell/column name and value to DirectBufferobjects. This has the lowest impact on reads — the values are still “live”
Java buffers — but only reduces heap significantly when you are storinglarge strings or blobs
• Off heap objects - moves the entire cell off heap, leaving only theNativeCell reference containing a pointer to the native (off-heap) data. Thismakes it effective for small values like ints or uuids as well, at the cost ofhaving to copy it back on-heap temporarily when reading from it (likely tobecome default in C* 3.0)– Writes are about 5% faster with offheap_objects enabled, primarily because Cassandra
doesn’t need to flush as frequently. Bigger sstables means less compaction is needed.– Reads are more or less the same
• For more information http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1
▶
Commit log
8/17/2019 CassandraTraining v3.3.4
87/183
87
• It is replayed in case a node went down and
wants to come back up• This replay creates the MemTables for that node• Commit log comprises of pieces (files) and it can
be controlled (commitlog_segment_size_in_mb)• Total commit log is a controlled parameter as well
(commitlog_total_space_in_mb)• Commit log itself is also acrued in memory. It is
then written to disk in two ways– Batch - if batch then all ack to requests wait until the
commit log is flushed to disk–
Periodic - the request is immediately ack but after sometime the commit log is flushed. If a node were to godown in this period then data can be lost if RF=1
Debate - "periodic" setting gives better performance and we shoulduse it. How to avoid the chances of data loss?
▶
When does the flush trigger?
8/17/2019 CassandraTraining v3.3.4
88/183
88
• Memtable total space in MB reached
– Default is 25% of JVM Heap– Can be more in case of off heap
• OR Commit log total space in MBreached
• OR Nodetool flush (manual)– nodetool flush
• The default settings are usually goodand are there for a reason
▶
Data file name structure
8/17/2019 CassandraTraining v3.3.4
89/183
89
• The data directories are created based on keyspaces and tables– .../data/keyspace/table
• The actual DB file is named as– keyspace-table-format-generationNum-component
• Component can be– CompressionInfo - compression info metadata
– Data - PK, data size, column idx, row level tombstone info, columncount, column list in sorted order by name
– Filter - Bloom filter– Index - index, also include Bloom filter info, tombstone– Statistics - hostograms for row size, gen numbers of files from where
this SST was compacted– Summary - index summary (that is loaded in mem for read
optimizations)– TOC - list of files– Digest - Text file with a digest
• Format - internal C* format eg "jb" is C* 2.0 format, 2.1 may have "ka"
▶
Cassandra Read Path
8/17/2019 CassandraTraining v3.3.4
90/183
90
Cassandra Read Path
SELECT…FROM…WHERE #partition=….;
RowCache
1
SSTable1 SSTable2 SSTable3
Row cache will be checked if it is enabled
▶
Cassandra Read Path
8/17/2019 CassandraTraining v3.3.4
91/183
91
Cassa d a ead a
SELECT…FROM…WHERE #partition=….;
RowCache
1
SSTable1 SSTable2 SSTable3
Bloom Filter 2
? ? ?
Row cache MISS
•One Bloom filter exists per SSTable and Memtable that the node is serving
•Probablistic filter that tells "Maybe a partition key exists" in itscorresponding SSTable or a "Definite NO"
▶
Cassandra Read Path
8/17/2019 CassandraTraining v3.3.4
92/183
92
SELECT…FROM…WHERE #partition=….;
RowCache
1
SSTable1 SSTable2 SSTable3
Bloom Filter 2
F F
Partition KeyCache3
T
•If any of the Bloom filters return a possible "YES" then the partition keymay exist in one or more of the SSTables
•Proceed to look at the partition key cache for the SSTables for which thereis a probability of finding the partition key, ignore the other SST keycaches
•Physically a single key cache is maintained ... next more on partition key cache▶
Partition Key Cache
8/17/2019 CassandraTraining v3.3.4
93/183
93
y
# Partition Offset …..
# Partition001 0x0 ….
# Partition002 0x153 …..
# Partition350 0x5464321 …..
# Partition001
data
data
…..
# Partition002
# Partition350
data
Data
….
Partition KeyCache
SSTable1
Partition key cache stores the offset position of the partition keysthat have been read recently.
▶
Cassandra Read Path
8/17/2019 CassandraTraining v3.3.4
94/183
94
SELECT…FROM…WHERE #partition=….;
RowCache
1
SSTable2
Bloom Filter 2
Partition KeyCache3
Key IndexSample4
If partition key is not found in the key cache then the read proceeds tocheck the "Key index sample" or "Partition summary" (in memory). It is asubset of the "Partition Index" that is the full index of the SSTable
Next a bit more of key index sample ...
▶
Key Index Sample
8/17/2019 CassandraTraining v3.3.4
95/183
95
y p
Sample Offset
# Partition001 0x0
# Partition128 0x4500
# Partition256 0x851513
# Partition512 0x5464321
# Partition001
data
data
…..
# Partition128
# Partition350
data
Data
….
Key IndexSample
SSTable
Think of the offset as an absolute disk address that fseek in C can use
Ratio of the key index sample is 1 per 128 keys
▶
Cassandra Read Path - complete flow
8/17/2019 CassandraTraining v3.3.4
96/183
96
p
SELECT…FROM…WHERE #partition=….;
RowCache
1
SSTable2
Bloom Filter 2
Partition KeyCache3
Key IndexSample4
Partition Index5
6
Mem Table
∑
Merge based on the most recent timestamp of the columns1. Memtable2. One or more than one SSTablemore on the merge process a bit later in LWW ...
Also update the row cache (if enabled) and key cache
T h e c o o r d i n a t o r w i l l m e r g e
b a s e d o n T S a s w e l l b a s e d o n
t h e c o n s i s t e n c y
l e v e l
▶
Bloom filters in action
8/17/2019 CassandraTraining v3.3.4
97/183
97
Setting change takes effect when the SSTables are regeneratedIn development env you can use nodetool/CCM scrub BUT it is time intensive
▶
Row caching
8/17/2019 CassandraTraining v3.3.4
98/183
98Caches are also written on disk so that it comes alive after a restartGlobal settings are also possible at the c* yaml file
▶
Key caching
8/17/2019 CassandraTraining v3.3.4
99/183
99
• Stored per SSTable even if it is a common store for all SSTables• Stores only the key
• Reduces the seek to just one read per replica (does not have to look intothe disk version of the index which is the full index)
• It is the default = true• This also periodically get saved on disk as well
•Changing index_interval increases/decreases the gap•Increasing the gap means more disk hits to get the partition key index•Reducing the gap means more memory but get to partition key without looking intodisk based partition index
▶
Eager retry
8/17/2019 CassandraTraining v3.3.4
100/183
100
• If a node is slow in responding to a
request, the coordinator forwardsit to another holding a replica ofthe requested partition
• Valid if RF>1• C* 2.0+ feature
Node1
Node2
Node
3
Node4
?
91
91
Client Driver
Read
▶
ALTER TABLE users WITH speculative_retry = '10ms';Or,ALTER TABLE users WITH speculative_retry = '99percentile';
Last Write Win (LWW)
8/17/2019 CassandraTraining v3.3.4
101/183
101
INSERT INTO users(login,name,age) VALUES ('jdoe', 'John DOE', 33);
Jdoe Age Name
33 John DOE
#Partition
▶
8/17/2019 CassandraTraining v3.3.4
102/183
Last Write Win (LWW)
8/17/2019 CassandraTraining v3.3.4
103/183
103
UPDATE users SET age = 34 WHERE login = 'jdoe';
Jdoe Age(t1) Name(t1)
33 John DOE
Jdoe Age(t2)
34
SSTable1SSTable2
Remember that SSTables are immutable, once written it cannot beupdated.Creates a new SSTable
Assume a flush occurs
▶
Last Write Win (LWW)
8/17/2019 CassandraTraining v3.3.4
104/183
104
DELETE age from users WHERE login = 'jdoe';
Jdoe Age(t1) Name(t1)
33 John DOE
Jdoe Age(t2)
34
SSTable1 SSTable2
Jdoe Age(t3)
X
SSTable3
tombstone
RIP
▶
Last Write Win (LWW)
8/17/2019 CassandraTraining v3.3.4
105/183
105
SELECT age from users WHERE login = 'jdoe';
Jdoe Age(t1) Name(t1)
33 John DOE
Jdoe Age(t2)
34
SSTable1 SSTable2
Jdoe Age(t3)
X
SSTable3
? ? ?
Where to read from? How to construct the response?
▶
Last Write Win (LWW)
8/17/2019 CassandraTraining v3.3.4
106/183
106
SELECT age from users WHERE login = 'jdoe';
Jdoe Age(t1) Name(t1)
33 John DOE
Jdoe Age(t2)
34
SSTable1 SSTable2
Jdoe Age(t3)
X
SSTable3
✗ ✗ ✔
▶
Compaction
8/17/2019 CassandraTraining v3.3.4
107/183
107
Jdoe Age(t1) Name(t1)
33 John DOE
Jdoe Age(t2)
34
SSTable1 SSTable2
Jdoe Age(t3)
X
SSTable3
Jdoe Name(t1)
John DOE
New SSTable
Disk space freed = sizeof(SST1 + SST2 + .. + SSTn) - sizeof(new compact SST)Commit logs (containing segments) are versioned and reused as well
Newly freed space becomes available for reuse
•Related SSTables are merged•Most recent version of each column is compiled to one partition in one new SSTable•Columns marked for eviction/deletion are removed
•Old generation SSTables are deleted•A new generation SSTable is created (hence the generation number keep increasing over time)
▶
Compaction
8/17/2019 CassandraTraining v3.3.4
108/183
108
• As per Datastax documentation– Cassandra 2.1 improves read performance after compaction by performing an
incremental replacement of compacted SSTables.– Instead of waiting for the entire compaction to finish and then throwing away the old
SSTable (and cache), Cassandra can read data directly from the new SSTable evenbefore it finishes writing.
• Three strategies– Sizetiered (default)
– Leveled (needs about 50% more IO than size tiered but the number of SSTablesvisited for data will be less)
– DateTiered
• Choice depends on the disk– Mechanical disk = Size tiered
– SSD = Leveled tiered
• Choice depends on use case too– Size - It is best to use this strategy when you have insert-heavy, read-light
workloads
– Level - It is best suited for ColumnFamilys with read-heavy workloads that havefrequent updates to existing rows
– Date - for timeseries data along with TTL
▶
SizeTiered compaction - a brief
8/17/2019 CassandraTraining v3.3.4
109/183
109
• With size-tiered compaction, similarly sized tables arecompacted into larger tables once a certain number areaccumulated (default 4)
• A row can exist in multipleSSTables, which can resultin a degradation of readperformance. This is
especially true if youperform many updates ordeletes with high reads
• It can be a good strategyfor write heavy scenarios
▶
LeveledTiered compaction
8/17/2019 CassandraTraining v3.3.4
110/183
110
• Leveled compaction creates sstables of a
fixed, relatively small size (5MB by defaultin Cassandra’s implementation), that aregrouped into “levels.”
• Within each level, sstables are guaranteed
to be non-overlapping. Each level is tentimes as large as the previous
• Leveled compaction guarantees that 90%of all reads will be satisfied from a single
sstable (assuming nearly-uniform rowsize). Worst case is bounded at the totalnumber of levels — e.g 7 for 10TB of data
▶
DateTiered compaction
8/17/2019 CassandraTraining v3.3.4
111/183
111
• This particularly applies to time series
data where the data lives for a specifictime• Use DateTiered compaction strategy• C* looks at the min and max Timestamp of
the SStable and finds out if anything isreally live• If not then that SSTable is just unlinked
and the compaction with another is fullyavoided
• Set the TTL in the CF definition so that if abad code looks to insert without TTL itdoes not cause unnecessary compaction
▶
"Coming back to life"
8/17/2019 CassandraTraining v3.3.4
112/183
112
• Let's assume multiple replicas exist for a given column• One node was NOT able to recored a tombstone because it
was down• If this node remains down past the gc_grace_seconds
without a repair then it will contain the old record• Other nodes meanwhile have evicted the tombstone
column• When the downed node does come up it will bring the old
column back to life on the other nodes!
• To ensure that deleted data never resurfaces, make sure torun repair at least once every gc_grace_seconds, and neverlet a node stay down longer than this time period
▶
NTP is very important
8/17/2019 CassandraTraining v3.3.4
113/183
113
• A micro second time stamp is associated
with the columns• The timestamp used is of the machine in
which C* is running (coordinator)• If the timing of the machines in the cluster
is off then C* will be unable to determinewhich record is latest!• This also means that the compaction may
not function• You can have a scenario where a deleted
record appears again or get unpredictablenumber of records for your query
▶
Lightweight transactions
8/17/2019 CassandraTraining v3.3.4
114/183
114
• Transactions are a bit different
– "Compare and Set" model• Two requests can agree on a single value creation
– One will suceed (uses Paxos algo, wikipedia has goodtext on this subject)
– Other will fail
INSERT INTO customer_account (customerID, customer_email)VALUES ('LauraS', '[email protected]')IF NOT EXISTS;
UPDATE customer_account
SET customer_email='[email protected]'IF customerID='LauraS';
Cassandra 2.1.1 and later supports non-equal conditions for lightweight transactions. You canuse =, != and IN operators in WHERE clauses to query lightweight tables.
▶
Triggers
8/17/2019 CassandraTraining v3.3.4
115/183
115
• The actual trigger logic resides
outside in a Java POJO
• Feature may still be experimental
• Has write performance impact
• Better to have application logic doany pre-processing
CREATE TRIGGER myTrigger ON myTable USING 'Java class name';
▶
Debate - Locking in C*
8/17/2019 CassandraTraining v3.3.4
116/183
116
Instance 1 Instance 2 Instance 3
Memcached / Redis
k=_0_0 v=k=_0_a v=..
k=_z_z v=
Step 1. Acquire lock
If FAIL look for next KEY
St ep 3. Release lock
These instances work exactly like instance 1. They will acquirelocks on other keys so that no two instances step on each other
Primary key = ((flag, starts_with), gen_id)
C
D
E
F
G
H
A
B
▶
Step 2. Get anavailable ID that has
the flag=0 and startswith 0
Delete this key from C*
8/17/2019 CassandraTraining v3.3.4
117/183
Before you model ...
8/17/2019 CassandraTraining v3.3.4
118/183
118
Report
Dashboard
Biz question
End user
Fast response
Model
Storage
Data-types
Keys
Data
User1
Structure2
C* way
RDBMS way
▶
QDD
8/17/2019 CassandraTraining v3.3.4
119/183
119
C* modelling is
"Query Driven Design"
✍
▶
Give me the query and I will organize the data via amodel to get you performance for that query.
Denormalization
8/17/2019 CassandraTraining v3.3.4
120/183
120
• Unlike RDBMS you cannot use any foreign
key relationships• FK and joins are not supported anyway!• Embed details, it will cause duplication but
that is alright
• Helps in getting the complete data in asingle read• More efficient, less random I/O• May have to create more than one
denormalized table to serve differentqueries• UDT is a great way (UDT a bit later...)
▶
Datatypes
8/17/2019 CassandraTraining v3.3.4
121/183
121
• Int, decimal, float, double
• Varchar, text
• Collections - Map (K=V), List (ordered V), Set (Unordered unique V)– Designed to store small amount of data (in hundreds)
– Limit is 64k
– Max size per element is 64KB
– Cannot be part of Primary key
– Nesting is not allowed
• Timestamp, timeuuid
• Counter• User defined types (UDT)
• There are other datatypes as well– position frozen - useful in 3D coordinate geometry
– blob - good practice is to gzip and then store
• Static modifer - value remains the same, save storage space BUT veryspecific usecase
▶
Choice of partition key
8/17/2019 CassandraTraining v3.3.4
122/183
122
• If a key is not unique then C* will upsert
with no warning!• Good choices
– Natural key like an email address or meter id
–Surrogate keys like uuid, timeuuid (uuid withtime component)
• CQL provides a number of timeuuidfunctions
– now()– dateof(tuuid) - extracts the timestamp as date
– unixtimestampof - raw 64 bit integer
▶
Wide row? Use composite PK
8/17/2019 CassandraTraining v3.3.4
123/183
123
CREATE TABLE all_events ( event_type int, date int, //format as yyyymmdd created_hh int, created_min int, created_sec int,
created_nn int, data text, PRIMARY KEY((event_type, date), created_hh));
Example of a compound PK ((...), ...) syntax
CREATE TABLE trunc_events ( event_type int,
date number, created_on timestamp, data text, PRIMARY KEY((event_type, date), created_on)) WITH CLUSTERING ORDER BY (created_on DESC);
insert into trunc_events (event_type, date, created_on, data) values(1, '2014-06-24', '2014-06-26 12:47:54','{"event" : "details of the event will go here"}') USING TTL 300;
Only event type as the PK will store all events ever generated in a single
partition. That can result in a huge partition and may breach C* limits
▶
Partition key IN clause debate
8/17/2019 CassandraTraining v3.3.4
124/183
124
( )
Meter_id_001
Meter_id_001
Meter_id_001
...
Meter_id_001
26-Jan-2014
27-Jan-2014
28-Jan-2014
...
15-Dec-2014
147 159 165 ... 183
5698 5712 5745
12 58 ... 302
86 91
Partition key
where meter_id = 'Meter_id_001' and date IN ( ... )
t1
t2
t3
tn
∑ ≠t1 t2 t3 tn≠≠,t1 t2 t3 tn,,T =
Is T predictable? What are the factors responsible fordetermining the overall fetch time T?
This is also called as "Multi get slice"
∈
▶
Partition key IN clause (Slicing)
8/17/2019 CassandraTraining v3.3.4
125/183
125
If you want to search across two partition keys then use
"IN" in the where clause. You can have a further limitingcondition based on cluster columns.
In a composite PK, IN works only on the last key. Eg Pk ((key1, key2),clust1, clust2). IN will work only on key2 (last column only).
▶
Modeling notation
8/17/2019 CassandraTraining v3.3.4
126/183
126
• Chen's notation for conceptual modeling
helps in design
▶
Sample model contd ...
8/17/2019 CassandraTraining v3.3.4
127/183
127
▶
8/17/2019 CassandraTraining v3.3.4
128/183
Bitmap type of index example
8/17/2019 CassandraTraining v3.3.4
129/183
129
• Multiple parts to a key•
Create a truth table of the various combinations• However, inserts will increase to the number ofcombinations
• Eg Find a car (vehicle id) in a car park byvariable combinations
▶
8/17/2019 CassandraTraining v3.3.4
130/183
Bitmap type of index example
8/17/2019 CassandraTraining v3.3.4
131/183
131
YES, there will be more writes and more data BUT that is an acceptablefact in big data modeling that looks to optimize the query and userexperience more than data normalization
select * from skldb.car_model_index
where make='Ford' and model='' and color='Blue';
select * from skldb.car_model_indexwhere make='' and model='' and color='Blue';
▶
8/17/2019 CassandraTraining v3.3.4
132/183
TTL - Time To LiveCREATE TABLE users (
user id text PRIMARY KEY
8/17/2019 CassandraTraining v3.3.4
133/183
133
user_id text PRIMARY KEY, first_name text, last_name text,
emails set, todo map);
INSERT INTO users (user_id, first_name, last_name, emails) VALUES('nm', 'Nirmallya', 'Mukherjee', {'[email protected]', '[email protected]'}, {'2014-12-25' : 'Christmas party', '2014-12-31' : 'At Gokarna'});
UPDATE users USING TTL SET todo['2012-10-1'] = 'find water' WHERE user_id = 'nm';
Rate limiting is a good idea in many cases. Eg allow password reset toonly 3 per day per user. Use TTL sliding window.
A promotion is generated for a customer with a TTL.
A temporary login is generated and is valid for a given TTL period.
Table level TTL overrides row level TTL specification.
▶
UDT - User Defined Typescreate keyspace address_book with replication = { 'class':'SimpleStrategy', 'replication_factor': 1 };
8/17/2019 CassandraTraining v3.3.4
134/183
134
CREATE TYPE address_book.address ( street text, city text,
zip_code int, phones set);
CREATE TYPE address_book.fullname ( firstname text, lastname text);
CREATE TABLE address_book.users ( id int PRIMARY KEY, name frozen , addresses map // a collection map);
INSERT INTO address_book.users (id, name) VALUES (1, {firstname: 'Dennis', lastname: 'Ritchie'});
UPDATE address_book.users SET addresses = addresses + {'home': { street: '9779 Forest Lane', city: 'Dallas',zip_code: 75015, phones: {'001 972 555 6666'}}} WHERE id=1;
SELECT name.firstname from users where id = 1;
CREATE INDEX on address_book.users (name);SELECT id FROM address_book.users WHERE name = {firstname: 'Dennis', lastname: 'Ritchie'};
▶
8/17/2019 CassandraTraining v3.3.4
135/183
Queries
8/17/2019 CassandraTraining v3.3.4
136/183
136
SELCT * FROM mailboxWHERE login=jdoeand message_id ='2014-07-12 16:00:00';
Get message by user and message_id (date)
SELCT * FROM mailboxWHERE login=jdoeand message_id ='2014-01-12 16:00:00';
Get message by user and date interval
CREATE TABLE mailbox (
login text,message_id timeuuid,Interlocutor text,Message text,PRIMARY KEY(login, message_id);
▶
Debate - are these correct?
8/17/2019 CassandraTraining v3.3.4
137/183
137
SELCT * FROM mailbox WHERE message_id ='2014-07-12 16:00:00';
Get message by message_id
SELCT * FROM mailbox WHEREmessage_id ='2014-01-12 16:00:00';
Get message by date interval
✗
✗
▶
8/17/2019 CassandraTraining v3.3.4
138/183
WHERE clause restrictions
8/17/2019 CassandraTraining v3.3.4
139/183
139
• All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition (where canfilter in contiguous data)
• "Allow filtering" can override the default behaviour of cross partition access but it isnot a good practice at all (incorrect data model)
– select .. from meter_reading where record_date > '2014-12-15'
(assuming there is a secondary index on record_date)
• Only exact match (=) predicate on #partition, range queries (
8/17/2019 CassandraTraining v3.3.4
140/183
140
• If the primary key is (industry_id,
exchange_id, stock_symbol) then thefollowing sort orders are valid– order by exchange_id desc, stock_symbol desc
– order by exchange_id asc, stock_symbol asc
• Following is an example of invalid
sort order– order by exchange_id desc, stock_symbol asc
▶
Secondary Index
8/17/2019 CassandraTraining v3.3.4
141/183
141
• Consider the earlier example of the table all_events, what will happen if wetry to get the records based on minute?–
Create a secondary index based on the minute• Can be created on any column except counter, static and collection columns
• It is actually another table with the key= and columns=keys thatcontain the key
• Do not use secondary indexes if – On high-cardinality columns because you then query a huge volume of records
for a small number of results– In tables that use a counter column
– On a frequently updated or deleted column (too many tombstones)
– To look for a row in a large partition unless narrowly queried
– High volume queries
• Secondary indexes are for searching convenience
– Use with low-cardinality columns– Columns that may contain a relatively small set of distinct values
– Use when prototyping, ad-hoc querying or with smaller datasets
• See this link for more details (http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html)
▶
Secondary Index - rebuilding
8/17/2019 CassandraTraining v3.3.4
142/183
142
• While Cassandra’s performance will always be best usingpartition-key lookups, secondary indexes provide a nice
way to query data on small or offline datasets• Occasionally, secondary indexes can get out-of-sync• Unfortunately, there is no great way to verify secondary
indexes, other than rebuilding them.• You can verify by a query using indexes, and look for
significant inconsistencies on different nodes
• Also note that each node keeps its own index (they are notdistributed), so you’ll need to run this on each node
• Must run nodetool repair to ensure the index gets built oncurrent data
nodetool rebuild_index my_keysapce my_column_family
▶
CQL Limits
8/17/2019 CassandraTraining v3.3.4
143/183
143
Limits (as per Datastax documentation)• Clustering column value, length of: 65535• Collection item, value of: 2GB (Cassandra 2.1 v3
protocol), 64K (Cassandra 2.0.x and earlier)• Collection item, number of: 2B (Cassandra 2.1 v3
protocol), 64K (Cassandra 2.0.x and earlier)
• Columns in a partition: 2B
• Fields in a tuple: 32768, try not to have 1000' offields
• Key length: 65535• Query parameters in a query: 65535•
Single column, value of: 2GB, xMB arerecommended• Statements in a batch: 65535
▶
C* batch - Is it always a good idea?
8/17/2019 CassandraTraining v3.3.4
144/183
144
Leverages token aware policiesCluster load is evenly balancedMeans if one fails we re-try ONLY that queryNot an all or none success model
▶
Tip: batches work well ONLY if all records have the same partition keyIn this case all records go to the same node.
Batch - use case
K bl i h b i d i
8/17/2019 CassandraTraining v3.3.4
145/183
145
• Keep two tables in synch - same event being stored in twotables, one by customer ID and the other by staff ID
• Emulating a classical database transaction - all or nothing!
▶
Antipatterns - things to watch out!
8/17/2019 CassandraTraining v3.3.4
146/183
146
Update columns as nulls in CQL. Tombstone for each null
Singleton messages - add, consume, remove loop
Intense update on a single column
Dynamic schema change (frequent changes) and topology change in
production is not a good idea.
◆
◆
◆
◆
▶
Use C* as a queue like singletons (add/consume loop)◆
But... I need a queue!
8/17/2019 CassandraTraining v3.3.4
147/183
147
• Break the partition key to carry some
time buckets eg. Queue1 in the lasthour
• At the time of select you can use the
clustering column eg Queue1 withtimestamp clustering column >SOME time. This helps in not going
through all those tombstones
▶
8/17/2019 CassandraTraining v3.3.4
148/183
148
Part 7
Applications with Cassandra
8/17/2019 CassandraTraining v3.3.4
149/183
The Client DAO and the Driver
S i i t th d f d ll
8/17/2019 CassandraTraining v3.3.4
150/183
150
• Session instances are thread-safe and usually asingle instance is all you need per application.
However, a given session can only be set to onekeyspace at a time, so one instance per keyspaceis necessary
• Use policies– Retry– Reconnect– LoadBalancing
• http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/fourSimpleRules.html
▶
C* client sample
8/17/2019 CassandraTraining v3.3.4
151/183
151
▶
C* client sample
8/17/2019 CassandraTraining v3.3.4
152/183
152http://www.datastax.com/drivers/java/2.1/index.html▶
C* client sample - the entity
8/17/2019 CassandraTraining v3.3.4
153/183
153
C* client sample
8/17/2019 CassandraTraining v3.3.4
154/183
154
See this to work with UDT -http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/reference/udtApi.html
8/17/2019 CassandraTraining v3.3.4
155/183
8/17/2019 CassandraTraining v3.3.4
156/183
156
Part 8
Administration
C* != (Fire and forget)
8/17/2019 CassandraTraining v3.3.4
157/183
157
C* is not a fire a forget system.
It does require administrationand monitoring to ensure the
infrastructure performs optimally
at all times.
▶
Monitoring - Opscenter
Workload modeling
8/17/2019 CassandraTraining v3.3.4
158/183
158
Workload modeling
• Workload characterization• Performance characteristics of the
cluster• Latency analysis of the cluster• Performance of the OS• Disk utilization• Read / Write operations per second
• OS Memory utilization• Heap utilization
▶
Monitoring - Datastax Opscenter
8/17/2019 CassandraTraining v3.3.4
159/183
159
▶
Monitoring - Datastax Opscenter
8/17/2019 CassandraTraining v3.3.4
160/183
160
▶
Monitoring - Datastax Opscenter
8/17/2019 CassandraTraining v3.3.4
161/183
161
▶
Monitoring - Datastax Opscenter
8/17/2019 CassandraTraining v3.3.4
162/183
162
▶
Monitoring - Datastax Opscenter
8/17/2019 CassandraTraining v3.3.4
163/183
163
▶
Monitoring - Datastax Opscenter
8/17/2019 CassandraTraining v3.3.4
164/183
164
▶
Monitoring - Datastax Opscenter
8/17/2019 CassandraTraining v3.3.4
165/183
165
▶
Dropped MUTATION messages this means that the mutation was not applied to all replicas itwas sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair (perhaps
because of load C* is defending itself by dropping messages).
Monitoring - Datastax Opscenter
8/17/2019 CassandraTraining v3.3.4
166/183
166
▶
Monitoring - Datastax Opscenter
8/17/2019 CassandraTraining v3.3.4
167/183
167
▶
Monitoring - Datastax Opscenter
8/17/2019 CassandraTraining v3.3.4
168/183
168
▶
Monitoring - Node coming up
8/17/2019 CassandraTraining v3.3.4
169/183
169
▶
Node down
8/17/2019 CassandraTraining v3.3.4
170/183
170
▶
Node tool
• Very important tool to manage C* in
8/17/2019 CassandraTraining v3.3.4
171/183
171
Very important tool to manage C in
production for day to day admin• Nodetool has over 40 commands• Can be found in the
$CASSANDRA_HOME/bin folder
• Try a few commands– ./nodetool -h 10.21.24.11 -p 7199 status
– ./nodetool -h 10.21.24.11 -p 7199 info
▶
8/17/2019 CassandraTraining v3.3.4
172/183
Node repair
How to compare two large data sets?
8/17/2019 CassandraTraining v3.3.4
173/183
173
How to compare two large data sets?
• Assume two arrays• Each containing 1 million numbers
• Further assume the order of storage is fixed
• How to compare the two arrays?
Potential solution
• Loop 1 and check the other?
• How long will it take to loop 1 million?
• What happens if the data gets into Billions?
• Lots of inefficiencies!
▶
Wikipedia definition of Merkel Tree
• Merkle tree is a tree in which every non-leaf node is labelled with the
8/17/2019 CassandraTraining v3.3.4
174/183
174
• A hash tree is a tree of hashes in whichthe leaves are hashes of data blocks in,for instance, a file or set of files. Nodesfurther up in the tree are the hashes oftheir respective children.
• For example, in the picture hash 0 is the
result of hashing the result ofconcatenating hash 0-0 and hash 0-1. That is, hash 0 = hash( hash 0-0 + hash0-1 ) where + denotes concatenation
• Merkle tree is a tree in which every non-leaf node is labelled with thehash of the labels of its children nodes. Hash trees are useful because
they allow efficient and secure verification of the contents of largedata structures
• Currently the main use of hash trees is to make sure that data blocksreceived from other peers in a peer-to-peer network are receivedundamaged and unaltered, and even to check that the other peers donot lie and send fake blocks
▶
Node repair using Nodetool
• Prime objective of node repair is to detect and fixi i t i d
8/17/2019 CassandraTraining v3.3.4
175/183
175
inconsistencies among nodes
• Per node Merkel tree is created• It provides an efficient way to find differences in data
blocks• Reduces the amount of data transferred to compare the
data blocks
• C* uses a fixed depth tree of
15 levels having 32K leafnodes
• Leaf nodes can represent arange of data because thedepth is fixed
▶
Node repair process
• When nodetool repair command is executed, the target nodeifi d ith h ti i th d di t th i f
8/17/2019 CassandraTraining v3.3.4
176/183
176
specified with -h option in the command, coordinates the repair of
each column family in each keyspace
• A repair coordinator node requests Merkle tree from each replicafor a specific token range to compare them
• Each replica builds a Merkle tree by scanning the data stored
locally in the requested token range.– This can be IO intensive but can be controlled by providing a partition range
eg the primary range but you will still be building the Merkel tree in parallel
– You can specify the start token, will repair the range for this token
– Finally can repair one DC at a time by passing "local" parameter
– Set the compaction and streaming thresholds (next slide)
• The repair coordinator node compares the Merkle trees and findsall the sub token ranges that differ between the replicas andrepairs data in those ranges
▶
Nodetool repair - potential strategy
• The system.local table has all the tokens that the node is
8/17/2019 CassandraTraining v3.3.4
177/183
177
responsible for
• Create a script that runs a repair using -st -et in a loop for all tokens for the given node
• Care should be taken to connect to the node using -h and useparallel execution (-par)
• This will allow the repair to run on small ranges and in short time
increments• Spool the output and analyze if everything was correct else can
send email or any form of notification to alert the admin
• It may be possible to get an exception like the following if the starttoken and end tokens are not the same– ERROR [Thread-1099288] 2015-03-14 02:48:08,822 StorageService.java
(line 2517) Repair session failed: java.lang.IllegalArgumentException:Requested range intersects a local range but is not fully contained in one;this would lead to imprecise repair
– Does not mean there is a problem
▶
Schema disagreements
• Perform schema changes one at a time, at a steady pace,
8/17/2019 CassandraTraining v3.3.4
178/183
178
and from the same node
• Do not make multiple schema changes at the same time
• If NTP is not in place it is possible that the schemas may not
be in synch (usual problem in most cases)
• Check if the schema is in agreementhttp://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_handle_schema_disagree_t.html
• ./nodetool describecluster
▶
Nodetool cfhistograms
• Will tell how many SSTables were looked at toi f d
8/17/2019 CassandraTraining v3.3.4
179/183
179
satisfy a read
– With level compaction should never go > 3– With size tiered compaction should never go > 12– If the above are not in place then compaction is falling
behindcheck with ./nodetool compactionstatsit should say "pending tasks: 0"
• Read and write letency (no n/w latency)– Write should not be > 150μs– Read should not be > 130μs (from mem)– SSD should be single digit μs
• ./nodetool cfhistograms
▶
Slow queries
• Use the DataStax Enterprise Performance Serviceto automatically capture long-running queries
8/17/2019 CassandraTraining v3.3.4
180/183
180
to automatically capture long running queries
(based on response time thresholds you specify)
• and then query the performance table that holdsthose cql statements
• cqlsh:dse_perf> select * from node_slow_log;
▶
C* timeout, concurrency, consistency
• Cassandra.yaml has multiple timeouts
8/17/2019 CassandraTraining v3.3.4
181/183
181
y p
(these change with versions)– read_request_timeout_in_ms– write_request_timeout_in_ms
– cross_node_timeout
• Concurrency– concurrent_reads
– concurrent_writes
• Application
– Consistency levels affect read/write• etc (see the yaml file)
▶
JNA
• Java Native Access allows for more
8/17/2019 CassandraTraining v3.3.4
182/183
182
Java Native Access allows for more
efficient communication of the JVMand the OS
• Ensure JNA is enabled and if not do
the following– sudo apt-get install libjna-java
• But, ideally JNA should be
automatically enabled
▶
About me
8/17/2019 CassandraTraining v3.3.4
183/183
www.linkedin.com/in/nirmallya
That's it!