CassandraTraining v3.3.4

8/17/2019 CassandraTraining v3.3.4

1/183

1

Developer track

ByNirmallya Mukherjee


2/183

2

Index

• Part 1 - Warmup• Part 2 - Introducing C*

• Part 3 - Setup & Installation

• Part 4 - Prelude to modelling• Part 5 - Write and Read paths

• Part 6 - Modelling

• Part 7 - Applications with C*• Part 8 - Administration

• Part 9 - Future releases


3/183

3

Part 1

Warm up!


4/183

4

What's the challange?

• Consumer oriented applications

• Users in millions

• Data produced in hundred/thousand of TB• Hardware failure happens no matter what

• Is consumer growth predictable? Then how to docapacity planning for infrastructure?

▶

Databases are doing just fine!


5/183

5

What can it do?

• Using 330 Google Compute Engine virtual machines, 300

1TB Persistent Disk volumes, Debian Linux, and DatastaxCassandra 2.2, we were able to construct a setup that can:– sustain one million writes per second to Cassandra with

a median latency of 10.3 ms and 95% completing under23 ms

– sustain a loss of ⅓ of the instances and volumes and stillmaintain the 1 million writes per second (though withhigher latency)

• Read more here

– http://googlecloudplatform.blogspot.in/2014/03/cassandra-hits-one-million-writes-per-second-on-google-compute-engine.html

▶


6/183

6

Who uses it?

• eBay

• Netflix

• Instagram

• Comcast

• Safeway

• Sky

• CERN

• Travelocity

• Spotify

• Intuit

• GE

▶


7/183

7

Gartner magic quadrant

▶


8/183

8

Types of NoSQL

• Wide Row Store - Also known as wide-column stores, these databases

store data in rows and users are able to perform some query operationsvia column-based access. A wide-row store offers very high performanceand a highly scalable architecture. Examples include: Cassandra, HBase,and Google BigTable.

• Key/Value - These NoSQL databases are some of the least complex as allof the data within consists of an indexed key and a value. Examplesinclude Amazon DynamoDB, Riak, and Oracle NoSQL database

• Document - Expands on the basic idea of key-value stores where"documents" are more complex, in that they contain data and eachdocument is assigned a unique key, which is used to retrieve thedocument. These are designed for storing, retrieving, and managingdocument-oriented information, also known as semi-structured data.Examples include MongoDB and CouchDB

• Graph - Designed for data whose relationships are well represented as agraph structure and has elements that are interconnected; with anundetermined number of relationships between them. Examples include:Neo4J and TitanDB

▶


9/183

9

SQL and NoSQL

1. Database, Relational, strict

models2. Data in rows, pre-defined

schema, sql supports join

3. Vertically scalable

4. Random access pattern

support5. Good fit for online

transactional systems

6. Master slave model

7. Periodic data replication as

read only copies in slave8. Availability model includes aslight downtime in case ofoutages

1. Datastore, distributed &

Non relational, flexibleschema

2. Data in key value pairs,flexible schema, no joins

3. Horizontally scalable

4. Designed for accesspatterns

5. Good for optimized readbased system or highavailability write systems

6. C* is materless model

7. Seamless data replication

8. Masterless allows for nodowntime

▶


10/183

10

Applicability of SQL / NoSQL

• No one single rule and very usecasespecific– High read (not write) oriented– High write (not read) oriented– Document based storage

– KV based storage

• Important to know the access patternsupfront

– Focus on modeling specific to the use case

• Very difficult to fix an improper modellater on unlike a database

▶


11/183

11

CAP Theorem

• ACID

– Atomicity - Atomicity requires that each transaction be "all or nothing"– Consistency - The consistency property ensures that any transaction

will bring the database from one valid state to another

– Isolation - The isolation property ensures that the concurrent executionof transactions result in a system state that would be obtained iftransactions were executed serially

– Durability - Durability means that once a transaction has beencommitted, it will remain so under all circumstances

• What is it? Can I have all?– Consistency - all nodes have the same data at all times

– Availability - Request must receive a response

– Partition tolerance - Should run even if there is a part failure

• CAP leads to BASE theory– Basically Available, Soft state, Eventual consistency

▶


12/183

12

AP systems, what about C?

Consistency

Availability Partition Tolerance

RDBMS

CassandraGoogle Big Table

HBaseMongoDB

How does this impact architecture?1.Eventual consistency2.Some application logic is needed

▶


13/183

13

Good fit use cases

• Store high volume stream data– Sensor data

– Logs

–Events

– Online usage impressions

• Scenario where the data can not berecreated, it is lost if not saved

• No need of rollups, aggregations

– Another system needs to do this

▶


14/183

14

Let's brain storm your usecase

▶


15/183

15

Part 2

Introducing Cassandra


16/183

16

Database or Datastore

• Where did the name "Cassandra" come from?– Some interesting reference from Greek mythology– What's the significance of the logo?

• Symantics - no real definition, both are ok• I would like to call it "Datastore"• A bit of un-learning is required to get a hold of the "different" ways• Inspiration from (think of it as object persistence)

– Google BigTable– Dynamo DB from Amazon

• A few interesting observations about datastore– Put = insert/update (Upsert?)– Get = select– Delete = delete

• Think HashMaps, Sorted Maps ...

• Storage mechanism– Btree vs LSM Tree

• Row oriented datastore

• Biggest of all - No SPOF (Masterless)

▶


17/183

17

Masterless architecture

Client 2

R1

R2

R3

Client 1

Client 3

▶

Why large malls have more than one entrance?


18/183

18

Seed node(s)

• It is like the recruitment department• Helps a new node come on board• More than one seed is a very good

practice

• Per rack based could be good• Starts the gossip process in new

nodes• If you have more than one DC then

include seed nodes from each DC• No other purpose

▶


19/183

19

Virtual node, Ring architecture

• C* operates by dividing all data evenly around

a cluster of nodes, which can be visualized as aring

• In the early days the nodes had a range oftokens

• This had to be done at the time of setting up C*

Ring withoutVNodes

A

F E

Node 1

B

A F

Node 2

F

E D

Node 6

C

B A

Node 3

E

D C

Node 5

D

C E

Node 4

Few peers helping to bring a downed node up, Increased loadon the selected peers

What will happen if a node with elivated utilization goes down?▶


20/183


21/183

21

Gossip

• I think we all know what this means in English! ☺

• It is a communication mechanism among nodes• Runs every second

• Passing state information– Available / down / bootstraping

– Load it is under

– Helps in detecting failed nodes• Upto 3 other nodes talk to one another

• A way any node learns about the cluster and other nodes

• Seed nodes play a critical role to start the gossip amongnodes

• It is versioned, older states are erased• Once gossip has started on a node, the stats about other

nodes are stored locally so that a re-start can be fast

• Local gossip info can be cleaned if needed

▶


22/183

22

Partitioner

• How do you all organize your cubicle/office space? Where will your stuff

be? Assume the floor you are on is about 25,000 sq ft.• A partitioner determines how to distribute the data across the nodes in thecluster

• Murmer3Partitioner - recommended for most purposes (default strategy aswell)

• Once a partitioner is set for a cluster, cannot be changed without datareload

• The primary key is hashed using the murmer hash to determine whichnode it needs to go

• There is nothing called as the "Master replica", all replicas are identical

• The token range it can produce is 263 to -263

• Wikipedia details

– MurmurHash is a non-cryptographic hash function suitable for general hash-basedlookup. It was created by Austin Appleby in 2008, and exists in a number of variants, allof which have been released into the public domain. When compared to other popularhash functions, MurmurHash performed well in a random distribution of regular keys.

▶


23/183

23

Replication

• How many of you have multiple copies of the most critical files -

pwd file, IT return confirmation etc on external drives?• Replication determines how many copies of the data will be

maintained in the cluster across nodes

• There is no single magic formula to determine the correct number

• The widely accepted number is 3 but your use case can be

different• This has an impact on the number of nodes in the cluster - cannot

have a high replication with less nodes

– Replication factor


24/183

24

Partitioner and Replication

• Position of the first copy of the data is

determined by the partitioner and copiesare placed by walking the cluster in aclockwise direction

• For example, consider a simple 4 node clusterwhere all of the partition keys managed bythe cluster were numbers in the PRIMARYrange of 0 to 100.

• Each node is assigned a token that representsa point in this range. In this simple example,the token values are 0, 25, 50, and 75.

• The first node, the one with token 0, isresponsible for the wrapping range (76-0).

• The node with the lowest token also acceptspartition keys less than the lowest token andmore than the highest token.

• Once the first node is determined the replicaswill be placed in the nodes in a clockwiseorder

0

25

50

75

Data Range 4(76+wrappingrange)

Data Range 1(1-25)

Data Range 1(26-50)

Data Range 3(51-75)

▶


25/183

25

Snitch

• What can you typically find at the entrance of a very large

mall? What's the need for "Can I help you?" desk?• Objective is to group machines into racks and data centers

to setup a DC/DR• Informs the partitioner about the rack and DC locations -

determines which nodes the replicas need to go• Identify which DC and rack a node belongs to

• Routing requests efficiently (replication)• Tries not to have more than one replica in the same rack

(may not be a physical grouping)• Allows for a truly distributed fault tolerant cluster• Seamless synchronization of data across DC/DR•

To be selected at the time of C* installation in C* yaml file• Changing a snitch is a long drawn up process especially ifdata exists in the keyspace - you have to run a full repair inyour cluster

▶


26/183

26

Snitch

In addition a "Google cloud snitch" is also available for GCE

Mixed cloud = cloud + local nodes ▶


27/183

27

Snitch - property file

A bit more about property file snitch• All nodes in the system must have the

same cassandra-topology.properties file• It is very useful if you cannot ensure the IP

octates (this prevents the use of Rack

infering snitch where every octate meanssomething) 10...

• Will give you full control of your clusterand the flexibilty of assigning any IP to a

node (once assigned remains assigned)• It can be hard to maintain informationabout a large cluster across multiple DCbut it is worth given the benifits

▶


28/183

28

Snitch - property file example

# Data Center One19.82.20.3 DC1:RAC1

19.83.123.233 DC1:RAC119.84.193.101 DC1:RAC119.85.13.6 DC1:RAC1

19.23.20.87 DC1:RAC219.15.16.200 DC1:RAC219.24.102.103 DC1:RAC2

# Data Center Two29.51.8.2 DC2:RAC129.50.10.21 DC2:RAC129.50.29.14 DC2:RAC1

53.25.29.124 DC2:RAC253.34.20.223 DC2:RAC2

53.14.14.209 DC2:RAC2

The names like "DC1" and "DC2" are very important as we will see ...

Also, in production use GissipingPropertyFileSnitch (non cloud, for cloud use the appropriate snitch for thatcloud) where you keep the entries for the rack+DC (cassandra-rackdc.properties) and C* will gossipthis across the cluster automatically. C* topology properties is used as a fallback.

▶


29/183

29

Commodity vs Specialized hardware

1. Large CAPEX2. Wasted/Idle resource3. Failure takes out a large chunk4. Expensive redundancy model

5. One shoe fitting all model6. Too much co-existence

1. Low CAPEX (rent on IaaS)2. Maximum resource utilization3. Failure takes out a small chunk4. Inexpensive redundancy

5. Specific h/w for specific tasks6. Very less to no co-existence

vs

V

e r t i c a l s c a l a b i l i t y

Horizontal scalability

This is what Google,LinkedIn, Facebook do. The norm is now beingadopted by largecorporations as well.

“ Just in time” expansion; stay in tune with the load. No need to build ahead in time inanticipation

▶


30/183

30

Elastic Linear Scalability

n1

n2

n3

n4

n5

n6

n7

n8

8 nodes

n1

n2

n3

n4

n5

n6

n7

n8

n9

n10

10 nodes

1.Need more capacity? Add servers2.Auto Re-distribution of the cluster3.Results in linear performance increase

▶


31/183

31

Debate - Failover design strategy

Scenario• Maintain a small failover in the data

center, and then simply add multiplenodes if a failure occurs

• Is this "Just in time" strategy correct?

Analysis• The additional overhead of bootstrapping

the new nodes will actually reduce

capacity at a critical point when thecapacity is needed most perhaps creatinga domino effect leading to severe outage

▶

Debate what if all machines are not


32/183

32

Debate - what if all machines are notsimilar?

n1

n2

n3

n4

n5

n6

State your observations based on typical machine characteristics1. CPU, Memory3. Storage space (consider externally mounted space as local)3. Network connectivity4. Query performance and "predictability" of queries

▶


33/183

33

Deployment - 4 dimensions

Node - One C* instance1

Rack - Logical set of nodes2

3 DC-DR - Logical set of racks

4 Cluster - Nodes that map to a single token ring(can be across DC), next slide ...

▶


34/183

34

Distributed deployment

//These are the C* nodes that the DAO will look to connect topublic static final String[] cassandraNodes = { "10.24.37.1", "10.24.37.2", "10.24.37.3" };

public enum clusterName { events, counter }

//You can build with policies like withRetryPolicy(), .withLoadBalancingPolicy(Round Robin)cluster = Cluster.builder().addContactPoints(Config.cassandraNodes).build();

session1 = cluster.connect(Constants.clusterName.events.toString());

RACK 1RACK 2

RACK 3RACK A

RACK B

Region 1 - DC Region 2 - DR

Cluster: SKL-Platform

▶


35/183

35

A bit more about multi DC setup

• When usingNetworkToplogyStrategy, you set

the number of replicas per datacenter

• For example if you set thereplicas as 4 (2 in each DC) thenthis is what you get ...

ALTER KEYSPACE system_auth WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC1' : 2, 'DC2' : 2};

Then run nodetool repair on each affected node, wait for itto come up and move on to the next node.

R1

R3

R2

R4

Data Center1

Data Center2Rack1

Rack2

Node1

Node3

Node5

Node2

Node4

Node6

R1

R2

Data Center1

Rack1

Rack2

Node7

Node9

Node11

Node8

Node10

Node12

R3

R4

Data Center2

▶


36/183

36

Debate - replication setting

• Let's assume we have the cluster as follows– DC1 having 5 nodes– DC2 having 5 nodes

• What will happen if the following statements are

executed (assume no syntax errors)?– create keyspace if not exists test with replication = { 'class' :'NetworkTopologyStrategy', 'DC1' : 2, 'DC2' : 10 };

– create table test ( col1 int, col2 varchar, primary key(col1));

– insert into test (col1, col2) values (1, 'Sir Dennis Ritchie');

– Will this succeed?

▶


37/183

37

Regions and Zones

• Many IaaS (Google / Amazon) providers have

multiple data centers around the world called"Regions"

• Each region is further divided into availability"zones"

• Depending on your IaaS or even your own DC,you must plan the topology in such a way that– A client request should not travel over the network too

long to reach C*

– An outage should not take out a significant chunk ofyour cluster

– A replication across DC should not add to the latency forthe client

▶


38/183

38

Part 3

Setup & Installation

i ll i


39/183

39

Let's get going! Installation

• Acquiring and Installing C*

– Understand the key components of Cassandra.yaml• Configuring and Installation structure

– Data directory

– Commit Log directory

– Cache directory

• System log configuration

– Cassandra-env.ps1 (change the log file position)– Default is $logdir = "$env:CASSANDRA_HOME\logs"

– In logback.xml change from .zip to .gz• ${cassandra.logdir}/system.log.%i.gz

• Creating a Cluster

• Adding Nodes to a Cluster

• Multiple Seed Nodes Ring management• Sizing is specific to the application - discussion

• Clock Synchronization and its implications– Network Time Protocol (NTP)

▶

i ll d


40/183

40

Firewall and ports

•Cassandra primarily operates usingthree ports

– 7000 (or 7001 if SSL) - internode clustercommunication

– 7199 - JMX port

– 9160 / 9042 - Client connections

– 22 - for SSH

• Firewall must have these ports openfor each node in the cluster/DC

▶

C d l i d i


41/183

41

Cassandra.yaml - an introduction

There are more configuration parameters, but we will cover those as we move along ... ▶

C d h


42/183

42

Cassandra-env.sh

▶

CQL h ll l h


43/183

43

CQL shell - cqlsh

• Command line tool

• Specify the host and port (none means localhost)• Provides tab finish• Supports two different command sets

– DDL– Shell commands (describe, source, tracing, copy etc)

• Try some commands– show version– show host– describe keyspaces– describe keyspace

– describe tables– describe table

CQL shell "source" command - loading a few million rows"sstableloader" which is another tool - beyond a few million rows

▶


44/183

44

Part 4

Prelude to Modeling

K k S h


45/183

45

Keyspace aka Schema

Cassandra Database

create keyspace [IF NOT EXISTS]meterdata with replication strategy(optionally with DC/DR setup), durablewrites (True/False)

create database meterdata;

It is like the database (MySQL) or schema/user (Oracle)

durable writes = false bypasses thecommit log. You can lose data.

▶

Ad i /S t k


46/183

46

Admin/System keyspaces

• SELECT * FROM– system.schema_keyspaces

– system.peers

– system.schema_columnfamilies

– system.schema_columns– system.local (info about itself)

– system.hints

– system.• Local contains information about

nodes (superset of gossip)

▶

C l F il (CF) / T bl


47/183

47

Column Family (CF) / Table

Cassandra Database

create table meterdata.bill_data (...)

primary key (compound key)

with compaction, clustering order

PK is mandatory in C*

create tablemeterdata.bill_date (...)

pk, references, engine,charset etc

Insert into bill_data () values ();

Update bill_data set=.. where ..

Delete from bill_data where ..

Standard SQL DMLstatements

What happens if we insert with a PK that already exists?Hold on to your thoughts ...

▶

PK k P titi k k R k


48/183

48

PK aka Partition key aka Row key

• Partition key determines which node the data will reside

– Simple key– Composite key

• It is the unit of replication in a cluster– All data for a given PK replicates around in the cluster

– All data for a given PK is co-located on a node

• The PK is hashed– Murmer3 hashing is the preferred

– Random does based on MD5 (not considrerd secure)

– Others are for legacy support only

• A partition is fetched from the disk in one disk seek, every partitionrequires a seek

PK

meter_id_001

meter_id_002

meter_id_003

Hash

558a666df55

88yt543edfhj

aad543rgf54lMore of PK during modeling ...

▶


49/183

49

CREATE TABLE all_events ( event_type int, date number, //format as yyyymmdd number (more efficient) created_hh int, created_min int, created_sec int, created_nn int, data text, PRIMARY KEY((event_type, date), created_hh, created_min));

Visualizing the Primary Key

Partition key Clusteringcolumn(s)

Unique

What are the components of the Primary key?

Primary key = Partition key(s) + Clustering column(s)Naturally by definition the primary key must be unique

▶

Visualizing partition key based storage


50/183

50

Visualizing partition key based storage

• Question - how does C* keep the data based on

the primary key definition of a given table?

Meter_id_001

Meter_id_001

Meter_id_001

...

Meter_id_001

26-Jan-2014

27-Jan-2014

28-Jan-2014

...

15-Dec-2014

147 159 165 ... 183

5698 5712 5745

12 58 ... 302

86 91

Partition key

Contiguous chunk of data on disk

Data growth per partition key

▶


51/183

51

Failure Tolerance By Replication

Remember CAP theorem? C* gives availabilityand partition tolerance

Replication Factor (RF) = 3

Insert a record with Pk=B C

D

E

F

G

H

A

B

1

2

3{B, A, H}

{C, B, A}

{D, C, B}

private static final String allEventCql = "insert into all_events (event_type, date, created_hh, created_min, created_sec, created_nn, data) values(?, ?, ?, ?, ?, ?, ?)";Session session = CassandraDAO.getEventSession();BoundStatement boundStatement = getWritableStatement(session, allEventCql);session.execute(boundStatement.bind(.., .., ..); ▶

PK data partitioning


52/183

52

PK - data partitioning

• Every node in the cluster is an owner of a range of tokens

that are calculated from the PK • After the client sends the write request the coordinators

partitioner uses the configured partitioner to determine thetoken value

• Then it looks for the node that has the token range (primaryrange) and puts the first replica there

• A node can have other ranges as well but has one primaryrange

• Subsequent replicas is with respect to the node of the firstcopy/replica

• Data with the same partition key will reside on the same

physical node• Clustering columns (in case of compound PK) does not

impact the choice of node– Default sort based on clustering columns is ASC

▶


53/183

53

Coordinator Node

Incoming Requests (read/Wrire)

Coordinator handles the request

An interesting aspect to note is that thedata (each column) is timestamped using

the coordinator node's time

Every node can be coordinator ->masterless

n3

n4

n5

n6n7

n8

n1

n2

1

2

3

Clientrequest

coordinator

▶

Hinted handoff a bit later ...


54/183

54

Consistency

Tunable at runtime• ONE (default)• QUORUM (strict majority w.r.t RF)• ALL

Apply both to read and write

protected static BoundStatement getWritableStatement(Session session, String cql, boolean setAnyConsistencyLevel) { PreparedStatement statement = session.prepare(cql); if(setAnyConsistencyLevel) { statement.setConsistencyLevel(ConsistencyLevel.ONE); }

BoundStatement boundStatement = new BoundStatement(statement); return boundStatement;}

Other consistency levels are ANY, LOCAL_QUORUM, TWO, THREE,LOCAL_ONE (see documentation on the web)

▶

What is a Quorum?


55/183

55

What is a Quorum?

quorum = RoundDown(sum_of_replication_factors / 2) + 1

sum_of_replication_factors = datacenter1_RF + datacenter2_RF + . . .+ datacentern_RF

• Using a replication factor of 3, a quorum is 2 nodes. The cluster can tolerate 1

replica down.

• Using a replication factor of 6, a quorum is 4. The cluster can tolerate 2replicas down.

• In a two data center cluster where each data center has a replication factor of3, a quorum is 4 nodes. The cluster can tolerate 2 replica nodes down.

• In a five data center cluster where two data centers have a replication factorof 3 and three data centers have a replication factor of 2, a quorum is 6 nodes.

▶


56/183

56

Write Consistency

Write ONE

Send requests to all replicas in thecluster applicable to the PK

n3

n4

n5

n6

n7

n8

n1

n2

1

2

3

coordinator

▶


57/183

57

Write Consistency

Write ONE

Send requests to all replicas in thecluster applicable to the PK

Wait for ONE ack before returning toclient

n3

n4

n5

n6

n7

n8

n1

n2

1

2

3

coordinator

5 μs

▶


58/183

58

Write Consistency

Write ONE

Send requests to all replicas in the clusterapplicable to the PK

Wait for ONE ack before returning to client

Other acks later , asynchronouslyn3

n4

n5

n6

n7

n8

n1

n2

1

2

3

coordinator

5 μs

10 μs

120 μs

▶


59/183

59

Write Consistency

Write QUORUM

Send requests to all replicas

Wait for QUORUM ack before returning to client

n3

n4

n5

n6

n7

n8

n1

n2

1

2

3

coordinator

5 μs

10 μs

120 μs

▶

d i


60/183

60

Read Consistency

Read ONE

Read from one node among all replicas

n3

n4

n5

n6

n7

n8

n1

n2

1

2

3

coordinator

▶

R d C i


61/183

61

Read Consistency

Read ONE

Read from one node among all replicas

Contact the faster node (stats)

n3

n4

n5

n6

n7

n8

n1

n2

1

2

3

coordinator

▶

R d C i t


62/183

62

Read Consistency

Read QUORUM

Read from one fastest node

n3

n4

n5

n6

n7

n8

n1

n2

1

2

3

coordinator

▶

R d C i t


63/183

63

Read Consistency

Read QUORUM


AND request digest from other replicas toreach QUORUM

n3

n4

n5

n6

n7

n8

n1

n2

1

2

3

coordinator

▶

R d C i t


64/183

64

Read Consistency

Read QUORUM


AND request digest from other replicasto reach QUORUM

Return most up-to-date data to client

Read repair a bit later ...

n3

n4

n5

n6

n7

n8

n1

n2

1

2

3

coordinator

▶

How to maintain consistency across


65/183

65

nodes?

• C* allows for tunable consistency andhence we have eventual consistency

• For whatever reason there can beinconsistency of data among nodes

• Inconsistencies are fixed using

– Read repair - update data partitionsduring reads (aka Anti-entropy ops)

– Nodetool - regular maintenance

• Let's see consistency in action ...

▶

C i t i A ti


66/183

66

Consistency in Action

RF = 3, Write ONE Read ONE

B A A

B A A

Write ONE: B

Read ONE: A

Data replication in progress..

A write must be written to the commit log and

memtable of at least one replica node.

Returns a response from the closest replica, asdetermined by the snitch. By default, a read repair runs inthe background to make the other replicas consistent.

▶

C i t i A ti


67/183

67


RF = 3, Write ONE Read QUORUM

B A A

B A A

Write ONE: B

Read QUORUM: A




Returns the record after a quorum of replicas hasresponded from any data center.

▶

Consistenc in Action


68/183

68


RF = 3, Write ONE Read ALL

B A A

B A A

Write ONE: B

Read ALL: B




Returns the record after all replicas haveresponded. The read operation will fail if a replicadoes not respond.

▶



69/183

69


RF = 3, Write QUORUM Read ONE

B B A

B B A

Write QUORUM: B

Read ONE: A


A write must be written to the commit log andmemtable on a quorum of replica nodes.

Returns a response from the closest replica, asdetermined by the snitch. By default, a read repair runs inthe background to make the other replicas consistent.

▶



70/183

70


RF = 3, Write QUORUM Read QUORUM

B B A

B B A

Write QUORUM: B

Read QUORUM: B


▶

Consistency Level


71/183

71

Consistency Level

ONEFast write, may not read latest

written value

QUORUM / LOCAL_QUORUMStrict majority w.r.t. Replication Factor

Good balance

ALLNot the best choice

Slow, no high availability

if(nodes_written + nodes_read) > replication factorthen you can get immediate consistency

▶

Debate - Immediate consistency


72/183

72

Debate Immediate consistency

• The following consistency level gives

immediate consistency– Write CL = ALL, Read CL = ONE

– Write CL = ONE, Read CL = ALL

– Write CL = QUORUM, Read CL = QUORUM

• Debate which CL combination you willconsider in what scenarios?

• The combinations are– Few write but read heavy

– Heavy write but few reads

– Balanced

▶

Hinted Handoff


73/183

73

n3

n4

n5

n6

n7

n8

n1

n2

1

2

3

Coordinator

Hinted Handoff

A write request is received by thecoordinator

Coordinator finds a node is down/offline

It stores "Hints" on behalf of the offline nodeif the coordinator knows in advance that thenode is down. Handoff is not taken if the

node goes down at the time of write.

Read repair / nodetool repair will fixinconsistencies.

Hints

✗

▶

A hinted handoff is taken in system.hints table {Location of the replica, version meta, actual data}It is taken only when the write consistency level can be satisfied and a guaranteethat any read consistency level can read the record

Hinted Handoff


74/183

74

Hinted Handoff

When the offline node comes up, thecoordinator forwards the stored hints

The node synchs up the state with thehints

The coordinator cannot perpetually holdthe hints

Configure an appropriate duration for anynode to hold hints - default 3 hrs

n3n4

n5

n6

n7

n8

n1

n2

1

2

3

CoordinatorHints

▶

Consistency level - ANY


75/183

75

Consistency level ANY

• This is a special consistency levelwhere even if any node is down thewrite will succeed using a hintedhandoff

• Use it very cautiously

▶

What is a Anti-entropy op?


76/183

76

a s a e opy op

• During reads the full data is requestedfrom ONE replica by the coordinator

• Others send a digest which is a hashof the data to the coordinator

• Coordinator checks if the data and thedigests match

• If they do not then the coordinatortakes the latest (n4) and merges intothe result

• Then the stale nodes (n2, n3) areupdated automatically in thebackground

• Happens on a configurable % of readsand not for all because there is anoverhead

• read_repair_chance is theconfiguration (defaults to 10% or 0.1)

= a 10% chance that a read repair willhappen when a read comes in• This is defined for a table or column

family

Repair with nodetool, a bit later ...

n3

n4

n5

n6

n7

n8

n1

n2

1

2

3

coordinator

Andersonts = 521

Neots = 851

Andersonts = 268

▶


77/183

77

Part 5

Write and Read path

▶

Why C* writes fast


78/183

78

y

??

RDBMS

Seeks and writes values tovarious pre-set locations

Cassandra

Continuously appends, merging andcontracting when appropriate

Try reading a book that is organized asfollows

P1, P23, P45, P2, P6, P99, P125, P8 ...

How about a book that looks like

P1, P2, P3, P4, P5, P6, ...

▶

Storage


79/183

79

g

• Writes are harder than reads to scale

• Spinning disks aren't good withrandom I/O

• Goal: minimize random I/O

– Log Structured Merged Tree (LSM)

– Push changes from a memory-basedcomponent to one or more disk

components

▶

Some more on LSM


80/183

80

• An LSM-tree is composed of two or

more tree-like component datastructures

• Smaller component is C0 sits inmemory - aka Memtables

• Larger component C1 sits in the disk– aka SSTables (Sorted String Table)– Immutable component

• Writes are inserted in C0 and logs• In time flush C0 to C1• C1 is optimized for sequential access

▶

Key components for Write


81/183

81

y p

Each node implements the folloing key components (3 storagemechanism and one process) to handle writes

1.Memtables - the in-memory tables corresponding to the CQLtables along with the related indexes

2.CommitLog - an append-only log, replayed to restore nodeMemtables

3.SSTables - Memtable snapshots periodically flushed (written

out) to free up heap

4.Compaction - periodic process to merge and streamlinemultiple generations of SSTables

▶

Cassandra Write Path


82/183

82


Memory

Commit log1

Commit log2

Commit logn

……

1

MemTable 1 MemTable 2 MemTable n……2

Coordinator

Appends

d u r a b l e

w r i t e

s =

T R U E

SSTable1

Table1

SSTable2

Table2

Appends (corresponds to CQL tables)

Single node in the cluster

Client

Mem to disk

Related configurations are commitlog_sync (batch/periodic),commitlog_sync_batch_window_in_ms, commitlog_synch_period_in_ms

Disk

▶



83/183

83


Memory

Commit log1

Commit log2

Commit logn

……

MemTable 1 MemTable 2 MemTable n……

2

SSTable1

1

Table1

SSTable2

Table2

SSTable3

Table3

Memtable exceeds a threshold, flush to SSTable, clean log,

clear heap corresponding to the memtables

New SSTable is created based on oldest commit logA very fast sequential write operation

These are multiple generations of SSTableswhich are compacted into one SSTableselect * from system.sstable_activity;

Writes are completely isolated from reads

No updated columns are visible until entire row is finished (technically, entire partition)

Disk

▶

Column family data state


84/183

84

y

A Table data=

It's Memtable+All of it's SSTables that have

been flushed

▶

Memtables & SSTables


85/183

85

• Memtables and SSTables are maintained per

table– SSTables are immutable, not written to again after the

memtable is flushed.

– A partition is typically stored across multiple SSTablefiles.

– What is sorted? - Partitions are sorted and stored

• For each SSTable, C* creates these structures– Partition index - A list of partition keys and the start position of

rows in the data file (on disk)– Partition index summary (in memory sample to speed up reads)

– Bloom filter (optional and depends on % false positive setting)

▶

Offheap memtables


86/183

86

• As of C* 2.1 a new parameter has been introduced (memtable_allocation_type) This capability is still under active development and performance over the

period of time is expected to get better

• Heap buffers - the current default behavior

• Off heap buffers - moves the cell/column name and value to DirectBufferobjects. This has the lowest impact on reads — the values are still “live”

Java buffers — but only reduces heap significantly when you are storinglarge strings or blobs

• Off heap objects - moves the entire cell off heap, leaving only theNativeCell reference containing a pointer to the native (off-heap) data. Thismakes it effective for small values like ints or uuids as well, at the cost ofhaving to copy it back on-heap temporarily when reading from it (likely tobecome default in C* 3.0)– Writes are about 5% faster with offheap_objects enabled, primarily because Cassandra

doesn’t need to flush as frequently. Bigger sstables means less compaction is needed.– Reads are more or less the same

• For more information http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1

▶

Commit log


87/183

87

• It is replayed in case a node went down and

wants to come back up• This replay creates the MemTables for that node• Commit log comprises of pieces (files) and it can

be controlled (commitlog_segment_size_in_mb)• Total commit log is a controlled parameter as well

(commitlog_total_space_in_mb)• Commit log itself is also acrued in memory. It is

then written to disk in two ways– Batch - if batch then all ack to requests wait until the

commit log is flushed to disk–

Periodic - the request is immediately ack but after sometime the commit log is flushed. If a node were to godown in this period then data can be lost if RF=1

Debate - "periodic" setting gives better performance and we shoulduse it. How to avoid the chances of data loss?

▶

When does the flush trigger?


88/183

88

• Memtable total space in MB reached

– Default is 25% of JVM Heap– Can be more in case of off heap

• OR Commit log total space in MBreached

• OR Nodetool flush (manual)– nodetool flush

• The default settings are usually goodand are there for a reason

▶

Data file name structure


89/183

89

• The data directories are created based on keyspaces and tables– .../data/keyspace/table

• The actual DB file is named as– keyspace-table-format-generationNum-component

• Component can be– CompressionInfo - compression info metadata

– Data - PK, data size, column idx, row level tombstone info, columncount, column list in sorted order by name

– Filter - Bloom filter– Index - index, also include Bloom filter info, tombstone– Statistics - hostograms for row size, gen numbers of files from where

this SST was compacted– Summary - index summary (that is loaded in mem for read

optimizations)– TOC - list of files– Digest - Text file with a digest

• Format - internal C* format eg "jb" is C* 2.0 format, 2.1 may have "ka"

▶

Cassandra Read Path


90/183

90

Cassandra Read Path

SELECT…FROM…WHERE #partition=….;

RowCache

1

SSTable1 SSTable2 SSTable3

Row cache will be checked if it is enabled

▶

Cassandra Read Path


91/183

91

Cassa d a ead a


RowCache

1


Bloom Filter 2

? ? ?

Row cache MISS

•One Bloom filter exists per SSTable and Memtable that the node is serving

•Probablistic filter that tells "Maybe a partition key exists" in itscorresponding SSTable or a "Definite NO"

▶

Cassandra Read Path


92/183

92


RowCache

1


Bloom Filter 2

F F

Partition KeyCache3

T

•If any of the Bloom filters return a possible "YES" then the partition keymay exist in one or more of the SSTables

•Proceed to look at the partition key cache for the SSTables for which thereis a probability of finding the partition key, ignore the other SST keycaches

•Physically a single key cache is maintained ... next more on partition key cache▶

Partition Key Cache


93/183

93

y

# Partition Offset …..

# Partition001 0x0 ….

# Partition002 0x153 …..

# Partition350 0x5464321 …..

# Partition001

data

data

…..

# Partition002

# Partition350

data

Data

….

Partition KeyCache

SSTable1

Partition key cache stores the offset position of the partition keysthat have been read recently.

▶

Cassandra Read Path


94/183

94


RowCache

1

SSTable2

Bloom Filter 2

Partition KeyCache3

Key IndexSample4

If partition key is not found in the key cache then the read proceeds tocheck the "Key index sample" or "Partition summary" (in memory). It is asubset of the "Partition Index" that is the full index of the SSTable

Next a bit more of key index sample ...

▶

Key Index Sample


95/183

95

y p

Sample Offset

# Partition001 0x0

# Partition128 0x4500



# Partition001

data

data

…..

# Partition128

# Partition350

data

Data

….

Key IndexSample

SSTable

Think of the offset as an absolute disk address that fseek in C can use

Ratio of the key index sample is 1 per 128 keys

▶

Cassandra Read Path - complete flow


96/183

96

p


RowCache

1

SSTable2

Bloom Filter 2

Partition KeyCache3

Key IndexSample4

Partition Index5

6

Mem Table

∑

Merge based on the most recent timestamp of the columns1. Memtable2. One or more than one SSTablemore on the merge process a bit later in LWW ...

Also update the row cache (if enabled) and key cache

T h e c o o r d i n a t o r w i l l m e r g e

b a s e d o n T S a s w e l l b a s e d o n

t h e c o n s i s t e n c y

l e v e l

▶

Bloom filters in action


97/183

97

Setting change takes effect when the SSTables are regeneratedIn development env you can use nodetool/CCM scrub BUT it is time intensive

▶

Row caching


98/183

98Caches are also written on disk so that it comes alive after a restartGlobal settings are also possible at the c* yaml file

▶

Key caching


99/183

99

• Stored per SSTable even if it is a common store for all SSTables• Stores only the key

• Reduces the seek to just one read per replica (does not have to look intothe disk version of the index which is the full index)

• It is the default = true• This also periodically get saved on disk as well

•Changing index_interval increases/decreases the gap•Increasing the gap means more disk hits to get the partition key index•Reducing the gap means more memory but get to partition key without looking intodisk based partition index

▶

Eager retry


100/183

100

• If a node is slow in responding to a

request, the coordinator forwardsit to another holding a replica ofthe requested partition

• Valid if RF>1• C* 2.0+ feature

Node1

Node2

Node

3

Node4

?

91

91

Client Driver

Read

▶

ALTER TABLE users WITH speculative_retry = '10ms';Or,ALTER TABLE users WITH speculative_retry = '99percentile';

Last Write Win (LWW)


101/183

101

INSERT INTO users(login,name,age) VALUES ('jdoe', 'John DOE', 33);

Jdoe Age Name

33 John DOE

#Partition

▶


102/183



103/183

103

UPDATE users SET age = 34 WHERE login = 'jdoe';

Jdoe Age(t1) Name(t1)

33 John DOE

Jdoe Age(t2)

34

SSTable1SSTable2

Remember that SSTables are immutable, once written it cannot beupdated.Creates a new SSTable

Assume a flush occurs

▶



104/183

104

DELETE age from users WHERE login = 'jdoe';


33 John DOE

Jdoe Age(t2)

34

SSTable1 SSTable2

Jdoe Age(t3)

X

SSTable3

tombstone

RIP

▶



105/183

105

SELECT age from users WHERE login = 'jdoe';


33 John DOE

Jdoe Age(t2)

34

SSTable1 SSTable2

Jdoe Age(t3)

X

SSTable3

? ? ?

Where to read from? How to construct the response?

▶



106/183

106

SELECT age from users WHERE login = 'jdoe';


33 John DOE

Jdoe Age(t2)

34

SSTable1 SSTable2

Jdoe Age(t3)

X

SSTable3

✗ ✗ ✔

▶

Compaction


107/183

107


33 John DOE

Jdoe Age(t2)

34

SSTable1 SSTable2

Jdoe Age(t3)

X

SSTable3

Jdoe Name(t1)

John DOE

New SSTable

Disk space freed = sizeof(SST1 + SST2 + .. + SSTn) - sizeof(new compact SST)Commit logs (containing segments) are versioned and reused as well

Newly freed space becomes available for reuse

•Related SSTables are merged•Most recent version of each column is compiled to one partition in one new SSTable•Columns marked for eviction/deletion are removed

•Old generation SSTables are deleted•A new generation SSTable is created (hence the generation number keep increasing over time)

▶

Compaction


108/183

108

• As per Datastax documentation– Cassandra 2.1 improves read performance after compaction by performing an

incremental replacement of compacted SSTables.– Instead of waiting for the entire compaction to finish and then throwing away the old

SSTable (and cache), Cassandra can read data directly from the new SSTable evenbefore it finishes writing.

• Three strategies– Sizetiered (default)

– Leveled (needs about 50% more IO than size tiered but the number of SSTablesvisited for data will be less)

– DateTiered

• Choice depends on the disk– Mechanical disk = Size tiered

– SSD = Leveled tiered

• Choice depends on use case too– Size - It is best to use this strategy when you have insert-heavy, read-light

workloads

– Level - It is best suited for ColumnFamilys with read-heavy workloads that havefrequent updates to existing rows

– Date - for timeseries data along with TTL

▶

SizeTiered compaction - a brief


109/183

109

• With size-tiered compaction, similarly sized tables arecompacted into larger tables once a certain number areaccumulated (default 4)

• A row can exist in multipleSSTables, which can resultin a degradation of readperformance. This is

especially true if youperform many updates ordeletes with high reads

• It can be a good strategyfor write heavy scenarios

▶

LeveledTiered compaction


110/183

110

• Leveled compaction creates sstables of a

fixed, relatively small size (5MB by defaultin Cassandra’s implementation), that aregrouped into “levels.”

• Within each level, sstables are guaranteed

to be non-overlapping. Each level is tentimes as large as the previous

• Leveled compaction guarantees that 90%of all reads will be satisfied from a single

sstable (assuming nearly-uniform rowsize). Worst case is bounded at the totalnumber of levels — e.g 7 for 10TB of data

▶

DateTiered compaction


111/183

111

• This particularly applies to time series

data where the data lives for a specifictime• Use DateTiered compaction strategy• C* looks at the min and max Timestamp of

the SStable and finds out if anything isreally live• If not then that SSTable is just unlinked

and the compaction with another is fullyavoided

• Set the TTL in the CF definition so that if abad code looks to insert without TTL itdoes not cause unnecessary compaction

▶

"Coming back to life"


112/183

112

• Let's assume multiple replicas exist for a given column• One node was NOT able to recored a tombstone because it

was down• If this node remains down past the gc_grace_seconds

without a repair then it will contain the old record• Other nodes meanwhile have evicted the tombstone

column• When the downed node does come up it will bring the old

column back to life on the other nodes!

• To ensure that deleted data never resurfaces, make sure torun repair at least once every gc_grace_seconds, and neverlet a node stay down longer than this time period

▶

NTP is very important


113/183

113

• A micro second time stamp is associated

with the columns• The timestamp used is of the machine in

which C* is running (coordinator)• If the timing of the machines in the cluster

is off then C* will be unable to determinewhich record is latest!• This also means that the compaction may

not function• You can have a scenario where a deleted

record appears again or get unpredictablenumber of records for your query

▶

Lightweight transactions


114/183

114

• Transactions are a bit different

– "Compare and Set" model• Two requests can agree on a single value creation

– One will suceed (uses Paxos algo, wikipedia has goodtext on this subject)

– Other will fail

INSERT INTO customer_account (customerID, customer_email)VALUES ('LauraS', '[email protected]')IF NOT EXISTS;

UPDATE customer_account

SET customer_email='[email protected]'IF customerID='LauraS';

Cassandra 2.1.1 and later supports non-equal conditions for lightweight transactions. You canuse =, != and IN operators in WHERE clauses to query lightweight tables.

▶

Triggers


115/183

115

• The actual trigger logic resides

outside in a Java POJO

• Feature may still be experimental

• Has write performance impact

• Better to have application logic doany pre-processing

CREATE TRIGGER myTrigger ON myTable USING 'Java class name';

▶

Debate - Locking in C*


116/183

116

Instance 1 Instance 2 Instance 3

Memcached / Redis

k=_0_0 v=k=_0_a v=..

k=_z_z v=

Step 1. Acquire lock

If FAIL look for next KEY

St ep 3. Release lock

These instances work exactly like instance 1. They will acquirelocks on other keys so that no two instances step on each other

Primary key = ((flag, starts_with), gen_id)

C

D

E

F

G

H

A

B

▶

Step 2. Get anavailable ID that has

the flag=0 and startswith 0

Delete this key from C*


117/183

Before you model ...


118/183

118

Report

Dashboard

Biz question

End user

Fast response

Model

Storage

Data-types

Keys

Data

User1

Structure2

C* way

RDBMS way

▶

QDD


119/183

119

C* modelling is

"Query Driven Design"

✍

▶

Give me the query and I will organize the data via amodel to get you performance for that query.

Denormalization


120/183

120

• Unlike RDBMS you cannot use any foreign

key relationships• FK and joins are not supported anyway!• Embed details, it will cause duplication but

that is alright

• Helps in getting the complete data in asingle read• More efficient, less random I/O• May have to create more than one

denormalized table to serve differentqueries• UDT is a great way (UDT a bit later...)

▶

Datatypes


121/183

121

• Int, decimal, float, double

• Varchar, text

• Collections - Map (K=V), List (ordered V), Set (Unordered unique V)– Designed to store small amount of data (in hundreds)

– Limit is 64k

– Max size per element is 64KB

– Cannot be part of Primary key

– Nesting is not allowed

• Timestamp, timeuuid

• Counter• User defined types (UDT)

• There are other datatypes as well– position frozen - useful in 3D coordinate geometry

– blob - good practice is to gzip and then store

• Static modifer - value remains the same, save storage space BUT veryspecific usecase

▶

Choice of partition key


122/183

122

• If a key is not unique then C* will upsert

with no warning!• Good choices

– Natural key like an email address or meter id

–Surrogate keys like uuid, timeuuid (uuid withtime component)

• CQL provides a number of timeuuidfunctions

– now()– dateof(tuuid) - extracts the timestamp as date

– unixtimestampof - raw 64 bit integer

▶

Wide row? Use composite PK


123/183

123

CREATE TABLE all_events ( event_type int, date int, //format as yyyymmdd created_hh int, created_min int, created_sec int,

created_nn int, data text, PRIMARY KEY((event_type, date), created_hh));

Example of a compound PK ((...), ...) syntax

CREATE TABLE trunc_events ( event_type int,

date number, created_on timestamp, data text, PRIMARY KEY((event_type, date), created_on)) WITH CLUSTERING ORDER BY (created_on DESC);

insert into trunc_events (event_type, date, created_on, data) values(1, '2014-06-24', '2014-06-26 12:47:54','{"event" : "details of the event will go here"}') USING TTL 300;

Only event type as the PK will store all events ever generated in a single

partition. That can result in a huge partition and may breach C* limits

▶

Partition key IN clause debate


124/183

124

( )

Meter_id_001

Meter_id_001

Meter_id_001

...

Meter_id_001

26-Jan-2014

27-Jan-2014

28-Jan-2014

...

15-Dec-2014

147 159 165 ... 183

5698 5712 5745

12 58 ... 302

86 91

Partition key

where meter_id = 'Meter_id_001' and date IN ( ... )

t1

t2

t3

tn

∑ ≠t1 t2 t3 tn≠≠,t1 t2 t3 tn,,T =

Is T predictable? What are the factors responsible fordetermining the overall fetch time T?

This is also called as "Multi get slice"

∈

▶

Partition key IN clause (Slicing)


125/183

125

If you want to search across two partition keys then use

"IN" in the where clause. You can have a further limitingcondition based on cluster columns.

In a composite PK, IN works only on the last key. Eg Pk ((key1, key2),clust1, clust2). IN will work only on key2 (last column only).

▶

Modeling notation


126/183

126

• Chen's notation for conceptual modeling

helps in design

▶

Sample model contd ...


127/183

127

▶


128/183

Bitmap type of index example


129/183

129

• Multiple parts to a key•

Create a truth table of the various combinations• However, inserts will increase to the number ofcombinations

• Eg Find a car (vehicle id) in a car park byvariable combinations

▶


130/183

Bitmap type of index example


131/183

131

YES, there will be more writes and more data BUT that is an acceptablefact in big data modeling that looks to optimize the query and userexperience more than data normalization

select * from skldb.car_model_index

where make='Ford' and model='' and color='Blue';

select * from skldb.car_model_indexwhere make='' and model='' and color='Blue';

▶


132/183

TTL - Time To LiveCREATE TABLE users (

user id text PRIMARY KEY


133/183

133

user_id text PRIMARY KEY, first_name text, last_name text,

emails set, todo map);

INSERT INTO users (user_id, first_name, last_name, emails) VALUES('nm', 'Nirmallya', 'Mukherjee', {'[email protected]', '[email protected]'}, {'2014-12-25' : 'Christmas party', '2014-12-31' : 'At Gokarna'});

UPDATE users USING TTL SET todo['2012-10-1'] = 'find water' WHERE user_id = 'nm';

Rate limiting is a good idea in many cases. Eg allow password reset toonly 3 per day per user. Use TTL sliding window.

A promotion is generated for a customer with a TTL.

A temporary login is generated and is valid for a given TTL period.

Table level TTL overrides row level TTL specification.

▶

UDT - User Defined Typescreate keyspace address_book with replication = { 'class':'SimpleStrategy', 'replication_factor': 1 };


134/183

134

CREATE TYPE address_book.address ( street text, city text,

zip_code int, phones set);

CREATE TYPE address_book.fullname ( firstname text, lastname text);

CREATE TABLE address_book.users ( id int PRIMARY KEY, name frozen , addresses map // a collection map);

INSERT INTO address_book.users (id, name) VALUES (1, {firstname: 'Dennis', lastname: 'Ritchie'});

UPDATE address_book.users SET addresses = addresses + {'home': { street: '9779 Forest Lane', city: 'Dallas',zip_code: 75015, phones: {'001 972 555 6666'}}} WHERE id=1;

SELECT name.firstname from users where id = 1;

CREATE INDEX on address_book.users (name);SELECT id FROM address_book.users WHERE name = {firstname: 'Dennis', lastname: 'Ritchie'};

▶


135/183

Queries


136/183

136

SELCT * FROM mailboxWHERE login=jdoeand message_id ='2014-07-12 16:00:00';

Get message by user and message_id (date)

SELCT * FROM mailboxWHERE login=jdoeand message_id ='2014-01-12 16:00:00';

Get message by user and date interval

CREATE TABLE mailbox (

login text,message_id timeuuid,Interlocutor text,Message text,PRIMARY KEY(login, message_id);

▶

Debate - are these correct?


137/183

137

SELCT * FROM mailbox WHERE message_id ='2014-07-12 16:00:00';

Get message by message_id

SELCT * FROM mailbox WHEREmessage_id ='2014-01-12 16:00:00';

Get message by date interval

✗

✗

▶


138/183

WHERE clause restrictions


139/183

139

• All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition (where canfilter in contiguous data)

• "Allow filtering" can override the default behaviour of cross partition access but it isnot a good practice at all (incorrect data model)

– select .. from meter_reading where record_date > '2014-12-15'

(assuming there is a secondary index on record_date)

• Only exact match (=) predicate on #partition, range queries (


140/183

140

• If the primary key is (industry_id,

exchange_id, stock_symbol) then thefollowing sort orders are valid– order by exchange_id desc, stock_symbol desc

– order by exchange_id asc, stock_symbol asc

• Following is an example of invalid

sort order– order by exchange_id desc, stock_symbol asc

▶

Secondary Index


141/183

141

• Consider the earlier example of the table all_events, what will happen if wetry to get the records based on minute?–

Create a secondary index based on the minute• Can be created on any column except counter, static and collection columns

• It is actually another table with the key= and columns=keys thatcontain the key

• Do not use secondary indexes if – On high-cardinality columns because you then query a huge volume of records

for a small number of results– In tables that use a counter column

– On a frequently updated or deleted column (too many tombstones)

– To look for a row in a large partition unless narrowly queried

– High volume queries

• Secondary indexes are for searching convenience

– Use with low-cardinality columns– Columns that may contain a relatively small set of distinct values

– Use when prototyping, ad-hoc querying or with smaller datasets

• See this link for more details (http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html)

▶

Secondary Index - rebuilding


142/183

142

• While Cassandra’s performance will always be best usingpartition-key lookups, secondary indexes provide a nice

way to query data on small or offline datasets• Occasionally, secondary indexes can get out-of-sync• Unfortunately, there is no great way to verify secondary

indexes, other than rebuilding them.• You can verify by a query using indexes, and look for

significant inconsistencies on different nodes

• Also note that each node keeps its own index (they are notdistributed), so you’ll need to run this on each node

• Must run nodetool repair to ensure the index gets built oncurrent data

nodetool rebuild_index my_keysapce my_column_family

▶

CQL Limits


143/183

143

Limits (as per Datastax documentation)• Clustering column value, length of: 65535• Collection item, value of: 2GB (Cassandra 2.1 v3

protocol), 64K (Cassandra 2.0.x and earlier)• Collection item, number of: 2B (Cassandra 2.1 v3

protocol), 64K (Cassandra 2.0.x and earlier)

• Columns in a partition: 2B

• Fields in a tuple: 32768, try not to have 1000' offields

• Key length: 65535• Query parameters in a query: 65535•

Single column, value of: 2GB, xMB arerecommended• Statements in a batch: 65535

▶

C* batch - Is it always a good idea?


144/183

144

Leverages token aware policiesCluster load is evenly balancedMeans if one fails we re-try ONLY that queryNot an all or none success model

▶

Tip: batches work well ONLY if all records have the same partition keyIn this case all records go to the same node.

Batch - use case

K bl i h b i d i


145/183

145

• Keep two tables in synch - same event being stored in twotables, one by customer ID and the other by staff ID

• Emulating a classical database transaction - all or nothing!

▶

Antipatterns - things to watch out!


146/183

146

Update columns as nulls in CQL. Tombstone for each null

Singleton messages - add, consume, remove loop

Intense update on a single column

Dynamic schema change (frequent changes) and topology change in

production is not a good idea.

◆

◆

◆

◆

▶

Use C* as a queue like singletons (add/consume loop)◆

But... I need a queue!


147/183

147

• Break the partition key to carry some

time buckets eg. Queue1 in the lasthour

• At the time of select you can use the

clustering column eg Queue1 withtimestamp clustering column >SOME time. This helps in not going

through all those tombstones

▶


148/183

148

Part 7

Applications with Cassandra


149/183

The Client DAO and the Driver

S i i t th d f d ll


150/183

150

• Session instances are thread-safe and usually asingle instance is all you need per application.

However, a given session can only be set to onekeyspace at a time, so one instance per keyspaceis necessary

• Use policies– Retry– Reconnect– LoadBalancing

• http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/fourSimpleRules.html

▶

C* client sample


151/183

151

▶

C* client sample


152/183

152http://www.datastax.com/drivers/java/2.1/index.html▶

C* client sample - the entity


153/183

153

C* client sample


154/183

154

See this to work with UDT -http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/reference/udtApi.html


155/183


156/183

156

Part 8

Administration

C* != (Fire and forget)


157/183

157

C* is not a fire a forget system.

It does require administrationand monitoring to ensure the

infrastructure performs optimally

at all times.

▶

Monitoring - Opscenter

Workload modeling


158/183

158

Workload modeling

• Workload characterization• Performance characteristics of the

cluster• Latency analysis of the cluster• Performance of the OS• Disk utilization• Read / Write operations per second

• OS Memory utilization• Heap utilization

▶

Monitoring - Datastax Opscenter


159/183

159

▶



160/183

160

▶



161/183

161

▶



162/183

162

▶



163/183

163

▶



164/183

164

▶



165/183

165

▶

Dropped MUTATION messages this means that the mutation was not applied to all replicas itwas sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair (perhaps

because of load C* is defending itself by dropping messages).



166/183

166

▶



167/183

167

▶



168/183

168

▶

Monitoring - Node coming up


169/183

169

▶

Node down


170/183

170

▶

Node tool

• Very important tool to manage C* in


171/183

171

Very important tool to manage C in

production for day to day admin• Nodetool has over 40 commands• Can be found in the

$CASSANDRA_HOME/bin folder

• Try a few commands– ./nodetool -h 10.21.24.11 -p 7199 status

– ./nodetool -h 10.21.24.11 -p 7199 info

▶


172/183

Node repair

How to compare two large data sets?


173/183

173

How to compare two large data sets?

• Assume two arrays• Each containing 1 million numbers

• Further assume the order of storage is fixed

• How to compare the two arrays?

Potential solution

• Loop 1 and check the other?

• How long will it take to loop 1 million?

• What happens if the data gets into Billions?

• Lots of inefficiencies!

▶

Wikipedia definition of Merkel Tree

• Merkle tree is a tree in which every non-leaf node is labelled with the


174/183

174

• A hash tree is a tree of hashes in whichthe leaves are hashes of data blocks in,for instance, a file or set of files. Nodesfurther up in the tree are the hashes oftheir respective children.

• For example, in the picture hash 0 is the

result of hashing the result ofconcatenating hash 0-0 and hash 0-1. That is, hash 0 = hash( hash 0-0 + hash0-1 ) where + denotes concatenation

• Merkle tree is a tree in which every non-leaf node is labelled with thehash of the labels of its children nodes. Hash trees are useful because

they allow efficient and secure verification of the contents of largedata structures

• Currently the main use of hash trees is to make sure that data blocksreceived from other peers in a peer-to-peer network are receivedundamaged and unaltered, and even to check that the other peers donot lie and send fake blocks

▶

Node repair using Nodetool

• Prime objective of node repair is to detect and fixi i t i d


175/183

175

inconsistencies among nodes

• Per node Merkel tree is created• It provides an efficient way to find differences in data

blocks• Reduces the amount of data transferred to compare the

data blocks

• C* uses a fixed depth tree of

15 levels having 32K leafnodes

• Leaf nodes can represent arange of data because thedepth is fixed

▶

Node repair process

• When nodetool repair command is executed, the target nodeifi d ith h ti i th d di t th i f


176/183

176

specified with -h option in the command, coordinates the repair of

each column family in each keyspace

• A repair coordinator node requests Merkle tree from each replicafor a specific token range to compare them

• Each replica builds a Merkle tree by scanning the data stored

locally in the requested token range.– This can be IO intensive but can be controlled by providing a partition range

eg the primary range but you will still be building the Merkel tree in parallel

– You can specify the start token, will repair the range for this token

– Finally can repair one DC at a time by passing "local" parameter

– Set the compaction and streaming thresholds (next slide)

• The repair coordinator node compares the Merkle trees and findsall the sub token ranges that differ between the replicas andrepairs data in those ranges

▶

Nodetool repair - potential strategy

• The system.local table has all the tokens that the node is


177/183

177

responsible for

• Create a script that runs a repair using -st -et in a loop for all tokens for the given node

• Care should be taken to connect to the node using -h and useparallel execution (-par)

• This will allow the repair to run on small ranges and in short time

increments• Spool the output and analyze if everything was correct else can

send email or any form of notification to alert the admin

• It may be possible to get an exception like the following if the starttoken and end tokens are not the same– ERROR [Thread-1099288] 2015-03-14 02:48:08,822 StorageService.java

(line 2517) Repair session failed: java.lang.IllegalArgumentException:Requested range intersects a local range but is not fully contained in one;this would lead to imprecise repair

– Does not mean there is a problem

▶

Schema disagreements

• Perform schema changes one at a time, at a steady pace,


178/183

178

and from the same node

• Do not make multiple schema changes at the same time

• If NTP is not in place it is possible that the schemas may not

be in synch (usual problem in most cases)

• Check if the schema is in agreementhttp://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_handle_schema_disagree_t.html

• ./nodetool describecluster

▶

Nodetool cfhistograms

• Will tell how many SSTables were looked at toi f d


179/183

179

satisfy a read

– With level compaction should never go > 3– With size tiered compaction should never go > 12– If the above are not in place then compaction is falling

behindcheck with ./nodetool compactionstatsit should say "pending tasks: 0"

• Read and write letency (no n/w latency)– Write should not be > 150μs– Read should not be > 130μs (from mem)– SSD should be single digit μs

• ./nodetool cfhistograms

▶

Slow queries

• Use the DataStax Enterprise Performance Serviceto automatically capture long-running queries


180/183

180

to automatically capture long running queries

(based on response time thresholds you specify)

• and then query the performance table that holdsthose cql statements

• cqlsh:dse_perf> select * from node_slow_log;

▶

C* timeout, concurrency, consistency

• Cassandra.yaml has multiple timeouts


181/183

181

y p

(these change with versions)– read_request_timeout_in_ms– write_request_timeout_in_ms

– cross_node_timeout

• Concurrency– concurrent_reads

– concurrent_writes

• Application

– Consistency levels affect read/write• etc (see the yaml file)

▶

JNA

• Java Native Access allows for more


182/183

182

Java Native Access allows for more

efficient communication of the JVMand the OS

• Ensure JNA is enabled and if not do

the following– sudo apt-get install libjna-java

• But, ideally JNA should be

automatically enabled

▶

About me


183/183

www.linkedin.com/in/nirmallya

[email protected]

That's it!

Documents

CassandraTraining v3.3.4