CassandraTraining v3.3.4

Embed Size (px)

Citation preview

  • 8/17/2019 CassandraTraining v3.3.4

    1/183

    1

    Developer track

    ByNirmallya Mukherjee

  • 8/17/2019 CassandraTraining v3.3.4

    2/183

    2

    Index

    • Part 1 - Warmup• Part 2 - Introducing C*

    • Part 3 - Setup & Installation

    • Part 4 - Prelude to modelling• Part 5 - Write and Read paths

    • Part 6 - Modelling

    • Part 7 - Applications with C*• Part 8 - Administration

    • Part 9 - Future releases

  • 8/17/2019 CassandraTraining v3.3.4

    3/183

    3

    Part 1

    Warm up!

  • 8/17/2019 CassandraTraining v3.3.4

    4/183

    4

    What's the challange?

    • Consumer oriented applications

    • Users in millions

    • Data produced in hundred/thousand of TB• Hardware failure happens no matter what

    • Is consumer growth predictable? Then how to docapacity planning for infrastructure?

    Databases are doing just fine!

  • 8/17/2019 CassandraTraining v3.3.4

    5/183

    5

    What can it do?

    • Using 330 Google Compute Engine virtual machines, 300

    1TB Persistent Disk volumes, Debian Linux, and DatastaxCassandra 2.2, we were able to construct a setup that can:– sustain one million writes per second to Cassandra with

    a median latency of 10.3 ms and 95% completing under23 ms

    – sustain a loss of ⅓ of the instances and volumes and stillmaintain the 1 million writes per second (though withhigher latency)

    • Read more here

    – http://googlecloudplatform.blogspot.in/2014/03/cassandra-hits-one-million-writes-per-second-on-google-compute-engine.html

  • 8/17/2019 CassandraTraining v3.3.4

    6/183

    6

    Who uses it?

    • eBay

    • Netflix

    • Instagram

    • Comcast

    • Safeway

    • Sky

    • CERN

    •  Travelocity

    • Spotify

    • Intuit

    • GE

  • 8/17/2019 CassandraTraining v3.3.4

    7/183

    7

    Gartner magic quadrant

  • 8/17/2019 CassandraTraining v3.3.4

    8/183

    8

     Types of NoSQL

    •   Wide Row Store - Also known as wide-column stores, these databases

    store data in rows and users are able to perform some query operationsvia column-based access. A wide-row store offers very high performanceand a highly scalable architecture. Examples include: Cassandra, HBase,and Google BigTable.

    •   Key/Value - These NoSQL databases are some of the least complex as allof the data within consists of an indexed key and a value. Examplesinclude Amazon DynamoDB, Riak, and Oracle NoSQL database

    •   Document - Expands on the basic idea of key-value stores where"documents" are more complex, in that they contain data and eachdocument is assigned a unique key, which is used to retrieve thedocument. These are designed for storing, retrieving, and managingdocument-oriented information, also known as semi-structured data.Examples include MongoDB and CouchDB

    •   Graph - Designed for data whose relationships are well represented as agraph structure and has elements that are interconnected; with anundetermined number of relationships between them. Examples include:Neo4J and TitanDB

  • 8/17/2019 CassandraTraining v3.3.4

    9/183

    9

    SQL and NoSQL

    1. Database, Relational, strict

    models2. Data in rows, pre-defined

    schema, sql supports join

    3. Vertically scalable

    4. Random access pattern

    support5. Good fit for online

    transactional systems

    6. Master slave model

    7. Periodic data replication as

    read only copies in slave8. Availability model includes aslight downtime in case ofoutages

    1. Datastore, distributed &

    Non relational, flexibleschema

    2. Data in key value pairs,flexible schema, no joins

    3. Horizontally scalable

    4. Designed for accesspatterns

    5. Good for optimized readbased system or highavailability write systems

    6. C* is materless model

    7. Seamless data replication

    8. Masterless allows for nodowntime

  • 8/17/2019 CassandraTraining v3.3.4

    10/183

    10

    Applicability of SQL / NoSQL

    • No one single rule and very usecasespecific– High read (not write) oriented– High write (not read) oriented– Document based storage

    – KV based storage

    • Important to know the access patternsupfront

    – Focus on modeling specific to the use case

    • Very difficult to fix an improper modellater on unlike a database

  • 8/17/2019 CassandraTraining v3.3.4

    11/183

    11

    CAP Theorem

    • ACID

    –   Atomicity - Atomicity requires that each transaction be "all or nothing"–   Consistency - The consistency property ensures that any transaction

    will bring the database from one valid state to another

    –   Isolation - The isolation property ensures that the concurrent executionof transactions result in a system state that would be obtained iftransactions were executed serially

    –   Durability - Durability means that once a transaction has beencommitted, it will remain so under all circumstances

    • What is it? Can I have all?–   Consistency - all nodes have the same data at all times

    –   Availability - Request must receive a response

    –   Partition tolerance - Should run even if there is a part failure

    • CAP leads to BASE theory–   Basically Available, Soft state, Eventual consistency

  • 8/17/2019 CassandraTraining v3.3.4

    12/183

    12

    AP systems, what about C?

    Consistency

    Availability Partition Tolerance

    RDBMS

    CassandraGoogle Big Table

    HBaseMongoDB

    How does this impact architecture?1.Eventual consistency2.Some application logic is needed

  • 8/17/2019 CassandraTraining v3.3.4

    13/183

    13

    Good fit use cases

    • Store high volume stream data– Sensor data

    – Logs

    –Events

    – Online usage impressions

    • Scenario where the data can not berecreated, it is lost if not saved

    • No need of rollups, aggregations

    – Another system needs to do this

  • 8/17/2019 CassandraTraining v3.3.4

    14/183

    14

    Let's brain storm your usecase

  • 8/17/2019 CassandraTraining v3.3.4

    15/183

    15

    Part 2

    Introducing Cassandra

  • 8/17/2019 CassandraTraining v3.3.4

    16/183

    16

    Database or Datastore

    • Where did the name "Cassandra" come from?– Some interesting reference from Greek mythology– What's the significance of the logo?

    • Symantics - no real definition, both are ok• I would like to call it "Datastore"• A bit of un-learning is required to get a hold of the "different" ways• Inspiration from (think of it as object persistence)

    – Google BigTable– Dynamo DB from Amazon

    • A few interesting observations about datastore– Put = insert/update (Upsert?)– Get = select– Delete = delete

    •  Think HashMaps, Sorted Maps ...

    • Storage mechanism– Btree vs LSM Tree

    • Row oriented datastore

    • Biggest of all - No SPOF (Masterless)

  • 8/17/2019 CassandraTraining v3.3.4

    17/183

    17

    Masterless architecture

    Client 2

    R1

    R2

    R3

    Client 1

    Client 3

    Why large malls have more than one entrance?

  • 8/17/2019 CassandraTraining v3.3.4

    18/183

    18

    Seed node(s)

    • It is like the recruitment department• Helps a new node come on board• More than one seed is a very good

    practice

    • Per rack based could be good• Starts the gossip process in new

    nodes• If you have more than one DC then

    include seed nodes from each DC• No other purpose

  • 8/17/2019 CassandraTraining v3.3.4

    19/183

    19

    Virtual node, Ring architecture

    • C* operates by dividing all data evenly around

    a cluster of nodes, which can be visualized as aring

    • In the early days the nodes had a range oftokens

    •  This had to be done at the time of setting up C*

    Ring withoutVNodes

    A

    F E

    Node 1

    B

    A F

    Node 2

    F

    E D

    Node 6

    C

    B A

    Node 3

    E

    D C

    Node 5

    D

    C E

    Node 4

    Few peers helping to bring a downed node up, Increased loadon the selected peers

    What will happen if a node with elivated utilization goes down?▶

  • 8/17/2019 CassandraTraining v3.3.4

    20/183

  • 8/17/2019 CassandraTraining v3.3.4

    21/183

    21

    Gossip

    • I think we all know what this means in English! ☺

    • It is a communication mechanism among nodes• Runs every second

    • Passing state information– Available / down / bootstraping

    – Load it is under

    – Helps in detecting failed nodes• Upto 3 other nodes talk to one another

    • A way any node learns about the cluster and other nodes

    • Seed nodes play a critical role to start the gossip amongnodes

    • It is versioned, older states are erased• Once gossip has started on a node, the stats about other

    nodes are stored locally so that a re-start can be fast

    • Local gossip info can be cleaned if needed

  • 8/17/2019 CassandraTraining v3.3.4

    22/183

    22

    Partitioner

    • How do you all organize your cubicle/office space? Where will your stuff

    be? Assume the floor you are on is about 25,000 sq ft.• A partitioner determines how to distribute the data across the nodes in thecluster

    • Murmer3Partitioner - recommended for most purposes (default strategy aswell)

    • Once a partitioner is set for a cluster, cannot be changed without datareload

    •  The primary key is hashed using the murmer hash to determine whichnode it needs to go

    •  There is nothing called as the "Master replica", all replicas are identical

    •  The token range it can produce is 263 to -263

    • Wikipedia details

    – MurmurHash is a non-cryptographic hash function suitable for general hash-basedlookup. It was created by Austin Appleby in 2008, and exists in a number of variants, allof which have been released into the public domain. When compared to other popularhash functions, MurmurHash performed well in a random distribution of regular keys.

  • 8/17/2019 CassandraTraining v3.3.4

    23/183

    23

    Replication

    • How many of you have multiple copies of the most critical files -

    pwd file, IT return confirmation etc on external drives?• Replication determines how many copies of the data will be

    maintained in the cluster across nodes

    •  There is no single magic formula to determine the correct number

    •  The widely accepted number is 3 but your use case can be

    different•  This has an impact on the number of nodes in the cluster - cannot

    have a high replication with less nodes

    – Replication factor

  • 8/17/2019 CassandraTraining v3.3.4

    24/183

    24

    Partitioner and Replication

    • Position of the first copy of the data is

    determined by the partitioner and copiesare placed by walking the cluster in aclockwise direction

    • For example, consider a simple 4 node clusterwhere all of the partition keys managed bythe cluster were numbers in the PRIMARYrange of 0 to 100.

    • Each node is assigned a token that representsa point in this range. In this simple example,the token values are 0, 25, 50, and 75.

    •  The first node, the one with token 0, isresponsible for the wrapping range (76-0).

    •  The node with the lowest token also acceptspartition keys less than the lowest token andmore than the highest token.

    • Once the first node is determined the replicaswill be placed in the nodes in a clockwiseorder

    0

    25

    50

    75

    Data Range 4(76+wrappingrange)

    Data Range 1(1-25)

    Data Range 1(26-50)

    Data Range 3(51-75)

  • 8/17/2019 CassandraTraining v3.3.4

    25/183

    25

    Snitch

    • What can you typically find at the entrance of a very large

    mall? What's the need for "Can I help you?" desk?• Objective is to group machines into racks and data centers

    to setup a DC/DR• Informs the partitioner about the rack and DC locations -

    determines which nodes the replicas need to go• Identify which DC and rack a node belongs to

    • Routing requests efficiently (replication)•  Tries not to have more than one replica in the same rack

    (may not be a physical grouping)• Allows for a truly distributed fault tolerant cluster• Seamless synchronization of data across DC/DR•

     To be selected at the time of C* installation in C* yaml file• Changing a snitch is a long drawn up process especially ifdata exists in the keyspace - you have to run a full repair inyour cluster

  • 8/17/2019 CassandraTraining v3.3.4

    26/183

    26

    Snitch

    In addition a "Google cloud snitch" is also available for GCE

    Mixed cloud = cloud + local nodes   ▶

  • 8/17/2019 CassandraTraining v3.3.4

    27/183

    27

    Snitch - property file

    A bit more about property file snitch• All nodes in the system must have the

    same cassandra-topology.properties file• It is very useful if you cannot ensure the IP

    octates (this prevents the use of Rack

    infering snitch where every octate meanssomething) 10...

    • Will give you full control of your clusterand the flexibilty of assigning any IP to a

    node (once assigned remains assigned)• It can be hard to maintain informationabout a large cluster across multiple DCbut it is worth given the benifits

  • 8/17/2019 CassandraTraining v3.3.4

    28/183

    28

    Snitch - property file example

    # Data Center One19.82.20.3 DC1:RAC1

    19.83.123.233 DC1:RAC119.84.193.101 DC1:RAC119.85.13.6 DC1:RAC1

    19.23.20.87 DC1:RAC219.15.16.200 DC1:RAC219.24.102.103 DC1:RAC2

    # Data Center Two29.51.8.2 DC2:RAC129.50.10.21 DC2:RAC129.50.29.14 DC2:RAC1

    53.25.29.124 DC2:RAC253.34.20.223 DC2:RAC2

    53.14.14.209 DC2:RAC2

     The names like "DC1" and "DC2" are very important as we will see ...

    Also, in production use GissipingPropertyFileSnitch (non cloud, for cloud use the appropriate snitch for thatcloud) where you keep the entries for the rack+DC (cassandra-rackdc.properties) and C* will gossipthis across the cluster automatically. C* topology properties is used as a fallback.

  • 8/17/2019 CassandraTraining v3.3.4

    29/183

    29

    Commodity vs Specialized hardware

    1. Large CAPEX2. Wasted/Idle resource3. Failure takes out a large chunk4. Expensive redundancy model

    5. One shoe fitting all model6. Too much co-existence

    1. Low CAPEX (rent on IaaS)2. Maximum resource utilization3. Failure takes out a small chunk4. Inexpensive redundancy

    5. Specific h/w for specific tasks6. Very less to no co-existence

    vs

       V

      e  r   t   i  c  a   l  s  c  a   l  a   b   i   l   i   t  y

    Horizontal scalability

     This is what Google,LinkedIn, Facebook do. The norm is now beingadopted by largecorporations as well.

    “ Just in time” expansion; stay in tune with the load. No need to build ahead in time inanticipation

  • 8/17/2019 CassandraTraining v3.3.4

    30/183

    30

    Elastic Linear Scalability

    n1

    n2

    n3

    n4

    n5

    n6

    n7

    n8

    8 nodes

    n1

    n2

    n3

    n4

    n5

    n6

    n7

    n8

    n9

    n10

    10 nodes

     

    1.Need more capacity? Add servers2.Auto Re-distribution of the cluster3.Results in linear performance increase

  • 8/17/2019 CassandraTraining v3.3.4

    31/183

    31

    Debate - Failover design strategy

    Scenario• Maintain a small failover in the data

    center, and then simply add multiplenodes if a failure occurs

    • Is this "Just in time" strategy correct?

    Analysis• The additional overhead of bootstrapping

    the new nodes will actually reduce

    capacity at a critical point when thecapacity is needed most perhaps creatinga domino effect leading to severe outage

    Debate what if all machines are not

  • 8/17/2019 CassandraTraining v3.3.4

    32/183

    32

    Debate - what if all machines are notsimilar?

    n1

    n2

    n3

    n4

    n5

    n6

    State your observations based on typical machine characteristics1. CPU, Memory3. Storage space (consider externally mounted space as local)3. Network connectivity4. Query performance and "predictability" of queries

  • 8/17/2019 CassandraTraining v3.3.4

    33/183

    33

    Deployment - 4 dimensions

    Node - One C* instance1

    Rack - Logical set of nodes2

    3 DC-DR - Logical set of racks

    4 Cluster - Nodes that map to a single token ring(can be across DC), next slide ...

  • 8/17/2019 CassandraTraining v3.3.4

    34/183

    34

    Distributed deployment

    //These are the C* nodes that the DAO will look to connect topublic static final String[] cassandraNodes = { "10.24.37.1", "10.24.37.2", "10.24.37.3" };

    public enum clusterName { events, counter }

    //You can build with policies like withRetryPolicy(), .withLoadBalancingPolicy(Round Robin)cluster = Cluster.builder().addContactPoints(Config.cassandraNodes).build();

    session1 = cluster.connect(Constants.clusterName.events.toString());

    RACK 1RACK 2

    RACK 3RACK A

    RACK B

    Region 1 - DC Region 2 - DR

    Cluster: SKL-Platform

  • 8/17/2019 CassandraTraining v3.3.4

    35/183

    35

    A bit more about multi DC setup

    • When usingNetworkToplogyStrategy, you set

    the number of replicas per datacenter

    • For example if you set thereplicas as 4 (2 in each DC) thenthis is what you get ...

    ALTER KEYSPACE system_auth WITH REPLICATION =  {'class' : 'NetworkTopologyStrategy', 'DC1' : 2, 'DC2' : 2};

     Then run nodetool repair on each affected node, wait for itto come up and move on to the next node.

    R1

    R3

    R2

    R4

    Data Center1

    Data Center2Rack1

     Rack2

    Node1

    Node3

    Node5

    Node2

    Node4

    Node6

    R1

    R2

    Data Center1

    Rack1

     Rack2

    Node7

    Node9

    Node11

    Node8

    Node10

    Node12

    R3

    R4

    Data Center2

  • 8/17/2019 CassandraTraining v3.3.4

    36/183

    36

    Debate - replication setting

    • Let's assume we have the cluster as follows– DC1 having 5 nodes– DC2 having 5 nodes

    • What will happen if the following statements are

    executed (assume no syntax errors)?– create keyspace if not exists test with replication = { 'class' :'NetworkTopologyStrategy', 'DC1' : 2, 'DC2' : 10 };

    – create table test ( col1 int, col2 varchar, primary key(col1));

    – insert into test (col1, col2) values (1, 'Sir Dennis Ritchie');

    – Will this succeed?

  • 8/17/2019 CassandraTraining v3.3.4

    37/183

    37

    Regions and Zones

    • Many IaaS (Google / Amazon) providers have

    multiple data centers around the world called"Regions"

    • Each region is further divided into availability"zones"

    • Depending on your IaaS or even your own DC,you must plan the topology in such a way that– A client request should not travel over the network too

    long to reach C*

    – An outage should not take out a significant chunk ofyour cluster

    – A replication across DC should not add to the latency forthe client

  • 8/17/2019 CassandraTraining v3.3.4

    38/183

    38

    Part 3

    Setup & Installation

    i ll i

  • 8/17/2019 CassandraTraining v3.3.4

    39/183

    39

    Let's get going! Installation

    • Acquiring and Installing C*

    – Understand the key components of Cassandra.yaml• Configuring and Installation structure

    – Data directory

    – Commit Log directory

    – Cache directory

    • System log configuration

    – Cassandra-env.ps1 (change the log file position)– Default is $logdir = "$env:CASSANDRA_HOME\logs"

    – In logback.xml change from .zip to .gz• ${cassandra.logdir}/system.log.%i.gz

    • Creating a Cluster

    • Adding Nodes to a Cluster

    • Multiple Seed Nodes Ring management• Sizing is specific to the application - discussion

    • Clock Synchronization and its implications– Network Time Protocol (NTP)

    i ll d

  • 8/17/2019 CassandraTraining v3.3.4

    40/183

    40

    Firewall and ports

    •Cassandra primarily operates usingthree ports

    – 7000 (or 7001 if SSL) - internode clustercommunication

    – 7199 - JMX port

    – 9160 / 9042 - Client connections

    – 22 - for SSH

    • Firewall must have these ports openfor each node in the cluster/DC

    C d l i d i

  • 8/17/2019 CassandraTraining v3.3.4

    41/183

    41

    Cassandra.yaml - an introduction

     There are more configuration parameters, but we will cover those as we move along ... ▶

    C d h

  • 8/17/2019 CassandraTraining v3.3.4

    42/183

    42

    Cassandra-env.sh

    CQL h ll l h

  • 8/17/2019 CassandraTraining v3.3.4

    43/183

    43

    CQL shell - cqlsh

    • Command line tool

    • Specify the host and port (none means localhost)• Provides tab finish• Supports two different command sets

    – DDL– Shell commands (describe, source, tracing, copy etc)

    •  Try some commands– show version– show host– describe keyspaces– describe keyspace

    – describe tables– describe table

    CQL shell "source" command - loading a few million rows"sstableloader" which is another tool - beyond a few million rows

  • 8/17/2019 CassandraTraining v3.3.4

    44/183

    44

    Part 4

    Prelude to Modeling

    K k S h

  • 8/17/2019 CassandraTraining v3.3.4

    45/183

    45

    Keyspace aka Schema

    Cassandra Database

    create keyspace [IF NOT EXISTS]meterdata with replication strategy(optionally with DC/DR setup), durablewrites (True/False)

    create database meterdata;

    It is like the database (MySQL) or schema/user (Oracle)

    durable writes = false bypasses thecommit log. You can lose data.

    Ad i /S t k

  • 8/17/2019 CassandraTraining v3.3.4

    46/183

    46

    Admin/System keyspaces

    • SELECT * FROM– system.schema_keyspaces

    – system.peers

    – system.schema_columnfamilies

    – system.schema_columns– system.local (info about itself)

    – system.hints

    – system.• Local contains information about

    nodes (superset of gossip)

    C l F il (CF) / T bl

  • 8/17/2019 CassandraTraining v3.3.4

    47/183

    47

    Column Family (CF) / Table

    Cassandra Database

    create table meterdata.bill_data (...)

    primary key (compound key)

    with compaction, clustering order

    PK is mandatory in C*

    create tablemeterdata.bill_date (...)

    pk, references, engine,charset etc

    Insert into bill_data () values ();

    Update bill_data set=.. where ..

    Delete from bill_data where ..

    Standard SQL DMLstatements

    What happens if we insert with a PK that already exists?Hold on to your thoughts ...

    PK k P titi k k R k

  • 8/17/2019 CassandraTraining v3.3.4

    48/183

    48

    PK aka Partition key aka Row key

    • Partition key determines which node the data will reside

    – Simple key– Composite key

    • It is the unit of replication in a cluster– All data for a given PK replicates around in the cluster

    – All data for a given PK is co-located on a node

    •  The PK is hashed– Murmer3 hashing is the preferred

    – Random does based on MD5 (not considrerd secure)

    – Others are for legacy support only

    • A partition is fetched from the disk in one disk seek, every partitionrequires a seek

    PK 

    meter_id_001

    meter_id_002

    meter_id_003

    Hash

    558a666df55

    88yt543edfhj

    aad543rgf54lMore of PK during modeling ...

  • 8/17/2019 CassandraTraining v3.3.4

    49/183

    49

    CREATE TABLE all_events (  event_type int,  date number, //format as yyyymmdd number (more efficient)  created_hh int,  created_min int,  created_sec int,  created_nn int,  data text,  PRIMARY KEY((event_type, date), created_hh, created_min));

    Visualizing the Primary Key

    Partition key Clusteringcolumn(s)

    Unique

    What are the components of the Primary key?

    Primary key = Partition key(s) + Clustering column(s)Naturally by definition the primary key must be unique

    Visualizing partition key based storage

  • 8/17/2019 CassandraTraining v3.3.4

    50/183

    50

    Visualizing partition key based storage

    • Question - how does C* keep the data based on

    the primary key definition of a given table?

    Meter_id_001

    Meter_id_001

    Meter_id_001

    ...

    Meter_id_001

    26-Jan-2014

    27-Jan-2014

    28-Jan-2014

    ...

    15-Dec-2014

    147 159 165 ... 183

    5698 5712 5745

    12 58 ... 302

    86 91

    Partition key

    Contiguous chunk of data on disk

    Data growth per partition key

  • 8/17/2019 CassandraTraining v3.3.4

    51/183

    51

    Failure Tolerance By Replication

    Remember CAP theorem? C* gives availabilityand partition tolerance

    Replication Factor (RF) = 3

    Insert a record with Pk=B C

    D

    E

    F

    G

    H

    A

    B

    1

    2

    3{B, A, H}

    {C, B, A}

    {D, C, B}

    private static final String allEventCql = "insert into all_events (event_type, date, created_hh, created_min,  created_sec, created_nn, data) values(?, ?, ?, ?, ?, ?, ?)";Session session = CassandraDAO.getEventSession();BoundStatement boundStatement = getWritableStatement(session, allEventCql);session.execute(boundStatement.bind(.., .., ..); ▶

    PK data partitioning

  • 8/17/2019 CassandraTraining v3.3.4

    52/183

    52

    PK - data partitioning

    • Every node in the cluster is an owner of a range of tokens

    that are calculated from the PK • After the client sends the write request the coordinators

    partitioner uses the configured partitioner to determine thetoken value

    •  Then it looks for the node that has the token range (primaryrange) and puts the first replica there

    • A node can have other ranges as well but has one primaryrange

    • Subsequent replicas is with respect to the node of the firstcopy/replica

    • Data with the same partition key will reside on the same

    physical node• Clustering columns (in case of compound PK) does not

    impact the choice of node– Default sort based on clustering columns is ASC

  • 8/17/2019 CassandraTraining v3.3.4

    53/183

    53

    Coordinator Node

    Incoming Requests (read/Wrire)

    Coordinator handles the request

    An interesting aspect to note is that thedata (each column) is timestamped using

    the coordinator node's time

    Every node can be coordinator ->masterless

    n3

    n4

    n5

    n6n7

    n8

    n1

    n2

    1

    2

    3

    Clientrequest

    coordinator

    Hinted handoff a bit later ...

  • 8/17/2019 CassandraTraining v3.3.4

    54/183

    54

    Consistency

     Tunable at runtime• ONE (default)• QUORUM (strict majority w.r.t RF)• ALL

    Apply both to read and write

    protected static BoundStatement getWritableStatement(Session session, String cql,  boolean setAnyConsistencyLevel) {  PreparedStatement statement = session.prepare(cql);  if(setAnyConsistencyLevel) {  statement.setConsistencyLevel(ConsistencyLevel.ONE);  }

      BoundStatement boundStatement = new BoundStatement(statement);  return boundStatement;}

    Other consistency levels are ANY, LOCAL_QUORUM, TWO, THREE,LOCAL_ONE (see documentation on the web)

    What is a Quorum?

  • 8/17/2019 CassandraTraining v3.3.4

    55/183

    55

    What is a Quorum?

    quorum = RoundDown(sum_of_replication_factors / 2) + 1

    sum_of_replication_factors = datacenter1_RF + datacenter2_RF + . . .+ datacentern_RF

    • Using a replication factor of 3, a quorum is 2 nodes. The cluster can tolerate 1

    replica down.

    • Using a replication factor of 6, a quorum is 4. The cluster can tolerate 2replicas down.

    • In a two data center cluster where each data center has a replication factor of3, a quorum is 4 nodes. The cluster can tolerate 2 replica nodes down.

    • In a five data center cluster where two data centers have a replication factorof 3 and three data centers have a replication factor of 2, a quorum is 6 nodes.

  • 8/17/2019 CassandraTraining v3.3.4

    56/183

    56

    Write Consistency

    Write ONE

    Send requests to all replicas in thecluster applicable to the PK 

    n3

    n4

    n5

    n6

    n7

    n8

    n1

    n2

    1

    2

    3

    coordinator

  • 8/17/2019 CassandraTraining v3.3.4

    57/183

    57

    Write Consistency

    Write ONE

    Send requests to all replicas in thecluster applicable to the PK 

    Wait for ONE ack before returning toclient

    n3

    n4

    n5

    n6

    n7

    n8

    n1

    n2

    1

    2

    3

    coordinator

    5 μs

  • 8/17/2019 CassandraTraining v3.3.4

    58/183

    58

    Write Consistency

    Write ONE

    Send requests to all replicas in the clusterapplicable to the PK 

    Wait for ONE ack before returning to client

    Other acks later , asynchronouslyn3

    n4

    n5

    n6

    n7

    n8

    n1

    n2

    1

    2

    3

    coordinator

    5 μs

    10 μs

    120 μs

  • 8/17/2019 CassandraTraining v3.3.4

    59/183

    59

    Write Consistency

    Write QUORUM

    Send requests to all replicas

    Wait for QUORUM ack before returning to client

    n3

    n4

    n5

    n6

    n7

    n8

    n1

    n2

    1

    2

    3

    coordinator

    5 μs

    10 μs

    120 μs

    d i

  • 8/17/2019 CassandraTraining v3.3.4

    60/183

    60

    Read Consistency

    Read ONE

    Read from one node among all replicas

    n3

    n4

    n5

    n6

    n7

    n8

    n1

    n2

    1

    2

    3

    coordinator

    R d C i

  • 8/17/2019 CassandraTraining v3.3.4

    61/183

    61

    Read Consistency

    Read ONE

    Read from one node among all replicas

    Contact the faster node (stats)

    n3

    n4

    n5

    n6

    n7

    n8

    n1

    n2

    1

    2

    3

    coordinator

    R d C i t

  • 8/17/2019 CassandraTraining v3.3.4

    62/183

    62

    Read Consistency

    Read QUORUM

    Read from one fastest node

    n3

    n4

    n5

    n6

    n7

    n8

    n1

    n2

    1

    2

    3

    coordinator

    R d C i t

  • 8/17/2019 CassandraTraining v3.3.4

    63/183

    63

    Read Consistency

    Read QUORUM

    Read from one fastest node

    AND request digest from other replicas toreach QUORUM

    n3

    n4

    n5

    n6

    n7

    n8

    n1

    n2

    1

    2

    3

    coordinator

    R d C i t

  • 8/17/2019 CassandraTraining v3.3.4

    64/183

    64

    Read Consistency

    Read QUORUM

    Read from one fastest node

    AND request digest from other replicasto reach QUORUM

    Return most up-to-date data to client

    Read repair a bit later ...

    n3

    n4

    n5

    n6

    n7

    n8

    n1

    n2

    1

    2

    3

    coordinator

    How to maintain consistency across

  • 8/17/2019 CassandraTraining v3.3.4

    65/183

    65

    nodes?

    • C* allows for tunable consistency andhence we have eventual consistency

    • For whatever reason there can beinconsistency of data among nodes

    • Inconsistencies are fixed using

    – Read repair - update data partitionsduring reads (aka Anti-entropy ops)

    – Nodetool - regular maintenance

    • Let's see consistency in action ...

    C i t i A ti

  • 8/17/2019 CassandraTraining v3.3.4

    66/183

    66

    Consistency in Action

    RF = 3, Write ONE Read ONE

    B A A

    B A A

    Write ONE: B

    Read ONE: A

    Data replication in progress..

    A write must be written to the commit log and

    memtable of at least one replica node.

    Returns a response from the closest replica, asdetermined by the snitch. By default, a read repair runs inthe background to make the other replicas consistent.

    C i t i A ti

  • 8/17/2019 CassandraTraining v3.3.4

    67/183

    67

    Consistency in Action

    RF = 3, Write ONE Read QUORUM

    B A A

    B A A

    Write ONE: B

    Read QUORUM: A

    Data replication in progress..

    A write must be written to the commit log and

    memtable of at least one replica node.

    Returns the record after a quorum of replicas hasresponded from any data center.

    Consistenc in Action

  • 8/17/2019 CassandraTraining v3.3.4

    68/183

    68

    Consistency in Action

    RF = 3, Write ONE Read ALL

    B A A

    B A A

    Write ONE: B

    Read ALL: B

    Data replication in progress..

    A write must be written to the commit log and

    memtable of at least one replica node.

    Returns the record after all replicas haveresponded. The read operation will fail if a replicadoes not respond.

    Consistency in Action

  • 8/17/2019 CassandraTraining v3.3.4

    69/183

    69

    Consistency in Action

    RF = 3, Write QUORUM Read ONE

    B B A

    B B A

    Write QUORUM: B

    Read ONE: A

    Data replication in progress..

    A write must be written to the commit log andmemtable on a quorum of replica nodes.

    Returns a response from the closest replica, asdetermined by the snitch. By default, a read repair runs inthe background to make the other replicas consistent.

    Consistency in Action

  • 8/17/2019 CassandraTraining v3.3.4

    70/183

    70

    Consistency in Action

    RF = 3, Write QUORUM Read QUORUM

    B B A

    B B A

    Write QUORUM: B

    Read QUORUM: B

    Data replication in progress..

    Consistency Level

  • 8/17/2019 CassandraTraining v3.3.4

    71/183

    71

    Consistency Level

    ONEFast write, may not read latest

    written value

    QUORUM / LOCAL_QUORUMStrict majority w.r.t. Replication Factor

    Good balance

    ALLNot the best choice

    Slow, no high availability

    if(nodes_written + nodes_read) > replication factorthen you can get immediate consistency

    Debate - Immediate consistency

  • 8/17/2019 CassandraTraining v3.3.4

    72/183

    72

    Debate Immediate consistency

    • The following consistency level gives

    immediate consistency– Write CL = ALL, Read CL = ONE

    – Write CL = ONE, Read CL = ALL

    – Write CL = QUORUM, Read CL = QUORUM

    • Debate which CL combination you willconsider in what scenarios?

    • The combinations are– Few write but read heavy

    – Heavy write but few reads

    – Balanced

    Hinted Handoff

  • 8/17/2019 CassandraTraining v3.3.4

    73/183

    73

    n3

    n4

    n5

    n6

    n7

    n8

    n1

    n2

    1

    2

    3

    Coordinator

    Hinted Handoff 

    A write request is received by thecoordinator

    Coordinator finds a node is down/offline

    It stores "Hints" on behalf of the offline nodeif the coordinator knows in advance that thenode is down. Handoff is not taken if the

    node goes down at the time of write.

    Read repair / nodetool repair will fixinconsistencies.

    Hints

    A hinted handoff is taken in system.hints table  {Location of the replica, version meta, actual data}It is taken only when the write consistency level can be satisfied and a guaranteethat any read consistency level can read the record

    Hinted Handoff

  • 8/17/2019 CassandraTraining v3.3.4

    74/183

    74

    Hinted Handoff 

    When the offline node comes up, thecoordinator forwards the stored hints

     The node synchs up the state with thehints

     The coordinator cannot perpetually holdthe hints

    Configure an appropriate duration for anynode to hold hints - default 3 hrs

    n3n4

    n5

    n6

    n7

    n8

    n1

    n2

    1

    2

    3

    CoordinatorHints

    Consistency level - ANY

  • 8/17/2019 CassandraTraining v3.3.4

    75/183

    75

    Consistency level ANY

    • This is a special consistency levelwhere even if any node is down thewrite will succeed using a hintedhandoff 

    • Use it very cautiously

    What is a Anti-entropy op?

  • 8/17/2019 CassandraTraining v3.3.4

    76/183

    76

    a s a e opy op

    • During reads the full data is requestedfrom ONE replica by the coordinator

    • Others send a digest which is a hashof the data to the coordinator

    • Coordinator checks if the data and thedigests match

    • If they do not then the coordinatortakes the latest (n4) and merges intothe result

    • Then the stale nodes (n2, n3) areupdated automatically in thebackground

    • Happens on a configurable % of readsand not for all because there is anoverhead

    • read_repair_chance is theconfiguration (defaults to 10% or 0.1)

    = a 10% chance that a read repair willhappen when a read comes in•  This is defined for a table or column

    family

    Repair with nodetool, a bit later ...

    n3

    n4

    n5

    n6

    n7

    n8

    n1

    n2

    1

    2

    3

    coordinator

    Andersonts = 521

    Neots = 851

    Andersonts = 268

  • 8/17/2019 CassandraTraining v3.3.4

    77/183

    77

    Part 5

    Write and Read path

    Why C* writes fast

  • 8/17/2019 CassandraTraining v3.3.4

    78/183

    78

    y

    ??

    RDBMS

    Seeks and writes values tovarious pre-set locations

    Cassandra

    Continuously appends, merging andcontracting when appropriate

     Try reading a book that is organized asfollows

    P1, P23, P45, P2, P6, P99, P125, P8 ...

    How about a book that looks like

    P1, P2, P3, P4, P5, P6, ...

    Storage

  • 8/17/2019 CassandraTraining v3.3.4

    79/183

    79

    g

    • Writes are harder than reads to scale

    • Spinning disks aren't good withrandom I/O

    • Goal: minimize random I/O

    – Log Structured Merged Tree (LSM)

    – Push changes from a memory-basedcomponent to one or more disk

    components

    Some more on LSM

  • 8/17/2019 CassandraTraining v3.3.4

    80/183

    80

    • An LSM-tree is composed of two or

    more tree-like component datastructures

    • Smaller component is C0 sits inmemory - aka Memtables

    • Larger component C1 sits in the disk– aka SSTables (Sorted String Table)– Immutable component

    • Writes are inserted in C0 and logs• In time flush C0 to C1• C1 is optimized for sequential access

    Key components for Write

  • 8/17/2019 CassandraTraining v3.3.4

    81/183

    81

    y p

    Each node implements the folloing key components (3 storagemechanism and one process) to handle writes

    1.Memtables - the in-memory tables corresponding to the CQLtables along with the related indexes

    2.CommitLog - an append-only log, replayed to restore nodeMemtables

    3.SSTables - Memtable snapshots periodically flushed (written

    out) to free up heap

    4.Compaction - periodic process to merge and streamlinemultiple generations of SSTables

    Cassandra Write Path

  • 8/17/2019 CassandraTraining v3.3.4

    82/183

    82

    Cassandra Write Path

    Memory

    Commit log1 

    Commit log2 

    Commit logn 

    ……

    1

    MemTable 1 MemTable 2 MemTable n……2

    Coordinator

    Appends

                 d          u          r          a              b              l          e

              w          r              i             t          e

              s        =

                  T             R             U             E

    SSTable1 

    Table1

    SSTable2 

    Table2

    Appends (corresponds to CQL tables)

    Single node in the cluster

    Client

    Mem to disk

    Related configurations are commitlog_sync (batch/periodic),commitlog_sync_batch_window_in_ms, commitlog_synch_period_in_ms

    Disk

    Cassandra Write Path

  • 8/17/2019 CassandraTraining v3.3.4

    83/183

    83

    Cassandra Write Path

    Memory

    Commit log1 

    Commit log2 

    Commit logn 

    ……

    MemTable 1 MemTable 2 MemTable n……

    2

    SSTable1 

    1

    Table1

    SSTable2 

    Table2

    SSTable3 

    Table3

    Memtable exceeds a threshold, flush to SSTable, clean log,

    clear heap corresponding to the memtables

    New SSTable is created based on oldest commit logA very fast sequential write operation

     These are multiple generations of SSTableswhich are compacted into one SSTableselect * from system.sstable_activity;

    Writes are completely isolated from reads

    No updated columns are visible until entire row is finished (technically, entire partition)

    Disk

    Column family data state

  • 8/17/2019 CassandraTraining v3.3.4

    84/183

    84

    y

    A Table data=

    It's Memtable+All of it's SSTables that have

    been flushed

    Memtables & SSTables

  • 8/17/2019 CassandraTraining v3.3.4

    85/183

    85

    • Memtables and SSTables are maintained per

    table– SSTables are immutable, not written to again after the

    memtable is flushed.

    – A partition is typically stored across multiple SSTablefiles.

    – What is sorted? - Partitions are sorted and stored

    • For each SSTable, C* creates these structures– Partition index - A list of partition keys and the start position of

    rows in the data file (on disk)– Partition index summary (in memory sample to speed up reads)

    – Bloom filter (optional and depends on % false positive setting)

    Offheap memtables

  • 8/17/2019 CassandraTraining v3.3.4

    86/183

    86

    • As of C* 2.1 a new parameter has been introduced (memtable_allocation_type) This capability is still under active development and performance over the

    period of time is expected to get better

    • Heap buffers - the current default behavior

    • Off heap buffers - moves the cell/column name and value to DirectBufferobjects. This has the lowest impact on reads — the values are still “live” 

     Java buffers — but only reduces heap significantly when you are storinglarge strings or blobs

    • Off heap objects - moves the entire cell off heap, leaving only theNativeCell reference containing a pointer to the native (off-heap) data. Thismakes it effective for small values like ints or uuids as well, at the cost ofhaving to copy it back on-heap temporarily when reading from it (likely tobecome default in C* 3.0)– Writes are about 5% faster with offheap_objects enabled, primarily because Cassandra

    doesn’t need to flush as frequently. Bigger sstables means less compaction is needed.– Reads are more or less the same

    • For more information http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1

    Commit log

  • 8/17/2019 CassandraTraining v3.3.4

    87/183

    87

    • It is replayed in case a node went down and

    wants to come back up•  This replay creates the MemTables for that node• Commit log comprises of pieces (files) and it can

    be controlled (commitlog_segment_size_in_mb)•  Total commit log is a controlled parameter as well

    (commitlog_total_space_in_mb)• Commit log itself is also acrued in memory. It is

    then written to disk in two ways– Batch - if batch then all ack to requests wait until the

    commit log is flushed to disk–

    Periodic - the request is immediately ack but after sometime the commit log is flushed. If a node were to godown in this period then data can be lost if RF=1

    Debate - "periodic" setting gives better performance and we shoulduse it. How to avoid the chances of data loss?

    When does the flush trigger?

  • 8/17/2019 CassandraTraining v3.3.4

    88/183

    88

    • Memtable total space in MB reached

    – Default is 25% of JVM Heap– Can be more in case of off heap

    • OR Commit log total space in MBreached

    • OR Nodetool flush (manual)– nodetool flush

    • The default settings are usually goodand are there for a reason

    Data file name structure

  • 8/17/2019 CassandraTraining v3.3.4

    89/183

    89

    •  The data directories are created based on keyspaces and tables– .../data/keyspace/table

    •  The actual DB file is named as– keyspace-table-format-generationNum-component

    • Component can be– CompressionInfo - compression info metadata

    – Data - PK, data size, column idx, row level tombstone info, columncount, column list in sorted order by name

    – Filter - Bloom filter– Index - index, also include Bloom filter info, tombstone– Statistics - hostograms for row size, gen numbers of files from where

    this SST was compacted– Summary - index summary (that is loaded in mem for read

    optimizations)–  TOC - list of files– Digest - Text file with a digest

    • Format - internal C* format eg "jb" is C* 2.0 format, 2.1 may have "ka"

    Cassandra Read Path

  • 8/17/2019 CassandraTraining v3.3.4

    90/183

    90

    Cassandra Read Path

    SELECT…FROM…WHERE #partition=….;

    RowCache

    1

    SSTable1  SSTable2  SSTable3 

    Row cache will be checked if it is enabled

    Cassandra Read Path

  • 8/17/2019 CassandraTraining v3.3.4

    91/183

    91

    Cassa d a ead a

    SELECT…FROM…WHERE #partition=….;

    RowCache

    1

    SSTable1  SSTable2  SSTable3 

    Bloom Filter 2

    ? ? ?

    Row cache MISS

    •One Bloom filter exists per SSTable and Memtable that the node is serving

    •Probablistic filter that tells "Maybe a partition key exists" in itscorresponding SSTable or a "Definite NO"

    Cassandra Read Path

  • 8/17/2019 CassandraTraining v3.3.4

    92/183

    92

    SELECT…FROM…WHERE #partition=….;

    RowCache

    1

    SSTable1  SSTable2  SSTable3 

    Bloom Filter 2

    F F

    Partition KeyCache3

    T

    •If any of the Bloom filters return a possible "YES" then the partition keymay exist in one or more of the SSTables

    •Proceed to look at the partition key cache for the SSTables for which thereis a probability of finding the partition key, ignore the other SST keycaches

    •Physically a single key cache is maintained ... next more on partition key cache▶

    Partition Key Cache

  • 8/17/2019 CassandraTraining v3.3.4

    93/183

    93

    y

    # Partition   Offset   …..

    # Partition001 0x0   ….

    # Partition002 0x153   …..

    # Partition350 0x5464321   …..

    # Partition001

    data

    data

    …..

    # Partition002

    # Partition350

    data

    Data

    ….

    Partition KeyCache

    SSTable1 

    Partition key cache stores the offset position of the partition keysthat have been read recently.

    Cassandra Read Path

  • 8/17/2019 CassandraTraining v3.3.4

    94/183

    94

    SELECT…FROM…WHERE #partition=….;

    RowCache

    1

    SSTable2 

    Bloom Filter 2

    Partition KeyCache3

    Key IndexSample4

    If partition key is not found in the key cache then the read proceeds tocheck the "Key index sample" or "Partition summary" (in memory). It is asubset of the "Partition Index" that is the full index of the SSTable

    Next a bit more of key index sample ...

    Key Index Sample

  • 8/17/2019 CassandraTraining v3.3.4

    95/183

    95

    y p

    Sample   Offset

    # Partition001 0x0

    # Partition128 0x4500

    # Partition256 0x851513

    # Partition512 0x5464321

    # Partition001

    data

    data

    …..

    # Partition128

    # Partition350

    data

    Data

    ….

    Key IndexSample

    SSTable

     Think of the offset as an absolute disk address that fseek in C can use

    Ratio of the key index sample is 1 per 128 keys

    Cassandra Read Path - complete flow

  • 8/17/2019 CassandraTraining v3.3.4

    96/183

    96

    p

    SELECT…FROM…WHERE #partition=….;

    RowCache

    1

    SSTable2 

    Bloom Filter 2

    Partition KeyCache3

    Key IndexSample4

    Partition Index5

    6

    Mem Table 

    Merge based on the most recent timestamp of the columns1. Memtable2. One or more than one SSTablemore on the merge process a bit later in LWW ...

    Also update the row cache (if enabled) and key cache

       T   h  e  c  o  o  r   d   i  n  a   t  o  r  w   i   l   l  m  e  r  g  e

       b  a  s  e   d  o  n   T   S  a  s  w  e   l   l   b  a  s  e   d  o  n

       t   h  e  c  o  n  s   i  s   t  e  n  c  y

       l  e  v  e   l

    Bloom filters in action

  • 8/17/2019 CassandraTraining v3.3.4

    97/183

    97

    Setting change takes effect when the SSTables are regeneratedIn development env you can use nodetool/CCM scrub BUT it is time intensive

    Row caching

  • 8/17/2019 CassandraTraining v3.3.4

    98/183

    98Caches are also written on disk so that it comes alive after a restartGlobal settings are also possible at the c* yaml file

    Key caching

  • 8/17/2019 CassandraTraining v3.3.4

    99/183

    99

    • Stored per SSTable even if it is a common store for all SSTables• Stores only the key

    • Reduces the seek to just one read per replica (does not have to look intothe disk version of the index which is the full index)

    • It is the default = true•  This also periodically get saved on disk as well

    •Changing index_interval increases/decreases the gap•Increasing the gap means more disk hits to get the partition key index•Reducing the gap means more memory but get to partition key without looking intodisk based partition index

    Eager retry

  • 8/17/2019 CassandraTraining v3.3.4

    100/183

    100

    • If a node is slow in responding to a

    request, the coordinator forwardsit to another holding a replica ofthe requested partition

    • Valid if RF>1• C* 2.0+ feature

    Node1

    Node2

    Node

    3

    Node4

    ?

    91

    91

    Client Driver

    Read

    ALTER TABLE users WITH speculative_retry = '10ms';Or,ALTER TABLE users WITH speculative_retry = '99percentile';

    Last Write Win (LWW)

  • 8/17/2019 CassandraTraining v3.3.4

    101/183

    101

    INSERT INTO users(login,name,age) VALUES ('jdoe', 'John DOE', 33);

     Jdoe Age Name

    33 John DOE

    #Partition

  • 8/17/2019 CassandraTraining v3.3.4

    102/183

    Last Write Win (LWW)

  • 8/17/2019 CassandraTraining v3.3.4

    103/183

    103

    UPDATE users SET age = 34 WHERE login = 'jdoe';

     Jdoe Age(t1) Name(t1)

    33 John DOE

     Jdoe Age(t2)

    34

    SSTable1SSTable2

    Remember that SSTables are immutable, once written it cannot beupdated.Creates a new SSTable

    Assume a flush occurs

    Last Write Win (LWW)

  • 8/17/2019 CassandraTraining v3.3.4

    104/183

    104

    DELETE age from users WHERE login = 'jdoe';

     Jdoe Age(t1) Name(t1)

    33 John DOE

     Jdoe Age(t2)

    34

    SSTable1   SSTable2

     Jdoe Age(t3)

    X

    SSTable3

    tombstone

    RIP

    Last Write Win (LWW)

  • 8/17/2019 CassandraTraining v3.3.4

    105/183

    105

    SELECT age from users WHERE login = 'jdoe';

     Jdoe Age(t1) Name(t1)

    33 John DOE

     Jdoe Age(t2)

    34

    SSTable1   SSTable2

     Jdoe Age(t3)

    X

    SSTable3

    ?   ?   ?

    Where to read from? How to construct the response?

    Last Write Win (LWW)

  • 8/17/2019 CassandraTraining v3.3.4

    106/183

    106

    SELECT age from users WHERE login = 'jdoe';

     Jdoe Age(t1) Name(t1)

    33 John DOE

     Jdoe Age(t2)

    34

    SSTable1   SSTable2

     Jdoe Age(t3)

    X

    SSTable3

    ✗   ✗   ✔

    Compaction

  • 8/17/2019 CassandraTraining v3.3.4

    107/183

    107

     Jdoe Age(t1) Name(t1)

    33 John DOE

     Jdoe Age(t2)

    34

    SSTable1   SSTable2

     Jdoe Age(t3)

    X

    SSTable3

     Jdoe Name(t1)

     John DOE

    New SSTable

    Disk space freed = sizeof(SST1 + SST2 + .. + SSTn) - sizeof(new compact SST)Commit logs (containing segments) are versioned and reused as well

    Newly freed space becomes available for reuse

    •Related SSTables are merged•Most recent version of each column is compiled to one partition in one new SSTable•Columns marked for eviction/deletion are removed

    •Old generation SSTables are deleted•A new generation SSTable is created (hence the generation number keep increasing over time)

    Compaction

  • 8/17/2019 CassandraTraining v3.3.4

    108/183

    108

    • As per Datastax documentation– Cassandra 2.1 improves read performance after compaction by performing an

    incremental replacement of compacted SSTables.– Instead of waiting for the entire compaction to finish and then throwing away the old

    SSTable (and cache), Cassandra can read data directly from the new SSTable evenbefore it finishes writing.

    •  Three strategies– Sizetiered (default)

    – Leveled (needs about 50% more IO than size tiered but the number of SSTablesvisited for data will be less)

    – DateTiered

    • Choice depends on the disk– Mechanical disk = Size tiered

    – SSD = Leveled tiered

    • Choice depends on use case too– Size - It is best to use this strategy when you have insert-heavy, read-light

    workloads

    – Level - It is best suited for ColumnFamilys with read-heavy workloads that havefrequent updates to existing rows

    – Date - for timeseries data along with TTL

    SizeTiered compaction - a brief 

  • 8/17/2019 CassandraTraining v3.3.4

    109/183

    109

    • With size-tiered compaction, similarly sized tables arecompacted into larger tables once a certain number areaccumulated (default 4)

    • A row can exist in multipleSSTables, which can resultin a degradation of readperformance. This is

    especially true if youperform many updates ordeletes with high reads

    • It can be a good strategyfor write heavy scenarios

    LeveledTiered compaction

  • 8/17/2019 CassandraTraining v3.3.4

    110/183

    110

    • Leveled compaction creates sstables of a

    fixed, relatively small size (5MB by defaultin Cassandra’s implementation), that aregrouped into “levels.”

    • Within each level, sstables are guaranteed

    to be non-overlapping. Each level is tentimes as large as the previous

    • Leveled compaction guarantees that 90%of all reads will be satisfied from a single

    sstable (assuming nearly-uniform rowsize). Worst case is bounded at the totalnumber of levels — e.g 7 for 10TB of data

    DateTiered compaction

  • 8/17/2019 CassandraTraining v3.3.4

    111/183

    111

    • This particularly applies to time series

    data where the data lives for a specifictime• Use DateTiered compaction strategy• C* looks at the min and max Timestamp of

    the SStable and finds out if anything isreally live• If not then that SSTable is just unlinked

    and the compaction with another is fullyavoided

    • Set the TTL in the CF definition so that if abad code looks to insert without TTL itdoes not cause unnecessary compaction

    "Coming back to life"

  • 8/17/2019 CassandraTraining v3.3.4

    112/183

    112

    • Let's assume multiple replicas exist for a given column• One node was NOT able to recored a tombstone because it

    was down• If this node remains down past the gc_grace_seconds

    without a repair then it will contain the old record• Other nodes meanwhile have evicted the tombstone

    column• When the downed node does come up it will bring the old

    column back to life on the other nodes!

    •  To ensure that deleted data never resurfaces, make sure torun repair at least once every gc_grace_seconds, and neverlet a node stay down longer than this time period

    NTP is very important

  • 8/17/2019 CassandraTraining v3.3.4

    113/183

    113

    • A micro second time stamp is associated

    with the columns• The timestamp used is of the machine in

    which C* is running (coordinator)• If the timing of the machines in the cluster

    is off then C* will be unable to determinewhich record is latest!• This also means that the compaction may

    not function• You can have a scenario where a deleted

    record appears again or get unpredictablenumber of records for your query

    Lightweight transactions

  • 8/17/2019 CassandraTraining v3.3.4

    114/183

    114

    •  Transactions are a bit different

    – "Compare and Set" model•  Two requests can agree on a single value creation

    – One will suceed (uses Paxos algo, wikipedia has goodtext on this subject)

    – Other will fail

    INSERT INTO customer_account (customerID, customer_email)VALUES ('LauraS', '[email protected]')IF NOT EXISTS;

    UPDATE customer_account

    SET customer_email='[email protected]'IF customerID='LauraS';

    Cassandra 2.1.1 and later supports non-equal conditions for lightweight transactions. You canuse =, != and IN operators in WHERE clauses to query lightweight tables.

     Triggers

  • 8/17/2019 CassandraTraining v3.3.4

    115/183

    115

    • The actual trigger logic resides

    outside in a Java POJO

    • Feature may still be experimental

    • Has write performance impact

    • Better to have application logic doany pre-processing

    CREATE TRIGGER myTrigger  ON myTable  USING 'Java class name';

    Debate - Locking in C*

  • 8/17/2019 CassandraTraining v3.3.4

    116/183

    116

    Instance 1 Instance 2 Instance 3

    Memcached / Redis

    k=_0_0 v=k=_0_a v=..

    k=_z_z v=

    Step 1. Acquire lock

    If FAIL look for next KEY

    St ep 3. Release lock 

     These instances work exactly like instance 1. They will acquirelocks on other keys so that no two instances step on each other

    Primary key = ((flag, starts_with), gen_id)

    C

    D

    E

    F

    G

    H

    A

    B

    Step 2. Get anavailable ID that has

    the flag=0 and startswith 0

    Delete this key from C*

  • 8/17/2019 CassandraTraining v3.3.4

    117/183

    Before you model ...

  • 8/17/2019 CassandraTraining v3.3.4

    118/183

    118

    Report

    Dashboard

    Biz question

    End user

    Fast response

    Model

    Storage

    Data-types

    Keys

    Data

    User1

    Structure2

    C* way

    RDBMS way

    QDD

  • 8/17/2019 CassandraTraining v3.3.4

    119/183

    119

    C* modelling is

    "Query Driven Design"

    Give me the query and I will organize the data via amodel to get you performance for that query.

    Denormalization

  • 8/17/2019 CassandraTraining v3.3.4

    120/183

    120

    • Unlike RDBMS you cannot use any foreign

    key relationships• FK and joins are not supported anyway!• Embed details, it will cause duplication but

    that is alright

    • Helps in getting the complete data in asingle read• More efficient, less random I/O• May have to create more than one

    denormalized table to serve differentqueries• UDT is a great way (UDT a bit later...)

    Datatypes

  • 8/17/2019 CassandraTraining v3.3.4

    121/183

    121

    • Int, decimal, float, double

    • Varchar, text

    • Collections - Map (K=V), List (ordered V), Set (Unordered unique V)– Designed to store small amount of data (in hundreds)

    – Limit is 64k

    – Max size per element is 64KB

    – Cannot be part of Primary key

    – Nesting is not allowed

    •  Timestamp, timeuuid

    • Counter• User defined types (UDT)

    •  There are other datatypes as well– position frozen - useful in 3D coordinate geometry

    – blob - good practice is to gzip and then store

    • Static modifer - value remains the same, save storage space BUT veryspecific usecase

    Choice of partition key

  • 8/17/2019 CassandraTraining v3.3.4

    122/183

    122

    • If a key is not unique then C* will upsert

    with no warning!• Good choices

    – Natural key like an email address or meter id

    –Surrogate keys like uuid, timeuuid (uuid withtime component)

    • CQL provides a number of timeuuidfunctions

    – now()– dateof(tuuid) - extracts the timestamp as date

    – unixtimestampof - raw 64 bit integer

    Wide row? Use composite PK 

  • 8/17/2019 CassandraTraining v3.3.4

    123/183

    123

    CREATE TABLE all_events (  event_type int,  date int, //format as yyyymmdd  created_hh int,  created_min int,  created_sec int,

      created_nn int,  data text,  PRIMARY KEY((event_type, date), created_hh));

    Example of a compound PK ((...), ...) syntax

    CREATE TABLE trunc_events (  event_type int,

      date number,  created_on timestamp,  data text,  PRIMARY KEY((event_type, date), created_on)) WITH CLUSTERING ORDER BY (created_on DESC);

    insert into trunc_events (event_type, date, created_on, data) values(1, '2014-06-24', '2014-06-26 12:47:54','{"event" : "details of the event will go here"}') USING TTL 300;

    Only event type as the PK will store all events ever generated in a single

    partition. That can result in a huge partition and may breach C* limits

    Partition key IN clause debate

  • 8/17/2019 CassandraTraining v3.3.4

    124/183

    124

    ( )

    Meter_id_001

    Meter_id_001

    Meter_id_001

    ...

    Meter_id_001

    26-Jan-2014

    27-Jan-2014

    28-Jan-2014

    ...

    15-Dec-2014

    147 159 165 ... 183

    5698 5712 5745

    12 58 ... 302

    86 91

    Partition key

    where meter_id = 'Meter_id_001' and date IN ( ... )

    t1

    t2

    t3

    tn

    ∑   ≠t1 t2 t3 tn≠≠,t1 t2 t3 tn,,T =

    Is T predictable? What are the factors responsible fordetermining the overall fetch time T?

     This is also called as "Multi get slice"

    Partition key IN clause (Slicing)

  • 8/17/2019 CassandraTraining v3.3.4

    125/183

    125

    If you want to search across two partition keys then use

    "IN" in the where clause. You can have a further limitingcondition based on cluster columns.

    In a composite PK, IN works only on the last key. Eg Pk ((key1, key2),clust1, clust2). IN will work only on key2 (last column only).

    Modeling notation

  • 8/17/2019 CassandraTraining v3.3.4

    126/183

    126

    • Chen's notation for conceptual modeling

    helps in design

    Sample model contd ...

  • 8/17/2019 CassandraTraining v3.3.4

    127/183

    127

  • 8/17/2019 CassandraTraining v3.3.4

    128/183

    Bitmap type of index example

  • 8/17/2019 CassandraTraining v3.3.4

    129/183

    129

    • Multiple parts to a key•

    Create a truth table of the various combinations• However, inserts will increase to the number ofcombinations

    • Eg Find a car (vehicle id) in a car park byvariable combinations

  • 8/17/2019 CassandraTraining v3.3.4

    130/183

    Bitmap type of index example

  • 8/17/2019 CassandraTraining v3.3.4

    131/183

    131

     YES, there will be more writes and more data BUT that is an acceptablefact in big data modeling that looks to optimize the query and userexperience more than data normalization

    select * from skldb.car_model_index

    where make='Ford'  and model=''  and color='Blue';

    select * from skldb.car_model_indexwhere make=''  and model=''  and color='Blue';

  • 8/17/2019 CassandraTraining v3.3.4

    132/183

     TTL - Time To LiveCREATE TABLE users (

    user id text PRIMARY KEY

  • 8/17/2019 CassandraTraining v3.3.4

    133/183

    133

      user_id text PRIMARY KEY,  first_name text,  last_name text,

      emails set,  todo map);

    INSERT INTO users (user_id, first_name, last_name, emails)  VALUES('nm', 'Nirmallya', 'Mukherjee', {'[email protected]', '[email protected]'},  {'2014-12-25' : 'Christmas party', '2014-12-31' : 'At Gokarna'});

    UPDATE users USING TTL   SET todo['2012-10-1'] = 'find water' WHERE user_id = 'nm';

    Rate limiting is a good idea in many cases. Eg allow password reset toonly 3 per day per user. Use TTL sliding window.

    A promotion is generated for a customer with a TTL.

    A temporary login is generated and is valid for a given TTL period.

     Table level TTL overrides row level TTL specification.

    UDT - User Defined Typescreate keyspace address_book with replication = { 'class':'SimpleStrategy', 'replication_factor': 1 };

  • 8/17/2019 CassandraTraining v3.3.4

    134/183

    134

    CREATE TYPE address_book.address (  street text,  city text,

      zip_code int,  phones set);

    CREATE TYPE address_book.fullname (  firstname text,  lastname text);

    CREATE TABLE address_book.users (  id int PRIMARY KEY,  name frozen ,  addresses map // a collection map);

    INSERT INTO address_book.users (id, name) VALUES (1, {firstname: 'Dennis', lastname: 'Ritchie'});

    UPDATE address_book.users SET addresses = addresses + {'home': { street: '9779 Forest Lane', city: 'Dallas',zip_code: 75015, phones: {'001 972 555 6666'}}} WHERE id=1;

    SELECT name.firstname from users where id = 1;

    CREATE INDEX on address_book.users (name);SELECT id FROM address_book.users WHERE name = {firstname: 'Dennis', lastname: 'Ritchie'};

  • 8/17/2019 CassandraTraining v3.3.4

    135/183

    Queries

  • 8/17/2019 CassandraTraining v3.3.4

    136/183

    136

    SELCT * FROM mailboxWHERE login=jdoeand message_id ='2014-07-12 16:00:00';

    Get message by user and message_id (date)

    SELCT * FROM mailboxWHERE login=jdoeand message_id ='2014-01-12 16:00:00';

    Get message by user and date interval

    CREATE TABLE mailbox (

    login text,message_id timeuuid,Interlocutor text,Message text,PRIMARY KEY(login, message_id);

    Debate - are these correct?

  • 8/17/2019 CassandraTraining v3.3.4

    137/183

    137

    SELCT * FROM mailbox WHERE message_id ='2014-07-12 16:00:00';

    Get message by message_id

    SELCT * FROM mailbox WHEREmessage_id ='2014-01-12 16:00:00';  

    Get message by date interval

  • 8/17/2019 CassandraTraining v3.3.4

    138/183

    WHERE clause restrictions

  • 8/17/2019 CassandraTraining v3.3.4

    139/183

    139

    • All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition (where canfilter in contiguous data)

    • "Allow filtering" can override the default behaviour of cross partition access but it isnot a good practice at all (incorrect data model)

    – select .. from meter_reading where record_date > '2014-12-15'

    (assuming there is a secondary index on record_date)

    • Only exact match (=) predicate on #partition, range queries (

  • 8/17/2019 CassandraTraining v3.3.4

    140/183

    140

    • If the primary key is (industry_id,

    exchange_id, stock_symbol) then thefollowing sort orders are valid– order by exchange_id desc, stock_symbol desc

    – order by exchange_id asc, stock_symbol asc

    • Following is an example of invalid

    sort order– order by exchange_id desc, stock_symbol asc

    Secondary Index

  • 8/17/2019 CassandraTraining v3.3.4

    141/183

    141

    • Consider the earlier example of the table all_events, what will happen if wetry to get the records based on minute?–

    Create a secondary index based on the minute• Can be created on any column except counter, static and collection columns

    • It is actually another table with the key= and columns=keys thatcontain the key

    • Do not use secondary indexes if – On high-cardinality columns because you then query a huge volume of records

    for a small number of results– In tables that use a counter column

    – On a frequently updated or deleted column (too many tombstones)

    –  To look for a row in a large partition unless narrowly queried

    – High volume queries

    • Secondary indexes are for searching convenience

    – Use with low-cardinality columns– Columns that may contain a relatively small set of distinct values

    – Use when prototyping, ad-hoc querying or with smaller datasets

    • See this link for more details (http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html)

    Secondary Index - rebuilding

  • 8/17/2019 CassandraTraining v3.3.4

    142/183

    142

    • While Cassandra’s performance will always be best usingpartition-key lookups, secondary indexes provide a nice

    way to query data on small or offline datasets• Occasionally, secondary indexes can get out-of-sync• Unfortunately, there is no great way to verify secondary

    indexes, other than rebuilding them.•  You can verify by a query using indexes, and look for

    significant inconsistencies on different nodes

    • Also note that each node keeps its own index (they are notdistributed), so you’ll need to run this on each node

    • Must run nodetool repair to ensure the index gets built oncurrent data

    nodetool rebuild_index my_keysapce my_column_family

    CQL Limits

  • 8/17/2019 CassandraTraining v3.3.4

    143/183

    143

    Limits (as per Datastax documentation)• Clustering column value, length of: 65535• Collection item, value of: 2GB (Cassandra 2.1 v3

    protocol), 64K (Cassandra 2.0.x and earlier)• Collection item, number of: 2B (Cassandra 2.1 v3

    protocol), 64K (Cassandra 2.0.x and earlier)

    • Columns in a partition: 2B

    • Fields in a tuple: 32768, try not to have 1000' offields

    • Key length: 65535• Query parameters in a query: 65535•

    Single column, value of: 2GB, xMB arerecommended• Statements in a batch: 65535

    C* batch - Is it always a good idea?

  • 8/17/2019 CassandraTraining v3.3.4

    144/183

    144

    Leverages token aware policiesCluster load is evenly balancedMeans if one fails we re-try ONLY that queryNot an all or none success model

     Tip: batches work well ONLY if all records have the same partition keyIn this case all records go to the same node.

    Batch - use case

    K bl i h b i d i

  • 8/17/2019 CassandraTraining v3.3.4

    145/183

    145

    • Keep two tables in synch - same event being stored in twotables, one by customer ID and the other by staff ID

    • Emulating a classical database transaction - all or nothing!

    Antipatterns - things to watch out!

  • 8/17/2019 CassandraTraining v3.3.4

    146/183

    146

    Update columns as nulls in CQL. Tombstone for each null

    Singleton messages - add, consume, remove loop

    Intense update on a single column

    Dynamic schema change (frequent changes) and topology change in

    production is not a good idea.

    Use C* as a queue like singletons (add/consume loop)◆

    But... I need a queue!

  • 8/17/2019 CassandraTraining v3.3.4

    147/183

    147

    • Break the partition key to carry some

    time buckets eg. Queue1 in the lasthour

    • At the time of select you can use the

    clustering column eg Queue1 withtimestamp clustering column >SOME time. This helps in not going

    through all those tombstones

  • 8/17/2019 CassandraTraining v3.3.4

    148/183

    148

    Part 7

    Applications with Cassandra

  • 8/17/2019 CassandraTraining v3.3.4

    149/183

     The Client DAO and the Driver

    S i i t th d f d ll

  • 8/17/2019 CassandraTraining v3.3.4

    150/183

    150

    • Session instances are thread-safe and usually asingle instance is all you need per application.

    However, a given session can only be set to onekeyspace at a time, so one instance per keyspaceis necessary

    • Use policies– Retry– Reconnect– LoadBalancing

    • http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/fourSimpleRules.html

    C* client sample

  • 8/17/2019 CassandraTraining v3.3.4

    151/183

    151

    C* client sample

  • 8/17/2019 CassandraTraining v3.3.4

    152/183

    152http://www.datastax.com/drivers/java/2.1/index.html▶

    C* client sample - the entity

  • 8/17/2019 CassandraTraining v3.3.4

    153/183

    153

    C* client sample

  • 8/17/2019 CassandraTraining v3.3.4

    154/183

    154

    See this to work with UDT -http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/reference/udtApi.html

  • 8/17/2019 CassandraTraining v3.3.4

    155/183

  • 8/17/2019 CassandraTraining v3.3.4

    156/183

    156

    Part 8

    Administration

    C* != (Fire and forget)

  • 8/17/2019 CassandraTraining v3.3.4

    157/183

    157

    C* is not a fire a forget system.

    It does require administrationand monitoring to ensure the

    infrastructure performs optimally

    at all times.

    Monitoring - Opscenter

    Workload modeling

  • 8/17/2019 CassandraTraining v3.3.4

    158/183

    158

    Workload modeling

    • Workload characterization• Performance characteristics of the

    cluster• Latency analysis of the cluster• Performance of the OS• Disk utilization• Read / Write operations per second

    • OS Memory utilization• Heap utilization

    Monitoring - Datastax Opscenter

  • 8/17/2019 CassandraTraining v3.3.4

    159/183

    159

    Monitoring - Datastax Opscenter

  • 8/17/2019 CassandraTraining v3.3.4

    160/183

    160

    Monitoring - Datastax Opscenter

  • 8/17/2019 CassandraTraining v3.3.4

    161/183

    161

    Monitoring - Datastax Opscenter

  • 8/17/2019 CassandraTraining v3.3.4

    162/183

    162

    Monitoring - Datastax Opscenter

  • 8/17/2019 CassandraTraining v3.3.4

    163/183

    163

    Monitoring - Datastax Opscenter

  • 8/17/2019 CassandraTraining v3.3.4

    164/183

    164

    Monitoring - Datastax Opscenter

  • 8/17/2019 CassandraTraining v3.3.4

    165/183

    165

    Dropped MUTATION messages this means that the mutation was not applied to all replicas itwas sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair (perhaps

    because of load C* is defending itself by dropping messages).

    Monitoring - Datastax Opscenter

  • 8/17/2019 CassandraTraining v3.3.4

    166/183

    166

    Monitoring - Datastax Opscenter

  • 8/17/2019 CassandraTraining v3.3.4

    167/183

    167

    Monitoring - Datastax Opscenter

  • 8/17/2019 CassandraTraining v3.3.4

    168/183

    168

    Monitoring - Node coming up

  • 8/17/2019 CassandraTraining v3.3.4

    169/183

    169

    Node down

  • 8/17/2019 CassandraTraining v3.3.4

    170/183

    170

    Node tool

    • Very important tool to manage C* in

  • 8/17/2019 CassandraTraining v3.3.4

    171/183

    171

    Very important tool to manage C in

    production for day to day admin• Nodetool has over 40 commands• Can be found in the

    $CASSANDRA_HOME/bin folder

    • Try a few commands– ./nodetool -h 10.21.24.11 -p 7199 status

    – ./nodetool -h 10.21.24.11 -p 7199 info

  • 8/17/2019 CassandraTraining v3.3.4

    172/183

    Node repair

    How to compare two large data sets?

  • 8/17/2019 CassandraTraining v3.3.4

    173/183

    173

    How to compare two large data sets?

    • Assume two arrays• Each containing 1 million numbers

    • Further assume the order of storage is fixed

    • How to compare the two arrays?

    Potential solution

    • Loop 1 and check the other?

    • How long will it take to loop 1 million?

    • What happens if the data gets into Billions?

    • Lots of inefficiencies!

    Wikipedia definition of Merkel Tree

    • Merkle tree is a tree in which every non-leaf node is labelled with the

  • 8/17/2019 CassandraTraining v3.3.4

    174/183

    174

    • A hash tree is a tree of hashes in whichthe leaves are hashes of data blocks in,for instance, a file or set of files. Nodesfurther up in the tree are the hashes oftheir respective children.

    • For example, in the picture hash 0 is the

    result of hashing the result ofconcatenating hash 0-0 and hash 0-1. That is, hash 0 = hash( hash 0-0 + hash0-1 ) where + denotes concatenation

    • Merkle tree is a tree in which every non-leaf node is labelled with thehash of the labels of its children nodes. Hash trees are useful because

    they allow efficient and secure verification of the contents of largedata structures

    • Currently the main use of hash trees is to make sure that data blocksreceived from other peers in a peer-to-peer network are receivedundamaged and unaltered, and even to check that the other peers donot lie and send fake blocks

    Node repair using Nodetool

    • Prime objective of node repair is to detect and fixi i t i d

  • 8/17/2019 CassandraTraining v3.3.4

    175/183

    175

    inconsistencies among nodes

    • Per node Merkel tree is created• It provides an efficient way to find differences in data

    blocks• Reduces the amount of data transferred to compare the

    data blocks

    • C* uses a fixed depth tree of

    15 levels having 32K leafnodes

    • Leaf nodes can represent arange of data because thedepth is fixed

    Node repair process

    • When nodetool repair command is executed, the target nodeifi d ith h ti i th d di t th i f

  • 8/17/2019 CassandraTraining v3.3.4

    176/183

    176

    specified with -h option in the command, coordinates the repair of

    each column family in each keyspace

    • A repair coordinator node requests Merkle tree from each replicafor a specific token range to compare them

    • Each replica builds a Merkle tree by scanning the data stored

    locally in the requested token range.–  This can be IO intensive but can be controlled by providing a partition range

    eg the primary range but you will still be building the Merkel tree in parallel

    –  You can specify the start token, will repair the range for this token

    – Finally can repair one DC at a time by passing "local" parameter

    – Set the compaction and streaming thresholds (next slide)

    •  The repair coordinator node compares the Merkle trees and findsall the sub token ranges that differ between the replicas andrepairs data in those ranges

    Nodetool repair - potential strategy

    •  The system.local table has all the tokens that the node is

  • 8/17/2019 CassandraTraining v3.3.4

    177/183

    177

    responsible for

    • Create a script that runs a repair using -st -et in a loop for all tokens for the given node

    • Care should be taken to connect to the node using -h and useparallel execution (-par)

    •  This will allow the repair to run on small ranges and in short time

    increments• Spool the output and analyze if everything was correct else can

    send email or any form of notification to alert the admin

    • It may be possible to get an exception like the following if the starttoken and end tokens are not the same– ERROR [Thread-1099288] 2015-03-14 02:48:08,822 StorageService.java

    (line 2517) Repair session failed: java.lang.IllegalArgumentException:Requested range intersects a local range but is not fully contained in one;this would lead to imprecise repair

    – Does not mean there is a problem

    Schema disagreements

    • Perform schema changes one at a time, at a steady pace,

  • 8/17/2019 CassandraTraining v3.3.4

    178/183

    178

    and from the same node

    • Do not make multiple schema changes at the same time

    • If NTP is not in place it is possible that the schemas may not

    be in synch (usual problem in most cases)

    • Check if the schema is in agreementhttp://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_handle_schema_disagree_t.html

    • ./nodetool describecluster

    Nodetool cfhistograms

    • Will tell how many SSTables were looked at toi f d

  • 8/17/2019 CassandraTraining v3.3.4

    179/183

    179

    satisfy a read

    – With level compaction should never go > 3– With size tiered compaction should never go > 12– If the above are not in place then compaction is falling

    behindcheck with ./nodetool compactionstatsit should say "pending tasks: 0"

    • Read and write letency (no n/w latency)– Write should not be > 150μs– Read should not be > 130μs (from mem)– SSD should be single digit μs

    • ./nodetool cfhistograms

    Slow queries

    • Use the DataStax Enterprise Performance Serviceto automatically capture long-running queries

  • 8/17/2019 CassandraTraining v3.3.4

    180/183

    180

    to automatically capture long running queries

    (based on response time thresholds you specify)

    • and then query the performance table that holdsthose cql statements

    • cqlsh:dse_perf> select * from node_slow_log;

    C* timeout, concurrency, consistency

    • Cassandra.yaml has multiple timeouts

  • 8/17/2019 CassandraTraining v3.3.4

    181/183

    181

    y p

    (these change with versions)– read_request_timeout_in_ms– write_request_timeout_in_ms

    – cross_node_timeout

    • Concurrency– concurrent_reads

    – concurrent_writes

    • Application

    – Consistency levels affect read/write• etc (see the yaml file)

     JNA

    • Java Native Access allows for more

  • 8/17/2019 CassandraTraining v3.3.4

    182/183

    182

     Java Native Access allows for more

    efficient communication of the JVMand the OS

    • Ensure JNA is enabled and if not do

    the following– sudo apt-get install libjna-java

    • But, ideally JNA should be

    automatically enabled

    About me

  • 8/17/2019 CassandraTraining v3.3.4

    183/183

    www.linkedin.com/in/nirmallya 

    [email protected]

     That's it!