Cassandra hands on

Cassandra Hands On Niall Milton, CTO, DigBigData

Examples courtesy of Patrick Callaghan, DataStax

Sponsored By

Introduction �  We will be walking through Cassandra use cases

from Patrick Callaghan on github.

�  https://github.com/PatrickCallaghan/

�  Patrick sends his apologies but due to Aer Lingus air strike on Friday he couldn’t get a flight back to UK

�  This presentation will cover the important points from each sample application

Agenda �  Transactions Example

�  Paging Example

�  Analytics Example

�  Risk Sensitivity Example

Transactions Example

Scenario �  We want to add products, each with a quantity to

an order

�  Orders come in concurrently from random buyers

�  Products that have sold out will return “OUT OF STOCK”

�  We want to use lightweight transactions to guarantee that we do not allow orders to complete when no stock is available

Lightweight Transactions �  Guarantee a serial isolation level, ACID

�  Uses PAXOS consensus algorithm to achieve this in a distributed system. See: �  http://research.microsoft.com/en-us/um/people/lamport/

pubs/paxos-simple.pdf

�  Every node is still equal, no master or locks

�  Allows for conditional inserts & updates

�  The cost of linearizable consistency is higher latency, not suitable for high volume writes where low latency is required

Retrieve & Run the Code 1.  git clone

https://github.com/PatrickCallaghan/datastax-transaction-demo.git

2.  mvn clean compile exec:java -Dexec.mainClass="com.datastax.demo.SchemaSetup”

3.  mvn clean compile exec:java -Dexec.mainClass="com.datastax.transactions.Main" -Dload=true -DcontactPoints=127.0.0.1 -DnoOfThreads=10

Schema 1.  create keyspace if not exists

datastax_transactions_demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1' };

2.  create table if not exists products(productId text, capacityleft int, orderIds set<text>, PRIMARY KEY (productId));

3.  create table if not exists buyers_orders(buyerId text, orderId text, productId text, PRIMARY KEY(buyerId, orderId));

Model public class Order {

private String orderId;

private String productId;

private String buyerId;

…

}

Method �  Find current product quantity at CL.SERIAL

�  This allows us to execute a PAXOS query without proposing an update, i.e. read the current value

SELECT capacityLeft from products WHERE productId = ‘1234’

e.g. capacityLeft = 5

Method Contd. �  Do a conditional update using IF operator to make

sure product quantity has not changed since last quantity check �  Note the use of the set collection type here. �  This statement will only succeed if the IF condition is

met

UPDATE products SET orderIds=orderIds + {'3'}, capacityleft = 4 WHERE productId = ’1234' IF capacityleft = 5;

Method Contd. �  If last query succeeds, simply insert the order.

INSERT into orders (buyerId, orderId, productId) values (1,3,’1234’);

�  This guarantees that no order will be placed where there is insufficient quantity to fulfill it.

Comments �  Using LWT incurs a cost of higher latency because

all replicas must be consulted before a value is committed / returned.

�  CL.SERIAL does not propose a new value but is used to read the possibly uncommitted PAXOS state

�  The IF operator can also be used as IF NOT EXISTS which is useful for user creation for example

Paging Example

Scenario �  We have 1000s of products in our product

catalogue

�  We want to browse these using a simple select

�  We don’t want to retrieve all at once!

Cursors �  We are often dealing with wide rows in Cassandra

�  Reading entire rows or multiple rows at once could lead to OOM errors

�  Traditionally this meant using range queries to retrieve content

�  Cassandra 2.0 (and Java driver) introduces cursors

�  Makes row based queries more efficient (no need to use the token() function)

�  This will simplify client code


https://github.com/PatrickCallaghan/datastax-paging-demo.git

2.  mvn clean compile exec:java -Dexec.mainClass="com.datastax.demo.SchemaSetup"

3.  mvn clean compile exec:java -Dexec.mainClass="com.datastax.paging.Main"

Schema

create table if not exists products(productId text, capacityleft int, orderIds set<text>, PRIMARY KEY (productId));

�  N.B With the default partitioner, products will be ordered based on Murmer3 hash value. Old way we would need to use the token() function to retrieve them in order

Model public class Product {

private String productId;

private int capacityLeft;

private Set<String> orderIds;

…

}

Method 1.  Create a simple select query for the products

table.

2.  Set the fetch size parameter

3.  Execute the statement

Statement stmt = new SimpleStatement("Select * from products”);stmt.setFetchSize(100);ResultSet resultSet = this.session.execute(stmt);

Method Contd. 1.  Get an iterator for the result set

2.  Use a while loop to iterate over the result set

Iterator<Row> iterator = resultSet.iterator();

while (iterator.hasNext()){

Row row = iterator.next();

// do stuff with the row

}

Comments �  Very easy to transparently iterate in a memory

efficient way over a large result set

�  Cursor state is maintained by driver.

�  Allows for failover between different page responses, i.e. the state is not lost if a page fails to load from a node in the replica set, the page will be requested from another node

�  See: http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0

Analytics Example

Scenario �  Don’t have Hadoop but want to run some HIVE type

analytics on our large dataset

�  Example: Get the Top10 financial transactions ordered by monetary value for each user

�  May want to add more complex filtering later (where value > 1000) or even do mathematical groupings, percentiles, means, min, max

Cassandra for Analytics �  Useful for many scenarios when no other analytics

solution is available

�  Using cursors, queries are bounded & memory efficient depending on the operation

�  Can be applied anywhere we can do iterative or recursive processing, SUM, AVG, MIN, MAX etc.

�  NB: The example code also includes an CQLSSTableWriter which is fast & convenient if we want to manually create SSTables of large datasets rather than send millions of insert queries to Cassandra


https://github.com/PatrickCallaghan/datastax-analytics-example.git

2.  export MAVEN_OPTS=-Xmx512M (up the memory)

3.  mvn clean compile exec:java -Dexec.mainClass="com.datastax.bulkloader.Main"

4.  mvn clean compile exec:java -Dexec.mainClass="com.datastax.analytics.TopTransactionsByAmountForUserRunner"

Schema create table IF NOT EXISTS transactions (

accid text,txtnid uuid,txtntime timestamp,amount double,

type text,reason text,PRIMARY KEY(accid, txtntime)

);

Model public class Transaction {

pivate String txtnId;

private String acountId;

private double amount;

private Date txtnDate;

private String reason;

private String type;

…}

Method �  Pass a blocking queue into the DAO method which cursors the

data, allows us to pop items off as they are added �  NB: Could also use a callback here to update the queue

public void getAllProducts(BlockingQueue<Transaction> processorQueue)

Statement stmt = new SimpleStatement(“SELECT * FROM transactions”);

stmt.setFetchSize(2500);

ResultSet resultSet = this.session.execute(stmt);

Method Contd. 1.  Get an iterator for the result set

2.  Use a while loop to iterate over the result set, add each row into the queue

while (iterator.hasNext()) {Row row = iterator.next();

Transaction transaction = createTransactionFromRow(row); //conveniencequeue.offer(transaction);

}

Method Contd. 1.  Use Java Collections & Transaction comparator to

track Top results

private Set<Transaction> orderedSet = new BoundedTreeSet<Transaction>(10, new TransactionAmountComparator());

Comments �  Entirely possible, but probably not to be thought of as a

complete replacement for dedicated analytics solutions

�  Issues are token distribution across replicas and mixed write and read patterns

�  Running analytics or MR operations can be a read heavy operation (as well as memory and i/o intensive)

�  Transaction logging tends to be write heavy

�  Cassandra can handle it, but in practice it is better to split workloads except for smaller cases, where latency doesn’t matter or where the cluster is not generally under significant load

�  Consider DSE Hadoop, Spark, Storm as alternatives

Risk Sensitivity Example

Scenario �  In financial risk systems, positions have sensitivity to

certain variable

�  Positions are hierarchical and is associated with a trader at a desk which is part of an asset type in a certain location.

�  E.g. Frankfurt/FX/desk10/trader7/position23

�  Sensitivity values are inserted for each position. We need to aggregate them for each level in the hierarchy

�  The Sum of all sensitivities over time is the new sensitivity as they are represented by deltas.

Scenario �  E.g. Aggregations for:

�  Frankfurt/FX/desk10/trader7

�  Frankfurt/FX/desk10

�  Frankfurt/FX

�  As new positions are entered the risk sensitivities will change and will need to be aggregated for each level for the new value to be available

Queries select * from risk_sensitivities_hierarchy where hier_path = 'Paris/FX'; !

select * from risk_sensitivities_hierarchy where hier_path = 'Paris/FX/desk4' and sub_hier_path='trader3'; !

select * from risk_sensitivities_hierarchy where hier_path = 'Paris/FX/desk4' and sub_hier_path='trader3' and risk_sens_name='irDelta';!


https://github.com/PatrickCallaghan/datastax-analytics-example.git

2.  export MAVEN_OPTS=-Xmx512M (up the memory)

3.  mvn clean compile exec:java -Dexec.mainClass="com.datastax.bulkloader.Main"

4.  mvn clean compile exec:java -Dexec.mainClass="com.heb.finance.analytics.Main" -DstopSize=1000000

Schema create table if not exists risk_sensitivities_hierarchy (

hier_path text,

sub_hier_path text,

risk_sens_name text,

value double,

PRIMARY KEY (hier_path, sub_hier_path, risk_sens_name)

) WITH compaction={'class': 'LeveledCompactionStrategy'};

NB: Notice the use of LCS as we want the table to be efficient for reads also

Model public class RiskSensitivity

public final String name;

public final String path;

public final String position;

public final BigDecimal value;

…

}

Method �  Write a service to write new sensitivities to

Cassandra Periodically.

insert into risk_sensitivities_hierarchy (hier_path, sub_hier_path, risk_sens_name, value) VALUES (?, ?, ?, ?)

Method Contd. �  In our aggregator do the following periodically

�  Select data for hierarchies we wish to aggregate

select * from risk_sensitivities_hierarchy where hier_path = ‘Frankfurt/FX/desk10/trader4’

�  Will get all positions related to this hierarchy �  Add the values (represented as deltas) to each other to get

the new sensitivity

�  E.g. S1 = -3, S2 = 2, S3= -1

�  Write it back for ‘Frankfurt/FX/desk10/trader4’

Comments �  Simple way to maintain up to date risk sensitivity

on an on going basis based on previous data

�  Will mean (N Hierarchies) * (N variables) queries are executed periodically (keep an eye on this)

�  Cursors, blocking queue and bounded collections help us achieve the same result without reading entire rows

�  Has other applications such as roll ups for stream data provided you have a reasonably low cardinality in terms of number of (time resolution) * variables.

�  Thanks Patrick Callaghan for the hard work coding

the examples!

� Questions?

Internet

Cassandra hands on