43
Cassandra Hands On Niall Milton, CTO, DigBigData Examples courtesy of Patrick Callaghan, DataStax Sponsored By

Cassandra hands on

Embed Size (px)

DESCRIPTION

A presentation delivered to the Dublin Cassandra User Group on the 29 May 2014. It covers use cases written by Patrick Callaghan of DataStax interpreted by Niall Milton of DigBigData.

Citation preview

Page 1: Cassandra hands on

Cassandra Hands On Niall Milton, CTO, DigBigData

Examples courtesy of Patrick Callaghan, DataStax

Sponsored By

Page 2: Cassandra hands on

Introduction �  We will be walking through Cassandra use cases

from Patrick Callaghan on github.

�  https://github.com/PatrickCallaghan/

�  Patrick sends his apologies but due to Aer Lingus air strike on Friday he couldn’t get a flight back to UK

�  This presentation will cover the important points from each sample application

Page 3: Cassandra hands on

Agenda �  Transactions Example

�  Paging Example

�  Analytics Example

�  Risk Sensitivity Example

Page 4: Cassandra hands on

Transactions Example

Page 5: Cassandra hands on

Scenario �  We want to add products, each with a quantity to

an order

�  Orders come in concurrently from random buyers

�  Products that have sold out will return “OUT OF STOCK”

�  We want to use lightweight transactions to guarantee that we do not allow orders to complete when no stock is available

Page 6: Cassandra hands on

Lightweight Transactions �  Guarantee a serial isolation level, ACID

�  Uses PAXOS consensus algorithm to achieve this in a distributed system. See: �  http://research.microsoft.com/en-us/um/people/lamport/

pubs/paxos-simple.pdf

�  Every node is still equal, no master or locks

�  Allows for conditional inserts & updates

�  The cost of linearizable consistency is higher latency, not suitable for high volume writes where low latency is required

Page 7: Cassandra hands on

Retrieve & Run the Code 1.  git clone

https://github.com/PatrickCallaghan/datastax-transaction-demo.git

2.  mvn clean compile exec:java -Dexec.mainClass="com.datastax.demo.SchemaSetup”

3.  mvn clean compile exec:java -Dexec.mainClass="com.datastax.transactions.Main" -Dload=true -DcontactPoints=127.0.0.1 -DnoOfThreads=10

Page 8: Cassandra hands on

Schema 1.  create keyspace if not exists

datastax_transactions_demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1' };

2.  create table if not exists products(productId text, capacityleft int, orderIds set<text>, PRIMARY KEY (productId));

3.  create table if not exists buyers_orders(buyerId text, orderId text, productId text, PRIMARY KEY(buyerId, orderId));

Page 9: Cassandra hands on

Model public class Order {

private String orderId;

private String productId;

private String buyerId;

}

Page 10: Cassandra hands on

Method �  Find current product quantity at CL.SERIAL

�  This allows us to execute a PAXOS query without proposing an update, i.e. read the current value

SELECT capacityLeft from products WHERE productId = ‘1234’

e.g. capacityLeft = 5

Page 11: Cassandra hands on

Method Contd. �  Do a conditional update using IF operator to make

sure product quantity has not changed since last quantity check �  Note the use of the set collection type here. �  This statement will only succeed if the IF condition is

met

UPDATE products SET orderIds=orderIds + {'3'}, capacityleft = 4 WHERE productId = ’1234' IF capacityleft = 5;

Page 12: Cassandra hands on

Method Contd. �  If last query succeeds, simply insert the order.

INSERT into orders (buyerId, orderId, productId) values (1,3,’1234’);

�  This guarantees that no order will be placed where there is insufficient quantity to fulfill it.

Page 13: Cassandra hands on

Comments �  Using LWT incurs a cost of higher latency because

all replicas must be consulted before a value is committed / returned.

�  CL.SERIAL does not propose a new value but is used to read the possibly uncommitted PAXOS state

�  The IF operator can also be used as IF NOT EXISTS which is useful for user creation for example

Page 14: Cassandra hands on

Paging Example

Page 15: Cassandra hands on

Scenario �  We have 1000s of products in our product

catalogue

�  We want to browse these using a simple select

�  We don’t want to retrieve all at once!

Page 16: Cassandra hands on

Cursors �  We are often dealing with wide rows in Cassandra

�  Reading entire rows or multiple rows at once could lead to OOM errors

�  Traditionally this meant using range queries to retrieve content

�  Cassandra 2.0 (and Java driver) introduces cursors

�  Makes row based queries more efficient (no need to use the token() function)

�  This will simplify client code

Page 17: Cassandra hands on

Retrieve & Run the Code 1.  git clone

https://github.com/PatrickCallaghan/datastax-paging-demo.git

2.  mvn clean compile exec:java -Dexec.mainClass="com.datastax.demo.SchemaSetup"

3.  mvn clean compile exec:java -Dexec.mainClass="com.datastax.paging.Main"

Page 18: Cassandra hands on

Schema

create table if not exists products(productId text, capacityleft int, orderIds set<text>, PRIMARY KEY (productId));

�  N.B With the default partitioner, products will be ordered based on Murmer3 hash value. Old way we would need to use the token() function to retrieve them in order

Page 19: Cassandra hands on

Model public class Product {

private String productId;

private int capacityLeft;

private Set<String> orderIds;

}

Page 20: Cassandra hands on

Method 1.  Create a simple select query for the products

table.

2.  Set the fetch size parameter

3.  Execute the statement

Statement stmt = new SimpleStatement("Select * from products”);stmt.setFetchSize(100);ResultSet resultSet = this.session.execute(stmt);

Page 21: Cassandra hands on

Method Contd. 1.  Get an iterator for the result set

2.  Use a while loop to iterate over the result set

Iterator<Row> iterator = resultSet.iterator();

while (iterator.hasNext()){

Row row = iterator.next();

// do stuff with the row

}

Page 22: Cassandra hands on

Comments �  Very easy to transparently iterate in a memory

efficient way over a large result set

�  Cursor state is maintained by driver.

�  Allows for failover between different page responses, i.e. the state is not lost if a page fails to load from a node in the replica set, the page will be requested from another node

�  See: http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0

Page 23: Cassandra hands on

Analytics Example

Page 24: Cassandra hands on

Scenario �  Don’t have Hadoop but want to run some HIVE type

analytics on our large dataset

�  Example: Get the Top10 financial transactions ordered by monetary value for each user

�  May want to add more complex filtering later (where value > 1000) or even do mathematical groupings, percentiles, means, min, max

Page 25: Cassandra hands on

Cassandra for Analytics �  Useful for many scenarios when no other analytics

solution is available

�  Using cursors, queries are bounded & memory efficient depending on the operation

�  Can be applied anywhere we can do iterative or recursive processing, SUM, AVG, MIN, MAX etc.

�  NB: The example code also includes an CQLSSTableWriter which is fast & convenient if we want to manually create SSTables of large datasets rather than send millions of insert queries to Cassandra

Page 26: Cassandra hands on

Retrieve & Run the Code 1.  git clone

https://github.com/PatrickCallaghan/datastax-analytics-example.git

2.  export MAVEN_OPTS=-Xmx512M (up the memory)

3.  mvn clean compile exec:java -Dexec.mainClass="com.datastax.bulkloader.Main"

4.  mvn clean compile exec:java -Dexec.mainClass="com.datastax.analytics.TopTransactionsByAmountForUserRunner"

Page 27: Cassandra hands on

Schema create table IF NOT EXISTS transactions (

accid text,txtnid uuid,txtntime timestamp,amount double,

type text,reason text,PRIMARY KEY(accid, txtntime)

);

Page 28: Cassandra hands on

Model public class Transaction {

pivate String txtnId;

private String acountId;

private double amount;

private Date txtnDate;

private String reason;

private String type;

…}

Page 29: Cassandra hands on

Method �  Pass a blocking queue into the DAO method which cursors the

data, allows us to pop items off as they are added �  NB: Could also use a callback here to update the queue

public void getAllProducts(BlockingQueue<Transaction> processorQueue)

Statement stmt = new SimpleStatement(“SELECT * FROM transactions”);

stmt.setFetchSize(2500);

ResultSet resultSet = this.session.execute(stmt);

Page 30: Cassandra hands on

Method Contd. 1.  Get an iterator for the result set

2.  Use a while loop to iterate over the result set, add each row into the queue

while (iterator.hasNext()) {Row row = iterator.next();

Transaction transaction = createTransactionFromRow(row); //conveniencequeue.offer(transaction);

}

Page 31: Cassandra hands on

Method Contd. 1.  Use Java Collections & Transaction comparator to

track Top results

private Set<Transaction> orderedSet = new BoundedTreeSet<Transaction>(10, new TransactionAmountComparator());

Page 32: Cassandra hands on

Comments �  Entirely possible, but probably not to be thought of as a

complete replacement for dedicated analytics solutions

�  Issues are token distribution across replicas and mixed write and read patterns

�  Running analytics or MR operations can be a read heavy operation (as well as memory and i/o intensive)

�  Transaction logging tends to be write heavy

�  Cassandra can handle it, but in practice it is better to split workloads except for smaller cases, where latency doesn’t matter or where the cluster is not generally under significant load

�  Consider DSE Hadoop, Spark, Storm as alternatives

Page 33: Cassandra hands on

Risk Sensitivity Example

Page 34: Cassandra hands on

Scenario �  In financial risk systems, positions have sensitivity to

certain variable

�  Positions are hierarchical and is associated with a trader at a desk which is part of an asset type in a certain location.

�  E.g. Frankfurt/FX/desk10/trader7/position23

�  Sensitivity values are inserted for each position. We need to aggregate them for each level in the hierarchy

�  The Sum of all sensitivities over time is the new sensitivity as they are represented by deltas.

Page 35: Cassandra hands on

Scenario �  E.g. Aggregations for:

�  Frankfurt/FX/desk10/trader7

�  Frankfurt/FX/desk10

�  Frankfurt/FX

�  As new positions are entered the risk sensitivities will change and will need to be aggregated for each level for the new value to be available

Page 36: Cassandra hands on

Queries select * from risk_sensitivities_hierarchy where hier_path = 'Paris/FX'; !

select * from risk_sensitivities_hierarchy where hier_path = 'Paris/FX/desk4' and sub_hier_path='trader3'; !

select * from risk_sensitivities_hierarchy where hier_path = 'Paris/FX/desk4' and sub_hier_path='trader3' and risk_sens_name='irDelta';!

Page 37: Cassandra hands on

Retrieve & Run the Code 1.  git clone

https://github.com/PatrickCallaghan/datastax-analytics-example.git

2.  export MAVEN_OPTS=-Xmx512M (up the memory)

3.  mvn clean compile exec:java -Dexec.mainClass="com.datastax.bulkloader.Main"

4.  mvn clean compile exec:java -Dexec.mainClass="com.heb.finance.analytics.Main" -DstopSize=1000000

Page 38: Cassandra hands on

Schema create table if not exists risk_sensitivities_hierarchy (

hier_path text,

sub_hier_path text,

risk_sens_name text,

value double,

PRIMARY KEY (hier_path, sub_hier_path, risk_sens_name)

) WITH compaction={'class': 'LeveledCompactionStrategy'};

NB: Notice the use of LCS as we want the table to be efficient for reads also

Page 39: Cassandra hands on

Model public class RiskSensitivity

public final String name;

public final String path;

public final String position;

public final BigDecimal value;

}

Page 40: Cassandra hands on

Method �  Write a service to write new sensitivities to

Cassandra Periodically.

insert into risk_sensitivities_hierarchy (hier_path, sub_hier_path, risk_sens_name, value) VALUES (?, ?, ?, ?)

Page 41: Cassandra hands on

Method Contd. �  In our aggregator do the following periodically

�  Select data for hierarchies we wish to aggregate

select * from risk_sensitivities_hierarchy where hier_path = ‘Frankfurt/FX/desk10/trader4’

�  Will get all positions related to this hierarchy �  Add the values (represented as deltas) to each other to get

the new sensitivity

�  E.g. S1 = -3, S2 = 2, S3= -1

�  Write it back for ‘Frankfurt/FX/desk10/trader4’

Page 42: Cassandra hands on

Comments �  Simple way to maintain up to date risk sensitivity

on an on going basis based on previous data

�  Will mean (N Hierarchies) * (N variables) queries are executed periodically (keep an eye on this)

�  Cursors, blocking queue and bounded collections help us achieve the same result without reading entire rows

�  Has other applications such as roll ups for stream data provided you have a reasonably low cardinality in terms of number of (time resolution) * variables.

Page 43: Cassandra hands on

�  Thanks Patrick Callaghan for the hard work coding

the examples!

� Questions?