Introduction to Cassandra and CQL for Java developers

Preview:

DESCRIPTION

This talk will provide a high-level overview of Cassandra, the Cassandra Query Language (CQL) and more specifically the DataStax CQL Java driver. This talk will aim to introduce Java developers tools, techniques and best practices for building Java application leveraging the Cassandra database using CQL3.

Citation preview

Introduction to Cassandra and CQL for Java developers

Julien Anguenot (@anguenot)!Houston Java User Group!

July 30th, 2014

Agenda

C* overview!C* key features!C* key concepts!Getting started with C*!CQL!DataStax CQL Java driver

C* overview

© 2014 iland internet solutions

What is C*?• Open source distributed storage system!• Essentially a partitioned row store!• A cross between Google’s BigTable (data model) and Amazon’s Dynamo

(architecture)!• Runs off commodity hardware!• Optimized for non-relational models!• Cassandra Query Language (CQL)!• Written in Java!• Apache Licence v2.0!• An open source community

4

© 2014 iland internet solutions

History

• Developed by Facebook for its inbox search!• Open sourced in 2008!• Apache Foundation top project in 2009!• 1.0 released in 2011!• 2.0 released in 2013!• 2.1 to be released this year

5

© 2014 iland internet solutions

C* is today

• One of the most popular “NoSQL” database!• Used by many (and large) organizations (Netflix, Instagram,

Twitter, eBay, etc.)!• Contributors include Facebook, IBM, Twitter, Rackspace, etc.!• Cassandra 2.0+ and CQL 3.1!• Drivers and client libs available for various languages:

Python, Java, C++, C#, etc.

6

© 2014 iland internet solutions

When to consider C*?• Performance: write is great, read is good on very large datasets.

(hundreds of TB)!• Application running across multiple data-centers in different

geographic locations!• Application requiring HA w/ no-SPOF (hundreds of nodes)!• Elastic scalability is critical!• Application running off commodity servers in premises or VMs at

your favorite IaaS!• Looking for simplicity over other solutions such as Hadoop /

HBase7

Cassandra vs HBase vs MongoDB

Let’s just get this out of the way

© 2014 iland internet solutions

MongoDB to be considered if / when?• (much) smaller datasets!• your application does not need to run across multiple

data centers.!• it is ok for your application to have a SPOF!• you do not need to scale out your application elastically!• write performance decreasing with amount of data is not

a big deal

9

© 2014 iland internet solutions

HBase to be considered if / when?• You do analytics: HBase running off Hadoop is a good

option!• Your application has a very low transaction rate!• Your application does not need to run in multiple data

centers!• You are not scared of moving parts!• Increasing your application overall architecture is fine

10

C* key features

© 2014 iland internet solutions

Scalability

• linearly scales reads and writes with number of nodes. Throughput of application // # of nodes!

• hundreds of nodes supported!• no downtime adding nodes!• no application level interruption!• multi-datacenter native replication support

12

© 2014 iland internet solutions

High Availability

• fault tolerant with tunable consistency (more on this later)!• data replicated to multiple nodes!• continuous availability: no SPOF (vs master / slave)

13

© 2014 iland internet solutions

Performances

• low latency!• write is great!• read is good!• can handles hundreds of TB

14

© 2014 iland internet solutions

Transaction Support!

• commit log: atomicity, isolation and durability of ACID compliance!

• consistency is tunable (more on this later)

15

© 2014 iland internet solutions

Simplicity

• all nodes in cluster are the same!• configuration is simple!• operation is simple

16

© 2014 iland internet solutions

Cassandra Query Language (CQL)

• SQL-like query language!• data are in tables containing rows of columns!• v3 replaces Thrift API and CQL v2

17

C* key concepts

© 2014 iland internet solutions

Tunable consistency!

• RDBMS: consistency and availability => transactions!• NoSQL: partition tolerance over consistency?!• Cassandra tunable consistency: tradeoffs in between

performance or accuracy on a per-query basis!• Write requests: all nodes, quorum of nodes or any available

nodes!• Read requests: all nodes “strong consistency”, quorum of

nodes or any nodes.

19

© 2014 iland internet solutions

Data model!• Flexible data storage: structured, semi-structured,

unstructured!• Change to data structures is dynamic!• strict minimum: essentially a distributed hash map!• low-level: requires application to have extensive knowledge

about the dataset!• Does not support a fully relational model: application

responsibility!• No foreign keys, no JOIN

20

© 2014 iland internet solutions

Partitioned row store!• keyspace (KS) is the primary container of data (like RDBMS database)!• KS contains column families (CF) (like relational tables)!• CF contains rows and rows contain columns!• CF requires a primary key: partition key (PK) is the first part of the primary key. !• PK determines on which nodes the data is stored. !• SELECT must include PK!• remaining columns part of primary key are clustering columns (think ordering)!• INSERT / UPDATE / DELETE OPS on rows w/ same PK for a CF are atomic and

isolated!• partitioning: C* distributes transparently data across multiple nodes (nodes can be

added and removed)!• Secondary indexes possible

21

Getting started

© 2014 iland internet solutions

Where to get started?• http://cassandra.apache.org/

Apache foundation project Web site!• http://planetcassandra.org/

Community Web site!• http://www.datastax.com/

company providing Cassandra support and solutions to enterpriseslots of great documentation

23

© 2014 iland internet solutions

Requirements

• Java >= 1.7 (prefer Oracle JVM)!• Python 2.7 (cqlsh only)

24

© 2014 iland internet solutions

Downloading

• stable releases available from Apache Foundation Web site!• binary distributions!• Debian / Ubuntu packages!

• DataStax provides RPMs!• you can build C* from source (testing patches etc.)

25

© 2014 iland internet solutions

Getting started with tarball distribution$ wget http://www.apache.org/dyn/closer.cgi?path=/cassandra/2.0.9/apache-cassandra-2.0.9-bin.tar.gz !$ sudo mkdir -p /var/log/cassandra $ sudo chown -R `whoami` /var/log/cassandra $ sudo mkdir -p /var/lib/cassandra $ sudo chown -R `whoami` /var/lib/cassandra $ tar -xzf apache-cassandra-2.0.9-bin.tar.gz !$ bin/cassandra -f

26

© 2014 iland internet solutions

Getting started with Debian / Ubuntu (1/2)

$ sudo vim /etc/apt/sources.list.d/java.list deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main

$ sudo apt-get update $ sudo apt-get oracle-java7-installer $ sudo apt-get install oracle-java7-set-default !

27

© 2014 iland internet solutions

Getting started with Debian / Ubuntu (2/2)

$ sudo vim /etc/apt/sources.list.d/cassandra.list deb http://www.apache.org/dist/cassandra/debian 20x main deb-src http://www.apache.org/dist/cassandra/debian 20x main

$ sudo apt-get update $ sudo apt-get install cassandra

28

© 2014 iland internet solutions

Running the CQL shell

$ (bin/)cqlsh Connected to Test Cluster at localhost:9160. [cqlsh 4.1.1 | Cassandra 2.0.9 | CQL spec 3.1.1 | Thrift protocol 19.39.0] Use HELP for help. cqlsh> •

29

Cassandra Query Language (CQL)

© 2014 iland internet solutions

Using CQL

• cqlsh!• DataStax driver!• simpler than Thrift API!• hide C* internal implementation details!• native transport port: 9042

31

© 2014 iland internet solutions

CQL basics

• usual statements!• CREATE / DROP / ALTER!• SELECT!• INSERT and UPDATE are the same (create or replace)

32

© 2014 iland internet solutions

Keyspace (KS)• “like” a RDBMS database but…!• replication strategy!

• SimpleStrategy: simple single DC cluster!• NetworkTopologyStrategy: multi-DC cluster!

• replication factor: total number of replicas across the cluster!• A replication factor of 1 means that there is only one copy of each row in

the DC!• A replication factor of 2 means two copies of each row, where each copy is

on a different node in every DC!• if RF > # nodes: writes rejected and read will depend on consistent level

33

© 2014 iland internet solutions

Creating KS: single node in a single DC

cqlsh> CREATE KEYSPACE HJUG WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };!!

1 node == 1 copy!

34

© 2014 iland internet solutions

Creating KS: 4 nodes cluster in a single DC (1/2)

cqlsh> CREATE KEYSPACE HJUG WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };!!

3 copies of data across 4 nodes

35

© 2014 iland internet solutions

Creating KS: 4 nodes cluster in a single DC (2/2)

• first replica on a node determined by the partitioner!• Additional replicas placed on the next nodes clockwise in

the ring

36

© 2014 iland internet solutions

Multi-DC (NetworkTopologyStrategy)• cluster deployed across multiple data centers!• specify how many replicas in each data center!• what to consider:!

• local reads with low net latency!• failure!• disk space!

• example:!1. 2 replicas in each DC: 1 node can be down per DC and still allows local reads at

a consistency level of ONE (1).!2. 3 replicas in each DC. 1 node per DC at a strong consistency level of

LOCAL_QUORUM (2) depending on query consistency level

37

© 2014 iland internet solutions

Creating KS: 2 DC of 3 nodes and RF 3

cqlsh> CREATE KEYSPACE HJUG WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', ‘us-east' : 3, ‘us-west’: 3 };!!

3 copies of data across 3 nodes in each DC (6 totals)

38

© 2014 iland internet solutions

nodetool status <KS>$ bin/nodetool status HJUG!!Datacenter: us-east!===============!Status=Up/Down!|/ State=Normal/Leaving/Joining/Moving!-- Address Load Tokens Owns (effective) Host ID Rack!UN 10.241.206.82 989.91 GB 256 100.0% 1aeb620e-f22d-485b-b755-323f8e20388a 206!UN 10.241.206.80 989.14 GB 256 100.0% aefbe1fc-3436-48ac-a07f-ac664c2b823f 206!UN 10.241.206.81 989.7 GB 256 100.0% acd7b4db-7a3f-4dac-96ef-9389a2f807ba 206!!Datacenter: us-west!===============!Status=Up/Down!|/ State=Normal/Leaving/Joining/Moving!-- Address Load Tokens Owns (effective) Host ID Rack!UN 10.243.206.80 989.7 GB 256 100.0% 3d8ea269-3e59-400c-9f77-727da2bcf8a6 206!UN 10.243.206.81 988.49 GB 256 100.0% 5832b870-fcfc-4046-a2d5-eff65fa53f4c 206!UN 10.243.206.82 987.92 GB 256 100.0% b8d0792a-b5fb-433f-a9f6-ce1110a3420b 206!!

39

© 2014 iland internet solutions

ALTER KEYSPACE <KS>cqlsh> ALTER KEYSPACE HJUG WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', ‘us-east' : 3, ‘us-west’: 2 };!!

You then need to run a repair

40

© 2014 iland internet solutions

DROP KEYSPACE <KS>

cqlsh> drop keyspace HJUG;!cqlsh> drop keyspace if exists HJUG;!!

Immediate and irreversible removal

41

© 2014 iland internet solutions

Using KS

cqlsh> use HJUG;cqlsh> describe keyspace HJUG;

42

© 2014 iland internet solutions

To go further

• partitioner!• snitch!• rack!• seeds!• nodetool!• read configuration file

43

© 2014 iland internet solutions

Creating table with a single primary keycqlsh:HJUG> CREATE TABLE users ( username varchar,! password varchar,! […], ! PRIMARY KEY (username));

44

© 2014 iland internet solutions

Creating table with a compound primary keycqlsh:HJUG> CREATE TABLE users( username varchar,! location_id int,! […],! PRIMARY KEY (username, location_id));!!

partition key: username!location_id: clustering columns (ordering)

45

© 2014 iland internet solutions

Creating table with a composite primary key

cqlsh:HJUG> CREATE TABLE users( username varchar,! location_id int,! […],! PRIMARY KEY ((username, location_id)));!!

each row will be on a separated partition of its own

46

© 2014 iland internet solutions

ALTER TABLE <T>

cqlsh:HJUG> ALTER TABLE users ADD last_login varchar;!cqlsh:HJUG> ALTER TABLE users ALTER last_login TYPE timestamp;!cqlsh:HJUG> ALTER TABLE users DROP last_login;!!cqlsh:HJUG> ALTER TABLE users with COMPRESSION = {'sstable_compression': ''};!

47

© 2014 iland internet solutions

DESCRIBE TABLE <T>cqlsh> use HJUG;cqlsh:HJUG> DESCRIBE TABLE HJUG;CREATE TABLE users( username varchar,! location_id int,! […],! PRIMARY KEY (username, location_id)!) WITH![…]!compaction={'class': 'SizeTieredCompactionStrategy'} AND!compression={'sstable_compression': 'LZ4Compressor'};!!

48

© 2014 iland internet solutions

INSERT

cqlsh> INSERT INTO HJUG.users (username, location_id) VALUES (‘janguenot’, ‘Houston’); !!cqlsh> use HJUG;!cqlsh:HJUG> INSERT INTO users (username, location_id) VALUES (‘janguenot’, ‘Houston’);

49

© 2014 iland internet solutions

UPDATE

cqlsh:HJUG> UPDATE USERS set X=‘Y’ where username=‘janguenot’ and location_id = ‘Houston’;

50

© 2014 iland internet solutions

SELECTcqlsh:HJUG> SELECT * FROM USERS;!cqlsh:HJUG> SELECT * FROM USERS ORDER BY location_id ASC;!cqlsh:HJUG> SELECT * FROM USERS where username = ‘janguenot’;!!Remember ORDER BY can ONLY be used with columns part of primary

key!!

51

© 2014 iland internet solutions

CQL predicates

• on partition keys: =, IN!• on the cluster columns: <,<=,=,>=,>,IN

52

© 2014 iland internet solutions

Performance considerations

• query against single partition are fast!• pk = <whatever>!

• queries spanning multiple partitions are slow!• new disk seek for each partition!

• queries spanning multiple cluster columns are fast

53

© 2014 iland internet solutions

GROUP BY?

• partition key cluster columns for grouping!• no group by statement

54

© 2014 iland internet solutions

DELETE

cqlsh:HJUG> DELETE FROM USERS where username = ‘janguenot’ and location_id = ‘Houston’;!!

Deleted values will be permanently deleted after next compaction

55

© 2014 iland internet solutions

TRUNCATE TABLE <T>

cqlsh:HJUG> truncate table users;

56

© 2014 iland internet solutions

DROP TABLE <T>

cqlsh:HJUG> drop table users;!cqlsh:HJUG> drop table if exists users;

57

© 2014 iland internet solutions

CQL Types

58

© 2014 iland internet solutions

CQL Collectionscqlsh:HJUG> CREATE TABLE users ( username varchar,! password varchar,! emails set<text>, ! PRIMARY KEY (username));!• Set, List and Map are supported!• 1 to many relationship!• they get serialized: keep it small or use extra table!• list, that are ordered, are not performant, use set if possible or consider additional

tables if large collection

59

© 2014 iland internet solutions

Secondary Indexes

• Query against a column outside the primary key!• CREATE INDEX <index_name> ON <T>(<column>);!• SELECT * FROM T where column=‘x’;!

• Performances are good but not great but definitely getting better and better

60

© 2014 iland internet solutions

Final remarks about CQL

• no sequences: you manage UUID at the app level (time UUID types might be used for time series though)!

• remember partition key is not a primary key: beware of UPDATE!

• In doubt, you can write: C* is good at it. Create table and store data (One to One, One To Many)!

• Your application will drive your data model!

61

© 2014 iland internet solutions

To go further

• TTL!• Counters!• Static column!• Lightweight transactions (IF, IF NOT EXISTS)

62

DataStax native CQL Java driver

© 2014 iland internet solutions

Main features• Provides CQL3 access to C* using Java!• Uses C* CQL Native protocol!• Tunable policies (including consistency)!• Load balancing / reconnection / failover / routing of requests!• prepared statements and batches!• Sync and Async queries supported!• tracing query supported (for debug purposes)!• Driver available for Python, C++ and C# as well (similar API)

64

© 2014 iland internet solutions

Driver modules• driver-core: the core layer!• driver-examples: example applications using the other

modules which are only meant for demonstration purposes.

65

© 2014 iland internet solutions

Maven dependency

<dependency> <groupId>com.datastax.cassandra</groupId> <artifactId>cassandra-driver-core</artifactId> <version>2.0.3</version> </dependency>

66

© 2014 iland internet solutions

Optional dependencies for compression

<dependency> <groupId>net.jpountz.lz4</groupId> <artifactId>lz4</artifactId> <version>1.2.0</version> <scope>runtime</scope> </dependency> <dependency> <groupId>org.xerial.snappy</groupId> <artifactId>snappy-java</artifactId> <version>1.0.5</version> <scope>runtime</scope> </dependency>

67

© 2014 iland internet solutions

Driver documentation• Docs

http://www.datastax.com/documentation/developer/java-driver/2.0/index.html!

• APIhttp://www.datastax.com/drivers/java/2.0 !

• Jira https://datastax-oss.atlassian.net/browse/JAVA !

• Mailing listhttps://groups.google.com/a/lists.datastax.com/forum/#!forum/java-driver-user

68

© 2014 iland internet solutions

Open Source• Apache v2 licence!• https://github.com/datastax/java-driver

69

Examples

© 2014 iland internet solutions

Step 1: connection to the cluster

Cluster.Builder clusterBuilder = Cluster.builder(); !// Connect to one (1) node clusterBuilder.addContactPoint(“10.10.10.2”); !// Connect to several nodes clusterBuilder.addContactPoints(“10.10.10.2”, “10.10.10.3”); !// Build the the cluster Cluster cluster = clusterBuilder.build(); !// … do work with the cluster … !// Shutdown the cluster cluster.shutdown();

71

© 2014 iland internet solutions

Step 2: connection to a keyspace

// Creating a session against the keyspace you want to interact with Session session = cluster.connect("HJUG"); !// Close up the session session.shutdown()

72

© 2014 iland internet solutions

Example 1: search queries and result set // TODO catch exceptions !// Execute a query using the cluster and iterate over the results ResultSet result = session.execute("SELECT * from USER;"); !// Option 1: iterate over the results Iterator<Row> iter = result.iterator(); while (iter.hasNext()) { Row row = iter.next(); log.info(String.format("Found user w/ username=%s", row.getString(“username”)); } !// Option 2: get all rows and iterate List<Row> rows = result.all(); for (Row row : rows) { log.info(String.format("Found user w/ username=%s", row.getString(“username”)); }

73

© 2014 iland internet solutions

Example 2: inserting data

// TODO catch exceptions !// INSERT a new user (TODO: escape parameters when used this way) session.execute(String.format("INSERT INTO USER (username, location_id) VALUES (%s, %s);", "Jim", “Houston"));

74

© 2014 iland internet solutions

Example 3: prepared statements// TOTO catch exceptions !// Create prepared statement that can be reused throughout the application. // You only need to create it once PreparedStatement usersByLocationStatement = session.prepare(String.format( "SELECT * FROM %s WHERE %s = ?;", USER, "location_id")); !// Create bound statement and bind query parameters BoundStatement boundStatement = new BoundStatement(usersByLocationStatement); !// You can override the default consistent level defined at the cluster level on a per// query basis boundStatement.setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM); !// Bind parameters boundStatement.bind(“Houston”); !!// Execute bound statement and get results ResultSet resultSet = session.execute(boundStatement);

75

© 2014 iland internet solutions

Example 4: Batch Statement// TODO catch exceptions !// Create a batch statement // Type logged ensures atomicity BatchStatement batchStatement = new BatchStatement(BatchStatement.Type.LOGGED); !// Create bound statement and bind query parameters BoundStatement boundStatement = new BoundStatement(usersByLocationStatement); boundStatement.bind("Houston"); !// Add the bound statements to the batch batchStatement.add(boundStatement); !// ... you can several bound statements to the batch ... !// execute batch session.execute(batchStatement);

76

© 2014 iland internet solutions

Example 5: Synchronous vs Asynchronous

// TODO catch exceptions !// INSERT synchronously a new user (TODO: escape parameters when used this way) session.execute(String.format("INSERT INTO USER (username, location_id) VALUES (%s, %s);", "Jim", “Houston”)); !// INSERT asynchronously a new user (TODO: escape parameters when used this way) session.executeAsync(String.format("INSERT INTO USER (username, location_id) VALUES (%s, %s);", "Jim", “Houston"));

77

© 2014 iland internet solutions

Example 6: batching result sets// We will get <limit> items at offset <x> // offset = x; // limit = y; !// Create bound statement and bind query parameters BoundStatement boundStatement = new BoundStatement(usersByLocationStatement); boundStatement.setFetchSize(limit); boundStatement.bind("Houston"); !!// Execute bound statement and get results ResultSet resultSet = session.execute(boundStatement); !for (int i = 0; i < (offset / limit); i++) { // Fetch the number of pages needed resultSet.fetchMoreResults(); } !Iterator<Row> iter = resultSet.iterator(); for (int i = 0; i < offset; i++) { // Throw away results from earlier pages if (iter.hasNext()) { iter.next(); } } !final List<Row> rows = new ArrayList<>(); for (int i = 0; i < limit; i++) { // Keep results from desired page if (iter.hasNext()) { rows.add(iter.next()); } }

78

DataStax CQL driver rule #1

Use one Cluster instance per (physical) cluster (per application lifetime)

© 2014 iland internet solutions

Cluster• handles queries, connections and their policies!• share cluster instance at the application level!• must be tuned according to C* nodes / cluster

configuration (timeouts, retries etc.)!• Consistency

80

© 2014 iland internet solutions

Example of a more complex Cluster setup// Initialize cluster like in example 1.// You can customize policies before build()clusterBuilder .withQueryOptions( new QueryOptions().setConsistencyLevel( ConsistencyLevel.LOCAL_QUORUM)) .withCompression(Compression.LZ4) .withSocketOptions( // Setting a value of 0 disables read timeouts: we let Cassandra timeout // before the cluster here. new SocketOptions().setConnectTimeoutMillis(1500) .setReadTimeoutMillis(0)) .withLoadBalancingPolicy(new DCAwareRoundRobinPolicy(“us-east”));

81

DataStax CQL driver rule #2

Use at most one Session per keyspace, or use a single Session and explicitly specify the keyspace in your

queries

© 2014 iland internet solutions

Session• API centered around query execution!• manages per-node connection pools!• avoid large # of sessions or major impact on server

resources (C* side)!• share session instance at the application level!• one session per keyspace at most!• if large number of keyspace: pre-defined number of sessions

83

DataStax CQL driver rule #3

if you execute a statement more than once, consider using a PreparedStatement

© 2014 iland internet solutions

Prepared statements• prepare once, bind and execute multiple times.!• parsed and prepared on the Cassandra nodes!• cache prepared statement at the application level!• only bound parameters and query are sent to nodes!• performance gains are significant!• prepared statements should be configured to rarely

receive null values when binding parameters

85

DataStax CQL driver rule #4

You can reduce the number of network roundtrips and also have atomic operations by using Batches

© 2014 iland internet solutions

Batch operations!• single request!• combines multiple data modification statements into a

single logical operation!• atomic operation: all statements pass or fail!• can use combinations of batch and prepared statements!• keep batch statement below the value specified in conf

file: batch_size_warn_threshold_in_kb (5 kb by default)

87

Thanks!

Slides available @ http://www.slideshare.net/anguenot/cassandra-cql-javahjug20140730 !

@anguenot / ja@iland.com!!

iland: http://www.iland.com!We are hiring in Houston!!

https://www.linkedin.com/company/iland-internet-solutions/careers !!

Recommended