An Overview for Architects, Developers, and IT Managers - DataStax

What’s New in Apache Cassandra™ 1.2?An Overview for Architects, Developers, and IT Managers

White PaperBY DATASTAX CORPORATION

DECEMBER 2012

ContentsIntroduction 3

Why Cassandra? 3

What’s New in Cassandra 1.2? 5

Manageability Enhancements 5

Virtual Nodes (Vnodes) 5

Parallel Leveled Compaction 7

Off-Heap Bloom Filters and Compression Metadata 7

Improved JBOD Functionality 7

Performance Enhancements 8

Query Profiling/Tracing 8

Faster Node Bootup/Startup 9

Murmur3Partitoner 9

Miscellaneous Performance Enhancements 9

Development Enhancements 9

Collections 9

Atomic Batches 11

Flat File Load/Export Utility 11

Native/Binary CQL Transport 12

Concurrent Schema Changes 12

CQL Enhancements 12

Additional Metadata Information 12

Getting Started with Cassandra 1.2 13

Cassandra for Production Environments 13

Conclusion 13

About DataStax 13

IntroductionApache Cassandra, an Apache Software Foundation project, is a massively scalable NoSQL database. Cassandra is designed to handle big data workloads across multiple data centers with no single point of failure, providing enterprises with continuous availability without compromising performance.

This paper discusses the new features contained within the 1.2 version of Cassandra. For more general information on NoSQL and NoSQL use cases, as well as an introduction to Cassandra, please see the “Why NoSQL?” and “Introduction to Apache Cassandra” white papers on DataStax.com.

Why Cassandra? Many modern businesses use Cassandra to power the applications that transform their business. Some companies using Cassandra today include the following:

Core features in Cassandra that cause many to choose the database for their big data, modern business systems include the following:

Massively scalable architecture – Cassandra’s masterless, peer-to-peer architecture overcomes the limitations of master-slave designs and allows for both high availability and massive scalability. Cassandra is the acknowledged NoSQL leader1 when it comes to comfortably scaling to terabytes or petabytes of data, while maintaining industry-leading write and read performance.

Linear scale performance – Nodes added to a Cassandra cluster (all done online) increase the throughput of a database in a predictable, linear fashion2 for both read and write operations, even in the cloud where such predictability can be difficult to ensure.

Continuous availability – Data is replicated to multiple nodes in a Cassandra database cluster to protect from loss during node failure and provide continuous availability with no downtime.

Abstract Many modern businesses use Cassandra to power the applica-tions that transform their business, with widely varying use cases: healthcare management, online gaming, e-commerce, media streaming, social media, and many more. Cassandra users enjoy massive scalability, continu-ous availability, fault detection and recovery, data consistency, and simplicity of installation that remain unmatched. The newest version of Cassandra offers many new enhancements including virtual nodes and parallel leveled compaction, query profiling/trac-ing, faster node bootup/startup, collections and atomic batches.

Figure 1:

Sample of companies currently using Cassandra

1 http://wikibon.org/wiki/v/Cassandra_Continues_to_Win_Real-Time_Big_Data_Converts

2 http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

Transparent fault detection and recovery – Cassandra clusters can grow into the hundreds or thousands of nodes. Because Cassandra was designed for commodity servers, machine failure is expected. Cassandra utilizes gossip protocols to detect machine failure and recover when a machine is brought back into the cluster – all without the application noticing.

Flexible, dynamic schema data modeling – Cassandra offers the organization of a traditional RDBMS table layout combined with the flexibility and power of no stringent structure requirements. This allows data to be dynamically stored as needed without performance penalty for changes that occur. In addition, Cassandra can store structured, semi-structured, and unstructured data.

Guaranteed data safety – Cassandra far exceeds other systems on write performance due to its append-only commit log while always ensuring durability. Users must no longer trade off durability to keep up with immense write streams. Data is absolutely safe in Cassandra; data loss is not possible.

Distributed, location independence design – Cassandra’s architecture avoids the hot spots and read/write issues found in master-slave designs. Users can have a highly distributed database (e.g., multiple geographies, multiple data centers) and read or write to any node in a cluster without concern over what node is being accessed.

Tunable data consistency – Cassandra offers flexible data consistency on a cluster, data center, or individual I/O operation basis. Very strong or eventual data consistency among all participating nodes can be set globally and also controlled on a per-operation basis (e.g., per INSERT, per UPDATE) in Cassandra’s drivers and client libraries.

Multiple data center replication – Whether it’s keeping data in multiple locations for disaster recovery scenarios or locating data physically near its end users for fast performance, Cassandra offers support for multiple data centers. Administrators simply configure how many copies of the data they want in each data center, and Cassandra handles the rest – replicating the data automatically. Cassandra is also rack-aware and can keep replicas of data stored on different physical racks, which helps ensure uptime in the case of single rack failures.

Cloud-enabled – Cassandra’s architecture maximizes the benefits of running in the cloud. Also, Cassandra allows for hybrid data distribution where some data can be kept on-premise and some in the cloud.

Data compression – Cassandra supplies built-in data compression, with up to an 80 percent reduction in raw data footprint. More importantly, Cassandra’s compression results in no performance penalty, with some use cases showing actual read/write operations speeding up due to less physical I/O being required.

CQL (Cassandra Query Language) – Cassandra provides a SQL-like language called CQL that mirrors SQL’s DDL, DML, and SELECT syntax. CQL greatly decreases the learning curve for those coming from RDBMS systems because they can use familiar syntax for all object creation and data access operations.

No caching layer required – Cassandra offers caching on each of its nodes. Coupled with Cassandra’s scalability characteristics, nodes can be incrementally added to the cluster to keep as much data in memory as needed. The result is that there is no need for a separate caching layer.

No special hardware needed – Cassandra runs on commodity machines and requires no expensive or special hardware.

Incremental and elastic expansion – The Cassandra ring allows online node additions. Because of Cassandra’s fully distributed architecture, every node type is the same, which means clusters can grow as needed without any complex architecture decisions.

Simple install and setup – Cassandra can be downloaded and installed in minutes, even for multi-cluster installs.

Ready for developers – Cassandra has drivers and client libraries for all the popular development languages (e.g., Java, Python)

Given these technical features and benefits, the following are typical big data use cases handled well by Cassandra in the enterprise:

Real-time, big data workloads

Time series data management

High-velocity device data ingestion and analysis

Healthcare system input and analysis

Media streaming management (e.g., music, movies)

Social media (i.e., unstructured data) input and analysis

Online web retail (e.g., shopping carts, user transactions)

Real-time data analytics

Online gaming (e.g., real-time messaging)

Software as a Service (SaaS) applications that utilize web services

Write-intensive systems

What’s New in Cassandra 1.2? Cassandra 1.2 includes few features in the areas of manageability, performance, and developer functionality.

Manageability Enhancements Virtual Nodes (Vnodes)Those who have worked with Cassandra in the past know how the database distributes data across a cluster of nodes. A numerical token is assigned each node, which makes it responsible for one range of data in the cluster. While this paradigm has worked very well for scaling out massive databases, it has a few limitations.

First, when a new node is added to an existing cluster, anywhere from one to a handful of existing nodes will participate in bootstrapping the new node with its data. The same is true if a node goes down and needs to be replaced. If the amount of data is large, then the process could be one that is very resource intensive on the nodes participating and time consuming overall. Because of this, a rule-of-thumb recommendation for Cassandra has been to deploy ‘thin nodes’, which have equated to keeping about ½ TB of data on each node.

Second, when new nodes are added to an existing cluster, the cluster becomes unbalanced where its data distribution is concerned (i.e. some nodes having more/less data than others). From a performance perspective, an even distribution of data is desired so the newly modified cluster must go through a rebalance operation, which if done manually can be an error prone and potentially long process.

In Cassandra 1.2, virtual nodes – or ‘vnodes’ – have been implemented to overcome these issues and provide easier manageability. Vnodes change the previous Cassandra paradigm from one token or range per node, to many per node. Within a cluster these can be randomly selected and be non-contigu-ous, resulting in smaller ranges that belong to each node:

Vnodes provide the following core benefits:

Rather than just one or a couple nodes participating in bootstrapping new nodes, all nodes participate in the operation, thus parallelizing the task with the end result being much faster performance for node addition operations.

The need to adhere to the ‘thin node’ recommendation for Cassandra no longer applies.

Vnodes automatically maintain the data distribution / balance of a cluster so there is no need to perform any rebalance operation after a cluster has been modified.

Enabling vnodes for a Cassandra cluster is easy and straightforward. Rather than assign each node a token, a new configuration parameter – num_tokens – is used to specify the number of vnodes tokens to use for a cluster (a good default is 256).

For more information on vnodes, including instructions on upgrading an existing cluster to use vnodes, please see the DataStax online documentation.

3

Core features in Cassandra that cause many to choose the database for their big data, modern

Apache Cassandra, an Apache Software Foundation project, is a massively scalable NoSQL database. Cassandra is designed to handle big data workloads across multiple data centers with no single point of failure, providing enterprises with continuous availability without compromising performance.












































4













































5













































6

Figure 2:

Comparison between non-vnodes and vnodes implementation

Parallel Leveled CompactionLeveled compaction has proven to be effective for update-intensive workloads, but has been limited by allowing only one leveled compaction at a time to run at a time per table. This has been true no matter how many hard disks or SSDs that data was spread across.

Parallel leveled compaction in Cassandra 1.2 is aimed at providing more efficient and faster compac-tion operations to deployments that especially occur on SSD hardware. Whereas the general idea for compaction processes is to mitigate its impact on the overall operation of nodes (which typically results in longer compaction times but less resource intensive operations), SSD implementations lend themselves to speeding up compaction tasks. Parallel leveled compaction allows for this to be the case on clusters deployed with SSD’s by allowing the LCS to run up to concurrent_compactors compac-tions across different SSTable ranges (including multiple compactions within the same level).

Off-Heap Bloom Filters and Compression MetadataCassandra 1.2 helps reduce the Java heap requirements for large datasets by moving the memory used for bloom filters and compression metadata into native memory. Java heap sizes that exceed 8GB tend to cause garbage collection operations to impact performance, so the off-heap enhancement for bloom filters helps reduce that possibility.

Improved JBOD Functionality Before Cassandra 1.2, a single disk going down in a JBOD (just a bunch of disks) configuration had the potential to make an entire node unavailable for I/O operations. Version 1.2 introduces a new disk_-failure_policy configuration setting that allows you to choose from two policies that deal with disk failure:

stop is the default setting for new 1.2 installations. Upon encountering a file system error Cassan-dra will shut down gossip and Thrift services, leaving the node effectively unavailable, but still reachable via JMX for troubleshooting.

best_effort This new Cassandra option will allow the database to do its best in the event of a disk error. If Cassandra can’t write to a disk, the disk will become blacklisted for writes and the node will continue writing elsewhere. If Cassandra can’t read from a disk, it will be marked as unreadable, and the node will continue serving data from readable sstables only. This implies that it’s possible, if the consistency level is ONE, for stale data to be served when the most recent version of data is on the unreadable disk, so choose this option with care.

ignore This policy exists for users upgrading from prior versions of Cassandra. In this mode, the database will behave in the exact same manner as older versions – all file system errors will be logged but otherwise ignored. DataStax recommends using either the stop or best_effort policies instead.

For more information on version 1.2’s improved JBOD support, see DataStax’s online documentation.

7

Performance Enhancements Query Profiling/Tracing Version 1.2 of Cassandra provides new performance diagnostic utilities aimed at helping you under-stand, diagnose, and troubleshoot CQL statements that are sent to a Cassandra cluster. You can interrogate individual CQL statements in an ad-hoc manner, or perform a system-wide collection of all queries/commands that are sent to a cluster.

For example, to understand how a Cassandra cluster will satisfy a single CQL INSERT statement, you would enable the trace utility from the CQL command prompt, issue your query, and review the diagnostic information provided: cqlsh> tracing on;

Now tracing requests.

cqlsh:foo> INSERT INTO test (a, b) VALUES (1, 'example');

Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9

activity | timestamp | source | source_elapsed

-------------------------------------+-----------------+-------------+----------------

execute_cql3_query | 00:02:37,015 | 127.0.0.1 | 0

Parsing statement | 00:02:37,015 | 127.0.0.1 | 81

Preparing statement | 00:02:37,015 | 127.0.0.1 | 273

Determining replicas for mutation | 00:02:37,015 | 127.0.0.1 | 540

Sending message to /127.0.0.2 | 00:02:37,015 | 127.0.0.1 | 779

Messsage received from /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 63

Applying mutation | 00:02:37,016 | 127.0.0.2 | 220

Acquiring switchLock | 00:02:37,016 | 127.0.0.2 | 250

Appending to commitlog | 00:02:37,016 | 127.0.0.2 | 277

Adding to memtable | 00:02:37,016 | 127.0.0.2 | 378

Enqueuing response to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 710

Sending message to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 888

Messsage received from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2334

Processing response from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2550

Request complete | 00:02:37,017 | 127.0.0.1 | 2581

Cassandra provides a description of each step it takes to satisfy the request along with what node(s) are affected, the time for each step, and the total time for the request.

In addition to individual query analysis, database administrators and system admins oftentimes need to collect all statements that are sent to a database to understand what the most resource intensive statements are and locate queries that need to be tuned. Cassandra 1.2 allows you to set a new node tool option – settraceprobability – to trace some or all statements sent to a cluster. A probability of 1.0 will trace everything whereas lesser amounts (e.g. 0.10) only sample a certain percentage of statements. Care should be taken on large and active systems, as system-wide tracing will have a performance impact.

The trace information is stored in a new systems_traces keyspace that holds two tables – sessions and events, which can be easily queried to answer questions such as what the most time-consuming query has been since a trace was started, and much more.

For more information on tracing and troubleshooting CQL statements in version 1.2, see DataStax’s online documentation.

8

Faster Node Bootup/StartupCassandra 1.2 provides faster startup/bootup times for each node in a cluster, with internal tests performed at DataStax showing up to 80% less time needed to start a Cassandra node. The startup reductions were realized through more efficient sampling and loading of SSTable indexes into memory caches.

Murmur3Partitoner Cassandra 1.2 supplies a new default partitioner: the Murmur3Partitioner, which based on the Murmur3 hash. The Murmur3 hash is 3x-5x faster than the prior MD5 has used in earlier versions of Cassandra; this translates into performance gains of over 10% for index-heavy workloads.

Note that the new Murmur3Partitioner is not backwards compatible with the previously used Random-Partitioner. Any upgrades from earlier versions of Cassandra necessitate that the partitioner being used is the RandomPartitioner.

Miscellaneous Performance EnhancementsVersion 1.2 of Cassandra 1.2 supplies a number of other performance enhancements including:

A new approach to index maintenance, which improves the speed at which indexes are updated.

More efficient and faster streaming of data during bootstrap or repair operations.

Faster replica recovery via a new concurrent hint delivery mechanism.

Development Enhancements Collections Version 1.2 of Cassandra introduces a new mechanism for storing data called collections. The general idea behind collections is to provide easier methods for inserting and manipulating data that consists of multiple items that you want to store in a single column; for example, multiple email addresses for a single employee. There are three different types of collections you can select from: (1) sets; (2) lists; (3) maps.

Sets

A set allows for the storage of a group of elements that are returned in sorted order when queried. For example, if you wanted to store multiple emails for the employees of a company, you might create the following table: cqlsh> CREATE TABLE emp (

emp_id int PRIMARY KEY,

first_name text,

last_name text,

emails set<text>

);

cqlsh> INSERT INTO emp (emp_id, first_name, last_name, emails)

VALUES(1, 'Laura', 'Jung', {'[email protected]',

'[email protected]'});

Sets may be added to:cqlsh> UPDATE emp

SET emails = emails + {[email protected]'}

WHERE emp_id = 1;

Sets may be queried:

cqlsh> SELECT emp_id, emails

FROM emp

WHERE emp_id = 1;

emp_id | emails

---------+-------------------------------------------------------------

1 | {'[email protected] ,"[email protected]","[email protected]"}

9

Sets may be deleted from on an individual item basis:cqlsh> UPDATE emp

SET emails = emails - {'[email protected]'} WHERE emp_id = 1;

Or sets may be deleted from in total, in one of two ways:cqlsh> UPDATE emp SET emails = {} WHERE emp_id = 1;

cqlsh> DELETE emails FROM emp WHERE emp_id = 1;

Lists

A set allows for the storage of a group of elements that are returned in sorted order when queried. For example, if you wanted to store multiple emails for the employees of a company, you might create the following table: cqlsh> ALTER TABLE emp ADD depts_mngd list<text>;

cqlsh> UPDATE emp

SET depts_mngd = [ 'engineering’, 'support' ] WHERE emp_id = 1;

With lists, you can prepend and append new items:cqlsh> UPDATE emp

SET depts_mngd = ['QA' ] + depts_mngd WHERE emp_id = 1;

cqlsh> UPDATE emp

SET depts_mngd = depts_mngd + [ 'doc' ] WHERE emp_id = 1;

You can also manipulate an item by its index:cqlsh> UPDATE emp SET depts_mngd [4] = 'docs' WHERE emp_id = 1;

cqlsh> DELETE depts_mngd [4] FROM emp WHERE emp_id = 1;

Lastly, you can remove list items by value (note that all instances of the value will be removed from the list):cqlsh> UPDATE emp

SET depts_mngd = depts_mngd - ['QA'] WHERE emp_id = 1;

Maps

As its name implies, a map maps one thing to another. For example, you might want to record the dates of performance reviews along with the end score of each review for each employee in your employee table: cqlsh> ALTER TABLE emp ADD perf_reviews map<timestamp, int>;

cqlsh> UPDATE users

SET perf_reviews = { '2012-04-01' : 95,

'2012-07-01' : 97 }

WHERE emp_id = 1;

Maps can be added to and items can be manipulated:cqlsh> UPDATE emp

SET pref_reviews['2012-10-01'] = 90

WHERE emp_id = 1;

cqlsh> DELETE perf_reviews['2012-10-01']

FROM emp

WHERE emp_id = 1;

For more information on collections, see DataStax’s online documentation.

Faster Node Bootup/StartupCassandra 1.2 provides faster startup/bootup times for each node in a cluster, with internal tests performed at DataStax showing up to 80% less time needed to start a Cassandra node. The startup reductions were realized through more efficient sampling and loading of SSTable indexes into memory caches.

Murmur3Partitoner Cassandra 1.2 supplies a new default partitioner: the Murmur3Partitioner, which based on the Murmur3 hash. The Murmur3 hash is 3x-5x faster than the prior MD5 has used in earlier versions of Cassandra; this translates into performance gains of over 10% for index-heavy workloads.

Note that the new Murmur3Partitioner is not backwards compatible with the previously used Random-Partitioner. Any upgrades from earlier versions of Cassandra necessitate that the partitioner being used is the RandomPartitioner.

Miscellaneous Performance EnhancementsVersion 1.2 of Cassandra 1.2 supplies a number of other performance enhancements including:

A new approach to index maintenance, which improves the speed at which indexes are updated.

More efficient and faster streaming of data during bootstrap or repair operations.

Faster replica recovery via a new concurrent hint delivery mechanism.

Development Enhancements Collections Version 1.2 of Cassandra introduces a new mechanism for storing data called collections. The general idea behind collections is to provide easier methods for inserting and manipulating data that consists of multiple items that you want to store in a single column; for example, multiple email addresses for a single employee. There are three different types of collections you can select from: (1) sets; (2) lists; (3) maps.

Sets

A set allows for the storage of a group of elements that are returned in sorted order when queried. For example, if you wanted to store multiple emails for the employees of a company, you might create the following table: cqlsh> CREATE TABLE emp (

emp_id int PRIMARY KEY,

first_name text,

last_name text,

emails set<text>

);

cqlsh> INSERT INTO emp (emp_id, first_name, last_name, emails)

VALUES(1, 'Laura', 'Jung', {'[email protected]',

'[email protected]'});

Sets may be added to:cqlsh> UPDATE emp

SET emails = emails + {[email protected]'}

WHERE emp_id = 1;

Sets may be queried:

cqlsh> SELECT emp_id, emails

FROM emp

WHERE emp_id = 1;

emp_id | emails

---------+-------------------------------------------------------------

1 | {'[email protected] ,"[email protected]","[email protected]"}

Sets may be deleted from on an individual item basis:cqlsh> UPDATE emp

SET emails = emails - {'[email protected]'} WHERE emp_id = 1;

Or sets may be deleted from in total, in one of two ways:cqlsh> UPDATE emp SET emails = {} WHERE emp_id = 1;

cqlsh> DELETE emails FROM emp WHERE emp_id = 1;

Lists

A set allows for the storage of a group of elements that are returned in sorted order when queried. For example, if you wanted to store multiple emails for the employees of a company, you might create the following table: cqlsh> ALTER TABLE emp ADD depts_mngd list<text>;

cqlsh> UPDATE emp

SET depts_mngd = [ 'engineering’, 'support' ] WHERE emp_id = 1;

With lists, you can prepend and append new items:cqlsh> UPDATE emp

SET depts_mngd = ['QA' ] + depts_mngd WHERE emp_id = 1;

cqlsh> UPDATE emp

SET depts_mngd = depts_mngd + [ 'doc' ] WHERE emp_id = 1;

You can also manipulate an item by its index:cqlsh> UPDATE emp SET depts_mngd [4] = 'docs' WHERE emp_id = 1;

cqlsh> DELETE depts_mngd [4] FROM emp WHERE emp_id = 1;

Lastly, you can remove list items by value (note that all instances of the value will be removed from the list):cqlsh> UPDATE emp

SET depts_mngd = depts_mngd - ['QA'] WHERE emp_id = 1;

Maps

As its name implies, a map maps one thing to another. For example, you might want to record the dates of performance reviews along with the end score of each review for each employee in your employee table: cqlsh> ALTER TABLE emp ADD perf_reviews map<timestamp, int>;

cqlsh> UPDATE users

SET perf_reviews = { '2012-04-01' : 95,

'2012-07-01' : 97 }

WHERE emp_id = 1;

Maps can be added to and items can be manipulated:cqlsh> UPDATE emp

SET pref_reviews['2012-10-01'] = 90

WHERE emp_id = 1;

cqlsh> DELETE perf_reviews['2012-10-01']

FROM emp

WHERE emp_id = 1;

For more information on collections, see DataStax’s online documentation.

10

Atomic Batches Prior versions of Cassandra allowed for batch operations, which allowed you to group related updates into a single statement. If some of the replicas for the batch failed mid-operation, the coordinator would hint those rows automatically. However, if the coordinator itself failed in mid operation, you could end up with partially applied batches.

In version 1.2 of Cassandra, batch operations are guaranteed by default to be atomic, and are handled differently than in earlier versions of the database. When a batch is written in 1.2, it is first written to a new system table that consumes the serialized batch as blob data. After the rows in the batch have been successfully written and persisted (or hinted), the system table entry is removed.

Again, the default functionality for batches in version 1.2 is for any batch to be atomic (i.e. all or nothing). It should be noted that there is a performance penalty for using atomic batches, so for use cases that necessi-tate batch operations, but either have client side workarounds or other methods for ensuring batch atomicity, a BEGIN UNLOGGED BATCH command is supplied for cases when performance is more important than atomicity guarantees. This is akin to using unlogged statements in many RDBMS’s.

In addition, version 1.2 also introduces a new BEGIN COUNTER BATCH command for batched counter updates. Unlike other writes in Cassandra, counter updates are not idempotent, so replaying them automati-cally from the new system table is not safe. Counter batches are thus strictly for improved performance when updating multiple counters in the same partition.

Lastly, it should be understood that although an atomic batch guarantees that if any part of the batch succeeds, all of it will, no other transactional enforcement is done at the batch level. For example, there is no batch isolation – other clients will be able to read the first updated rows from the batch, while other rows are in progress. However, transactional row updates within a single row are isolated (i.e. a partial row update cannot be read).

Flat File Load/Export Utility Cassandra 1.2 contains a utility that makes it easy to import and export flat file data to/from Cassandra tables. Although it was initially introduced in Cassandra 1.1.3, the new load utility wasn’t formally announced with that version, so an explanation of it is warranted in this document.

The utility mirrors the COPY command from the PostgreSQL RDBMS and is used in Cassandra’s CQL shell. A variety of file formats and delimiters are supported including comma-separated value (CSV), tabs, and more, with CSV being the default.

The syntax for the COPY command is the following: COPY <column family / table name> [ ( column [, ...] ) ] FROM ( '<filename>' | STDIN ) [

WITH <option>='value' [AND ...] ];

COPY <column family / table name> [ ( column [, ...] ) ] TO ( '<filename>' | STDOUT ) [

WITH <option>='value' [AND ...] ];

Below are simple examples of the COPY command in action: cqlsh> SELECT * FROM airplanes;

name | mach | manufacturer | year

--------------+------+--------------+------

P38-Lightning | 0.7 | Lockheed | 1937

cqlsh> COPY airplanes (name, mach, year, manufacturer) TO 'temp.csv'

1 rows exported in 0.004 seconds.

cqlsh> TRUNCATE airplanes;

cqlsh> COPY airplanes (name, manufacturer, year, mach) FROM 'temp.csv'; 1 rows imported in

0.087 seconds.

For more information about the COPY command, see DataStax’s online documentation.

11

Native/Binary CQL Transport Prior to Cassandra 1.2, the Cassandra Query Language (CQL) API had been using Thrift as a network transport, but now with version 1.2 and above, a new binary protocol is available for CQL that does not require Thrift.

There are a number of benefits that the new 1.2 native CQL transport provides:

Thrift is a synchronous transport meaning only one request can be active at a time for a connec-tion. By contrast, the new native CQL transport allows each connection to handle more than one active request at the same time. This translates into client libraries only needing to maintain a relatively low number of open connections to a Cassandra node in order to maximize performance, and helps scale large clusters.

Thrift is an RPC mechanism, which means you cannot have a Cassandra server push information to a client. However the new native CQL protocol allows clients to register for certain types of event notifications from a server. As of 1.2, currently supported events include [1] cluster topology changes (e.g. a node joins the cluster, is removed, is moved, etc.); [2] status changes (e.g. a node is detected up/down); [3] schema changes (e.g. a table has been modified, etc.). These new capabilities allow clients to stay up to date with the state of the Cassandra cluster without having to poll the cluster regularly.

The new native protocol allows for messages to be compressed if desired.

Thrift is still the default transport in 1.2, so if you want to use the new binary protocol, you will need to change the start_native_transport option to true in the cassandra.yaml file (you can also turn start_rpc to false if you’re not going to use the thrift interface). You will also need a client driver that supports this new binary protocol such as the new DataStax Java and .NET drivers.

Concurrent Schema Changes While Cassandra 1.1 introduced the ability to modify objects in a concurrent fashion across a cluster, it did not include support for programmatically creating and dropping column families / tables (either permanent or temporary) in a concurrent manner. Version 1.2 supplies this functionality, which means multiple users may add/drop tables at the same time in the same cluster.

CQL Enhancements There have been numerous enhancements to CQL made in Cassandra 1.2. Changes include a new ALTER KEYSPACE statement, syntax additions to understand how long a TTL column has remaining, support for conditional operators, and much more. For a full list of all CQL additions in version 1.2, please see the DataStax online documentation.

Additional Metadata Information Cassandra 1.2 delivers new data dictionary objects that can be queried to find out cluster demographic information and more. The three new metadata tables in the Cassandra system keyspace are:

schema_keyspaces – provides quick access to keyspace metadata.

local – supplies demographic data for the local node that is currently connected to.

peers – provides information for peer nodes in a cluster.

Example output from the schema_keyspaces data dictionary object might be:

SELECT * from system.schema_keyspaces;

keyspace | durable_writes | name | strategy_class | strategy_options

----------+----------------+---------+----------------+--------------------

history | True | history | SimpleStrategy | {"replication_factor":"1"}

ks_info | True | ks_info | SimpleStrategy | {"replication_factor":"1"}

12

About DataStax

DataStax provides a massively scalable big data platform to run mission-critical business applications for some of the world’s most innovative and data-inten-sive enterprises. Powered by the open source Apache Cassandra™ database, DataStax delivers a fully distributed, continuously available platform that is faster to deploy and less expensive to maintain than other database platforms.

DataStax has more than 250 customers including leaders such as Netflix, Rackspace, Pearson Education, and Constant Contact, and spans verticals including web, financial services, telecommunications, logistics, and government. Based in San Mateo, Calif., DataStax is backed by industry-leading investors including Lightspeed Venture Partners, Meritech Capital, and Cross-link Capital.

For more information, visit www.datastax.com.

Getting Started with Cassandra 1.2 The easiest way to get started with Cassandra 1.2 is by downloading and using DataStax Community Edition that bundles the latest version of Apache Cassandra, sample database and applications, and a free version of DataStax OpsCenter, which is a visual management and monitoring solution for Cassandra and other big data platforms.

You can find out more about DataStax Community Edition and obtain downloads by visiting: http://www.planetcassanra.org.

Cassandra for Production Environments DataStax Enterprise Edition is a big data platform that provides a production-ready version of Cassandra, which is integrated with Hadoop for analytics and Apache Solr for enterprise search. DataStax Enterprise Edition is completely free to use in development environments, however production deployments do require a subscription be purchased from DataStax.

You can find out more about DataStax Enterprise Edition and find downloads at: http://www.datastax.com/products/enterprise.

Conclusion To find out more about Apache Cassandra and DataStax, and to obtain downloads of Cassandra and DataStax Enterprise software, please visit www.datastax.com or send an email to [email protected]. Note that DataStax Enterprise Edition is completely free to use in development environments, while production deployments require a software subscription to be purchased.

DataStax powers the big data apps that transform business for more than 200 customers, including startups and 20 of the Fortune 100. DataStax delivers a massively scalable, flexible and continuously available big data platform built on Apache Cassandra™. DataStax integrates enterprise-ready Cassandra, Apache Hadoop™ for analytics and Apache Solr™ for search across multi-datacenters and in the cloud.

Companies such as Adobe, Healthcare Anytime, eBay and Netflix rely on DataStax to transform their businesses. Based in San Mateo, Calif., DataStax is backed by industry-leading investors: Lightspeed Venture Partners, Crosslink Capital and Meritech Capital Partners. For moreinformation, visit DataStax.com or follow us on Twitter @DataStax.

777 Mariners Island Blvd #510 San Mateo, CA 94404 650-389-6000

Documents

An Overview for Architects, Developers, and IT Managers - DataStax