Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Cassandra 1.0and the future of big data

Jonathan Ellis

Tuesday, October 4, 2011

About me

✤ Project chair, Apache Cassandra✤ Active since Dec 2008✤ First non-Facebook committer✤ wrote ~30% of committed patches, reviewed ~40% of the rest

✤ Distributed systems background✤ At Mozy, built a multi-petabyte, scalable storage system based on

Reed-Solomon encoding

✤ Founder and CTO, DataStax

About DataStax

✤ Founded in April 2010✤ Commercial leader in Apache Cassandra✤ 100+ customers✤ 30+ employees✤ Home to Apache Cassandra Chair & most committers✤ Headquartered in San Francisco Bay area, California✤ Secured $11M in Series B funding in Sep 2011

Job Trends (indeed.com)

“Big Data” trend

Big data

Analytics(Hadoop)

Realtime(“NoSQL”)?

✤ Financial✤ Social Media✤ Advertising✤ Entertainment✤ Energy✤ E-tail✤ Health care✤ Government

Some Cassandra users

Common use cases

✤ Time series data✤ Messaging✤ Ad tracking✤ Data mining✤ User activity streams✤ User sessions✤ Anything requiring: Scalable + performant + highly

available

Why people choose Cassandra

✤ Multi-master, multi-DC✤ Linearly scalable✤ Larger-than-memory datasets✤ Best-in-class performance (not just writes!)✤ Fully durable✤ Integrated caching✤ Tuneable consistency

✤ CREATE COLUMN FAMILY✤ Expiring columns (TTL)✤ Secondary (column) indexes✤ Efficient streaming✤ Efficient cross-datacenter writes

✤ CQL✤ Counters✤ Automatic memtable tuning✤ New bulk load interface

✤ Compression✤ Read performance✤ LeveledCompactionStrategy✤ CQL 2.0

Compression

✤ Rows-per-block or blocks-per-row

Classic size-tiered compaction

Level-based Compaction

✤ SSTables are non-overlapping within a level✤ Bounds the number that can contain a given row

L2: 1000 MB

L1: 100 MB

L0: newly flushed

Read performance: maxtimestamp

✤ Sort sstables by maximum (client-provided) timestamp✤ Only merge sstables until we have the columns requested✤ Allows pre-merging highly fragmented rows without

waiting for compaction

Results

cqlsh> SELECT * FROM users WHERE state='UT' AND birth_date > 1970;

CQL 2.0

✤ ALTER✤ Counter support✤ TTL support✤ SELECT count(*)

Post-1.0 features

✤ Ease Of Use✤ CQL

✤ “Native” transport✤ Composite columns✤ Prepared statements

✤ Triggers✤ Entity groups✤ Smarter range queries

✤ Enables more-efficient analytics

The evolution of Analytics

Analytics + Realtime

Analytics Realtime

replication

Big data

Analytics(Hadoop)

Realtime(Cassandra)

DataStaxEnterprise

DataStax Enterprise re-unifiesrealtime and analytics

Data model: Realtime

GOOG LNKD P AMZN AAPLE80 20 40 100 20

Portfolio1

Portfolios

2011-01-01 2011-01-02 2011-01-03$79.85 $75.23 $82.11

StockHist

last$95.52

$186.10

$112.98

LiveStocks

AAPLAMZN

Data model: Analytics

worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93

Portfolio1

HistLoss

Portfolio2Portfolio3

ticker rdate returnGOOG 2011-07-25 $8.23GOOG 2011-07-24 $6.14GOOG 2011-07-23 $7.78AAPL 2011-07-25 $15.32AAPL 2011-07-24 $12.68

10dayreturns

INSERT OVERWRITE TABLE 10dayreturnsSELECT a.row_key ticker, b.column_name rdate, b.value - a.valueFROM StockHist a JOIN StockHist b ON (a.row_key = b.row_key AND date_add(a.column_name,10) = b.column_name);

2011-01-01 2011-01-02 2011-01-03$79.85 $75.23 $82.11

row_key column_name valueGOOG 2011-01-01 $8.23GOOG 2011-01-02 $6.14GOOG 2011-001-03 $7.78

portfolio rdate preturnPortfolio1 2011-07-25 $118.21Portfolio1 2011-07-24 $60.78Portfolio1 2011-07-23 -$34.81Portfolio2 2011-07-25 $2143.92Portfolio3 2011-07-24 -$10.19

portfolio_returns

INSERT OVERWRITE TABLE portfolio_returnsSELECT row_key portfolio, rdate, SUM(b.return)FROM portfolios a JOIN 10dayreturns b ON (a.column_name = b.ticker)GROUP BY row_key, rdate;

INSERT OVERWRITE TABLE HistLossSELECT a.portfolio, rdate, minpFROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);

worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93

Portfolio1

HistLoss

Portfolio2Portfolio3

Portfolio Demo dataflow

Portfolios

Historical Prices

Intermediate Results

Largest loss

Portfolios

Live Prices for today

Largest loss

Operations

✤ “Vanilla” Hadoop✤ 8+ services to setup, monitor, backup, and recover

(NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper, Region Server,...)

✤ Single points of failure✤ Can't separate online and offline processing

✤ DataStax Enterprise✤ Single, simplified component✤ Self-organizes based on workload✤ Peer to peer✤ JobTracker failover✤ No additional cassandra config

OpsCenter

Questions?

✤ http://datastax.com/dev/blog✤ jonathan@datastax.com

Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Technology

Cassandra Training Session 2svn.wso2.org/repos/wso2/people/kasunw/BAM/Cassandra/Cassandra... · Configuring Cassandra Contd ... replicationStrategy, replicationFactor, cfs); cluster.addKeyspace(definition);

Cassandra Summit EU 2014 - Testing Cassandra Applications

Cassandra CLuster Management by Japan Cassandra Community

A GUIDE TO STRESS TESTING KAFKA, SPARK AND CASSANDRA … · Spark Workers. The nodes are named Spark-Cassandra-Master, Spark-Cassandra-Worker01 and Spark-Cassandra-Worker02. The Cassandra

Cabs, Cassandra, and Hailo (at Cassandra EU)

CQL In Cassandra 1.0 (and beyond)

[db tech showcase Tokyo 2016] D27: Next Generation Apache Cassandra by ヤフー株式会社星井祥吾

Cassandra Summit Tokyo 2015 - intra-mart

Apache Cassandra™ Documentationcourses.physics.illinois.edu/cs425/fa2017/cassandra10.pdfApache Cassandra 1.0 Documentation Introduction to Apache Cassandra Apache Cassandra is a

Apache Cassandra at Target - Cassandra Summit 2014

Paris Cassandra Meetup - Cassandra for Developers

Tokyo Cassandra Summit 2014: Apache Cassandra 2.0 + 2.1 by Jonathan Ellis

Tokyo Cassandra Summit 2014: Tunable Consistency by Al Tobey

Cassandra Day NYC - Cassandra anti patterns

Tokyo Cassandra Summit 2014: CQL3 Data Modeling by Yuki Morishita

Two years in the making: What is new in Apache Cassandra 4.0? · Cassandra in Community • Cassandra 2.x (stable, widely adopted) • Cassandra 3.0.x (stable) • Cassandra 3.11.x

Columnar Databases NoSQL · Google BigTable, Cassandra, HBase Cassandra as an Example Cassandra data model 1.0 vs. 2.0 Cassandra Query Language (CQL) Data partitioning, replication

Apache Cassandra in Action - O'Reilly Mediaassets.en.oreilly.com/1/event/55/Apache Cassandra in Action... · Apache Cassandra in Action. Why Cassandra? ... Cassandra in production

Distributed Counters in Cassandra (Cassandra Summit 2010)

Cassandra Core Concepts - Cassandra Day Toronto