Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Preview:

DESCRIPTION

 

Citation preview

Cassandra 1.0and the future of big data

Jonathan Ellis

Tuesday, October 4, 2011

About me

✤ Project chair, Apache Cassandra✤ Active since Dec 2008✤ First non-Facebook committer✤ wrote ~30% of committed patches, reviewed ~40% of the rest

✤ Distributed systems background✤ At Mozy, built a multi-petabyte, scalable storage system based on

Reed-Solomon encoding

✤ Founder and CTO, DataStax

Tuesday, October 4, 2011

About DataStax

✤ Founded in April 2010✤ Commercial leader in Apache Cassandra✤ 100+ customers✤ 30+ employees✤ Home to Apache Cassandra Chair & most committers✤ Headquartered in San Francisco Bay area, California✤ Secured $11M in Series B funding in Sep 2011

Tuesday, October 4, 2011

Job Trends (indeed.com)

Tuesday, October 4, 2011

“Big Data” trend

Tuesday, October 4, 2011

Big data

Analytics(Hadoop)

Realtime(“NoSQL”)?

Tuesday, October 4, 2011

✤ Financial✤ Social Media✤ Advertising✤ Entertainment✤ Energy✤ E-tail✤ Health care✤ Government

Some Cassandra users

Tuesday, October 4, 2011

Common use cases

✤ Time series data✤ Messaging✤ Ad tracking✤ Data mining✤ User activity streams✤ User sessions✤ Anything requiring: Scalable + performant + highly

available

Tuesday, October 4, 2011

Why people choose Cassandra

✤ Multi-master, multi-DC✤ Linearly scalable✤ Larger-than-memory datasets✤ Best-in-class performance (not just writes!)✤ Fully durable✤ Integrated caching✤ Tuneable consistency

Tuesday, October 4, 2011

0.7

✤ CREATE COLUMN FAMILY✤ Expiring columns (TTL)✤ Secondary (column) indexes✤ Efficient streaming✤ Efficient cross-datacenter writes

Tuesday, October 4, 2011

0.8

✤ CQL✤ Counters✤ Automatic memtable tuning✤ New bulk load interface

Tuesday, October 4, 2011

1.0

✤ Compression✤ Read performance✤ LeveledCompactionStrategy✤ CQL 2.0

Tuesday, October 4, 2011

Compression

✤ Rows-per-block or blocks-per-row

Tuesday, October 4, 2011

Classic size-tiered compaction

Tuesday, October 4, 2011

Level-based Compaction

✤ SSTables are non-overlapping within a level✤ Bounds the number that can contain a given row

L2: 1000 MB

L1: 100 MB

L0: newly flushed

Tuesday, October 4, 2011

Read performance: maxtimestamp

✤ Sort sstables by maximum (client-provided) timestamp✤ Only merge sstables until we have the columns requested✤ Allows pre-merging highly fragmented rows without

waiting for compaction

Tuesday, October 4, 2011

Results

Tuesday, October 4, 2011

CQL

cqlsh> SELECT * FROM users WHERE state='UT' AND birth_date > 1970;

        KEY | birth_date |         full_name | state | bsanderson |       1975 | Brandon Sanderson |    UT |

Tuesday, October 4, 2011

CQL 2.0

✤ ALTER✤ Counter support✤ TTL support✤ SELECT count(*)

Tuesday, October 4, 2011

Post-1.0 features

✤ Ease Of Use✤ CQL

✤ “Native” transport✤ Composite columns✤ Prepared statements

✤ Triggers✤ Entity groups✤ Smarter range queries

✤ Enables more-efficient analytics

Tuesday, October 4, 2011

The evolution of Analytics

Analytics + Realtime

Tuesday, October 4, 2011

The evolution of Analytics

Analytics Realtime

replication

Tuesday, October 4, 2011

The evolution of Analytics

ETL

Tuesday, October 4, 2011

Big data

Analytics(Hadoop)

Realtime(Cassandra)

DataStaxEnterprise

Tuesday, October 4, 2011

DataStax Enterprise re-unifiesrealtime and analytics

Tuesday, October 4, 2011

26

Tuesday, October 4, 2011

Data model: Realtime

GOOG LNKD P AMZN AAPLE80 20 40 100 20

Portfolio1

Portfolios

2011-01-01 2011-01-02 2011-01-03$79.85 $75.23 $82.11

GOOG

StockHist

last$95.52

$186.10

$112.98

GOOG

LiveStocks

AAPLAMZN

Tuesday, October 4, 2011

Data model: Analytics

worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93

Portfolio1

HistLoss

Portfolio2Portfolio3

Tuesday, October 4, 2011

Data model: Analytics

ticker rdate returnGOOG 2011-07-25 $8.23GOOG 2011-07-24 $6.14GOOG 2011-07-23 $7.78AAPL 2011-07-25 $15.32AAPL 2011-07-24 $12.68

10dayreturns

INSERT OVERWRITE TABLE 10dayreturnsSELECT a.row_key ticker, b.column_name rdate, b.value - a.valueFROM StockHist a JOIN StockHist b ON (a.row_key = b.row_key AND date_add(a.column_name,10) = b.column_name);

Tuesday, October 4, 2011

2011-01-01 2011-01-02 2011-01-03$79.85 $75.23 $82.11

GOOG

row_key column_name valueGOOG 2011-01-01 $8.23GOOG 2011-01-02 $6.14GOOG 2011-001-03 $7.78

Data model: Analytics

Tuesday, October 4, 2011

Data model: Analytics

portfolio rdate preturnPortfolio1 2011-07-25 $118.21Portfolio1 2011-07-24 $60.78Portfolio1 2011-07-23 -$34.81Portfolio2 2011-07-25 $2143.92Portfolio3 2011-07-24 -$10.19

portfolio_returns

INSERT OVERWRITE TABLE portfolio_returnsSELECT row_key portfolio, rdate, SUM(b.return)FROM portfolios a JOIN 10dayreturns b ON (a.column_name = b.ticker)GROUP BY row_key, rdate;

Tuesday, October 4, 2011

Data model: Analytics

INSERT OVERWRITE TABLE HistLossSELECT a.portfolio, rdate, minpFROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);

worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93

Portfolio1

HistLoss

Portfolio2Portfolio3

Tuesday, October 4, 2011

Portfolio Demo dataflow

Portfolios

Historical Prices

Intermediate Results

Largest loss

Portfolios

Live Prices for today

Largest loss

Tuesday, October 4, 2011

Operations

✤ “Vanilla” Hadoop✤ 8+ services to setup, monitor, backup, and recover

(NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper, Region Server,...)

✤ Single points of failure✤ Can't separate online and offline processing

✤ DataStax Enterprise✤ Single, simplified component✤ Self-organizes based on workload✤ Peer to peer✤ JobTracker failover✤ No additional cassandra config

Tuesday, October 4, 2011

OpsCenter

Tuesday, October 4, 2011

Questions?

✤ http://datastax.com/dev/blog✤ jonathan@datastax.com

Tuesday, October 4, 2011

37

Tuesday, October 4, 2011

Recommended