37
Cassandra 1.0 and the future of big data Jonathan Ellis Tuesday, October 4, 2011

Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

  • Upload
    jbellis

  • View
    2.393

  • Download
    1

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Cassandra 1.0and the future of big data

Jonathan Ellis

Tuesday, October 4, 2011

Page 2: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

About me

✤ Project chair, Apache Cassandra✤ Active since Dec 2008✤ First non-Facebook committer✤ wrote ~30% of committed patches, reviewed ~40% of the rest

✤ Distributed systems background✤ At Mozy, built a multi-petabyte, scalable storage system based on

Reed-Solomon encoding

✤ Founder and CTO, DataStax

Tuesday, October 4, 2011

Page 3: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

About DataStax

✤ Founded in April 2010✤ Commercial leader in Apache Cassandra✤ 100+ customers✤ 30+ employees✤ Home to Apache Cassandra Chair & most committers✤ Headquartered in San Francisco Bay area, California✤ Secured $11M in Series B funding in Sep 2011

Tuesday, October 4, 2011

Page 4: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Job Trends (indeed.com)

Tuesday, October 4, 2011

Page 5: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

“Big Data” trend

Tuesday, October 4, 2011

Page 6: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Big data

Analytics(Hadoop)

Realtime(“NoSQL”)?

Tuesday, October 4, 2011

Page 7: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

✤ Financial✤ Social Media✤ Advertising✤ Entertainment✤ Energy✤ E-tail✤ Health care✤ Government

Some Cassandra users

Tuesday, October 4, 2011

Page 8: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Common use cases

✤ Time series data✤ Messaging✤ Ad tracking✤ Data mining✤ User activity streams✤ User sessions✤ Anything requiring: Scalable + performant + highly

available

Tuesday, October 4, 2011

Page 9: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Why people choose Cassandra

✤ Multi-master, multi-DC✤ Linearly scalable✤ Larger-than-memory datasets✤ Best-in-class performance (not just writes!)✤ Fully durable✤ Integrated caching✤ Tuneable consistency

Tuesday, October 4, 2011

Page 10: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

0.7

✤ CREATE COLUMN FAMILY✤ Expiring columns (TTL)✤ Secondary (column) indexes✤ Efficient streaming✤ Efficient cross-datacenter writes

Tuesday, October 4, 2011

Page 11: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

0.8

✤ CQL✤ Counters✤ Automatic memtable tuning✤ New bulk load interface

Tuesday, October 4, 2011

Page 12: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

1.0

✤ Compression✤ Read performance✤ LeveledCompactionStrategy✤ CQL 2.0

Tuesday, October 4, 2011

Page 13: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Compression

✤ Rows-per-block or blocks-per-row

Tuesday, October 4, 2011

Page 14: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Classic size-tiered compaction

Tuesday, October 4, 2011

Page 15: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Level-based Compaction

✤ SSTables are non-overlapping within a level✤ Bounds the number that can contain a given row

L2: 1000 MB

L1: 100 MB

L0: newly flushed

Tuesday, October 4, 2011

Page 16: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Read performance: maxtimestamp

✤ Sort sstables by maximum (client-provided) timestamp✤ Only merge sstables until we have the columns requested✤ Allows pre-merging highly fragmented rows without

waiting for compaction

Tuesday, October 4, 2011

Page 17: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Results

Tuesday, October 4, 2011

Page 18: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

CQL

cqlsh> SELECT * FROM users WHERE state='UT' AND birth_date > 1970;

        KEY | birth_date |         full_name | state | bsanderson |       1975 | Brandon Sanderson |    UT |

Tuesday, October 4, 2011

Page 19: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

CQL 2.0

✤ ALTER✤ Counter support✤ TTL support✤ SELECT count(*)

Tuesday, October 4, 2011

Page 20: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Post-1.0 features

✤ Ease Of Use✤ CQL

✤ “Native” transport✤ Composite columns✤ Prepared statements

✤ Triggers✤ Entity groups✤ Smarter range queries

✤ Enables more-efficient analytics

Tuesday, October 4, 2011

Page 21: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

The evolution of Analytics

Analytics + Realtime

Tuesday, October 4, 2011

Page 22: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

The evolution of Analytics

Analytics Realtime

replication

Tuesday, October 4, 2011

Page 23: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

The evolution of Analytics

ETL

Tuesday, October 4, 2011

Page 24: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Big data

Analytics(Hadoop)

Realtime(Cassandra)

DataStaxEnterprise

Tuesday, October 4, 2011

Page 25: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

DataStax Enterprise re-unifiesrealtime and analytics

Tuesday, October 4, 2011

Page 26: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

26

Tuesday, October 4, 2011

Page 27: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Data model: Realtime

GOOG LNKD P AMZN AAPLE80 20 40 100 20

Portfolio1

Portfolios

2011-01-01 2011-01-02 2011-01-03$79.85 $75.23 $82.11

GOOG

StockHist

last$95.52

$186.10

$112.98

GOOG

LiveStocks

AAPLAMZN

Tuesday, October 4, 2011

Page 28: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Data model: Analytics

worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93

Portfolio1

HistLoss

Portfolio2Portfolio3

Tuesday, October 4, 2011

Page 29: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Data model: Analytics

ticker rdate returnGOOG 2011-07-25 $8.23GOOG 2011-07-24 $6.14GOOG 2011-07-23 $7.78AAPL 2011-07-25 $15.32AAPL 2011-07-24 $12.68

10dayreturns

INSERT OVERWRITE TABLE 10dayreturnsSELECT a.row_key ticker, b.column_name rdate, b.value - a.valueFROM StockHist a JOIN StockHist b ON (a.row_key = b.row_key AND date_add(a.column_name,10) = b.column_name);

Tuesday, October 4, 2011

Page 30: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

2011-01-01 2011-01-02 2011-01-03$79.85 $75.23 $82.11

GOOG

row_key column_name valueGOOG 2011-01-01 $8.23GOOG 2011-01-02 $6.14GOOG 2011-001-03 $7.78

Data model: Analytics

Tuesday, October 4, 2011

Page 31: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Data model: Analytics

portfolio rdate preturnPortfolio1 2011-07-25 $118.21Portfolio1 2011-07-24 $60.78Portfolio1 2011-07-23 -$34.81Portfolio2 2011-07-25 $2143.92Portfolio3 2011-07-24 -$10.19

portfolio_returns

INSERT OVERWRITE TABLE portfolio_returnsSELECT row_key portfolio, rdate, SUM(b.return)FROM portfolios a JOIN 10dayreturns b ON (a.column_name = b.ticker)GROUP BY row_key, rdate;

Tuesday, October 4, 2011

Page 32: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Data model: Analytics

INSERT OVERWRITE TABLE HistLossSELECT a.portfolio, rdate, minpFROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);

worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93

Portfolio1

HistLoss

Portfolio2Portfolio3

Tuesday, October 4, 2011

Page 33: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Portfolio Demo dataflow

Portfolios

Historical Prices

Intermediate Results

Largest loss

Portfolios

Live Prices for today

Largest loss

Tuesday, October 4, 2011

Page 34: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Operations

✤ “Vanilla” Hadoop✤ 8+ services to setup, monitor, backup, and recover

(NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper, Region Server,...)

✤ Single points of failure✤ Can't separate online and offline processing

✤ DataStax Enterprise✤ Single, simplified component✤ Self-organizes based on workload✤ Peer to peer✤ JobTracker failover✤ No additional cassandra config

Tuesday, October 4, 2011

Page 35: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

OpsCenter

Tuesday, October 4, 2011

Page 36: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Questions?

✤ http://datastax.com/dev/blog✤ [email protected]

Tuesday, October 4, 2011

Page 37: Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

37

Tuesday, October 4, 2011