Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Add a Bit of ACID to Cassandra

Oleg Anastasyev Lead Platform Developer ok.ru

http://ok.ru

ok.ru* 45M daily, 80M monthly audience

* Top 4 social networking site

* Top 7 on total time on site in the world*

* ~ 500,000 http reqs/sec

* > 400 Gbps out

* > 8000 iron servers in 5 DCs, ~1ms ping

* comScore data on July 2014, desktops, users of 15+ age

http://ok.ru

* Since 2010 - 0.6-ok, 1.2, 2.0

* In 2014 - 33 clusters - > 600 storage nodes - 330 TB

* Fastest :1.5M ops (48 nodes)

* Largest : 130TB (96 nodes)

Cassandra at

SQL Server 2005

* Consistent (ACID) OLTP data

* 200 servers, 50 TB of data

* Sharding • F(Entity_Id) -> Token -> SQL Server Node • F(Master_Id) === F(Detail_Id)

* Local node commit only

Fast SQL Server 2005

* DB JOIN

* Foreign key constraints

* Stored Procs, Triggers

* Read uncommitted (noTx)

* Short lived transactions <100ms

* No massive UPDATEs, DELETEs

* Always query on indexed data

Usual SQL shortcomings

* Manual “scale out” with downtime

* Downtime on maintenance

* Write performance

* BSoD, swap outs, magic

* Expensive HA hardware (10x 1U server price)

* Fragile failover - ~ 10% failovers fail

* Downtime on DC failure or partition

Simple transaction in SQL ServerTX.start(“Albums”, id);Album album = albums.lock(id);Photo photo = photos.create(…);

if (photo.status == PUBLIC ) {album.incPublicPhotosCount();

}

TX.commit();

* Read - modify - write

* Involves a few records, different tables

* Possibility of concurrent transactions on 1 key

Usual NoSQL problems

* Learning curve

* Sophisticated development - Often rewrite from scratch, data model and UI - Often with omission of functionality

* Distributed programming means - (A lot of) app specific code around consistency,

conflicts resolution, retries and rollbacks

* Ad-hoc, fragile and buggy ACID implementation

We need a New Storage

* Fast to learn and develop - ACID - SQL

* Easy to operate and maintain: - Read and modify on DC failure - Automatic scale out w/o downtime - Commodity hardware

* Fixable codebase (OpenSource,Java)

TODO: SQL

* Scale out

* Availability - Cluster - Conflict resolution - SQL

* ACID

* SQL

* Cassandra 2 CQL

NoSQL ?

- OR -

Cassandra 2.0* Implements out of the box

- CQL - Automatic scale out - Good write perf - Quorums, speculative retry ( see also CASSANDRA-6866 ) - Logged Batch - “Lightweight” transactions ?

Read - modify - write Possibility of concurrent transactions on 1 key Involves a few records, different tables

“3 phase commit” -> slow

Cassandra 2.0* Implements out of the box

- CQL - Automatic scale out - Good write perf ( https://github.com/jbellis/YCSB ) - Quorums, speculative retry ( see also CASSANDRA-6866 ) - Logged Batch - “Lightweight” transactions - Secondary indexes ?

https://github.com/jbellis/YCSB

C*One

* ACID transactions - No SpOF, DC failure resistant - Across multiple tables and partitions - Commits and rollbacks

* First class indexes - No additional coding - Online build on existing data

CassandraGossip & Messaging

clients C* Storage nodes

C*OneUpdate services

Schema

Partitioner

“Heartbeat”

Cluster topology

C*One

clients > 800

(all java)

* Fat client mode

* Client is its own coordinator

* Faster

* -1 point of failure -> more reliable

Clients

clients

NoTx


In Tx

Clients

C*One Update Srvs

* Manages pessimistic locks

* Generates monotonic timestamp for cells

* Manages transactions

* Failure management

Lamport Timestamp http://en.wikipedia.org/wiki/Lamport_timestamps

http://en.wikipedia.org/wiki/Lamport_timestamps


00

10

20

30

40

50

* Transaction Group Masters

* Simple in-memory locking

Locks mgmt

00

10

20

30

50

40

* Each to every heartbeat

* Quorum cluster view(I am dead if Q say so)

* 50ms tick

* G1 GC

* 200ms till failure detection

DC-1 DC-2 DC-3

Heartbeat Quorum

Failure detection

50

* Master election protocol

* Speculative transaction start

50’

50”

clients > 800

start Tx

Failure management

Unborn transactions

* Transacion start requests queue - (in substitute’s memory) - Thrown away after timeout

* On range master failure - queue is being processed - send started replies to clients

(declines if already opened)

clients

Locks table

1. StartTx

id=1, a=1, b=1

2. Lock

3. Read

4. Cache

Transaction state

RAMTx start

Transaction state

Locks table

1. UPDATE

id=1, a=1, b=1

2. File

id=1, a=2, b=2

RAMTx write

clients

Transaction state

Locks table

1. Read

id=1, a=1, b=1id=1, a=2, b=2

2. Read ?

3. resolve()

RAMTx read

clients

Transaction state

Locks table

1. Commit

id=1, a=2, b=2

RAM

LOGGED BATCH

2

3

4. Ack

Tx commit

clients

1. Rollback

RAM

Transaction state

Locks table

id=1, a=2, b=2

Tx rollback

clients

ACID

* Atomicity - logged batch or nothing

* Consistency - application, rollback

* Isolation - Locks - Read Committed

* Durability - quorum reads and writes to Cassandra

Indexes in Cassandra 2

* CREATE INDEX (owner, modified ) ? - No composite index support - High cardinality - Don’t scale (synchronous full cluster scan on read) - Max 100K tombstones per index

CREATE TABLE photos ( id bigint primary key, owner bigint, modified timestamp

SELECT * WHERE owner=? AND modified>?

Primary Key

id owner modified caption access …

1 111 9.10.2014 “kitty cat” PUB …

INDEX i1 ON photos (owner, modified) VALUES (caption,access,…);

Primary Key

Partition Key Clustering Key

owner modified id caption access …111 9.10.2014 1 “kitty cat” PUB …

SELECT * WHERE owner=? AND modified>?

SELECT * FROM i1_photoWHERE owner=? AND modified>?

Global Indexes in C*One

UPDATE

RAM

Transaction state

id=1, a=1, b=1id=1, a=2, b=2

Schema2. idxwrites()

idx: a=2, b=2, id=1

Index

clients

* Indexes “a la SQL” - Consistent - On more than 1 column - Scalable and fast - Built into CQL - No additional coding required - Very little penalty (+1 write)

ACID

Production: Photos

* 11 bi photos

* 80k reads/sec, 2k-8k tx/sec

* SQL - RF=1 (+1 on RAID 10, +3 in backups) - 32 MS SQL + 16 standby + 10 backup = 58 - load =100%

* C*One - RF=3 ( in each DC ) - 63 C* + 6 upd = 69, 1/3 price - load = 30%

* Tx failures 8500 /day -> 85/day

* Avg Tx timespan: <40ms

* Commit latency avg: <2ms

* Read, write, avg <2ms, 99% ~ 3ms

Photos: numbers

C*

* 22 patches to issues.apache.org - range thombstone and queries fixes, optimizations,

etc.

* Commit log on the fly compression(CASSANDRA-7994)

* Reliable always retry policy(CASSANDRA-6866)

* Night of the Living Dead(CASSANDRA-7872)

http://issues.apache.org

Oleg Anastasyev [email protected] ok.ru/oa @m0nstermind

slideshare.net/m0nstermind

http://v.ok.ru

T H A N K Y O U !

mailto:[email protected]?subject=

http://v.ok.ru

Technology

Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014