35
Add a Bit of ACID to Cassandra Oleg Anastasyev Lead Platform Developer ok.ru

Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Embed Size (px)

DESCRIPTION

OK.ru is one of the largest social networks for Russian-speaking audiences with 80+ million unique user’s visits monthly. ok.ru uses Cassandra since 2010 and made a number of improvements to C* 2.0 and 2.1 codebase. Until recent time more than 50 TB of data at Ok.ru OLTP systems was managed by Microsoft SQL Sever. It’s very expensive, hard to scale and cannot save us from outage if one of our data centers fail. We wanted a new, fast scalable and reliable storage for these data. These data has requirements to support ACID transactions, so we don’t have to rewrite all application code from scratch. С* does not support these transactions, only lightweight, so we implemented a new storage with ACID and selected features of SQL world by ourselves. Still, it has C* at its heart. We’ll discuss the internals of the new storage, what features of C* we had to alter and which to rewrite from scratch. We’ll also talk about its operational experience in production.

Citation preview

Page 1: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Add a Bit of ACID to Cassandra

Oleg Anastasyev Lead Platform Developer ok.ru

Page 2: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

ok.ru* 45M daily, 80M monthly audience

* Top 4 social networking site

* Top 7 on total time on site in the world*

* ~ 500,000 http reqs/sec

* > 400 Gbps out

* > 8000 iron servers in 5 DCs, ~1ms ping

* comScore data on July 2014, desktops, users of 15+ age

Page 3: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

* Since 2010 - 0.6-ok, 1.2, 2.0

* In 2014 - 33 clusters - > 600 storage nodes - 330 TB

* Fastest :1.5M ops (48 nodes)

* Largest : 130TB (96 nodes)

Cassandra at

Page 4: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

SQL Server 2005

* Consistent (ACID) OLTP data

* 200 servers, 50 TB of data

* Sharding • F(Entity_Id) -> Token -> SQL Server Node • F(Master_Id) === F(Detail_Id)

* Local node commit only

Page 5: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Fast SQL Server 2005

* DB JOIN

* Foreign key constraints

* Stored Procs, Triggers

* Read uncommitted (noTx)

* Short lived transactions <100ms

* No massive UPDATEs, DELETEs

* Always query on indexed data

Page 6: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Usual SQL shortcomings

* Manual “scale out” with downtime

* Downtime on maintenance

* Write performance

* BSoD, swap outs, magic

* Expensive HA hardware (10x 1U server price)

* Fragile failover - ~ 10% failovers fail

* Downtime on DC failure or partition

Page 7: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Simple transaction in SQL ServerTX.start(“Albums”, id);Album album = albums.lock(id);Photo photo = photos.create(…);

if (photo.status == PUBLIC ) {album.incPublicPhotosCount();

}

TX.commit();

* Read - modify - write

* Involves a few records, different tables

* Possibility of concurrent transactions on 1 key

Page 8: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Usual NoSQL problems

* Learning curve

* Sophisticated development - Often rewrite from scratch, data model and UI - Often with omission of functionality

* Distributed programming means - (A lot of) app specific code around consistency,

conflicts resolution, retries and rollbacks

* Ad-hoc, fragile and buggy ACID implementation

Page 9: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

We need a New Storage

* Fast to learn and develop - ACID - SQL

* Easy to operate and maintain: - Read and modify on DC failure - Automatic scale out w/o downtime - Commodity hardware

* Fixable codebase (OpenSource,Java)

Page 10: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

TODO: SQL

* Scale out

* Availability - Cluster - Conflict resolution - SQL

* ACID

* SQL

* Cassandra 2 CQL

NoSQL ?

- OR -

Page 11: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Cassandra 2.0* Implements out of the box

- CQL - Automatic scale out - Good write perf - Quorums, speculative retry ( see also CASSANDRA-6866 ) - Logged Batch - “Lightweight” transactions ?

Read - modify - write Possibility of concurrent transactions on 1 key Involves a few records, different tables

“3 phase commit” -> slow

Page 12: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Cassandra 2.0* Implements out of the box

- CQL - Automatic scale out - Good write perf ( https://github.com/jbellis/YCSB ) - Quorums, speculative retry ( see also CASSANDRA-6866 ) - Logged Batch - “Lightweight” transactions - Secondary indexes ?

Page 13: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

C*One

* ACID transactions - No SpOF, DC failure resistant - Across multiple tables and partitions - Commits and rollbacks

* First class indexes - No additional coding - Online build on existing data

Page 14: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

CassandraGossip & Messaging

clients C* Storage nodes

C*OneUpdate services

Schema

Partitioner

“Heartbeat”

Cluster topology

C*One

Page 15: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

clients > 800

(all java)

* Fat client mode

* Client is its own coordinator

* Faster

* -1 point of failure -> more reliable

Clients

Page 16: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

clients

NoTx

C*OneUpdate services

In Tx

Clients

Page 17: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

C*One Update Srvs

* Manages pessimistic locks

* Generates monotonic timestamp for cells

* Manages transactions

* Failure management

Lamport Timestamp http://en.wikipedia.org/wiki/Lamport_timestamps

Page 18: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

C*OneUpdate services

00

10

20

30

40

50

* Transaction Group Masters

* Simple in-memory locking

Locks mgmt

Page 19: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

00

10

20

30

50

40

* Each to every heartbeat

* Quorum cluster view(I am dead if Q say so)

* 50ms tick

* G1 GC

* 200ms till failure detection

DC-1 DC-2 DC-3

Heartbeat Quorum

Failure detection

Page 20: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

50

* Master election protocol

* Speculative transaction start

50’

50”

clients > 800

start Tx

Failure management

Page 21: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Unborn transactions

* Transacion start requests queue - (in substitute’s memory) - Thrown away after timeout

* On range master failure - queue is being processed - send started replies to clients

(declines if already opened)

Page 22: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

clients

Locks table

1. StartTx

id=1, a=1, b=1

2. Lock

3. Read

4. Cache

Transaction state

RAMTx start

Page 23: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Transaction state

Locks table

1. UPDATE

id=1, a=1, b=1

2. File

id=1, a=2, b=2

RAMTx write

clients

Page 24: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Transaction state

Locks table

1. Read

id=1, a=1, b=1id=1, a=2, b=2

2. Read ?

3. resolve()

RAMTx read

clients

Page 25: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Transaction state

Locks table

1. Commit

id=1, a=2, b=2

RAM

LOGGED BATCH

2

3

4. Ack

Tx commit

clients

Page 26: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

1. Rollback

RAM

Transaction state

Locks table

id=1, a=2, b=2

Tx rollback

clients

Page 27: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

ACID

* Atomicity - logged batch or nothing

* Consistency - application, rollback

* Isolation - Locks - Read Committed

* Durability - quorum reads and writes to Cassandra

Page 28: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Indexes in Cassandra 2

* CREATE INDEX (owner, modified ) ? - No composite index support - High cardinality - Don’t scale (synchronous full cluster scan on read) - Max 100K tombstones per index

CREATE TABLE photos ( id bigint primary key, owner bigint, modified timestamp

SELECT * WHERE owner=? AND modified>?

Page 29: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Primary Key

id owner modified caption access …

1 111 9.10.2014 “kitty cat” PUB …

INDEX i1 ON photos (owner, modified) VALUES (caption,access,…);

Primary Key

Partition Key Clustering Key

owner modified id caption access …111 9.10.2014 1 “kitty cat” PUB …

SELECT * WHERE owner=? AND modified>?

SELECT * FROM i1_photoWHERE owner=? AND modified>?

Global Indexes in C*One

Page 30: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

UPDATE

RAM

Transaction state

id=1, a=1, b=1id=1, a=2, b=2

Schema2. idxwrites()

idx: a=2, b=2, id=1

Index

clients

Page 31: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

* Indexes “a la SQL” - Consistent - On more than 1 column - Scalable and fast - Built into CQL - No additional coding required - Very little penalty (+1 write)

ACID

Page 32: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Production: Photos

* 11 bi photos

* 80k reads/sec, 2k-8k tx/sec

* SQL - RF=1 (+1 on RAID 10, +3 in backups) - 32 MS SQL + 16 standby + 10 backup = 58 - load =100%

* C*One - RF=3 ( in each DC ) - 63 C* + 6 upd = 69, 1/3 price - load = 30%

Page 33: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

* Tx failures 8500 /day -> 85/day

* Avg Tx timespan: <40ms

* Commit latency avg: <2ms

* Read, write, avg <2ms, 99% ~ 3ms

Photos: numbers

Page 34: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

C*

* 22 patches to issues.apache.org - range thombstone and queries fixes, optimizations,

etc.

* Commit log on the fly compression(CASSANDRA-7994)

* Reliable always retry policy(CASSANDRA-6866)

* Night of the Living Dead(CASSANDRA-7872)

Page 35: Add a bit of ACID to Cassandra. My talk from Cassandra Summit EU 2014

Oleg Anastasyev [email protected] ok.ru/oa @m0nstermind

slideshare.net/m0nstermind

http://v.ok.ru

T H A N K Y O U !