Upload
odnoklassnikiru
View
5.227
Download
5
Tags:
Embed Size (px)
DESCRIPTION
OK.ru is one of the largest social networks for Russian-speaking audiences with 80+ million unique user’s visits monthly. ok.ru uses Cassandra since 2010 and made a number of improvements to C* 2.0 and 2.1 codebase. Until recent time more than 50 TB of data at Ok.ru OLTP systems was managed by Microsoft SQL Sever. It’s very expensive, hard to scale and cannot save us from outage if one of our data centers fail. We wanted a new, fast scalable and reliable storage for these data. These data has requirements to support ACID transactions, so we don’t have to rewrite all application code from scratch. С* does not support these transactions, only lightweight, so we implemented a new storage with ACID and selected features of SQL world by ourselves. Still, it has C* at its heart. We’ll discuss the internals of the new storage, what features of C* we had to alter and which to rewrite from scratch. We’ll also talk about its operational experience in production.
Citation preview
ok.ru* 45M daily, 80M monthly audience
* Top 4 social networking site
* Top 7 on total time on site in the world*
* ~ 500,000 http reqs/sec
* > 400 Gbps out
* > 8000 iron servers in 5 DCs, ~1ms ping
* comScore data on July 2014, desktops, users of 15+ age
* Since 2010 - 0.6-ok, 1.2, 2.0
* In 2014 - 33 clusters - > 600 storage nodes - 330 TB
* Fastest :1.5M ops (48 nodes)
* Largest : 130TB (96 nodes)
Cassandra at
SQL Server 2005
* Consistent (ACID) OLTP data
* 200 servers, 50 TB of data
* Sharding • F(Entity_Id) -> Token -> SQL Server Node • F(Master_Id) === F(Detail_Id)
* Local node commit only
Fast SQL Server 2005
* DB JOIN
* Foreign key constraints
* Stored Procs, Triggers
* Read uncommitted (noTx)
* Short lived transactions <100ms
* No massive UPDATEs, DELETEs
* Always query on indexed data
Usual SQL shortcomings
* Manual “scale out” with downtime
* Downtime on maintenance
* Write performance
* BSoD, swap outs, magic
* Expensive HA hardware (10x 1U server price)
* Fragile failover - ~ 10% failovers fail
* Downtime on DC failure or partition
Simple transaction in SQL ServerTX.start(“Albums”, id);Album album = albums.lock(id);Photo photo = photos.create(…);
if (photo.status == PUBLIC ) {album.incPublicPhotosCount();
}
TX.commit();
* Read - modify - write
* Involves a few records, different tables
* Possibility of concurrent transactions on 1 key
Usual NoSQL problems
* Learning curve
* Sophisticated development - Often rewrite from scratch, data model and UI - Often with omission of functionality
* Distributed programming means - (A lot of) app specific code around consistency,
conflicts resolution, retries and rollbacks
* Ad-hoc, fragile and buggy ACID implementation
We need a New Storage
* Fast to learn and develop - ACID - SQL
* Easy to operate and maintain: - Read and modify on DC failure - Automatic scale out w/o downtime - Commodity hardware
* Fixable codebase (OpenSource,Java)
TODO: SQL
* Scale out
* Availability - Cluster - Conflict resolution - SQL
* ACID
* SQL
* Cassandra 2 CQL
NoSQL ?
- OR -
Cassandra 2.0* Implements out of the box
- CQL - Automatic scale out - Good write perf - Quorums, speculative retry ( see also CASSANDRA-6866 ) - Logged Batch - “Lightweight” transactions ?
Read - modify - write Possibility of concurrent transactions on 1 key Involves a few records, different tables
“3 phase commit” -> slow
Cassandra 2.0* Implements out of the box
- CQL - Automatic scale out - Good write perf ( https://github.com/jbellis/YCSB ) - Quorums, speculative retry ( see also CASSANDRA-6866 ) - Logged Batch - “Lightweight” transactions - Secondary indexes ?
C*One
* ACID transactions - No SpOF, DC failure resistant - Across multiple tables and partitions - Commits and rollbacks
* First class indexes - No additional coding - Online build on existing data
CassandraGossip & Messaging
clients C* Storage nodes
C*OneUpdate services
Schema
Partitioner
“Heartbeat”
Cluster topology
C*One
clients > 800
(all java)
* Fat client mode
* Client is its own coordinator
* Faster
* -1 point of failure -> more reliable
Clients
clients
NoTx
C*OneUpdate services
In Tx
Clients
C*One Update Srvs
* Manages pessimistic locks
* Generates monotonic timestamp for cells
* Manages transactions
* Failure management
Lamport Timestamp http://en.wikipedia.org/wiki/Lamport_timestamps
C*OneUpdate services
00
10
20
30
40
50
* Transaction Group Masters
* Simple in-memory locking
Locks mgmt
00
10
20
30
50
40
* Each to every heartbeat
* Quorum cluster view(I am dead if Q say so)
* 50ms tick
* G1 GC
* 200ms till failure detection
DC-1 DC-2 DC-3
Heartbeat Quorum
Failure detection
50
* Master election protocol
* Speculative transaction start
50’
50”
clients > 800
start Tx
Failure management
Unborn transactions
* Transacion start requests queue - (in substitute’s memory) - Thrown away after timeout
* On range master failure - queue is being processed - send started replies to clients
(declines if already opened)
clients
Locks table
1. StartTx
id=1, a=1, b=1
2. Lock
3. Read
4. Cache
Transaction state
RAMTx start
Transaction state
Locks table
1. UPDATE
id=1, a=1, b=1
2. File
id=1, a=2, b=2
RAMTx write
clients
Transaction state
Locks table
1. Read
id=1, a=1, b=1id=1, a=2, b=2
2. Read ?
3. resolve()
RAMTx read
clients
Transaction state
Locks table
1. Commit
id=1, a=2, b=2
RAM
LOGGED BATCH
2
3
4. Ack
Tx commit
clients
1. Rollback
RAM
Transaction state
Locks table
id=1, a=2, b=2
Tx rollback
clients
ACID
* Atomicity - logged batch or nothing
* Consistency - application, rollback
* Isolation - Locks - Read Committed
* Durability - quorum reads and writes to Cassandra
Indexes in Cassandra 2
* CREATE INDEX (owner, modified ) ? - No composite index support - High cardinality - Don’t scale (synchronous full cluster scan on read) - Max 100K tombstones per index
CREATE TABLE photos ( id bigint primary key, owner bigint, modified timestamp
SELECT * WHERE owner=? AND modified>?
Primary Key
id owner modified caption access …
1 111 9.10.2014 “kitty cat” PUB …
INDEX i1 ON photos (owner, modified) VALUES (caption,access,…);
Primary Key
Partition Key Clustering Key
owner modified id caption access …111 9.10.2014 1 “kitty cat” PUB …
SELECT * WHERE owner=? AND modified>?
SELECT * FROM i1_photoWHERE owner=? AND modified>?
Global Indexes in C*One
UPDATE
RAM
Transaction state
id=1, a=1, b=1id=1, a=2, b=2
Schema2. idxwrites()
idx: a=2, b=2, id=1
Index
clients
* Indexes “a la SQL” - Consistent - On more than 1 column - Scalable and fast - Built into CQL - No additional coding required - Very little penalty (+1 write)
ACID
Production: Photos
* 11 bi photos
* 80k reads/sec, 2k-8k tx/sec
* SQL - RF=1 (+1 on RAID 10, +3 in backups) - 32 MS SQL + 16 standby + 10 backup = 58 - load =100%
* C*One - RF=3 ( in each DC ) - 63 C* + 6 upd = 69, 1/3 price - load = 30%
* Tx failures 8500 /day -> 85/day
* Avg Tx timespan: <40ms
* Commit latency avg: <2ms
* Read, write, avg <2ms, 99% ~ 3ms
Photos: numbers
C*
* 22 patches to issues.apache.org - range thombstone and queries fixes, optimizations,
etc.
* Commit log on the fly compression(CASSANDRA-7994)
* Reliable always retry policy(CASSANDRA-6866)
* Night of the Living Dead(CASSANDRA-7872)
Oleg Anastasyev [email protected] ok.ru/oa @m0nstermind
slideshare.net/m0nstermind
http://v.ok.ru
T H A N K Y O U !