29
Apache Cassandra Durability, Durability, Durability ... Matthew F. Dennis // @mdennis CassandraSF August 8, 2012

durability, durability, durability

Embed Size (px)

DESCRIPTION

Talk from CassandraSF 2012 showing the importance of real durability. Examples of use for row level isolation in Cassandra and the implementation of a transaction log pattern. The example used is a banking system on top of Cassandra with support crediting/debiting an account, viewing an account balance and transferring money between accounts.

Citation preview

Page 1: durability, durability, durability

Apache CassandraDurability, Durability, Durability ...

Matthew F. Dennis // @mdennisCassandraSFAugust 8, 2012

Page 2: durability, durability, durability

Cassandra Data Model

KeyspacesColumn FamiliesRowsColumns (tuples)Name, Value, Timestamp, TTL

Page 3: durability, durability, durability

Yeah Matt, you told us ...

Page 4: durability, durability, durability

A Banking Application(the canonical example)

credit_account(acct, delta)get_account_balance(acct)xfer_funds(from, to, delta)

Page 5: durability, durability, durability

“Work Backwards From Your Queries” --everyone

The only “query” we really have is get_account_balance

Page 6: durability, durability, durability

accounts column family

acctXall_balls xact_id0 xact_id1 ...

$123.45 “details” “details” ...

00:00:00 etc hash(“details”) - a unique, one-oneidempotent id for xact0

from, to, delta, sessionId, timestamp, amount, check number, order number, et cetera

as a JSON/ProtoBuf/XML blob(i.e. everything about a “change” to the acct)

current “base” total

Page 7: durability, durability, durability

get_account_balance(acct)

● read the entire row

● apply deltas to “base”

Page 8: durability, durability, durability

credit_account(acct, delta)

● write “details” to accounts CF

Page 9: durability, durability, durability

get_account_balance(acct)

● obviously the row for each account grows unbounded

● need to safely recalculate the “base” to avoid unbounded growth

Page 10: durability, durability, durability

accounts column family

● a standard single master setup for consolidating works well here (because the system is slower, not broken, while the master is down)

● lets be clear on that; the master being up or down is independent of the correctness of the system(otherwise master => SPOF => bad)

Page 11: durability, durability, durability

accounts column family(consolidation, WOT form)

● pick number of processors, hash acctId mod num_consolidators(clearly other options exist to assign accounts to consolidators)

● only the assigned processor can update the base

● read row for account, calculate new base, write new base + delete columns that went into the base

● read at CL.ALL the first time an account is seen by the processor after boot (in memory, BDB, etc)

● on failure of a write at CL.Q, do not continue processing for that account until a read at CL.ALL for that account has completed

● adding consolidators is easy and requires no down time; shut them down, reconfigure with a new number, start them up

● essentially check pointing

Page 12: durability, durability, durability

accounts column family(consolidation)

acctXall_balls xact_id0 xact_id1 ...

{x} {y} {z} ...

xact0

xact1current “base” total

if “base” was written at CL.Q, then a read at CL.Q will returnthe most most current version.

Page 13: durability, durability, durability

accounts column family(consolidation)

acctXall_balls xact_id0 xact_id1 ...

{x, y, z} ...

xact0 tombstone

xact1 tombstonenew “base” total

processor calculates new base and deletes corresponding deltas

Page 14: durability, durability, durability

accounts column family(consolidation)

acctXall_balls xact_id0 xact_id1 ...

{x, y, z} ...

xact0 tbmbstone

xact1 tombstonenew “base” total

row level isolation guarantees that if the base includes the deltathen delta is absent from the delta list. Likewise, if the base does not include the delta then the delta is in the delta list

Page 15: durability, durability, durability

accounts column family(durability, consistency)

● writes can be at any any consistency level that meets your durability, consistency and availability requirements (CL.Q?)

● base updates and the queries to calculate them must be at CL.Q with the initial read at CL.ALL

Page 16: durability, durability, durability

why the CL.ALL read?

node0 node1 node2

left set is base, right is delta list {} {} {} {} {} {}

concurrent write for x {} {x} {} {} {} {}

CL.Q response from node0 and node1calculate base as {x}, CL.Q write fails {x} {} {} {} {} {}

concurrent write for y {x} {} {} {} {} {y}

CL.Q response from node1 and node2calculate base as {y} {x} {} {} {} {y} {}

node2 propagates base={y}node0 propagates deltas={}resulting in x missing from base *and* from deltas

{y} {} {y} {} {y} {}

Page 17: durability, durability, durability

accounts column family(xfers)

● clearly source account and destination account can be on different nodes

● so, how do you maintain consistency across them when doing transfers?

Page 18: durability, durability, durability

accounts column family(xfers)

● the common approach is to use a transaction log (go go wikipedia)

● Oracle uses one● PGSQL uses one● C* uses one● we should have one too!

Page 19: durability, durability, durability

the xact_log column family

node_tokentimeuuid(xact0) timeuuid(xact1) timeuuid(xact2)

“details” “details” “details”

same “complete” details as previously

timeuuid of when xact0 occurred

randomly chosen from set of nodes (or from a known range, e.g. 0-100)

Page 20: durability, durability, durability

the xact_log column family

● a durable (e.g. multi node) place to write changes● a write to xact_log CF ~= “commit”● each node runs a crond job that periodically (e.g.

every minute +/- 15 seconds) queries a slice of its corresponding row(s) and those of it’s neighbor (could improve on polling)

● node replays any messages found in their entirety and deletes the column

● normally, the query returns no results

Page 21: durability, durability, durability

xfer_funds(from, to, delta)(the interesting one)

● write “details” to xact_log CF● in parallel, write “details” for from

and to account rows● delete “details” from xact_log CF

(could be done after client response)

● failures?

Page 22: durability, durability, durability

xfer_funds(from, to, delta)(failures)

● before insert● after insert● after from xor to is applied● after from and to is applied● after delete from xact_log CF

Page 23: durability, durability, durability

consistency?(eventually)

● partitions between data centers?● failures for xacts in flight?● maintenance? ● upgrades? ● you have requirements, be honest about

what they are …● do not page your ops team at 4am unless

required (which *should* be rare)

Page 24: durability, durability, durability

accounts column family settings

● normal gc_grace_seconds● row cache friendly*● key cache friendly (~everything is)● level compaction strategy

(IO “now” or IO “later”?)● should probably use

commit_log_sync=batch(not a per CF setting)

* in general you should probably just avoid the row cache all together

Page 25: durability, durability, durability

xact_log column family settings

● gc_grace_seconds = 0● row cache unfriendly● key cache friendly, but not needed● level compaction strategy

(or sized with min_threshold=2)

Page 26: durability, durability, durability

other uses

● “base” and “deltas” need not represent money

● character inventory/trading● portfolios● escrow exchanges● anything combinable

(you control the consolidate code)

Page 27: durability, durability, durability

is this the best way?

● not always, of course not, depends on your requirements and goals

● could use C* for xact_log, Oracle for balances

● could use zookeeper instead of CL.Q and CL.ALL for consolidators

● C* solutions favors availability, scalability and durability over other desirable traits

Page 28: durability, durability, durability

Q?Matthew F. Dennis // @mdennis

Page 29: durability, durability, durability

Thank You!(now go prep for your lighting talk)