Upload
dangnga
View
216
Download
1
Embed Size (px)
Citation preview
Outline: Closing the CAP Gap
• Just-Right ConsistencyAvailable as possible, and consistent when necessary
2
Outline: Closing the CAP Gap
• Just-Right ConsistencyAvailable as possible, and consistent when necessary
• AntidoteDBThe first database that provides transactions with strong semantics, targeted at the JRC approach
2
Outline: Closing the CAP Gap
• Just-Right ConsistencyAvailable as possible, and consistent when necessary
• AntidoteDBThe first database that provides transactions with strong semantics, targeted at the JRC approach
• Moving forwardAntidote’s path forward from research to company and product
2
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
Centralized database.
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
Clients read and write against the primary copy.
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
Geo-replicated for both fault-tolerance and high-availability.
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
Clients read and write locally for low-latency.
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
What happens if C can’t communicate with other replicas?
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
Choice 1: Consistent-Under-Partition (CP)• Synchronize each operation
Maintains “single system image”
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
Choice 1: Consistent-Under-Partition (CP)• Synchronize each operation
Maintains “single system image”
• Spanner/F1, serializability modelCoordination is expensive; Spanner typically has to wait 100ms to commit an update transaction
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
Choice 1: Consistent-Under-Partition (CP)• Synchronize each operation
Maintains “single system image”
• Spanner/F1, serializability modelCoordination is expensive; Spanner typically has to wait 100ms to commit an update transaction
Over-conservative,but easy to program!
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
Choice 2: Available-Under-Partition (AP)• Riak, Cassandra, Dynamo
Operations issued against local copy, and across the cluster in parallel
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
Choice 2: Available-Under-Partition (AP)• Riak, Cassandra, Dynamo
Operations issued against local copy, and across the cluster in parallel
• Local operation only, asynchronous propagationStale reads and write conflicts will occur without synchronization
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
Choice 2: Available-Under-Partition (AP)• Riak, Cassandra, Dynamo
Operations issued against local copy, and across the cluster in parallel
• Local operation only, asynchronous propagationStale reads and write conflicts will occur without synchronization
Available,but difficult to program!
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
CAP TheoremCP AP
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
CAP Theorem
High cost
CP AP
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
CAP Theorem
High cost
Low availability
CP AP
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
CAP Theorem
High cost
Low availability
Synchronization
CP AP
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
CAP Theorem
High cost
Low availability
Synchronization
Low cost
CP AP
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
CAP Theorem
High cost
Low availability
Synchronization
Low cost
High availability
CP AP
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
CAP Theorem
High cost
Low availability
Synchronization
Low cost
High availability
Anomalies
CP AP
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
CAP Theorem
High cost
Low availability
Synchronization
Low cost
High availability
Anomalies
CP AP
False dichotomy!
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
A
B
C
CAP Theorem
High cost
Low availability
Synchronization
Low cost
High availability
Anomalies
CP AP
False dichotomy!
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
• No “one-size-fits-all” consistency modelChoosing either model will either be over-conservative or risk anomalies
A
B
C
CAP Theorem
High cost
Low availability
Synchronization
Low cost
High availability
Anomalies
CP AP
False dichotomy!
[Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
• No “one-size-fits-all” consistency modelChoosing either model will either be over-conservative or risk anomalies
• Application-level invariantsInstead, tailor consistency choices based on application-level invariants for each operation
Just Right Consistency• Preserve sequential patterns
Applications written sequentially that are correct should maintain correctness under concurrency
13
Just Right Consistency• Preserve sequential patterns
Applications written sequentially that are correct should maintain correctness under concurrency
• AP-compatible invariantsStrongest AP model; invariants that only require “one way” communications
13
Just Right Consistency• Preserve sequential patterns
Applications written sequentially that are correct should maintain correctness under concurrency
• AP-compatible invariantsStrongest AP model; invariants that only require “one way” communications
• CAP-sensitive invariantsTransactions that require coordination; “two way” communication invariants
13
Just Right Consistency• Preserve sequential patterns
Applications written sequentially that are correct should maintain correctness under concurrency
• AP-compatible invariantsStrongest AP model; invariants that only require “one way” communications
• CAP-sensitive invariantsTransactions that require coordination; “two way” communication invariants
• Tools for analysis and verificationIdentify and verify application has sufficient synchronization to ensure application invariants
13
Fælles Medicinkort• FMK [production] / FMKe [synthetic workload]
Danish National Joint Medicine Card; operating 24x7 since 2013 for 6 million Danish citizens
15
Fælles Medicinkort• FMK [production] / FMKe [synthetic workload]
Danish National Joint Medicine Card; operating 24x7 since 2013 for 6 million Danish citizens
• Lifecycle management for prescriptionsInvolves patient, pharmacy, and doctor management around active prescriptions in Denmark
15
Fælles Medicinkort• FMK [production] / FMKe [synthetic workload]
Danish National Joint Medicine Card; operating 24x7 since 2013 for 6 million Danish citizens
• Lifecycle management for prescriptionsInvolves patient, pharmacy, and doctor management around active prescriptions in Denmark
• Assumed correct in isolation “Correct-Individually”, C in ACID, each operation ensures application-level invariants
15
Fælles Medicinkort• FMK [production] / FMKe [synthetic workload]
Danish National Joint Medicine Card; operating 24x7 since 2013 for 6 million Danish citizens
• Lifecycle management for prescriptionsInvolves patient, pharmacy, and doctor management around active prescriptions in Denmark
• Assumed correct in isolation “Correct-Individually”, C in ACID, each operation ensures application-level invariants
15
• create-prescriptionCreate prescription for patient, doctor, pharmacy
• update-prescription-medicationAdd or increase medication to prescription
• process-prescriptionDeliver a medication by a pharmacy
• get-*-prescriptionsQuery functions to return information about prescriptions
FMKe Invariants• Relative order [referential integrity]
Create a prescription and reference it by a patient
16
FMKe Invariants• Relative order [referential integrity]
Create a prescription and reference it by a patient
• Joint update [atomicity]Create prescription, then update doctor, patient, and pharmacy
16
FMKe Invariants• Relative order [referential integrity]
Create a prescription and reference it by a patient
• Joint update [atomicity]Create prescription, then update doctor, patient, and pharmacy
• Precondition check [if, then]Medication should not be over delivered
16
AP-compatible• No synchronization
Updates occur locally without blocking, no synchronization in the critical path
18
AP-compatible• No synchronization
Updates occur locally without blocking, no synchronization in the critical path
• Asynchronous operationUpdates are fast, available, and exploit concurrency
18
AP-compatible• No synchronization
Updates occur locally without blocking, no synchronization in the critical path
• Asynchronous operationUpdates are fast, available, and exploit concurrency
• Compatible invariantsRelative order and joint update invariants can be preserved
18
Can we make non-commutative updates commutative?
24
Can we find a suitable data model for AP systems?
RA
RB
1
set(1)
3
2
set(2)
set(3)
?
?
How do we deterministically pick a value to keep?
Do we use a timestamp?(like Cassandra, and drop a value?)
RA
RB
1
set(1)
3
2
set(2)
set(3)
?
?
How do we deterministically pick a value to keep?
Do we use a timestamp?(like Cassandra, and drop a value?)
Timestamps make concurrent operations commute
but fail to capture intent.
RA
RB
1
set(1)
3
2
set(2)
set(3)
3
3
max(2,3)
max(2,3)
Deterministic conflict resolution
function.
CRDTs generalize
this framework.
Conflict-Free Replicated Data Types
• Replicated abstract data types Extension of sequential data type that encapsulates deterministic merge function
28
Conflict-Free Replicated Data Types
• Replicated abstract data types Extension of sequential data type that encapsulates deterministic merge function
• Many existing designsSets, counters, registers, flags, maps
28
Causal Consistency• Respect causality
Ensure updates are delivered in the causal order [Lamport 78]
44
Causal Consistency• Respect causality
Ensure updates are delivered in the causal order [Lamport 78]
• Strongest available modelAlways able to return some compatible version for an object
44
Causal Consistency• Respect causality
Ensure updates are delivered in the causal order [Lamport 78]
• Strongest available modelAlways able to return some compatible version for an object
• Referential integrityCausal consistency is sufficient for providing referential integrity in an AP database
44
RA
RB
C1
Rx
create Rx
Dr
update Dr(Rx)
Pt
update Pt(Rx)
Ph
update Ph(Rx)
Add reference in pharmacy record.
RA
RB
C1
Rx
create Rx
Dr
update Dr(Rx)
Pt
update Pt(Rx)
Ph
update Ph(Rx)
Updates are causally consistent.
RA
RB
C1
Rx
create Rx
Dr
update Dr(Rx)
Pt
update Pt(Rx)
Ph
update Ph(Rx)
Client can read inconsistent state.
RA
RB
C1
Rx
create Rx
Dr
update Dr(Rx)
Pt
update Pt(Rx)
Ph
update Ph(Rx)
Client is missing update to pharmacy.
RA
RB
C1
T1
create Rxupdate Dr(Rx)update Pt(Rx)update Ph(Rx)
Group updates into an atomic transaction.
RA
RB
C1
T1
create Rxupdate Dr(Rx)update Pt(Rx)update Ph(Rx)
Updates reflect “All-Or-Nothing” property through snapshots.
RA
RB
C1
T1
create Rxupdate Dr(Rx)update Pt(Rx)update Ph(Rx)
T2
Transactions are delivered in causal order.
RA
RB
C1
T1
create Rxupdate Dr(Rx)update Pt(Rx)update Ph(Rx)
T2
Therefore, snapshots are causally consistent.
RA(2)
RB(2) 4
41
pp(1)
RC(2) 44
add(3)
Correct outcome with four remaining medications.
Precondition is stable under concurrent addition.
RA(2)
RB(2) -1
-11
pp(1)
RC(2) -10
pp(2)
Replica C concurrently checks preconditionand delivers two medications.
RA(2)
RB(2) -1
-11
pp(1)
RC(2) -10
pp(2)
Incorrect outcome violating non-negative invariant.
Precondition is NOT stable under concurrent fulfillment.
RA(2)
RB(2) -1
-11
pp(1)
RC(2) -10
pp(2)
Incorrect outcome violating non-negative invariant.
Precondition is NOT stable under concurrent fulfillment.
• Forbid concurrency Prevent operations from proceeding without synchronization to enforce invariant
• Allow concurrency and remove invariantAllow operation to proceed, knowing that the invariant may be violated under concurrent operations
CISE Analysis• Individually correct
Individual operations never violate the invariant
• ConvergenceConcurrent effects commute
81
CISE Analysis• Individually correct
Individual operations never violate the invariant
• ConvergenceConcurrent effects commute
• Precondition stabilityPreconditions are stable under every pair of concurrent operations
81
CISE Analysis• Individually correct
Individual operations never violate the invariant
• ConvergenceConcurrent effects commute
• Precondition stabilityPreconditions are stable under every pair of concurrent operations
81
If satisfied, invariant is guaranteed with concurrency.
AntidoteDB• Open-source Erlang database
Developed in Erlang, on top of the Riak Core distributed systems framework
83
AntidoteDB• Open-source Erlang database
Developed in Erlang, on top of the Riak Core distributed systems framework
• Transactional Causal ConsistencyOnly industrial-grade database providing both causal consistency and all-or-nothing transactions
83
AntidoteDB• Open-source Erlang database
Developed in Erlang, on top of the Riak Core distributed systems framework
• Transactional Causal ConsistencyOnly industrial-grade database providing both causal consistency and all-or-nothing transactions
• Alpha release availableCurrently under development, but an alpha release of the product is available on GitHub
83
A
B
N1
N2
TxnMgr
Materializer
Log
InterDC-Repl
…each operating a transaction manager, materializers, log.
A
B
N1
N2
TxnMgr
Materializer
Log
InterDC-Repl
…with a causal consistency protocol running in the wide area.
Data Model
89
Register• Last-Writer Wins • Multi-Value
Set• Grow-Only • Add-Wins • Remove-Wins
Map
Counter• Unlimited • Restricted ≥ 0
Graph• Directed • Monotonic DAG • Edit graph
Sequence
Object API
90
User1 = {michel, antidote_crdt_mvreg, user_bucket},
{ok, Time2} = antidote:update_objects(ignore, [], [{User1, assign,
{["Michel", “[email protected]”], ClientIdentifier}}]),
{ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]).
Object API
91
User1 = {michel, antidote_crdt_mvreg, user_bucket},
{ok, Time2} = antidote:update_objects(ignore, [], [{User1, assign,
{["Michel", “[email protected]”], ClientIdentifier}}]),
{ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]).
Identify an object by object identifier.
Object API
92
User1 = {michel, antidote_crdt_mvreg, user_bucket},
{ok, Time2} = antidote:update_objects(ignore, [], [{User1, assign,
{["Michel", “[email protected]”], ClientIdentifier}}]),
{ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]).
Use the update API to assign a value to this register.
Object API
93
User1 = {michel, antidote_crdt_mvreg, user_bucket},
{ok, Time2} = antidote:update_objects(ignore, [], [{User1, assign,
{["Michel", “[email protected]”], ClientIdentifier}}]),
{ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]).
Read the object, providing a minimum snapshot time.
Object API
93
User1 = {michel, antidote_crdt_mvreg, user_bucket},
{ok, Time2} = antidote:update_objects(ignore, [], [{User1, assign,
{["Michel", “[email protected]”], ClientIdentifier}}]),
{ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]).
Read the object, providing a minimum snapshot time.
Simple, operation-based API. (think Redis, Riak CRDTs)
Object API
93
User1 = {michel, antidote_crdt_mvreg, user_bucket},
{ok, Time2} = antidote:update_objects(ignore, [], [{User1, assign,
{["Michel", “[email protected]”], ClientIdentifier}}]),
{ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]).
Read the object, providing a minimum snapshot time.
Simple, operation-based API. (think Redis, Riak CRDTs)
Causal dependencies are automatically captured by
execution order.
Transaction API
94
{ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId), ok = antidote:update_objects([{Set, add, "Java"}], TxId), {ok, _} = antidote:commit_transaction(TxId).
Transaction API
95
Start a transaction with the transaction API, with a given snapshot time and return a transaction identifier.
{ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId), ok = antidote:update_objects([{Set, add, "Java"}], TxId), {ok, _} = antidote:commit_transaction(TxId).
{ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId), ok = antidote:update_objects([{Set, add, "Java"}], TxId), {ok, _} = antidote:commit_transaction(TxId).
Transaction API
96
Read objects using the interactive transaction API.
{ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId), ok = antidote:update_objects([{Set, add, "Java"}], TxId), {ok, _} = antidote:commit_transaction(TxId).
Transaction API
97
Update objects using the interactive transaction API.
{ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId), ok = antidote:update_objects([{Set, add, "Java"}], TxId), {ok, _} = antidote:commit_transaction(TxId).
Transaction API
98
Once finished updating, commit the transaction.
{ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId), ok = antidote:update_objects([{Set, add, "Java"}], TxId), {ok, _} = antidote:commit_transaction(TxId).
Transaction API
98
Once finished updating, commit the transaction.
Transactions read causally consistent snapshots
and updates are applied atomically.
Scalability
99
Kops
/ s
100200300400500600700800
1 x 5
1 x 1
01
x 25
2 x 2
53
x 25
1 x 5
1 x 1
01
x 25
2 x 2
53
x 25
1 x 5
1 x 1
01
x 25
2 x 2
53
x 25
1 x 5
1 x 1
01
x 25
2 x 2
53
x 25
99(1) 90(10) 75(25) 50(50)
read(update) ratio
DCs × Servers
LWW registers 100k keys/partitionpower law distribution
Cure vs. SOA
100
Kops
/ s
0100200300400500600700800900
10001100
Eige
rGR Cure EC
Eige
rGR Cure EC
Eige
rGR Cure EC
Eige
rGR Cure EC
99(1) 90(10) 75(25) 50(50)
read(update) ratio
3 DCs × 25 ServersLWW registers
Cure vs. EC
101
Kops
/ s
100200300400500600700800900
100011001200
Cure
, 1KB
EC, 1
KBCu
re, 1
0KB
EC, 1
0KB
Cure
, 1KB
EC, 1
KBCu
re, 1
0KB
EC, 1
0KB
Cure
, 1KB
EC, 1
KBCu
re, 1
0KB
EC, 1
0KB
Cure
, 1KB
EC, 1
KBCu
re, 1
0KB
EC, 1
0KB
99(1) 90(10) 75(25) 50(50)
read(update) ratio
3 DCs x 25 ServersCRDT sets
Future Features• Intra-DC replication
Antidote provides no replication within the datacenter and assumes only geo-replication at the moment
102
Future Features• Intra-DC replication
Antidote provides no replication within the datacenter and assumes only geo-replication at the moment
• ACID transactionsFor Antidote to provide all of JRC, it needs ACID transaction support: no research needed, only implementation
102
Moving Forward• Research prototype
Originally a research prototype to build a database requiring reduced synchronization (SyncFree FP7) with Basho, Rovio, and Trifork
103
Moving Forward• Research prototype
Originally a research prototype to build a database requiring reduced synchronization (SyncFree FP7) with Basho, Rovio, and Trifork
• Research aheadLightKone (H2020) will investigate moving AntidoteDB close to the edge to provide DDN services
103
Moving Forward• Research prototype
Originally a research prototype to build a database requiring reduced synchronization (SyncFree FP7) with Basho, Rovio, and Trifork
• Research aheadLightKone (H2020) will investigate moving AntidoteDB close to the edge to provide DDN services
• IndustrializationObtaining seed funding to start a company to industrialize AntidoteDB
103
Resources• https://github.com/SyncFree/antidote
AntidoteDB
104
Resources• https://github.com/SyncFree/antidote
AntidoteDB
• http://syncfree.github.io/antidote/Documentation for AntidoteDB
104
Resources• https://github.com/SyncFree/antidote
AntidoteDB
• http://syncfree.github.io/antidote/Documentation for AntidoteDB
• www.antidotedb.comWebsite
104
Resources• https://github.com/SyncFree/antidote
AntidoteDB
• http://syncfree.github.io/antidote/Documentation for AntidoteDB
• www.antidotedb.comWebsite
• docker pull antidotedb/antidoteTry out Antidote!
104