Upload
nowles
View
65
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Conflict-free Replicated Data Types. Presented by: Ron Zisman. Marc Shapiro, Nuno Preguiça , Carlos Baquero and Marek Zawirski. Replication and Consistency - essential features of large distributed systems such as www, p2p, and cloud computing Lots of replicas - PowerPoint PPT Presentation
Citation preview
Conflict-free Replicated Data Types
MARC SHAPIRO, NUNO PREGUIÇA, CARLOS BAQUERO AND MAREK ZAWIRSKI
Presented by: Ron Zisman
2Motivation
Replication and Consistency - essential features of large distributed systems such as www, p2p, and cloud computing
Lots of replicas Great for fault-tolerance and read latency× Problematic when updates occur
• Slow synchronization• Conflicts in case of no synchronization
3Motivation
We look for an approach that: supports Replication guarantees Eventual Consistency is Fast and Simple
Conflict-free objects = no synchronization whatsoever
Is this practical?
4Contributions
TheoryStrong Eventual Consistency (SEC)
A solution to the CAP problem Formal definitions Two sufficient conditions Strong equivalence between
the two Incomparable to sequential
consistency
PracticeCRDTs = Convergent or Commutative Replicated Data Types
Counters Set Directed graph
5Strong Consistency
Ideal consistency: all replicas know about the update immediately after it executes
Preclude conflicts Replicas update in the same
total order Any deterministic object
Consensus Serialization bottleneck Tolerates < n/2 faults Correct, but doesn’t scale
6Strong Consistency
Ideal consistency: all replicas know about the update immediately after it executes
Preclude conflicts Replicas update in the same
total order Any deterministic object
Consensus Serialization bottleneck Tolerates < n/2 faults Correct, but doesn’t scale
7Strong Consistency
Ideal consistency: all replicas know about the update immediately after it executes
Preclude conflicts Replicas update in the same
total order Any deterministic object
Consensus Serialization bottleneck Tolerates < n/2 faults Correct, but doesn’t scale
8Strong Consistency
Ideal consistency: all replicas know about the update immediately after it executes
Preclude conflicts Replicas update in the same
total order Any deterministic object
Consensus Serialization bottleneck Tolerates < n/2 faults Correct, but doesn’t scale
9Strong Consistency
Ideal consistency: all replicas know about the update immediately after it executes
Preclude conflicts Replicas update in the same
total order Any deterministic object
Consensus Serialization bottleneck Tolerates < n/2 faults Correct, but doesn’t scale
10Eventual Consistency
Update local and propagate No foreground synch Eventual, reliable delivery
On conflict Arbitrate Roll back
Consensus moved to background Better performance× Still complex
11Eventual Consistency
Update local and propagate No foreground synch Eventual, reliable delivery
On conflict Arbitrate Roll back
Consensus moved to background Better performance× Still complex
12Eventual Consistency
Update local and propagate No foreground synch Eventual, reliable delivery
On conflict Arbitrate Roll back
Consensus moved to background Better performance× Still complex
13Eventual Consistency
Update local and propagate No foreground synch Eventual, reliable delivery
On conflict Arbitrate Roll back
Consensus moved to background Better performance× Still complex
14Eventual Consistency
Update local and propagate No foreground synch Eventual, reliable delivery
On conflict Arbitrate Roll back
Consensus moved to background Better performance× Still complex
15Eventual Consistency
Update local and propagate No foreground synch Eventual, reliable delivery
On conflict Arbitrate Roll back
Consensus moved to background Better performance× Still complex
16Eventual Consistency
Reconcile
Update local and propagate No foreground synch Eventual, reliable delivery
On conflict Arbitrate Roll back
Consensus moved to background Better performance× Still complex
17Strong Eventual Consistency
Update local and propagate No synch Eventual, reliable delivery
No conflict deterministic outcome of
concurrent updates No consensus: ≤ n-1 faults Solves the CAP problem
18Strong Eventual Consistency
Update local and propagate No synch Eventual, reliable delivery
No conflict deterministic outcome of
concurrent updates No consensus: ≤ n-1 faults Solves the CAP problem
19Strong Eventual Consistency
Update local and propagate No synch Eventual, reliable delivery
No conflict deterministic outcome of
concurrent updates No consensus: ≤ n-1 faults Solves the CAP problem
20Strong Eventual Consistency
Update local and propagate No synch Eventual, reliable delivery
No conflict deterministic outcome of
concurrent updates No consensus: ≤ n-1 faults Solves the CAP problem
21Strong Eventual Consistency
Update local and propagate No synch Eventual, reliable delivery
No conflict deterministic outcome of
concurrent updates No consensus: ≤ n-1 faults Solves the CAP problem
22Definition of EC
Eventual delivery: An update delivered at some correct replica is eventually delivered to all correct replicas
Termination: All method executions terminate
Convergence: Correct replicas that have delivered the same updates eventually reach equivalent state Doesn’t preclude roll backs and reconciling
23Definition of SEC
Eventual delivery: An update delivered at some correct replica is eventually delivered to all correct replicas
Termination: All method executions terminate
Strong Convergence: Correct replicas that have delivered the same updates have equivalent state
24
System model
System of non-byzantine processes interconnected by an asynchronous network
Partition-tolerance and recovery
What are the two simple conditions that guarantee strong convergence?
25Query
Client sends the query to any of the replicas Local at source replica Evaluate synchronously, no side effects
26Query
Client sends the query to any of the replicas Local at source replica Evaluate synchronously, no side effects
27Query
Client sends the query to any of the replicas Local at source replica Evaluate synchronously, no side effects
28State-based approach
An object is a tuple
Local queries, local updates Send full state; on receive, merge
Update is said ‘delivered’ at some replica when it is included in its casual history
Causal History:
payload setinitial state quer
y
updatemerge
29State-based replication
Local at source .u(a), .u(b), … Precondition, compute Update local payload
Causal History: on query: on update:
30State-based replication
Local at source .u(a), .u(b), … Precondition, compute Update local payload
Causal History: on query: on update:
31State-based replication
Local at source .u(a), .u(b), … Precondition, compute Update local payload
Convergence Episodically: send payload On delivery: merge payloads
Causal History: on query: on update: on merge:
32State-based replication
Local at source .u(a), .u(b), … Precondition, compute Update local payload
Convergence Episodically: send payload On delivery: merge payloads
Causal History: on query: on update: on merge:
33State-based replication
Local at source .u(a), .u(b), … Precondition, compute Update local payload
Convergence Episodically: send payload On delivery: merge payloads
Causal History: on query: on update: on merge:
34Semi-lattice
A poset is a join-semilattice if for all x,y in S a LUB exists
LUB = Least Upper Bound Associative: Commutative: Idempotent:
Examples:
35State-based: monotonic semi-lattice CvRDT
If:
then replicas converge to LUB of last values
payload type forms a semi-lattice updates are increasing merge computes Least Upper Bound
36Operation-based approach
An object is a tuple
prepare-update Precondition at source 1st phase: at source, synchronous, no side effects
effect-update Precondition against downstream state (P) 2nd phase, asynchronous, side-effects to downstream state
payload setinitial state quer
yprepare-update
effect-updatedelivery precondition
37Operation-based replication
Local at source Precondition, compute Broadcast to all replicas
Causal History: on query/prepare-update:
38Operation-based replication
Local at source Precondition, compute Broadcast to all replicas
Eventually, at all replicas: Downstream precondition Assign local replica
Causal History: on query/prepare-update: on effect-update:
39Operation-based replication
Local at source Precondition, compute Broadcast to all replicas
Eventually, at all replicas: Downstream precondition Assign local replica
Causal History: on query/prepare-update: on effect-update:
40Op-based: commute CmRDT
If:
then replicas converge
Liveness: all replicas execute all operations in delivery order where the downstream precondition (P) is true
Safety: concurrent operations all commute
41Monotonic semi-lattice
Commutative
A state-based object can emulate an operation-based object, and vice-versa
Use state-based reasoning and then covert to operation based for better efficiency
42Comparison
State-based Update ≠ merge
operation Simple data types State includes preceding
updates; no separate historical information
Inefficient if payload is large
File systems (NFS, Dynamo)
Operation-based Update operation Higher level, more
complex More powerful, more
constraining Small messages Collaborative editing
(Treedoc), Bayou, PNUTS
State-based or op-based, as convenient
43SEC is incomparable to sequential consistency
There is a SEC object that is not sequentially-consistent:
Consider a Set CRDT S with operations add(e) and remove(e) remove(e) → add(e) e ∈ S add(e) ║ remove(e’) e ∈ S ∧ e’ S add(e) ║ remove(e) e ∈ S (suppose add wins)
Consider the following scenario with replicas: , 1. [add(e); remove(e’)] ║ [add(e’); remove(e)] 2. merges the states from and : e ∈ S ∧ e’∈ S
The state of replica will never occur in a sequentially-consistent execution (either remove(e) or remove(e’) must be last)
44SEC is incomparable to sequential consistency
There is a sequentially-consistent object that is not SEC: If no crashes occur, a sequentially-consistent object is
SEC Generally, sequential consistency requires consensus to
determine the single order of operations – cannot be solved if n-1 crashes occur (while SEC can tolerate n-1 crashes)
45
Example CRDTs
Multi-master counter
Observed-Remove Set
Directed Graph
46Multi-master counter
Increment Payload: Partial order: value() = increment() = ++
merge(x,y) = =
47Multi-master counter
Increment / Decrement Payload: , Partial order: value() = - increment() = ++ decrement() = ++ merge(x,y) = = (
)
48Set design alternatives
Sequential specification: {true} add(e) {e ∈ S} {true} remove(e) {e ∈ S}
Concurrent: {true} add(e) ║ remove(e) {???} linearizable? error state? last writer wins? add wins? remove wins?
49Observed-Remove Set
50Observed-Remove Set
Payload: added, removed (element, unique token) add(e) =
51Observed-Remove Set
Payload: added, removed (element, unique token) add(e) =
52Observed-Remove Set
Payload: added, removed (element, unique token) add(e) =
53Observed-Remove Set
Payload: added, removed (element, unique token) add(e) =
54Observed-Remove Set
Payload: added, removed (element, unique token) add(e) = Remove all unique elements observed: remove(e) = lookup(e) = merge(S,S’) =
55Observed-Remove Set
Payload: added, removed (element, unique token) add(e) = Remove all unique elements observed: remove(e) = lookup(e) = merge(S,S’) =
56Observed-Remove Set
Payload: added, removed (element, unique token) add(e) = Remove all unique elements observed: remove(e) = lookup(e) = merge(S,S’) =
57OR-Set + Snapshot
Read consistent snapshots despite concurrent, incremental updates
Vector clock for each process (global time) Payload: a set of (event, timestamp) pairs Snapshot: vector clock value lookup(e,t):
Garbage Collection: retain tombstones until not needed log entry discarded as soon as its timestamp is less than all
remote vector clocks (delivered to all processes)
58Sharded OR-Set
Very large objects Independent shards Static: hash, Dynamic: consensus
Statically-Sharded CRDT Each shard is a CRDT Update: single shard No cross-object invariants A combination of independent CRDTs remains a CRDT
Statically-Sharded OR-Set Combination of smaller OR-Sets Consistent snapshots: clock cross shards
59Directed Graph – Motivation
Design a web search engine compute page rank by a
directed graph Efficiency and
scalability Asynchronous processing
Responsiveness Incremental processing,
as fast as each page is crawled
Operations Find new pages: add vertex Parse page links: add/remove
arc Add URLs of linked pages to
be crawled: add vertex Deleted pages: remove
vertex (lookup masks incident arcs)
Broken links allowed: add arc works even if tail vertex doesn’t exist
60Graph design alternatives
Graph = (V,A) where A ⊆ V V Sequential specification:
{v’,v’’ V} addArc(v’,v’’) {…} {(v’,v’’) A} removeVertex(v’) {…}
Concurrent: removeVertex(v) ║ addArc(v’,v’’) linearizable? last writer wins? addArc(v’,v’’) wins? – v’ or v’’ restored if removed removeVertex(v) wins? - all edges to or from v are
removed
61Directed Graph (op-based)
Payload: OR-Set V (vertices), OR-Set A (arcs)
62Directed Graph (op-based)
Payload: OR-Set V (vertices), OR-Set A (arcs)
63Summary
Principled approach Strong Eventual Consistency
Two sufficient conditions: State: monotonic semi-lattice Operation: commutativity
Useful CRDTs Multi-master counter, OR-Set, Directed Graph
64Future Work
Theory Class of computations accomplished by CRDTs Complexity classes of CRDTs Classes of invariants supported by a CRDT CRDTs and self-stabilization, aggregation, and so onPractice Library implementation of CRDTs Supporting non-critical synchronous operations
(commiting a state, global reset, etc) Sharding
65Extras: MV-Register and the Shopping Cart Anomaly
MV-Register ≈ LWW-Set Register Payload = { (value, versionVector) } assign: overwrite value, vv++ merge: union of every element in each input set that is not
dominated by an element in the other input set A more recent assignment overwrites an older one Concurrent assignments are merged by union (VC merge)
66Extras: MV-Register and the Shopping Cart Anomaly
Shopping cart anomaly deleted element reappears
MV-Register does not behave like a set Assignment is not an alternative to proper
add and remove operations
67
The problem with eventual consistency jokes is that you can't tell who doesn't get it from who hasn't gotten it.