Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
The bad The good The ugly Messages References
The CAP theoremThe bad, the good and the ugly
Michael Pfeiffer
Advanced Networking TechnologiesFG Telematik/Rechnernetze
TU Ilmenau
2017-05-15
1 / 19
The bad The good The ugly Messages References
1 The bad: The CAP theorem’s proof
2 The good: A different perspective
3 The ugly: CAP and SDN
2 / 19
The bad The good The ugly Messages References
Section 1
The bad: The CAP theorem’s proof
3 / 19
The bad The good The ugly Messages References
The CAP theorem
Central proposition
In a distributed system, it is impossible to provide• Consistency,• Availability, and• Partition tolerance
all at once, i.e. at least one of them has to be sacrificed.
• Suggested by Brewer in 1999/2000, proof by Gilbert andLynch in 2002 [1]
• In many networks, the absence of partitions cannot beguaranteed (firmware bugs, administrative errors, . . . )→ choice between CP and AP
4 / 19
The bad The good The ugly Messages References
Formal model
Network partition
All messages between nodes in different components are lost.
Availability: Available data objects
• Every request received by a non-failing node must result ina response.
• No time boundary, but network partition can last ‘forever’,thus a strong availability requirement.
Consistency: Atomic data objects
• ∃ total order on all operations such that each operationlooks as if it were completed at a single instant.
• Equivalent: Requests must act as if they were processed ona single node, one at a time.
5 / 19
The bad The good The ugly Messages References
Formal model
Network partition
All messages between nodes in different components are lost.
Availability: Available data objects
• Every request received by a non-failing node must result ina response.
• No time boundary, but network partition can last ‘forever’,thus a strong availability requirement.
Consistency: Atomic data objects
• ∃ total order on all operations such that each operationlooks as if it were completed at a single instant.
• Equivalent: Requests must act as if they were processed ona single node, one at a time.
5 / 19
The bad The good The ugly Messages References
Formal model
Network partition
All messages between nodes in different components are lost.
Availability: Available data objects
• Every request received by a non-failing node must result ina response.
• No time boundary, but network partition can last ‘forever’,thus a strong availability requirement.
Consistency: Atomic data objects
• ∃ total order on all operations such that each operationlooks as if it were completed at a single instant.
• Equivalent: Requests must act as if they were processed ona single node, one at a time.
5 / 19
The bad The good The ugly Messages References
Proof
Proof by contradiction. Assume there is a CAP system:
G1 E G2
C1 C2
1. x← 42 2. success! 3. x? 4. ???
6 / 19
The bad The good The ugly Messages References
Proof
Proof by contradiction. Assume there is a CAP system:
G1
E
G2
C1 C2
1. x← 42 2. success! 3. x? 4. ???
6 / 19
The bad The good The ugly Messages References
Proof
Proof by contradiction. Assume there is a CAP system:
G1 E G2
C1 C2
1. x← 42 2. success! 3. x? 4. ???
6 / 19
The bad The good The ugly Messages References
Proof
Proof by contradiction. Assume there is a CAP system:
G1 E G2
C1
C2
1. x← 42
2. success! 3. x? 4. ???
6 / 19
The bad The good The ugly Messages References
Proof
Proof by contradiction. Assume there is a CAP system:
G1 E G2
C1
C2
1. x← 42 2. success!
3. x? 4. ???
6 / 19
The bad The good The ugly Messages References
Proof
Proof by contradiction. Assume there is a CAP system:
G1 E G2
C1 C2
1. x← 42 2. success! 3. x?
4. ???
6 / 19
The bad The good The ugly Messages References
Proof
Proof by contradiction. Assume there is a CAP system:
G1 E G2
C1 C2
1. x← 42 2. success! 3. x? 4. ???
6 / 19
The bad The good The ugly Messages References
Classical strategies for CP and AP
CP systems
• Delay the acknowledgement of a write operation until newvalue has been propagated to all nodes
• Examples:• Relational database with synchronous replication• 2PCP
AP systems
• Answer with the (possibly stale) last known value• Examples:
• Slave DNS servers• NoSQL databases
7 / 19
The bad The good The ugly Messages References
Classical strategies for CP and AP
CP systems
• Delay the acknowledgement of a write operation until newvalue has been propagated to all nodes
• Examples:• Relational database with synchronous replication• 2PCP
AP systems
• Answer with the (possibly stale) last known value• Examples:
• Slave DNS servers• NoSQL databases
7 / 19
The bad The good The ugly Messages References
Section 2
The good: A different perspective
8 / 19
The bad The good The ugly Messages References
A different perspective (by Brewer [2])
The partition decision
If a partition occurs during the processing of an operation,each node can decide to• cancel the operation (favour C over A), or• proceed, but risk inconsistencies (favour A over C).
But: It is possible to decide differently every time, based on thecircumstances.
This means:• No partition→ No problem• But during a partition, all systems must decide eventually• Permanently retrying is in fact a choice for C over A
9 / 19
The bad The good The ugly Messages References
A different perspective (by Brewer [2])
The partition decision
If a partition occurs during the processing of an operation,each node can decide to• cancel the operation (favour C over A), or• proceed, but risk inconsistencies (favour A over C).
But: It is possible to decide differently every time, based on thecircumstances.
This means:• No partition→ No problem• But during a partition, all systems must decide eventually• Permanently retrying is in fact a choice for C over A
9 / 19
The bad The good The ugly Messages References
Mitigation strategies
• Generally: To keep consistency, some operations must beforbidden during a partition
• Others are okay (e.g. read queries)• Often: Guarantee to consistency to a certain degree• Example: Read-your-own-writes consistency
• Facebook: A user’s timeline is stored at master copy andcached at slaves
• Usually users see (potentially stale) copies at slaves• But when they post something, their reads are redirected
to the respective master for a certain time
• Different strategies on different levels possible, e.g. insidea single site and between sites (latency!)
• Often: In one component progress is possible, multipleconsensus algorithms available (e.g. dynamic voting)
10 / 19
The bad The good The ugly Messages References
Partition recovery
What if we still want to continue service during partition?
1 Detect partition
2 Enter a special partition mode
3 Continue service
4 After partition: Recovery
The small problem: Partition detection• Nodes can disagree whether a partition exists• Consensus about partition state not possible• Nodes may enter the partition mode at different times• A distributed commit protocol is required (2PCP, Paxos, . . . )
11 / 19
The bad The good The ugly Messages References
Partition recovery
What if we still want to continue service during partition?
1 Detect partition
2 Enter a special partition mode
3 Continue service
4 After partition: Recovery
The small problem: Partition detection• Nodes can disagree whether a partition exists• Consensus about partition state not possible• Nodes may enter the partition mode at different times• A distributed commit protocol is required (2PCP, Paxos, . . . )
11 / 19
The bad The good The ugly Messages References
The big problem: Partition recovery
A (very) simple example:• Users register on a web site• Every user is assigned an unique ID (SQL: serial,
auto_increment)• During partition: Same ID might be assigned twice• Recovery: Recreate uniqueness of IDs
Partition recovery: It’s about invariants
• In a consistent system, invariants are guaranteed• Even when the system’s designer does not know them• In an available system, invariants must be explicitly
restored after a partition• System’s designer must know the invariants and how to
restore them
12 / 19
The bad The good The ugly Messages References
The big problem: Partition recovery
A (very) simple example:• Users register on a web site• Every user is assigned an unique ID (SQL: serial,
auto_increment)• During partition: Same ID might be assigned twice• Recovery: Recreate uniqueness of IDs
Partition recovery: It’s about invariants
• In a consistent system, invariants are guaranteed• Even when the system’s designer does not know them• In an available system, invariants must be explicitly
restored after a partition• System’s designer must know the invariants and how to
restore them
12 / 19
The bad The good The ugly Messages References
CRDTs
• Commutative/Conflict-free Replicated Data Types (CRDTs)are data types that provably converge
• Example: Google Docs serialises edits into a series of insertand delete operations
On Monday, the ANTlecture is at 13:00.
On Thursday, the ANTlecture is at 13:00.
On Monday, the ANTlecture is at 17:00.
On Thursday, the ANTlecture is at 17:00.
→ Application-specific invariants are not ensuredautomatically
13 / 19
The bad The good The ugly Messages References
CRDTs
• Commutative/Conflict-free Replicated Data Types (CRDTs)are data types that provably converge
• Example: Google Docs serialises edits into a series of insertand delete operations
On Monday, the ANTlecture is at 13:00.
On Thursday, the ANTlecture is at 13:00.
On Monday, the ANTlecture is at 17:00.
On Thursday, the ANTlecture is at 17:00.
→ Application-specific invariants are not ensuredautomatically
13 / 19
The bad The good The ugly Messages References
CRDTs
• Commutative/Conflict-free Replicated Data Types (CRDTs)are data types that provably converge
• Example: Google Docs serialises edits into a series of insertand delete operations
On Monday, the ANTlecture is at 13:00.
On Thursday, the ANTlecture is at 13:00.
On Monday, the ANTlecture is at 17:00.
On Thursday, the ANTlecture is at 17:00.
→ Application-specific invariants are not ensuredautomatically
13 / 19
The bad The good The ugly Messages References
CRDTs
• Commutative/Conflict-free Replicated Data Types (CRDTs)are data types that provably converge
• Example: Google Docs serialises edits into a series of insertand delete operations
On Monday, the ANTlecture is at 13:00.
On Thursday, the ANTlecture is at 13:00.
On Monday, the ANTlecture is at 17:00.
On Thursday, the ANTlecture is at 17:00.
→ Application-specific invariants are not ensuredautomatically
13 / 19
The bad The good The ugly Messages References
CRDTs
• Commutative/Conflict-free Replicated Data Types (CRDTs)are data types that provably converge
• Example: Google Docs serialises edits into a series of insertand delete operations
On Monday, the ANTlecture is at 13:00.
On Thursday, the ANTlecture is at 13:00.
On Monday, the ANTlecture is at 17:00.
On Thursday, the ANTlecture is at 17:00.
→ Application-specific invariants are not ensuredautomatically
13 / 19
The bad The good The ugly Messages References
CRDTs
• Commutative/Conflict-free Replicated Data Types (CRDTs)are data types that provably converge
• Example: Google Docs serialises edits into a series of insertand delete operations
On Monday, the ANTlecture is at 13:00.
On Thursday, the ANTlecture is at 13:00.
On Monday, the ANTlecture is at 17:00.
On Thursday, the ANTlecture is at 17:00.
→ Application-specific invariants are not ensuredautomatically
13 / 19
The bad The good The ugly Messages References
More on partition recovery
• Recovery is tedious and error prone• Brewer: Similar to going from single-threaded to
multi-threaded programming• Sometimes only possibility: Ask the user (e.g. git merge)• Balance between availability and consistency:
• ATMs: When partitioned, limit withdrawal to amount X• Invariant: Not more withdrawals than allowed• Manual correction afterwards
• Usual tools:• Version vectors (vector clocks)• Logging, replay and rollback
14 / 19
The bad The good The ugly Messages References
Section 3
The ugly: CAP and SDN
15 / 19
The bad The good The ugly Messages References
SDN and CAP
• So far, we have talked about distributed systems on theapplication layer (databases, web services, ...)
• SDN is much more basic (layer 2/3)• Network functionality is essential→ pure CP is not really an option
• AP means partition recovery is required
16 / 19
The bad The good The ugly Messages References
SDN and partition recovery
• Possible without the network up and running?• Beware of dependency loops. . .• Is falling back to non-SDN networking possible?• Even if SDN has been used to replace features like VLANs?• Relying on user input rather unrealistic. . .• Possible to figure out all the invariants?• Most SDN publications ignore the issue. . .• BGP does not stabilise in all cases [3]. . .
17 / 19
The bad The good The ugly Messages References
Wrapping up
1 The CAP theorem is proven and holds.
2 Do not think about CP or AP systems, but about thepartition decision.
3 Many possibilities to fine-tune the balance betweenconsistency and availability, and to recover from partitions.
4 But systems tend to become very complex.
5 Can we stomach this amount of complexity for buildingservices as basic as network connectivity?
18 / 19
The bad The good The ugly Messages References
[1] Seth Gilbert and Nancy Lynch. ‘Brewer’s conjecture andthe feasibility of consistent, available, partition-tolerantweb services’. In: ACM SIGACT News 33 (2 June 2002),pp. 51–59. DOI: 10.1145/564585.564601.
[2] Eric Brewer. ‘CAP twelve years later: How the “rules” havechanged’. In: Computer 45 (2 Feb. 2012), pp. 23–29. DOI:10.1109/MC.2012.37.
[3] Timothy G. Griffin and Gordon Wilfong. ‘An analysis ofBGP convergence properties’. In: ACM SIGCOMMComputer Communication Review 29 (4 Oct. 1999),pp. 277–288. DOI: 10.1145/316194.316231.
19 / 19