Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Fault-Tolerance, Fast and Slow:Exploiting Failure Asynchrony
in Distributed SystemsRamnatthan Alagappan, Aishwarya Ganesan, Jing Liu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau
OSDI ‘18
Replication Protocols
1
Viewstamped replication RaftPaxos
OSDI ‘18
Replication Protocols
1
GFSColossusBigTable
Foundation upon which datacenter systems are built
Viewstamped replication RaftPaxos
OSDI ‘18
The Two Different Worlds of Replication
2
OSDI ‘18
The Two Different Worlds of Replication
2
World-1 World-2
OSDI ‘18
The Two Different Worlds of Replication
2
How and where to store system state?World-1 World-2
OSDI ‘18
The Two Different Worlds of Replication
2
How and where to store system state?
synchronously persist updates to disks
Disk-durable
World-1 World-2
OSDI ‘18
buffer updates only in volatile memory
The Two Different Worlds of Replication
2
How and where to store system state?
synchronously persist updates to disks
Disk-durable
World-1
Memory-durable
World-2
OSDI ‘18
buffer updates only in volatile memory
The Two Different Worlds of Replication
2
How and where to store system state?
synchronously persist updates to disks
Disk-durable
World-1
Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …
Memory-durable
World-2
OSDI ‘18
buffer updates only in volatile memory
The Two Different Worlds of Replication
2
How and where to store system state?
synchronously persist updates to disks
Disk-durable
World-1
Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …
Memory-durable
World-2
Viewstamped replication, NOPaxos [OSDI ‘16],SpecPaxos [NSDI ’15] …
OSDI ‘18
buffer updates only in volatile memory
The Two Different Worlds of Replication
2
How and where to store system state?
synchronously persist updates to disks
Disk-durable
World-1
Neither approach is ideal: reliable or performant
Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …
Memory-durable
World-2
Viewstamped replication, NOPaxos [OSDI ‘16],SpecPaxos [NSDI ’15] …
OSDI ‘18
buffer updates only in volatile memory
The Two Different Worlds of Replication
2
How and where to store system state?
synchronously persist updates to disks
Disk-durable
World-1
Neither approach is ideal: reliable or performant
Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …
Memory-durable
World-2
Viewstamped replication, NOPaxos [OSDI ‘16],SpecPaxos [NSDI ’15] …
safe but suffer from poor performance
OSDI ‘18
buffer updates only in volatile memory
The Two Different Worlds of Replication
2
How and where to store system state?
synchronously persist updates to disks
Disk-durable
World-1
Neither approach is ideal: reliable or performant
Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …
Memory-durable
World-2
Viewstamped replication, NOPaxos [OSDI ‘16],SpecPaxos [NSDI ’15] …
safe but suffer from poor performance
performant but riskunsafety or unavailability
OSDI ‘18
Can a protocol provide strong reliability while maintaining high performance?
3
OSDI ‘18
SAUCR:Situation-Aware Updates and Crash Recovery
4
OSDI ‘18
SAUCR:Situation-Aware Updates and Crash Recovery
Simple insight: dynamically (based on the situation) decide how to commit updates
4
OSDI ‘18
SAUCR:Situation-Aware Updates and Crash Recovery
Simple insight: dynamically (based on the situation) decide how to commit updates
with many or all nodes up, buffer in memory – fast modewith failures, if only bare majority up, flush to disk – slow mode
4
OSDI ‘18
SAUCR:Situation-Aware Updates and Crash Recovery
Simple insight: dynamically (based on the situation) decide how to commit updates
with many or all nodes up, buffer in memory – fast modewith failures, if only bare majority up, flush to disk – slow mode
Strong reliability while maintaining high performance
4
OSDI ‘18
Simultaneity of Failures
SAUCR’s effectiveness depends upon simultaneity of failures
5
OSDI ‘18
Simultaneity of Failures
SAUCR’s effectiveness depends upon simultaneity of failuresindependent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)
can react and switch from fast to slow modepreserves durability and availability
5
OSDI ‘18
Simultaneity of Failures
SAUCR’s effectiveness depends upon simultaneity of failuresindependent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)
can react and switch from fast to slow modepreserves durability and availability
many truly simultaneous correlatedno gap and so cannot reactremain unavailable
5
OSDI ‘18
Simultaneity of Failures
SAUCR’s effectiveness depends upon simultaneity of failuresindependent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)
can react and switch from fast to slow modepreserves durability and availability
many truly simultaneous correlatedno gap and so cannot reactremain unavailable
however, existing data hints they are extremely rare –the Non-Simultaneity Conjecture
5
OSDI ‘18
Results
6
OSDI ‘18
Results
Implemented in ZooKeeper
6
OSDI ‘18
Results
Implemented in ZooKeeperSAUCR improves reliability compared to memory-durable systems
durable and available in 100s of crash scenariosmemory-durable loses data or becomes unavailable
6
OSDI ‘18
Results
Implemented in ZooKeeperSAUCR improves reliability compared to memory-durable systems
durable and available in 100s of crash scenariosmemory-durable loses data or becomes unavailable
Improvements at no or little costoverheads within 0%-9% of memory-durable systems
6
OSDI ‘18
Results
Implemented in ZooKeeperSAUCR improves reliability compared to memory-durable systems
durable and available in 100s of crash scenariosmemory-durable loses data or becomes unavailable
Improvements at no or little costoverheads within 0%-9% of memory-durable systems
Compared to disk-durableslight reduction in availability in extremely rare casesimproves performance – 2.5x on SSDs, 100x on HDDs
6
OSDI ‘18
Outline
IntroductionDistributed updates and crash recovery
disk-durable protocolsmemory-durable protocols
Situation-aware updates and crash recoveryResultsSummary and conclusion
7
OSDI ‘18
Disk-Durable Protocols
8
OSDI ‘18
Disk-Durable Protocols
8
Leader Follower Follower
Update
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1
Leader Follower Follower
Update
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1
A=2
Leader Follower Follower
Client
Update
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1
A=2
Leader Follower Follower
Client
Update
A=2A=2
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1
A=2
Leader Follower Follower
Client
Update
A=2 A=2
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1A=2
Leader Follower Follower
fsync
Client
Update
fsync
A=2 A=2
fsync
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1A=2
Leader Follower Follower
fsync
ClientCommitted
fsync completed on a majority ?
Update
fsync
A=2 A=2
fsync
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1A=2
Leader Follower Follower
fsync
ClientCommitted
Recovery
fsync completed on a majority ?
Update
fsync
A=2 A=2
fsync
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1A=2
Leader Follower Follower
fsync
ClientCommitted
Recovery
fsync completed on a majority ?
if ack’d anyone, data on disk – safe Update
fsync
A=2 A=2
fsync
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1A=2
Leader Follower Follower
fsync
ClientCommitted
Recovery
fsync completed on a majority ?
A=1 A=2
if ack’d anyone, data on disk – safe Update
fsync
A=2 A=2
fsync
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1A=2
Leader Follower Follower
fsync
ClientCommitted
Recovery
fsync completed on a majority ?
A=1 A=2
if ack’d anyone, data on disk – safe Update
fsync
A=2 A=2
fsync
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1A=2
Leader Follower Follower
fsync
ClientCommitted
Recovery
fsync completed on a majority ?
A=1 A=2
if ack’d anyone, data on disk – safe Update
fsync
A=2 A=2
fsync
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1A=2
Leader Follower Follower
fsync
ClientCommitted
Recovery
fsync completed on a majority ?
A=1 A=2
ready
imm
edia
te
if ack’d anyone, data on disk – safe
A=1 A=2
Update
fsync
A=2 A=2
fsync recovery: just read from local disk
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1A=2
Leader Follower Follower
fsync
ClientCommitted
Recovery
fsync completed on a majority ?
A=1 A=2
ready
imm
edia
te
if ack’d anyone, data on disk – safe
A=1 A=2
Update
fsync
A=2 A=2
fsync recovery: just read from local disk
A=1
ready
imm
edia
te
A=1
lagging
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1A=2
Leader Follower Follower
fsync
ClientCommitted
Recovery
fsync completed on a majority ?
A=1 A=2
ready
imm
edia
te
if ack’d anyone, data on disk – safe
A=1 A=2
Update
fsync
A=2 A=2
Safe and available
fsync recovery: just read from local disk
A=1
ready
imm
edia
te
A=1
lagging
OSDI ‘18
Disk-Durable Protocols
8
A=1 A=1 A=1A=2
Leader Follower Follower
fsync
ClientCommitted
Recovery
But poor performance due to fsync – 50x on HDDs, 2.5x on SSDs
fsync completed on a majority ?
A=1 A=2
ready
imm
edia
te
if ack’d anyone, data on disk – safe
A=1 A=2
Update
fsync
A=2 A=2
Safe and available
fsync recovery: just read from local disk
A=1
ready
imm
edia
te
A=1
lagging
OSDI ‘18 9
Leader Follower Follower
Client
Update
Memory
A=1
Memory
A=1
Memory
A=1
A=2
Memory-Durable Protocols (Oblivious Recovery)
OSDI ‘18 9
Leader Follower Follower
ClientCommitted
Update
Memory
A=1
Memory
A=1
Memory
A=1A=2 A=2 A=2
buffered on a majority ?
Memory-Durable Protocols (Oblivious Recovery)
OSDI ‘18 9
Leader Follower Follower
ClientCommitted
RecoveryUpdate
Memory
A=1
Memory
A=1
Memory
A=1A=2 A=2 A=2
buffered on a majority ?
Memory-Durable Protocols (Oblivious Recovery)
OSDI ‘18 9
Leader Follower Follower
ClientCommitted
RecoveryUpdateOblivious: doesn’t realize loss on failure
Memory
A=1
Memory
A=1
Memory
A=1A=2 A=2 A=2
buffered on a majority ?
Memory-Durable Protocols (Oblivious Recovery)
OSDI ‘18
Memory
9
Leader Follower Follower
ClientCommitted
RecoveryUpdateOblivious: doesn’t realize loss on failure
A=1 A=2
Memory
A=1
Memory
A=1
Memory
A=1A=2 A=2 A=2
buffered on a majority ?
Memory-Durable Protocols (Oblivious Recovery)
OSDI ‘18
Memory
9
Leader Follower Follower
ClientCommitted
RecoveryUpdateOblivious: doesn’t realize loss on failure
A=1 A=2
Memory
A=1
Memory
A=1
Memory
A=1A=2 A=2 A=2
buffered on a majority ?
Memory-Durable Protocols (Oblivious Recovery)
OSDI ‘18
Memory
9
Leader Follower Follower
ClientCommitted
RecoveryUpdateOblivious: doesn’t realize loss on failure
Memory
A=1
Memory
A=1
Memory
A=1A=2 A=2 A=2
buffered on a majority ?
Memory-Durable Protocols (Oblivious Recovery)
OSDI ‘18
MemoryMemory
9
Leader Follower Follower
ClientCommitted
RecoveryUpdateOblivious: doesn’t realize loss on failure
ready
Memory
A=1
Memory
A=1
Memory
A=1A=2 A=2 A=2
buffered on a majority ?
imm
edia
te
Memory-Durable Protocols (Oblivious Recovery)
OSDI ‘18
MemoryMemory
9
Leader Follower Follower
ClientCommitted
RecoveryUpdateOblivious: doesn’t realize loss on failure
ready
Memory
A=1
Memory
A=1
Memory
A=1A=2 A=2 A=2
buffered on a majority ?
imm
edia
te
Memory-Durable Protocols (Oblivious Recovery)
e.g., ZooKeeper with forceSync = falsepractitioners do use this config!
OSDI ‘18
MemoryMemory
9
Leader Follower Follower
ClientCommitted
RecoveryUpdateOblivious: doesn’t realize loss on failure
ready
Memory
A=1
Memory
A=1
Memory
A=1A=2 A=2 A=2
buffered on a majority ?
imm
edia
te
Memory-Durable Protocols (Oblivious Recovery)
Performant
e.g., ZooKeeper with forceSync = falsepractitioners do use this config!
OSDI ‘18
But can lead to data loss
MemoryMemory
9
Leader Follower Follower
ClientCommitted
RecoveryUpdateOblivious: doesn’t realize loss on failure
ready
Memory
A=1
Memory
A=1
Memory
A=1A=2 A=2 A=2
buffered on a majority ?
imm
edia
te
Memory-Durable Protocols (Oblivious Recovery)
Performant
e.g., ZooKeeper with forceSync = falsepractitioners do use this config!
OSDI ‘18
Data Loss Example in Oblivious Approach
10
OSDI ‘18
Data Loss Example in Oblivious Approach
10
OSDI ‘18
Data Loss Example in Oblivious Approach
10
A=1
A=1
A=1
A=1 committed
OSDI ‘18
Data Loss Example in Oblivious Approach
10
A=1
A=1
A=1
A=1 committedtwo nodes slow or failed
OSDI ‘18
Data Loss Example in Oblivious Approach
10
A=1
A=1
A=1
A=1 committed
A=1
A=1
A=1
two nodes slow or failed
OSDI ‘18
Data Loss Example in Oblivious Approach
10
A=1
A=1
A=1
A=1 committed
A=1
A=1
A=1
crashestwo nodes slow or failed
OSDI ‘18
Data Loss Example in Oblivious Approach
10
A=1
A=1
A=1
A=1 committed
A=1
A=1
A=1
crashestwo nodes slow or failed
, recovers
OSDI ‘18
Data Loss Example in Oblivious Approach
10
A=1
A=1
A=1
A=1 committed
A=1
A=1
crashestwo nodes slow or failed
, recoversloses its data but oblivious:
immediately joins
OSDI ‘18
Data Loss Example in Oblivious Approach
10
A=1
A=1
A=1
A=1 committed
A=1
A=1
crashestwo nodes slow or failed
, recoversloses its data but oblivious:
immediately joins
A=1
A=1
OSDI ‘18
Data Loss Example in Oblivious Approach
10
A=1
A=1
A=1
A=1 committed
A=1
A=1
crashes
A=1
A=1
two nodes slow or failed
, recoversloses its data but oblivious:
immediately joins
OSDI ‘18
Data Loss Example in Oblivious Approach
10
A=1
A=1
A=1
A=1 committed
A=1
A=1
crashes
A=1
A=1
two nodes slow or failed
, recoversloses its data but oblivious:
immediately joins
lagging nodes along with recovered node form majority;
lose committed update
majority do not know
of previously committed update
OSDI ‘18 11
Leader Follower Follower
ClientCommitted
Update
Memory
A=1
Memory
A=1
Memory
A=1A=2A=2
buffered on a majority ?
Memory-Durable Protocols (Loss-Aware Recovery)
A=2
OSDI ‘18 11
Leader Follower Follower
ClientCommitted
RecoveryUpdate
Memory
A=1
Memory
A=1
Memory
A=1A=2A=2
buffered on a majority ?
Memory-Durable Protocols (Loss-Aware Recovery)
A=2
OSDI ‘18 11
Leader Follower Follower
ClientCommitted
RecoveryUpdate
Loss-aware: realizes loss, waits for majority
Memory
A=1
Memory
A=1
Memory
A=1A=2A=2
buffered on a majority ?
Memory-Durable Protocols (Loss-Aware Recovery)
A=2
OSDI ‘18
Memory
11
Leader Follower Follower
ClientCommitted
RecoveryUpdate
Loss-aware: realizes loss, waits for majority
A=1 A=2
Memory
A=1
Memory
A=1
Memory
A=1A=2A=2
buffered on a majority ?
Memory-Durable Protocols (Loss-Aware Recovery)
A=2
OSDI ‘18
Memory
11
Leader Follower Follower
ClientCommitted
RecoveryUpdate
Loss-aware: realizes loss, waits for majority
A=1 A=2
Memory
A=1
Memory
A=1
Memory
A=1A=2A=2
buffered on a majority ?
Memory-Durable Protocols (Loss-Aware Recovery)
A=2
OSDI ‘18
Memory
11
Leader Follower Follower
ClientCommitted
RecoveryUpdate
Loss-aware: realizes loss, waits for majority
Memory
A=1
Memory
A=1
Memory
A=1A=2A=2
buffered on a majority ?
Memory-Durable Protocols (Loss-Aware Recovery)
A=2
OSDI ‘18 11
Leader Follower Follower
ClientCommitted
RecoveryUpdate
Loss-aware: realizes loss, waits for majority
recoveringwait for majority
responsesMemory
A=1
Memory
A=1
Memory
A=1A=2A=2
buffered on a majority ?
Memory-Durable Protocols (Loss-Aware Recovery)
A=2
OSDI ‘18
Memory
11
Leader Follower Follower
ClientCommitted
RecoveryUpdate
Loss-aware: realizes loss, waits for majority
recoveringwait for majority
responses
ready
Memory
A=1
Memory
A=1
Memory
A=1A=2A=2
buffered on a majority ?
maj
ority
re
spon
ses
Memory-Durable Protocols (Loss-Aware Recovery)
A=1 A=2
A=2
OSDI ‘18
Memory
11
Leader Follower Follower
ClientCommitted
RecoveryUpdate
Loss-aware: realizes loss, waits for majority
recoveringwait for majority
responses
ready
Memory
A=1
Memory
A=1
Memory
A=1A=2A=2
buffered on a majority ?
maj
ority
re
spon
ses
Memory-Durable Protocols (Loss-Aware Recovery)
A=1 A=2
A=2 e.g., Viewstamped replication
OSDI ‘18
Avoids loss (unlike oblivious) but can lead to unavailability
Memory
11
Leader Follower Follower
ClientCommitted
RecoveryUpdate
Loss-aware: realizes loss, waits for majority
recoveringwait for majority
responses
ready
Memory
A=1
Memory
A=1
Memory
A=1A=2A=2
buffered on a majority ?
maj
ority
re
spon
ses
Memory-Durable Protocols (Loss-Aware Recovery)
A=1 A=2
A=2 e.g., Viewstamped replication
OSDI ‘18
Unavailability Example in Loss-Aware Approach
12
OSDI ‘18
Unavailability Example in Loss-Aware Approach
12
A=1
A=1
A=1
A=1 committedtwo nodes crashed
OSDI ‘18
Unavailability Example in Loss-Aware Approach
12
A=1
A=1
A=1
A=1 committed
A=1
A=1
A=1
two nodes crashed
OSDI ‘18
Unavailability Example in Loss-Aware Approach
12
A=1
A=1
A=1
A=1 committed
A=1
A=1
A=1
crashestwo nodes crashed
OSDI ‘18
Unavailability Example in Loss-Aware Approach
12
A=1
A=1
A=1
A=1 committed
A=1
A=1
A=1
crashestwo nodes crashed
, recovers
OSDI ‘18
Unavailability Example in Loss-Aware Approach
12
A=1
A=1
A=1
A=1 committed
A=1
A=1
crashestwo nodes crashed
, recoverscannot collect
majority responsesalthough majority up – unavailable
OSDI ‘18
Unavailability Example in Loss-Aware Approach
12
A=1
A=1
A=1
A=1 committed
A=1
A=1
crashestwo nodes crashed
, recoverscannot collect
majority responsesalthough majority up – unavailable
A=1
A=1
OSDI ‘18
Unavailability Example in Loss-Aware Approach
12
A=1
A=1
A=1
A=1 committed
A=1
A=1
crashestwo nodes crashed
, recoverscannot collect
majority responsesalthough majority up – unavailable
A=1
A=1
failed nodes recover
OSDI ‘18
Unavailability Example in Loss-Aware Approach
12
A=1
A=1
A=1
A=1 committed
A=1
A=1
crashestwo nodes crashed
, recoverscannot collect
majority responsesalthough majority up – unavailable
A=1
A=1
failed nodes recoverstay in recovering
unavailable even after all nodes recover
OSDI ‘18
Outline
IntroductionDistributed updates and crash recoverySituation-aware updates and crash recovery
SAUCR insights, guarantees, and overviewsituation-aware updatessituation-aware crash recovery
ResultsSummary and conclusion
13
OSDI ‘18 14
SAUCR Intuition and Insight
OSDI ‘18
Existing protocols are static in nature: do not adapt to failures
14
SAUCR Intuition and Insight
OSDI ‘18
Existing protocols are static in nature: do not adapt to failures
14
buffer even with many failures
Memory-durable
always
poor reliability
SAUCR Intuition and Insight
OSDI ‘18
Existing protocols are static in nature: do not adapt to failures
14
buffer even with many failures
Memory-durable
persist even when no failures
Disk-durable
alwaysalways
poor reliability poor performance
SAUCR Intuition and Insight
OSDI ‘18
Existing protocols are static in nature: do not adapt to failures
Insight: reacting to failures and adapting to situation can achieve reliability and performance
14
buffer even with many failures
Memory-durable
persist even when no failures
Disk-durable
alwaysalways
poor reliability poor performance
SAUCR Intuition and Insight
OSDI ‘18
Existing protocols are static in nature: do not adapt to failures
Insight: reacting to failures and adapting to situation can achieve reliability and performance
when no or few failures could buffer in memory
14
buffer even with many failures
Memory-durable
persist even when no failures
Disk-durable
alwaysalwayscommon case
when many or all up
buffer in memory
Memory-durable
poor reliability poor performance
SAUCR Intuition and Insight
OSDI ‘18
Existing protocols are static in nature: do not adapt to failures
Insight: reacting to failures and adapting to situation can achieve reliability and performance
when no or few failures could buffer in memorywhen failure arise, flush
14
buffer even with many failures
Memory-durable
persist even when no failures
Disk-durable
alwaysalwayswith failures
when only minimum upcommon case
when many or all up
buffer in memory
flush to disk
Memory-durable Disk-durable
poor reliability poor performance
SAUCR Intuition and Insight
OSDI ‘18
Guarantees Depend upon Simultaneity of Failures
15
OSDI ‘18
Guarantees Depend upon Simultaneity of Failures
15
With non-simultaneous, gap exists, SAUCR can react and ensures durability
OSDI ‘18
Guarantees Depend upon Simultaneity of Failures
15
With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligible
OSDI ‘18
Guarantees Depend upon Simultaneity of Failures
15
With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligiblecorrelated: many nodes fail together
although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous
OSDI ‘18
Guarantees Depend upon Simultaneity of Failures
15
With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligiblecorrelated: many nodes fail together
although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous
With simultaneous correlated, no gap, SAUCR cannot react, unavailable
OSDI ‘18
Guarantees Depend upon Simultaneity of Failures
15
With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligiblecorrelated: many nodes fail together
although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous
With simultaneous correlated, no gap, SAUCR cannot react, unavailable
We conjecture they are extremely rare: a gap exists between failurescorrelated but a few seconds apart [Ford et al., OSDI ‘10]analysis reveals a gap of 50 ms or more almost always
OSDI ‘18
Guarantees Depend upon Simultaneity of Failures
15
With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligiblecorrelated: many nodes fail together
although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous
With simultaneous correlated, no gap, SAUCR cannot react, unavailable
We conjecture they are extremely rare: a gap exists between failurescorrelated but a few seconds apart [Ford et al., OSDI ‘10]analysis reveals a gap of 50 ms or more almost always
Most cases: any no. of independent and non-simultaneous correlated – same as disk-durableRare cases: more than a majority crash truly simultaneously – remain unavailable
OSDI ‘18
SAUCR Overview
16
OSDI ‘18
SAUCR Overview
Updateswhen more than a majority up, buffer updates in memory – fast mode
e.g., 4 or 5 nodes up in a 5-node cluster
16
OSDI ‘18
SAUCR Overview
Updateswhen more than a majority up, buffer updates in memory – fast mode
e.g., 4 or 5 nodes up in a 5-node cluster
When nodes fail and only a bare majority alive, flush to disk – slow modee.g., only 3 nodes up in a 5-node cluster
16
OSDI ‘18
SAUCR Overview
Updateswhen more than a majority up, buffer updates in memory – fast mode
e.g., 4 or 5 nodes up in a 5-node cluster
When nodes fail and only a bare majority alive, flush to disk – slow modee.g., only 3 nodes up in a 5-node cluster
Crash Recoverywhen a node recovers from a crash, it recovers its data
either from its disk (if crashed in slow mode)or from other nodes (if crashed in fast mode)
16
OSDI ‘18
Situation-Aware Updates
17
OSDI ‘18
Situation-Aware Updates
17
L
all nodes up
OSDI ‘18
Situation-Aware Updates
17
L
all nodes upfast mode -
buffer updates
OSDI ‘18
Situation-Aware Updates
17
L L
all nodes up 4 nodes up (more than majority)fast mode -
buffer updates
OSDI ‘18
Situation-Aware Updates
17
L L
all nodes up 4 nodes up (more than majority)fast mode -
buffer updates remain infast mode
OSDI ‘18
Situation-Aware Updates
17
L L L
all nodes up 4 nodes up (more than majority)
only majority upfast mode -
buffer updates remain infast mode
OSDI ‘18
Situation-Aware Updates
17
L L L
all nodes up 4 nodes up (more than majority)
only majority upfast mode -
buffer updates remain infast mode
switch to slow, flush to disk
OSDI ‘18
Situation-Aware Updates
17
L L L L
all nodes up 4 nodes up (more than majority)
only majority upfast mode -
buffer updates remain infast mode
switch to slow, flush to disk
OSDI ‘18
Situation-Aware Updates
17
L L L L
all nodes up 4 nodes up (more than majority)
only majority up commitsubsequent
updates in slow mode
fast mode -buffer updates remain in
fast mode
switch to slow, flush to disk
OSDI ‘18
Situation-Aware Updates
17
L L L L L
all nodes up 4 nodes up (more than majority)
only majority up commitsubsequent
updates in slow mode
one node recoversand catches up;fast mode -
buffer updates remain infast mode
switch to slow, flush to disk
OSDI ‘18
Situation-Aware Updates
17
L L L L L
all nodes up 4 nodes up (more than majority)
only majority up commitsubsequent
updates in slow mode
one node recoversand catches up;fast mode -
buffer updates remain infast mode
switch to slow, flush to disk
OSDI ‘18
Situation-Aware Updates
17
L L L L L
all nodes up 4 nodes up (more than majority)
only majority up commitsubsequent
updates in slow mode
one node recoversand catches up;fast mode -
buffer updates remain infast mode
switch to slow, flush to disk
OSDI ‘18
Situation-Aware Updates
17
L L L L L
all nodes up 4 nodes up (more than majority)
only majority up commitsubsequent
updates in slow mode
one node recoversand catches up;fast mode -
buffer updates remain infast mode
switch to slow, flush to disk switch to fast
OSDI ‘18
Failure Reaction
32
Basic failure-detection mechanism: heartbeats
remain in fast mode
Follower failures
switch to slow mode
Leader failures
on a missing heartbeat, followers flush to disk
Leader Leader
Challenges: too many packets, spurious elections, too much data to flushTechniques in the paper …
Result: can react to failures even when they are only a few milliseconds apart, preserving durability and availability
steps down
OSDI ‘18
Mode-Aware Crash Recovery
19
OSDI ‘18
Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)
Mode-Aware Crash Recovery
19
OSDI ‘18
Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)
Mode-Aware Crash Recovery
19
SAUCR
OSDI ‘18
Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)
Mode-Aware Crash Recovery
19
SAUCR
OSDI ‘18
Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)
Mode-Aware Crash Recovery
19
SAUCR
OSDI ‘18
Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)
Mode-Aware Crash Recovery
19
SAUCR
OSDI ‘18
Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)
Mode-Aware Crash Recovery
19
SAUCR
recover from local disk
OSDI ‘18
Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)
Mode-Aware Crash Recovery
19
SAUCR
recover from local disk
ready
immediate
OSDI ‘18
Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)
Mode-Aware Crash Recovery
19
SAUCR
recover from local disk
ready
immediate
OSDI ‘18
Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)
Mode-Aware Crash Recovery
19
SAUCR
recover from local disk
ready
lost updates
recover from other nodes
immediate
OSDI ‘18
Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)
Mode-Aware Crash Recovery
19
SAUCR
recover from local disk
ready
lost updates
recover from other nodes
a bare minority (bare majority - 1)responses
immediate
OSDI ‘18
Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)
Mode-Aware Crash Recovery
19
SAUCR
recover from local disk
ready
lost updates
recover from other nodes
ready
a bare minority (bare majority - 1)responses
immediate
OSDI ‘18
Intuition for why SAUCR’s recovery is safe
20
OSDI ‘18
Intuition for why SAUCR’s recovery is safe
20
Assume update-A committed, S1 recovers and has seen A before crashS1
OSDI ‘18
Intuition for why SAUCR’s recovery is safe
20
Assume update-A committed, S1 recovers and has seen A before crash
Safety condition: update-A must be recovered
S1
OSDI ‘18
Intuition for why SAUCR’s recovery is safe
20
Assume update-A committed, S1 recovers and has seen A before crash
Safety condition: update-A must be recoveredIf A was committed in fast mode, then at least one in any bare minority must contain update-A
AAA
S1
OSDI ‘18
Intuition for why SAUCR’s recovery is safe
20
Assume update-A committed, S1 recovers and has seen A before crash
Safety condition: update-A must be recoveredIf A was committed in fast mode, then at least one in any bare minority must contain update-AIf update-A was committed in slow mode, S1 recovers from disk
A AA
S1
OSDI ‘18
Intuition for why SAUCR’s recovery is safe
20
Assume update-A committed, S1 recovers and has seen A before crash
Safety condition: update-A must be recoveredIf A was committed in fast mode, then at least one in any bare minority must contain update-AIf update-A was committed in slow mode, S1 recovers from diskProof sketch in the paper …
A AA
S1
OSDI ‘18
Outline
IntroductionDistributed updates and crash recoverySituation-aware updates and crash recoveryResultsSummary and conclusion
51
OSDI ‘18
Evaluation
We implement in SAUCR in ZooKeeperCompare SAUCR’s reliability and performance against
disk-durable ZooKeeper (forceSync = true)memory-durable ZooKeeper (forceSync = false)viewstamped replication (ideal model)
52
OSDI ‘18
Reliability Testing
53
12345
1234 1235 1245 1345 2345
123 124 345…12
1 2 3 4 5
125
4514 …
…
13
Cluster crash-testing frameworkGenerates cluster-state sequences
How it works?Please see our paper…
OSDI ‘18
703 0
217 1047
1264 0 0
1264 0 0
Reliability Results
54
Correct Unavailable Data loss Correct Unavailable Data loss
703 0 561
217 1047 0
1264 0 0
Systemsmemory-durable
zookeeperviewstampedreplication
disk-durablezookeeperSAUCR 1200 64 0
Simultaneous
non-simultaneous: gap of 50 ms, simultaneous: no gap memory-durable zookeeper silently loses dataviewstamped replication leads to permanent unavailabilitySAUCR reacts to non-simultaneous – durable and availableother systems behave the same as non-simultaneous casessimultaneous: SAUCR by design remains unavailable in some cases
561
0
Non-Simultaneous
OSDI ‘18
Macro-benchmark Performance: YCSB-load
55
Compared to disk-durable, both memory-durable and SAUCR are faster SAUCR’s performance matches memory-durable ZooKeeper
within 9% of memory-durable Zookeeper even for write-intensive workloadsoverheads because SAUCR writes to one additional node
0
5
10
15
20
25
HDD SSD
Thr
ough
put
(KO
ps/s
)
Memory-durable ZooKeeper SAUCR Disk-durable ZooKeeper
100x
2.5x
9%9%
OSDI ‘18
Summary
26
Replication protocols are an important foundationneed to be performant, yet also provide high reliability
Dichotomy: disk-durable vs. memory-durable protocolsunsavory choices: either performant or reliable
SAUCR – situation-aware updates and crash recoveryprovides both high performance and reliability
OSDI ‘18
Conclusions
Paying careful attention to how failures occurcan find approaches that provide both performance and reliability more data from real-world deployments?
Hybrid approach – an effective systems-design technique –applicable to distributed updates and recovery too
worthwhile to look at other important protocols/systems where we make similar two-ends-of-the-spectrum tradeoffs?
Thank you!Poster #6
27