VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

$Page 1: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw$
Fault-Tolerance, Fast and Slow:Exploiting Failure Asynchrony

in Distributed SystemsRamnatthan Alagappan, Aishwarya Ganesan, Jing Liu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau

OSDI ‘18

Replication Protocols

1

Viewstamped replication RaftPaxos

OSDI ‘18

Replication Protocols

1

GFSColossusBigTable

Foundation upon which datacenter systems are built

Viewstamped replication RaftPaxos

OSDI ‘18

The Two Different Worlds of Replication

2

OSDI ‘18


2

World-1 World-2

OSDI ‘18


2

How and where to store system state?World-1 World-2

OSDI ‘18


2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1 World-2

OSDI ‘18

buffer updates only in volatile memory


2



Disk-durable

World-1

Memory-durable

World-2

OSDI ‘18



2



Disk-durable

World-1

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable

World-2

OSDI ‘18



2



Disk-durable

World-1


Memory-durable

World-2

Viewstamped replication, NOPaxos [OSDI ‘16],SpecPaxos [NSDI ’15] …

OSDI ‘18



2



Disk-durable

World-1

Neither approach is ideal: reliable or performant


Memory-durable

World-2


OSDI ‘18



2



Disk-durable

World-1



Memory-durable

World-2


safe but suffer from poor performance

OSDI ‘18



2



Disk-durable

World-1



Memory-durable

World-2


safe but suffer from poor performance

performant but riskunsafety or unavailability

OSDI ‘18

Can a protocol provide strong reliability while maintaining high performance?

3

OSDI ‘18

SAUCR:Situation-Aware Updates and Crash Recovery

4

OSDI ‘18


Simple insight: dynamically (based on the situation) decide how to commit updates

4

OSDI ‘18



with many or all nodes up, buffer in memory – fast modewith failures, if only bare majority up, flush to disk – slow mode

4

OSDI ‘18



with many or all nodes up, buffer in memory – fast modewith failures, if only bare majority up, flush to disk – slow mode

Strong reliability while maintaining high performance

4

OSDI ‘18

Simultaneity of Failures

SAUCR’s effectiveness depends upon simultaneity of failures

5

OSDI ‘18


SAUCR’s effectiveness depends upon simultaneity of failuresindependent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)

can react and switch from fast to slow modepreserves durability and availability

5

OSDI ‘18




many truly simultaneous correlatedno gap and so cannot reactremain unavailable

5

OSDI ‘18




many truly simultaneous correlatedno gap and so cannot reactremain unavailable

however, existing data hints they are extremely rare –the Non-Simultaneity Conjecture

5

OSDI ‘18

Results

6

OSDI ‘18

Results

Implemented in ZooKeeper

6

OSDI ‘18

Results

Implemented in ZooKeeperSAUCR improves reliability compared to memory-durable systems

durable and available in 100s of crash scenariosmemory-durable loses data or becomes unavailable

6

OSDI ‘18

Results



Improvements at no or little costoverheads within 0%-9% of memory-durable systems

6

OSDI ‘18

Results



Improvements at no or little costoverheads within 0%-9% of memory-durable systems

Compared to disk-durableslight reduction in availability in extremely rare casesimproves performance – 2.5x on SSDs, 100x on HDDs

6

OSDI ‘18

Outline

IntroductionDistributed updates and crash recovery

disk-durable protocolsmemory-durable protocols

Situation-aware updates and crash recoveryResultsSummary and conclusion

7

OSDI ‘18

Disk-Durable Protocols

8

OSDI ‘18


8

Leader Follower Follower

Update

OSDI ‘18


8

A=1 A=1 A=1


Update

OSDI ‘18


8

A=1 A=1 A=1

A=2


Client

Update

OSDI ‘18


8

A=1 A=1 A=1

A=2


Client

Update

A=2A=2

OSDI ‘18


8

A=1 A=1 A=1

A=2


Client

Update

A=2 A=2

OSDI ‘18


8

A=1 A=1 A=1A=2


fsync

Client

Update

fsync

A=2 A=2

fsync

OSDI ‘18


8

A=1 A=1 A=1A=2


fsync

ClientCommitted

fsync completed on a majority ?

Update

fsync

A=2 A=2

fsync

OSDI ‘18


8

A=1 A=1 A=1A=2


fsync

ClientCommitted

Recovery


Update

fsync

A=2 A=2

fsync

OSDI ‘18


8

A=1 A=1 A=1A=2


fsync

ClientCommitted

Recovery


if ack’d anyone, data on disk – safe Update

fsync

A=2 A=2

fsync

OSDI ‘18


8

A=1 A=1 A=1A=2


fsync

ClientCommitted

Recovery


A=1 A=2


fsync

A=2 A=2

fsync

OSDI ‘18


8

A=1 A=1 A=1A=2


fsync

ClientCommitted

Recovery


A=1 A=2


fsync

A=2 A=2

fsync

OSDI ‘18


8

A=1 A=1 A=1A=2


fsync

ClientCommitted

Recovery


A=1 A=2


fsync

A=2 A=2

fsync

OSDI ‘18


8

A=1 A=1 A=1A=2


fsync

ClientCommitted

Recovery


A=1 A=2

ready

imm

edia

te

if ack’d anyone, data on disk – safe

A=1 A=2

Update

fsync

A=2 A=2

fsync recovery: just read from local disk

OSDI ‘18


8

A=1 A=1 A=1A=2


fsync

ClientCommitted

Recovery


A=1 A=2

ready

imm

edia

te


A=1 A=2

Update

fsync

A=2 A=2


A=1

ready

imm

edia

te

A=1

lagging

OSDI ‘18


8

A=1 A=1 A=1A=2


fsync

ClientCommitted

Recovery


A=1 A=2

ready

imm

edia

te


A=1 A=2

Update

fsync

A=2 A=2

Safe and available


A=1

ready

imm

edia

te

A=1

lagging

OSDI ‘18


8

A=1 A=1 A=1A=2


fsync

ClientCommitted

Recovery

But poor performance due to fsync – 50x on HDDs, 2.5x on SSDs


A=1 A=2

ready

imm

edia

te


A=1 A=2

Update

fsync

A=2 A=2

Safe and available


A=1

ready

imm

edia

te

A=1

lagging

OSDI ‘18 9


Client

Update

Memory

A=1

Memory

A=1

Memory

A=1

A=2

Memory-Durable Protocols (Oblivious Recovery)

OSDI ‘18 9


ClientCommitted

Update

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?


OSDI ‘18 9


ClientCommitted

RecoveryUpdate

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2



OSDI ‘18 9


ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2



OSDI ‘18

Memory

9


ClientCommitted


A=1 A=2

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2



OSDI ‘18

Memory

9


ClientCommitted


A=1 A=2

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2



OSDI ‘18

Memory

9


ClientCommitted


Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2



OSDI ‘18

MemoryMemory

9


ClientCommitted


ready

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2


imm

edia

te


OSDI ‘18

MemoryMemory

9


ClientCommitted


ready

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2


imm

edia

te


e.g., ZooKeeper with forceSync = falsepractitioners do use this config!

OSDI ‘18

MemoryMemory

9


ClientCommitted


ready

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2


imm

edia

te


Performant


OSDI ‘18

But can lead to data loss

MemoryMemory

9


ClientCommitted


ready

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2


imm

edia

te


Performant


OSDI ‘18

Data Loss Example in Oblivious Approach

10

OSDI ‘18


10

OSDI ‘18


10

A=1

A=1

A=1

A=1 committed

OSDI ‘18


10

A=1

A=1

A=1

A=1 committedtwo nodes slow or failed

OSDI ‘18


10

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

two nodes slow or failed

OSDI ‘18


10

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

crashestwo nodes slow or failed

OSDI ‘18


10

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1


, recovers

OSDI ‘18


10

A=1

A=1

A=1

A=1 committed

A=1

A=1


, recoversloses its data but oblivious:

immediately joins

OSDI ‘18


10

A=1

A=1

A=1

A=1 committed

A=1

A=1



immediately joins

A=1

A=1

OSDI ‘18


10

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashes

A=1

A=1



immediately joins

OSDI ‘18


10

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashes

A=1

A=1



immediately joins

lagging nodes along with recovered node form majority;

lose committed update

majority do not know

of previously committed update

OSDI ‘18 11


ClientCommitted

Update

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2


Memory-Durable Protocols (Loss-Aware Recovery)

A=2

OSDI ‘18 11


ClientCommitted

RecoveryUpdate

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2



A=2

OSDI ‘18 11


ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2



A=2

OSDI ‘18

Memory

11


ClientCommitted

RecoveryUpdate


A=1 A=2

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2



A=2

OSDI ‘18

Memory

11


ClientCommitted

RecoveryUpdate


A=1 A=2

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2



A=2

OSDI ‘18

Memory

11


ClientCommitted

RecoveryUpdate


Memory

A=1

Memory

A=1

Memory

A=1A=2A=2



A=2

OSDI ‘18 11


ClientCommitted

RecoveryUpdate


recoveringwait for majority

responsesMemory

A=1

Memory

A=1

Memory

A=1A=2A=2



A=2

OSDI ‘18

Memory

11


ClientCommitted

RecoveryUpdate



responses

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2


maj

ority

re

spon

ses


A=1 A=2

A=2

OSDI ‘18

Memory

11


ClientCommitted

RecoveryUpdate



responses

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2


maj

ority

re

spon

ses


A=1 A=2

A=2 e.g., Viewstamped replication

OSDI ‘18

Avoids loss (unlike oblivious) but can lead to unavailability

Memory

11


ClientCommitted

RecoveryUpdate



responses

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2


maj

ority

re

spon

ses


A=1 A=2

A=2 e.g., Viewstamped replication

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

OSDI ‘18


12

A=1

A=1

A=1

A=1 committedtwo nodes crashed

OSDI ‘18


12

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

two nodes crashed

OSDI ‘18


12

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

crashestwo nodes crashed

OSDI ‘18


12

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1


, recovers

OSDI ‘18


12

A=1

A=1

A=1

A=1 committed

A=1

A=1


, recoverscannot collect

majority responsesalthough majority up – unavailable

OSDI ‘18


12

A=1

A=1

A=1

A=1 committed

A=1

A=1




A=1

A=1

OSDI ‘18


12

A=1

A=1

A=1

A=1 committed

A=1

A=1




A=1

A=1

failed nodes recover

OSDI ‘18


12

A=1

A=1

A=1

A=1 committed

A=1

A=1




A=1

A=1

failed nodes recoverstay in recovering

unavailable even after all nodes recover

OSDI ‘18

Outline

IntroductionDistributed updates and crash recoverySituation-aware updates and crash recovery

SAUCR insights, guarantees, and overviewsituation-aware updatessituation-aware crash recovery

ResultsSummary and conclusion

13

OSDI ‘18 14

SAUCR Intuition and Insight

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

14


OSDI ‘18


14

buffer even with many failures

Memory-durable

always

poor reliability


OSDI ‘18


14


Memory-durable

persist even when no failures

Disk-durable

alwaysalways

poor reliability poor performance


OSDI ‘18


Insight: reacting to failures and adapting to situation can achieve reliability and performance

14


Memory-durable


Disk-durable

alwaysalways



OSDI ‘18



when no or few failures could buffer in memory

14


Memory-durable


Disk-durable

alwaysalwayscommon case

when many or all up

buffer in memory

Memory-durable



OSDI ‘18



when no or few failures could buffer in memorywhen failure arise, flush

14


Memory-durable


Disk-durable

alwaysalwayswith failures

when only minimum upcommon case

when many or all up

buffer in memory

flush to disk

Memory-durable Disk-durable



OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

OSDI ‘18


15

With non-simultaneous, gap exists, SAUCR can react and ensures durability

OSDI ‘18


15

With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligible

OSDI ‘18


15

With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligiblecorrelated: many nodes fail together

although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous

OSDI ‘18


15



With simultaneous correlated, no gap, SAUCR cannot react, unavailable

OSDI ‘18


15




We conjecture they are extremely rare: a gap exists between failurescorrelated but a few seconds apart [Ford et al., OSDI ‘10]analysis reveals a gap of 50 ms or more almost always

OSDI ‘18


15




We conjecture they are extremely rare: a gap exists between failurescorrelated but a few seconds apart [Ford et al., OSDI ‘10]analysis reveals a gap of 50 ms or more almost always

Most cases: any no. of independent and non-simultaneous correlated – same as disk-durableRare cases: more than a majority crash truly simultaneously – remain unavailable

OSDI ‘18

SAUCR Overview

16

OSDI ‘18

SAUCR Overview

Updateswhen more than a majority up, buffer updates in memory – fast mode

e.g., 4 or 5 nodes up in a 5-node cluster

16

OSDI ‘18

SAUCR Overview



When nodes fail and only a bare majority alive, flush to disk – slow modee.g., only 3 nodes up in a 5-node cluster

16

OSDI ‘18

SAUCR Overview



When nodes fail and only a bare majority alive, flush to disk – slow modee.g., only 3 nodes up in a 5-node cluster

Crash Recoverywhen a node recovers from a crash, it recovers its data

either from its disk (if crashed in slow mode)or from other nodes (if crashed in fast mode)

16

OSDI ‘18

Situation-Aware Updates

17

OSDI ‘18


17

L

all nodes up

OSDI ‘18


17

L

all nodes upfast mode -

buffer updates

OSDI ‘18


17

L L

all nodes up 4 nodes up (more than majority)fast mode -

buffer updates

OSDI ‘18


17

L L

all nodes up 4 nodes up (more than majority)fast mode -

buffer updates remain infast mode

OSDI ‘18


17

L L L

all nodes up 4 nodes up (more than majority)

only majority upfast mode -


OSDI ‘18


17

L L L




switch to slow, flush to disk

OSDI ‘18


17

L L L L





OSDI ‘18


17

L L L L


only majority up commitsubsequent

updates in slow mode

fast mode -buffer updates remain in

fast mode


OSDI ‘18


17

L L L L L




one node recoversand catches up;fast mode -



OSDI ‘18


17

L L L L L







OSDI ‘18


17

L L L L L







OSDI ‘18


17

L L L L L






switch to slow, flush to disk switch to fast

OSDI ‘18

Failure Reaction

32

Basic failure-detection mechanism: heartbeats

remain in fast mode

Follower failures

switch to slow mode

Leader failures

on a missing heartbeat, followers flush to disk

Leader Leader

Challenges: too many packets, spurious elections, too much data to flushTechniques in the paper …

Result: can react to failures even when they are only a few milliseconds apart, preserving durability and availability

steps down

OSDI ‘18

Mode-Aware Crash Recovery

19

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)


19

OSDI ‘18



19

SAUCR

OSDI ‘18



19

SAUCR

OSDI ‘18



19

SAUCR

OSDI ‘18



19

SAUCR

OSDI ‘18



19

SAUCR

recover from local disk

OSDI ‘18



19

SAUCR


ready

immediate

OSDI ‘18



19

SAUCR


ready

immediate

OSDI ‘18



19

SAUCR


ready

lost updates

recover from other nodes

immediate

OSDI ‘18



19

SAUCR


ready

lost updates


a bare minority (bare majority - 1)responses

immediate

OSDI ‘18



19

SAUCR


ready

lost updates


ready

a bare minority (bare majority - 1)responses

immediate

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

OSDI ‘18


20

Assume update-A committed, S1 recovers and has seen A before crashS1

OSDI ‘18


20

Assume update-A committed, S1 recovers and has seen A before crash

Safety condition: update-A must be recovered

S1

OSDI ‘18


20


Safety condition: update-A must be recoveredIf A was committed in fast mode, then at least one in any bare minority must contain update-A

AAA

S1

OSDI ‘18


20


Safety condition: update-A must be recoveredIf A was committed in fast mode, then at least one in any bare minority must contain update-AIf update-A was committed in slow mode, S1 recovers from disk

A AA

S1

OSDI ‘18


20


Safety condition: update-A must be recoveredIf A was committed in fast mode, then at least one in any bare minority must contain update-AIf update-A was committed in slow mode, S1 recovers from diskProof sketch in the paper …

A AA

S1

OSDI ‘18

Outline

IntroductionDistributed updates and crash recoverySituation-aware updates and crash recoveryResultsSummary and conclusion

51

OSDI ‘18

Evaluation

We implement in SAUCR in ZooKeeperCompare SAUCR’s reliability and performance against

disk-durable ZooKeeper (forceSync = true)memory-durable ZooKeeper (forceSync = false)viewstamped replication (ideal model)

52

OSDI ‘18

Reliability Testing

53

12345

1234 1235 1245 1345 2345

123 124 345…12

1 2 3 4 5

125

4514 …

…

13

Cluster crash-testing frameworkGenerates cluster-state sequences

How it works?Please see our paper…

OSDI ‘18

703 0

217 1047

1264 0 0

1264 0 0

Reliability Results

54

Correct Unavailable Data loss Correct Unavailable Data loss

703 0 561

217 1047 0

1264 0 0

Systemsmemory-durable

zookeeperviewstampedreplication

disk-durablezookeeperSAUCR 1200 64 0

Simultaneous

non-simultaneous: gap of 50 ms, simultaneous: no gap memory-durable zookeeper silently loses dataviewstamped replication leads to permanent unavailabilitySAUCR reacts to non-simultaneous – durable and availableother systems behave the same as non-simultaneous casessimultaneous: SAUCR by design remains unavailable in some cases

561

0

Non-Simultaneous

OSDI ‘18

Macro-benchmark Performance: YCSB-load

55

Compared to disk-durable, both memory-durable and SAUCR are faster SAUCR’s performance matches memory-durable ZooKeeper

within 9% of memory-durable Zookeeper even for write-intensive workloadsoverheads because SAUCR writes to one additional node

0

5

10

15

20

25

HDD SSD

Thr

ough

put

(KO

ps/s

)

Memory-durable ZooKeeper SAUCR Disk-durable ZooKeeper

100x

2.5x

9%9%

OSDI ‘18

Summary

26

Replication protocols are an important foundationneed to be performant, yet also provide high reliability

Dichotomy: disk-durable vs. memory-durable protocolsunsavory choices: either performant or reliable

SAUCR – situation-aware updates and crash recoveryprovides both high performance and reliability

OSDI ‘18

Conclusions

Paying careful attention to how failures occurcan find approaches that provide both performance and reliability more data from real-world deployments?

Hybrid approach – an effective systems-design technique –applicable to distributed updates and recovery too

worthwhile to look at other important protocols/systems where we make similar two-ends-of-the-spectrum tradeoffs?

Thank you!Poster #6

27

Documents

VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw