144
Fault-Tolerance, Fast and Slow: Exploiting Failure Asynchrony in Distributed Systems Ramnatthan Alagappan, Aishwarya Ganesan, Jing Liu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau

VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

Fault-Tolerance, Fast and Slow:Exploiting Failure Asynchrony

in Distributed SystemsRamnatthan Alagappan, Aishwarya Ganesan, Jing Liu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau

Page 2: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Replication Protocols

1

Viewstamped replication RaftPaxos

Page 3: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Replication Protocols

1

GFSColossusBigTable

Foundation upon which datacenter systems are built

Viewstamped replication RaftPaxos

Page 4: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

The Two Different Worlds of Replication

2

Page 5: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

The Two Different Worlds of Replication

2

World-1 World-2

Page 6: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

The Two Different Worlds of Replication

2

How and where to store system state?World-1 World-2

Page 7: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

The Two Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1 World-2

Page 8: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

buffer updates only in volatile memory

The Two Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1

Memory-durable

World-2

Page 9: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

buffer updates only in volatile memory

The Two Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable

World-2

Page 10: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

buffer updates only in volatile memory

The Two Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable

World-2

Viewstamped replication, NOPaxos [OSDI ‘16],SpecPaxos [NSDI ’15] …

Page 11: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

buffer updates only in volatile memory

The Two Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1

Neither approach is ideal: reliable or performant

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable

World-2

Viewstamped replication, NOPaxos [OSDI ‘16],SpecPaxos [NSDI ’15] …

Page 12: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

buffer updates only in volatile memory

The Two Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1

Neither approach is ideal: reliable or performant

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable

World-2

Viewstamped replication, NOPaxos [OSDI ‘16],SpecPaxos [NSDI ’15] …

safe but suffer from poor performance

Page 13: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

buffer updates only in volatile memory

The Two Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1

Neither approach is ideal: reliable or performant

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable

World-2

Viewstamped replication, NOPaxos [OSDI ‘16],SpecPaxos [NSDI ’15] …

safe but suffer from poor performance

performant but riskunsafety or unavailability

Page 14: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Can a protocol provide strong reliability while maintaining high performance?

3

Page 15: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

SAUCR:Situation-Aware Updates and Crash Recovery

4

Page 16: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

SAUCR:Situation-Aware Updates and Crash Recovery

Simple insight: dynamically (based on the situation) decide how to commit updates

4

Page 17: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

SAUCR:Situation-Aware Updates and Crash Recovery

Simple insight: dynamically (based on the situation) decide how to commit updates

with many or all nodes up, buffer in memory – fast modewith failures, if only bare majority up, flush to disk – slow mode

4

Page 18: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

SAUCR:Situation-Aware Updates and Crash Recovery

Simple insight: dynamically (based on the situation) decide how to commit updates

with many or all nodes up, buffer in memory – fast modewith failures, if only bare majority up, flush to disk – slow mode

Strong reliability while maintaining high performance

4

Page 19: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Simultaneity of Failures

SAUCR’s effectiveness depends upon simultaneity of failures

5

Page 20: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Simultaneity of Failures

SAUCR’s effectiveness depends upon simultaneity of failuresindependent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)

can react and switch from fast to slow modepreserves durability and availability

5

Page 21: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Simultaneity of Failures

SAUCR’s effectiveness depends upon simultaneity of failuresindependent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)

can react and switch from fast to slow modepreserves durability and availability

many truly simultaneous correlatedno gap and so cannot reactremain unavailable

5

Page 22: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Simultaneity of Failures

SAUCR’s effectiveness depends upon simultaneity of failuresindependent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)

can react and switch from fast to slow modepreserves durability and availability

many truly simultaneous correlatedno gap and so cannot reactremain unavailable

however, existing data hints they are extremely rare –the Non-Simultaneity Conjecture

5

Page 23: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Results

6

Page 24: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Results

Implemented in ZooKeeper

6

Page 25: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Results

Implemented in ZooKeeperSAUCR improves reliability compared to memory-durable systems

durable and available in 100s of crash scenariosmemory-durable loses data or becomes unavailable

6

Page 26: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Results

Implemented in ZooKeeperSAUCR improves reliability compared to memory-durable systems

durable and available in 100s of crash scenariosmemory-durable loses data or becomes unavailable

Improvements at no or little costoverheads within 0%-9% of memory-durable systems

6

Page 27: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Results

Implemented in ZooKeeperSAUCR improves reliability compared to memory-durable systems

durable and available in 100s of crash scenariosmemory-durable loses data or becomes unavailable

Improvements at no or little costoverheads within 0%-9% of memory-durable systems

Compared to disk-durableslight reduction in availability in extremely rare casesimproves performance – 2.5x on SSDs, 100x on HDDs

6

Page 28: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Outline

IntroductionDistributed updates and crash recovery

disk-durable protocolsmemory-durable protocols

Situation-aware updates and crash recoveryResultsSummary and conclusion

7

Page 29: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

Page 30: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

Leader Follower Follower

Update

Page 31: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1

Leader Follower Follower

Update

Page 32: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1

A=2

Leader Follower Follower

Client

Update

Page 33: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1

A=2

Leader Follower Follower

Client

Update

A=2A=2

Page 34: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1

A=2

Leader Follower Follower

Client

Update

A=2 A=2

Page 35: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

Client

Update

fsync

A=2 A=2

fsync

Page 36: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

fsync completed on a majority ?

Update

fsync

A=2 A=2

fsync

Page 37: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

Update

fsync

A=2 A=2

fsync

Page 38: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

if ack’d anyone, data on disk – safe Update

fsync

A=2 A=2

fsync

Page 39: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

A=1 A=2

if ack’d anyone, data on disk – safe Update

fsync

A=2 A=2

fsync

Page 40: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

A=1 A=2

if ack’d anyone, data on disk – safe Update

fsync

A=2 A=2

fsync

Page 41: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

A=1 A=2

if ack’d anyone, data on disk – safe Update

fsync

A=2 A=2

fsync

Page 42: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

A=1 A=2

ready

imm

edia

te

if ack’d anyone, data on disk – safe

A=1 A=2

Update

fsync

A=2 A=2

fsync recovery: just read from local disk

Page 43: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

A=1 A=2

ready

imm

edia

te

if ack’d anyone, data on disk – safe

A=1 A=2

Update

fsync

A=2 A=2

fsync recovery: just read from local disk

A=1

ready

imm

edia

te

A=1

lagging

Page 44: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

A=1 A=2

ready

imm

edia

te

if ack’d anyone, data on disk – safe

A=1 A=2

Update

fsync

A=2 A=2

Safe and available

fsync recovery: just read from local disk

A=1

ready

imm

edia

te

A=1

lagging

Page 45: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

But poor performance due to fsync – 50x on HDDs, 2.5x on SSDs

fsync completed on a majority ?

A=1 A=2

ready

imm

edia

te

if ack’d anyone, data on disk – safe

A=1 A=2

Update

fsync

A=2 A=2

Safe and available

fsync recovery: just read from local disk

A=1

ready

imm

edia

te

A=1

lagging

Page 46: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18 9

Leader Follower Follower

Client

Update

Memory

A=1

Memory

A=1

Memory

A=1

A=2

Memory-Durable Protocols (Oblivious Recovery)

Page 47: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18 9

Leader Follower Follower

ClientCommitted

Update

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

Page 48: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18 9

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

Page 49: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18 9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

Page 50: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Memory

9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

A=1 A=2

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

Page 51: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Memory

9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

A=1 A=2

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

Page 52: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Memory

9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

Page 53: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

MemoryMemory

9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

imm

edia

te

Memory-Durable Protocols (Oblivious Recovery)

Page 54: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

MemoryMemory

9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

imm

edia

te

Memory-Durable Protocols (Oblivious Recovery)

e.g., ZooKeeper with forceSync = falsepractitioners do use this config!

Page 55: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

MemoryMemory

9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

imm

edia

te

Memory-Durable Protocols (Oblivious Recovery)

Performant

e.g., ZooKeeper with forceSync = falsepractitioners do use this config!

Page 56: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

But can lead to data loss

MemoryMemory

9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

imm

edia

te

Memory-Durable Protocols (Oblivious Recovery)

Performant

e.g., ZooKeeper with forceSync = falsepractitioners do use this config!

Page 57: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Data Loss Example in Oblivious Approach

10

Page 58: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Data Loss Example in Oblivious Approach

10

Page 59: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

Page 60: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committedtwo nodes slow or failed

Page 61: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

two nodes slow or failed

Page 62: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

crashestwo nodes slow or failed

Page 63: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

crashestwo nodes slow or failed

, recovers

Page 64: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashestwo nodes slow or failed

, recoversloses its data but oblivious:

immediately joins

Page 65: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashestwo nodes slow or failed

, recoversloses its data but oblivious:

immediately joins

A=1

A=1

Page 66: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashes

A=1

A=1

two nodes slow or failed

, recoversloses its data but oblivious:

immediately joins

Page 67: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashes

A=1

A=1

two nodes slow or failed

, recoversloses its data but oblivious:

immediately joins

lagging nodes along with recovered node form majority;

lose committed update

majority do not know

of previously committed update

Page 68: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18 11

Leader Follower Follower

ClientCommitted

Update

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

Page 69: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18 11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

Page 70: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18 11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

Page 71: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Memory

11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

A=1 A=2

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

Page 72: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Memory

11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

A=1 A=2

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

Page 73: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Memory

11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

Page 74: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18 11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

recoveringwait for majority

responsesMemory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

Page 75: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Memory

11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

recoveringwait for majority

responses

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

maj

ority

re

spon

ses

Memory-Durable Protocols (Loss-Aware Recovery)

A=1 A=2

A=2

Page 76: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Memory

11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

recoveringwait for majority

responses

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

maj

ority

re

spon

ses

Memory-Durable Protocols (Loss-Aware Recovery)

A=1 A=2

A=2 e.g., Viewstamped replication

Page 77: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Avoids loss (unlike oblivious) but can lead to unavailability

Memory

11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

recoveringwait for majority

responses

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

maj

ority

re

spon

ses

Memory-Durable Protocols (Loss-Aware Recovery)

A=1 A=2

A=2 e.g., Viewstamped replication

Page 78: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

Page 79: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committedtwo nodes crashed

Page 80: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

two nodes crashed

Page 81: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

crashestwo nodes crashed

Page 82: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

crashestwo nodes crashed

, recovers

Page 83: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashestwo nodes crashed

, recoverscannot collect

majority responsesalthough majority up – unavailable

Page 84: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashestwo nodes crashed

, recoverscannot collect

majority responsesalthough majority up – unavailable

A=1

A=1

Page 85: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashestwo nodes crashed

, recoverscannot collect

majority responsesalthough majority up – unavailable

A=1

A=1

failed nodes recover

Page 86: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashestwo nodes crashed

, recoverscannot collect

majority responsesalthough majority up – unavailable

A=1

A=1

failed nodes recoverstay in recovering

unavailable even after all nodes recover

Page 87: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Outline

IntroductionDistributed updates and crash recoverySituation-aware updates and crash recovery

SAUCR insights, guarantees, and overviewsituation-aware updatessituation-aware crash recovery

ResultsSummary and conclusion

13

Page 88: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18 14

SAUCR Intuition and Insight

Page 89: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

14

SAUCR Intuition and Insight

Page 90: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

14

buffer even with many failures

Memory-durable

always

poor reliability

SAUCR Intuition and Insight

Page 91: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

14

buffer even with many failures

Memory-durable

persist even when no failures

Disk-durable

alwaysalways

poor reliability poor performance

SAUCR Intuition and Insight

Page 92: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

Insight: reacting to failures and adapting to situation can achieve reliability and performance

14

buffer even with many failures

Memory-durable

persist even when no failures

Disk-durable

alwaysalways

poor reliability poor performance

SAUCR Intuition and Insight

Page 93: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

Insight: reacting to failures and adapting to situation can achieve reliability and performance

when no or few failures could buffer in memory

14

buffer even with many failures

Memory-durable

persist even when no failures

Disk-durable

alwaysalwayscommon case

when many or all up

buffer in memory

Memory-durable

poor reliability poor performance

SAUCR Intuition and Insight

Page 94: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

Insight: reacting to failures and adapting to situation can achieve reliability and performance

when no or few failures could buffer in memorywhen failure arise, flush

14

buffer even with many failures

Memory-durable

persist even when no failures

Disk-durable

alwaysalwayswith failures

when only minimum upcommon case

when many or all up

buffer in memory

flush to disk

Memory-durable Disk-durable

poor reliability poor performance

SAUCR Intuition and Insight

Page 95: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

Page 96: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durability

Page 97: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligible

Page 98: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligiblecorrelated: many nodes fail together

although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous

Page 99: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligiblecorrelated: many nodes fail together

although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous

With simultaneous correlated, no gap, SAUCR cannot react, unavailable

Page 100: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligiblecorrelated: many nodes fail together

although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous

With simultaneous correlated, no gap, SAUCR cannot react, unavailable

We conjecture they are extremely rare: a gap exists between failurescorrelated but a few seconds apart [Ford et al., OSDI ‘10]analysis reveals a gap of 50 ms or more almost always

Page 101: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligiblecorrelated: many nodes fail together

although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous

With simultaneous correlated, no gap, SAUCR cannot react, unavailable

We conjecture they are extremely rare: a gap exists between failurescorrelated but a few seconds apart [Ford et al., OSDI ‘10]analysis reveals a gap of 50 ms or more almost always

Most cases: any no. of independent and non-simultaneous correlated – same as disk-durableRare cases: more than a majority crash truly simultaneously – remain unavailable

Page 102: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

SAUCR Overview

16

Page 103: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

SAUCR Overview

Updateswhen more than a majority up, buffer updates in memory – fast mode

e.g., 4 or 5 nodes up in a 5-node cluster

16

Page 104: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

SAUCR Overview

Updateswhen more than a majority up, buffer updates in memory – fast mode

e.g., 4 or 5 nodes up in a 5-node cluster

When nodes fail and only a bare majority alive, flush to disk – slow modee.g., only 3 nodes up in a 5-node cluster

16

Page 105: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

SAUCR Overview

Updateswhen more than a majority up, buffer updates in memory – fast mode

e.g., 4 or 5 nodes up in a 5-node cluster

When nodes fail and only a bare majority alive, flush to disk – slow modee.g., only 3 nodes up in a 5-node cluster

Crash Recoverywhen a node recovers from a crash, it recovers its data

either from its disk (if crashed in slow mode)or from other nodes (if crashed in fast mode)

16

Page 106: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Situation-Aware Updates

17

Page 107: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Situation-Aware Updates

17

L

all nodes up

Page 108: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Situation-Aware Updates

17

L

all nodes upfast mode -

buffer updates

Page 109: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Situation-Aware Updates

17

L L

all nodes up 4 nodes up (more than majority)fast mode -

buffer updates

Page 110: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Situation-Aware Updates

17

L L

all nodes up 4 nodes up (more than majority)fast mode -

buffer updates remain infast mode

Page 111: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Situation-Aware Updates

17

L L L

all nodes up 4 nodes up (more than majority)

only majority upfast mode -

buffer updates remain infast mode

Page 112: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Situation-Aware Updates

17

L L L

all nodes up 4 nodes up (more than majority)

only majority upfast mode -

buffer updates remain infast mode

switch to slow, flush to disk

Page 113: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Situation-Aware Updates

17

L L L L

all nodes up 4 nodes up (more than majority)

only majority upfast mode -

buffer updates remain infast mode

switch to slow, flush to disk

Page 114: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Situation-Aware Updates

17

L L L L

all nodes up 4 nodes up (more than majority)

only majority up commitsubsequent

updates in slow mode

fast mode -buffer updates remain in

fast mode

switch to slow, flush to disk

Page 115: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Situation-Aware Updates

17

L L L L L

all nodes up 4 nodes up (more than majority)

only majority up commitsubsequent

updates in slow mode

one node recoversand catches up;fast mode -

buffer updates remain infast mode

switch to slow, flush to disk

Page 116: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Situation-Aware Updates

17

L L L L L

all nodes up 4 nodes up (more than majority)

only majority up commitsubsequent

updates in slow mode

one node recoversand catches up;fast mode -

buffer updates remain infast mode

switch to slow, flush to disk

Page 117: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Situation-Aware Updates

17

L L L L L

all nodes up 4 nodes up (more than majority)

only majority up commitsubsequent

updates in slow mode

one node recoversand catches up;fast mode -

buffer updates remain infast mode

switch to slow, flush to disk

Page 118: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Situation-Aware Updates

17

L L L L L

all nodes up 4 nodes up (more than majority)

only majority up commitsubsequent

updates in slow mode

one node recoversand catches up;fast mode -

buffer updates remain infast mode

switch to slow, flush to disk switch to fast

Page 119: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Failure Reaction

32

Basic failure-detection mechanism: heartbeats

remain in fast mode

Follower failures

switch to slow mode

Leader failures

on a missing heartbeat, followers flush to disk

Leader Leader

Challenges: too many packets, spurious elections, too much data to flushTechniques in the paper …

Result: can react to failures even when they are only a few milliseconds apart, preserving durability and availability

steps down

Page 120: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Mode-Aware Crash Recovery

19

Page 121: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

Page 122: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

Page 123: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

Page 124: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

Page 125: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

Page 126: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

recover from local disk

Page 127: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

recover from local disk

ready

immediate

Page 128: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

recover from local disk

ready

immediate

Page 129: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

recover from local disk

ready

lost updates

recover from other nodes

immediate

Page 130: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

recover from local disk

ready

lost updates

recover from other nodes

a bare minority (bare majority - 1)responses

immediate

Page 131: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

recover from local disk

ready

lost updates

recover from other nodes

ready

a bare minority (bare majority - 1)responses

immediate

Page 132: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Page 133: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crashS1

Page 134: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crash

Safety condition: update-A must be recovered

S1

Page 135: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crash

Safety condition: update-A must be recoveredIf A was committed in fast mode, then at least one in any bare minority must contain update-A

AAA

S1

Page 136: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crash

Safety condition: update-A must be recoveredIf A was committed in fast mode, then at least one in any bare minority must contain update-AIf update-A was committed in slow mode, S1 recovers from disk

A AA

S1

Page 137: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crash

Safety condition: update-A must be recoveredIf A was committed in fast mode, then at least one in any bare minority must contain update-AIf update-A was committed in slow mode, S1 recovers from diskProof sketch in the paper …

A AA

S1

Page 138: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Outline

IntroductionDistributed updates and crash recoverySituation-aware updates and crash recoveryResultsSummary and conclusion

51

Page 139: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Evaluation

We implement in SAUCR in ZooKeeperCompare SAUCR’s reliability and performance against

disk-durable ZooKeeper (forceSync = true)memory-durable ZooKeeper (forceSync = false)viewstamped replication (ideal model)

52

Page 140: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Reliability Testing

53

12345

1234 1235 1245 1345 2345

123 124 345…12

1 2 3 4 5

125

4514 …

13

Cluster crash-testing frameworkGenerates cluster-state sequences

How it works?Please see our paper…

Page 141: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

703 0

217 1047

1264 0 0

1264 0 0

Reliability Results

54

Correct Unavailable Data loss Correct Unavailable Data loss

703 0 561

217 1047 0

1264 0 0

Systemsmemory-durable

zookeeperviewstampedreplication

disk-durablezookeeperSAUCR 1200 64 0

Simultaneous

non-simultaneous: gap of 50 ms, simultaneous: no gap memory-durable zookeeper silently loses dataviewstamped replication leads to permanent unavailabilitySAUCR reacts to non-simultaneous – durable and availableother systems behave the same as non-simultaneous casessimultaneous: SAUCR by design remains unavailable in some cases

561

0

Non-Simultaneous

Page 142: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Macro-benchmark Performance: YCSB-load

55

Compared to disk-durable, both memory-durable and SAUCR are faster SAUCR’s performance matches memory-durable ZooKeeper

within 9% of memory-durable Zookeeper even for write-intensive workloadsoverheads because SAUCR writes to one additional node

0

5

10

15

20

25

HDD SSD

Thr

ough

put

(KO

ps/s

)

Memory-durable ZooKeeper SAUCR Disk-durable ZooKeeper

100x

2.5x

9%9%

Page 143: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Summary

26

Replication protocols are an important foundationneed to be performant, yet also provide high reliability

Dichotomy: disk-durable vs. memory-durable protocolsunsavory choices: either performant or reliable

SAUCR – situation-aware updates and crash recoveryprovides both high performance and reliability

Page 144: VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv ri 5hsolfdwlrq +rz dqg zkhuh wr vwruh v\vwhp vwdwh" v\qfkurqrxvo\ shuvlvw

OSDI ‘18

Conclusions

Paying careful attention to how failures occurcan find approaches that provide both performance and reliability more data from real-world deployments?

Hybrid approach – an effective systems-design technique –applicable to distributed updates and recovery too

worthwhile to look at other important protocols/systems where we make similar two-ends-of-the-spectrum tradeoffs?

Thank you!Poster #6

27