97
Gray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, Randolph Yao

Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Gray Failure: The Achilles’ Heel of Cloud-Scale Systems

Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch,

Yingnong Dang, Murali Chintalapati, Randolph Yao

Page 2: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Outline

Background & the gray failure problem

Real-world gray failure cases in Azure

A model and a definition for gray failure

differential observability

Potential future directions

26

Page 3: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Rapid Growth of Cloud System Infra

» Software user shift

direct: e.g., office 365, Google Drive

indirect: e.g., Netflix on AWS

» Workload diversity

website, workflow, big data, machine learning

» Internal composition

more data centers, larger cluster, special h/w

containerization, micro-services

27

Page 4: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Demanding Requirement on Availability

» Users intolerant of service downtime

» Failure more costly

SLA violation, reputation hit, customer loss, engineering resource waste

» New availability bar

3 nines to 5 or 6 nines

28

Page 5: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Key: Embrace Fault-tolerance!

29

Redundancy CorrectnessDecomposition

Rich history since 1950s Steps:

Page 6: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Key: Embrace Fault-tolerance!

30

Redundancy CorrectnessDecomposition

process pair

RAID

Primary/backup

state machine replication

transaction

checkpoint chain replication

virtual synchrony Paxos Zab PBFT

Gossip

Zyzzyva

N-version2PC

triple modular redundancy

erasure coding

Rich history since 1950s

Steps:

Building block:

Page 7: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Status Quo

31

Page 8: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Status Quo

» By and large, the efforts paid off

many faults successfully detected, tolerated, and repaired every day

few global outages

99.9% is achievable

» But moving forward…

simplistic assumptions start to break

reasoning about availability becomes hard

frequent bizarre phenomenon in production

99.999% and beyond is challenging

32

Page 9: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Status Quo

» By and large, the efforts paid off

many faults successfully detected, tolerated, and repaired every day

few global outages

99.9% is achievable

» But moving forward…

simplistic assumptions start to break

reasoning about availability becomes hard

frequent bizarre phenomenon in production

99.999% and beyond is challenging

33

scale & complexity

availability?t

Page 10: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Status Quo

» By and large, the efforts paid off

many faults successfully detected, tolerated, and repaired every day

few global outages

99.9% is achievable

» But moving forward…

simplistic assumptions start to break

reasoning about availability becomes hard

frequent bizarre phenomenon in production

99.999% and beyond is challenging

34

scale & complexity

availability?t

A common theme:

the overlooked gray

failure problem

Page 11: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

The Elephant in the Cloud - Gray Failure

35

process pair

RAID primary backup

chain replication2PC

TMR

erasure codingPaxos

virtual synchrony

Zab

Fail-stop

A component either

works correctly or stops

Page 12: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

The Elephant in the Cloud - Gray Failure

36

Byzantine

A component may

behave arbitrarily

process pair

RAID primary backup

chain replication2PC

TMR

erasure codingPaxos

PBFT Zyzzyva

virtual synchrony

UpRight

Q/U BAR Gossip

Aliph

Zab

Fail-stop

A component either

works correctly or stops

Page 13: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

The Elephant in the Cloud - Gray Failure

37

ByzantineGray failure

A component appears to be still working

but is in fact experiencing severe issue

A component may

behave arbitrarily

process pair

RAID primary backup

chain replication2PC

TMR

erasure codingPaxos

PBFT Zyzzyva

virtual synchrony

UpRight

Q/U BAR Gossip

Aliph

Zab

Fail-stop

A component either

works correctly or stops

Page 14: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

The Elephant in the Cloud - Gray Failure

38

Fail-stop ByzantineGray failure

A component appears to be still working

but is in fact experiencing severe issue

A component either

works correctly or stops

A component may

behave arbitrarily

Page 15: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

The Elephant in the Cloud - Gray Failure

39

Fail-stop ByzantineGray failure

A component appears to be still working

but is in fact experiencing severe issue

A component either

works correctly or stops

A component may

behave arbitrarily

• subtle and ambiguous: e.g., switch random packet loss, non-

fatal exceptions, memory thrashing, flaky disk I/O, overload..

symptom

Page 16: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

The Elephant in the Cloud - Gray Failure

40

• across s/w and h/w stack in the infra due to various defects

• behind most service incidents we’ve seen in Azure

occurrence

• subtle and ambiguous: e.g., switch random packet loss, non-

fatal exceptions, memory thrashing, flaky disk I/O, overload..

symptom

Fail-stop Byzantine

A component appears to be still working

but is in fact experiencing severe issue

A component either

works correctly or stops

A component may

behave arbitrarily

Gray failure

Page 17: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

The Elephant in the Cloud - Gray Failure

41

• across s/w and h/w stack in the infra due to various defects

• behind most service incidents we’ve seen in Azure

occurrence

• fault-tolerance ineffective or counterproductive

• faults take engineers & designers huge efforts to nail down

• teams play the blame game with each other

danger

Fail-stop Byzantine

A component appears to be still working

but is in fact experiencing severe issue

A component either

works correctly or stops

A component may

behave arbitrarily

• subtle and ambiguous: e.g., switch random packet loss, non-

fatal exceptions, memory thrashing, flaky disk I/O, overload..

symptom

Gray failure

Page 18: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Real-world Gray Failure Cases in Azure

42

Page 19: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case I: Redundancy in Datacenter Network (1)

43

Core

Aggregation

ToR

A B

r1 r2

Page 20: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case I: Redundancy in Datacenter Network (1)

44

Core

Aggregation

ToR

A B

r1 r2

Page 21: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case I: Redundancy in Datacenter Network (1)

45

Core

Aggregation

ToR

A B

r1 r2

crash

Page 22: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case I: Redundancy in Datacenter Network (1)

46

Core

Aggregation

ToR

A B

r1 r2

crash

Page 23: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case I: Redundancy in Datacenter Network (1)

47

Core

Aggregation

ToR

A B

r1 r2

increasing # of core switches helps with availability

crash

Page 24: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case I: Redundancy in Datacenter Network (2)

48

Core

Aggregation

ToR

A B

r1 r2

Workload: single round trip

random

packet

drop

𝒑

Page 25: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case I: Redundancy in Datacenter Network (2)

49

Core

Aggregation

ToR

A B

r1 r2

Workload: single round trip

• packets will not be re-routed application glitches or increased latency

• increasing # of core switches may not affect chance of being affected

random

packet

drop

𝒑

Page 26: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case I: Redundancy in Datacenter Network (3)

50

Core

Aggregation

ToR

A B

r1 r2

Workload: send multiple requests

wait for all to finish (e.g., search)

r3 r4

C D E

• high chance to involve every core switches for each front-end request

• gray failure at any core switch will cause delay

• more core switches worse tail latencies 𝟏 − (𝟏 − 𝒑)𝒏

random

packet

drop

Page 27: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case II: Failure Detector in Compute Service

51

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

Hierarchical agents to catch failure in different layers

Page 28: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case II: Failure Detector in Compute Service

52

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

crash!

Hierarchical agents to catch failure in different layers

Page 29: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case II: Failure Detector in Compute Service

53

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

VM3 dead

crash!

Hierarchical agents to catch failure in different layers

Page 30: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case II: Failure Detector in Compute Service

54

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

reboot

Hierarchical agents to catch failure in different layers

Page 31: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case II: Failure Detector in Compute Service

55

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

connectivity

issue

Hierarchical agents to catch failure in different layers

Page 32: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case II: Failure Detector in Compute Service

56

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

connectivity

issue

VM3 good

Hierarchical agents to catch failure in different layers

No action needed

Page 33: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case II: Failure Detector in Compute Service

57

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

connectivity

issue

VM3 good

Can’t SSH/RDP

Hierarchical agents to catch failure in different layers

No action needed

Page 34: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

58

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

Page 35: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

59

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

low free blocks

Page 36: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

60

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

low free blocks

EN1,EN2,EN3

healthy

Page 37: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

61

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

low free blocks

EN1,EN2,EN3

healthy

Page 38: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

62

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

EN1,EN2,EN3

healthy

crash

write

Page 39: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

63

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

EN3 is

down

crash

Page 40: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

64

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

EN3 is

down

crash

Page 41: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

65

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

low free blocks

Page 42: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

66

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

low free blocks

EN2,EN3,EN4

healthy

Page 43: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

67

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

low free blocks

EN2,EN3,EN4

healthy

Page 44: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

68

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

EN2,EN3,EN4

healthy

crash

write

Page 45: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

69

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

EN3 is

down

crash

Page 46: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

70

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

EN3 is

down

crash

Page 47: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

71

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front EndEN3 is

broken

Page 48: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

72

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN4 EN5 …

Front End Front End Front EndEN3 is

broken

Page 49: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

73

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN4 EN5 …

Front End Front End Front EndEN3 is

broken

re-replication,

fragmentation

Page 50: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Case III: Recovery in Storage Service

74

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN4 EN5 …

Front End Front End Front EndEN3 is

broken

re-replication,

fragmentation

Page 51: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Understanding Gray Failure

75

Page 52: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

The Many Faces of Gray Failure

76

So, what is a gray failure?

Page 53: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

The Many Faces of Gray Failure

77

So, what is a gray failure?

A performance issue.

Page 54: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

The Many Faces of Gray Failure

78

So, what is a gray failure?

A performance issue.

A problem that some thinks is a failure but some thinks is not,

e.g., a 2% packet loss. The ambiguity itself defines gray failure.

If everyone agrees it is a problem, it is not a gray failure.

Page 55: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

The Many Faces of Gray Failure

79

So, what is a gray failure?

A performance issue.

A problem that some thinks is a failure but some thinks is not,

e.g., a 2% packet loss. The ambiguity itself defines gray failure.

If everyone agrees it is a problem, it is not a gray failure.

A Heisenbug, sometimes it occurs and sometimes it does not.

Page 56: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

The Many Faces of Gray Failure

80

So, what is a gray failure?

A performance issue.

A problem that some thinks is a failure but some thinks is not,

e.g., a 2% packet loss. The ambiguity itself defines gray failure.

If everyone agrees it is a problem, it is not a gray failure.

A Heisenbug, sometimes it occurs and sometimes it does not.

The system is failing slowly, e.g., memory leak.

Page 57: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

The Many Faces of Gray Failure

81

So, what is a gray failure?

A performance issue.

A problem that some thinks is a failure but some thinks is not,

e.g., a 2% packet loss. The ambiguity itself defines gray failure.

If everyone agrees it is a problem, it is not a gray failure.

A Heisenbug, sometimes it occurs and sometimes it does not.

The system is failing slowly, e.g., memory leak.

There is an increasing number of transient errors in the system,

which results in reduced system capacity even if the system still

manages to continue working.

Page 58: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

The Many Faces of Gray Failure

82

So, what is a gray failure?

A performance issue.

A problem that some thinks is a failure but some thinks is not,

e.g., a 2% packet loss. The ambiguity itself defines gray failure.

If everyone agrees it is a problem, it is not a gray failure.

A Heisenbug, sometimes it occurs and sometimes it does not.

The system is failing slowly, e.g., memory leak.

There is an increasing number of transient errors in the system,

which results in reduced system capacity even if the system still

manages to continue working.

What is a formal way to define and

study gray failure, one that potentially

sheds light on how to address it?

Page 59: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

System Core

System

An Abstract Model

83

Note: these are logical entities

Page 60: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

System Core

System

An Abstract Model

84

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

Note: these are logical entities

Page 61: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

System Core

System

An Abstract Model

85

App1

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app

Note: these are logical entities

Page 62: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

System Core

System

An Abstract Model

86

App1 App2

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app analytics

Note: these are logical entities

Page 63: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

System Core

System

An Abstract Model

87

App1 App2 App3

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app analytics system2

Note: these are logical entities

Page 64: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

System Core

System

An Abstract Model

88

App1 App2 App3

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app analytics user/operator

…Appn

system2

Note: these are logical entities

Page 65: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

System Core

System

An Abstract Model

89

Observer

App1 App2 App3

probe

report

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app analytics user/operator

…Appn

system2

Note: these are logical entities

Page 66: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

System Core

System

An Abstract Model

90

Observer

Reactor

App1 App2 App3

probe

report

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app analytics user/operator

…Appn

system2

Note: these are logical entities

Page 67: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

System Core

System

An Abstract Model

91

Observer

Reactor

App1 App2 App3

probe

report

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app analytics user/operator

…Appn

system2

Note: these are logical entities

Page 68: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

System Core

System

An Abstract Model

92

Observer

Reactor

App1 App2 App3

probe

report

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app analytics user/operator

…Appn

system2

Note: these are logical entities

Page 69: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Gray Failure Trait: Differential Observability

93

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

Page 70: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Gray Failure Trait: Differential Observability

94

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Page 71: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Gray Failure Trait: Differential Observability

95

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

different entities come into different conclusions

about whether a system is working or not

difference

Page 72: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Gray Failure Trait: Differential Observability

96

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

All apps deem

system good

Appi deems

system bad

observer deems

system good

observer deems

system bad

Gray Failure!

difference

Page 73: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Gray Failure Trait: Differential Observability

97

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

All apps deem

system good

Appi deems

system bad

observer deems

system good

observer deems

system bad

Gray Failure!

❶ ❷

difference

Healthy or w/ minor

latent fault

Page 74: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Gray Failure Trait: Differential Observability

98

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

All apps deem

system good

Appi deems

system bad

observer deems

system good

observer deems

system bad

Gray Failure!

❶ ❷

Fault tolerance at play,

or a false positive

difference

Healthy or w/ minor

latent fault

Page 75: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Gray Failure Trait: Differential Observability

99

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

All apps deem

system good

Appi deems

system bad

observer deems

system good

observer deems

system bad

Gray Failure!

❶ ❷

❸ ❹ Crash, fail-stop

Fault tolerance at play,

or a false positive

difference

Healthy or w/ minor

latent fault

Page 76: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Applying the Model

100

Case I

term example

system data center network

observer switch peers

reactor routing protocols

app1 simple web server

app2 search engine

gray failure app2 observed glitches but neighbors of

the bad switch (and app1) didn’t

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

Page 77: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Applying the Model

101

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

Page 78: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Applying the Model

102

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

Page 79: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Applying the Model

103

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

- EN3 degraded

- Observation difference

Page 80: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Applying the Model

104

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

- EN3 degraded

- Observation difference

- EN3 crashed & rebooted

- No observation difference

Page 81: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Applying the Model

105

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

- EN3 degraded

- Observation difference

- EN3 crashed & rebooted

- No observation difference

- EN3 degraded

- Observation difference

Page 82: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Applying the Model

106

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

- EN3 degraded

- Observation difference

- EN3 crashed & rebooted

- No observation difference

- EN3 degraded

- Observation difference

… - EN3 crashed & removed

- No observation difference

Page 83: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Applying the Model

107

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

- EN3 degraded

- Observation difference

- EN3 crashed & rebooted

- No observation difference

- EN3 degraded

- Observation difference

… - EN3 crashed & removed

- No observation difference

…- More ENs affected

- More observation difference

Page 84: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Applying the Model

108

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

- EN3 degraded

- Observation difference

- EN3 crashed & rebooted

- No observation difference

- EN3 degraded

- Observation difference

… - EN3 crashed & removed

- No observation difference

…- More ENs affected

- More observation difference

- Observation difference

completely gone, too late

Page 85: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Direction 1: Close Observation Gap

109

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Traditional failure detector multi-dimensional health monitor

Page 86: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Direction 1: Close Observation Gap

110

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Doctors can’t use heartbeat as the only signal of a patient’s health

Traditional failure detector multi-dimensional health monitor

Page 87: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Direction 1: Close Observation Gap

111

Doctors can’t use heartbeat as the only signal of a patient’s health

Heartbeat-based

hierarchical failure detector

In-VM performance counters before and

during the gray failure incident (case II)

Traditional failure detector multi-dimensional health monitor

Page 88: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Direction 2: Approximate Application View

112

• Infeasible to completely eliminate differential observability due

to multi-tenancy and modularity constraints

• System sends probes to emulate common application usage

Page 89: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Direction 2: Approximate Application View

113

• Infeasible to completely eliminate differential observability due

to multi-tenancy and modularity constraints

• System sends probes to emulate common application usage

pod

podsetSpine

Leaf

ToR

Servers

PingMesh [SIGCOMM ’15]

Page 90: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Direction 2: Approximate Application View

114

• Infeasible to completely eliminate differential observability due

to multi-tenancy and modularity constraints

• System sends probes to emulate common application usage

pod

podsetSpine

Leaf

ToR

Servers

Failure in Spine

PingMesh [SIGCOMM ’15]

Page 91: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Direction 3: Leverage Power of Scale

115

Break the observation silos to complement each other

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Page 92: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Direction 3: Leverage Power of Scale

Observable vs. Detectable

although end-to-end probe may expose differential observability, it may not detect who is responsible for the difference

example solution: infer signals from many network devices to localize fault

Blame game

A thinks B is problematic, B thinks A is problematic

example solution: correlate VM failure events with topology info to judge

116

Break the observation silos to complement each other

Page 93: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Direction 3: Leverage Power of Scale

Observable vs. Detectable

although end-to-end probe may expose differential observability, it may not detect who is responsible for the difference

example solution: infer signals from many network devices to localize fault

Blame game

A thinks B is problematic, B thinks A is problematic

example solution: correlate VM failure events with topology info to judge

117

Break the observation silos to complement each other

Page 94: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Direction 4: Harness Temporal Patterns

118

Time series and trend analysis of key health signals

time

observability

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Page 95: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Direction 4: Harness Temporal Patterns

119

Storage side observed availability issue

Time series and trend analysis of key health signals

Page 96: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Conclusion

•Cloud system are adept at handling crash and fail-stop failures

» decades of efforts and research advances have paid off

•Gray failure is a major challenge moving forward

» behind many service incidents

» an acute pain for system designers and engineers

» fault tolerance 1.0 2.0

•A first attempt to define and explore this problem domain

» differential observability is a fundamental trait

» addressing this trait is key to tackle gray failure

» potential future directions with open challenges

120

Page 97: Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The Achilles’ Heel of Cloud-Scale Systems Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob

Discussion

•Why does differential observability occur?

•Should (can) academia work on this problem?

121

• practitioners have been troubled by these issues for quite a while,

relying on intuition, workaround, resources and process

• hungry for principled approach to understand and tackle the problem

• many issues exist in open-source distributed system stack as well

• simplistic model and assumption about component behavior

• modularity principle, unexpected dependency, improper error handling

• focus on narrow point availability but dismissed app availability