28
1 © Copyright 2015 EMC Corporation. All rights reserved. RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures Ao Ma Fred Douglis Guanlin Lu Darren Sawyer EMC Surendar Chandra Windsor Hsu Datrium

RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

1 © Copyright 2015 EMC Corporation. All rights reserved.

RAIDShield: Characterizing, Monitoring, and

Proactively Protecting Against Disk Failures

Ao Ma

Fred Douglis

Guanlin Lu

Darren Sawyer

EMC

Surendar Chandra

Windsor Hsu

Datrium

Page 2: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

2 © Copyright 2015 EMC Corporation. All rights reserved.

Disk failures are commonplace

• Whole-disk failure

• Partial failure

RAID is widely deployed

• Protect data against failures with redundancy

Pervasive RAID Protection

Page 3: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

3 © Copyright 2015 EMC Corporation. All rights reserved.

Storage system is evolving

• Escalated use of less reliable drives causes more

whole-disk failures

• Increasing disk capacity results in more sector errors

Solution

• Add extra redundancy (RAID5, RAID6, …)

– Ensure data reliability at the cost of storage efficiency

RAID Overview

Is adding extra redundancy an efficient solution?

Page 4: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

4 © Copyright 2015 EMC Corporation. All rights reserved.

Analyzed 1 million SATA disks and revealed

• Failure modes degrading RAID reliability

• Reallocated sectors reflect disk reliability deterioration

• Disk failure is predictable

Built RAIDSHIELD, an active defense mechanism

• Reconstruct failing disk before it’s too late!

• PLATE: single-disk proactive protection

– Deployment eliminates 70% of RAID failures

• ARMOR: disk group proactive protection

– Recognize vulnerable RAID groups

What We Did

Page 5: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

5 © Copyright 2015 EMC Corporation. All rights reserved.

Background

Disk failure analysis

RAIDSHIELD:

• Identify failure indicator

• Reallocated Sector (RS) characterization

• Single disk proactive protection

• Disk group proactive protection

5

Outline

Page 6: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

6 © Copyright 2015 EMC Corporation. All rights reserved.

Disk failure does not follow a fail-stop model

The production systems studied define failure as

• Connection is lost

• An operation exceeds the timeout threshold

• Write fails

Whole-disk Failure Definition

Page 7: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

7 © Copyright 2015 EMC Corporation. All rights reserved.

• Each disk drive model is denoted as <family-capacity>

• Relative sizes within a family are ordered by the capacity number – E.g. A-2 is larger than A-1

Disk Model Population (Thousands)

First Deployment

Log Length (Months)

A-1 34 06/2008 60

A-2 165 11/2008 60

B-1 100 06/2008 48

C-1 93 10/2010 36

C-2 253 12/2010 36

D-1 384 09/2011 21

Disk Data Collection

Page 8: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

8 © Copyright 2015 EMC Corporation. All rights reserved.

What Do Real Disk Failures Look Like?

Page 9: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

9 © Copyright 2015 EMC Corporation. All rights reserved.

0 0 0 1 4

24

46

20

3 1 0

10

20

30

40

50

0-6 12-18 24-30 36-42 48-54

A-2

Month

Distribution of Lifetime of Failed Drives

0 0 1 2 4

15

34 29

11

3

0

10

20

30

40

0-6 12-18 24-30 36-42 48-54

A-1

Month

A large fraction of failed drives are found at a similar age

P

erc

en

tage

(%)

P

erc

en

tage

(%)

Page 10: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

10 © Copyright 2015 EMC Corporation. All rights reserved.

The number of affected disks keep growing

• About 10% of disks get sector errors at the 3rd year

Sector error numbers increase continuously

• Average error count increases 25% to 300% year

over year

Increasing Frequency of Sector Errors

Page 11: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

11 © Copyright 2015 EMC Corporation. All rights reserved.

Drive failing at a similar age

• Failure rate is not constant

• A high risk of multiple simultaneous failures

Increasing frequency of sector errors

• Exacerbate risk of reconstruction failures

Passive Redundancy is Inefficient

Ensuring reliability in the worst case requires

adding considerable extra redundancy, making it

unattractive from a cost perspective

Page 12: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

12 © Copyright 2015 EMC Corporation. All rights reserved.

Motivation

• Ensure data safety with minimal redundancy

• Proactively recognize impending failures and

migrate vulnerable data in advance

Methodology

• Identify indicator of impending failure

• Indicator characterization

• Proactive protection

RAIDSHIELD, The Proactive Protection

Page 13: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

13 © Copyright 2015 EMC Corporation. All rights reserved.

Potential indicators

• Various disk errors

Criteria of a good indicator

• It happens much more frequently on failed

disks rather than working disks

Approach

• Quantify the discrimination between error

value on failed disks and working ones

– Deciles comparison is used

Identify Failure Indicator

Page 14: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

14 © Copyright 2015 EMC Corporation. All rights reserved.

Failed disks have more media errors than working ones

The discrimination is not significant enough

Media Error Comparison

2 3 5 9 15

22 32

47

86

1 1 1 2 3 4 7 13

30

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9

failed disk working disk

Deciles

A-2 M

ed

ia E

rro

r C

ou

nt

Page 15: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

15 © Copyright 2015 EMC Corporation. All rights reserved.

2 23 87 187 327

522 812

1242

2025

0 0 0 0 0 1 2 6 29

0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9

failed disk working disk

A-2

Deciles

Re

allo

cate

d S

ect

or

Co

un

t

A-2

Deciles

RS is strongly correlated with disk failures

Reallocated Sector (RS) Comparison

Page 16: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

16 © Copyright 2015 EMC Corporation. All rights reserved.

Most failed drives tend to have a larger number of

RS than working ones

RS is strongly correlated with whole-disk failures,

followed by media errors, pending sector errors

and uncorrectable sector errors

Correlation Between Sector Errors

And Whole-disk Failure

RS is a strong indicator of impending disk failure

Page 17: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

17 © Copyright 2015 EMC Corporation. All rights reserved.

Larger RS count implies higher failure rate in

two-month observation window

Disk Failure Rate Given Different RS Count

1.7

67 75 80 83 85 86 89 90 90 91 92 93 93 94 95

0

20

40

60

80

100

0 40 80 120 160 200 240 280 320 360 400 440 480 520 560 600

Dis

k Fa

ilure

Rat

e (

%)

RS count

A-2

RS Characterization (1)

Page 18: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

18 © Copyright 2015 EMC Corporation. All rights reserved.

Larger RS count, faster to fail

10%

25%

median

75%

90%

RS Count

Time

Margin

(Days)

Disk Failure Time Given Different RS Count

RS Characterization (2)

Page 19: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

19 © Copyright 2015 EMC Corporation. All rights reserved.

RS count indicates the degree of disk

reliability deterioration

Use the RS count to predict impending

disk failure in advance

PLATE: Single Disk Proactive Protection

Page 20: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

20 © Copyright 2015 EMC Corporation. All rights reserved.

70.1 66.6 64 61.8 59.9 52.1

47 42.6 39 36.9

4.5 2.8 2.1 1.7 1.4 0.8 0.7 0.4 0.3 0.27 0

10

20

30

40

50

60

70

80

90

100

20 40 60 80 100 200 300 400 500 600

failures predicted

false positive

Pe

rce

nta

ge (

%)

Both the predicted failure and false positive rates decrease as the threshold increases

Simulation Result: Failures Captured Rate Given Different RS Threshold

RS threshold

Page 21: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

21 © Copyright 2015 EMC Corporation. All rights reserved.

5 5 15 15

80

10

70

0

20

40

60

80

100

Without Proactive Protection With Proactive Protection

Hardware Failures Others Triple Failures Eliminated Triple Failures

Single proactive protection reduces about 70% of RAID failures, equivalent to 88% of the triple-disk failures

PLATE Deployment Result: Causes of Recovery Incidents

Pe

rce

nta

ge (

%)

Page 22: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

22 © Copyright 2015 EMC Corporation. All rights reserved.

10% remaining triple failures

• PLATE misses RAID failures caused by multiple less reliable

drives, whose RS counts haven’t exceed the threshold

Triage

• Prioritize disk groups with highest risk

Motivation of ARMOR:

The RAID Group Proactive Protection

Page 23: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

23 © Copyright 2015 EMC Corporation. All rights reserved.

1-1

1-2

1-3

1-4

2-1

2-2

2-3

2-4

3-1

3-2

3-3

3-4

4-1

4-2

4-3

4-4

?

X

X

X X

?

?

?

?

Threat of Failure

Imminent Failure

Good Disk

Healthy DG1

Imminent Failure of DG2

Protected DG3

Possible Failure of DG4

Single disk protection: Replace 2-3, 2-4, 3-4 (PLATE) Can’t identify DG4 nor the difference between DG2 and DG3

Group protection: Replace DG4 or increase redundancy (ARMOR) Protect DG4 and recognize the difference between DG2 and DG3

Disk Group Protection Example

Page 24: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

24 © Copyright 2015 EMC Corporation. All rights reserved.

Calculate the single disk failure probability

• Conditional probability through Bayes Theorem

Calculate the probability of a vulnerable RAID

• Combination of those single disk probabilities through joint probability

ARMOR Methodology

Page 25: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

25 © Copyright 2015 EMC Corporation. All rights reserved.

The discrimination shows ARMOR is effective to recognize endangered DGs

In practice, it identifies most DG failures that are not predicted by PLATE

Pro

bab

ility

Deciles distribution

Evaluation

0.25 0.33

0.39 0.44 0.46 0.5

0.63 0.73

0.93

0.15 0.2 0.23 0.25 0.27 0.28 0.3 0.31 0.32

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9

Groups with more than one failure Groups without failure

Page 26: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

26 © Copyright 2015 EMC Corporation. All rights reserved.

Google reports SMART metrics such as reallocated sector strongly suggest an impending failure, but they also determine that half of the failed disks show no such errors [Pinheiro’07]

• Different workload and RAID rewrite

Disk failure prediction

• Average maximum latency [Goldszmidt’12]

• SMART failure prediction [Murray’05, Hughes’02]

Related Work

Page 27: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

27 © Copyright 2015 EMC Corporation. All rights reserved.

We analyzed 1 million SATA drives

• Observe failure modes degrading RAID reliability

• Reveal RS count reflects the disk reliability deterioration

• Disk failure is predictable

We built RAIDSHIELD, an active defense mechanism

• PLATE: single disk proactive protection – Deployment eliminates 70% of RAID failures

• ARMOR: disk group proactive protection – Recognize vulnerable RAID groups

– Hope to deploy in future

Is adding extra redundancy an efficient solution?

• Use as much redundancy as needed to ensure availability

• Proactive replacement should decrease the level needed

Summary

Page 28: RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

28 © Copyright 2015 EMC Corporation. All rights reserved.

RAIDShield: Characterizing, Monitoring, and

Proactively Protecting Against Disk Failures

Questions?

Acknowledgement Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau

Data Domain engineer team, members of AD and CTO office, Stephen Manley

Contact: [email protected]