RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015

1 © Copyright 2015 EMC Corporation. All rights reserved.

RAIDShield: Characterizing, Monitoring, and

Proactively Protecting Against Disk Failures

Ao Ma

Fred Douglis

Guanlin Lu

Darren Sawyer

EMC

Surendar Chandra

Windsor Hsu

Datrium


Disk failures are commonplace

• Whole-disk failure

• Partial failure

RAID is widely deployed

• Protect data against failures with redundancy

Pervasive RAID Protection


Storage system is evolving

• Escalated use of less reliable drives causes more

whole-disk failures

• Increasing disk capacity results in more sector errors

Solution

• Add extra redundancy (RAID5, RAID6, …)

– Ensure data reliability at the cost of storage efficiency

RAID Overview

Is adding extra redundancy an efficient solution?


Analyzed 1 million SATA disks and revealed

• Failure modes degrading RAID reliability

• Reallocated sectors reflect disk reliability deterioration

• Disk failure is predictable

Built RAIDSHIELD, an active defense mechanism

• Reconstruct failing disk before it’s too late!

• PLATE: single-disk proactive protection

– Deployment eliminates 70% of RAID failures

• ARMOR: disk group proactive protection

– Recognize vulnerable RAID groups

What We Did


Background

Disk failure analysis

RAIDSHIELD:

• Identify failure indicator

• Reallocated Sector (RS) characterization

• Single disk proactive protection

• Disk group proactive protection

5

Outline


Disk failure does not follow a fail-stop model

The production systems studied define failure as

• Connection is lost

• An operation exceeds the timeout threshold

• Write fails

Whole-disk Failure Definition


• Each disk drive model is denoted as <family-capacity>

• Relative sizes within a family are ordered by the capacity number – E.g. A-2 is larger than A-1

Disk Model Population (Thousands)

First Deployment

Log Length (Months)

A-1 34 06/2008 60

A-2 165 11/2008 60

B-1 100 06/2008 48

C-1 93 10/2010 36

C-2 253 12/2010 36

D-1 384 09/2011 21

Disk Data Collection


What Do Real Disk Failures Look Like?


0 0 0 1 4

24

46

20

3 1 0

10

20

30

40

50

0-6 12-18 24-30 36-42 48-54

A-2

Month

Distribution of Lifetime of Failed Drives

0 0 1 2 4

15

34 29

11

3

0

10

20

30

40

0-6 12-18 24-30 36-42 48-54

A-1

Month

A large fraction of failed drives are found at a similar age

P

erc

en

tage

(%)

P

erc

en

tage

(%)


The number of affected disks keep growing

• About 10% of disks get sector errors at the 3rd year

Sector error numbers increase continuously

• Average error count increases 25% to 300% year

over year

Increasing Frequency of Sector Errors


Drive failing at a similar age

• Failure rate is not constant

• A high risk of multiple simultaneous failures

Increasing frequency of sector errors

• Exacerbate risk of reconstruction failures

Passive Redundancy is Inefficient

Ensuring reliability in the worst case requires

adding considerable extra redundancy, making it

unattractive from a cost perspective


Motivation

• Ensure data safety with minimal redundancy

• Proactively recognize impending failures and

migrate vulnerable data in advance

Methodology

• Identify indicator of impending failure

• Indicator characterization

• Proactive protection

RAIDSHIELD, The Proactive Protection


Potential indicators

• Various disk errors

Criteria of a good indicator

• It happens much more frequently on failed

disks rather than working disks

Approach

• Quantify the discrimination between error

value on failed disks and working ones

– Deciles comparison is used

Identify Failure Indicator


Failed disks have more media errors than working ones

The discrimination is not significant enough

Media Error Comparison

2 3 5 9 15

22 32

47

86

1 1 1 2 3 4 7 13

30

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9

failed disk working disk

Deciles

A-2 M

ed

ia E

rro

r C

ou

nt


2 23 87 187 327

522 812

1242

2025

0 0 0 0 0 1 2 6 29

0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9

failed disk working disk

A-2

Deciles

Re

allo

cate

d S

ect

or

Co

un

t

A-2

Deciles

RS is strongly correlated with disk failures

Reallocated Sector (RS) Comparison


Most failed drives tend to have a larger number of

RS than working ones

RS is strongly correlated with whole-disk failures,

followed by media errors, pending sector errors

and uncorrectable sector errors

Correlation Between Sector Errors

And Whole-disk Failure

RS is a strong indicator of impending disk failure


Larger RS count implies higher failure rate in

two-month observation window

Disk Failure Rate Given Different RS Count

1.7

67 75 80 83 85 86 89 90 90 91 92 93 93 94 95

0

20

40

60

80

100

0 40 80 120 160 200 240 280 320 360 400 440 480 520 560 600

Dis

k Fa

ilure

Rat

e (

%)

RS count

A-2

RS Characterization (1)


Larger RS count, faster to fail

10%

25%

median

75%

90%

RS Count

Time

Margin

(Days)

Disk Failure Time Given Different RS Count

RS Characterization (2)


RS count indicates the degree of disk

reliability deterioration

Use the RS count to predict impending

disk failure in advance

PLATE: Single Disk Proactive Protection


70.1 66.6 64 61.8 59.9 52.1

47 42.6 39 36.9

4.5 2.8 2.1 1.7 1.4 0.8 0.7 0.4 0.3 0.27 0

10

20

30

40

50

60

70

80

90

100

20 40 60 80 100 200 300 400 500 600

failures predicted

false positive

Pe

rce

nta

ge (

%)

Both the predicted failure and false positive rates decrease as the threshold increases

Simulation Result: Failures Captured Rate Given Different RS Threshold

RS threshold


5 5 15 15

80

10

70

0

20

40

60

80

100

Without Proactive Protection With Proactive Protection

Hardware Failures Others Triple Failures Eliminated Triple Failures

Single proactive protection reduces about 70% of RAID failures, equivalent to 88% of the triple-disk failures

PLATE Deployment Result: Causes of Recovery Incidents

Pe

rce

nta

ge (

%)


10% remaining triple failures

• PLATE misses RAID failures caused by multiple less reliable

drives, whose RS counts haven’t exceed the threshold

Triage

• Prioritize disk groups with highest risk

Motivation of ARMOR:

The RAID Group Proactive Protection


1-1

1-2

1-3

1-4

2-1

2-2

2-3

2-4

3-1

3-2

3-3

3-4

4-1

4-2

4-3

4-4

?

X

X

X X

?

?

?

?

Threat of Failure

Imminent Failure

Good Disk

Healthy DG1

Imminent Failure of DG2

Protected DG3

Possible Failure of DG4

Single disk protection: Replace 2-3, 2-4, 3-4 (PLATE) Can’t identify DG4 nor the difference between DG2 and DG3

Group protection: Replace DG4 or increase redundancy (ARMOR) Protect DG4 and recognize the difference between DG2 and DG3

Disk Group Protection Example


Calculate the single disk failure probability

• Conditional probability through Bayes Theorem

Calculate the probability of a vulnerable RAID

• Combination of those single disk probabilities through joint probability

ARMOR Methodology


The discrimination shows ARMOR is effective to recognize endangered DGs

In practice, it identifies most DG failures that are not predicted by PLATE

Pro

bab

ility

Deciles distribution

Evaluation

0.25 0.33

0.39 0.44 0.46 0.5

0.63 0.73

0.93

0.15 0.2 0.23 0.25 0.27 0.28 0.3 0.31 0.32

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9

Groups with more than one failure Groups without failure


Google reports SMART metrics such as reallocated sector strongly suggest an impending failure, but they also determine that half of the failed disks show no such errors [Pinheiro’07]

• Different workload and RAID rewrite

Disk failure prediction

• Average maximum latency [Goldszmidt’12]

• SMART failure prediction [Murray’05, Hughes’02]

Related Work


We analyzed 1 million SATA drives

• Observe failure modes degrading RAID reliability

• Reveal RS count reflects the disk reliability deterioration

• Disk failure is predictable

We built RAIDSHIELD, an active defense mechanism

• PLATE: single disk proactive protection – Deployment eliminates 70% of RAID failures

• ARMOR: disk group proactive protection – Recognize vulnerable RAID groups

– Hope to deploy in future

Is adding extra redundancy an efficient solution?

• Use as much redundancy as needed to ensure availability

• Proactive replacement should decrease the level needed

Summary


RAIDShield: Characterizing, Monitoring, and

Proactively Protecting Against Disk Failures

Questions?

Acknowledgement Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau

Data Domain engineer team, members of AD and CTO office, Stephen Manley

Contact: [email protected]

Documents

RAID Shield Characterizing, Monitoring, and Proactively ...€¦ · RAID Shield Characterizing, Monitoring, and Proactively Protecting Against Failures Author: EMC Created Date: 3/2/2015