Performance/Reliability of Disk Systems So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the

Performance/Reliability of Disk Systems

• So far, we looked at ways to improve the performance of disk systems.

• Next, we will look at ways to improve the reliability of disk systems.

• What is reliability? Availability of data when there is a disk “failure” of some sort.

• This is achieved at the cost of some redundancy (of data and/or disks).

Disk failures – A classification1. Intermittent failure: An attempt to read or write a sector is

unsuccessful, but with repeated tries we are able to read or write successfully.

2. Media decay: A bit or bits are permanently corrupted, and it becomes impossible to read a sector correctly no matter how many times we try.

3. Write failure: We attempt to write a sector, but we can neither write successfully nor can we retrieve the previously written sector. A possible cause: power outage during the writing of the sector.

4. Disk crash: The entire disk becomes unreadable, suddenly and permanently.

Intermittent Failures• Disk sectors are stored with some redundant bits, whose purpose is to to tell

whether what we are reading from the sector is correct or not.

– The reading function returns a pair (w,s), where w is the data in the sector that is read, and s is a status bit that tells whether or not the read was successful;

• In an intermittent failure, we may get a status "bad" several times, but if the read function is repeated enough times (100 times is a typical limit), then eventually a status "good" will be returned.

• Writing: A straightforward way to perform the check is to read the sector and compare it with the sector we intended to write. However, instead of performing the complete comparison at the disk controller, it is simpler to attempt to read the sector and see if its status is "good."

– If so, we assume the write was correct, and if the status is "bad" then the write was apparently unsuccessful and must be repeated.

Checksums for failure detection• A useful model of disk read: the reading function returns (w,s) –

– w is the data in the sector that is read and – s is the status bit.

• How s gets “good” or “bad” values? • Easy; each sector has additional bits, called checksum (written by

the disk controller).• Simple form of checksum is the parity bit: 011010001 111011100

• The number of 1’s among data bits and their parity is always even.

• Read(w,s) function returns value “good” for s, if w has even number of 1’s; otherwise, s=“bad”.

(Interleaved) Parity bits• It is possible that more than one bit in a sector be corrupted

– Error(s) may not be detected.• Suppose bits error randomly: Probability of undetected error (i.e.

even 1’s) is thus 50% (Why?)

• Let’s have 8 parity bits

01110110110011010000111110110100

• Probability of error is 1/28 = 1/256• With n parity bits, the probability of undetected error = 1/2n

Recovery from disk crashes• Mean time to failure (MTTF) = when 50% of the disks have

crashed, typically 10 years

• Simplified (assuming this happens linearly)– In the 1st year = 5%,

– In the 2nd year = 5%,

– …

– In the 20th year = 5%

• However the mean time to a disk crash doesn’t have to be the same as the mean time to data loss; there are solutions.

Redundant Array of Independent Disks, RAID

• RAID 1:Mirror each disk (data/redundant disks)

• If a disk fails, restore using the mirror

Assume: • 5% failure per year; MTTF = 10 years (for disks). • 3 hours to replace and restore failed disk.

If a failure to one disk occurs, then the other better not fail in the next three hours. • Probability of failure = 5% 3/(24 365) = 1/58400. • If one disk fails every ten years, then one of two will fail every 5 years. • One in 58,400 of those failures results in data loss; MTTF = 292,000 years.

Drawback: We need one redundant disk for each data disk. This is the mean time to failure for

data.

RAID 4• Problem with RAID 1 (also called Mirroring):

n data disks & n redundant disks

• RAID 4: One redundant disk only.

• xy modulo-2 sum of x and y (XOR)

• 11110000 10101010 = 01011010

• n data disks & 1 redundant disk (for any n)

• Each block in the redundant disk has the parity bits for the corresponding blocks in the other disks: (Block-interleaved parity).

• Number the blocks (on each disk): 1,2,3,…,k

i th Block of Disk 1: 11110000i th Block of Disk 2: 10101010i th Block of Disk 3: 00111000i th Block of red. disk: 01100010

Properties of XOR: • Commutativity: xy = yx

• Associativity: x(yz) = (xy)z

• Identity: x0 = 0x = x (0 is vector)

• Self-inverse: xx = 0– As a useful consequence, if xy=z, then we can “add” x to both

sides and get y=xz

– More generally:

0 = x1...xn

Then “adding” xi to both sides, we get:

xi = x1…xi-1 xi+1...xn

Failure recovery in RAID 4We must be able to restore whatever disk crashes.

• Just compute the modulo 2 sum of corresponding blocks of the other disks.

• Use equation

• Example:

i th Block of Disk1: 11110000i th Block of Disk 2: 10101010i th Block of Disk 3: 00111000i th Block of red disk: 01100010

rednjjj xxxxxx ...... 111

Disk 2 crashes. Compute it by taking the modulo 2

sum of the rest.

RAID 4 (Cont’d)• Reading: as usual

– Interesting possibility: If we want to read from disk i, but it is busy and all other disks are free, then instead we can read the corresponding blocks from all other disks and modulo 2 sum them.

• Writing: – Write block.

– Update redundant block

How do we get the value for the redundant block?

• Naively: Read all n corresponding blocks

n+1 disk I/O’s, which is

n-1 blocks read,

1 data block write,

1 redundant block write).

• Better: How?

How do we get the value for the redundant block?

• Better Writing: To write block j of data disk i (new value = v): – Read old value of that block, say o.

– Read the jth block of the redundant disk, say r.

– Compute w = v o r.

– Write v in block j of disk i.

– Write w in block j of the redundant disk.

• Total: 4 disk I/O; (true for any number of data disks)

• Problem Why does this work? – Intuition: v o is the “change” to the parity.

– Redundant disk must change to compensate.

Examplei th Block of Disk1: 11110000i th Block of Disk 2: 10101010i th Block of Disk 3: 00111000i th Block of red disk: 01100010

Suppose we change 10101010 into 01101110101010100110111001100010---------------10100110

111100000110111000111000-------------10100110

RAID 5• RAID 4: Problem: The redundant disk is involved in every write

(but more cost-effective than mirroring) A bottleneck!

• Solution is RAID 5: vary the redundant disk for different blocks. – Example: n disks; block j is redundant on disk i if

i = remainder of j/n.

• Example: n=4. So, there are 4 disks. – The first disk numbered 0, would be the “redundant” when

considering cylinders numbered: 0, 4, 8, 12 etc. (because they leave reminder 0 when divided by 4).

– The disk numbered 1, would be the “redundant” for its cylinders numbered: 1, 5, 9, etc.

RAID 5 (Cont’d)• The reading/writing load for each disk is the same.• Problem. In one block write what’s the probability that

a disk is involved?– Each disk has 1/(n+1) probability to have the block.

– If not, i.e. with probability n/(n+1), then it has 1/n chance that it will be the redundant block for that block.

– So, each of the four disks is involved in:

1/(n+1) * 1 + (n/(n+1))*(1/n) = 2/(n+1) of the writes.

RAID 6 - for multiple disk crashesLet’s focus on recovering from two disk crashes.

Setup:

• 7 disks, numbered 1through 7

• The first 4 are data disks, and disks 5 through 7 are redundant.

• The relationship between data and redundant disks is summarized by a 3 x 7 matrix of 0's and l's

1 2 3 4 5 6 7

1 1 1 0 1 0 0

1 1 0 1 0 1 0

1 0 1 1 0 0 1

The columns for the redundant disks have a single 1.

The columns for the data disks each have at least two l's.

Data disks

Data disks The disks with 1

in a given row of the matrix are treated as if they were the entire set of disks in a RAID level 4 scheme.

RAID 6 - example1) 11110000

2) 10101010

3) 00111000

4) 01000001

5) 01100010

6) 00011011

7) 10001001

1 2 3 4 5 6 7

1 1 1 0 1 0 0

1 1 0 1 0 1 0

1 0 1 1 0 0 1

Data disks

Data disks

disk 5 is modulo 2 sum of disks 1,2,3



RAID 6 Failure Recovery• Why is it possible to recover from two disk crashes?

• Let the failed disks be a and b.

• Since all columns of the redundancy matrix are different, we must be able to find some row r in which the columns for a and b are different.

• Suppose that a has 0 in row r, while b has 1 there.

• Then we can compute the correct b by taking the modulo-2 sum of corresponding bits from all the disks other than b that have 1 in row r. – Note that a is not among these, so none of them have failed.

• Having done so, we must recompute a, with all other disks available.

RAID 6 – How many redundant disks?

• The number of disks can be one less than any power of 2, say 2k

– 1.

• Of these disks, k are redundant, and the remaining 2k– 1– k are data disks, so the redundancy grows roughly as the logarithm of the number of data disks.

• For any k, we can construct the redundancy matrix by writing all possible columns of k 0's and 1's, except the all-0's column. – The columns with a single 1 correspond to the redundant disks, and

the columns with more than one 1 are the data disks.

RAID 6 - exercise• Find a RAID level 6 scheme using 15 disks, 4 of which are

redundant.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 0 1 1 1 0 0 0 1 1 1 1 0 0 0

1 1 0 1 1 0 1 1 0 1 0 0 1 0 0

1 1 1 0 1 1 0 1 0 0 1 0 0 1 0

1 1 1 1 0 1 1 0 1 0 0 0 0 0 1

Documents

Performance/Reliability of Disk Systems So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the