Performance of RAID5 disk arrays with read and write caching

Distributed and Parallel Databases, 2, 261-293 (1994) © 1994 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Performance of RAID5 Disk Arrays with Read and Write Caching

JAI MENON

IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120-6099

Received August 27, 1993; Revised January 28, 1994

MENONJMHALMADEN. IBM, COM

Abstract . In this paper, we develop analytical models and evaluate the performance of RAID5 disk arrays in normal mode (all disks operational), in degraded mode (one disk broken, rebuild not started) and in rebuild mode (one disk broken, rebuild started but not finished). Models for estimating rebuild time under the assumption that user requests get priority over rebuild activity have also been developed. Separate models were developed for cached and uncached disk controllers. Particular emphasis is on the performance of cached arrays, where the caches are built of Non-Volatile memory and support write caching in addition to read caching. Using these models, we evaluate the performance of arrayed and unarrayed disk subsystems when driven by a database workload such as those seen on systems running any of several popular database managers. In particular, we assume single-block accesses, fiat device skew and little seek affinity.

With the above assumptions, we find six significant results. First, in normal mode, we find there is no difference in performance between subsystems built out of either small arrays or large arrays as long as the total number of disks used is the same. Second, we find that if our goal is to minimize the average response time of a subsystem in degraded and rebuild modes, it is better to use small arrays rather than large arrays in the subsystem. Third, we find the counter-intuitive result that if our goal is to minimize the average response time of requests to any one array in the subsystem, it is better to use large arrays than small arrays in the subsystem. We call this the best worst-case phenomenon.

Fourth, we find that when no caching is used in the disk controller, subsystems built out of arrays have a normal mode performance that is significantly worse than an equivalent unarrayed subsystem built of the same drives. For the specific drive, controller, workload and system parameters we used for our calculations, we find that, without a cache in the controller and operating at typical I/O rates, the normal mode response time of a subsystem built out of arrays is 50% higher than that of an unarrayed subsystem. In rebuild mode, we find that a subsystem built out of arrays can have anywhere from 100% to 200% higher average response time than an equivalent unarrayed subsystem.

Out fifth result is that, with cached controllers, the performance differences between arrayed and equivalent unarrayed subsystems shrink considerably. We find that the normal mode response time in a subsystem built out of arrays is only 4.1% higher than that of an equivalent unarrayed system. In degraded (rebuild) mode, a subsystem built out of small arrays has a response time 11% (13%) higher and a subsystem built out of large arrays has a response time 15% (19%) higher than an unarrayed subsystem.

Our sixth and last result is that cached arrays have significantly better response times and throughputs than equivalent uncached arrays. For one workload, a cached array with good hit ratios had 5 times the throughput and 10 to 40 times lower response times than the equivalent uncached array. With poor hit ratios, the cached array is still a factor of 2 better in throughput and a factor of 4 to 10 better in response time for this same workload.

We conclude that 3 design decisions are important when designing disk subsystems built out of RAID level 5 arrays. First, it is important that disk subsystems built out of arrays have disk controllers with caches, in particular Non-Volatile caches that cache writes in addition to reads. Second, if one were trying to minimize the worst response time seen by any user, one would choose disk array subsystems built out of large RAID level 5 arrays because of the best worst-case phenomenon. Third, if average subsystem response time is the most important design metric, the subsystem should be built out of small RAID level 5 arrays.

Keywords: disk arrays, caching, non-volatile caches, write caches, storage subsystem performance

262 M~,NON

1. I n t r o d u c t i o n

A disk array is a set of disk drives (and associated controller) which can automatically recover data when one (or more) drives in the set fails. Such recovery is done using redundant data that is stored on the disks and is maintained by the controller. Thus, the primary function provided by a disk array is enhanced data availability and enhanced immunity from data loss. Optionally, some disk arrays also provide high data rate by providing the capability of reading from multiple drives in the set in parallel. A nice description of five types of disk arrays called RAID levels 1 through 5 is given in [24], and a sixth type of disk array is described in [9].

For this paper, we are primarily interested in the RAID Level 5 disk array and its performance. Such an array uses aparity technique described in [16, 5, 24]. This technique requires fewer disks than duplexing (for example 12% more disks instead of 100% more disks), and achieves fast recovery times compared to the checkpoint and log technique used by many database systems to recover from disk failures. We illustrate the array parity technique below on a 4 + P disk array. In this diagram, Pi is a parity block that protects the four data blocks labelled Di.

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5

I I I I I I + I Track 1 Track 2 1 D1D2D3D4 D5D6D7D8 D1D2D3D4 D5D6DTD8 D1D2D3D4 DSD6DTD8 DID2D3D4 P 5 P 6 P T P 8 P I P 2 P 3 P 4 D5D6D7D8

Pari~group l

- - Parity group 8

We show only two tracks (each with four blocks) on each of the disk drives (disks for short). A column consisting of 4 data blocks and a parity block is called a parity group. There are eight parity groups in the diagram shown. The parity width of the array is the number of data blocks per parity group 4 in our example. An array with small parity width (say 4 or 8) is referred to as a small array; while an array with large parity width (say 16 or more) is referred to as a large array. In the figure, P1 contains the parity or exclusive OR of the blocks labeled D1 that are in the same parity group. Similarly, P2 is the exclusive OR of the blocks labeled D2, P3 the exclusive OR of D3s, and so on. Such a disk array is robust against single disk crashes; if disk 1 were to fail, data on it can be recreated by reading data from the remaining four disks and performing the appropriate exclusive OR operations.

Whenever the controller receives a request to write a data block, it must also update the corresponding parity block for consistency. If D1 is to be altered, the new value of P1 is calculated as:

new P1 = (old D1 XOR new D1 XOR old P1)

PERFORMANCE OF RAID5 DISK ARRAYS 263

Since the parity must be altered each time the data is modified, these arrays require four disk accesses to write a data block:

1. Read the old data

2. Read the old parity

3. Write the new data

4. Write the new parity

where the reads must be completed before the writes can begin. In the diagram, the parity blocks for the first four parity groups are on disk 5, for the

second four parity groups on disk 4, and so on. If all the parity blocks for all the parity groups had been on one and the same disk, we would have had a RAID level 4 [24]. With parity rotated amongst all the drives, we have either a RAID Level 5 or a parity striped array [9]. 1 We are particularly interested in RAID level 5 and parity striped arrays because they perform better than other types of disk arrays on transaction processing systems and because they achieve high availability in a cost-effective way; we will focus exclusively on them in this paper.

The purpose of this paper is to evaluate the performance of track-striped RAID level 5 and/or parity striped arrays in transaction-processing environments. It considerably extends the work we originally reported in [20]. While there have been a number of performance studies of arrays ([24, 3, 9, 22, 10]), this paper addresses some of the unexplored work in the previous papers. Unlike previous work, we compare the performance of arrays of disks to that of unarrayed systems built from the SAME disks. Previous work has compared the performance of arrays of small form-factor (5.25" or 3.5") disks to unarrayed systems built of larger (10.8" or 14") disks. We also explore the effects of caching on arrays, calling such systems Cached RAIDs. More details on how Cached RAIDs operate are given in [21]. In particular, we look at Non-Volatile caches which cache both read and write data. 2 For Cached RAIDs, we look at performance in normal, degraded and rebuild modes of operation. We have extended the work reported in [20] to give read operations priority over destage operations from the RAID cache (destages from RAID cache look like writes to the disk) and we have explored the effects of different read cache hit ratios, write cache hit ratios and read-to-write ratios on performance. We also use more accurate analysis to calculate rebuild times than the analysis we used in our previous paper and we present results for both small and large arrays.

Simplifying assumptions are made to keep the analysis tractable, so there is no claim that the analysis is highly accurate. However, we believe that the analysis is accurate enough for performance comparisons between subsystems built from arrays with different parity widths and between subsystems built from arrays and unarrayed subsystems, which are the two goals of this report. Simulations, not reported here, have been used to verify various aspects of our analysis for accuracy.

This paper also addresses the performance of arrayed disk subsystems in degraded mode (when a disk in an array has failed, but before we begin to rebuild the data from the device to a spare device). 3 Finally, we also evaluate the performance of arrayed disk subsystems

264 MENON

in rebuild mode (during the time data is being rebuilt to a spare device). We analyze the performance of the subsystem to user requests during the rebuild and we also calculate the time to rebuild assuming that user requests (including cache destage operations) have higher priority than rebuild activity.

The rest of this report is organized as follows. In the next section, we briefly describe the models we used to make our calculations. Then, we describe the cases we analyze--widths of arrays, drive and controller parameters, workload parameters, etc. Next, we present performance results for arrays without a cache in the controller. Finally, we present results for normal mode performance, degraded mode performance, rebuild mode performance and rebuild time for arrays with controller caches. We end by summarizing our conclusions.

2. Brief Description of Analysis

We separately analyze two cases--one without a disk cache, one with a Non-Volatile disk cache.

2.1. Analysis of Disk Arrays Without Caching

First we present the analysis for normal mode, then for degraded mode.

2.1.1. Normal Mode Analysis

The N + P array is modelled as N + 1 (number of disks) identical parallel queues as shown in Figure 1 on page 265.4 Requests arrive according to a Poisson process with arrival rate K. The read requests arrive at rate rK, the write requests at rate (1 - r)K, where r is the read fraction in the workload. Read requests are randomly assigned to one of the queues with probability 1/ (N ÷ 1), whereas write requests are split (forked) into two identical requests and assigned to two different queues picked randomly. 5 Write requests are not considered complete until both of the split (forked) requests are completed (they can be modelled as a fork-join queue, since the forked requests must both be completed for the write request to be completed). Each of the split requests consists of an average seek, an average latency, a one block read, most of a one revolution delay, and a one block write. Later, we will refer to each of the split requests as a Read-Modify-Write or RMW request. The read requests consist of an average seek, an average latency and a one block read.

The average service time (ST) for a request in a queue is calculated as a weighted average of the service time for a read request (STr) and the service time of a RMW request (STrmw) a s :

S T = r x S T r + 2 × ( 1 - r ) × S T r m w

The service times are assumed to be independent and exponentially distributed with mean ST. Let p be the utilization of the disk server. Then, the response time for a read request (or a RMW request) Tr is easily calculated as ST/(1 - p) from the result in [15] for M/M/1


U r ' P - . a c h e d C o n l b r o l l e r M o d e l

r l ~ / N + l r o a d l l

2 ( 1 - r ) K / N + I R M W

1 sT r K r ' ~ u c l e

I ' t ( / ] ~ + 1 plbe.,de

2 ( 1 - r ) K / N + I R M W

rK / IW + 1 r o d e ( 1 - r ) K w r l t = . - P

- ~t;+Tg '-e<l-o~n,i~-~-¢t~'~v-" - - + p l l t

r l ( / l ~ + 1 ~ d , B

2(I -r)KJN+I R M W

8 T

Figure 1. Uncached Array Model.

queues. The response time for a write request Tw is calculated as (12 - p)*Tr/8 from the result for fork/join queues in [23]. Finally, the average request response time of the array may now be calculated as a weighted average of Tr and Tw. 6

2.1.2. Degraded Mode Analysis

It is easy to see that, in normal mode, each of the N + 1 devices in an array gets

rK K K ( N + 1-------~ reads + (1 - r ) ~ R M W d a t a + (1 - r ) ~ R M W p a r i t y

Consider requests to the broken disk in degraded mode. The rK/(N + 1) reads that used to go to it will cause every surviving disk's traffiC to increase by that amount, since every read to the broken disk is replaced by a read to every surviving disk. 7 Next, consider a write that used to go to the broken disk; which caused a RMW of data to the broken disk and a RMW parity to some other disk. We can no longer RMW from the broken disk, but we still need to calculate and write parity to the parity disk. This is done by reading all the surviving data blocks in the parity group, XORing them with the new data to be written to generate the new parity, then writing the new parity to the parity disk. The net result is that on every surviving disk except the one containing parity we increased the number of reads, and on the surviving disk containing parity, we changed a RMW operation to a simple write parity operation (no need to read old parity). Since we are using rotated parity, it is easy to see that on each surviving disk, we have reduced the number of RMW parity operations from

K (1

to

266 MENON

( N - l ) x ( 1 - r ) K N x (N + 1)

But, we have increased the simple data read or parity write operations on each disk by (1 - r)K/(N + 1).

Finally, consider a write that used to go to some disk D which generated a RMW parity on the broken disk. Since the disk containing parity is broken, we no longer need to RMW data on D; rather we may directly write to it. As a result, we reduce the number of RMW data operations on each surviving disk from

to

K ( 1 - r ) N + 1

( N - l ) x ( 1 - r ) K

N x ( N + I )

But, we have increased the simple write operations on each disk by (1 - r) K N X ( N + I ) '

This gives us the result that in degraded mode, each of the N surviving disks gets

2rK - - r e a d s N + I ( N - l ) × ( 1 - r ) K K

N × (N + 1) RMW data + (1 - r) N x (N + 1) data writes

( N - l ) × ( 1 - r ) K K N x (N + 1) RMW parity + (1 - r) ~ data reads or parity writes

The above results can be used to compute p the device utilization and ST the weighted average service time of a device server. We now need to calculate the response time for reads Tr and the response time for writes Tw.

There are two kinds of reads--reads that go to a surviving disk and get response time T1 and reads that go to the broken disk (requiring the forking of N identical requests to each of the surviving disks) and get response time TN. From [23], we use the result that

T N = ~ - 2 + T i - x 1 - x p x x T 1

where T1 is ST/(1-p) and HN is the harmonic series. Tr is calculated as a weighted average between T1 and TN, using the fact that r K / ( N + 1) reads get response time T N and N r K / ( N + 1) reads get response time T1.

There are three kinds of writes--those to the broken device (there are (1 - T)K/(N + 1) of these), those to a good device but the parity is on the broken device (there are (1 - r ) K / ( N + 1) of these also) and those to a good device and the parity is also on a good device (there are (N - 1)(1 - r )k / (N + 1) of these). The first kind of write gets response time TN-1 + T1, since it requires waiting for N - 1 reads to complete, followed by the writing of parity (TN-1 is calculated like T N above, except N is replaced with N - 1); the second kind of write gets response time T1 above since we no longer need to RMW data;


and the third kind of write gets response time T2 which is calculated as (12 - p)T1/8, since it requires completion of 2 different RMWs on 2 different disks. By weighting T1, T2 and T N appropriately, we are able to calculate Tw, the response time for a write. Finally, based on the read fraction r, the average response time to a request may now be calculated as a weighted average of Tr and Tw.

2.2. Analysis of Disk Arrays With Non-Volatile Caching

As before, we first present the analysis for normal mode, then for degraded mode.

2.2.1. Normal Mode Analysis

We assume that the controller consists of 4 processors (also called storage directors in IBM parlance [19]) whose job is to examine a shared, Non-Volatile cache for hits and misses on read and write requests. We refer to the Non-Volatile cache as NVS (Non-Volatile store). The NVS only contains data blocks, parity blocks are not cached. This is because our simulations (not reported here) indicate that it is better to cache data blocks to improve read performance than it is to cache parity blocks to improve write performance. If the block requested by a read is in cache, it is considered a read hit; else it is a read miss. Read hits are satisfied directly out of the cache, read misses require access to the disks. If the block written by a write request is in cache, it is a write hit else it is a write miss. On a write hit or miss, the data block is accepted from the system and placed in the Non-Volatile cache after which the write is considered "done". This block is destaged to disk (pushed from NVS and written to disk) at some later time, along with other blocks that also need to be destaged and that are on the same track or cylinder. Disks only see read misses and destages.

Our analysis of IBM traces from transaction-processing and database applications ([12]) convinces us that, for large caches, it is very reasonable to set the write hit ratio close to 1, since most database writes are preceded by reads which are almost always likely to be in the disk controller cache. 8 Write hits may be divided into hits on dirty pages (pages that have been previously written by the host system and are sitting in the cache waiting to be destaged) and hits on clean pages. We call the fraction of writes hits that are write hits on dirty pages the NVS hit ratio. If the system request rate is K, the fraction of requests that are write hits to clean pages is C and the fraction of requests that are write misses is D, then K C ÷ K D is the rate at which new dirty blocks are created in cache. In steady state, this must also be the rate at which dirty blocks are destaged from cache. If X blocks are destaged per destage, the destage rate can be expressed as (KC ÷ K D ) / X destages/sec. I f /3 is the fraction of requests that are read misses, then there are K/3 read misses/sec.

This system is modelled as an M/M/4 system (to represent the four storage directors managing the cache) which generates read misses and destages at some rate to be handled by N + 1 disks each of which is represented by 2 queues, one for read misses and one for destages. Of the two queues at each device, the read miss queue has higher priority than the destage queue. The situation is shown in Figure 2 on page 269. There are four types of

268 MENON

requests from the system--read hits, read misses, write hits and write misses. Read hits, write hits and write misses are satisfied directly by the cache; hence their response time is modelled as the response time of an M/M/4 system. Read misses first pass through the M/M/4 system with some response time, then enter the read miss queue of one of the N + 1 devices with equal probability. The utilization of the devices is calculated easily, since the read miss rate and the destage rate are known, and the service time for a read miss and the service time for a destage are also known.

A destage is treated as 3 or 4 separate operations--read old parity, write new parity, optionally read old data and write new data. The response time of a destage is not reported. The destage rate must keep up with the write rate from the system, but since the system does not wait for the destage to be completed, the response time for the destage is of little interest. On a read miss, we assume that the requested block and all remaining blocks following it in the rest of that track are staged into cache as is done in the IBM 3990 disk controller ([19]). The response time of the read miss only includes the time to retrieve the block requested; the staging of the rest of the track is done in the background and keeps the device busy, but does not contribute directly to read miss response time. We assume that read misses have higher priority than destage requests at the individual devices. This is why there are two queues at each device, a high priority queue into which the read misses go, and a low-priority queue into which the separate operations constituting the destage are sent. We assume a non preemptive priority queueing discipline; a low-priority operation will commence only if there are no high priority operations waiting, but once a low-priority operation starts, it will not be preempted by a high-priority operation. This is because preemption is not available on standard disks today.

All the quantities needed to calculate the read miss response time at the devices is thus known. The total read miss response time is calculated as the sum of the response time through the cache (M/M/4 queue) and the response time at the device (A server with 2 classes of jobs, each with exponential service times and a non-preemptive priority discipline). The response time at the device consists of a waiting time and a service time. The waiting time is given by the equation ([6])

"~1 X 2 X ST~ + A2 x 2 x ST~

2 x (1 - p l )

where ,~1 and ST1 are the arrival rate and service time of read misses and )~2 and ST2 are the arrival rate and service time of destage operations, and pl is the device busy due to read misses. The average request response time may now be calculated as a function of read hit response time, write hit response time, write miss response time and read miss response time. For simplicity, let KB, the read miss rate be R, let KC the destage rate due to write hits on clean pages be D1 and let KD the destage rate due to write misses be D2. Then, each device gets R/(N+ 1) read misses per second, and it also gets 3D1/(N+ 1) +4D2/(N+ 1) destage operations per second.


C a c h e d C o n t r o l l e r M o d e l

[VVIVI/4

Stonsge d i r e c t o r s

R/N+1 r~ad~ I I I ~ s - r 3 D/I~I+ 1 d ~ t ~ g ~ o r 4D/N+1

R/N+1 r~adB [ I ~ s ' r

R or 4D/N+1

-= [ [ I ~ S T O R/N+1 read8 I 3D/N÷1 deat~g~ o r 4D/N+1

R/N÷1 r ~ d 8 ~ S T 3D/t',,l.-i-'1- clo~B~J~g~a~-" I 1 ] o r 4D/N+1 d4mv'ice8

Figure 2. Cached Array Model.

2.2.2. Degraded Mode Analysis

After the failure of one of the disks, the R/(N + 1) reads/sec to the failed device increases the traffic to all the surviving disks by R/(N + 1) reads/sec. Next consider the 3D 1 / (N + 1 ) destage operations, consisting of D1/(N + 1) data writes, D1/(N + 1) parity reads and D1/(N + 1) parity writes. The D1/(N + 1) writes of data per second that used to go to the failed device do not affect the traffic to the surviving disks in any way. This is because we still need to read and write parity from the corresponding parity disk. 9 Similarly, the D1/(N + 1) reads of parity and writes of parity that used to go to the failed device do not affect the traffic to the surviving disks at all. This is because we still need to write the data to the corresponding data disk that survives.

Finally, consider the 4D2/(N + 1) destage operations per sec to the failed disk, consisting of 2D2/(N + 1) reads and writes of data and 2D2/(N + 1) reads and writes of parity. Consider the 2D2/(N + 1) reads and writes of data to the failed disk which used to cause reads and writes of parity on other disks. Now, for the disk that contains parity it eliminates the need to read parity, but on all other disks it increases the reads of data. This is because parity is now calculated by reading data from all the remaining disks and XORing it with new data to produce new parity. So, each surviving disk does 2D2/(N + 1) reads and writes of data - D2/(N* (X + 1)) reads of parity + ( ( N - 1)/N)*(D2/(N + 1)) reads of data. Next, consider the 2D2/(N + 1) reads and writes of parity to the failed disk which used to cause reads and writes of data on surviving disks. Now, we no longer need to read data from surviving disks since we can write data to them directly, so traffic to surviving disks is decreased. Each surviving disk does 2D2/(N + 1) reads and writes of parity -D2/(N*(N + 1)) reads of data.

270 MENON

Putting it all together, each surviving disk gets

2R 3D1 4D2 - - r e a d s / s e c + - - + - - N + I N + I N + I

N - 1 D2 + - - x

N N + I

This simplifies to

2D2

N × ( N + I ) destage ops/sec

2R 3D1 4D2 N - 3 D2 N +-----~ reads/sec + ~ + ~ + ~ x ~ destage ops/sec

At the devices, we know the service time for the reads and the service time for the destages, so we can calculate the response time of reads to a surviving disk T1 as described previously for the M/M/1 queue with 2 priorities using the equation from [6]. For reads to the failed device, the response time is calculated as

TN = HN+-~X 1-- Xp X x T I

The read response time Tr can be calculated as a weighted average of T1 and TN.

2.3. Rebuild Analysis

An excellent analysis of rebuild time for disk arrays is made in [22]. That paper considers three different rebuild strategies--baseline rebuild (sequential reconstruction of blocks from failed disk to spare disk), rebuild with redirection of reads (satisfying some read requests from the spare disk if the block requested has already been rebuilt to the spare disk) andpiggy.backing rebuild (a read to the failed disk that has not yet been rebuilt must be reconstructed anyway; write the reconstructed block to the spare disk now rather than wait for the sequential rebuild process to reach this block). In this paper, we only consider the baseline rebuild procedure. However, we make a more exact analysis of rebuild than [22] which only uses device utilizations to calculate rebuild times and does not differentiate between rebuilding the first of a group of tracks and rebuilding the rest of the tracks in that group successively. Also, instead of concentrating on calculating the rebuild time (as in [22]) we are also interested in calculating the performance of the array to user requests during the rebuild process. 1°

In making this analysis, we assume that user requests get priority over background rebuild activity. This is analyzed as follows. Each device in the array will process normal user requests. Only when there are no user requests in its queue will a device attempt to rebuild the next sequential track. At the end of rebuilding this track, if a new user request has arrived, the device returns to handle the new user request. Otherwise, it continues to rebuild the next sequential track, always examining the queue for user requests at the end of each track rebuild.

Figure 3 on page 271 shows a simplified, discrete-time, Markov model representation of the states of one of the surviving disks in the array. From Figure 3 on page 271, we see


-'A [---

1-r

P

Figure 3. States of a Surviving Disk During Rebuild.

that there are three states for a d e v i c e - - a first state in which it is processing normal user requests, a second state in which it is seeking and then reading a track to be used for rebuild, and a third state in which the next track is read for rebuild (no seek required since we just finished rebuilding the last track; we ignore single cylinder seeks at the end of a cylinder). The second and third states are entered only if there are no user requests to be processed by the device. Each of the three states is divided into substates, where the number of substates for a state is the mean time it takes to complete the operation represented by that state, in milliseconds.

Tb : re are u substates in the first state, where u is the average service time ST of a user request. Once the first substate is entered, the device goes to the second substate in the second time unit, the third substate in the third time unit, and so on, until it reaches the uth substate in the uth time unit. In the next time unit, two state transitions are p o s s i b l e - - i f there are more user requests to process, the device reenters the first substate of the first state with probabili ty (1 - q); if there are no user requests to process then the device enters the first of n substates of the second state, where rt is the time to seek and read a track. After n time units in these substates, two transitions are again poss ib le - - the device may return to the first substate of the first state with probabili ty (1 - p) if a new user request has arrived in n time units, else it proceeds to the first substate of the third state where it proceeds to rebuild the next track. In the latter case, it stays in these substates for m time units (the time to read a track) after which one of two transitions are poss ib le- - re turn to the first substate

272 MENON

with probability r that no new user request arrives in time m, or return to the first substate of the first state with probability (1 - r). The probability of being in any of the first substates is A, of being in any of the second substates is B and of being in any of the third substates is C. In Appendix A, we show how to solve for A, B and C. It is easy to see that tracks are being rebuilt at a rate of # = B ÷ C tracks/time unit. Rebuild time is simply the total number of tracks on the device divided by/3 ÷ C. We assume, for this analysis, that there is little difference between the rate at which the different surviving disks proceed through rebuild so that the difference in time between when the fastest surviving disk reads the last track and the slowest surviving disk reads the last track is small enough compared to the total rebuild time to be ignored. 11

Next, let us define A as the difference between the response time for a particular user request in rebuild mode and the response time for the same user request in degraded mode. In the worst case, if the user request arrives just after a rebuild has started, it will need to wait n (recall that n is the time to seek and read a track for rebuild) time units so A is n; in the best case, the user request arrives just when rebuild has completed, so A is 0. The actual value lies between n and 0. If a rebuild always takes n time units, the average value of A would be n/2. Since a rebuild may take either n or m time units, the average value of A is calculated as a weighted average of n/2 and m/2. The average value of A is calculated in the appendix. Then, the rebuild response time is calculated as

degraded response time + A

3. Assumptions and Cases

3.1. System Organizations--Equal Capacity and Equal Cost

While the preceding analysis made no assumptions on the number of disks, most of the results we present assume that 32 disks are needed to store all the data in the application. We obtained similar results for larger systems with 80 disks, but space prohibits us from reporting those results.

We present results from the analysis of four subsystems--first, a system with 32 independent disks not organized as an array; second, a subsystem organized as two 16 + P arrays and a shared spare (total of 35 disks, 32 data, 2 parity, 1 spare); third, a subsystem organized as four 8 + P arrays with a shared spare (total of 37 disks) and finally a system organized as eight 4 + P arrays and a shared spare (41 disks total). Each of these array organizations are analyzed with and without a Non-Volatile disk cache.

The above four subsystems are compared because they all are able to store the same amount of user data--we call these equal capacity subsystems. However, these subsystems use different numbers of disks and hence are not all the same cost. Briefly, we also present some results comparing the subsystem organized as four 8 ÷ P arrays with another subsystem organized as two 17 + P arrays--we call these equal cost subsystems.


3.2. Workload Parameters

We are particularly interested in the performance of the subsystems for database and transaction processing applications. Our experience with such applications indicates that some of these workloads are dominated by single block read and write requests. 12 Therefore, our entire analysis assumes single-block accesses to the arrays. Next, we have assumed that the application or database administrator is able to carefully balance the workload between the disks in the I/O subsystem so that no significant skew exists among the different devices in the array or between the different arrays in the subsystem. Consistent with the assumption of uniform skew is the assumption that the average seek per request is close to that which would be achieved if every request were truly random. While it is well-known that clustered accesses tend to reduce the average seek considerably from that for truly random accesses ([17]), it is also true that such systems tend to see considerable skew between devices. When one attempts to flatten skew by striping across devices such as is done in arrays, one loses the seek affinity between requests and gets a situation closer to random.

The workload parameters we have used are largely those obtained by analysis of IMS customer traces ([12]). The average block size read or written is set to 4096 bytes. The number of blocks destaged together is set at 2; that is when a block is destaged we assume there is one other block on that same track which may also be destaged with very little extra penalty. The read fraction r is varied from 0.5 to 0.875. The write hit ratio is varied fi'om 0.3 to 1. The NVS hit ratio (the fraction of write hits that are hits on dirty pages) is set to a constant of 0.6. The read hit ratio is varied between 0.1 and 0.7. The above read and write hit ratios are for the unarrayed subsystems. For the arrayed subsystems, we assume lower hit ratios, since their caches must contain both the old and new values of data blocks in cache until the block is destaged, la

3.3. Disk, Controller and Channel Parameters

We have largely used the IBM 3390 Model 1 disk drive ([13]) characteristics as the basis for choosing performance parameters for the disk drive in our study. So, we have picked an average seek time of 10 msecs and a data rate of 4.2 MB/sec. The 3390 has a latency of 7.1 msecs, but we have chosen 5.56 msecs for our device in this study, since we expect that future devices will have smaller form factor than the 3390 and therefore be able to spin faster. Finally, for the rebuild calculations, we have assumed that there are 30,000 tracks per device (15 tracks per cylinder and 2000 cylinders). For the controller, we have assumed an overhead of 0.5 msecs, and for the channel which attaches the controller to the host system, we have assumed a data rate capability of 18 MB/sec.

3.4. Performance Measures

The performance of a disk subsystem is presented as a curve of response time (including queueing delays) versus I/O rate. [7] predicts that (on IBM systems) the I/O rate as a function of GB of data will vary between a low of 1 acclsec/GB for user data to a high of

274 MENON

11 acc/sec/GB for system data (in the 1995 time period). If we assume that the drive we have picked is a 3.5" form-factor drive, and extrapolating out a few years the fact that 2 GB drives are already available from drive manufacturers like IBM, we let our drive have a capacity of 3 GB. Then, our 32 drive systems store approximately 100 GB. Therefore, we expect the I/O rate to this disk subsystem to vary between 100 IOs/sec for user data to 1100 IOs/sec for system data. For the uncached configurations, we will use 500 IOs/sec as the I/O rates at which to compare response times and for the cached configurations, we will use 1000 IOs/sec as the I/O rates at which to compare response times. This allows comparisons at medium and high I/O rates. The lower the response time curve over the range of I/O rates considered typical, the better the performance of the disk subsystem.

Response time curves will be presented for the different disk subsystems in normal, degraded and rebuild modes. Also, we will present a curve of rebuild time as a function of I/O rate. The arrays generally operate in normal mode. When a disk fails, the array drops into degraded mode. If a spare is available, the array spends almost no time in degraded mode, since it almost immediately begins the rebuild operation and so moves into rebuild mode. It remains in rebuild mode for a time equal to the rebuild time. Finally, it leaves rebuild mode and returns to normal mode when the rebuild has been completed.

4. Results for Arrays Without Controller Caches

Our main emphasis in this paper is on the performance of Cached RAIDs. So, we will minimize the number of results we present for the uncached cases. There are three main results we illustrate with a few graphs. Our first result is that uncached RAIDs are unattractive because of how significantly their performance suffers in comparison to unarrayed systems. Our second result is that small arrays are better than large arrays if our goal is to minimize average response time of the subsystem. Our third result is that large arrays are better than small arrays if our goal is to minimize the average response time of any array in the subsystem.

In Figure 4 on page 275, we present response time results for unarrayed and arrayed systems in normal, degraded and rebuild mode. The first of the two graphs plots response time versus read to write ratio for different subsystems. The second of the two graphs shows the ratio between response time of the arrayed system and response time of the unarrayed system versus read to write ratio. The results indicate that the arrayed systems can be anywhere between 30% worse (when the workload is mostly reads) to 200% worse (when there are an equal number of reads and writes and the array is operating in rebuild mode) compared to unarrayed subsystems. This illustrates our first result.

In Figure 5 on page 276, we present response time results for arrays of different parity widths. The first graph of the figure compares equal capacity systems. There are four curves in this graph, one for unarrayed subsystems, and the other 3 for arrayed subsystems in normal, degraded and rebuild modes respectively. Clearly, we see that arrays built of small number of drives have much better response times than arrays built of large number of drives. For example, the smallest array has a response time of 27 msecs in rebuild mode, whereas the largest array has a response time of 42 msecs in rebuild mode. However, these results are somewhat unfair in that arrays built of small number of drives have more drives

PERPORMANCE OP RAID5 DISK ARRAYS 275

7 0

4"(8+P) arrays versus unarrayed R e s p o n s e Time a t 5001os /sec

l ~ 6 O

;20 . . . . O

~" ~=-~"~ -~ ' -~ - - "c~

4p

~ t i L t

1 2 s 4 6 8 7

R,14 f Ra t io

F. o~

2

I i i f r i . . . . . . . ,i . . . . . . . t , o 1 ~ ,~ 4 • 6 7

R/1/V Ratio ..._ 32 una r rayed dev ices ..~_ 4 " (8÷P) n o r m a l m o d e _ ~ 4 " (8+P) d e g r a d e d m o d e _ _ 4 " (8+P) rebu i ld m o d e

Figure 4. Unarrayed Versus Arrayed Performance.

overall than arrays built of large number of drives (arrays built of 2 + P have 48 drives total, but arrays built of 16 + P have only 34 drives total).

To remove this unfairness, the second graph in the figure shows a comparison between a ~* (8 + P)system and a 2 * (17 + P) system, both of which have the same number of drives (36). We see that both systems have the same normal mode performance, but once again, the systems built of small arrays have better average degraded and rebuild mode performance. This illustrates our second result. However, when we look at the response time of IOs to the broken array, we see that systems built of large arrays have the edge over arrays built of small arrays. This illustrates our third result.

This third result is counter-intuitive, but may be explained as follows. Consider an N + P array which has a broken drive. For simplicity, consider only the read requests. 1/(N + 1) of the reads that used to go to the broken device increase the utilization of all the other devices in the array by approximately a factor of 2, independent of N. Let the device utilization of a device have been :c in normal mode; then it is 2z now (independent of N the parity width of array). N/(N + i) of the reads go to a surviving disk and see response time T (independent of N). 1/(N + 1) of the reads that used to go to the broken disk now require accessing all other N disks and has response time Q (the larger the N the larger Q is). The average response time of the array is

276 MENON

416 R e s p o n s e T i m e V e r s u s A r r a y Width

40

SB

~ 0

~5

2 0 0

B5

2"(16÷P) vs 4"(8+P) vs 8"(4+P) v8 16"(2÷P) .~ R / W Rat io 7:1, 5 0 0 los/see ~ . I . ~ - - - - I

~ t . . . ~ - ' - . . . . . . . . . . . . . . . . ~

' 1tO ' 6 15 Array Width (not Including parity)

BO

415

dO

SO

a--- ..... 4"(8+P) vs 2"(17+P)

_ _ ~ - , i p

t

z e Q , o 1 ~ , e ~ s 1 4 e I z l a Array Width (not Including parity)

-u- 3 2 u n a r r a y e d devices .~.. N o r m a l M o d e D e g r a d e d m o d e ~ Rebu i ld m o d e

-6.- Worst case R e b M o d e ../.... Worst case D o g M o d e

Figure 5. Performance of Arrays Versus Array Width.

N T Q - - ÷ - -

N + I N + I

The larger the N, the greater the effect of the first term, and the better the response time of the broken array in degraded mode tends to be. For example, for a 17 + P array, the weight of the first term is 17/18 whereas for a 4 ÷ P array, the weight of the first term is only 4/5. The fact that Q is larger for larger N tends to be overcome by the division by N + 1. This explains why, in Figure 5, the larger parity width array actually has better response time than the smaller parity width array.

5. Results for Arrays With Control ler Caches

We begin by presenting results for a base case where the read to write ratio is set to 7, the read hit ratio is 70% and the write hit ratio is set to 100%. These are typical for IMS. Next, we show sensitivity of the results to variations in write hit ratio. Finally, we show sensitivity to variations in read hit ratio. Sensitivity to variations in read to write ratio are also shown.


5.1. Results for Base Case

We first present normal mode results, then degraded mode results, then rebuild mode results and, finally, rebuild time results.

5.1.1. Normal Mode Results

Results of response time versus I/O rate are presented in Figure 6 on page 278.14 Recall that the expected I/O rate varies between 100 and 1100 IOs/sec. The graphs show that cached systems of 32 drives are quite capable of handling more than 1000 IOs/sec. The results also indicate that the unarrayed subsystem of 32 disks is the best performing subsystem. However, we see that the arrayed subsystems are not much worse. For example, even at 1000 IOs/sec, the best arrayed subsystem has a response time that is only 3% worse and even the worst arrayed subsystem has a response time that is only 9% worse. A number of reasons contribute to this effect--the most important is because the destage of writes is no longer in the response time, since destages are done asynchronously. Other smaller effects are that: destages now need only 3 accesses rather than 4 (old data is always in cache); and that destages are combined with other destages to minimize disk seeking. Taken together, this means that the extra disk activity for arrays compared to unarrayed subsystems is less when there is a cache than it is when there is no cache.

Among the subsystems, the 8*(4 + P) subsystem has slightly better performance than the 4*(8 + P) subsystem which, in turn, is better than the 2"(16 + P) subsystem. These slight differences in favor of the subsystems built from arrays with small parity widths can be strictly accounted for on the basis of the larger number of total disk drives in those subsystems.

It is illuminating to see the average per disk utilization shown in Figure 7 on page 279 for both the unarrayed case and the 4*(8 + P) case. Disk (or device) utilization has two components---disk utilization due to read misses and disk utilization due to destages. The disk utilization for read misses actually goes down when going from the unarrayed case to the arrayed case; this is because there are more disks (36 versus 32) for the arrayed case, so the per disk utilization is lower. On the other hand, the disk utilization due to destages is higher for the arrayed case, in spite of the larger number of disks, because of the extra work needed to update parity. The total disk utilization is higher for the arrayed case compared to the unarrayed case, accounting for the slightly worse performance of arrays which we previously saw.

5.I.2. Degraded Mode Results

For brevity, we only present average degraded mode performance results. The equal capacity average degraded mode performance results are presented in Figure 8 on page 280. It is interesting to see that the performance of the arrayed subsystems continue to be quite close to that of the unarrayed system. Even at 1000 Ios/sec, the 4* (8 + P) system has a response time that is only 18% higher than that of the unarrayed system and the 8* (4+ P) subsystem

2 7 8 M E N O N

Normal Mode Performance; With cache/NVS B/W Ratio=7, Road hr=.7, Write hr=l

2O

0

¢

15

10

0 I I I

0 500 1000 1500

Ios/sec

_,_ 32 unarrayed.devices ..,.. 4"(8+P) arrays

4,- 8"(4+P) arrays ....... 2"(16+P) arrays

2OOO

Figure 6. Normal Mode Performance.

response time is only 9% higher than that of the unarrayed system. With NVS caching, the performance of arrays relative to unarrayed systems is quite acceptable. As before, arrays of small parity widths have the best average degraded mode performance.

5.1.3. Rebuild Mode Results

The equal capacity average rebuild mode results are presented in Figure 9 on page 281.

The subsystems built from arrays with small parity widths have better average rebuild performance. Also, the arrayed subsystems do not have substantially worse response time than the unarrayed subsystems; the response time of the 8* (4 -t- P) subsystem is only 12% higher than that of the unarrayed subsystem at 1000 IOs/sec and the response time for the 4* (8 + P) subsystem is only 24% higher than that of the unarrayed subsystem.

Rebuild time plots are shown in Figure 10 on page 282. The curves indicate that the rebuild time is as low as about 8 minutes at 1000 IOs/sec. The results also indicate that, for the equal capacity comparisons, the subsystems built from arrays with small parity widths take less time to rebuild. While not shown, we found that at equal cost, there is no difference in time to rebuild between subsystems built from arrays with small or large parity widths.


Device Utilization Comparisons; 4"(8+P) Versus 32 unarrayed devices Normal Mode Performance; R/W ratio =7, Read h.r=.7, Write h.r=l

1

0.8

0.6

0.4

0.2

_ . . 0

20 100 200 400 600 800 1000 1200 1400 1600 1800 2000 2500

Ios/soc [ ]Device utilization due to Rd Miss for array

[ ]Device utilization due to rd miss for unarrayed case

~'~ Device utilization due to destage for array

[ ]Device utilization due to destage for unarrayed case

0,8

0"6

0,4

0,2

Figure 7. Device Utilizations.

5.1.4. Summary Results

In Figure 11 on page 283, we show a plot of subsystem performance over time for three cases--an unarrayed subsystem, a 4* (8 + P) subsystem, and a 2* (17 + P) subsystem. For performance, we have chosen to show response time at 1000 IOs/sec (the highest I/O rate) in order to see how bad arrayed subsystems can get relative to unarrayed subsystems. In the beginning, all subsystems are in normal mode and have the best performance. Then, the arrayed subsystems enter degraded mode briefly, before they enter rebuild mode. The amount of time spent in rebuild mode is shown to scale. Finally, following rebuild, all the arrayed subsystems return to normal mode. We see that, in normal mode, the arrayed subsystem response time is only 6.4% higher than that of the unarrayed subsystem (compared to almost 50% for the uncached case). In degraded mode, the subsystem built from small parity width arrays has a response time 18% higher and the subsystem built from large parity width arrays has a response time 27% higher than the unarrayed subsystem. Finally, in rebuild mode, we see that the 4* (8 + P) subsystem has 24% higher response time than the unarrayed subsystem and we also see that the 2"(17 + P) subsystem has

280 MIGNON

Degraded Mode Performance; With cache/NVS R/W Ratio=7, Read hr=. 7, Write hr=l

2O

¢

15

10

/ . . / / ./,/""

/#,. . i / " I ~.t

0 1 I I

0 500 lOOO 1500

Ios/soc ..,_ 32 unarrayed devices ..,.. 4"(8+P) arrays 4,- 8"(4+P) arrays ...... 2"(16+P) arrays

Figure 8. Average Degraded Mode Performance.

38% higher response time than the unarrayed subsystem. As we had expected, the cached controllers are able to bring arrayed subsystem performance closer to unarrayed subsystem performance. In fact, at the more realistic I/O rate of 500 IOs/sec (not shown), the normal mode response time of the arrayed subsystem is only 4.1% higher, the average degraded mode response times of the arrayed subsystems are 10% and 15% higher respectively, and the average rebuild mode response times are 13% and 19% higher than the response time of the unarrayed subsystem.

5.2. Variations in Read to Write Ratio and Write Hit Ratio

The results so far indicate that uncached arrays have unacceptable performance relative to uncached and unarrayed subsystems, while cached arrays have very acceptable performance relative to cached unarrayed subsystems. Here, we explore if the latter result still holds true when we vary the read to write ratio and the write hit ratio. Figure 12 on page 283, shows a plot of the ratio of arrayed response time to unarrayed response time for different values of read to write ratio and write hit ratio (in normal mode). The worst case for arrays (performance 30% worse) is when the write hit ratio is very low (0.3) and the read to write ratio is 1:1. Clearly, arrays are at their worst when there are a lot of writes in the workload, and the write hit ratio is low, since arrays do poorly on writes, particularly write misses which require 4 disk accesses.


Rebuild Mode Performance; With cache/NVS R/W Ratio=7, Read hr --.7, Write hr=l

2O

0

¢

15

10

. r , f . . - - "

. . . . . , l~.... . ,1[.. ~ ...,., ~ .

f I I

0 500 1000 1500

Ios/sec _,_ 32 unarrayed devices + . 4"(8+P) arrays .4.- 8"(4+P) arrays . . . . . . . 2"(16+P) arrays

20OO

Figure 9. Average Rebuild Mode Performance.

It is interesting and illuminating to see the plots of Figure 13 on page 284, which shows unarrayed response time as a function of read to write ratio and write hit ratio, and the plots of Figure 14 on page 284, which shows arrayed response time as a function of read to write ratio and write hit ratio. Figure 12, which we already saw, was in fact obtained as a ratio of the response times from these two figures. Note that response times get better as the fraction of writes in the workload increases, since writes are accomplished at electronic speeds in NVS whereas reads require disk access. However, the array response times do not get better as fast or as much as the unarrayed response times, so the ratio between the two gets worse as the fraction of writes in the workload increases.

From Figure 15 on page 285, we see that the arrayed subsystem response time is never more than 40% worse than the unarrayed subsystem response time in degraded mode; and from Figure 16 on page 285, we see that the arrayed subsystem response time is never more than 45% worse than the unarrayed subsystem response time in rebuild mode.

5.3. Variations in Read Hit Ratio

Next, we set read hit ratio to 10% instead of 70%. Such a subsystem is closer to an uncached subsystem, since the hit ratio is so low as to make the cache mostly ineffective for read hits. Neither the arrayed nor the unarrayed subsystems are capable of doing 1000 IOs/sec, so we compare their relative performance at 400 IOs/sec. Figure 17 on page 286, Figure 18 on page 286 and Figure 19 on page 287 show that the arrayed response times are never more

282 MENON

Rebuild Mode Performance; With cache/NVS R/W Ratio=7, Read hr=.7, Writo hr=l

25OO

Q) 1500

looo

112 5oo

/ /

/ / / =

0 I I I

0 5O0 1000 1500 2oOO

Ios/sec

..~.. 4"(8+P) arrays 8"(4+P) arrays ...... 2"(16+P) arrays

Figure 10. Rebuild Time.

than 20% worse in normal mode, 28% worse in degraded mode and 35% worse in rebuild mode than unarrayed subsystems. This indicates that NVS caching is effective in making array performance acceptably close to that of unarrayed subsystems even when read hit ratios are not good.

6. Comparing Arrays with Caches to Arrays without Caches

We conclude by comparing the performance of cached arrays to uncached arrays. We believe that a small amount of cache can be added to a disk array without significantly affecting cost. The results of this section show that even a small cache can improve performance (both throughput and response time) significantly.

Figure 20 on page 287, compares the performance of 4* (8 + P) arrays when subjected to a workload with a read to write ratio of 7:1. Three curves are shown in this figure--a curve for uncached arrays; a curve for cached arrays that get good hit ratios (read hit ratio of 0.7 and write hit ratio of 1); and a curve for cached arrays that get poor hit ratios (read hit ratio of 0.1 and write hit ratio of 0.3). A cached array may get poor hit ratios either because the size of the cache used is small or because the workload is not cache friendly. Even though we stopped the graph at 3000 IOs/sec, the cached array with good hit ratios was able to reach a throughput of 4000 IOs/sec. On the other hand, the uncached array maximum throughput was 1600 IOs/sec and the cached array with poor hit ratios had a maximum throughput of

P E R F O R M A N C E OF RAID5 DISK ARRAYS 283

Time Plot of Array Performance (b

'~ Performance at 1000 Ios/sec with NVS cache ~o

t,l &

0

T

0 I I I

0 5 I0 15

Time (rains) ,_ 4"(8+P) array r.*.. 2"(17+P) array

32 unarrayed devices R/W ratio = 7; Read h.r =.7, Wr~ h.r = 1

I

2O 25

Figure 11. Time Plot.

g -o

£c

32 unarrayed devices versus 4"(8+P) with NVS c a c h e

N o r m a l M o d e P e r f o r m a n c e at 1 0 0 0 Ios/sec; R e a d h . r = . 7 21

I 1.g i

7.8

7.7'

L6

1.5

1.4

1 .1 !

1! 0

_._ W r i t e h . r = 1

t i [ r i -r I

1 2 3 4 5 6 7

R / W Rat io

Write h . r = . 7 ~ Write h . r = .3

Figure 12. Ratio of Arrayed to Unarrayed Response Times.

1800 IOs/sec. The cached array also has 4 to 40 times lower response time if it gets good hit ratios; and 1.5 to 2 times lower response time if it gets poor hit ratios.

284 MErqON

32 unarrayed devices Performance with NVS Cache

12

i . 4

2 0

Per fo rmance at 1000 Ios/sec; Ftead h . r = .7

_=_ Write h . r = 1

I I I i i I I

1 2 S 4 5 6 7

R / W ,Ratio

_~.. W r i t e h . r = . 7 -.u- W r i t e h . r = . 3

Figure 13. Unarrayed Response Times.

Normal Mode Performance of 4"(8+P) arrays

I11

I l l Q:

Per fo rmance at 1000 Ios /sec with NVS Cache; Ftead h . r = .7 12

I 0

8

~ ~ ~ : ~ . _ _ . _ _ _ _ _ _ _ _ .............. ~_.7~.-z-.-~--::~___-5

I I t I I I I

0 1 2 S 4 6 6 7

P,/W Rat io

_~_ W r i t e h . r = l _~_ W r i t e h . r = . 7 __~ W r i t e h . r = . 3

Figure 14. Arrayed Response Times.

In Figure 21 on page 288, we compare the performance of these same arrays when the workload has a read to write ratio of 1:1. Since there are more writes in this workload, we expect write caching to be very beneficial and this is borne out by the results. We found that a cached array with good hit ratios has 5 times the maximum throughput of the equivalent uncached array for this workload. Also, response times are a factor of 10 to 40 lower for the cached array with good hit ratios. With poor hit ratios, the cached array is still a factor of 2 better in throughput and a factor of 4 to 10 better in response time.


® 32 unarrayed devices versus 4"(8+P) with NVS cache

Degraded Mode Per fo rmance at 1000 Ios/sec; R e a d h . r = .7

1.g

oc

,r,

1.8

1 . 7

1 . 6

1 .5

1 .4

1,1

1 0

_,,_ Write h . r = 1

• I I . . . . ~ f .... I. i I

1 2 .q 4 $ 6 7

R / W Rat io

+ Write h . r = .7 -.u- Write h . r = .3

Figure 15. Arrayed to Unarrayed in Degraded Mode.

32 unarrayed devices versus 4"(8+P) with NVS cache Ftebui ld Mode Per fo rmance at 1000 Ios/sec; Read h. r = .7

r~

E

2

1.Q

1.8

1 . 7

1 . 6

1 .6

1 .4

1 .2

1 .1

1 I 0 1

Write h . r = 1

I I t 1 , I [

2 ,~ 4 S 6 7

Ft/W Rat io

_~_ Write h . r = . 7 __~ Write h. r = . 3

Figure 16. Arrayed to Unarrayed in Rebuild Mode.

7. Conclusions

In this paper, we have developed analytical models for evaluating the performance of arrayed disk subsystems in normal mode (all disks operational) in degraded mode (one disk broken,

286 MENON

32 unarrayed devices versus 4"(8+P) with NVS cache N o r m a l M o d e P e r f o r m a n c e a t 4 0 0 Ios /sec ; R e a d h . r = . 1

i . 1 . 9

1~ 1.8

1.7

1.6

1,,4

1.3

1.1 $ I1:

_=_ Wr i te h . r = 1

I I I I

1 ;2 3 4

RAh/ Ra t i o

_ ~ _ Write h . r = . 7

..x::::::::. : : : :dEIE;=~ I I I B 6 ?

_ ~ Write h . r = .3

Figure 17. Arrayed to Unar rayed in Normal Mode.

32 unarrayed devices versus 4"(8+P) with NVS cache ¢J

.¢,

D e g r a d e d M o d e Per formance at 4 0 0 Ios/sec; R e a d h . r = . 1

2

1 , P

1 , 8

1,7

1.6

1.5

1.4

1..q

1.2

1.1

1 i 0 1

. .~ Wr i te h.r = 1

I I I I I I

2 3 4 5 6 7

R / W Ra t i o

._~ Wr i te h . r = . 7 _ ~ Wr i te h . r = .3

Figure 18. Arrayed to Unarrayed in Degraded Mode.

rebuild not started) and in rebuild mode (one disk broken, rebuild started but not finished). Models for estimating rebuild time under the assumption that user requests get priority and can preempt rebuild activity have also been developed. Separate models were developed for cached and uncached disk controllers.

P E R F O R M A N C E OF RAID5 DISK ARRAYS 2 8 7

32 unarrayed devices versus 4"(8+P) with NVS cache Rebuild Mode Performance at 400 Ios/sec; R e a d h . r = . 1

~. . . . . . . . . . . : - ~ : : _ - ~ - ....... _. ===================== ..........

II>

1.~ f 1,8

1.7 1.6

1.4

E~ 1.3 1.2

I~ I t ~ ~ , o 1 2 ~ 4

R / W Ratio

.~ ~ Write h . r = 1 _~_ Write h.r = . 7

i i I

$ 6 7

Write h.r = .3

Figure 19. Arrayed to Unarrayed in Rebuild Mode.

Cached Versus Uncached Arrays INN Ratio = 7, 4"(8+P) arrays

150

100

50

/ /

/ / / / J

0 T I I 1 I

0 500 1000 1500 2000 2500 3000 _ s _ Urcad)ed a m a ~

-4,- Cache~ Read h.r=,l, WEte h,~3 ..~,. Cache~ Rea# l~r=-,7, Writo l~r=-I

Figure 20. Cached Versus Uncached Arrays.

Using these models, we then evaluated the performance of disk subsystems built from (4 + P) RAID level 5 arrays, (8 + P) RAID level 5 arrays, (16 + P) RAID level 5 arrays, and (17 + P) RAID level 5 arrays in normal, degraded and rebuild modes when driven by

288 MENON

Cached Versus Uncached Arrays R/W Ratio = 1, 4"(8+P) arrays

2O0

150

100

50

t / /

/ /

ff S . ~. . .~. . . . . . . l . , k . .T_ .~ ( - - - - . f t .(.--. 4~( .. . . . . . ~ . . . , . . .~ . . .T . . .~ . . g.- ~ . . . . . . . . . . . . . . . . . . ~ . . . . . . . . .

500 1~0 1500 2000 25OO 3O00

U ~ arrays ..~ . Cache~ Read h.r=-.7, WOte h.r=-1

. ~ Cached, Read h.r=.l, Write h.~3

Figure 21. Cached Versus Uncached Arrays.

a database workload such as those seen on systems running popular database managers. In particular, we assumed single-block accesses, flat device skew and little seek affinity. In normal mode, we found there was no difference in performance between disk subsystems built from arrays with small or large parity widths as long as the total number of disks used is the same. In degraded and rebuild modes, we found and explained a counter-intuitive result we called the best worst-case effect for large arrays which showed that the larger the array parity width in a disk subsystem, the better the degraded (rebuild) mode performance of the broken array. However, we also found that the larger the array parity width in a disk subsystem, the worse the average degraded (rebuild) mode performance of the entire subsystem of arrays that includes the broken array. Put another way, the larger parity widths minimize the worst response time seen by any user, if one assumes that all of a user's requests are contained within one array, since it minimizes the worst response time to any array. However, a subsystem built from arrays with small parity widths has the best average performance in degraded and in rebuild modes.

For small request transaction-processing workloads, our results showed that arrayed subsystems have significantly worse performance than equivalent unarrayed subsystems built of the same drives, when no caching is used in the disk controller. For the specific drive, controller, workload and system parameters we used for our calculations, we found that without a cache in the controller, the response time of an arrayed subsystem (at typical I/O rates) in normal mode is 50% higher than that of an unarrayed subsystem. In rebuild mode, we found that the 4* (8 + P) subsystem had 100% higher average response time and


that the 2* (17 + P) subsystem has almost 200% higher response time than the equivalent unarrayed subsystem.

With cached controllers, our results showed that performance differences between arrayed and equivalent unarrayed subsystems shrank considerably. We found that in normal mode, the arrayed subsystem response time is only 4.1% higher than that of the unarrayed subsystem (compared to almost 50% for the uncached case). In degraded mode, the subsystem built from arrays with small parity widths has a response time 1 I% higher and the subsystem built from arrays with large parity widths has a response time 15% higher than the unarrayed subsystem. Finally, in rebuild mode, we found that the 4* (8 + P) subsystem had 13% higher average response time and the 2* (17 + P ) subsystem had 19% higher response time than the unarrayed subsystem. As we had expected, the cached controllers are able to bring arrayed subsystem performance closer to unarrayed subsystem performance. We found that our results about cached controllers were fairly robust even when we varied read to write ratio, read hit ratios and write hit ratios.

Our results showed that cached arrays have significantly better response times and throughputs than equivalent uncached arrays. For a workload with a read to write ratio of 1, a cached array with good hit ratios had 5 times the throughput of the equivalent uncached array. Also, response times were a factor of 10 to 40 lower for the cached array with good hit ratios. With poor hit ratios, the cached array is still a factor of 2 better in throughput and a factor of 4 to 10 better in response time for this same workload. For a workload with high read content (read to write ratio of 7), the cached array has 3 times better throughput and 4 to 40 times lower response time if it gets good hit ratios; and 10% better throughput and 1.5 to 2 times lower response time if it gets poor hit ratios.

Our results indicate that it is important that array disk controllers be cached. Even a small amount of cache can produce significant improvements in response time and throughput. Uncached arrays have a very significant negative effect on performance for database workloads. If one were trying to minimize the worst response time seen by any user, one would choose a subsystem built from arrays with large parity widths because of the best worst- case phenomenon. However, this has a very detrimental effect on average performance in degraded and rebuild modes. We advocate the choice of disk subsystems built from cached arrays with small parity widths (4 + P or 8 + P).

Acknowledgments

The work reported here benefitted from numerous discussions with many members of the Hagar disk array project; in particular, Dick Mattson and Spencer Ng. Dick Mattson was instrumental in pointing me towards the use of Markov models for the rebuild analysis and helped me construct the first version of the actual model to be used. Spencer Ng constructed an earlier version of an array analytical model, and our models have benefitted from each others work. The analysis has been much improved and made more rigorous following many discussions with Alex Thomasian who read through many draft versions of this paper.

290 MENON

Appendix A. Rebuild Time Analysis Using Markov Models

The states of one of the surviving disks in the array are shown in Figure 13 on page 284 is the probability that no user request arrives in time n. If A is the arrival rate of user requests to the array and the arrival process is Poisson, then, from [15], we know that

p = e - A n

Similarly, r, the probability that no user request arrives in time m is

- -Am ' g = e

The Markov model can be solved for A, B and C, if we use the state-equilibrium equations, and we know p, q and r. However, q is still not known. So, we will need another equation, different from the state-equilibrium equations, to solve the Markov model. We now find this extra equation.

Pl is the device utilization in degraded mode. uA is the device utilization processing user requests in rebuild mode. The two must be equal, since we process the same number of user requests in a given unit of time whether we are in degraded mode or in rebuild mode. So,

u A = Pl

This gives us a way to calculate A. Now, we can solve for B and C as follows. In steady-state, the state-equilibrium equations are easily seen to be Aq = B and B p +

Cr = C. From the latter,

B × p C - - - -

1 - r

We also know that

u A + n B + m C = 1

since the probability of being in one of the states of the Markov chain is 1. Substituting C in terms of B in this equation, we have

1 - u A B - mxp~

(n + l - r /

Once B is solved for, then C can be solved for. B + C is the rate at which tracks are rebuilt, and the rebuild time is ~ r k s / ( B + C), where t r k s is the number of tracks on a disk drive.

Next, let us calculate A which is the average of the difference between rebuild mode response time and degraded mode response time. This is simply half the average time it takes to rebuild a track. A track rebuild takes n or m time units, and on average it takes

( n x B + m x C )

+ c)

So,

A = (nxB+mxC) 2 x ( B + C )


Notes

1. The difference is that RAID level 5 arrays stripe data across the disks of the array, whereas parity striped arrays do not stripe data. RAID level 5 arrays allow data to be striped at one of many levels--blocks, tracks, multiple tracks, cylinders, etc.. If data is striped at a track level or greater (as has been proposed in [4]) and all user requests access only a block or a few blocks of data, then there may not be much difference in performance between a RAID level 5 and a parity striped array.

2. The cache may be built out of Non-Volatile RAM such as flash memory, or it may be battery-backed RAM.

3. Some terminology is in order here. In this paper, a disk may be called a disk drive, a disk device, a drive or a device. Sometimes, we refer to a failed disk as a broken disk or a broken device or a broken drive. Also, we refer to an array with a broken disk as a broken array. A good disk in a broken array is called a surviving disk.

4. We assume that this group of N + 1 disks looks either like 1 large disk, or several smaller disks to the host operating system. Let the disks as seen by the operating system be logical disks, and the N + 1 disks of the array be physical disks, or simply, disks. So, for example, we can have 1 large logical disk realized on N ÷ 1 physical disks. We also assume that the operating system software can issue multiple requests to a logical disk, so that queueing against a logical disk occurs in the disk controller rather than in the operating system. This is typical of operating systems like Unix when attaching to SCSI [1] disks with the tagged command queueing feature.

5. Simulations of track-striped disk arrays under real-life workloads indicate that requests are distributed evenly across the physical disks of the array, even when the original request stream accessed different portions of the database non-uniformly. So, our assumption that each disk gets the same traffic is a good one.

6. A more accurate analysis is as follows. Treat each server as a two-stage parallel server, with one stage for read requests and another for RMW requests. Then, each device is modelled as an M/H2/1 queue instead of an M/M/1 queue. In addition to calculating the average ST, also calculate the second moment of the service time. Now, use the P-K formula [15] for M/G/1 queues to calculate a waiting time at the device. T r is calculated as waiting time plus service time for read. The response time for a RMW request Trmw is calculated as waiting time plus service time for a RMW request. T w is calculated as (12 - p) x ~ . Finally, the average request response time of the array may now be calculated as a weighted average of T r and T w . The difference between this more accurate analysis and the one we employed is very small (less than 2%) even at very high disk utilizations. Because of this and because our emphasis is on cached rather than uncached RAIDs, we used the simpler and computationally more efficient analysis.

7. The corresponding blocks in the parity group must be read from the surviving disks in the array and XORed to produce the resuk needed.

8. This means that when a block is destaged, most of the time only 3 disk accesses are needed--read old parity, write new parity, write new data. Read old data is not needed, since it is almost always already in cache.

9. Since the old data is in cache, the array can calculate new parity by reading old parity and XORing with old and new data in cache. The fact that the disk containing old data is broken does not change the situation at all.

10. An important reason we selected the baseline rebuild process is to keep the performance of the array mostly invariant during the entire rebuild process. This makes it easier to talk about performance in rebuild mode. The more sophisticated rebuild schemes will improve performance in rebuild mode as the rebuild proceeds, making it necessary to talk about performance at the beginning of rebuild versus performance at the end of rebuild. Even with the baseline rebuild process, rebuild mode performance of the array remains invariant only for cached controllers. With uncached controllers, there is a small difference in rebuild mode performance at the start of rebuild versus rebuild mode performance later during the rebuild even if the baseline rebuild process is used. This is because of differences in how writes to the broken disk and writes to a surviving disk whose parity is on the broken disk are handled. For such writes, if the affected location on the broken disk has been rebuilt to the spare, the write operates like it did during normal operation (RMW data, RMW parity); else the write operates like it did during degraded mode. For purposes of this study, we ignore this small difference in rebuild performance during the rebuild process and present rebuild mode performance as it would be at the start of the rebuild process (when the affected locations are not likely to have been rebuilt).

11. This has been verified with simulations we reported in [11].

12. This is more true of IBM's older database system IMS ([14]) than it is of our newer database system DB2 ([25]). Nonetheless, it is still a significant fraction of the DB2 workload. In the disk array literature, such requests have been called small reads/writes.

292 MENON

13. For example, consider a write hit to a clean page. In this case, the array controller allocates a new page in cache to hold the new block just received from the host. The old value of the block (the clean page) is preserved, since it is needed to calculate new parity at destage time. We estimate (results from trace studies and discussion with [ 12]) that array controllers need approximately a 17% larger cache to achieve the same hit ratio as an unarrayed controller, because it uses 17% of its cache space to hold the old values of data blocks. So, to calculate the read and write hit ratios for the arrayed controllers, we assume that we have effectively a cache that is only 0.83 as large as the cache for the unarrayed case. We then use the empirical result in [18] that miss ratios are inversely proportional to the cube root of cache sizes. For example, this means that if the unarrayed controller has a read hit ratio of 10%, the arrayed controller will be assumed to have a hit ratio of about 4.3% for the same size cache.

14. In all our response time plots, the Y-axis consistently goes from 0 to 20 msecs, so that the different response time plots can be readily compared to one another; we have done this even when it might be more appropriate to show a different response time range on the Y-axis, such as might be the case for the response time plot we are now looking at.

References

1. ANSIStandardX3.•3••Sma••C•mputers•ystemsInterface(SCSI)•AmericanNati•nalStandardsInstitute (ANSI) (New York, 1986).

2. Bates, K. H. and TeGrotenhuis, M., Shadowing Boosts System Reliability, Computer Design 24,4 (Apr. 1985) pp. 129-137.

3. Chen, R M., Gibson, G. and Patterson, D., An Evaluation of Redundant Arrays of Disks Using an Amdahl 5890, Proc. ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems (Boulder, Colorado, May. 1990) pp. 75-85.

4. Chen, R M. and Patterson, D. A., Maximizing Performance in a Striped Disk Array, 17th International Symposium on Computer Architecture (1990) pp. 22-331.

5. Clark, B. E. et. al., Parity Spreading to Enhance Storage Access, United States Patent4,761,785 (Aug. 1988).

6. Conway, R. W., Maxwell, W. L. and Miller, L. W., Theory of Scheduling, Addison-Wesley (New York, 1967).

7. G••dstein• S.• St•rage Perf•rmance-an Eight Year •ut•••k• IBM Santa Teresa Technical Rep•rt T• •3.3•8 (Santa Teresa, July 1987).

8. Gray, J. N., Notes on Database Operating Systems, Operating Systems: An Advanced Course (New York, 1978).

9. Gray, J. N., Horst, B. and Walker, M., Parity Striping of Disk Arrays: Low-Cost Reliable Storage With Acceptable Throughput, Tandem Computers Technical Report (Brisbane, Australia, August 1990) pp. 148- 161.

10. Holland, M. and Gibson, G. A., Parity Declustering for Continuous Operation in Disk Arrays, Proc. 5th ASPLOS (Boston, Oct. 1992) pp. 23-35.

11. Hou, R. Y., Menon, J. and Patt, Y. N., Balancing IO Response Time and Rebuild Time in RAID5 arrays, Proc. Hawaii Int'l Conf. on System Sciences (Hawaii, Jan. 1993) pp. 70-79.

12. Hyde, J., Personal Communication, IBM Publication Group (1990).

13. IBM 3390 Direct Accesss Storage Introduction, IBM Publication GC26--4573 (San Jose, 1989).

14. IMS-VS Version 1: Program Logic Manual, IBM Publication L¥20-8004 (San Jose, 1977).

15. Kleinrock, L., Queueing Systems, Vol. I, John Wiley (New York, 1975).

16. Law••r• F. D., E•cient Mass St•rage Parity Rec•very Mechanism, IBM Technical Discl•sure Bulletin 24,2 (July 1981) pp. 986-987.

17. Lynch, W. C., Do Disk Arms Move? Performance Evaluation Review 1 3,16 (July 1972).

18. McNutt, Bruce, A Simple Statistical Model of Cache Reference Locality, CMG Proceedings (1991).


19. Menon, J. M. and Hartung, M., The IBM 3990 Disk Cache, Compcon 1988 (San Francisco, June 1988).

20. Men•n• J. and Matts•n• R. L.• Perf•rmance •f Arrays in Transacti•n Pr•cessing Envir•nments• Pr•c. •f • 2th International Conference on Distributed Computing Systems (Yokohama, Japan, June 1992) pp. 302-309.

21. Menon, J. and Cortney, J., The Architecture of a Fault-Tolerant Cached RAID Controller, 20th Annual International Symposium on Computer Architecture (San Diego, California, May 1993) pp. 76-86.

22. Muntz, R. R. and Lni, C. S., Performance Analysis of Disk Arrays Under Failure, 16th VLDB Conference (Brisbane, Australia, 1990) pp. 162-173.

23. Nelson, R. and Tantawi, N., Approximate Analysis of Fork-Join Synchronization in Parallel Queues, IEEE Transactions on Computers 37,6 (June 1988) pp. 739-743.

24. Patterson, D., Gibson, G. and Katz, R. H., Reliable Arrays of Inexpensive Disks (RAID), ACM SIGMOD Conference 1988 (Chicago, June 1988) pp. 109-116.

25. Teng, J., DB2 Buffer Pool Management, GUIDE 79 Conference (San, Jose, March 1991).

Documents

Performance of RAID5 disk arrays with read and write caching