Sensitivity Analysis of Reliability and Performability Measures for Multiprocessor Systems

S E N S I T I V I T Y ANALYSIS OF R E L I A B I L I T Y A N D P E R F O R M A B I L I T Y M E A S U R E S F O R

M U L T I P R O C E S S O R S Y S T E M S *

J a m e s T . B l a k e , A n d r e w L. R e i b m a n ! a n d K i s h o r S. T r i v e d i

D e p a r t m e n t of C o m p u t e r S c i e n c e

D u k e U n i v e r s i t y

D u r h a m , N o r t h C a r o l i n a 27706

A b s t r a c t

Traditional evaluation techniques for multiprocessor systems use Markov chains and Markov reward models to compute measures such as mean time to failure, reliability, performance, and performability. In this paper, we discuss the extension of Markov models to include parametric sensitivity analysis. Using such analysis, we c a n guide system optimization, identify parts of a system model sensitive to error, and find system reliability and performability bottlenecks.

As an example we consider three models of a 16 processor, 16 memory system. A network provides communication between the processors and the memories. Two crossbar-network models and the Omega network are considered. For these models, we examine the sensitivity of the mean time to failure, unreliability, and performability to changes in component failure rates. We use the sensitivities to identify bottlenecks in the three system models. Index terms - - Multistage interconnection networks, Omega networks, performability, reliability, sensitivity analysis.

1 I n t r o d u c t i o n

As the use of multiprocessor systems increases, the re-

liability and performance characteristics of the various

design options for realizing these systems must be care-

fully analyzed. Several papers examine the reliability

*This work was supported in part by the Air Force Office of Scientific Research under grant AFOSR-84-0132.

tNow with AT&T Bell Laboratories, Holmdel, NJ 07733

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the AC M copyright notice and the title o f the publication and its date appear, and notice is given that copying is by permission o f the Association for Comput ing Machinery. To copy otherwise, or to republish, requires a fee and /o r specfic permission.

[4,17,18] and performability [2,13] of multiprocessor sys-

tems. In addition to computing traditional measures of

system performance, it is often interesting to determine

the performance or reliability "bottleneck" of a system

or to optimize system architectures. Towards this end,

we discuss parametric sensitivity analysis [5], the com-

putation of derivatives of system measures with respect

to various model parameters [6,7,19]. These derivatives

can be used to guide system optimization. Parameters

with large sensitivities usually deserve close attention

in the quest to improve system characteristics. These

parameters may also indicate elements of a model that

are particularly prone to error [20].

The paper is organized as follows: A brief summary

of Markov and Markov reward models for performance

and reliability modeling is given in Section 2. Section

3 discusses the computation of parametric sensitivities.

Three reliability/performability models for a multipro-

eessor system (MPS) are given in Section 4. In Section

5, we present numerical results for the parametric sen-

sitivity of these models.

2 Markov Reliability Models

The evolution of a degradable system through various

configurations with different sets of operational compo-

nents can be represented by a discrete-state, continuous-

time Markov chain (CTMC), {Z( t ) , t > 0}, with state

space q2 = {1,2 , . . . ,k}. For each i , j E q2, let qij be the

transition rate from state i to state j , and define

k qli = - - ~ qii.

]=1 j#i

Then, Q = [q~¢] is the k by k transi t ion rate matrix.

We let P~(t) = Prob[Z(t) = i] be the probability that

© 1988 A C M 0-89791-254-3/88/0005/0177 $1.50 I r 7

the system is in s tate i at t ime t. The t ransient s tate

probabil i ty row-vector P( t ) can be computed by solving

a mat r ix differential equat ion [22],

2_(t) = P_P_(t)Q. (1)

Methods for comput ing P( t ) are compared in [15].

We can divide the state space into two sets: UP,

the set of operat ional states, and DOWN, the set of

failure or down states. If all D O W N states are absorbing

(failure) states, we can obtain the system reliability from

the s ta te probabil i t ies,

R(t) = ~ P,(t). iEUP

Associated with each state of the C T M C is a reward

rate tha t represents the performance level of the sys-

t em in tha t state. The C T M C and the reward rates

are combined to form a Markov reward model [8]. Each

s ta te represents a different system configuration. Tran-

sitions to states wi th smaller reward rates (lower per-

formance levels) are component failure transit ions, and,

in repairable systems, t ransi t ions to states wi th higher

performance levels are repair transit ions.

The choice of a performance measure for determin-

ing reward rates is a funct ion of the system to be eval-

uated. Often a raw measure of system capacity such as

the instruct ion execution rate is useful. For an intercon-

nect ion network (IN), one appropr ia te per formmance

measure is bandwidth . At other t imes, a queueing the-

oretic performance model may be used to compute the

reward rates. Since the t ime-scale of the performance-

related events (bandwidth) is much faster than the the

t ime-scale of the reliabil i ty-related events (component

failures), s teady-s ta te values from performance models

are used to specify the performance levels or reward

rates for each state.

We let r~ denote the reward rate associated with

state i, and call r the reward vector. The reward rate of

the system at t ime t is given by the process X(t) -= rz(o.

The expected reward rate at t ime t is

E[X(t)] -- ~ r ,P,(t) .

This quant i ty is also called the computation availability

[2].

If we let Y (t) be the amount of reward accumulated

(the amount of work done) by a system during the in-

terval (0, t), then

.fO t

Y(t) = x(u)d~. (2)

Furthermore , if we use bandwid th to construct the re-

ward vector, then f rom equat ion (2), Y(t) represents the

number of requests that the IN is capable of satisfying

by t ime t. The expected accumulated reward is

/: E[g(t)] = E[ X(u)du] = r, P,(~)d,~. (3)

In order to compute E[Y(t)], we lat L,(t) = fg P,(,~)d,~.

Then, the row vector L(t) can be computed by solving

the system of differential equations:

"_L(t) = LCt)Q + PC0) . (4)

Methods of solving this system of equat ions are dis-

cussed in [16].

A special case of the expected accumulated reward

is the mean time to failure ( M T T F ) . The M T T F is

defined as

/; M T T F = R(t)dt. (5)

The M T T F is a special case of E[Y(oo)], with reward

rate 0.0 assigned to all D O W N states (which are as-

sumed to be absorbing) and reward rate 1.0 assigned to

all UP states. To compute M T T F , we solve for r_ in

~_~ = -~(0), (6)

where P_(0) is the par t i t ion of P(0) corresponding to the

UP states only. The mat r ix Q is obta ined by deleting

the rows and columns in Q corresponding to D O W N

states. Any linear algebraic system solver can be used

to solve this system of equations. Al though one might

like to use direct methods like Gaussian el imination; for

large, sparse models, i terat ive methods are more prac-

tical [211. The mat r ix - Q is a non-singular , diagonally-

dominant M-ma t r ix . Thus, if we use an i terative me thod

such as Gauss-Seidel, SOR, or opt imal SOR to solve (6),

it is guaranteed to converge to the solution [23]. Then,

M T T F = ~ rl. (7) iEUP

3 S e n s i t i v i t y A n a l y s i s

The results obta ined from a model are sensitive to many

factors. For example, the effect of a change in distri-

but ion on a stochastic model is often considered. In

this paper , we concentra te our a t tent ion on parametr ic

sensitivity analysis, a technique to compute the effect

of changes in the rate constants of a Markev model on

178

the measures of interest. Parametr ic sensitivity analysis

helps: (1) guide system optimization, (2) find reliability,

performance, and performabil i ty bottlenecks in the sys-

tem, and (3) identify the model parameters tha t could

produce significant modeling errors.

One approach to parametr ic sensitivity analysis is to

use upper and lower bounds on each parameter in the

model to compute optimistic and conservative bounds

on system reliability [20]. Our approach is to compute

the derivative of the measures of interest with respect to

the model parameters [6,19]. A bound on the per turbed

solution can then be computed with a simple Taylor

series approximation.

We assume that the t ransi t ion rates qq are functions

of some parameter A. For a given value of A, we want to

compute the derivative of various measures with respect

to A (e.g., OPi(t)/OA). If we let S(t) be the row vector

of the sensitivities OPi(t)/O)~, then from equation (1)

we obta in

h(t) = S(t)Q + P( t )V (8)

where V is the derivative of Q with respect to )~. As-

suming the initial conditions do not depend on A, we

have £(0) OP(O) aP(t)

= - - l i m - - - - O . 0h t~0 0,~ -

We can then solve equations (1) and (8) s imultaneously

using,

V [~(t),S_(t)] = [P__(t),S(t)] [ 0Q Q ] (9)

subject to the initial condit ion

[P(0),S_(0)] = [ ~ , 0 ] . (10)

Let ~ be the number of non-zero entries in Q, and let

~7~ be the number of non-zero entries in V.

For acyclic models, an efficient algori thm that re-

quires O(2r /+ r/~) floating-point operations (FLOPS) is

discussed in [12]. For more general models with cy-

cles, we can use an explicit integrat ion technique like

Runge-Kut ta . The execution t ime of explicit methods

like Runge-Kut t a is O((2r /+ rl,) (q + v)t) FLOPS, where

q = max, Iqlll and v = maxi Iv~.,I. To solve equation (10)

with Vniformization [7], we choose q > max, Iqi,! and

Q* = Q/q + I. Then,

0 oo _S(t) = ~-~ i~=oP(O)(Q*)' e -qt (qt)ii! = ~-~II(i)'e-qt°° (~!)'qzt ~, ,

= /=0 (11)

where

and

= = 1 ) Q ' )

= ~ ( i - 1)'Q" + n _ ( i - 1 ) ° Q *, ( 1 2 )

n_if) = ~ ( i - 1 )Q" , n_(0) = P ( 0 ) . (13)

If the CTMC's initial condit ions do not depend on ,~,

then H'(0) = 0. Also note tha t O Q . = Y/q. With a

sparse matr ix implementat ion, Uniformization requires

O( (2 r /+ ~)qt) FLOPS. Both Runge-Kut ta ' s and Uni-

formization's performance degrades linearly as q (or v)

grows. Problems with values of q tha t are large relative

to the length of the solution interval are called stiff.

Large values of q (and v) are common in systems with

repair or reconfiguration. An at tract ive alternative for

such stiff problems is an implicit integration technique

with execution t ime O(2r /+ rls) [15].

We can derive the sensitivity of E[X(t)] from the

sensitivities of the state probabilit ies

OE[X(t)] = ~ Orl ~ r,P,(t) = ~ -g-ZP,(t)+ ~ r,S,(t). Oh Oh ie* ,e~ ,e~

(14)

Similarly, we can derive the sensitivity of E[Y(t)] by

differentiating equation (3),

Ori . . f t = ~ - ~ n i ( t ) + ~ r i j o S,(u)du.(15)

As in the ins tantaneous measures case, methods for

comput ing the cumulative state probabil i ty sensitivity

vector, f~ S_(u)du, include numerical integration, the ACE

algori thm for acyclic models, and Uniformization.

For the special case of mean time to failure, if we

differentiate equation (6), we can solve for s,

= - ~-5-2 (16)

where _r is the solution obtained from equation (6). Then,

OMTTF Or~ Oh = E ~ = ~ s," (17)

iEUP /CUP

This linear system can be solved using the same algo-

r i thms used to solve equation (6).

Having computed the derivative of some measure,

say M T T F , with respect to various system parameters,

,~i, there are at least three distinct ways to use the re-

sults. The first application is to provide error bounds

179

on the solution when given bounds on the input param-

eters. Assume that each of the parameters ),i is con-

ta ined in an uncer ta in ty interval of width AAi. We can

then approximate ly determine an uncer ta inty interval

A M T T F ,

AAI O M T T F . (18) A M T T F ~- ~ OAi

i

A second use of parametr ic sensitivities is in the

identification of port ions of a model that need refine-

ment. There is some cost involved in reducing the size

of the intervals A),i since it requires taking addit ional

measurements or performing more detailed analysis. We

assume the cost (or time) of reduct ion in A),~ is propor-

t ional to AA~/Ai and let

O M T T F , I = a r gm ax i ),i ~ (19)

where argmaxi]xi] denotes the value of i tha t maximizes

xi. Then, refining the I t h parameter is the most cost-

effective way to improve the accuracy of the model.

A th i rd applicat ion of parametr ic sensitivities is sys-

t em opt imizat ion and bot t leneck analysis. Assume that

there are Ni copies of component i in the system, and

tha t the failure rate of component i is )~i. Fur thermore ,

assume the cost of the i th subsystem is given by some

funct ion ciNiA~. '~'. Define the opt imizat ion problem:

M a x i m i z e : M T T F

S u b j e c t T o : ~ c,N,)~7 ~' <_ C O S T . (20) i

Using the me thod of Lagrange multipliers [1], the opti-

mal values of Ai satisfy:

)~7'+ 10MTT.______FF = cons tan t . (21) ci Ni ai c3 Ai

Let I A~. i - l - 1 0 M T T F

r - ' (22) I* : a grnax, c~Nia~ OA~ "

Then, the most cost-effective point to make an incre-

menta l investment is in subsys tem type I*. In other

words, the system bott leneck from the M T T F point of

view is subsys tem I*. In our numerical examples, we

will use this definition of bottleneck. For convenience,

we also assume tha t el = a~ = 1 for all i, a l though other

cost functions could be used. At the conclusion of the

numerical results section, we compare these results with

those obtained using the second scaling approach.

4 M u l t i p r o c e s s o r E x a m p l e

Consider a mult iproceesor system (MPS) tha t consists

of 16 processors (Ps), 16 shared memories (Ms), and

an interconnect ion network (IN) tha t connects the pro-

cessors to the memories. We consider three different

interconnect ion network models. Firs t , the IN may be

modeled as one large crossbar switch, as in the C .mmp

system [18] (see Figure 1). We refer to this model as

S Y S , .

o-- o=,=

o-

o -

- - o

Figure 1: Mult iprocessor System Using a Crossbal Switch as a Single Componen t Interconnect ion Network.

8

Figure 2: Mult iprocessor System Using a Crossbar Switch Composed of Mul t ip lexers /Demul t ip lexers as the Interconnect ion Network.

A second more detailed network model, referred to

as S Y S d , is shown in Figure 2. Here, the IN is composed

of sixteen 1 × 16 demult iplexers and sixteen 16 × 1 mul-

t iplexers [18]. In this ar rangement , each processor is

connected to a demult iplexer and each memory is con-

nected to a mult iplexer. Functionally, this is equivalent

to the crossbar switch, but it pr'.ovides addi t ional fault-

tolerance. In S Y S , ~ any switch failure results in the

complete disconnection of all processors and memories.

In S Y S a , a mult iplexer failure disconnects only the as-

sociated processor or memory.

In the third model, the IN is an Omega network

[11]. This network has two stages and is const ructed

from eight 4 x 4 switching elements (SEs). This is an

180

economical a l ternat ive to a crossbar switch, since the

complexi ty of the crossbar is O(N 2) whereas that of the

Omega network is O(N log N) [10]. The MPS using the

Omega IN is shown in Figure 3. We refer to this model

as S Y Sn. stage 1 Stage 2

~ =

Figure 3: Mult iprocessor System Using an Omega Net- work with 4 x 4 Switching Elements as the Interconnec- tion Network.

4 . 1 Rel iab i l i ty

We use the switch fault model for reliability analysis.

The pr imary assumption in this model is that each com-

ponent is an atomic structure. Therefore, the failure of

any device in a component causes a total failure of the

component . For brevity, we also assume that system

failure occurs only as a result of the accumulat ion of

component failures (exhaustion of redundancy) . This

"perfect coverage" assumption can be easily extended

to incorporate "imperfect coverage."

Gate count is used to "equalize" the failure rates of

the three models of the IN. From [10], an n x n crossbar

switch requires 4n (n - 1) gates where n is the number of

inputs and outputs . And an n x 1 mult iplexer requires

2(n - 1) gates where n is the number of inputs to the

mult iplexer. A demult iplexer also requires 2 ( n - 1 ) gates

by similar reasoning. These numbers for gate count are

based on switching element construct ion that utilizes a

tree-like ar rangement of gates. For the 16 × 16 MPS,

there are 960 gates in the simple 16 × 16 crossbar switch,

30 gates in a demul t ip lexer /mul t ip lexer , and 48 gates in

the 4 x 4 SE (assuming the SE uses a crossbar construc-

tion). Since the switch-fault model assumption is be-

ing used, if 6, denotes the failure rate of the 16 × 16

crossbar switch, then 6~/960 is the gate failure rate,

6d = ~a/32 is the demul t ip lexer /mul t ip lexer failure rate,

and 6n ---- 6,/20 is the 4 x 4 SE failure rate.

For each model of the 16 × 16 MPS system, we

construct a Markov reliability model. For S Y S , , we

enumera te all possible system states. For SYSd, w e

modify the C T M C used in SYS° by adjust ing the fail-

ure rates of the processors and memories to account for

their associated demult iplexers /mult iplexers . In SYSn ,

the C T M C must account for the failure behavior of the

processors, memories, and the SEs to which they are

connected. By exploiting the symmet ry in the Omega

network, the s ta te description can be accomplished with

an 8-tuple. The initial s tate is (44444444) where posi-

t ion i (1 ~ i _< 4) represents the number of functioning

processors connected to an operat ional SE in posit ion

i. Similarly for the memories where 5 < i < 8. This

Markov chain embodies the concept of bulk failures. For

a given i, ei ther a processor or memory may fail and the

value at posit ion i decreases by one, or a SE may fail

and the value at posit ion i becomes zero. For SYSn ,

state lumping [9] reduces the state space of the CTMC.

To extend the conditions for lumpabil i ty to Markov re-

ward models, we require tha t every pair of states u and v

in the " lumped state" must have identical reward rates

(i.e., r~ = r~).

4.2 P e r f o r m a n c e

For each state in the system reliability model, we need

to determine an associated reward rate (performance

level). We use the average number of busy memories

(memory bandwidth) as the reward rate (performance

level) for each system configuration. This is an appro-

pr iate choice of performance metric for the MPS since

the efficiency of the system is l imited by the ability of

the processors to randomly access the available memo-

ries.

In S Y S , and SYSd, the networks are non-blocking,

i.e., contention for the memories occurs at the memory

ports. In contrast , S Y S e has a blocking network, since

contention also occurs inside the IN. If two or more pro-

cessors compete for the same output link of a SE, only

one request is successful and the remaining requests are

dropped.

In determining the bandwidth of a given configu-

rat ion of the mult iprocessor system, the independent-

reference model assumptions s tated in (14] for analysis

of circuit-switched networks are used. Let p~, denote

the probabil i ty that a processor issues a request during

181

a pa r t i cu la r m e m o r y reques t cycle, and Pout denote the

p robab i l i ty t h a t a pa r t i cu la r m e m o r y receives a request

a t its i n p u t link. Since it is a s sumed t h a t requests are

no t buffered in the IN, nor are mul t ip le requests ac-

cepted at a m e m o r y on any cycle, c o m p u t a t i o n of the

m e m o r y b a n d w i d t h for the MPS is accompl ished in a

s t r a igh t fo rward m a n n e r .

Over t ime, c o m p o n e n t s of the MPS fail, and as a re-

sult , t he pe r fo rmance of the sys tem decreases. For the

c rossbar ne twork , i m p l em en t ed as a single e lement in

S Y S , or us ing mul t ip lexers in SYS4 , we use the model

developed by B h a n d a r k a r [3] to ob ta in the average num-

ber of busy memories , or m e m o r y b a n d w i d t h , for each

degraded sys tem conf igurat ion. For the Omega net -

work, we use an ex tens ion of tile pe r fo rmance model

in [141 .

For the n x n c rossbar switch, the p robab i l i ty t h a t

a pa r t i cu l a r processor requests a pa r t i cu la r m e m o r y for

a given ne twork cycle is pi,-,/n. The p robab i l i ty t h a t a

pa r t i cu la r m e m o r y is selected by at least one processor

is

Po,~t = 1 - (1 - Pin),., (23) n - "

The b a n d w i d t h for the sys tem is jus t Pout t imes n, hence

BW~b,r = n ( 1 - ( 1 - P l n , -X-) ) (24)

In the presence of m e m o r y or processor failures, th is

equa t ion m u s t be modif ied since the n u m b e r of opera-

t ional memor ies is not , in general , equal to the num-

ber of ope ra t iona l processors. In [3], a deta i led combi-

na to r i a l and M a r k o v i a n analysis was pe r fo rmed to de-

t e rmine the b a n d w i d t h in the a symmet r i c case. Let i

deno te the n u m b e r of ope ra t iona l processors and j de-

no te the n u m b e r of ope ra t iona l memories . Fur the r , let

= rn in{ i , j } a nd rn = m a x { i , j } . T h e n f o r p i , = 1.0,

B h a n d a r k a r found the average n u m b e r of busy memo-

ries in the sys t em is accura te ly pred ic ted by the formula ,

m(1 - ( 1 - 1 ~ r e ) e ) . Therefore , the reward ra te associa ted

wi th a pa r t i cu la r s t r u c t u r e s t a t e ( i , j ) is

BW=bor = m(1 - (1 - 1 /m)~) . (25)

Next, consider the N x N Omega ne twork wi th switch-

ing e lements of size n x n. N u m b e r the s tage to which

the processors are a t t a c h e d as s tage 1, and the last s tage

to which the memor ies are a t t a ched as s tage v. The

swi tch ing e lements are n x n crossbars , and the out-

pu t of a pa r t i cu la r link of a swi tch ing e lement can be

Input Links Output Links

0 - - - - 0

1 - - - - 1

n - 1 - - m n - 1

Figure 4: n x n Switching Element .

deno ted as pl. This value is also the p robab i l i t y t h a t

the re is an i n p u t reques t for a SE in the next s tage. A

recur rence re la t ion exists for c o m p u t i n g these reques t

p robabi l i t i es :

P~+I = 1 - ( 1 - n ) . p l , (26)

Note t h a t Pin is the p robab i l i t y t h a t the re is a reques t a t

t he first s tage and p~ = po,a ( the p robab i l i t y t h a t the re

is a reques t for a pa r t i cu la r memory ) . The b a n d w i d t h

is c o m p u t e d as the p r o d u c t of the reques t p robabi l i t i es

for a pa r t i cu la r m e m o r y and the n u m b e r of memor ies

[14]: BW n = N(1 - (1 - P ~ - I ) , ) . (27)

n

In the presence of failures, th is equa t ion m u s t be mod-

ified to accoun t for graceful degrada t ion . In the first

s tage of the ne twork , consider a pa r t i cu la r i npu t l ink to

an n x n SE, say link 0 in F igure 4, and denote it by

pi,.0. I t may reques t a pa r t i cu la r o u t p u t l ink w i th equal

probabi l i ty , so it does not reques t a specific l ink wi th

p robab i l i t y (1 - p i , , o / n ) . Similarly, i npu t l ink 1 does

not reques t the same link w i th p robab i l i ty (1 - pi,.1/n).

If the processor a t t a ched to the i npu t l ink has failed,

t h e n Pi,.i = 0. Now, the p robab i l i t y of a reques t for a

specific o u t p u t l ink, say i, as a resul t of the (pe rhaps

unequal ) reques t p robabi l i t i es by the i npu t links is t h e n

c o m p u t e d as

Po~.t.i = 1 - H ( 1 - ~ ) . (28) j=0

The b a n d w i d t h of the SE is then ,

n(Pout.i) if t he SE has not failed, (29) BWsE = 0 otherwise .

T h e ou tpu t s of th is SE serve as inpu ts to the SEs in

the nex t s tage. At the final s tage of the Omega net -

work, some memor ies may be inoperable . T h e ne twork

b a n d w i d t h is c o m p u t e d as the s u m of the reques t p rob-

abil i t ies for the opera t iona l memories . Let No denote

182

t he set of ope ra t iona l memories , t hen

BWn ~- ~ (po~t)i. (30) jENo

In the next sect ion, equa t ion (25) is used to compu te

the reward ra te for each s ta te of the SYS~ and SYSd

models . Similarly, equa t ion (30) is used to com pu te the

reward ra te for each s t a t e of the SYS~ model .

5 N u m e r i c a l R e s u l t s

In th is sect ion, we examine the t r ans i en t reliability, m e a n

t ime to failure, and expected reward ra te at t ime t for

our th ree 16× 16 MPS models . For each model , we com-

pu t e the sensi t ivi t ies of these th ree measures to changes

in the c o m p o n e n t fai lure rates .

A given task for a mul t ip rocessor sys tem may re-

quire U processors and V memor ies where U and V

are b o t h less t h a n the to ta l resources avai lable on a

fu l ly-opera t ional MPS. So given the task, a mul t ip ro-

cessor sys tem is ope ra t iona l as long as U processors

can access V memories . As in [18], we consider the

case where 4 processors and 4 memor ies are required

( K = U = V -- 4). For SYS , and SYSd, the cor-

r e spond ing CTMCs have 170 s tates . For SYS~, the

C T M C has more t h a n 64000 s ta tes before s t a t e lump-

ing, and 3970 s ta tes af ter lumping. We have also com-

pu t ed resul ts for the sys tem t h a t requires 12 processors

and 12 memor ies ( K = 12). For brevity, mos t of the

da t a for this second case is omi t ted .

We use fai lure da t a f rom the analysis of the C . m m p

sys tem [18]. By a pa r t s coun t me thod , Sieworiek deter-

m ined the fai lure ra tes per hour for the componen t s to

be:

Processor M e m o r y Switch -- 0.0000689 "7 = 0.0002241 5, = 0.0002024

Like Siewiorek, we assume t h a t c o m p o n e n t l ifetime dis-

t r i bu t ions are exponent ia l ly d i s t r ibu ted .

In the rest of th is sect ion, we first consider some

single-valued measures of ne twork per fo rmance and re-

liability. We t h e n consider t ime -dependen t sys tem reli-

abi l i ty and its sensit ivi ty. Finally, we consider E[X(t)],

the expec ted reward ra t e (bandwid th ) at t ime t, a mea-

sure of degradab le sys tem per formance .

Arch i t ec tu re B a n d w i d t h

SYSm 10.3 SYSd 10.3 S Y S n 8.4

M T T F Cost K = 12 K = 4 1322.3 3611.8 9 6 0 1537.9 6708.6 960 1497.2 6575.5 384

Table 1: C o m p a r i s o n of MPS Models.

5 . l S i n g l e - V a l u e d M e a s u r e s

In Table 1, we use th ree s ingle-valued measures to com-

pare the th ree MPS models . Using equat ions (25) and

(30), the b a n d w i d t h is c o m p u t e d using the app roach de-

scr ibed in Section 4.2. If we consider b a n d w i d t h alone,

SYS~ and SYSd are ind is t inguishable , and SYSn is

the least preferred choice. Next, sys tem M T T F is com-

p u t e d us ing equa t ions (6) and (7). Based on the M T T F ,

SYSd is the mos t rel iable sys tem, and SYS~ is the least

reliable. The cost of processors and memor ies is the

same for all th ree models , so we use the cost of the IN

to con t r a s t the models . The cost of the IN is c o m p u t e d

using a gate count . SYSn requires less t h a n one-hal f

the n u m b e r of gates needed by the o ther two sys tems.

Next we consider the sens i t iv i ty of the M T T F esti-

ma tes given in Table 1 to changes in c o m p o n e n t fai lure

ra tes . For each model , us ing equa t ion (17), we c ompu te

the sens i t iv i ty of M T T F wi th respect to the processor

fai lure ra te , m e m o r y failure ra te , and swi tch ing e lement

fai lure rate . Note t h a t each sys tem has a different num-

ber of swi tch ing e lements and these SEs have different

fai lure ra tes . To find the sys tem bot t lenecks , we use

the cost model descr ibed in Section 3 wi th al = ci = 1.

The p a r a m e t r i c sensi t ivi t ies are mul t ip l ied by a fac-

tor of A~/N~. The resul ts are shown in Table 2. The

bot t lenecks for each sys tem conf igura t ion are italicized.

Because SYS~ is mos t sensi t ive to swi tch failures, for

this model , the swi tch is the rel iabi l i ty bot t leneck. The

memor ies are the bo t t l eneck for the o ther two models .

5.2 Reliability

In F igure 5, we plot the t ime -dependen t rel iabi l i ty curves

for the th ree M P S models . Fai lure ra tes for the IN are

d e t e r m i n e d using the gate count m e t h o d discussed in

the Section 4.1. We res t r ic t our a t t e n t i o n to the case

K -- 4. Because SYS~ is vulnerable-' to a s ingle-point

swi tch failure, R~(t) is s ignif icant ly less t h a n Rd(t) or

R~ (t). Model ing the IN at the demul t i p l exe r /mu l t i p l exe r

level increases the p red ic ted rel iabi l i ty since the fai lure

of ind iv idua l c o m p o n e n t s is no t ca tas t rophic . Also, ob-

183

MPS " Failure Ra te Paramete r

Processors

" K = 1 2 K = 4 S Y S e " --21.29 - 2 . 0 6 --1462.78 --2297.36 S Y S a -35 .01 -20 .14 - 1974.08 - 9069.,~5 S Y S n -35 .45 -34 .84 - 1863.69 - 8655.67

Memories Network

K = 1 2 K = 4 K = 1 2 K = 4 - 4 6 e s . e 4

- - 0 . 9 3

--10.64

- 89899.40 --3.57

--39.74

1.0

Table 2: Scaled Sensit ivi ty of M T T F with l~espect to Parameters (x ((X~/Ni) x 10s)).

~ e ~ v ' f SYS i

, \

6000 12000 Time (Hours)

Figure 5: Comparison of the Reliabilities of the Three

MPS Models for K = 4.

7X10 -5

0 " 6000 12000

Time (Hours)

Figure 6: Scaled Paramet r ic Sensit ivi ty of Unrel iabi l i ty - - Simple Crossbar Model (SYSo).

serve tha t Rn(t) < Rd(t). In the K = 12 case (not

shown), the degree of separat ion between the reliability

of SYSo and the other two models is less pronounced.

Scaled parametr ic sensitivities for SYSs and S Y S n

are plot ted in Figures 6 and 7. We omit the plot for

SYSd, because it is a lmost identical to the plot for

SYSn . These parametr ic sensitivities are scaled by mul-

t iplying by the factor X~/N~. Regardless of mission t ime,

all three systems are insensitive to small changes in the

processor failure rate. For S Y S , , switch failure is the

reliabili ty bott leneck. For SYSd and SYSa , increased

faul t- tolerance in the switch makes the memories the

reliability bott leneck, regardless of mission time.

5.3 P e r f o r m a b i l i t y

For K = 4, Figure 8 plots E[X(t)], the expected system

bandwid th at t ime t. SYSd is superior to the other

two systems. For small values of t, S Y S s is superior to

S Y S n , while the converse is true for modera te values

of t. This crossover occurs because for small K , up

to three SEs can fail in S Y S n without causing system

failure; while for S Y S , when the IN fails, the system is

d o w n .

Parametr ic sensitivities for E[X(t)] of the MPS mod-

els are plot ted in Figures 9 and 10. We omit the plot

for SYSa because it is almost identical to the plot for

SYSn . These parametr ic sensitivities are scaled by mul-

t iplying by a factor of ,k~/Ni. We note tha t the sen-

sitivities have an opposite sign than the sensitivit ies

of system unreliabili ty; an increase in failure rate in-

creases unreliabil i ty but decreases the expected reward

rate. We also note that , unlike the sensit ivity of un-

reliability, the processor failure rate sensitivity curve

is visible. Al though it is unlikely tha t enough proces-

sors would ever fail to cause total system failure, a few

processor failures might occur, reducing system perfor-

mance. In S Y S , , the switch is the performabil i ty bot-

tleneck. Because SYSd and S Y S n have faul t - tolerant

switches, regardless of mission t ime, memories are their

performabil i ty bott leneck.

5.4 Ana lys i s w i t h an A l t e r n a t e Sensi- t iv i ty M e a s u r e

As we ment ioned in Section 3, a second use of para-

metr ic sensitivit ies is in the identification of port ions

of a model tha t need. refinement. Instead of using a

cost function, as in the three previous subsections, here

we consider relat ive changes, A),i/,k ~. This quant i ty is

obtained by scaling the parametr ic sensitivities (multi-

plying each S(t) by ,kl). Using this approach changes

the results obtained for S Y S , . With the "cost-based"

measure used in Section 5.1, the M T T F of S Y S , was

184

1,4x10 "5

¢.n-

E

&

0

~(--- Memories

6000 12000 Time (Hours)

Figure 7: Scaled Parametric Sensitivity of Unreliability - - Omega Network Model (SYS~).

most sensitive to switch failures for both K = 4 and

K -- 12. With the alternate scaling used here, the

M T T F of SYSo is most sensitive to switch failures for

K = 4, but for K = 12, it is most sensitive to mem-

ory failures. If we want to improve the M T T F model

for SYS~, then K is also a factor in determining what

component of the model should be refined.

If we repeat the reliability sensitivity analysis with

the alternate scaling, SYS , is initially most sensitive to

switch failures, but as mission time increases, exhaus-

tion of memory redundancy becomes a greater prob-

lem. For t _> 4000, SYS, reliability is most sensitive

to changes in the memory failure rate. For E[X(t)] of

SYS, , a similar crossover is observable at t = 4000.

If we want to improve the reliability or performabil-

ity models for SYSo for small t, the failure rate of the

switch should be more accurately determined. For large

values of t, the failure rate of the memory system should

be more accurately determined.

6 C o n c l u s i o n s

System modelers often rely on single-valued measures

like MTTF. This oversimplification may hide impor-

tant differences between candidate architectures. Time-

dependent reliability analysis provides additional data,

but unless whole series of models are run, it does not

suggest where to spend additional design effort. In this

paper, we discussed the use of Markov reward mod-

els and parametric sensitivity analysis. Markov reward

models allow us to model the performance of degrad-

able systems. Parametric sensitivity analysis helps iden-

tify critical system components or portions of the model

that are particularly sensitive to error.

12

~ SYS s

~ f SYSd

0 10000 i

Time (Hours) 20000

Figure 8: Comparison of the Expected Reward Rates at time t for the Three MPS Models for K = 4.

To demonstrate the use of parametric sensitivity anal-

ysis in the evaluation of competing system designs, we

considered three models of a multiprocessor system con-

structed from processors, shared memories, and an in-

terconnection network. For each model, we computed

the parametric sensitivity of mean time to failure, sys-

tem unreliability, and time-dependent expected reward

rates. By scaling with respect to a cost function, we

were able to identify the reliability, performability, and

M T T F bottlenecks in each system. The three mod-

els produced different results. The differences between

the models highlight the need for detailed models and

shows the role of analytic modeling in choosing design

alternatives and guiding the design refinements.

R e f e r e n c e s

[1] M. Avriel. Nonlinear Programming: Analysis and Methods. Prentice-Hall, Englewood Cliffs, N J, 1976.

[2] M. D. Beaudry. Performance Related Reliabil- ity for Computer Systems. IEEE Transactions on Computers, C-27(6):540-547, June 1978.

[3] D. P. Bhandarkar. Analysis of Memory Interfer- ence in Multiprocessors. IEEE Transactions on Computers, C°24 (9) :897-908, September 1975.

[4] C. R. Das and L. N. Bhuyan. Reliability Simula- tion of Multiprocessor Systems. In Proceedings o/ the International Conference on Parallel Process- ing, pages 591-598, August 1985.

[5] P. M. Frank. Introduction to System Sensitivity Theory. Academic Press, New York, 1978.

[6] A. Goyal, S. Lavenberg, and K. Trivedi. Proba- bilistic Modeling of Computer System Availability. Annals of Operations Research, 8:285-306, March 1987.

18S

0

¢Z-

E

-4.5X10 .4 6

i i I i I

6000 12000 Time (Hours)

Figure 9: Scaled Parametric Sensitivity of Performance Level - - Simple Crossbar Model (SYSs) .

0

z

7 03 =e $

-4.0X10"

- Z otwo -- rooo o o orof 6000 12000

Time (Hours)

Figure 10: Scaled Parametric Sensitivity of Performance Level - - Omega Network Model (SYSa) .

[7] P. Heidelberger and A. Goyal. Sensitivity Anal- ysis of Continuous Time Markov Chains Using Uniformization. In P. J. Courtois G. Iazeolla and O. J. Boxma, editors, Proceedings of the 2nd International Workshop on Applied Math- ematics and Performance/Reliability Models of Computer/Communication Systems, pages 93-104, Rome, Italy, May 1987.

[8] R. A. Howard. Dynamic Probabilistic Systems, Volume 11: Semi-Markov and Decision Processes. John Wiley and Sons, New York, 1971.

[9] J. G. Kemeny and J. L. Snell. Finite Markov Chains. Van Nostrand-Reinhold, Princeton, N J, 1960.

[10]

[11]

[121

[13]

[14]

D. Kuck. The Structure of Computers and Com- putations. Volume 1, John Wiley and Sons, NY, 1978.

D. H. Lawrie. Access and Alignment of Data in an Array Processor. IEEE Transactions on Comput- ers, C-24:1145-1155, December 1975.

R. A. Marie, A. L. Reibman, and K. S. Trivedi. Transient Solution of Acyclic Markov Chains. Per- formance Evaluation, 7(3):175-194, 1987.

J. F. Meyer. On Evaluating the Performability of Degradable Computing Systems. IEEE Trans- actions on Computers, C-29(8):720-731, August 1980.

J. H. Patel. Performance of Processor-Memory In- terconnections for Multiprocessors. IEEE Trans- actions on Computers, C-30(10):771-780, October 1981.

[15] A. Reibman and K. Trivedi. Numerical Transient Analysis of Markov Models. Computers and Oper- ations Research, 15(1):19-36, 1988.

[16] A. L. Reibman and K. S. Trivedi. Transient Anal- ysis of Cumulative Measures of Markov Chain Be- havior. 1987. Submitted for publication.

[17] D. P. Siewiorek. Multiprocessors: Reliability Modeling and Graceful Degradation. In Infoteeh State of the Art Conference on System Reliability, pages 48-73, Infotech International, London, 1977.

[18] D. P. Siewiorek, V. Kini, R. Joobbani, and H. Bel- lis. A Case Study of C.mmp, Cm*, and C.vmp: Par t II - - Predicting and Calibrating Reliabil- ity of Multiprocessor. Proceedings of the IEEE, 66(10):1200-1220, October 1978.

[19] M. Smotherman. Parametric Error Analysis and Coverage Approximations in Reliability Modeling. PhD thesis, Department of Computer Science, Uni- versity of North Carolina, Chapel Hill, NC, 1984.

[20] M. Smotherman, R. Geist, and K. Trivedi. Prov- ably Conservative Approximations to Complex Re- liability Models. IEEE Transactions on Comput- ers, C-35(4):333-338, April 1986.

[21] W. Stewart and A. Goyal. Matrix Methods in Large Dependability Models. Research Report RC-11485, IBM, November 1985.

[22] K. S. Trivedi. Probability and Statistics with Re- liability, Queueing and Computer Science Applica- tions. Prentice-Hall, Englewood Cliffs, N J, 1982.

[23] R. S. Varga. Matrix Iterative Analysis. Prentice- Hall, Englewood Cliffs, N J, 1962.

1 8 6

Documents

Sensitivity Analysis of Reliability and Performability Measures for Multiprocessor Systems