Upload
duke
View
0
Download
0
Embed Size (px)
Citation preview
S E N S I T I V I T Y ANALYSIS OF R E L I A B I L I T Y A N D P E R F O R M A B I L I T Y M E A S U R E S F O R
M U L T I P R O C E S S O R S Y S T E M S *
J a m e s T . B l a k e , A n d r e w L. R e i b m a n ! a n d K i s h o r S. T r i v e d i
D e p a r t m e n t of C o m p u t e r S c i e n c e
D u k e U n i v e r s i t y
D u r h a m , N o r t h C a r o l i n a 27706
A b s t r a c t
Traditional evaluation techniques for multiprocessor sys- tems use Markov chains and Markov reward models to compute measures such as mean time to failure, relia- bility, performance, and performability. In this paper, we discuss the extension of Markov models to include parametric sensitivity analysis. Using such analysis, we c a n guide system optimization, identify parts of a sys- tem model sensitive to error, and find system reliability and performability bottlenecks.
As an example we consider three models of a 16 pro- cessor, 16 memory system. A network provides com- munication between the processors and the memories. Two crossbar-network models and the Omega network are considered. For these models, we examine the sen- sitivity of the mean time to failure, unreliability, and performability to changes in component failure rates. We use the sensitivities to identify bottlenecks in the three system models. Index terms - - Multistage interconnection networks, Omega networks, performability, reliability, sensitivity analysis.
1 I n t r o d u c t i o n
As the use of multiprocessor systems increases, the re-
liability and performance characteristics of the various
design options for realizing these systems must be care-
fully analyzed. Several papers examine the reliability
*This work was supported in part by the Air Force Office of Scientific Research under grant AFOSR-84-0132.
tNow with AT&T Bell Laboratories, Holmdel, NJ 07733
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the AC M copyright notice and the title o f the publication and its date appear, and notice is given that copying is by permission o f the Association for Comput ing Machinery. To copy otherwise, or to republish, requires a fee and /o r specfic permission.
[4,17,18] and performability [2,13] of multiprocessor sys-
tems. In addition to computing traditional measures of
system performance, it is often interesting to determine
the performance or reliability "bottleneck" of a system
or to optimize system architectures. Towards this end,
we discuss parametric sensitivity analysis [5], the com-
putation of derivatives of system measures with respect
to various model parameters [6,7,19]. These derivatives
can be used to guide system optimization. Parameters
with large sensitivities usually deserve close attention
in the quest to improve system characteristics. These
parameters may also indicate elements of a model that
are particularly prone to error [20].
The paper is organized as follows: A brief summary
of Markov and Markov reward models for performance
and reliability modeling is given in Section 2. Section
3 discusses the computation of parametric sensitivities.
Three reliability/performability models for a multipro-
eessor system (MPS) are given in Section 4. In Section
5, we present numerical results for the parametric sen-
sitivity of these models.
2 Markov Reliability Models
The evolution of a degradable system through various
configurations with different sets of operational compo-
nents can be represented by a discrete-state, continuous-
time Markov chain (CTMC), {Z( t ) , t > 0}, with state
space q2 = {1,2 , . . . ,k}. For each i , j E q2, let qij be the
transition rate from state i to state j , and define
k qli = - - ~ qii.
]=1 j#i
Then, Q = [q~¢] is the k by k transi t ion rate matrix.
We let P~(t) = Prob[Z(t) = i] be the probability that
© 1988 A C M 0-89791-254-3/88/0005/0177 $1.50 I r 7
the system is in s tate i at t ime t. The t ransient s tate
probabil i ty row-vector P( t ) can be computed by solving
a mat r ix differential equat ion [22],
2_(t) = P_P_(t)Q. (1)
Methods for comput ing P( t ) are compared in [15].
We can divide the state space into two sets: UP,
the set of operat ional states, and DOWN, the set of
failure or down states. If all D O W N states are absorbing
(failure) states, we can obtain the system reliability from
the s ta te probabil i t ies,
R(t) = ~ P,(t). iEUP
Associated with each state of the C T M C is a reward
rate tha t represents the performance level of the sys-
t em in tha t state. The C T M C and the reward rates
are combined to form a Markov reward model [8]. Each
s ta te represents a different system configuration. Tran-
sitions to states wi th smaller reward rates (lower per-
formance levels) are component failure transit ions, and,
in repairable systems, t ransi t ions to states wi th higher
performance levels are repair transit ions.
The choice of a performance measure for determin-
ing reward rates is a funct ion of the system to be eval-
uated. Often a raw measure of system capacity such as
the instruct ion execution rate is useful. For an intercon-
nect ion network (IN), one appropr ia te per formmance
measure is bandwidth . At other t imes, a queueing the-
oretic performance model may be used to compute the
reward rates. Since the t ime-scale of the performance-
related events (bandwidth) is much faster than the the
t ime-scale of the reliabil i ty-related events (component
failures), s teady-s ta te values from performance models
are used to specify the performance levels or reward
rates for each state.
We let r~ denote the reward rate associated with
state i, and call r the reward vector. The reward rate of
the system at t ime t is given by the process X(t) -= rz(o.
The expected reward rate at t ime t is
E[X(t)] -- ~ r ,P,(t) .
This quant i ty is also called the computation availability
[2].
If we let Y (t) be the amount of reward accumulated
(the amount of work done) by a system during the in-
terval (0, t), then
.fO t
Y(t) = x(u)d~. (2)
Furthermore , if we use bandwid th to construct the re-
ward vector, then f rom equat ion (2), Y(t) represents the
number of requests that the IN is capable of satisfying
by t ime t. The expected accumulated reward is
/: E[g(t)] = E[ X(u)du] = r, P,(~)d,~. (3)
In order to compute E[Y(t)], we lat L,(t) = fg P,(,~)d,~.
Then, the row vector L(t) can be computed by solving
the system of differential equations:
"_L(t) = LCt)Q + PC0) . (4)
Methods of solving this system of equat ions are dis-
cussed in [16].
A special case of the expected accumulated reward
is the mean time to failure ( M T T F ) . The M T T F is
defined as
/; M T T F = R(t)dt. (5)
The M T T F is a special case of E[Y(oo)], with reward
rate 0.0 assigned to all D O W N states (which are as-
sumed to be absorbing) and reward rate 1.0 assigned to
all UP states. To compute M T T F , we solve for r_ in
~_~ = -~(0), (6)
where P_(0) is the par t i t ion of P(0) corresponding to the
UP states only. The mat r ix Q is obta ined by deleting
the rows and columns in Q corresponding to D O W N
states. Any linear algebraic system solver can be used
to solve this system of equations. Al though one might
like to use direct methods like Gaussian el imination; for
large, sparse models, i terat ive methods are more prac-
tical [211. The mat r ix - Q is a non-singular , diagonally-
dominant M-ma t r ix . Thus, if we use an i terative me thod
such as Gauss-Seidel, SOR, or opt imal SOR to solve (6),
it is guaranteed to converge to the solution [23]. Then,
M T T F = ~ rl. (7) iEUP
3 S e n s i t i v i t y A n a l y s i s
The results obta ined from a model are sensitive to many
factors. For example, the effect of a change in distri-
but ion on a stochastic model is often considered. In
this paper , we concentra te our a t tent ion on parametr ic
sensitivity analysis, a technique to compute the effect
of changes in the rate constants of a Markev model on
178
the measures of interest. Parametr ic sensitivity analysis
helps: (1) guide system optimization, (2) find reliability,
performance, and performabil i ty bottlenecks in the sys-
tem, and (3) identify the model parameters tha t could
produce significant modeling errors.
One approach to parametr ic sensitivity analysis is to
use upper and lower bounds on each parameter in the
model to compute optimistic and conservative bounds
on system reliability [20]. Our approach is to compute
the derivative of the measures of interest with respect to
the model parameters [6,19]. A bound on the per turbed
solution can then be computed with a simple Taylor
series approximation.
We assume that the t ransi t ion rates qq are functions
of some parameter A. For a given value of A, we want to
compute the derivative of various measures with respect
to A (e.g., OPi(t)/OA). If we let S(t) be the row vector
of the sensitivities OPi(t)/O)~, then from equation (1)
we obta in
h(t) = S(t)Q + P( t )V (8)
where V is the derivative of Q with respect to )~. As-
suming the initial conditions do not depend on A, we
have £(0) OP(O) aP(t)
= - - l i m - - - - O . 0h t~0 0,~ -
We can then solve equations (1) and (8) s imultaneously
using,
V [~(t),S_(t)] = [P__(t),S(t)] [ 0Q Q ] (9)
subject to the initial condit ion
[P(0),S_(0)] = [ ~ , 0 ] . (10)
Let ~ be the number of non-zero entries in Q, and let
~7~ be the number of non-zero entries in V.
For acyclic models, an efficient algori thm that re-
quires O(2r /+ r/~) floating-point operations (FLOPS) is
discussed in [12]. For more general models with cy-
cles, we can use an explicit integrat ion technique like
Runge-Kut ta . The execution t ime of explicit methods
like Runge-Kut t a is O((2r /+ rl,) (q + v)t) FLOPS, where
q = max, Iqlll and v = maxi Iv~.,I. To solve equation (10)
with Vniformization [7], we choose q > max, Iqi,! and
Q* = Q/q + I. Then,
0 oo _S(t) = ~-~ i~=oP(O)(Q*)' e -qt (qt)ii! = ~-~II(i)'e-qt°° (~!)'qzt ~, ,
= /=0 (11)
where
and
= = 1 ) Q ' )
= ~ ( i - 1)'Q" + n _ ( i - 1 ) ° Q *, ( 1 2 )
n_if) = ~ ( i - 1 )Q" , n_(0) = P ( 0 ) . (13)
If the CTMC's initial condit ions do not depend on ,~,
then H'(0) = 0. Also note tha t O Q . = Y/q. With a
sparse matr ix implementat ion, Uniformization requires
O( (2 r /+ ~)qt) FLOPS. Both Runge-Kut ta ' s and Uni-
formization's performance degrades linearly as q (or v)
grows. Problems with values of q tha t are large relative
to the length of the solution interval are called stiff.
Large values of q (and v) are common in systems with
repair or reconfiguration. An at tract ive alternative for
such stiff problems is an implicit integration technique
with execution t ime O(2r /+ rls) [15].
We can derive the sensitivity of E[X(t)] from the
sensitivities of the state probabilit ies
OE[X(t)] = ~ Orl ~ r,P,(t) = ~ -g-ZP,(t)+ ~ r,S,(t). Oh Oh ie* ,e~ ,e~
(14)
Similarly, we can derive the sensitivity of E[Y(t)] by
differentiating equation (3),
Ori . . f t = ~ - ~ n i ( t ) + ~ r i j o S,(u)du.(15)
As in the ins tantaneous measures case, methods for
comput ing the cumulative state probabil i ty sensitivity
vector, f~ S_(u)du, include numerical integration, the ACE
algori thm for acyclic models, and Uniformization.
For the special case of mean time to failure, if we
differentiate equation (6), we can solve for s,
= - ~-5-2 (16)
where _r is the solution obtained from equation (6). Then,
OMTTF Or~ Oh = E ~ = ~ s," (17)
iEUP /CUP
This linear system can be solved using the same algo-
r i thms used to solve equation (6).
Having computed the derivative of some measure,
say M T T F , with respect to various system parameters,
,~i, there are at least three distinct ways to use the re-
sults. The first application is to provide error bounds
179
on the solution when given bounds on the input param-
eters. Assume that each of the parameters ),i is con-
ta ined in an uncer ta in ty interval of width AAi. We can
then approximate ly determine an uncer ta inty interval
A M T T F ,
AAI O M T T F . (18) A M T T F ~- ~ OAi
i
A second use of parametr ic sensitivities is in the
identification of port ions of a model that need refine-
ment. There is some cost involved in reducing the size
of the intervals A),i since it requires taking addit ional
measurements or performing more detailed analysis. We
assume the cost (or time) of reduct ion in A),~ is propor-
t ional to AA~/Ai and let
O M T T F , I = a r gm ax i ),i ~ (19)
where argmaxi]xi] denotes the value of i tha t maximizes
xi. Then, refining the I t h parameter is the most cost-
effective way to improve the accuracy of the model.
A th i rd applicat ion of parametr ic sensitivities is sys-
t em opt imizat ion and bot t leneck analysis. Assume that
there are Ni copies of component i in the system, and
tha t the failure rate of component i is )~i. Fur thermore ,
assume the cost of the i th subsystem is given by some
funct ion ciNiA~. '~'. Define the opt imizat ion problem:
M a x i m i z e : M T T F
S u b j e c t T o : ~ c,N,)~7 ~' <_ C O S T . (20) i
Using the me thod of Lagrange multipliers [1], the opti-
mal values of Ai satisfy:
)~7'+ 10MTT.______FF = cons tan t . (21) ci Ni ai c3 Ai
Let I A~. i - l - 1 0 M T T F
r - ' (22) I* : a grnax, c~Nia~ OA~ "
Then, the most cost-effective point to make an incre-
menta l investment is in subsys tem type I*. In other
words, the system bott leneck from the M T T F point of
view is subsys tem I*. In our numerical examples, we
will use this definition of bottleneck. For convenience,
we also assume tha t el = a~ = 1 for all i, a l though other
cost functions could be used. At the conclusion of the
numerical results section, we compare these results with
those obtained using the second scaling approach.
4 M u l t i p r o c e s s o r E x a m p l e
Consider a mult iproceesor system (MPS) tha t consists
of 16 processors (Ps), 16 shared memories (Ms), and
an interconnect ion network (IN) tha t connects the pro-
cessors to the memories. We consider three different
interconnect ion network models. Firs t , the IN may be
modeled as one large crossbar switch, as in the C .mmp
system [18] (see Figure 1). We refer to this model as
S Y S , .
o-- o=,=
o-
o -
- - o
Figure 1: Mult iprocessor System Using a Crossbal Switch as a Single Componen t Interconnect ion Network.
8
Figure 2: Mult iprocessor System Using a Crossbar Switch Composed of Mul t ip lexers /Demul t ip lexers as the Interconnect ion Network.
A second more detailed network model, referred to
as S Y S d , is shown in Figure 2. Here, the IN is composed
of sixteen 1 × 16 demult iplexers and sixteen 16 × 1 mul-
t iplexers [18]. In this ar rangement , each processor is
connected to a demult iplexer and each memory is con-
nected to a mult iplexer. Functionally, this is equivalent
to the crossbar switch, but it pr'.ovides addi t ional fault-
tolerance. In S Y S , ~ any switch failure results in the
complete disconnection of all processors and memories.
In S Y S a , a mult iplexer failure disconnects only the as-
sociated processor or memory.
In the third model, the IN is an Omega network
[11]. This network has two stages and is const ructed
from eight 4 x 4 switching elements (SEs). This is an
180
economical a l ternat ive to a crossbar switch, since the
complexi ty of the crossbar is O(N 2) whereas that of the
Omega network is O(N log N) [10]. The MPS using the
Omega IN is shown in Figure 3. We refer to this model
as S Y Sn. stage 1 Stage 2
~ =
Figure 3: Mult iprocessor System Using an Omega Net- work with 4 x 4 Switching Elements as the Interconnec- tion Network.
4 . 1 Rel iab i l i ty
We use the switch fault model for reliability analysis.
The pr imary assumption in this model is that each com-
ponent is an atomic structure. Therefore, the failure of
any device in a component causes a total failure of the
component . For brevity, we also assume that system
failure occurs only as a result of the accumulat ion of
component failures (exhaustion of redundancy) . This
"perfect coverage" assumption can be easily extended
to incorporate "imperfect coverage."
Gate count is used to "equalize" the failure rates of
the three models of the IN. From [10], an n x n crossbar
switch requires 4n (n - 1) gates where n is the number of
inputs and outputs . And an n x 1 mult iplexer requires
2(n - 1) gates where n is the number of inputs to the
mult iplexer. A demult iplexer also requires 2 ( n - 1 ) gates
by similar reasoning. These numbers for gate count are
based on switching element construct ion that utilizes a
tree-like ar rangement of gates. For the 16 × 16 MPS,
there are 960 gates in the simple 16 × 16 crossbar switch,
30 gates in a demul t ip lexer /mul t ip lexer , and 48 gates in
the 4 x 4 SE (assuming the SE uses a crossbar construc-
tion). Since the switch-fault model assumption is be-
ing used, if 6, denotes the failure rate of the 16 × 16
crossbar switch, then 6~/960 is the gate failure rate,
6d = ~a/32 is the demul t ip lexer /mul t ip lexer failure rate,
and 6n ---- 6,/20 is the 4 x 4 SE failure rate.
For each model of the 16 × 16 MPS system, we
construct a Markov reliability model. For S Y S , , we
enumera te all possible system states. For SYSd, w e
modify the C T M C used in SYS° by adjust ing the fail-
ure rates of the processors and memories to account for
their associated demult iplexers /mult iplexers . In SYSn ,
the C T M C must account for the failure behavior of the
processors, memories, and the SEs to which they are
connected. By exploiting the symmet ry in the Omega
network, the s ta te description can be accomplished with
an 8-tuple. The initial s tate is (44444444) where posi-
t ion i (1 ~ i _< 4) represents the number of functioning
processors connected to an operat ional SE in posit ion
i. Similarly for the memories where 5 < i < 8. This
Markov chain embodies the concept of bulk failures. For
a given i, ei ther a processor or memory may fail and the
value at posit ion i decreases by one, or a SE may fail
and the value at posit ion i becomes zero. For SYSn ,
state lumping [9] reduces the state space of the CTMC.
To extend the conditions for lumpabil i ty to Markov re-
ward models, we require tha t every pair of states u and v
in the " lumped state" must have identical reward rates
(i.e., r~ = r~).
4.2 P e r f o r m a n c e
For each state in the system reliability model, we need
to determine an associated reward rate (performance
level). We use the average number of busy memories
(memory bandwidth) as the reward rate (performance
level) for each system configuration. This is an appro-
pr iate choice of performance metric for the MPS since
the efficiency of the system is l imited by the ability of
the processors to randomly access the available memo-
ries.
In S Y S , and SYSd, the networks are non-blocking,
i.e., contention for the memories occurs at the memory
ports. In contrast , S Y S e has a blocking network, since
contention also occurs inside the IN. If two or more pro-
cessors compete for the same output link of a SE, only
one request is successful and the remaining requests are
dropped.
In determining the bandwidth of a given configu-
rat ion of the mult iprocessor system, the independent-
reference model assumptions s tated in (14] for analysis
of circuit-switched networks are used. Let p~, denote
the probabil i ty that a processor issues a request during
181
a pa r t i cu la r m e m o r y reques t cycle, and Pout denote the
p robab i l i ty t h a t a pa r t i cu la r m e m o r y receives a request
a t its i n p u t link. Since it is a s sumed t h a t requests are
no t buffered in the IN, nor are mul t ip le requests ac-
cepted at a m e m o r y on any cycle, c o m p u t a t i o n of the
m e m o r y b a n d w i d t h for the MPS is accompl ished in a
s t r a igh t fo rward m a n n e r .
Over t ime, c o m p o n e n t s of the MPS fail, and as a re-
sult , t he pe r fo rmance of the sys tem decreases. For the
c rossbar ne twork , i m p l em en t ed as a single e lement in
S Y S , or us ing mul t ip lexers in SYS4 , we use the model
developed by B h a n d a r k a r [3] to ob ta in the average num-
ber of busy memories , or m e m o r y b a n d w i d t h , for each
degraded sys tem conf igurat ion. For the Omega net -
work, we use an ex tens ion of tile pe r fo rmance model
in [141 .
For the n x n c rossbar switch, the p robab i l i ty t h a t
a pa r t i cu l a r processor requests a pa r t i cu la r m e m o r y for
a given ne twork cycle is pi,-,/n. The p robab i l i ty t h a t a
pa r t i cu la r m e m o r y is selected by at least one processor
is
Po,~t = 1 - (1 - Pin),., (23) n - "
The b a n d w i d t h for the sys tem is jus t Pout t imes n, hence
BW~b,r = n ( 1 - ( 1 - P l n , -X-) ) (24)
In the presence of m e m o r y or processor failures, th is
equa t ion m u s t be modif ied since the n u m b e r of opera-
t ional memor ies is not , in general , equal to the num-
ber of ope ra t iona l processors. In [3], a deta i led combi-
na to r i a l and M a r k o v i a n analysis was pe r fo rmed to de-
t e rmine the b a n d w i d t h in the a symmet r i c case. Let i
deno te the n u m b e r of ope ra t iona l processors and j de-
no te the n u m b e r of ope ra t iona l memories . Fur the r , let
= rn in{ i , j } a nd rn = m a x { i , j } . T h e n f o r p i , = 1.0,
B h a n d a r k a r found the average n u m b e r of busy memo-
ries in the sys t em is accura te ly pred ic ted by the formula ,
m(1 - ( 1 - 1 ~ r e ) e ) . Therefore , the reward ra te associa ted
wi th a pa r t i cu la r s t r u c t u r e s t a t e ( i , j ) is
BW=bor = m(1 - (1 - 1 /m)~) . (25)
Next, consider the N x N Omega ne twork wi th switch-
ing e lements of size n x n. N u m b e r the s tage to which
the processors are a t t a c h e d as s tage 1, and the last s tage
to which the memor ies are a t t a ched as s tage v. The
swi tch ing e lements are n x n crossbars , and the out-
pu t of a pa r t i cu la r link of a swi tch ing e lement can be
Input Links Output Links
0 - - - - 0
1 - - - - 1
n - 1 - - m n - 1
Figure 4: n x n Switching Element .
deno ted as pl. This value is also the p robab i l i t y t h a t
the re is an i n p u t reques t for a SE in the next s tage. A
recur rence re la t ion exists for c o m p u t i n g these reques t
p robabi l i t i es :
P~+I = 1 - ( 1 - n ) . p l , (26)
Note t h a t Pin is the p robab i l i t y t h a t the re is a reques t a t
t he first s tage and p~ = po,a ( the p robab i l i t y t h a t the re
is a reques t for a pa r t i cu la r memory ) . The b a n d w i d t h
is c o m p u t e d as the p r o d u c t of the reques t p robabi l i t i es
for a pa r t i cu la r m e m o r y and the n u m b e r of memor ies
[14]: BW n = N(1 - (1 - P ~ - I ) , ) . (27)
n
In the presence of failures, th is equa t ion m u s t be mod-
ified to accoun t for graceful degrada t ion . In the first
s tage of the ne twork , consider a pa r t i cu la r i npu t l ink to
an n x n SE, say link 0 in F igure 4, and denote it by
pi,.0. I t may reques t a pa r t i cu la r o u t p u t l ink w i th equal
probabi l i ty , so it does not reques t a specific l ink wi th
p robab i l i t y (1 - p i , , o / n ) . Similarly, i npu t l ink 1 does
not reques t the same link w i th p robab i l i ty (1 - pi,.1/n).
If the processor a t t a ched to the i npu t l ink has failed,
t h e n Pi,.i = 0. Now, the p robab i l i t y of a reques t for a
specific o u t p u t l ink, say i, as a resul t of the (pe rhaps
unequal ) reques t p robabi l i t i es by the i npu t links is t h e n
c o m p u t e d as
Po~.t.i = 1 - H ( 1 - ~ ) . (28) j=0
The b a n d w i d t h of the SE is then ,
n(Pout.i) if t he SE has not failed, (29) BWsE = 0 otherwise .
T h e ou tpu t s of th is SE serve as inpu ts to the SEs in
the nex t s tage. At the final s tage of the Omega net -
work, some memor ies may be inoperable . T h e ne twork
b a n d w i d t h is c o m p u t e d as the s u m of the reques t p rob-
abil i t ies for the opera t iona l memories . Let No denote
182
t he set of ope ra t iona l memories , t hen
BWn ~- ~ (po~t)i. (30) jENo
In the next sect ion, equa t ion (25) is used to compu te
the reward ra te for each s ta te of the SYS~ and SYSd
models . Similarly, equa t ion (30) is used to com pu te the
reward ra te for each s t a t e of the SYS~ model .
5 N u m e r i c a l R e s u l t s
In th is sect ion, we examine the t r ans i en t reliability, m e a n
t ime to failure, and expected reward ra te at t ime t for
our th ree 16× 16 MPS models . For each model , we com-
pu t e the sensi t ivi t ies of these th ree measures to changes
in the c o m p o n e n t fai lure rates .
A given task for a mul t ip rocessor sys tem may re-
quire U processors and V memor ies where U and V
are b o t h less t h a n the to ta l resources avai lable on a
fu l ly-opera t ional MPS. So given the task, a mul t ip ro-
cessor sys tem is ope ra t iona l as long as U processors
can access V memories . As in [18], we consider the
case where 4 processors and 4 memor ies are required
( K = U = V -- 4). For SYS , and SYSd, the cor-
r e spond ing CTMCs have 170 s tates . For SYS~, the
C T M C has more t h a n 64000 s ta tes before s t a t e lump-
ing, and 3970 s ta tes af ter lumping. We have also com-
pu t ed resul ts for the sys tem t h a t requires 12 processors
and 12 memor ies ( K = 12). For brevity, mos t of the
da t a for this second case is omi t ted .
We use fai lure da t a f rom the analysis of the C . m m p
sys tem [18]. By a pa r t s coun t me thod , Sieworiek deter-
m ined the fai lure ra tes per hour for the componen t s to
be:
Processor M e m o r y Switch -- 0.0000689 "7 = 0.0002241 5, = 0.0002024
Like Siewiorek, we assume t h a t c o m p o n e n t l ifetime dis-
t r i bu t ions are exponent ia l ly d i s t r ibu ted .
In the rest of th is sect ion, we first consider some
single-valued measures of ne twork per fo rmance and re-
liability. We t h e n consider t ime -dependen t sys tem reli-
abi l i ty and its sensit ivi ty. Finally, we consider E[X(t)],
the expec ted reward ra t e (bandwid th ) at t ime t, a mea-
sure of degradab le sys tem per formance .
Arch i t ec tu re B a n d w i d t h
SYSm 10.3 SYSd 10.3 S Y S n 8.4
M T T F Cost K = 12 K = 4 1322.3 3611.8 9 6 0 1537.9 6708.6 960 1497.2 6575.5 384
Table 1: C o m p a r i s o n of MPS Models.
5 . l S i n g l e - V a l u e d M e a s u r e s
In Table 1, we use th ree s ingle-valued measures to com-
pare the th ree MPS models . Using equat ions (25) and
(30), the b a n d w i d t h is c o m p u t e d using the app roach de-
scr ibed in Section 4.2. If we consider b a n d w i d t h alone,
SYS~ and SYSd are ind is t inguishable , and SYSn is
the least preferred choice. Next, sys tem M T T F is com-
p u t e d us ing equa t ions (6) and (7). Based on the M T T F ,
SYSd is the mos t rel iable sys tem, and SYS~ is the least
reliable. The cost of processors and memor ies is the
same for all th ree models , so we use the cost of the IN
to con t r a s t the models . The cost of the IN is c o m p u t e d
using a gate count . SYSn requires less t h a n one-hal f
the n u m b e r of gates needed by the o ther two sys tems.
Next we consider the sens i t iv i ty of the M T T F esti-
ma tes given in Table 1 to changes in c o m p o n e n t fai lure
ra tes . For each model , us ing equa t ion (17), we c ompu te
the sens i t iv i ty of M T T F wi th respect to the processor
fai lure ra te , m e m o r y failure ra te , and swi tch ing e lement
fai lure rate . Note t h a t each sys tem has a different num-
ber of swi tch ing e lements and these SEs have different
fai lure ra tes . To find the sys tem bot t lenecks , we use
the cost model descr ibed in Section 3 wi th al = ci = 1.
The p a r a m e t r i c sensi t ivi t ies are mul t ip l ied by a fac-
tor of A~/N~. The resul ts are shown in Table 2. The
bot t lenecks for each sys tem conf igura t ion are italicized.
Because SYS~ is mos t sensi t ive to swi tch failures, for
this model , the swi tch is the rel iabi l i ty bot t leneck. The
memor ies are the bo t t l eneck for the o ther two models .
5.2 Reliability
In F igure 5, we plot the t ime -dependen t rel iabi l i ty curves
for the th ree M P S models . Fai lure ra tes for the IN are
d e t e r m i n e d using the gate count m e t h o d discussed in
the Section 4.1. We res t r ic t our a t t e n t i o n to the case
K -- 4. Because SYS~ is vulnerable-' to a s ingle-point
swi tch failure, R~(t) is s ignif icant ly less t h a n Rd(t) or
R~ (t). Model ing the IN at the demul t i p l exe r /mu l t i p l exe r
level increases the p red ic ted rel iabi l i ty since the fai lure
of ind iv idua l c o m p o n e n t s is no t ca tas t rophic . Also, ob-
183
MPS " Failure Ra te Paramete r
Processors
" K = 1 2 K = 4 S Y S e " --21.29 - 2 . 0 6 --1462.78 --2297.36 S Y S a -35 .01 -20 .14 - 1974.08 - 9069.,~5 S Y S n -35 .45 -34 .84 - 1863.69 - 8655.67
Memories Network
K = 1 2 K = 4 K = 1 2 K = 4 - 4 6 e s . e 4
- - 0 . 9 3
--10.64
- 89899.40 --3.57
--39.74
1.0
Table 2: Scaled Sensit ivi ty of M T T F with l~espect to Parameters (x ((X~/Ni) x 10s)).
~ e ~ v ' f SYS i
, \
6000 12000 Time (Hours)
Figure 5: Comparison of the Reliabilities of the Three
MPS Models for K = 4.
7X10 -5
0 " 6000 12000
Time (Hours)
Figure 6: Scaled Paramet r ic Sensit ivi ty of Unrel iabi l i ty - - Simple Crossbar Model (SYSo).
serve tha t Rn(t) < Rd(t). In the K = 12 case (not
shown), the degree of separat ion between the reliability
of SYSo and the other two models is less pronounced.
Scaled parametr ic sensitivities for SYSs and S Y S n
are plot ted in Figures 6 and 7. We omit the plot for
SYSd, because it is a lmost identical to the plot for
SYSn . These parametr ic sensitivities are scaled by mul-
t iplying by the factor X~/N~. Regardless of mission t ime,
all three systems are insensitive to small changes in the
processor failure rate. For S Y S , , switch failure is the
reliabili ty bott leneck. For SYSd and SYSa , increased
faul t- tolerance in the switch makes the memories the
reliability bott leneck, regardless of mission time.
5.3 P e r f o r m a b i l i t y
For K = 4, Figure 8 plots E[X(t)], the expected system
bandwid th at t ime t. SYSd is superior to the other
two systems. For small values of t, S Y S s is superior to
S Y S n , while the converse is true for modera te values
of t. This crossover occurs because for small K , up
to three SEs can fail in S Y S n without causing system
failure; while for S Y S , when the IN fails, the system is
d o w n .
Parametr ic sensitivities for E[X(t)] of the MPS mod-
els are plot ted in Figures 9 and 10. We omit the plot
for SYSa because it is almost identical to the plot for
SYSn . These parametr ic sensitivities are scaled by mul-
t iplying by a factor of ,k~/Ni. We note tha t the sen-
sitivities have an opposite sign than the sensitivit ies
of system unreliabili ty; an increase in failure rate in-
creases unreliabil i ty but decreases the expected reward
rate. We also note that , unlike the sensit ivity of un-
reliability, the processor failure rate sensitivity curve
is visible. Al though it is unlikely tha t enough proces-
sors would ever fail to cause total system failure, a few
processor failures might occur, reducing system perfor-
mance. In S Y S , , the switch is the performabil i ty bot-
tleneck. Because SYSd and S Y S n have faul t - tolerant
switches, regardless of mission t ime, memories are their
performabil i ty bott leneck.
5.4 Ana lys i s w i t h an A l t e r n a t e Sensi- t iv i ty M e a s u r e
As we ment ioned in Section 3, a second use of para-
metr ic sensitivit ies is in the identification of port ions
of a model tha t need. refinement. Instead of using a
cost function, as in the three previous subsections, here
we consider relat ive changes, A),i/,k ~. This quant i ty is
obtained by scaling the parametr ic sensitivities (multi-
plying each S(t) by ,kl). Using this approach changes
the results obtained for S Y S , . With the "cost-based"
measure used in Section 5.1, the M T T F of S Y S , was
184
1,4x10 "5
¢.n-
E
&
0
~(--- Memories
6000 12000 Time (Hours)
Figure 7: Scaled Parametric Sensitivity of Unreliability - - Omega Network Model (SYS~).
most sensitive to switch failures for both K = 4 and
K -- 12. With the alternate scaling used here, the
M T T F of SYSo is most sensitive to switch failures for
K = 4, but for K = 12, it is most sensitive to mem-
ory failures. If we want to improve the M T T F model
for SYS~, then K is also a factor in determining what
component of the model should be refined.
If we repeat the reliability sensitivity analysis with
the alternate scaling, SYS , is initially most sensitive to
switch failures, but as mission time increases, exhaus-
tion of memory redundancy becomes a greater prob-
lem. For t _> 4000, SYS, reliability is most sensitive
to changes in the memory failure rate. For E[X(t)] of
SYS, , a similar crossover is observable at t = 4000.
If we want to improve the reliability or performabil-
ity models for SYSo for small t, the failure rate of the
switch should be more accurately determined. For large
values of t, the failure rate of the memory system should
be more accurately determined.
6 C o n c l u s i o n s
System modelers often rely on single-valued measures
like MTTF. This oversimplification may hide impor-
tant differences between candidate architectures. Time-
dependent reliability analysis provides additional data,
but unless whole series of models are run, it does not
suggest where to spend additional design effort. In this
paper, we discussed the use of Markov reward mod-
els and parametric sensitivity analysis. Markov reward
models allow us to model the performance of degrad-
able systems. Parametric sensitivity analysis helps iden-
tify critical system components or portions of the model
that are particularly sensitive to error.
12
~ SYS s
~ f SYSd
0 10000 i
Time (Hours) 20000
Figure 8: Comparison of the Expected Reward Rates at time t for the Three MPS Models for K = 4.
To demonstrate the use of parametric sensitivity anal-
ysis in the evaluation of competing system designs, we
considered three models of a multiprocessor system con-
structed from processors, shared memories, and an in-
terconnection network. For each model, we computed
the parametric sensitivity of mean time to failure, sys-
tem unreliability, and time-dependent expected reward
rates. By scaling with respect to a cost function, we
were able to identify the reliability, performability, and
M T T F bottlenecks in each system. The three mod-
els produced different results. The differences between
the models highlight the need for detailed models and
shows the role of analytic modeling in choosing design
alternatives and guiding the design refinements.
R e f e r e n c e s
[1] M. Avriel. Nonlinear Programming: Analysis and Methods. Prentice-Hall, Englewood Cliffs, N J, 1976.
[2] M. D. Beaudry. Performance Related Reliabil- ity for Computer Systems. IEEE Transactions on Computers, C-27(6):540-547, June 1978.
[3] D. P. Bhandarkar. Analysis of Memory Interfer- ence in Multiprocessors. IEEE Transactions on Computers, C°24 (9) :897-908, September 1975.
[4] C. R. Das and L. N. Bhuyan. Reliability Simula- tion of Multiprocessor Systems. In Proceedings o/ the International Conference on Parallel Process- ing, pages 591-598, August 1985.
[5] P. M. Frank. Introduction to System Sensitivity Theory. Academic Press, New York, 1978.
[6] A. Goyal, S. Lavenberg, and K. Trivedi. Proba- bilistic Modeling of Computer System Availability. Annals of Operations Research, 8:285-306, March 1987.
18S
0
¢Z-
E
-4.5X10 .4 6
i i I i I
6000 12000 Time (Hours)
Figure 9: Scaled Parametric Sensitivity of Performance Level - - Simple Crossbar Model (SYSs) .
0
z
7 03 =e $
-4.0X10"
- Z otwo -- rooo o o orof 6000 12000
Time (Hours)
Figure 10: Scaled Parametric Sensitivity of Performance Level - - Omega Network Model (SYSa) .
[7] P. Heidelberger and A. Goyal. Sensitivity Anal- ysis of Continuous Time Markov Chains Using Uniformization. In P. J. Courtois G. Iazeolla and O. J. Boxma, editors, Proceedings of the 2nd International Workshop on Applied Math- ematics and Performance/Reliability Models of Computer/Communication Systems, pages 93-104, Rome, Italy, May 1987.
[8] R. A. Howard. Dynamic Probabilistic Systems, Volume 11: Semi-Markov and Decision Processes. John Wiley and Sons, New York, 1971.
[9] J. G. Kemeny and J. L. Snell. Finite Markov Chains. Van Nostrand-Reinhold, Princeton, N J, 1960.
[10]
[11]
[121
[13]
[14]
D. Kuck. The Structure of Computers and Com- putations. Volume 1, John Wiley and Sons, NY, 1978.
D. H. Lawrie. Access and Alignment of Data in an Array Processor. IEEE Transactions on Comput- ers, C-24:1145-1155, December 1975.
R. A. Marie, A. L. Reibman, and K. S. Trivedi. Transient Solution of Acyclic Markov Chains. Per- formance Evaluation, 7(3):175-194, 1987.
J. F. Meyer. On Evaluating the Performability of Degradable Computing Systems. IEEE Trans- actions on Computers, C-29(8):720-731, August 1980.
J. H. Patel. Performance of Processor-Memory In- terconnections for Multiprocessors. IEEE Trans- actions on Computers, C-30(10):771-780, October 1981.
[15] A. Reibman and K. Trivedi. Numerical Transient Analysis of Markov Models. Computers and Oper- ations Research, 15(1):19-36, 1988.
[16] A. L. Reibman and K. S. Trivedi. Transient Anal- ysis of Cumulative Measures of Markov Chain Be- havior. 1987. Submitted for publication.
[17] D. P. Siewiorek. Multiprocessors: Reliability Modeling and Graceful Degradation. In Infoteeh State of the Art Conference on System Reliability, pages 48-73, Infotech International, London, 1977.
[18] D. P. Siewiorek, V. Kini, R. Joobbani, and H. Bel- lis. A Case Study of C.mmp, Cm*, and C.vmp: Par t II - - Predicting and Calibrating Reliabil- ity of Multiprocessor. Proceedings of the IEEE, 66(10):1200-1220, October 1978.
[19] M. Smotherman. Parametric Error Analysis and Coverage Approximations in Reliability Modeling. PhD thesis, Department of Computer Science, Uni- versity of North Carolina, Chapel Hill, NC, 1984.
[20] M. Smotherman, R. Geist, and K. Trivedi. Prov- ably Conservative Approximations to Complex Re- liability Models. IEEE Transactions on Comput- ers, C-35(4):333-338, April 1986.
[21] W. Stewart and A. Goyal. Matrix Methods in Large Dependability Models. Research Report RC-11485, IBM, November 1985.
[22] K. S. Trivedi. Probability and Statistics with Re- liability, Queueing and Computer Science Applica- tions. Prentice-Hall, Englewood Cliffs, N J, 1982.
[23] R. S. Varga. Matrix Iterative Analysis. Prentice- Hall, Englewood Cliffs, N J, 1962.
1 8 6