Upper Bounds on Mean Life of Self-Repairing Systems

1962 KLETSKY: UPPER BOUNDS ON MEAN LIFE OF SELF-REPAIRING SYSTEMS 43

Upper Bounds on Mean Life of Self-Repairing Systems*

EARL J. KLETSKYt, member, IRE

Summary-Upper bounds on mean life of self- Elements are assumed to fail catastrophically andrepairing systems are established on the basis of cannot be repaired. It is, however, possible to re-a very general system model. Expressions for place elements. The model is "self-repairing" insystem mean-life are obtained by formulating the the sense that repair of the system is accomplish-problem as a two-dimensional difference equation ed by replacement of elements from a reservoirin time and the number of nonfailed spare ele- of spares. The approach employed calculatesments. system mean life on the basis of this very general

Results indicate that mean life of an isolated model which includes only those components es-system cannot be expected to exceed roughly three sential for successful operation. Problems of in-times the mean life of the elements which make it terconnection, switching, diagnosis, element re-up, regardless of how standby elements are em- placement, and physical system repair are con-ployed. If the element failure rate in standby is sidered incidental. For this reason the resultssubstantially less than the failure rate in operation, obtained represent upper bounds on mean life.then system mean life is essentially linearly pro-portional to the number of available standby ele-ments. CLOSED SYSTEMS

An important class of systems are those whichINTRODUCTION are inaccessible to external repair. These are

called closed systems. Such systems are com-Modern systems of great complexity are pletely isolated from their external environment

usually characterized by having very high relia- except that energy may be supplied externally (forbility during early life followed by a very rapid example, in the form of solar energy impinging ondecrease in reliability at some future time. That its surface). During fabrication the system isis, these systems are practically certain to be provided with all of its component parts and ap-working prior to a given time and are practically propriate spares. No new elements can be in-certain not to be working following this time. For gested by the system following its initial entry in-such systems, mean time to failure (system mean to service.life) is a most useful parameter since it provides It is assumed that the closed system consists ofa good measure of the time at which this very a total of N identical elements and each is char-rapid decline in reliability occurs. acterized by the same failure model. From this

Included in this general class of systems are total, L elements are selected to form the actualthose which exhibit self-repairing properties or operating portion of the system. Whatever mech-employ other techniques to extend useful life. De- anisms are employed to perform the necessarytermination of mean life for systems of this type switching and interconnection functions are as-will establish upper bounds on the expected life of sumed to be perfectly reliable. Failures are in-present, proposed, and "blue-sky" systems. dependent and can only be attributed to one or

The system model studied consists of a collec- more of the N elements.tion of a finite number of identical elements.Successful system operation requires that a givennumber of these elements be operable at any time. REDUNDANTY SYSTEMS

* Received June 9, 1961. This paper was presented at the An example of a simple closed system is theAIEE Conference on Diagnosis of Failures in Switching wired-in redundant system shown in Fig. 1. TheCircuits, Michigan State University, East Lansing, Mich., system reliability for this configuration is easilyMay 15-16, 1961. The work was supported by the Rome shw t beAir Dev. Ctr. under Contract AF 30(602)-2234.shwtobtDepartment of Electrical Engineering, Syracuse Univer- W L ()sity, Syracuse, N.Y. R(t) = l1 - [ 1 - r(t)] 1

44 IRE TRANSACTIONS ON RELIABILITY AND QUALITY CONTROL October

where r(t) is the element reliability, W is the 100 _ _ - - -number of paralleled elements, and L is the num- ... =ber of cascaded subsystems. Since the system isclosed, the total number of elements is N = WL.System mean life, M, can be found by computing 10

M= f R(t) dt= f - r(t)] dt. (2)

first member (2) generally valid

- __ - - Z_lation and its derivation is carried out in the Ap- eld_ _pendix. If it is assumed that element reliability is -16_given by r(t) = ekt, then it can be shown that (2) O _/ _Ireduces to 10 100 1000

N~ 2W 3

M= 1-()z 3+(L) W1 Fig. 2-Mean life of redundant system.

Ns=L +1 N1- (3) BINOMIAL SYSTEMS

s=lThis expression for mean life of the redundant It is possible to establish upper bounds on thesystem is displayed graphically in Fig. 2 for vari- mean life of closed systems by removing the fixedous values of L. It is immediately apparent that wiring and switching constraints found in redun-order of magnitude increases in system mean life dant systems. Consider a general closed systemover element mean life are not possible without composed of a reservoir of N identical elementsimpractical increases in the total number of ele- shown in Fig. 3. Out of this reservoir, a subset ofments. However, when L is large, significant in- L elements is chosen and suitably interconnectedcreases in system mean life over the nonredun- to form a working system. Elements of the work-dant configuration are possible through the use of ing system which fail are discarded and replacedredundancy. In all cases however the upper bound from the reservoir. The system fails when theon mean life is approximately 3 to 4 timupes the reservoir is emptied. Working elements have aentmean lifeisappreoxmately 3numbers of ele- constant failure rate Nu and reservoir (standby)element mean life if reasonable elements have a failure rate Xs. The model as-

sumes that a perfectly reliable mechanism forselection and interconnection is available.

The following definitions and notation will beemployed:

{3r{i} {} j = net number of elements which must failin order to cause a system failure (At

r3 C} {} t = 0, j N- L + 1 = m; when all standbyelements are exhausted, j = 1).

I Iw

L

N~~~~~~~LW~~~~-N=LW r= e-Xt \f

R(t)= _l(l6 )r] L.

M= R(t) dt

Fig. 1-Redundant system. Fig. 3-Binomial system.


Q'(T) = Qj(t - to) = probability that the time in- By means of the Laplace transform, (7) can be putterval for exactly j failures is less than in a more manageable form. LetT.

Pi() = 8 Qj(r)/a t = probability density of the ftime interval required for exactly j fail- U(j,s) = 0. P.(r). (8)ures.

aI = probability that a working element fails in Then the transform of (7) becomesthe time increment A t.a2 = probability that a standby element fails in

the tinme increment At. sU(j,s) - P.(o+)= xu[L + (j-l)d][U(j-l,s) - U(j,s)JWith the above notation Qj(r) can be written as [L + (j -1)d]X

Qi( ) (al + a2)Qj.1 ( - At) + U(j,s) s+ [L + (j-l)d]X U(j-l,s) +u

(1 a- - a2)Qj(r-At). (4) P.(O+)This relation is shown schematically on the tran- s + [L + (j-l)d]X (9)sition diagram in Fig. 4. a1 and a2 can be ex-

pressed in terms of the element failure rates, the Appropriate boundary conditions are found to benumber of elements involved, and the time incre-ment: U(O,s) = 1 P.(O+) = 0. (10)

a1= LkXAt

The solution to (9) subject to the conditions ex-a2 = (j-l)X At = (j-l)dAAt (5) pressed in (10) is easily shown to be

When these values are introduced and the limit j-1 (L + kd)Xtaken as At approaches zero, (4) becomes. U(j,s) =kO s u j + . (11)

k s +(L +kd)X J.laQ.Qr)

P d) tIt is now necessary to establish relations be-tween U(j,s), system reliability, and system mean

xu [L + (j-1) d] [Qj(1) - Q(i]'r) (6) life. From the definition of system reliability

A differentiation with respect to time yields tR (t)= - f P (T) d1 = 1- Q (1)0 (12)

t [L + (j-l) d] [P 1() - P. ('j (7) R(t) = - s U(m,s).

N-L+I HenceQ

Xs s

uR(t)= ',, [s-1U(m,s)I. (13)

', s <Q. \t \ As indicated by (2) an expression for systemmean life can be found by integrating system reli-ability over-all time. Hence it is possible to write

0

If 2lim tMM= t-00 M' where M'= f R(1)d T (14)

(1,=LXUAt 0at(j1)XAt0

Fig. 4-Closed system o M' =-i43R(t) = -[ - U(m,s)] . (15)transition diagram.ssss


Employing the final value theorem leads to 100

Mlim lim s1r2

lim [s - s U(m,s)X. (16) , . ,

System mean life can now be evaluated byusing (11) in (16) and carrying out the indicatedlimiting process. This gives

N-L XL64M=7v 1 d=-As . (17) 10 100 1000

k=o (L +kd)X Xu N

Two important special cases arise. When Fig. 5-Mean life of binomiald = 1, (s =Xu), the failure rate for standby ele- system (Xs=5 u)ments is the same as that for operating elements.In this case (17) gives is not sufficient to obtain large increases in sys-

tem mean life.N-L 1

1 ___1 1 The second important special case of (17) arisesXkM=7 _ 1 1+ + -u k-O (L+k) L L + 1 * N when d = 0, (Xs = 0). This implies that no failures

occur among the standby elements. Under these

N 1 L-1 conditions, system mean life becomes

n=-1 n7 I (18) -M=E L N- L sX = 0. (21)

k=OL L

The finite sums can be neatly and very accuratelyapproximated by noting that Fig. 6 shows curves of XuM as given by (21). For

comparison, curves from Fig. 5 are superimposed.F N 1The improvement in mean life over the case where

lim - - ln N =0.5772. .Y standby elements fail is immediately evident.N-oo n n

= Euler's Constant. (19) 100 I i-:

N1Let I- = ln N + 0.5772 + E(N), then the error

n=1 0 ~~~~~~~~~~~~~~10-- - 222:E(N) is less than 1 per cent for N = 20 and is -approximately 0. 1 per cent for N - 100. XuM -

Thus for N and L-1 greater than 20 system -mean life is very nearly ,,L

X M = ln N - ln (L-1) = ln L (As =Au) (20) - / 1 ,711

If the upper limit of the sum is less than 20, the --O IOOOvalue of AUM is easily calculated from the exact Nexpression given by (18). A plot of AUM is shownin Fig. 5. Comparison with Fig. 1 indicates that, Fig. 6-Mean life of binomialif As - Au, even removal of the wiring constraint system (As °)


For the general case when Xs 7 0, system Q.(f) = aAt Q.1 (f - AT) + bAt Q 1 (i - At) +mean life lies between the respective curves of JJFig. 6. (1 - aAt - bAt) Q.(r At) j >0, (22)As a practical application of these results con-sider a closed system requiring 64 transistors forits operation. Let the system contain 700 tran- where: aAt = LXuAt = probability that an elementsistors so that the ratio of spares to operating fails in the time interval At.units is 10. If the transistors are used in a rea- bAt = b(j)At = probability that a replace-sonably conservative fashion, then the mean life ment element is ingested in the time in-of the transistor is roughly the same whether it is terval At.in operation or in standby. If the transistor meanlife is, say, 3 years, then it is impossible to de-sign this closed system to have a mean life in ex- After taking the limit as At approaches zero, in-cess of 7 to 8 years even if all other components aQ (f)in the system never fail. serting P. (r) = at , and forming the La-

place transform, (22) becomesOPEN SYSTEMS

sU(j,s) = aU(j=1,s) + bU(j+l,s) - (a+ b) U(j,s) +

Systems which can ingest new elements eitherautomatically or manually are called open systems. P.(O+) i > 0 (23)The mean life of such systems can be found by ex-tending the analysis made above for closed sys- The complete solution of (23) involves knowl-tems. As an upper bound on open systems con- edge of appropriate boundary conditions. Thesesider a binomial system containing N elements, L are found to beof which must operate in order that the systemoperate. Assume that the failure rate for elements U(m,s) = a U(m - 1,s)in operation is Au and that the failure rate for ele- s + a (ments in standby is zero. Further assume that (24)replacements are made only when the total num- U(O,s) = 1 P.(O+) = Ober of working elements is less than N. Theyarethen made at random time intervals at an average With these conditions, the difference equationrate b(j). The transition diagram for this model to be solved can be written asis shown in Fig. 7. From the diagram Qj('r) isseen to be (s + a+ b) U(j,s) = acU(j-1,s) + bU(j + 1,s)

for 1 < j < m (25a)

N-L+I , Q U(m,s) = a U(m-1,s) m = N - L + 1 (25b)

I ~~~Qtt---t- X \ U(O,s) = 1. (25c)

Qs,47t \\ Eq. (25) is a second-order linear differenceequation with nonconstant coefficients, b = b(j).

i---J--IX:}~9X(r. \ \\Efficient techniques for solving this type of equa-4,9 \ tion do not exist. It is possible however, to find a

'' 'at ' \\\ solution by 'calculating up" from (25a). Valuesfor the index j + 1 can be found in terms of values

0

L vr # for indexj by successive substitution. When j = m(25b) is used to eliminate the final unknown quantity

a,-= LXuAt =b WSAt U(m- 1,s).For the special case where the coefficients are

independent of j replacement rate [b(j) = B = con-Fig. 7-open system stant] a closed form solution for system mean life

transition diagram. has been obtained. This solution can be expressed


L=8 xe= 0 Difficult technological problems remain which100 ll__ __ lZ prevent system mean life from approaching the

_=- __ -_SZ___ _ r_ bounds outlined above. Among the most serious isA_ f_ /r" it _ _ t < _ _the development of switching apparatus of suitable

complexity and reliability. Others are related to10 the problems associated with the implementation of

- - 17L 7 2f - ~ ~ ~failure sensors and self-diagnostic routines. A

xu M -- ° third serious problem is the design of hardwareI - -ga t X ejcapable of actually carrying out the required re-

I--- ~~~~~~~~~pairfunction.

__IAPPENDIX

10 100 1000 A useful expression for system mean life isN given by

Fig. 8-Mean life of binomial system °°with replacement. M = f R(t) dt. (27)

0

as The proof of this expression follows:

/=_t1 0 rm + \X)m - -11 Mean time to first failure (mean life) is defined

M =!( 1\ Im (X) -lI as

Xa= +1,m=N-L+1,a=LX. (26) M= f t W(t) dt (28)

Eq. (26) is plotted in Fig. 8 as a function of N with where W(t) is the probability density of times toX as a parameter for the special case L = 8. system failure.From this figure the effect of replacement rate on System reliability can also be expressed as asystem mean life is easily observed. function of W(t)

tR(t) = 1 - f W(x) dx. (29)

0CONCLUSIONS

Differentiating (29) with respect to time leads toThe general conclusions reached in this study

can be summarized as follows: dR(t) (30)1) If the failure rates in standby and in oper- dt = W(t).

ation are the same then the mean life of aclosed system cannot be expected to exceed Hence (28) can be written asroughly three times the mean life of the(identical) elements which make it up, re- 00gardless of the technique used to employ M=- f dt- dt . (31)the standby elements. o

2) If the failure rate of elements in standby issubsantallyles thn th falurerat in The right member of (31) is now integrated bysubstantially less than the failure rate in

operation then the mean life of a closed sys- parts, yieldingtem is essentially linearly proportional to 1o cc oothe number of standby elements available. M tR(t) - f R(t) dt= f R(t) dt.

3) The mean life of open systems can be 0 0 j 0 (32)greatly increased by ingesting replacementelements at an average rate which is great- The integrated term vanishes at the upper limiter than the element failure rate. since M is known to be finite.

Documents

Upper Bounds on Mean Life of Self-Repairing Systems