Fault-Tolerant Design for VLSI: Effect of Interconnect Requirements

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-31, NO. 7, JULY 1982

Fault-Tolerant Design for VLSI: Effect ofInterconnect Requirements on Yield Improvement

of VLSI Designs

TULIN ERDIM MANGIR, MEMBER, IEEE, AND ALGIRDAS AVIZIENIS, FELLOW, IEEE

Abstract-In order to take full advantage of VLSI, new designmethods are necessary to improve the yield and testability. Designswhich incorporate redundancy to improve the yields of high densitymemory chips are well known. The goal of this paper is to motivate theextension of this technique to other types of VLSI logic circuits. Thebenefits and the limitations of on-chip modularization and the use ofspare elements are presented, and significant yield improvements areshown to be possible.A mathematical model for yield is developed, and the effects of in-

terconnect densities and logic module complexities in yield improve-ment are investigated for regularly designed VLSI circuits. The chipcontains redundant modules which can be substituted for flawedmodules at the probe testing stage. Because of the interconnectioncomplexity involved in the reconfiguration, redundancy is employedonly for the logic areas, while their interconnections remain nonre-dundant. I-t is shown that yield improvements of two or more over theyields of nonredundant designs can be expected even though redun-dancy is provided only for the logic areas. It is also observed that theyield improvement saturates above a limited amount (10 percent) ofredundancy. This indicates that the chips can be designed around anoptimal amount of additional logic on the chip to improve theyields.

Index Terms-Interconnect area estimates, redundancy parti-tioning, redundancy placement, regular designs, VLSI yield im--provement, VLSI fault tolerance.

I. INTRODUCTION

PAST improvements in circuit integration have been re-ductions of defect densities to achieve economically viable

yields of perfect devices. This is neither possible nor necessaryfor VLSI. The proposed structures are so complex (> 105 gatesper chip) and the catastrophic defect size is so small (,um), thatthere is a great likelihood that we will never be able to makethem perfect. On the other hand, we can now afford to putredundancy on the chip to make fault-tolerant design viable,by allocating some of the chip area for this purpose. However,a careful study of how and where this redundancy should beplaced is necessary to avoid excessive penalty in added areaand possibly more complex processing. Currently, there is noanalytical model available to optimize the placement of re-dundancy.

Manuscript received October 8, 1981; revised January 12, 1982. This workwas supported in part by the Office of Naval Research under contractN00014-79-C-0866.

T. E. Mangir is with Mangir Associates, Los Angeles, CA 90066.A. Avizienis is with the Department of Computer Science, University of

California, Los Angeles, CA 90024.

The following methods have been suggested to improve theprocessing yields of VLSI circuits:

1) modifying the design rules, thereby reducing the prob-ability of yield loss due to critical spacings, such as contact topoly layers (defect avoidance);

2) modifying the designs and the design methodologies, sothat redundant elements (spares) can be introduced on the chipto compensate for defective areas. Redundancy can be intro-duced into the design in different forms such as coding orreplication.

In previous work redundant elements to improve the pro-cessing yield have been proposed and implemented for highdensity memories [2], [3], [6], [7], [16]. The approach is tosubstitute spare word or bit lines (address lines) for defectiveones during the testing of the chips. The information whichidentifies good and bad locations can be stored on the chip byone of several of the following alternatives:

1) latches,2) laser personalization [2], [6], [7], [10],3) electrically programmable storage elements [8],4) electrically programmable links [13].Latches and other electrically alterable memory elements

can be reprogrammed in the field, if necessary. Therefore, theredundancy included on the chip for yield improvement canalso be used for field maintainance/reliability improve-ment.

Laser personalization involves blowing the fuses to spareelements [6], disconnecting the contacts to faulty elements [2],or joining interconnect layers together [10]. Thus it is per-manent, and once done cannot be changed. In [6] laser pulsesare used to burn fuses to isolate the faulty elements from thepower supply. This technique is particularly suited to caseswhere interconnections between modules are done in serial orin pipeline fashion.The evolution of yield with chip area has been extensively

studied in the literature. Various theories have been presentedand analytical expressions derived to fit statistical data basedon defect density distributions [ 1], [ 15], [17], [18]. All workis based on random defect distributions, and the papers differin their treatment of various defects being distinguishable orindistinguishable from each other. Generally, it is assumed thatthe yield falls off exponentially with increasing chip area. Thisoccurs when a Poisson distribution of defect densities is as-sumed. However, it has been observed that in practice yield

0018-9340/82/0700-0609$00.75 ©) 1982 IEEE

609

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 7, JULY 1982

Fig. 1. Two-dimensional array of modules and interconnects.

falls off less sharply [1], but nevertheless significantly withincreasing chip area.

The other attempts at yield calculations which take re-

dundancy into account have been reported for memory circuitsby Schuster et al. [16] and Stapper et al. [18]. The modelpresented in [ 16] is also based on the exponential dependenceof yield on the active chip area. The defects are separated as

correctable, uncorrectable, and gross imperfections, and thenet yield is calculated as the product of these three indepen-dently calculated yields. Stapper et al. [18] have described a

yield model with redundancy based on the Gamma distributionof defects for RAM chips. Mixed Poisson statistics are usedto arrive at a yield expression describing yield of the redund-antly designed memory chips. That approach is based on thedata obtained from a specific process line for a specific product(i.e., RAM chips), and therefore may not be applicable to othercircuit types or process facilities. Also, for our purposes we are

interested only in the cumulative effect of the defects in thelogic areas, rather than the individual defect types. In this workwe have used an average random defect density and Bose-Einstein statistics to derive the yield expressions. This approachis readily extendable to yield calculations with redundancy,and also preserves the generality of the expressions as long as

defects per chip can be described with an average defect den-sity. Neither of the references [ 16] and [ 18] accounts for theeffects of the complexities of areas, connectivities betweendifferent areas, or the effect of regularity of interconnectionswhich would affect the processing tolerances, and thus yield[19].

It is important to note that the redundant areas introducedare also as susceptible to defects as other parts of the chip, sincethe defects are assumed to be randomly distributed. However,the probability of the added (redundant) and the nonredundantareas both being defective at the same time is very small. Also,studies done on high density memories [ 16], [ 18] indicate thata large percentage of the failures occurs because of localizeddefects which affect single bits, rows, or columns of the array,

suggesting that this assumption is a valid one.

In Section II a yield model with redundancy is described.In Section III the effects of module complexities and inter-

connect densities are studied. Section IV discusses the effectof other factors on yield improvement, and Section V presentsa summary of the conclusions of this work.

II. YIELD MODEL WITH MODULE REDUNDANCY

The idealized model of the chip, depicted in Fig. 1, consistsof a two dimensional arrangement of N modules, of whichN-R are required to be functional, with intermodule con-nections running vertically and horizontally between themodules. It is assumed that all modules are identical; thereforeany faulty module can be replaced by one of the R redundant(spare) modules. Examples of this type of design are: on-chipbit-slice designs, master-slice designs, array processors, anddistributed-logic memories. A specific design developed atUCLA that comes very close to the model is the signed-digitarithmetic building block [20].

For the nonredundant case the yield is given by the proba-bility of getting a defect-free chip. What redundancy allowsus is to reclaim the chips with nonzero defects which wouldnormally be considered as waste. For example, if we have oneextra module on the chip (i.e., R = 1), by successfully substi-tuting this extra module for the defective one, we should beable to add the yield of all the chips with a single defect to theyield of defect-free chips to improve the total yield. To describethe expected yield we assume a random defect density whichhas an average Do, and statistical partitions to obtain the yieldof redundantly designed VLSI chips as follows [12]:

co (ARDo)kYR =O (1 + AD)k+l (1 - r)k Pk,N,R (1)

where

Pk,N,R CN,R fork < R,

{R(j)k-Jji- (N- i)C, fork> R.1.=I N$ i=O N

In practice, the contribution of terms with k > R to total yieldYR is quite small; therefore for practical purposes the infinitesum in (1) can be replaced by a finite sum with the upper limitof R.The bypass coverage 0 < CN,R < 1 is defined as the condi-

tional probability that a module will be bypassed, given thatit is faulty. The inclusion of CN,R explicitly allows the con-sideration of less than perfect bypass mechanisms and testsemployed to identify the defective logic modules. Its use hereis analogous to the use of the coverage parameter in the mod-eling of fault-tolerant computer systems [ 14]. Even though itcan be expressed as an explicit function of complexity, inter-connection densities, amount of redundancy, and switchingmechanisms, this dependency is treated implicitly in thispaper.

Equation (1) provides us with the necessary analytic toolto investigate the yield improvement and cost due to the useof redundancy as a function of the parameters of (1). Specificattention will be given to the effect of interconnect require-ments and module complexity in the following section. Anelementary illustration of the use of (1) is given in Fig. 2, in

610

MANGIR AND AVIZIENIS: FAULT-TOLERANT DESIGN FOR VLSI

REDUNDANCY(NUMBER OF EXTRA MODULES)

FM

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

AREA (Cm2)

Fig. 2. Yield versus area for bypass coverage C",,R = 1 and Do = 9/cm-2.

AR Aw YO

0.66 0.06 0.16

0/4 0.14 0.15

0.85 0.25 0.13

0.99 0.39 0.11

1.19 0.59 0.1

if I I I0 2 4 6 8 10 12 14 16 1£ 20

% REDUNDANCY

Fig. 3. Figure of merit (FM) versus percent redundancy for varying inter-connect area ratios for 105 gates/chip Do = 9/cm-2, and active nonre-dundant chip area A, = 0.5 cm2. N = 100, ng = 1000, ag = 50oX2, y = 0.5,b = 0.5, X = I Am.

which YR is plotted versus the total area AR, for several valuesof R and for Do = 9/cm2, with the simplifying assumptionsof r = 0 and CN,R - 1. This is an upper bound for improve-ment, assuming essentially independent modules and perfectbypass coverage. These assumptions will be eliminated insubsequent discussions.

For cost considerations a figure of merit FM is defined asfollows. To be able to produce a working chip, on the average1/ Yo nonredundant chips, and 1 / YR chips designed with re-dundancy, must be produced, respectively. (YO and YR are theyields of nonredundant and redundant chips, respectively.)Working chips are defined to be those which perform theirintended functions with respect to observable input/outputrelationships, when accessed at the probe level, even thoughthere may be undetected internal faults.When the ratio YR/Yo> 1, then redundancy will have in-

creased the yield of VLSI chips. However, redundancy alsoincreases the total area of the chips from Ao to AR = NAm +A,. This assumes that there is no significant increase in thearea of pads, pad drivers, etc. The figure of merit (FM) is de-fined as

FM = (Ao/AR) X (YR/Yo)and we consider that a cost advantage is attained by the useof redundancy when FM > 1.

III. EFFECT OF INTERCONNECTS AND MODULECOMPLEXITIES

To incorporate the interconnect areas into the model thefollowing two assumptions are made:

1) the number of defects (faults) in the interconnectionsis a function of the area of the interconnections,

2) since redundancy is provided only for the logic modulesany defect falling into the interconnect area is catastrophic.The key parameter for the study of the interconnection ef-

fect is the ratio r, defined as r = AW/AR, where A, is the in-termodule interconnect area. The term (1 - r) k in (1) repre-sents the probability that, given k defects on the chip, A, re-mains defect free. In Fig. 3 the FM is plotted versus the per-'centage of redundancy for a chip of nonredundant active logic

area of 0.5 cm2, for varying, values of r, with a fixed defectdensity Do = 9/cm2. Note that the maximum improvementshifts to a higher percentage of redundancy for lower valuesof r. This is explained by the interconnect dominance of yieldimprovement for large ratio of interconnect areas.

In recent papers, Donath [4], [5] and Heller [9] havestudied the interconnection problem for LSI, with respect toaverage interconnection lengths and wireability, with the ob-jective of finding results which reflect the characteristics ofthe logic design process. The approach used in this paper toestimate the interconnect requirements is based on an exten-sion of the model described in [9]. Briefly, the model considersa doubly infinite array of equal size modules. The number ofinterconnections originating from each module is taken to bedistributed according to a Poisson distribution with parametery and is assumed to be independent between different modules.Each interconnection length is assumed to be independentlychosen according to a geometric distribution, with the meanL'. In interconnecting the digital designs, interconnects in thesame net are observed to connect more than two modules. Inspite of this correlation it has been observed a posteriori thatthe assumption of independent origination and terminationof single two module interconnections does not introduce se-rious error.The following relationship has been observed for the number

T of terminals (connections) required per module [5]:

T =ynbwhere

ng is the number of blocks (gate equivalents) in themodule

y is the average number of interconnects per gateb is Rent's exponent, 0.5 < b < 1.

It is observed that b = 0.5 when the interconnections aremostly internal to th-e module. Larger values are observed formodules with high numbers of external interconnections.

Given some regular geometric arrangement of logic areas,one can model the interconnect requirements as a stochasticprocess. Interconnects will originate and terminate at modules

611


3.0

FM

3.0

2.8

AR Aw YO

0.78 0.18 0.23

0.43 0.13 0.35

2.6

2.4

2.2

FM 2.0

1.8

1.6

1.4

1.2

1.0

AR Aw YO

0.71 0.11 0.25

0.78 0.18 0.23

0.89 0.29 0.21

% REDUNDANCY

Fig. 4. Effect of size of the blocks on FM for X = 1 ,um. n = IO, gates/chip, Do = 5 cm-2. N = 100, ng = 1000,y = 2.5, b = 0.75.

% REDUNDANCY

Fig. 5. Effect of number of blocks on FM for X = 1 tim. n = 105 gates/chip, Do = 5 cm-2, ag = 500X2, y = 2.5, b = 0.75.

according to some probabilistic models. For a given numberof modules, interconnect demand is a function of the averagenumber of interconnections per block. Another factor thataffects the demand is the average length of interconnections.Consider the two-dimensional arrangement of modules as inFig. 1.The length of interconnects required for N modules, where

the average length of interconnections is L' and ( Am + wP)is the distance between modules, is given by

L= NL'ynb(_Am+WP)where

Am is the module areaw is the average number of wires in the intermodule in-

terconnect areaP is the distance between centers of the wires, i.e., pitch,

which is defined by technology.

For this paper horizontal and vertical pitches are assumed tobe the same and P = 5X, where X is the linewidth.Then assuming that the horizontal and the vertical pitches

are the same, the interconnect area between the modules is

A, = Lh Lu-nAAm

where Lh and L, are, the total horizontal and vertical inter-connect length, respectively. (Here, for simplicity we will as-sume that the number of interconnect channels between themodules is '/-'\W. This is a good assumption forN > 10.) Thenusing the equation for Aw and Fig. 1, we get

Aw = 2wNP Am w 2P + Nw2p2. (2)As expected, the interconnect area is a strong function of theaverage number of wires in the interconnect channel. De-pending on the assumptions made on w, the effect on the yieldimprovement will vary.

A. Effect of the Size of the BlocksTo examine the effect of the size of the individual blocks that

make up the modules, we use ag = 500X2 and ag = 250X2,where ag is the individual block area and each module has ngblocks. The size of the individual blocks is an indication of how

"tight" the design is. Note that this area is assumed to includethe wiring area associated with the individual block. From Fig.4 we make the following observations.

1) For larger block areas, the wiring area and the logic arearatios are smaller than the small size block areas. This resultsin

a) lower yields because of the increased chip areab) increased improvement in yield with redundancy, for

the larger block area case. We also observe that the peak im-provement shifts toward less added redundancy for small blockareas.

2) The same relative effect is observed for different line-widths X. This can be explained, since as block areas are de-creased actual critical logic area is reduced, therefore less re-dundancy is needed. Also, for smaller block areas overall yieldbecomes more interconnect dominated. In other words, withinthe assumptions of the model the amount of hardcore is in-creased.

3) Peak yield improvement is depressed for smaller line-widths. This results fiom smaller chip area, and thereforehigher nonredundant yields.

B. Effect of the Number ofBlocks

To observe the effect of the number of blocks within amodule we have plotted the yield curves for a fixed total blockcount n, from n = 104 to n = 106 blocks (gate equivalents) perchip. For 105 blocks/chip with y = 2.5, b = 0.75, ag = 500X2,and X = lu, we have

N

10100

1000

ng104103102

r

0.150.220.32

Yo

0.320.240.21.

Yield improvement for Do = 5/cm2 is shown in Fig. 5. We seethat the most significant improvement is for N = 10 case,which has the minimum intermodule area and highest non-redundant yield. The maximum improvement can be reachedby just adding one spare module. This is encouraging for thosecases where a chip is composed of multiple areas of differentregularly designed blocks, where design of each area can bemaximized for yield with an extra logic module.

612

MANGIR AND AVIZIENIS: FAULT-TOLERANT DESIGN FOR VLSI

4.0

3.7

3.4

3.1

FM

4.0

3.7

3.4 _-

3.1 _

ARb = 0.5

Aw YO

0.68 0.08 0.259

1.2 0.6 0.168I I

12 14 16 18 20

% REDUNDANCY

Fig. 6. Effect of complexity of logic modules for Case 1, with n = I0gates/chip, Do = 5 cm-2. X = 1 ,um, varying Rent's exponent b. N = 100,ng = 1000, ag = 50oX2, y = 0.5.

FM

AR Aw YO

0.68 0.08 0.259

FM

1.11 0.51 0.177

0 2 4 6 8 10 12 14 16 18 20

% REDUNDANCY

Fig. 7. Effect of complexity of logic modules for Case 1, with n = 105gates/chip, Do = 5 cm-2, X = 1 ,m, varying average connections per blocky. N = 100, ng = 1000, ag 500X2, b = 0.5.

C. Effect of Complexity of the Logic Modules

For our purposes complexity will be defined in two differentways. First, by the block (equivalent gate) count of eachmodule, i.e., number of blocks/module; and second, bystudying the external interconnect requirements, i.e., thenumber of connections required by the module. The first casehas been addressed under the section for size of blocks, so inthis section only the second case is considered. We examine twocases for the dependence of average number of interconnectsin a wiring channel w on the average connection length permodule L'T.

Case 1: w O(L'ynb) in (2). This results from direct useof the equation for the number of terminals T. For the sameblock count and y, it is observed that b has a pronounced effecton the overall yield improvement. This can be interpreted asfollows: the lower the external information required by theblock (i.e., the more it is self contained), the better the im-provement in yield. In Fig. 6 it is observed that even though-nonredundant yields are close because of the extremely highintermodule wiring required, the maximum improvement inyield is much lower for b = 0.75. In fact, FM becomes less than

2.8 _

FIM 2.5

2.2

y = 2.5

AR Aw YO

1.11 0.51 0.177

6.24 5.64 0.038

1.9 _b = 0.5

1.6

1.3.b = 0.75

1.0 I0 2 4 6 8 10 12 14 16 18 20

% REDUNDANCY

Fig. 8. Effect of module complexity for Case 1, for y > 1 with n = l0Ogates/chip, Do = 5 cm-2, X = 1 ,um. N = 100, ng = 1000, ag = 500X2, y= 2.5.

AR Aw YO

0.67 0.07 0.26

0.78 0.18 0.23

% REDUNDANCY

Fig. 9. Effect of complexity of logic modules for Case 2, with n = 105gates/chip, Do = 5 m2, X = 1 grm, varying Rent's exponent b. N = 100,ng = 1000, ag =500X2,y =2.5.

one for b > 0.6 being the crossover exponent for y > 1 (Fig.8). In Fig. 7 FM is plotted for the same case, for varying y.Even though this (Case 1) dependency has been used widelyto approximate the wiring area required for LSI chips, we haveobserved that this dependency overestimates connection re-quirements, especially for large logic blocks, which tend to bemore self contained.

Case 2: w O(L'ynb)'/2 in (2). This is a more realistic de-pendence for actual circuits of LSI and above complexities,where gate-to-pin ratios (or gate to connection ratios) are high.In this case, the dependence of yield improvement on both yand b is observed to be weak. However, for self containedmodules a higher yield improvement is possible (Fig. 9).

It is concluded that, for modules with simpler interconnectrequirements (Case 2), the number of modules becomes thedominant factor in determining the yield improvement evenfor large module sizes (Fig. 5). For designs where a high degreeof connectivity is required between modules (low gate-to-pinratios), module complexity becomes the limiting factor foryield improvement using spare modules, and the order of thecurves in Fig. 5 is reversed, i.e., the least amount of improve-ment is possible for N = 10.

613


IV. OTHER CONSIDERATIONS

1) Effect of Minimum Linewidth X: In general, smallerlinewidths mean a smaller overall chip area for the samefunction capability. In this paper the total number of blocksis used as a measure of function capability. Overall, the yieldsare observed to be higher for narrower linewidths, as expectedfor the same defect density. However, from past experienceit has been conjectured that the defect density increases in-versely proportional to the linewidth, with a ratio -(X1 /X2)2for scaled technologies, going from a process with X1 to a scaledprocess of X2, with X1 > X2. Also, to avoid high current densityeffects such as metal migration, the cross sections of the in-terconnects are not scaled with the process. It is experiencedthat thick layers of interconnects are prone to higher densityof defects. Therefore,;to interpret the effect of the linewidth,appropriate curves for defect should be used corresponding tothese conditions. Another point is that during the lifetime ofa process, depending on the maturity and changes made, bothdefect density and defect distribution varies. Therefore, actualyield improvement will shift up and down accordingly.

2) Effect of Extent of Failures: In deriving the yieldequations we have assumed that the effects of defects in logicareas are independent of each other. Note that this assumptionis not required. For a module of ng gates, the distance betweenthe modules is determined by the wiring requirements of themodules w. For w = 10 and a pitch P = 5X, this distance is '50,um which can usually be detected during visual inspection ofthe chips. Also any defect of this size or larger will also affectthe interconnect area. Since we have assumed that any faultin the intermodule interconnection area is catastrophic, defectsextending to more than one logic area are also taken into ac-count in the model.

3) Extension to Multiple Regions: On a logic chip whichconsists of multiple types of logic modules, each area can bedesigned to improve the yield of the overall design. In this casesome areas may utilize the approach addressed here, and othersmay use error coding. Each type of module would be optimizedfor yield of that area and the overall chip yield. Ifwe generalizethe mathematical approach presented here, then the overallyield with redundancy can be expressed as [ 12]

R 1=[1+ k(1[- r)k(1 + ATDO) L = All possible

combinations ofk defects in m regions

reconfiguration, we have not considered redundancy in theinterconnect areas, and treated them as the hard core. We havepresented expressions for redundant yield with chip area, de-fect density, and amount of redundancy as parameters. Wehave also discussed the effects of interconnect and modulecomplexities.Our main conclusions are as follows.* For a given chip size and defect density, yields increase

rapidly at first with increased redundancy to a maximum, thendecrease relatively slowly due to added area.

* The optimum amount of redundancy which maximizesthe yield increases slowly with increasing defect density, whilethe maximum yield decreases.

* When yields are dominated- by interconnect area, yieldmprovement is not as large.

. Randomness of the intermodule connection pattern in-creases the interconnect area, therefore decreasing the yieldimprovement.

* For large modules with low intermodule interconnectrequirements, a smaller number of modules on the chip resultsin higher yield improvements. To put it another way, formodules with high gate-to-pin (connection) ratio, redundancyis more effective if the extra modules are provided at largemodule level, with the design being partitioned into a smalltotal number of modules on the chip.

* For logic modules with high intermodule interconnectrequirements, yield improvement is limited by the complexityof the module. (Module complexity is defined by the inter-connections.) This is to say, if modules have low gate-to-con-nection ratio, then redundancy must be provided as smallmodules, with the total design being partitioned into a largenumber of these modules.

* For this approach to be effective a high bypass coverage(defined as the product of testing and replacement coverages)is necessary. This means that the probability of detecting de-fective modules (testing coverage), and substituting good onesin their places (replacement coverage) must be high.

In this work we have assumed arbitrary interconnections.If interconnections are restricted to special cases such asnearest neighbor or tree connections between the logic mod-ules, yield improvements will be higher. However, more re-search is needed in design and reconfiguration techniques in-order to exploit the full potential of this approach for bothhigher yielding and more reliable VLSI circuits.i' Ii]

whereAT is the total chip areaR is the least amount of redundancyr is the area ratio of common connections between m re-

gions (i.e., hardcore) to the total active chip areaIi is the yield improvement for each region.

V. CONCLUSIONS

In this paper we have presented a model to study the faulttolerance techniques applied in regulary designed VLSI cir-cuits to bypass manufacturing flaws by using on-chip redun-dant logic modules. Because of the complexity involved in

LIST OF SYMBOLS

ag: average area of individual blocks comprising themodules,

Ao: area of chip without redundancy,AR: area of the chip with redundancy added,Am: area of individual module on the chip agng,A,: area of the intermodule interconnections,

b: Rent's exponent,CN,R: bypass coverage,

Do: average random defect density,FM: figure of merit = (YR/Yo) X (Ao/AR),

Iy: yield improvement YR/Yo,

614

MANGIR AND AV121ENIS: FAULT-TOLERANT DESIGN FOR VLSI

k: number of defects,L': average intermodule interconnection length (in units

of module to module distance),Lh: total length of interconnect running in orizontal di-

rection,LV: total length of interconnect running in vertical di-

rection,X: linewidth, minimum dimension used in a tech-

nology,N. number of modules on the chip,n: total number of blocks (i.e., gate equivalents) on the

chip,ng: number of blocks in a module,P: pitch, separation between the centers of interconnect

wires,Pk,N,R. probability of distributing k defects into R out ofN

modules,r: ratio of intermodule interconnection area to total

active chip area with redundancy AW/AR,R: number of redundant modules,T: average number of connections required per

module,w: average number of interconnect wires in the inter-

module interconnect channel,y: average number of external connections required per

block (gate) of module,YO: yield without redundancy (i.e., R = 0),YR: total yield with redundancy.

REFERENCES

[I] J. Bernard, "The IC yield problem: A tentative analysis for MOS/SOScircuits," IEEE Trans. Electron Devices, vol. ED-25, pp. 939-944, Aug.1978.

[2] R. P. Cenker et al., "A fault-tolerant 64K dynamic RAM," in Dig.ISSCC, vol. 22, Feb. 1979, pp. 150-151.

[3] R. R. DeSimone et al., "FET RAM's," in Dig. ISSCC, vol. 22, Feb.1979,pp. 154-155.

[4] W. E. Donath, "Placement and average interconnection length ofcomputer logic," IEEE Trans. Circuits Syst., vol. CAS-27, pp. 272-277,Apr. 1979.

[5] - , "Equivalence of memory to random logic," IBM J. Res. Develop.,pp. 401-407, Sept. 1974.

[6] B. R. Elmer, "Fault tolerant 92160 bit multiphase CCD memory," inDig. ISSCC, vol. 20, Feb. 1977, pp. 116-117.

[7] B. F. Fitzgerald and E. P. Thoma, "Circuit implementation of fusibleredundant addresses on RAMs for productivity enhancement," IBMJ. Res. Develop., vol. 24, pp. 291-298, May 1980.

[8] D. Frohman-Bentchkowsky, "A fully decoded electrically programmableMOS-ROM," in Dig. ISSCC, vol. 14, Feb. 1971, pp. 80-81.

[9] W. R. Heller, "Prediction of wiring space requirements for LSI," in Proc.17th Des. Automat. Conf., 1979, pp. 2-42.

[10] L. Kuhn, "Experimental study of laser formed connections for LSI waferpersonalization," IEEE J. Solid-State Circuits, vol. SC-10, pp.219-228, Aug. 1975.

[11] T. E. Mangir and A. Avizienis, "Effect of interconnect requirementson VLSI circuit yield improvement by means of redundancy," in Proc.IEEE COMPCON 1981, Feb. 1981, pp. 322-326.

[12] T. E. Mangir, "Use of on-chip redundancy for fault-tolerant VLSI de-sign," Ph.D. dissertation, Univ. of California, Los Angeles, 1981; andDep. Comput. Sci., Univ. of California, Los Angeles, Tech. Rep. CSD820201, Dec. 1981.

[13] T. Mano et al., "A fault tolerant 256K RAM fabricated with molyb-denum-polysilicon technology," IEEE J. Solid-State Circuits, vol.SC- 15, pp. 865-872, Oct. 1980.

[14] Y. W. Ng and A. Avizienis, "A unified reliability model for fault-tolerantcomputers," IEEE Trans. Comput., vol. C-29, pp. 1002-1011, Nov.1980.

[15] J. E. Price, "A new look at yield of integrated circuits," Proc. IEEE,pp. 1290-1291,Aug. 1970.

[16] S. E. Schuster, "Multiple word/bit line redundancy for semiconductormemories," IEEE J. Solid-State Circuits, vol. SC- 13, pp. 689-703, Oct.1978.

[17] C. H. Stapper, "Yield modelling and process monitoring," IBM J. Res.Develop., vol. 20, pp. 228-234, 1976.

[18] C. H. Stapper et al., "Yield model for productivity optimization ofproduct," IBM J. Res. Develop., vol. 23, pp. 298-409, May 1980.

[19] J. T. Wallmark, "A statistical model for determining the minimum sizein integrated circuits," IEEE Trans. Electron Devices, vol. ED-26, pp.135-142, Feb. 1979.

[201 A. Avizienis and C. Tung, "A universal arithmetic building element(ABE) and design methods for arithmetic processors," IEEE Trans.Comput., vol. C-19, pp. 733-745, Aug. 1970.

Tulin Erdim Mangir (S'71-M'80) was born inTurkey on May 3, 1950. She received the B.S. de-gree in electrical engineering from University ofCalifornia, Los Angeles (UCLA), in 1971, theM.S. degree in electrical engineering (semicon-ductor device physics) from University of South-ern California, Los Angeles (USC), in 1974, theM.S. degree in electrical engineering (computerarchitecture) from USC in 1975, and the Ph.D.degree in engineering from UCLA in 1981.From 1971 to 1975 she was associated with the

USC School of Engineering working on numerical analysis, device physics,and fault tolerance. From 1975 to 1976 she was with the Burroughs Corpo-ration working with Intel 8080 MDS for line control applications. From 1976to 1978 she was with the UCLA School of Engineering and Applied Sciences,teaching Computer Science and Electrical Engineering. From 1978 to 1982she was with Xerox Corporation doing research in short channel effects inMOS, VLSI chip design, and VLSI systems architecture. Currently, she isa Consultant. Her current research interests are fault-tolerant VLSI, recon-figurable systems, signal and voice processing, VLSI designs for testability,semiconductor device physics, and distributed computing.

Dr. Mangir is a member of SWE.

Algirdas Avizienis (S'55-M'56-F'73) was born inKaunas, Lithuania in 1932. He received the B.S.,M.S., and Ph.D. degrees, all in electrical engineer-ing, from the University of Illinois, Urbana-Champaign, in 1954, 1955, and 1960, respec-tively.From 1956 to 1960 he was associated with the

Digital Computer Laboratory, at the University ofIllinois as a Research Assistant and Fellow, partic-ipating in the design of the Illiac It computer. In1960 he joined the Spacecraft Computers Section,

Jet Propulsion Laboratory (JPL), California Institute of Technology, andinitiated research on reliability of computing and fault tolerance. He organizedand directed the JPL-STAR Experimental Computer Research Project from1964 to 1972. He joined the Faculty of the University of California, Los An-geles (UCLA) in 1962. Currently, he is a Professor in the Department ofComputer Science, UCLA, where since 1972 he has been the Principal In-vestigator of the "Reliable Computing and Fault-Tolerance" Research Project.He teaches courses in computer system architecture, computer arithmetic,and fault tolerance, and is the author or coauthor of over 100 publications inthese fields of study. He has served as a Consultant in studies of computersystems design and fault tolerance sponsored by the US Air Force, US Navy,NASA, and the National Bureau of Standards. He has also served on anumber of study groups and panels, including the Hardware Systems Com-mittee of the NSF COSERS Study, and a three-year term as a member ofthe Advisory Panel on Computer Science and Engineering for the NationalScience Foundation, Office of Computing Activities.

Dr. Avizienis has received numerous awards, among others the AIAA In-formation Systems Award, the Honor Roll of the IEEE Computer Society,the NASA Apollo Achievement Award, Best Paper selection from the 1971

615


IEEE TRANSACTIONS ON COMPUTERS, and the NASA Exceptional Ser-vice Medal. In the IEEE Computer Society, he founded and was the firstChairman of he Technical Committee on Fault-Tolerant Computing(1969-1973), and was the organizer and General Chairman of the First In-ternational Symposium on Fault-Tolerant Computing in 1971. He also servedfor four years (197 1-1974) as a member of the Governing Board of the IEEEComputer Society. In international activities, he is a member of the Working

Group 10.3 of the IFIP TC-10 on "Digital System Design." He has lecturedand conducted joint research at the National Polytechnic Institute of Mexico,the University of Sao Paulo, Brazil, and the Laboratoire d'Automatique etd'Analyse Systemes in Toulouse, France. In 1974 he spent a five-month re-search visit, sponsored by the U.S. National Academy of Science, at the In-stitute of Mathematics and Cybernetics, Lithuanian Academy of Sciences,Vilinius, Lithuania.

Formal Specification and Mechanical Verificationof SIFT: A Fault-Tolerant Flight Control System

P. MICHAEL MELLIAR-SMITH AND RICHARD L. SCHWARTZ

Abstract-This paper describes the formal specification and proofmethodology employed to demonstrate that the SIFT computer meetsits requirements. The hierarchy of design specifications is shown, fromvery abstract descriptions of system function down to the implemen-tation. The most abstract design specifications are simple and easyto understand, almost all details of the realization having been ab-stracted out, and can be used to ensure that the system functions re-liably and as intended. A succession of lower level specifications refinethese specifications into more detailed and more complex views of thesystem design, culminating in the Pascal implementation. The paperdescribes the rigorous mechanical proof that the abstract specificationsare satisfied by the actual implementation.

Index Terms-Distributed systems, fault tolerance, reliability,SIFT, specification, verification.

I. INTRODUCTION

THIS paper describes the methodology being employed todemonstrate rigorously that the SIFT fault-tolerant

computer meets its requirements. The methodology employsa hierarchy of design specifications, expressed in the mathe-matical domain of multisorted first-order predicate calculus.The most abstract of these, from which almost all details ofmechanization have been removed, represents the require-ments on the system for reliability and intended functionality.Successive specifications in the hierarchy add design and im-plementation detail until the Pascal programs implementingthe SIFT executive are reached. At each stage a formalmathematical proof is constructed to show that the more de-tailed specification, appropriately mapped, exhibits preciselythe properties of the more abstract specification above it in thehierarchy. The formal proof, that a SIFT system in a "safe"state operates correctly despite the presence of arbitrary faults,has been completed all the way from the most abstract speci-

Manuscript received September 17, 1981; revised January 7, 1982. Thiswork was supported by the NASA Langley Research Center under ContractNAS1-1 5428.

The authors are with the Computer Science Laboratory, SRI International,Menlo Park, CA 94025.

fication to the Pascal program. This proof has been performedusing STP, a new specification and verification system, de-signed and implemented at SRI International. Except whereexplicitly indicated, every specification exhibited in this paperhas been mechanically proven consistent with the requirementson SIFT and with the Pascal implementation. The proof re-mains to be completed that the SIFT executive performs anappropriate, safe, and timely reconfiguration in the presenceof faults.A SIFT system, designed to meet the required ultrahigh

reliability by processor replication and voting, has been con-structed by the Bendix Corporation and is now operating atSRI. It will be delivered shortly to NASA Langley for evalu-ation in the AirLab. Rather than providing a general intro-duction to the SIFT system and the algorithms used to achievethe desired fault tolerance, we explore the process of refiningthe high-level specifications of SIFT down to the implemen-tation level. A general introduction to SIFT can be found in[1] and [2] and a description of the SIFT executive appearsin [3]. The SIFT hardware is documented in [4] and [5]. Thefault tolerance algorithms employed are defined in [2] and[6].

Sections II and IV of the paper present a brief introductionto the requirements of SIFT and the mechanisms employedto cope with the reliability requirements. Section III discusseshow formal proof is used to substantiate the reliability claims.Section V outlines the specification hierarchy. After brieflydescribing in Section VI the mechanical specification andverification system employed in the effort, Sections VIIthrough X describe each of the specifications in detail. Finally,in Section XI conclusions and directions for further researchare discussed.

II. THE REQUIREMENTS FOR SIFT

The SIFT computer system has been designed to meet therequirements for future passenger aircraft control. Such air-

0018-9340/82/0700-0616$00.75 © 1982 IEEE

616

Documents

Fault-Tolerant Design for VLSI: Effect of Interconnect Requirements