6
Fault detection and correction in array computers for image processing W R Moore, B.Sc, Ph.D., C. Eng., M.I.E.E. Indexing terms: Computer applications, Fault location, Image processing Abstract: The paper addresses the problems of detecting and correcting faults that may occur in arrays of processors used for image processing. The variety of useful hardware and software solutions is reviewed. It is shown that faults can be corrected efficiently by bypassing the faulty column of the array, and a novel technique is described which detects processor faults with a very modest increase in circuitry. The addition of a parity check on the memory is sufficient to give an effective and efficient detection and correction of all permanent and many transient faults. Additionally, the use of a full parity processor increases the proportion of transient faults detected. 1 Introduction An effective architecture for high-speed image processing seems to be an array of single-bit processors, each having its own memory and data connections to its nearest neighbours. Each processor corresponds to one or more elements of the image data (or pixels). Examples of this kind of architecture are the CLIP [1,2], the DAP [1, 3,4], the STARAN [5], the MPP [6] and the GRID [7]. One particular interest in these architectures is their suitability for manufacture in VLSI tech- nology which is currently capable of producing at least 32 processors on a single chip [7]. Small memories (perhaps up to 100 bits per processor) may also be manufactured on the chip, but larger memories are currently placed on separate chips. Such arrays are relatively large items of electronic equip- ment and are therefore certain to develop faults from time to time. The array will represent a substantial capital investment and, by its nature, it is likely to be in almost continuous use. There will therefore be a premium on detecting any faults that do occur and on providing automatic correction. On the other hand, it is possible that a short break in correct processing would be acceptable, particularly if the presence of a fault has been announced. This paper therefore reviews the available fault detection and correction techniques with the following aims in mind: (a) All faults should be detected. If possible they should be detected within the time taken to process a few frames of the data. (b) The majority of faults should be corrected auto- matically. This probably need not be instantaneous but should be fast, perhaps within the time taken to process a few tens or hundreds of frames of data. Similarly, it is unlikely to be necessary to retain the data that are within the array at the time a fault occurs. (c) Multiple faults need not generally be considered. This assumes that external maintenance is available, so that the computer is restored to its original state before another fault occurs. Diagnostic aids which assist such maintenance would valuable. 2 Relevant characteristics of array computers and image- processing algorithms Array computers and image-processing algorithms have certain features which strongly influence the sort of fault detection and correction techniques that are applicable. Paper 2104E, first received 15th December 1981 and in revised form 18th June 1982 The author is a lecturer with the Electronics Department, Southampton University and is a consultant to the General Electric Company pic, Hirst Research Centre, East Lane, Wembley, Middlesex HA9 7PP, England IEEPROC, Vol. 129, Pt. E, No. 6, NOVEMBER 1982 2.1 Trade between time and hardware If a problem exactly fits an array of a certain size, then it can generally be solved by an array of half this size in something over twice the time. One approach would be for each proces- sor to represent two adjacent pixels in alternating time periods. For the serial nearest neighbour access of the DAP, the MPP and the GRID this would solve the problem in close to twice the original time but would require extra memory and there- fore involve somewhat more than half the hardware. The parallel input gating of the CLIP does not adapt so easily to this mode of operation, and an additonal time penalty would be incurred when performing relevant operations on it. An alternative approach would be to divide the problem into two halves and process these separately. This could get closer to halving the hardware but would generally involve a consider- able time penalty in co-ordinating the two solutions. This relationship between execution time and the hardware introduces a measure of flexibility into the fault-detection and -correction techniques that can be used by permitting a relatively simple trade to be made between the use of hard- ware redundancy and the use of time redundancy. 2.2 Large numbers of identical processors The large number of identical processors in the array suggests the use of parity bits or other forms of coding across the spatially distributed data. Coding techniques generally offer an efficient use of redundancy (e.g. in comparison with complete duplication of the hardware). The use of identical processors also limits the amount of hardware which needs to be held as spares in many fault-correction schemes. 2.3 Interconnection constraints Owing to the distribution of array computers, there are severe constraints on the additional interconnection paths that can be added for fault detection or correction. For example, it is unlikely to be economic to arrange for one spare processing element to be capable of being switched to any of the locations in the array. 2.4 Data propagation Some image-processing algorithms propagate data through the array, so that single point faults can rapidly lead to multiple data errors. This characteristic is a hazard to those fault- detection schemes which rely on single-point failure modes, but on the other hand may lead to such unlikely results that the fault is self evident. 2.5 Correlation of data Many electronic faults consist of permanent open- or short- circuit conditions which may result in wild changes to the data held in the array. These may often be detected by their 0143-7062/82/060229 + 06 $01.50/0 229 \m i I I I 1 I is H IF-

Fault detection and correction in array computers for image processing

  • Upload
    wr

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Fault detection and correction in array computersfor image processing

W R Moore, B.Sc, Ph.D., C. Eng., M.I.E.E.

Indexing terms: Computer applications, Fault location, Image processing

Abstract: The paper addresses the problems of detecting and correcting faults that may occur in arrays ofprocessors used for image processing. The variety of useful hardware and software solutions is reviewed. It isshown that faults can be corrected efficiently by bypassing the faulty column of the array, and a noveltechnique is described which detects processor faults with a very modest increase in circuitry. The addition ofa parity check on the memory is sufficient to give an effective and efficient detection and correction of allpermanent and many transient faults. Additionally, the use of a full parity processor increases the proportionof transient faults detected.

1 Introduction

An effective architecture for high-speed image processing seemsto be an array of single-bit processors, each having its ownmemory and data connections to its nearest neighbours.Each processor corresponds to one or more elements of theimage data (or pixels). Examples of this kind of architectureare the CLIP [1,2] , the DAP [1, 3 ,4 ] , the STARAN [5], theMPP [6] and the GRID [7]. One particular interest in thesearchitectures is their suitability for manufacture in VLSI tech-nology which is currently capable of producing at least 32processors on a single chip [7]. Small memories (perhaps up to100 bits per processor) may also be manufactured on the chip,but larger memories are currently placed on separate chips.

Such arrays are relatively large items of electronic equip-ment and are therefore certain to develop faults from time totime. The array will represent a substantial capital investmentand, by its nature, it is likely to be in almost continuous use.There will therefore be a premium on detecting any faults thatdo occur and on providing automatic correction. On the otherhand, it is possible that a short break in correct processingwould be acceptable, particularly if the presence of a faulthas been announced.

This paper therefore reviews the available fault detectionand correction techniques with the following aims in mind:

(a) All faults should be detected. If possible they should bedetected within the time taken to process a few frames of thedata.

(b) The majority of faults should be corrected auto-matically. This probably need not be instantaneous but shouldbe fast, perhaps within the time taken to process a few tens orhundreds of frames of data. Similarly, it is unlikely to benecessary to retain the data that are within the array at thetime a fault occurs.

(c) Multiple faults need not generally be considered. Thisassumes that external maintenance is available, so that thecomputer is restored to its original state before another faultoccurs. Diagnostic aids which assist such maintenance wouldvaluable.

2 Relevant characteristics of array computers and image-processing algorithms

Array computers and image-processing algorithms have certainfeatures which strongly influence the sort of fault detectionand correction techniques that are applicable.

Paper 2104E, first received 15th December 1981 and in revised form18th June 1982The author is a lecturer with the Electronics Department, SouthamptonUniversity and is a consultant to the General Electric Company pic,Hirst Research Centre, East Lane, Wembley, Middlesex HA9 7PP,England

IEEPROC, Vol. 129, Pt. E, No. 6, NOVEMBER 1982

2.1 Trade between time and hardwareIf a problem exactly fits an array of a certain size, then it cangenerally be solved by an array of half this size in somethingover twice the time. One approach would be for each proces-sor to represent two adjacent pixels in alternating time periods.For the serial nearest neighbour access of the DAP, the MPPand the GRID this would solve the problem in close to twicethe original time but would require extra memory and there-fore involve somewhat more than half the hardware. Theparallel input gating of the CLIP does not adapt so easily tothis mode of operation, and an additonal time penalty wouldbe incurred when performing relevant operations on it. Analternative approach would be to divide the problem into twohalves and process these separately. This could get closer tohalving the hardware but would generally involve a consider-able time penalty in co-ordinating the two solutions.

This relationship between execution time and the hardwareintroduces a measure of flexibility into the fault-detectionand -correction techniques that can be used by permitting arelatively simple trade to be made between the use of hard-ware redundancy and the use of time redundancy.

2.2 Large numbers of identical processorsThe large number of identical processors in the array suggeststhe use of parity bits or other forms of coding across thespatially distributed data. Coding techniques generally offer anefficient use of redundancy (e.g. in comparison with completeduplication of the hardware). The use of identical processorsalso limits the amount of hardware which needs to be heldas spares in many fault-correction schemes.

2.3 Interconnection constraintsOwing to the distribution of array computers, there are severeconstraints on the additional interconnection paths that can beadded for fault detection or correction. For example, it isunlikely to be economic to arrange for one spare processingelement to be capable of being switched to any of the locationsin the array.

2.4 Data propagationSome image-processing algorithms propagate data through thearray, so that single point faults can rapidly lead to multipledata errors. This characteristic is a hazard to those fault-detection schemes which rely on single-point failure modes, buton the other hand may lead to such unlikely results that thefault is self evident.

2.5 Correlation of dataMany electronic faults consist of permanent open- or short-circuit conditions which may result in wild changes to thedata held in the array. These may often be detected by their

0143-7062/82/060229 + 06 $01.50/0 229

\m

iI

II

1

IisH

IF-

contrast with the normal high degree of spatial correlation ofdata across the image. Such faults may also be corrected easily,e.g. input noise may be cleaned up by the use of median fil-tering [8].

3 Fault-detection and -correction techniques

Fault-detection and -correction techniques may be roughlydivided into software or hardware approaches (or a combi-nation of the two). Software solutions will certainly slowthe computation down, and so require time redundancy, butmay also involve some hardware redundancy. This may arisefrom a need for extra control facilities or perhaps from areduction in the resolution of the output requiring the use of asomewhat larger array to achieve equivalent results. Similarly,hardware solutions will certainly require hardware redundancybut may also slow the computation down and thereforerequire some time redundancy. Fortuitously, as noted inSection 2.1 preceding, array computers permit some tradingbetween time and hardware redundancy, and this makes theoptimum mixture of software and hardware techniques easierto evaluate.

A wide variety of fault-detection and -correction techniquesexist (see, for example, References 9 and 10) but the discussionof Sections 1 and 2 enables the field to be narrowed substan-tially. The modest speed required for fault correction impliesthat the correction techniques ought to employ largely timeredundancy and minimise the hardware redundancy. Fault-masking schemes, such as triplication followed by majorityvoting, do not therefore appear sensible. In contrast, the aimof fast fault detection, along with the normally fast dataprocessing, means that it may well be justifiable to use a hard-ware fault-detection scheme, even to the extreme of dupli-cating the hardware.

Since the array is composed of identical processors, theminimum hardware redundancy for fault correction would beto hold just one spare processor. This would create an enor-mous connectivity problem, however, and the smallest replace-able unit that appears to be realistic for automatic correctionis a column (or a row) of the array.

If several processors are packaged in one integrated circuit,it may be more appropriate to consider a column of integratedcircuits as the smallest replaceable unit, and this will contain ablock of columns of processors. Connectivity considerationsalso point to such replacement being mechanised by having aredundant column in the array and arranging that any onecolumn containing a faulty processor can be bypassed to createa working computer (as in the MPP, Reference 6). Themechanism is described in more detail in Section 4.

Note that the smallest replaceable unit for any subsequentmaintenance may well be different and is more likely to be acircuit board or an integrated circuit package.

3.1 Hardware duplicationPerhaps the simplest concept in fault detection is to duplicatethe hardware and compare the results from the two identicalhalves. This comparison could be performed between individualprocessors or solely between outputs of the arrays.

A comparison of the outputs of the two identical arrayswould detect errors within the time taken to process one frameof the image and would require minimal interconnections.Alternatively, the interprocessor comparison would alsopinpoint the location of a fault and could therefore trigger acolumn replacement mechanism. The detection could be donecontinuously by additional hardware comparators, or else by asoftware comparison. With a software approach, the morefrequently the comparison is performed, the larger will be thetime redundancy. In either case, the comparison will require

230

additional interconnection paths between the two identicalarrays.

This duplication of the hardware represents a considerableinvestment but in return yields a simple solution detecting allfaults quickly. The investment might also be offset by incor-porating software control, so that the overall system can, onoccasion, be used as a double-sized array when it is known tobe fault free.

A compromise solution would be to duplicate just onecolumn, or block of columns, of the array at any given instantof time. Software or hardware control could rotate the pair ofcolumns acting as duplicates at the end of each frame of theprocessing, so that the whole array is checked after a completerotation. Obviously, faults would not then be detected immedi-ately but the hardware cost would be considerably reduced.Furthermore, this solution fits well with a column replace-ment mechanism as shown in Section 5.

3.2 Software duplicationIf the time is available, software duplication might be usedinstead of hardware duplication for fault detection. The dupli-cation could exist in the form of retrying the problem, butwith a vertical shift of the data to avoid repeating identicalerrors. Alternatively, the software could split the array intotwo halves, each solving the whole problem. In both cases, thesolution rate will be at least halved and some modest additionto the hardware logic may be required.

The time redundancy could be reduced by duplicating onlythe occasional frame of data on the assumption that faultsneed not be detected immediately. Naturally, this would notdetect transient faults.

3.3 Self-detection circuitsA number of techniques have been devised for designingcircuits which announce their own faults. Of these, twoactually use less than twice the hardware required for a fault-free circuit.

Two-rail logic [11] takes both true and complement valuesof every input and produces true and complement outputs. Afault in the circuit destroys the complementary relationship ofone or more output pair. The total logic is rather less thantwice the original, because of the elimination of inverters (thedelay time will be similarly reduced). The method comparesfavourably with duplication of every processor but is clearlysusceptible to multiple faults on a single chip, which need notbe the case with the other methods discussed. It also lacks theadaptability of duplication.

Alternating logic [12] employs self dual circuits whichproduce a complementary output when presented with com-plementary inputs. True and complement inputs are presentedin alternate time slots, and faults are revealed when the corre-sponding outputs are not complementary. Again, the overalllogic is less than twice that of the original, but the compu-tation time is doubled and this appears to rule it out as a use-ful technique in array computers.

3.4 Coding techniquesThe presence of parallel data within the array strongly suggeststhe use of coding techniques which generally provide a muchmore efficient use of hardware than duplication. A parity bit issufficient to detect all single errors and, if a parity bit is addedto each column of the array, then the fault will also be diag-nosed with sufficient precision to trigger the column replace-ment mechanism.

The parity bit is easily generated and is particularly usefulfor checking the memory. It is, to a large extent, possible tocheck all data paths and routing within the processors and thearray by incorporating parity processors, as in Reference 4.

IEEPROC, Vol. 129, Pt. E, No. 6, NOVEMBER 1982

Essentially, each column of the array has a parity processor;cyclic routing of data horizontally across the array producescorresponding shifts in parity across the columns and cyclicshifts vertically preserve the parity of each column.

The parity-processor checking of Reference 4 is suppressedfor noncyclic routing of data, but additional logic could beprovided to cope with this case. In a VLSI environment, aparity processor would probably be fabricated on every in-tegrated circuit chip, along with several normal processors.Note that it is essential for horizontal routing that each parityprocessor confines its operation to one column of processors,which will probably not be the block of processors on thatparticular chip. This would have to be accomplished byexternal wiring.

Other more complicated codes could be used to detectmultiple faults and to correct faults, but this does not appearto be necessary.

Parity is not preserved during logical operations of theprocessing element. In theory, any logical circuit may becoded with similar effect [13, 14] but in practice the resultingparity circuit would be extremely complicated for two reasons.First, the code circuit for a sequential circuit requires eitheraccess to all the internal states of the original sequential circuit,or else to have equivalent states of its own. Secondly, thesimple structure of the original circuit with a number of largelynoninteracting processors would not be reflected in the parityprocessors. It therefore seems that some logical faults cannotbe covered by a parity detection approach.

3.5 TestingA variety of schemes could be devised to test an array com-puter under software control. It may be tested by the hostcomputer, or test patterns may be permanently stored in thearray to be exercised and compared with known good patterns.Alternatively, software control could make each processor testits neighbour.

A complete test of the processors and their memory islikely to be slow so that this approach is more likely to be use-ful in controlling the fault correction mechanism when thedetection technique does not diagnose the fault preciselyenough.

It is, however, possible that testing could be used as aprimary detection technique if time permits. The time redun-dancy might be minimised at the expense of a delay in detec-ting faults, by performing an elementary check after everyframe or every few frames of data. This might consist of asimple functional check of the processing elements and theirrouting, together with a rotating check on the memory oversome longer time cycle. Certainly, transient faults would notbe detected and memory faults could remain dormant for aconsiderable time.

3.6 Fault-tolerant algorithmsAn alternative approach is to mask faults by developing imageprocessing algorithms which allow for erroneous data transmit-ted by single processing elements. No work on this is known,but References 8 and 15 give sufficient details of standard(fault-free) algorithms to enable some broad generalisaiton tobe made.

3.6.1 Grey-scale algorithms: In images which are representedas grey-scale data, there will normally be a high correlationbetween adjacent pixels, and this fact may be used to eliminatemany errors produced by computer faults. Two algorithmswhich are commonly used to reduce noise [8] are medianfiltering which selects the median value from a group ofadjacent pixels, and out-of-range filtering which discards thosevalues which are different from their neighbours by more than

IEEPROC, Vol. 129, Pt. E, No. 6, NOVEMBER 1982

a specified threshold. The latter algorithm might also be usedas prefiltering to other operations to reject data passed fromfaulty processors.

It is not clear from published research what the comput-ational advantages are of operating on smaller or larger areasof the image (e.g. four nearest neighbours, or eight more),although the larger the area, the more inherently fault-tolerantis the result. Naturally, time will be taken to perform the extrafiltering and the resolution of the final image will be reduced,so that the fault-tolerant algorithm will use both time andhardware redundancy. Since the fault-tolerant algorithmwould mask the fault, no additional fault correction mechan-ism is required. Faults do need to be detected in the end, how-ever, in order to carry out appropriate maintenance. Out-of-range filtering could provide fault detection information butotherwise a periodic fault detection scheme appears to benecessary. Since it would not be needed very frequently itshould use, primarily, time redundancy.

3.6.2 Binary algorithms: On the surface, it seems that it ismore difficult to make binary algorithms fault tolerant be-cause adjacent pixels are essentially uncorrelated and becauseexpand and shrink functions can also propagate errors through-out the array. The other side to this is that such algorithmsare fast and require little storage. It may therefore be reason-ably efficient to use software control to duplicate the binaryalgorithms in the image manipulation.

It appears then that, although research on fault-tolerantalgorithms would be very helpful, it is unlikely that theresults would be competitive with the more efficient of theoptions described in the preceding Sections.

4 Mechanism for fault detection by column bypassing

It appears from Section 3 that faults can be corrected ef-fectively by bypassing the faulty column. This is certainlyan efficient solution, in as much as it requires only a smallproportion of redundant processing elements. Specificbypassing mechanisms are described below, in order toreveal what further redundancy is necessary to perform therequired rerouting of data.

The basic mechanism is illustrated in Fig. 1 for an arraywith square connectivity (each processor having access to fourneighbours N, E, S, W). The array is assumed to be packaged inintegrated circuits which each contain a block of processors of(say) x rows and y columns. The column bypassing mechan-isms requires an extra column of integrated circuits (y columnsof processors) and additional bypassing paths in the E-W direc-tion. The N-S connections within each column are unchanged.Control signals to each block column tell it whether to ex-change data with its nearest block neighbours (E and W) orwith one of its next nearest block neighbours (EE or WW). Itis expected that pins on the integrated circuits will be at apremium and that these control signals will therefore be inputserially on an existing control pin. For the same reason, theoptions below are compared in terms of their pin requirementsas well as their logic circuit requirements.

Precise details of the bypassing circuit depend on how thedata is routed through the array and three variations are con-sidered below.

4.1 Square-connected array with parallel links to four near-est neighbours

In this case (as in the DAP, Reference 1), each processor has adedicated output path transmitting in four directions (N E, SW). Fig. 2 shows, by dotted lines, the extra circuits requiredfor column bypassing. Note that N and S connections havebeen omitted for clarity. Separate W and WW outputs are

231

ll

IB

HIII

provided to guard against short circuit faults in one of thoseblock columns. In a standard array, the corner processors onthe chip would have used a common pin for transmissions intwo directions, but this is ruled out for the same reason. Thetotal requirements are therefore:

Extra logic per chip2x ' multiplexers2x + 4 output drivers

A 3-state control register (normal; bypass east; bypass west).

Extra pins per chip2x inputs2x + 4 outputs

Control

\ - \ /V

signals

—~C)V1- [ )

1

4.2 Octagonally connected array with parallel links to eightnearest neighbours

In this case (as in the CLIP, Reference 1), each processor has adedicated output path transmitting in eight directions (N NE,E, S, SW, W, NW). This increases the flexibility of the proces-sor and only requires an extra four input pins into eachcorner processor on the chip. The additional circuitry forcolumn bypassing is similar to that for Section 4.1 prec-eding, but now six out of eight directions are involved inbypassing. Again, separate outputs are required for datagoing to different columns, in order to guard against short-circuit faults.

Extra logic per chip2x + 4 multiplexers2x + 4 output drivers

A 3-state control register

Extra pins per chip2x + 4 inputs2x + 4 outputs

4.3 Square-connected array with multiplexed serial links tofour nearest neighbours

In this case, each processor has a multiplexed bidirectional linkwith its four nearest neighbours. The MPP[6] and the GRID[7] use this approach. Each processing element can then onlyaccess one neighbouring processor at a time, and although this

Bypass W

A processoron West sideof chip

A processoron East sideof chip

,U.

AJ- Fig. 3 Bypassing circuitry for square-connected array with serial linksto four nearest neighbours

1 h - -

Fig. 1 Basic column bypassing mechanisma Standard array: (each circle represents one integrated circuit, so

this array has 4JC by 3y processing elements)b Fault-tolerant array: (dotted lines illustrate the extra circuitry

required for column bypassing)

Bypass W

rrAm w A• ^

From WW ' !—*-tJ

To W<li

ToWW _ J>f -

Aonof

r; 3 State controli register

processorWest sidechip

Aonof

i

processorEast sidechip

Bypassi

—i r~~

r-[>-

t

FromEE

To E»-

ToEE

Fig. 2 Bypassing circuitry for square-connected array with parallellinks to four nearest neighbours

will slow down certain image-processing algorithms, it does notappear to be a severe disadvantage. It does halve the numberof pins required for the data routing in comparison with thecase of Section 4.2 above (a saving of 2x + 2y pins per chip).In the GRID, four multiplexers between the links of each pro-cessor allow direct access to all eight nearest neighbours overthis network. Fig. 3 shows by dotted lines the extra circuitsrequired for column by-passing. These additional requirementsare therefore:

Extra logic per chip2x multiplexers2x output drivers

A 3-state control register

Extra pins per chip2x input/output pins

In some array computers [6], a serial input/output bit plane isused in parallel with the processing array. This is best organisedwith aN->S routing of data, so that simple 2-way switches atthe northern input and southern output are sufficient to allowfor the column bypass. The output switches should be set toignore the bypassed column for the frame of data being out-put, and the input switches set to ignore the column whichwill be bypassed when the new data are operated on (not

232 IEEPROC, Vol. 129, Pt. E, No. 6, NOVEMBER 1982

necessarily the same column). It is clear that an E -*• W routingof this bit plane would involve considerably more extra cir-cuitry.

Other array computers [3, 7] use N- S and E- W highwaysfor data input and output. These can be dealt with in a similarmanner, although the difficulties associated with the E- Wrouting are avoided if the probability of a faulty chip affectingthe highway is remote.

The logic requirements for column bypassing are clearlymodest in comparison with the expected complexity of thestandard array, but the additional pin requirements per chipmay present more difficulties. It is seen, however, that theadditional pins are fewer than the saving which can be madeby using a serial transmission between processors in place of aparallel approach.

5 Use of by-passed column to duplicate and monitor anadjacent column

The redundancy requirements of the fault-detection schemesreviewed in Section 3 are mainly self-explanatory. However,the hardware approach of duplicating just a proportion of thearray at any given instant of time is believed to be a novelsolution and its redundancy requirement is not obvious. Forthis reason, more details of this solution are given below.

rDuplicate pair

A

NextWesterncolumn

Westernneighbour(beingduplicated)

Bypassedcolumn

Easternneighbour

Extra logic per chip2x comparators(for case 4.2:2* + 4 coparators)Two more states to control register(Normal; bypass east, bypass west; bypass westand compare W with WW, compare E with EE)A fault-indication output driver

Extra pins per chipPossibly one for fault indication and possibly a secondfor fault indication from the northern neighbour if faultindications are OR-ed together down a column. Alter-natively, the fault indication might be output serially ona pin already available or OR-ed together with any otherfault indication, such as from a parity test.

The pair of columns being monitored can be changed and, ifthe pair is stepped across the array after each frame of data hasbeen processed, most hard faults will be detected at the end ofa fixed number of frames.

Errors which may not be detected in a fixed time includememory faults for which the faulty value and the true valuesare equal, during the monitored frame, but not at other times.The basic scheme is very simple and involves minimal additionsto the hardware. It might be supposed that the fault detectionscheme would simply trigger the appropriate fault-correctionbypass, but a number of further points need to be noted.

(a) The scheme only detects that one of two columns isfaulty, and further diagnosis is needed to decide which is to bepermanently bypassed.

(b) This diagnosis might be made by comparing first onecolumn of the pair and then the other with its alternativeneighbour. In the case of noncyclic arrays (with definateedges), processors at the edge will not have a second suitableneighbour and it is suggested that switches be provided tocreate the necessary cyclicity.

ft

Ii

ComparingE with EE

Bypassing E Bypassing W Bypassing Wand comparingW with WW

JY

Control states

Fig. 4 Comparison of data produced by bypassed column and itswestern neighbour

Fault on bypass path 2 - 4 suggests that column 2 or

3 is f au l t y

1

m11

Given the circuitry required in Section 4 for bypassing afaulty column of the array, very modest additions aresufficient to enable the bypassed column to duplicate theprocessing of a neighbouring column. Faults can then bedetected by comparing the data produced by these twoduplicates. It is necessary to compare the duplicate datarouted out to the east and to the west (and NE, NW, SE, SW inthe case of an octagonally connected array). This comparisoncan be performed conveniently in the two columns adjacentto the duplicate pair, because they already have access to thenecessary data. Fig 4 illustrates the most likely approach.Here the bypassed column duplicates its western neighbour byrouting data to and from E and WW. Data output by theduplicate pair to the east are compared within the easternneighbour and data output to the west are compared in thenext adjacent western column. The additional circuitryrequired is:

Fault on bypass path 2 - 4 suggests that column 3 or

4 is faul ty

Fig. 5 A fault on a bypass line can appear to be a fault of thebypassed column

(c) Faults on the data lines can create confusing results.Figs. 5 and 6 show that a fault on a bypassing line can beincorrectly diagnoised as a fault in the bypassed processor, andthat a fault on a direct line can appear as two separate faults.It is therefore necessary to check the routing paths inde-pendantly in order to confirm the diagnosis suggested by theduplication scheme.

(d) If a separate input/output bit plane is used, then furthercomparators should be employed along the southern edge tocompare the serial outputs of the duplicate columns. Since the

IEEPROC, Vol. 129, Pt. E, No. 6, NOVEMBER 1982 233

Fault on direct path 3 - A suggests column 2 or 3 is faulty

1 2 3 A 5 6

Fault on direct path 3 - A suggests column A or 5 is faulty

Fig. 6 A fault on a direct line can appear to be two separate faults

duplicate pair may well have been stepped along the arraybefore these serial outputs are available, the comparators inuse will have to correspond to the duplicate pair in use at thetime the data were calculated (not necessarily the pair in use atthe current time).

(e) There is little difficulty in using more than one bypasscolumn to detect and correct more than one fault, providedthat faults do not occur in adjacent columns.

6 Conclusion

This paper shows that quite simple fault detection and cor-rection schemes are available if either the amount of hardwareor the solution time can be doubled. This represents a roughupper limit to the amount of redundancy which need be used,but other solutions are available which use less redundancy.

Fault correction can be performed efficiently by means of aspare column in the array and interconnections which allowany one particular column to be bypassed. A mechanism forthis has been described. Most fault detection techniquesdiagnose the fault to an adequate precision to initiate suchcorrective action for the majority of faults. However, sincetime is not at a premium, self-text or external test routines canbe used to control the correction mechanism.

If such a correction scheme is used, it is a simple matterto exploit the by-passed column to detect faults by arrangingfor it to duplicate the function of its neighbour and by com-paring their outputs. In this way, permanent faults in the arraycan generally be detected after a delay at most equal to thenumber of frames needed to try the duplicate pair in everyposition across the array. A number of minor points indicatethat the step between detection and correction should probablybe under software control. This fault-detection scheme iscomplementary to the use of parity detection which can detectall permanent and transient faults on data stored in

memory, and these together constitute an efficient solution tothe detection and correction of faults in arrays of processors.

In the event that transient faults of the processors need tobe detected, or that the latency of the rotating duplication istoo long, the extension of parity checking into the processor[4] will detect many of the processor and all of the routingfaults instantly, and therefore provide a further improvementat a reasonable cost.

7 Acknowledgments

This paper describes work carried out by the author whileacting as a Consultant to the GEC Research Laboratories,Hirst Research Centre, Wembley, and he would like to thankAndrew McCabe and Ian Robinson for their helpful dis-cussions.

8 References

1 DUFF, M.J.B.: 'Array processing', Electron. & Power, 1980, 26,(11), pp. 888-893

2 DUFF, M.J.B.: 'Review of CLIP4 image processing system'.Proceedings of national computer conference, Anahein, CA,5th-8th June 1978, pp. 1055-1060

3 REDD AWAY, S.F.: 'The DAP approach', in JESSHOPE, C.R.,and HOCKNEY, R.W. (Eds): 'Infotech state of the art report onsuper computers, Vol. 2' (Infotech, International, Maidenhead,1979), pp. 309-329

4 HUNT, D.J.: UK Patent Application GB 2037042 A5 BATCHER, K.E.: 'The Staran computer' in JESSHOPE, C.R. and

HOCKNEY, R.W. (Eds.): 'Infotech state of the art report onsuper computers, Vol 2' (Infotech, International, Maidenhead,1979), pp. 33-49

6 BATCHER, K.E.: 'Design of a massively parallel processor', IEEETrans., 1980, C-29, pp. 836-840

7 ROBINSON, I.N., and MOORE, W.R.: 'A parallel array architectureand its implementation in silicon', IEEE custom integrated circuitsconference, Rochester, NY, May 1982, pp. 41-45

8 PRATT, W.K.: 'Digital image processing' (John Wiley, 1978)9 KRAFT, G.D., and TOY, W.N.: 'Microprogrammed control and

reliable design of small computers' (Prentice-Hall, 1981)10 BREUER, M.A., and FRIEDMAN, A.D.: 'Diagnosis and reliable

design of digital systems' (Computer Science Press, 1976)11 DUKE, K.A.: 'Detect errors in complex logic', Electron Design,

12th Oct. 1972, pp. 88-9312 WOODARD, S.E., and METZ, G.: 'Self-checking alternating logic:

Sequential circuit design'. Proceedings of IEEE symposium com-puter architecture, Apr. 1978, pp. 114-122

13 SENGUPTA, A., CHATTOPADHYAY, D.K., PALIT, A.,BANDYOPADHYAY, A.K., and CHOUDHURY, A.K.: 'Realiz-ation of fault-tolerant machines - linear code application', IEEETrans., 1981, C-30, pp. 237-240

14 LARSEN, R.W., and REED, I.S.: 'Redundancy by coding vs redun-dancy by replication for failure-tolerant sequential circuits', ibid.,1972,C-21,pp. 130-137

15 NUDD, G.R.: 'Image understanding architectures'. Proceedings ofnational computer conference, Anahein, CA, 19th-22ndMay, 1980,pp. 337-390

William R. Moore received his B.Sc. inelectrical engineering from the Universityof Bristol in 1968 and his Ph.D. in ControlEngineering from the University ofCambridge in 1979. From 1969 to 1973he worked on digital avionic controlsystems for British Aerospace, Filton, andfrom 1973 to 1976 he undertook re-search at the University of Cambridge onnew approaches to the design of fault-tolerant control systems. From 1976 to

1980 he lectured at the University of Hull on control systems,transducers and reliability. From 1981 he has lectured inelectronics at the University of Southampton and also acted asa consultant to the GEC Research Laboratories, Hirst ResearchCentre. He is actively interested in control systems, VLSIarchitectures, reliability and fault tolerance.

234 IEEPROC, Vol. 129, Pt. E, No. 6, NOVEMBER 1982