11
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO. 3, JUNE 2014 1389 A Low-Overhead Multiple-SEU Mitigation Approach for SRAM-based FPGAs with Increased Reliability Hasan Baig, Jeong-A Lee, Senior Member, IEEE, and Zahid Ali Siddiqui Abstract—The mitigation of single-event upsets (SEUs) through modular or functional redundancy is a traditional approach for de- signing fault-tolerant systems; however, even in multiple redundant systems, SEUs can lead to a system failure if they occur simultane- ously. Previous fault-tolerant approaches have proposed run-time reconguration to regain the lost functionality. We worked with a similar strategy to overcome failures caused by unidirectional SEUs occurring simultaneously in both frontline and redundant modules, but the approach we propose in this paper not only im- proves reliability but also requires low-overhead as compared to previous methodologies. The proposed architecture is an array of computation tiles containing computation cells and corresponding hot-spares. Each computation tile has a separate region for spare cells. The simultaneous faults are handled by an on-chip fault-tol- erant core and external host software that partially recongure the spare-cells region of a computation tile. The proposed architecture is implemented on a Xilinx Virtex-5 FPGA device and veried with the aid of simple digital application. Compared to previous schemes, our approach requires up to 9.6x less area overhead while providing 57.6% more reliability to mask multiple unidirectional SEUs. Index Terms—Fault-tolerance, FPGA, partial reconguration, reliability, self-reconguration, self-repair, single-event-upset (SEU). I. INTRODUCTION AND PREVIOUS WORK A FIELD-PROGRAMMABLE GATE ARRAY (FPGA) device, customizable by SRAM, is a type of integrated circuit that consists of an array of programmable logic blocks interconnected by a programmable routing network and I/O blocks. SRAM-based FPGA devices are becoming popular for remote applications because of their high performance, reduced development cost, and re-programmability. Radiation in the environment can seriously affect the functionality of a circuit. A single-event upset (SEU) occurs when a charged particle, present in the environment, hits the silicon of a circuit, trans- ferring enough energy in order to provoke a fault in the system. Manuscript received November 28, 2013; revised February 20, 2014; ac- cepted March 30, 2014. Date of publication May 14, 2014; date of current ver- sion June 12, 2014. This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (2013R1A1A3012335) and by the Ministry of Education, Science and Technology (MEST) ( 2010-0009303). (Corresponding author: J.-A. Lee.) H. Baig and J.-A. Lee are with the Department of Computer Engineering, Chosun University, Gwangju, South Korea (e-mail: [email protected]; [email protected]). Z. A. Siddiqui is with the Department of Electronic Engineering, NED University of Engineering and Technology, Karachi 75270, Pakistan (e-mail: [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TNS.2014.2315432 An SEU can have a persistent (permanent) or temporary (tran- sient) effect depending on the amount of energy transferred by the charged particle. The main consequence of a transient effect is a bit ip in a memory element, while a stuck-at fault can be an upshot of a persistent effect. In order to keep these programmable devices operational in such hostile environment where the human intervention for maintenance and repair in eld is impossible (although in some cases ground intervention for software patches can be done routinely to correct systematic problems), a fault-tolerant and a self-healing recongurable architecture is necessary. Triple Modular Redundancy (TMR) has traditionally been used for protecting digital logic from SEUs in space-borne ap- plications [26]. The types of TMR including localized TMR, distributed TMR, global TMR, and XTMR are proposed by dif- ferent vendors like Mentor Graphics, Xilinx etc [33]. The main usage has been either at the module level or for the protection of sequential elements in digital logic. With the use of repro- grammable logic, as in SRAM-based FPGAs, the protection of the combinational logic is insufcient since the logical func- tionality of the FPGA can be changed owing to a charged par- ticle hitting the on-chip conguration SRAM. Protection of the combinatorial logic is therefore required to avoid involuntary changes to circuit functionality. In TMR, three identical hard- ware modules are implemented to carry out the same function, with the majority output being used [1], [2]. TMR is therefore not capable to handle more than one fault occurring simultane- ously [3], [4]. An alternative approach comes from the concept of self-re- pairing embryonic systems, which was rst introduced in the 1990s [5], [6]. Since then, self-repairing digital systems, as an advanced form of fault-tolerant systems, have received increasing attention, as modern digital systems are getting more complex and fast [7], [8]. Research has already been carried out on self-repairing digital systems, including [9]–[11], in which a logical system realized by lookup tables (LUTs) has been suggested. A dynamic partial reconguration has also been studied for fast fault recovery [12], [13] in which the faulty partition of an FPGA is recongured without stopping the overall operation. Another important thing in any FPGA architecture is the routing network, which occupies 80–90% of the area of an FPGA. The self-healing fault-tolerant architectures that have thus far been proposed consist of separate original and spare fault-tolerant modules that require an external router. This external router not only occupies additional chip area but also makes the routing logic ambiguous. 0018-9499 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

A Low-Overhead Multiple-SEU Mitigation Approach for SRAM-based FPGAs with Increased Reliability

Embed Size (px)

Citation preview

IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO. 3, JUNE 2014 1389

A Low-Overhead Multiple-SEUMitigation Approachfor SRAM-based FPGAs with Increased Reliability

Hasan Baig, Jeong-A Lee, Senior Member, IEEE, and Zahid Ali Siddiqui

Abstract—The mitigation of single-event upsets (SEUs) throughmodular or functional redundancy is a traditional approach for de-signing fault-tolerant systems; however, even inmultiple redundantsystems, SEUs can lead to a system failure if they occur simultane-ously. Previous fault-tolerant approaches have proposed run-timereconfiguration to regain the lost functionality. We worked witha similar strategy to overcome failures caused by unidirectionalSEUs occurring simultaneously in both frontline and redundantmodules, but the approach we propose in this paper not only im-proves reliability but also requires low-overhead as compared toprevious methodologies. The proposed architecture is an array ofcomputation tiles containing computation cells and correspondinghot-spares. Each computation tile has a separate region for sparecells. The simultaneous faults are handled by an on-chip fault-tol-erant core and external host software that partially reconfigure thespare-cells region of a computation tile. The proposed architectureis implemented on a Xilinx Virtex-5 FPGA device and verified withtheaidof simpledigital application.Compared toprevious schemes,our approach requires up to 9.6x less area overheadwhile providing57.6% more reliability to mask multiple unidirectional SEUs.

Index Terms—Fault-tolerance, FPGA, partial reconfiguration,reliability, self-reconfiguration, self-repair, single-event-upset(SEU).

I. INTRODUCTION AND PREVIOUS WORK

A FIELD-PROGRAMMABLE GATE ARRAY (FPGA)device, customizable by SRAM, is a type of integrated

circuit that consists of an array of programmable logic blocksinterconnected by a programmable routing network and I/Oblocks. SRAM-based FPGA devices are becoming popular forremote applications because of their high performance, reduceddevelopment cost, and re-programmability. Radiation in theenvironment can seriously affect the functionality of a circuit.A single-event upset (SEU) occurs when a charged particle,present in the environment, hits the silicon of a circuit, trans-ferring enough energy in order to provoke a fault in the system.

Manuscript received November 28, 2013; revised February 20, 2014; ac-cepted March 30, 2014. Date of publication May 14, 2014; date of current ver-sion June 12, 2014. This research was supported by the Basic Science ResearchProgram through the National Research Foundation of Korea (NRF) funded bytheMinistry of Science, ICT and Future Planning (2013R1A1A3012335) and bythe Ministry of Education, Science and Technology (MEST) ( 2010-0009303).(Corresponding author: J.-A. Lee.)H. Baig and J.-A. Lee are with the Department of Computer Engineering,

Chosun University, Gwangju, South Korea (e-mail: [email protected];[email protected]).Z. A. Siddiqui is with the Department of Electronic Engineering, NED

University of Engineering and Technology, Karachi 75270, Pakistan (e-mail:[email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TNS.2014.2315432

An SEU can have a persistent (permanent) or temporary (tran-sient) effect depending on the amount of energy transferredby the charged particle. The main consequence of a transienteffect is a bit flip in a memory element, while a stuck-at faultcan be an upshot of a persistent effect. In order to keep theseprogrammable devices operational in such hostile environmentwhere the human intervention for maintenance and repair infield is impossible (although in some cases ground interventionfor software patches can be done routinely to correct systematicproblems), a fault-tolerant and a self-healing reconfigurablearchitecture is necessary.Triple Modular Redundancy (TMR) has traditionally been

used for protecting digital logic from SEUs in space-borne ap-plications [26]. The types of TMR including localized TMR,distributed TMR, global TMR, and XTMR are proposed by dif-ferent vendors like Mentor Graphics, Xilinx etc [33]. The mainusage has been either at the module level or for the protectionof sequential elements in digital logic. With the use of repro-grammable logic, as in SRAM-based FPGAs, the protection ofthe combinational logic is insufficient since the logical func-tionality of the FPGA can be changed owing to a charged par-ticle hitting the on-chip configuration SRAM. Protection of thecombinatorial logic is therefore required to avoid involuntarychanges to circuit functionality. In TMR, three identical hard-ware modules are implemented to carry out the same function,with the majority output being used [1], [2]. TMR is thereforenot capable to handle more than one fault occurring simultane-ously [3], [4].An alternative approach comes from the concept of self-re-

pairing embryonic systems, which was first introduced in the1990s [5], [6]. Since then, self-repairing digital systems, asan advanced form of fault-tolerant systems, have receivedincreasing attention, as modern digital systems are getting morecomplex and fast [7], [8].Research has already been carried out on self-repairing digital

systems, including [9]–[11], in which a logical system realizedby lookup tables (LUTs) has been suggested. A dynamic partialreconfiguration has also been studied for fast fault recovery [12],[13] in which the faulty partition of an FPGA is reconfiguredwithout stopping the overall operation.Another important thing in any FPGA architecture is the

routing network, which occupies 80–90% of the area of anFPGA. The self-healing fault-tolerant architectures that havethus far been proposed consist of separate original and sparefault-tolerant modules that require an external router. Thisexternal router not only occupies additional chip area but alsomakes the routing logic ambiguous.

0018-9499 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1390 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO. 3, JUNE 2014

Fig. 1. Self-healing architecture proposed in [10].

The big picture of the architecture proposed by P. K. Lala [10]is shown in Fig. 1. In this architecture, all functional cells areidentified as “f”, router cells as “r”, and spare cells as “s” (sparecells are used to replace faulty functional cells). It can be seen inFig. 1 that each functional cell is surrounded by two spare andtwo router cells. Themechanismof routing the input connectionsof a faulty cell to the input ports of corresponding spare cell is notdiscussed in [10]. Two function cells share a single spare cell, andso additional decision making circuitry is required to replace thespecific faulty cell with the spare cell. This makes the routing inthis architecture quite complex and ambiguous.An additional consequence of using an external router is that

this architecture is not compatible with existing island-stylerouting networks, which means that if a new architecture hasto be developed, additional external routers are then required,and consequently more area is required, in addition to the 90%of the FPGA area used by the existing island-style routingarchitecture. Regardless of the routing complexity, this schemehas a large area overhead (see Section V).The hybrid scheme of wiring a redundant and a stem (empty)

cell for fault recovery is proposed by K. Sokehwan et al. in[14]. In this framework, if a fault occurs in a working cell, thecorresponding redundant cell replaces it instantly, and normaloperation can be restored immediately. However, in this case,each functional cell has its own spare cell, where the inputsare pre-routed for both. Although this reduces the routing com-plexity, it still requires an external router to decide which cellshould be active—the original cell or its corresponding spare.Fig. 2 shows the top-level diagram of the fault-tolerant FPGA

architecture proposed in [14]. The expanded block is a workingmodule consisting of 16working cells, 16 spare cells, and 4 stemcells. The light gray blocks represent the original working cells,while the dark blocks are the corresponding spare cells. Thewhite blocks are stem cells that are used to heal cells that arepermanent faulty. For each pair of working cell and spare cell,an external router is used to replace a faulty cell with the corre-sponding redundant cell. Here again the routing is vague; in par-ticular, inter-working module communication is not described.The area-overhead to fault-coverage ratio is also high—this isdiscussed in more detail in Section V.

II. PROBLEMS AND MOTIVATIONS

The most important thing is the fault-tolerant circuitry, whichitself isSEUsensitive.The fault-tolerant scheme is thereforeuse-less if the fault-tolerant circuitry requires a larger area than thearea being protected. The main problem in previously proposedarchitectures is the large area-overhead to fault-coverage ratio.

Fig. 2. Self-healing architecture proposed in [14].

Another important concern is the presence of extra routinglogic in the previous architectures. If one needs to implementpreviously proposed fault-tolerant schemes in the existing is-land-style routing architecture, additional area is required to im-plement the router logic.Because of this reason, it becomes difficult for the commer-

cial FPGA vendors to refine their existing routing architectureaccording to the newly proposed schemes in order to incorporatethe fault-tolerant capabilities. This underlines the need to de-velop a design that can easily be integrated with existing routingstrategies. We have proposed such an architecture, which cannot only be implemented in existing FPGA devices but can alsobe fabricated entirely as a new device that utilizes the existingisland-style routing strategies.It should also be noted that the traditional fault-tolerant

scheme, TMR, is unable to produce the correct output if two,out of the three available, redundant modules become faultyduring the same clock cycle.Our proposed architecture is able to tolerate both unidirec-

tional bit-flip and stuck-at faults. We have improved the stem-cell strategy, proposed in [14], of recovering permanent stuck-atfaults through differentiation (reprogramming) via dynamic par-tial reconfiguration. Our architecture focuses on SEUs occur-ring in the logic of Computation Tiles (see Section III-C) andnot in those SEUs that occur in Fault-Tolerant Core (see Sec-tion III-D). Error-detection schemes and SEUs occurring in thenets are also out of the scope of this paper. Our main goal is to

BAIG et al.: A LOW-OVERHEAD MULTIPLE-SEU MITIGATION APPROACH 1391

Fig. 3. A fault-tolerant FPGA architecture with self-repairing capabilities.(a) Proposed architecture (b) Computation Tile (c) Computation Cell.

compare the overhead and reliability of our approach for SEUmitigation in logic with that of previous approaches.The rest of this paper is organized as follows. A detailed de-

scription of the proposed architecture, fault-tolerant computa-tion cell, fault detection and self-repairing mechanism, fault-tol-erant core and external host software has been presented in Sec-tion III. Section IV comprises the experimental testing of the ar-chitecture with a case-example. Section V gives a performancecomparison of the proposed architecture with previously pro-posed schemes. Section VI concludes the research by consid-ering future enhancements.

III. THE PROPOSED ARCHITECTURE

A. Brief Overview of a Complete Architecture

A generic fault-tolerant FPGA computation cell has been de-veloped that can be arrayed over the entire die to build up asymmetrical fault-tolerant FPGA architecture. The proposed ar-chitecture is an array of computation tiles (CTs), as depicted inFig. 3(a).Each computation tile, shown in Fig. 3(b), is divided into a

number of computation cells. All computation tiles also containa separate portion for spare cells as shown in Fig. 3(b). This por-tion holds the redundant copy of the original function of all com-putation cells present in the same computation tile. The regionof spare cells is partially reconfigurable. This region is recon-figured dynamically when both the original and spare functionsbecome faulty at the same time (that is, in the same clock cycle).We refer to this condition as a “simultaneous errors (SE)” andwe will use this terminology throughout this paper. The compu-tation tile is described in detail in Section III-C.

The simultaneous errors occurring in the computation tile arehandled by a fault-tolerant core (shown in Fig. 3(a)). This corecan be placed, preferably, at the die center to maintain the homo-geneity of the proposed architecture. It also defines the healingpriority when the simultaneous error occurs in more than onetile at the same time. The fault-tolerant core communicates withexternal PC software via a UART interface, which identifies thefaulty tile and downloads its corresponding partial bitstream (ofthe spare-cell region) to recover that particular tile from the si-multaneous error condition.

B. The Fault-Tolerant FPGA Computation Cell

Recall the TMR scheme in which a module is triplicated andbit-wise majority voting is performed on the output of the trip-licated circuit. The majority voter outputs the logic value (1 or0) corresponding to at least two of its inputs. Since this tech-nique cannot identify the faulty module, we adopt a techniquein which the function is copied twice along with its error-detec-tion codes (EDCs).The proposed fault-tolerant computation cell, shown in Fig.

3(c), is capable of detecting all unidirectional errors in a func-tion and is able to mask it with its redundant spare function. Itconsists of two LUTs—an original-function ( ) LUT withpre-computed EDCs and a second one for fault detection. Thecorresponding redundant spare-function ( ) LUT is placedin the spare-cell region along with its EDC. The fault checkingcircuitry not only identifies the error in the original-function butalso in the spare-function, both at the same time. Thus, the spareLUT acts as a hot-spare and masks the SEU occurring in theoriginal LUT instantly. In the case of simultaneous errors, thearchitecture is capable of reconfiguring the spare-cells regionduring runtime via dynamic partial reconfiguration. This tech-nique is different from blind scrubbing which scrubs the entire(or partial) configuration bitstream periodically without takingconsideration of whether an upset is present or not [27]. In ourarchitecture, partial reconfiguration (or partial scrubbing) is notperformed periodically, rather only when needed, with the helpof internal fault-tolerant core.Fig. 4 shows the internal connections of a fault-tolerant FPGA

computation cell. Note that the function inputs to the originaland its corresponding spare LUTs are pre-routed. The internalcomponents of the second LUT inside the computation cell arealso visible. This complete circuit can be implemented in singlefracturable LUTs. The XOR gates (see Fig. 4) check the fault inboth the original and spare functions. The non-matching valueindicates a fault either in the original or in the spare function.The MUX serves as an internal router and routes the non-faultysignal out of the cell.The input select signal of the MUX is routed with the output

of the XOR gate which checks the original-function LUT.When,1. Neither the original nor the spare function is faulty

Original-function is routed out of the cell2. Only the spare function is faulty

Original-function is routed out of the cell3. The original function is faulty

Spare-function is routed out of the cell

1392 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO. 3, JUNE 2014

Fig. 4. Interconnections of fault-tolerant computation cells with their corresponding spare cells.

4. Both original and spare functions are faultySimultaneous Error ( ) signal is generated through

AND gateThe EDCs are nothing but the complement of the original

function. The inverted function (or EDC) is used to detect uni-directional bit-flip errors.It seems initially that the area overhead in our scheme is

greater (because of EDCs) than that of TMR. However, if thecomponents are implemented in fracturable LUTs, the areaoverhead, in terms of LUTs, can be reduced. All the latestFPGA devices have at least six-input LUTs that can be frac-tured into two same-input functions. Hence, two functions thathave the same set of inputs can be placed in fracturable LUTs.In our scheme, the original-function, spare-function, and EDCsall share the same set of inputs. We therefore pack , ,and the corresponding EDCs in two fracturable LUTs as shownin Fig. 4.This reduces the area overhead, in terms of LUTs. Let us

consider an example of one LUT function: if our scheme isadopted, only three LUTs will be required. If TMR is adopted,the area overhead will still be the same, i.e. three LUTs. Thisis because, out of three redundant modules, two copies can beaccommodated in a single fracturable LUT, as shown in Fig 5,and the third redundancy will be placed in another fracturableLUT. However, the majority voter cannot be a part of the secondfracturable LUT because it has a different set of inputs. There-fore, the majority-voter circuit must be placed in a separate frac-turable LUT. TMR implementation, using fracturable LUTs, forless than four-input functions could be area efficient on AlteraStratix-II and Stratix-III FPGA devices [24], but these devices

Fig. 5. Implementation of TMR using fracturable LUTs.

are sensitive to single-event latchup and are often not used inharsh radiation environment.

C. The Computation Tile

The reason why only the spare cell is kept partially recon-figurable is that the whole FPGA area cannot be made partiallyreconfigurable at the same time. The standard partially reconfig-urable design consists of two portions—static and dynamic. Thestatic part is not run-time reconfigurable; the dynamic region,however, can be reconfigured on the fly. The nets of static re-gion cannot pass through any dynamic partially reconfigurablearea. For this reason, we kept only redundant spare functions inthe partially reconfigurable region.

BAIG et al.: A LOW-OVERHEAD MULTIPLE-SEU MITIGATION APPROACH 1393

Fig. 3(b) depicts a computation tile. The length of a compu-tation tile is proportional to the configuration frame length. Theconfiguration frame defines the granularity of a partial recon-figuration; that is, the minimum number of Configurable LogicBlocks (CLBs), reconfigured during partial reconfiguration. Forexample, if the configuration frame length of a device is 20, thenat least 20 CLBs will be reconfigured during partial reconfigura-tion of that device. The configuration frame length varies fromdevice to device.If a simultaneous error occurs in any of the computation cell,

the whole spare-cell region is reconfigured according to theconfiguration granularity of a device. This strategy leads to thefact that whether a simultaneous error occurs in a single com-putation cell or in multiple cells, all of them will be healed atonce.It is important to keep the width of a computation tile as low

as possible because the number of spare cells in a spare-cells re-gion is directly proportional to the number of computation cellsin a tile. A large number of computation cells in a computationtile will have a large number of spare cells, and thus the sizeof the spare-cell region will be increased. This in turn will in-crease the size of the partial bitstream of spare-cells region, andhence the reconfiguration time. Therefore, it is good to have alarge number of small computation tiles, despite the fact thatthe computation load on fault-tolerant core will be increased.Reconfiguration delay analysis is discussed in Section IV.

D. The Fault-Tolerant Core

As mentioned earlier in Section III-A, the fault-tolerant coreis required to monitor and handle simultaneous errors via dy-namic partial reconfiguration. The FPGA die is divided into twohalves to efficiently handle simultaneous errors in both regionsof the FPGA die. Fig. 6(a) shows each tile marked by its uniqueID (denoted as ), for each half of the die,in order to identify which tile has a simultaneous fault.The reason we have divided the die into two halves can be ex-

plained by considering first the case when the die is not dividedfor simultaneous error handling. In this case, the architecturewill be similar to that shown in Fig. 6(b), with tile indices fromCT0–CT43. Any of the tiles can be given the highest preferencefor healing a simultaneous error. Suppose that the highest pref-erence is given to CT0 and that the lowest is given to CT43.In this scenario, if a simultaneous error occurs, for example

in all CT0–CT7 at the same time, then CT4 has to wait until thesimultaneous errors present in CT0 to CT3 are healed. In orderto avoid such situation, we divided the die into two halves, asdepicted in Fig. 6(a), where each tile is marked by its unique ID( ), for both halves of the die to identifywhich tile has a simultaneous fault. If the left half is given thehighest priority then according to the scenario discussed above,CT0 on the left-side will be healed first, and then CT0 on theright-hand side (corresponding to CT4 of Fig. 6(b)). Then CT1on the left-hand side will be healed followed by CT1 on theright, and so on. Similarly, the die can be divided into four quar-ters to further improve the simultaneous error handling.Fig. 7 shows the internal components of a fault-tolerant core.

It consists of two Simultaneous Error Handlers, for both right

Fig. 6. The computation tile indices when (a) the die is divided into two halves,and (b) the die is not divided.

Fig. 7. The components of a fault-tolerant core.

and left halves (see Fig. 6(a)) of the FPGA chip. These simulta-neous error handlers are 32-to-5 priority encoders with some ad-ditional circuitry. The priority encoder on the left decides whichtile, out of the left side of a die, should be healed first when asimultaneous error occurs in more than one of them at a time.Similarly, the simultaneous error handler (on the right) makes

1394 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO. 3, JUNE 2014

Fig. 8. The off-chip host PC software responsible for testing and healing si-multaneous faults by downloading pre-generated partial bitstreams.

the same decision for the right half. The responsibility of pri-ority controller is to decide which half of a die should be givenhighest priority. In our proposed architecture, either half of a diecan be given highest priority.The frame generator generates the frame-containing infor-

mation of computation tiles with simultaneous faults. The fault-tolerant core then sends this frame to external host software, viaUART interface, which heals that faulty tile by downloading itscorresponding pre-generated partial bitstream of the stem-cellregion.

E. The Testing and Fault-Injection Software

A software application is developed on the LabVIEW plat-form to emulate the functionality of the proposed architecture(a screen-shot is shown in Fig. 8). This application runs on ahost PC and communicates with the FPGA device via UART in-terface. The software serves two purposes—fault injection andtesting, and execution of the self-repairing operation to recoversimultaneous faults. This requires the partial bitstreams for allspare-cells regions to be ready prior to running the experiment.These bitstreams are stored in a predefined folder accessibleto prototype software. The bitstream downloading commandswere first stored in separate command files for each bitstream.When any of the buttons (depicted in Fig. 8) is activated, theprototype software runs the command file associated with thatoption to download the associated bitstream via JTAG interface.Thus, running the XILINX iMPACT software manually for bit-stream downloading is not required.E. 1) Self-Repairing Mechanism: When a simultaneous error

occurs, the fault-tolerant core sends the information frame tothe host software via UART interface. The software then ana-lyzes the frame to find out which computation tile is faulty anddownloads its corresponding pre-generated partial bitstream.This whole process, including receiving the information frame,its analysis and bitstream downloading takes approximatelya second to complete. Out of this time, the partial bit down-loading time is no more than a millisecond via JTAG interface,as shown later in the reconfiguration time performance Table IIin Section IV.

TABLE IMANUAL RECONFIGURATION OPTIONS IN HOST SOFTWARE

TABLE IIACTUAL PARTIAL RECONFIGURATION TIME OF SPARE-CELLS REGION IN

EXPERIMENT USING DIFFERENT CONFIGURATION MODES

E. 2) Fault-Injection and Testing: The fault injection tech-niques are essential to validate the dependability of a system byanalyzing its behavior when a fault occurs. Several techniqueshave been developed to inject faults in a system prototype.These techniques include fault injection through radiation-ex-posure, hardware-based systems, software-based setups, simu-lations, emulations etc. Fault injection via radiation exposuresmay be favored by engineers for the ability to imitate the phys-ical phenomenon, but this technique suffers from several disad-vantages to test prototype architectures. First there is no abilityto control the exact location at which a fault can be injectedneither the amount of faults injected can be controlled properly.Moreover, this technique is found to be much time consumingand expensive [28]. The major disadvantage of software-basedfault injection scheme is the limited locations at which thefaults can be injected, limited observability and controllability.Simulation-based fault injection environments not only requirelarge development efforts but also require a lot of time to sim-ulate the design model. The real-time fault injection in a pro-totype is also not possible in simulation-based fault injections.On the contrary, the emulation-based fault injection is foundto be the best way of injecting faults in SRAM-based FPGAsvia partial RTR (Run Time Reconfiguration) [29][30]. Anotheradvantage of emulation is to allow the designer to study theactual behavior of the circuit in the application environment,taking into account real-time interactions [31]. Our main goalof this research is to choose such a scheme which providesfast fault-injection capability with good control and visibility.Hence we adopted an emulation-based fault injection approachto test our proposed architecture. The pros and cons of differenttypes of fault-injection techniques and tools are mentioned in

BAIG et al.: A LOW-OVERHEAD MULTIPLE-SEU MITIGATION APPROACH 1395

Fig. 9. Functional diagram of a test application.

[32]. As discussed earlier, the developed prototype softwarenot only repairs the simultaneous faults automatically, but it isalso capable of inducing temporary and simultaneous errors inthe FPGA die to verify the functionality of a proposed archi-tecture.Unlike [14] in which extra circuitry is implemented for error

induction, we employed a differential-based [18] partial recon-figuration technique to induce error(s) in any particular cell.Using differential-based partial reconfiguration, it is possible todynamically change the LUT contents without interrupting theon-going process [18]. We used this technique to dynamicallychange the contents of both original and spare functions (as wellas the pre-stored EDC contents) to insert temporary and perma-nent errors.We generated separate bitstreams for each error case sum-

marized in Table I. When a simultaneous error is generated, thefault-tolerant core reports it to the external PC software, viaUART interface. The software then identifies the faulty compu-tation tile in the FPGA and then downloads its correspondingpartial bitstream to reconfigure it. This is how the softwareauto repairs any fault(s) in our proposed fault-tolerant FPGAarchitecture.

IV. EXPERIMENTAL TESTING AND PERFORMANCE ANALYSIS

We have tested our proposed architecture on anXC5VLX110T Virtex-5 FPGA device mounted on anXUPV5-LX110T general-purpose development board. Thereis no special reason for selecting this platform for testing; anydevice that supports partial reconfiguration will suffice.We developed a simple application to induce and observe

bit-flip and stuck-at errors. The application is divided into eightfunctions that work together to perform an LED-shifting opera-tion. We placed four functions in one tile and the remaining fourin another. This application uses eight general purpose on-boarduser LEDs to demonstrate the results of both error induction andhealing. We placed one tile in the left half of the die and anotherin the right half in order to test the permanent error-handling op-eration of a priority controller. Fig. 9 shows the functional dia-gram of a test application. Each tile contains four cells namedC1, C2, C3, and C4.

After running the software, the full application bitstream isdownloaded first by pressing the “Full Bitstream” button, showninFig. 8.Different error conditions can thenbe tested, as depictedin Table I. To emulate a temporary error, we change the contentsof the original function in Cell 1 of Tile 1 and the contents of thespare function of Cell 2 of Tile 2. The function keeps on runningwithout hindrance whenever a temporary error occurs becauseit follows rules two and three, described in Section III-B.We have also tested the scenario where a simultaneous error

occurs in more than one computation tile at a time. After acti-vating the “Simult_Err_T1T2_Both” option, the error is inducedin both tiles at the same time. The software then heals Tile 1 first,as it lies on left half of the die, and then Tile 2. The user can alsouse the “Reconfiguration Delay” option (Fig. 8) to observe thefault induction and its sequential recovery by the host software.Project demonstration videos can be seen at the following links;• Brief introduction of self-healing software (http://youtu.be/EfHrXE1jqPQ)

• Full demonstration of a test application (http://youtu.be/Gyt-aqmd7jw)

The autonomous healing is performed here using the externalhost software by downloading partial bits via JTAG interface. Itis shown in the demonstration that a simultaneous error takesless than a second to heal (this includes the time for receiving aframe from the fault-tolerant core, frame processing in software,and downloading the corresponding partial bitstream via JTAGinterface). Table II shows the actual partial reconfiguration timeof spare cells in our experiment using different configurationinterfaces on the Xilinx XUPV5 board. As mentioned before,the application is evenly distributed into two tiles, each withfour computation cells, and so the size of the spare-cells regionof both the tiles in this example is the same (8,192 bytes, or65,536 bits).As discussed in Section III-C, the time required to recon-

figure the spare-cells region depends on its size. If a compu-tation tile contains a large number of computation cells then thesame number of spare cells will be required. This in turn will in-crease the size of the partial bitstream of the spare-cells region,and hence the healing time.Fig 10 shows how the healing time of the spare-cells region

varies in our experiment depending on the placement of compu-tation cells in different numbers of computation tiles. When alleight computation cells are placed in a single computation tile,the healing time is increased owing to the increment in the sizeof the partial bitstream of its spare-cells region. When the samecomputation cells are distributed in eight different computationtiles, the healing time of the spare cells is reduced.We also estimated the area overhead required to implement

the same experiment using TMR, [10], [14] and compared it toour scheme. The comparison is made on the basis of fracturableLUT implementation. Fig. 11 shows that the TMR and our pro-posed schemes require 3x the area to cover transient faults. Onthe other hand, same fault-coverage requires 29x and 17x thearea by P. K Lala [10] and K. Sokehwan et al. [14] schemes, re-spectively. For covering simultaneous faults, our scheme is alsomore area efficient than the architectures used by K. Sokehwanet al. The simultaneous-fault coverage is neither supported byTMR nor P. K Lala’s scheme.

1396 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO. 3, JUNE 2014

Fig. 10. Logarithmic plot of bitstream sizes of spare-cells region vs. reconfigu-ration delay for different configuration modes. The units of the y-axis are shownin the data series legends above the plot.

Fig. 11. Area-overhead required to implement our experiment for transientsand simultaneous faults coverage.

V. ARCHITECTURAL COMPARISON WITH PREVIOUS WORK

We have compared our proposed architecture with previouslydeveloped fault-tolerant architectures—TMR [3], P. K. Lala etal. [10] and K. Sokehwan et al. [14].In TMR, sometimes additional logic is required to implement

the majority-logic voters. In the worst case, TMR can requireup to six times the area of the original circuit [22]. P. K. Lalaet al. [10] and K. Sokehwan et al. [14] developed such an ar-chitecture, which consists of separate original cell, spare cell,and a cell router to route non-faulty cells (either original or aspare) with other cells. In addition, the spare function activatesand goes through a self-checkingmechanism only when an orig-inal function becomes faulty. Therefore, the delay is increasedto route the non-faulty function with another cell because theoriginal function is checked first; if it is found faulty then thespare function is activated and checked before routing its output

to the input of another cell. These schemes thus not only have ahigh area overhead, but also experience a high inter-cell routingdelay. Our proposed architecture, however, consists of compu-tation cells rather than a pair of original and spare cells. Thefunction router is also a part of computation cell. Due to thisfact, not only are the original and spare functions both checkedat the same time, but also a single LUT is required for both ofthem (see Fig. 4). This reduces the area and routing delay over-heads compared to [10] and [14].Also in [14], each working module has 16 working cells and

a single self-tester is available for each working module. There-fore, it is evident that for more than one faulty working cell ina working module, self-test can be carried out on only one ofthem at a time. The self-test module generates pseudo-randombit patterns and passes it to the circuit under test (CUT). It thenexamines the fault in logic by comparing the resulting valuesfrom CUT. It categorizes a fault to be transient if two outputtrains from CUT are the exactly same, and permanent if bothare different from each other. This self-testing facility is not pro-vided in [10]. However, in our proposed architecture, error-de-tection codes are stored along with the original and spare func-tions. These pre-generated codes are compared with the func-tions to identify which one is faulty–either spare or the originalfunction.An additional register is required in [10] and [14] both to

remember the erroneous state and faulty cell ID. However, inTMR and our architecture, the redundant function instantlytakes over the erroneous state of an original function withoutmemorizing its current state in any extra register.The permanent error condition described by K. Sokehwan

et al. is healed via stem cell reconfiguration process [14]. Ac-cording to the strategy adopted for reconfiguring a stem cell in[14], only one cell of a working module can be healed from apermanent error at a time.It is also claimed in [14] that the partial reconfiguration tech-

niques has been employed to speed up the permanent fault-re-covery process, which contradicts their video demonstration.The permanent fault-recovery demonstration video, referred toin [14] (and which can be seen at [21]), shows that the com-plete design flow reruns again from the process of synthesisuntil bitstream generation and downloading, which takes 2 min-utes and 20 seconds approximately, including bitstream down-loading time. The device resets for about 9–10 seconds duringreconfiguration, which clearly shows that the full device is re-configured rather than the partial area of that faulty cell. How-ever, in comparison to [14], we do not have any further redun-dancy other than the single spare function. We define the con-dition of simultaneous error (see Section III-A) and its healingmechanism through spare-cells reconfiguration. Our architec-ture heals all simultaneous errors, in a single computation tile,at once (see Section III-C). In addition, unlike [14], our self-re-pairing software reconfigures a faulty tile with its correspondingpre-generated partial bitstream. It takes no more than a secondto reconfigure the faulty tile without hindering the other on-chiprunning processes.Table III summarizes the architectural comparison between

[10], [14], TMR [3], and our proposed architecture. The areaoverheads are not calculated in [10] and [14]. We therefore

BAIG et al.: A LOW-OVERHEAD MULTIPLE-SEU MITIGATION APPROACH 1397

TABLE IIIARCHITECTURAL COMPARISON WITH PREVIOUS WORK

roughly estimated the optimized minimum area overheadsfrom [10] and [14] by calculating the number of additionalresources required (in terms of fracturable LUTs and registers)for fault-tolerant capabilities. For example, in the architectureproposed by P. K. Lala [10], the additional components re-quired by the functional cell alone are 20. The functional celladopted there consists of a logic block, tri-state buffer, demul-tiplexer, and a 6-bit control register. The expanded versionof the logic block, shown in [10], consists of two 8x1 RAMs(3-input complementary functions), one 2-to-1 multiplexer,three D-flip-flops, a 1-to-8 demultiplexer, a tri-state buffer, a1-bit error register, a two-pair two rail checker, XOR gate, andan I/O block. The two 8x1 RAMs, consisting of 3-input com-plementary functions can be taken as a single entity becausethey can be implemented in a single fracturable LUT. The I/Oblock (shown in [10]) consists of three 8x1 MUXs with theircorresponding select signals stored in three instances of 3-bitRAMs. Therefore, the complete structure of a functional cell of[10] requires almost 20 components. The resources required byrouter and spare cells are additional. These resources altogethercost up to more than 29x the area, i.e., 9.6x more area than ourapproach. Similarly, the area overhead in [14] is greater than17x, as mentioned in Table III.The fault mechanism in FPGAs also presents another chal-

lenge. Since all FPGA resources are sensitive to SEUs, any addi-tional logic added to a circuit to improve the reliability can itselfbe vulnerable to SEUs. In other words, logic added to negate theeffect of SEUs can increase the overall SEU-sensitive cross-sec-

tion of a design. If this increase in cross section is greater thanthe reduction in cross section the scheme provides, there will bea net loss in reliability [23].The computation of SEU-sensitive area in our architecture is

shown in Table IV. The readings are obtained by implementingthe architecture on the Xilinx Virtex-5 XC5VLX110T FPGAdevice. SEU sensitivity for other architectures has been madein a similar manner. For example, in P. K. Lala’s architecture[10], the number of SEU-sensitive components in functionalcell, logic block, I/O block, router cell and spare cell are 3, 11,6, 3, and 15 respectively. Three-input complementary functions,RAM1 and RAM2 in logic block, are SEU insensitive and thusnot included. Therefore, out of 38 components, the followingfive components are SEU insensitive in [10]:• One pair of original and complementary functions in logicblock of functional cell

• Four pairs of spares and their corresponding complemen-tary functions in spare cell

The SEU-sensitive area in the K. Sokehwan et al. paper hasbeen found in a similar way. Table V shows a comparison ofthe percentage area that remains sensitive to SEU between [10],[14], TMR [3], and our proposed architecture. Only our schemeand [14] can handle simultaneous faults.As shown in Table V, only 21.4% of the SEU sensitive area

in [14] is protected, as opposed to our approach which provides79% protection and thus 57.6% more reliable. It is therefore ev-ident that the logic added for mitigation in [10] and [14] is largerthan the area being protected from SEU. This area itself is sen-

1398 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO. 3, JUNE 2014

TABLE IVSEU-SENSITIVE AREA CALCULATIONS FOR OUR ARCHITECTURE

TABLE VAREA COMPARISON FOR SEU SENSITIVITY

sitive to SEUs, and thus its inclusion reduces the reliability of adevice. TMR, on the other hand, has a reliability of 75% to coverfaults occurring either in the original or redundant modules, oneat a time. However, our architecture provides the highest relia-bility to cover faults occurring either individually or simultane-ously in both the original and its redundant modules.

VI. CONCLUSIONS AND FUTURE ENHANCEMENTS

This paper presents a fault-tolerant scheme to mask unidirec-tional errors caused by SEUs occurring either simultaneously orone at a time. Our proposed architecture additionally heals allof the spare cells in a computation tile at once. This means thatwhether simultaneous error occurs in a single computation cellor in multiple cells inside the same computation tile, all cellswill be healed at once. The effect of spare-cells region size onits healing time is reviewed with different configuration band-width modes. An architectural comparison is also performed inwhich our proposed architecture is found to be the most areaefficient and the most reliable among the existing fault-tolerantschemes. The present model is tested and verified on a XilinxXUPV5 board.The proposed fault-tolerant scheme can be used in conjunc-

tion with configuration scrubbing to additionally cover upsets inrouting logic. Just as the golden memory and its scrubbing con-troller is kept in radiation-hardened devices in memory-scrub-bing technique [25], a fault-tolerant core and the partial bit-streams of spare-cells regions can also be housed in a radia-tion-hardened chip. A soft-core microprocessor can be devel-oped, as a part of fault-tolerant core, to perform the function-ality of present off-chip host software.

REFERENCES[1] R. E. Lyons andW.Vanderkulk, “The use of triple-modular redundancy

to improve computer reliability,” IBM J., vol. 6, no. 2, pp. 200–209,Apr. 1962.

[2] C. LaFrieda, E. Ipek, J. F.Martínez, andR.Manohar, “Utilizing dynam-ically coupled cores to form a resilient chip multiprocessor,” in Proc.37th Annu. Int. Conf. Depend. Syst. Netw., 2007, pp. 317–326.

[3] C. Carmichael, “Triplemodule redundancy design techniques for virtexFPGAs,”XilinxApplicationNote, XAPP197 (v1.0.1) Jul. 6, 2006.

[4] S. Habinc, “Functional triple modular redundancy (FTMR),” GaislerResearch, FPGA-003-01, Ver 0.2 Dec. 2002.

[5] D. Mange, E. Sanchez, A. Stauffer, G. Tempesti, P. Marchal, and C.Piguet, “Embryonics: A new methodology for designing field-pro-grammable gate arrays with self-repair and self-replicating properties,”IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 6, no. 3, pp.387–399, Sep. 1998.

[6] D. Mange, M. Sipper, A. Stauffer, and G. Tempesti, “Towards robustintegrated circuits: The embryonics approach,” Proc. IEEE, vol. 88,no. 4, pp. 516–541, Apr. 2000.

[7] L. J. K. Durbeck and N. J. Macias, J. Schewel, P. James-Roxby, H.Schmit, and J. McHenry, Eds., “Defect-tolerant, fine-grained paralleltesting of a Cell Matrix,” Proc. SPIE, vol. 4867, pp. 71–85, 2002.

[8] J. M. Emmert, C. E. Stroud, and M. Abramovici, “Online fault toler-ance for FPGA logic blocks,” IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 15, no. 2, pp. 216–226, Feb. 2007.

[9] S. Mitra, W.-J. Huang, N. R. Saxena, S.-Y. Yu, and E. J. McCluskey,“Reconfigurable architecture for autonomous self-repair,” IEEE De-sign Test Comput., vol. 21, no. 3, pp. 228–240, May 2004.

[10] P. K. Lala, B. K. Kumar, and J. P. Parkerson, “On self-healing digitalsystem design,” Microelectron. J., vol. 37, no. 4, pp. 353–362, Apr.2006.

[11] W. Barker, D. M. Halliday, Y. Thoma, E. Sanchez, G. Tempesti, andA. M. Tyrrell, “Fault tolerance using dynamic reconfiguration on thepoetic tissue,” IEEE Trans. Evol. Comput., vol. 11, no. 5, pp. 666–684,Oct. 2007.

[12] J. D. Hadley and B. L. Hutchings, “Designing a partially reconfiguredsystem in FPGAs for fast board development and reconfigurable com-puting,” Proc. SPIE, vol. 2607, pp. 210–220, 1995.

[13] E. J. McDonald, “Runtime FPGA partial reconfiguration,” EEEAerosp. Electron. Syst. Mag., vol. 23, no. 7, pp. 10–15, Jul. 2008.

[14] K. Sokehwan et al., “A hierarchical self-repairing architecture for fastfault recovery of digital systems inspired from paralogous gene regula-tory circuits,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.20, no. 12, pp. 2315–2328, Dec. 2012.

[15] W. Lie andW. Feng-yan, “Dynamic partial reconfiguration in FPGAs,”in Proc. 3rd Int. Conf. Intelligent Information Technology Application,2009, pp. 445–448.

[16] J. Thorvinger, “Dynamic partial reconfiguration of an FPGA for com-putational hardware support,” M.Sc. thesis, Lund Institute of Tech-nology, Lund, Sweden, Jun. 2004.

[17] XILINX, “Partial reconfiguration user guide, UG702(v13.1),” Mar. 01,2011.

[18] [Online]. Available: http://www.xilinx.com/tools/partial-reconfigura-tion.htm.

BAIG et al.: A LOW-OVERHEAD MULTIPLE-SEU MITIGATION APPROACH 1399

[19] E. Eto, “Difference-based partial reconfiguration,” XILINX Applica-tion Note, XAPP 290 (v2.0), Dec. 03, 2007.

[20] [Online]. Available: http://youtu.be/Gyt-aqmd7jw[21] [Online]. Available: http://sbie.kaist.ac.kr/media/sr_permanent_fault.

avi[22] M. J. Wirthlin, N. Rollins, M. Caffrey, and P. Graham, “Hardness by

design techniques for field-programmable gate arrays,” in Proc. 11thAnnu. NASA Symp. VLSI Design, Coeur d’Alene, ID, USA, May 2003,pp. WA11.1–WA11.6.

[23] K. S. Morgan et al., “A comparison of TMR with alternative fault-tolerant design techniques for FPGAs,” IEEE Trans. Nucl. Sci., vol.54, no. 6, pp. 2065–2072, Dec. 2007.

[24] Altera Corporation, “FPGA architecture,” WP-01003-1.0(ver 1.0), Jul.2006.

[25] I. Herrera-Alzu and M. López-Vallejo, “Design techniques for xilinxvirtex FPGA configurationmemory scrubbers,” IEEE Trans. Nucl. Sci.,vol. 60, no. 1, pp. 376–385, Feb. 2013.

[26] R. Yuan, “Triple modular redundancy (TMR) in a configurable faulttolerant processor (CFTP) for space applications,” M.S. thesis, NavalPostgraduate School, Monterey, CA, USA, Dec. 2003.

[27] J. Heiner et al., “FPGA partial reconfiguration via configuration scrub-bing,” in , Proc. Field Programmable Logic and Applications Conf.,pp. 99–104, Aug. 2009.

[28] C. Hsueh, K. Tsai, and K. Iyer, “Fault injection techniques and tools,”IEEE Comput., vol. 30, no. 4, pp. 75–82, Apr. 1997.

[29] L. Antoni, R. Leveugle, and B. Fehér, “Using run-time reconfigurationfor fault injection applications,” IEEE Trans. Instrum. Meas., vol. 52,no. 5, Oct. 2003.

[30] L. Sterpone and M. Violante, “A new partial reconfiguration-basedfault injection system to evaluate SEU effects in SRAM-basedFPGAs,” IEEE Trans. Nucl. Sci., vol. 54, no. 4, pp. 965–970, Aug.2007.

[31] E. Böhl, W. Harter, and M. Trunzer, “Real time effect testing of pro-cessor faults,” in Proc. 5th IEEE Int. On-Line Testing Workshop, Jul.1999, pp. 39–43.

[32] H. Ziade et al., “A survey on fault injection techniques,” Int. Arab J.Inf. Technol., vol. 1, no. 2, Jul. 2004.

[33] M. Berg, “TMR schemes byMEI technologies/NASAGSFC,” in Proc.Eur. Space Agency FPGA Tool Workshop, Noordwijk, The Nether-lands, Sep. 2009.