Low Overhead Fault-Tolerance Technique for Dynamically Reconfigurable Softcore Processor

8/2/2019 Low Overhead Fault-Tolerance Technique for Dynamically Reconfigurable Softcore Processor

1/15

IEEE TRANSACTIONS ON COMPUTERS, VOL. XX, NO. YY, XXXXX 2012 1

Low Overhead Fault-Tolerance Technique forDynamically Reconfigurable Softcore Processor

Hung-Manh Pham, Student Member, IEEE, Sebastien Pillement, Member, IEEE, and

Stanisaw J. Piestrak, Member, IEEE

!

AbstractIn this paper, we propose a new approach to implement a

reliable softcore processor on SRAM-based FPGAs, which can miti-

gate radiation-induced temporary faults (single-event upsets (SEUs)) at

moderate cost. A new Enhanced Lockstep scheme built using a pair

of MicroBlaze cores is proposed and implemented on Xilinx Virtex-5

FPGA. Unlike the basic lockstep scheme, ours allows to detect and

eliminate its internal temporary configuration upsets without interruptingnormal functioning. Faults are detected and eliminated using a Configu-

ration Engine built on the basis of the PicoBlaze core which, to avoid

a single point of failure, is implemented as fault-tolerant using triple

modular redundancy (TMR). A softcore processor can recover from

configuration upsets through partial reconfiguration combined with roll-

forward recovery. SEUs affecting logic which are significantly less likely

than those affecting configuration are handled by checkpointing and

rollback. Finally, to handle permanent faults, the tiling technique is also

proposed. The new Enhanced Lockstep scheme requires significantly

shorter error recovery time compared to conventional lockstep scheme

and uses significantly smaller number of slices compared to known

TMR-based design (although at the cost of longer error recovery time).

The efficiency of the proposed approach was validated through fault

injection experiments.

Index Terms Error recovery, fault injection, fault-tolerance, FPGA,

lockstep, reconfigurable system, single-event upset (SEU), softcore

processor.

1 INTRODUCTION

MODERN FPGAs, besides customary reconfigurableresources, offer the designers the possibilities ofimplementing programmable processors having featuresof Commercial Off-The-Shelf (COTS) components (noneed to modify processor architecture or applicationsoftware). In particular, Xilinx FPGA devices include two

categories of processors: the hardcore embedded pro-cessor (PowerPC) and softcore processors (MicroBlaze,PicoBlaze) [1]. Hardcore embedded processors are hard-wired on the FPGA die and their number is limitedon each device (1, 2, 4 or no hardcore processor). Onthe other hand, softcore processors use reconfigurable

H.-M. Pham and S. Pillement are with University of Rennes 1/IRISA/INRIA,Res. Team CAIRN, 22305 Lannion, France (email: [email protected];[email protected]). S. J. Piestrak is with LIEN Lab., University ofLorraine, 54506 Vandoeuvre-les-Nancy, France; his research was done whenhe was with IRISA/INRIA, Res. Team CAIRN, 22305 Lannion, France, onleave from Univ. of Metz, 57070 Metz, France (email: [email protected])

Manuscript received ....., ....; revised ....., .....

resources, so their number that can be actually imple-mented depends on the device size only.

Xilinx FPGAs use SRAM-based technologies which areknown to be very susceptible to radiation and electro-magnetic noise [2]. The major effects caused by them

are known as Single-Event Upsets (SEUs) or soft errors,because only some logic state(s) of memory element(s)are changed but the circuit/device itself is not perma-nently damaged. In FPGAs, SEUs may directly corruptcomputation results or induce changes to configurationmemory; the latter can cause changes in the functionalityand performance of the device. Due to their flexibil-ity, FPGAs are attractive for mission-critical embeddedapplications, but their reliability could be insufficientunless some fault-tolerance techniques capable of mit-igating soft errors are used. These techniques should al-low for on-line error detection or/and correction duringsystem operation, very fast fault location, quick recovery

from temporary failures, and fast permanent fault repairthrough reconfiguration.

Here, we are interested in designing and implement-ing a fault-tolerant softcore processor using Virtex-5FPGA. FPGA implementations of various fault-tolerantsoftcore and hardcore processors can be found in [3][10]. In [3], lockstep systems using dual hardcore proces-sors PowerPC found in certain FPGAs are constructed.However, a limited number of available PowerPC pro-cessors in FPGAs restricts their utilization to build e.g. afault-tolerant Multi-Processor System-on-Chip (MPSoC).Therefore, one option to implement larger number oflockstep modules in a single FPGA is to employ soft-

core processors that can use all available reconfigurableresources of the device.

In [4], the authors claim proposing a lockstep schemeusing two hardcore PowerPC processors embedded inXilinx Virtex II Pro FPGA, which could be considered forapplication in softcore processors as well. However, thisscheme is rather a handshake scheme than a lockstep one

because, to perform consistency checks, the processorsexecute the same program but not simultaneously. Asa result, the overall time overhead is relatively large(besides context saving and restoring, if needed), becauseof the sequential execution of two identical tasks on

Digital Object Indentifier 10.1109/TC.2012.55 0018-9340/12/$31.00 2012 IEEE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.


2/15


two different processors, which might be prohibitivelylong in some real-time applications. Moreover, usingcheckpointing and rollback for context recovery resultsin an extra time overhead. Hence, it is worthwhile toconsider the possibility of context saving and restoringonly in case of errors, even at the cost of extra hardware.

The fault-tolerant softcore processor architectures from[5] using triple modular redundancy (TMR) are able tocorrect single errors in a faulty module through roll-forward error recovery. Although there is no need toregularly save the context when no errors occur, the TMRapproach consumes well over 200% of extra hardwareresources which could limit the possibilities of buildinga more powerful system like MPSoC.

A fault-tolerant version of an open-source LEON3processor [6], called LEON3-FT, was proposed in [7]. It isprovided with means allowing for detection and correc-tion of errors caused by SEUs in the processing unit andall on-chip RAM memories. However, the FPGA imple-mentation of LEON3-FT is highly resource-consuming,

because all reconfigurable logic like Configurable LogicBlocks (CLBs), Look-Up Tables (LUTs), and and flip-flops, must be protected using TMR.

The solutions from [8][10] have the ability to detectand correct errors using low level approaches capable ofdetecting bit-flips in the configuration memory, but theyare unable to distinguish whether the detected bit-flipsreally affect the design. Very often, a flipped bit doesnot belong to any design, so that halting the system tocorrect this error is pointless.

Finally, the cost of full TMR can be excessive insome applications like space systems, in which powerconsumption and weight are limited. In particular, the

available internal FPGA resources like CLBs, LUTs, andInput/Output Blocks (IOBs) could be insufficient to im-plement a full TMR version (in some designs, at leastinputs must be separated to avoid a single point of fail-ure). Moreover that sample designs show that full TMRversions could require significantly more overhead than200% (e.g. in [11], 430% for slices and 320% for LUTs).In those cases, alternative fault-tolerance techniques thatcan provide substantial cost reduction, although at theexpense of some increase of the failure rate, could be ofsome interest. Several such techniques which rely eitheron partial TMR applied only to the most critical parts ofthe design and/or are application-dependent have been

proposed e.g. in [11][14].In this paper, we propose a new architecture of a

fault-tolerant reconfigurable system which can be im-plemented at a reduced hardware and time cost on anySRAM-based FPGA with integrated softcore processors.Our actual implementation on Xilinx Virtex-5 FPGArelies on using an Enhanced Lockstep scheme builtusing a pair of MicroBlaze cores. To identify the faultycore, we propose a specially designed fault-tolerant (FT)Configuration Engine built using PicoBlaze. Once the exacterror location is determined by a specially designedScan Motor and a Frame Address Generator, the error is

corrected through partial reconfiguration combined withroll-forward recovery technique.

The paper is organized as follows. Section 2 detailssome basic FPGA features that are essential to supportfault-tolerance in the designs proposed here, presentsthe fault and error model of SRAM FPGAs, and sur-veys the fault-tolerance techniques commonly used inFPGA-based systems. Section 3 details the new fault-tolerant architecture. The fault mitigation strategy andthe state recovery procedure are described in Sections4 and 5, respectively. FPGA implementation details andcomparisons against existing designs are presented inSection 6. Section 7 presents fault injection campaignswhich provided statistical data to validate our design.The last section presents some conclusions and givessome perspectives for further research.

2 PRELIMINARIES

In this section, we will present some basic FPGA featuresthat are essential to support fault-tolerance in the designs

proposed here, the fault and error model of SRAMFPGAs, and the survey of the fault-tolerance techniquescommonly used in FPGA-based systems.

2.1 Partial Reconfiguration of FPGA

Currently, the main interest in the granularity of theFPGA programming data is related to the dynamic re-configuration property provided by some recent FPGAsto perform on-line programming (dynamic reconfigura-tion) of a portion of their logic (partial reconfiguration)without affecting the rest of the system.

Fig. 1. Dynamically reconfigurable architecture of Virtex-5 FPGA.

Figure 1 shows the internal structure of the XilinxVirtex-5 FPGA. Virtex FPGAs have configuration mem-

ory arranged in frames that are tiled across the device.Frames are the smallest addressable segments of thedevice configuration memory space. The FPGA fabric isphysically divided into different rows, with each rowdivided into columns. The rows are numbered from 0(up to 9) in the top and bottom halves of the FPGA,starting from the center (see Fig. 6-11 in [15]). A columnin the device matrix corresponds to a block in the array(CLB, DSP, block RAM (BRAM), IOB, etc.). In Virtex-5,a CLB column is 20 CLBs high by 1 CLB wide (Ch. 6in [15]) and one CLB contains two slices, whereas oneBRAM column contains four 4KB BRAM blocks. The



3/15


configuration of a column is defined by a set of certainconfiguration frames whose unique 32-bit address can

be determined using five parameters detailed in Fig. 2,stored in the Frame Address Register (FAR). The majoraddresses are numbered from left to right (physically inthe device matrix) starting with 0, and they respectivelycorrespond to the positions of the columns inside a row.The addresses with the same major address and differentminor addresses belong to the frames determining theconfiguration of one column. The numbers of minoraddresses inside a column depend on logic type (CLB,DSP or BRAM interconnect) (Table 6-16 in [15]).

Fig. 2. Division of 32-bit frame address (Fig. 6-10 in [15]).

Frame-based reconfiguration allows to modify a framethe smallest reconfigurable entity, whereas a modular ormodule-based reconfigurable system is one which has atleast one entire block dynamically reconfigurable [16].In a modular reconfigurable system, the FPGA fabricconsists of some static logic and one or more partiallyreconfigurable regions (PRRs). This fabric partitioningenables reconfiguration of a single PRR without systeminterruption (the static region and other PRRs continueexecution while only the reconfigured PRR halts) [17].

The behavior of the FPGA is determined by configura-tion bitstreams that consist of a sequence of instructionsand control signals data. Downloading this sequence al-

lows to program the FPGA to perform requested designfunctions. A non-reciprocal relation exists between thedesign and its bitstream: it is not possible to extractthe design structure and implementation on the FPGAfrom the bitstream. The reconfiguration process itselfcan be done either completely or partially by sendingthe related bitstream (full or partial) to the InternalConfiguration Access Port (ICAP) (Ch. 4 in [15]).

2.2 MicroBlaze Softcore Processor Structure

Besides hardcore embedded processors like PowerPC,that are hardwired in the die, FPGAs can also implement

softcore processors offered by manufacturers, like Mi-croBlaze, PicoBlaze, and Nios, as well as those proposed

by the open-source community, like LEON3 [6] andLEON4 [18]. Softcore processors are Intellectual Property(IP) blocks written in hardware description languages(HDL) like VHDL or Verilog, to be implemented usingreconfigurable resources of FPGAs. Figure 3 shows atypical MicroBlaze processor system that consists of a 32-

bit MicroBlaze core, the program memory BRAM withits data and instruction buses, and the Processor LocalBus (PLB) [19]the central bus of the MicroBlaze corewith some peripherals connected to it.

Fig. 3. MicroBlaze softcore processor structure.

2.3 Fault and Error Model of SRAM FPGAs

We assume a commonly used fault model of SRAM FP-GAs, which includes temporary faults (SEUs) affectingconfiguration memory or user memory implementinge.g. flip-flops and registers. Single SEUs will be of ourprimary concern here, because they are the most likelyto occur. Nevertheless, unlike most authors, we willalso consider permanent faults affecting configurationmemory (in Subs. 6.1).

2.3.1 Configuration UpsetsSEUs in configuration memory may result in modifi-cations of the functionalities of the application designthe FPGA implements. For a given design, all configu-ration memory bits can be classified as being sensitive(whose upset induces errors) and non-sensitive [20]. Thisis because amongst numerous configuration memory

bits only some are actually utilized by the users design,hence SEUs affecting the configuration bits that are notutilized by a specific design will not affect the behaviorof that design. Although of temporary nature, SEUs mayhave permanent effects until the device is reconfigurede.g. by readback or scrubbing [21]. In addition to con-figuration sensitivity, the sensitive bits can be furthercategorized into two following categories [20].

Non-persistent bits are those configuration bits which,when upset, may induce non-persistent functionalerrors which disappear once the device is recon-figured, so that the design can return to normaloperation. The non-persistent bits generally involvepurely combinational circuitry of the design.

Persistent bits are those configuration bits which,when upset, induce persistent functional errors,which do not disappear even after the device isreconfigured. The persistent bits generally involve

any part of the design that contains the sequentialcircuitry or BRAM. The frame-based reconfigurationfollowed by the internal reset could eliminate per-sistent errors but, unfortunately, it cannot deal withcorrupted data written to the BRAM. One feasiblesolution for the latter problem is the following: (i)to reconfigure the whole module (module-based re-configuration) to eliminate configuration errors andre-establish the start-up state of the BRAM; then (ii)to perform an internal reset of the registers and flip-flops to the initial correct state; and finally, (iii) toperform the context recovery.



4/15


2.3.2 User-Logic Upsets

The user logic contains memory elements (registers andflip-flops) which change their states during normal op-eration. Obviously, they can also be affected by SEUswhich could provoke the so-called logic errors due to

bit flips and which may have permanent effects untilthe affected memory element is rewritten. Because their

contents is not readily available for inspection, any con-current error detection is virtually impossible withoutusing fault-tolerance techniques specifically dedicatedto handle logic errors (like TMR). Nevertheless, theprobability of undetected logic errors seems relativelylimited by taking into account the following observa-tion. In modern FPGA devices, the configuration cellsoccupy more than 98% of all memory elements: e.g.Xilinx Virtex-5 SX50T FPGA contains a total number of20,019,328 configuration bits and only 28,800 flip-flops[15], [22]. The occurrence probability of logic upsets threeorders of magnitude lower than of configuration upsetsis confirmed by the experimental results reported in Figs

6 and 7 of [23]. Moreover, it was reported in [24] (Fig.11) that frequency variation has no evident impact onthe number of observed upsets.

2.4 Standard Configuration Data Error HandlingTechniques in FPGAs

Configuration data of SRAM FPGAs, containing mil-lions of configuration bits, are particularly vulnerableto SEUs. They can be protected against errors by var-ious standard means provided by FPGAs manufactur-ers, like configuration readback, configuration scrubbing,and error detecting or correcting codes [8], [21]. These

techniques can be applied continuously in the back-ground of a user design. FRAME ECC is the Virtex-5and -6 primitive using single-error-correcting/double-error-detecting (SEC/DED) Hamming error correctingcode (ECC) which during configuration readback allowsto correct single errors and/or detect double errors inthe configuration frame data (Ch. 4 in [15]).

2.5 Hardware Redundancy Techniques for FPGAs

The most widely used fault-tolerant methods used tomitigate logic errors in FPGA designs rely on hardwareredundancy: (i) duplication with comparison (DWC) for

detecting faults (Fig. 4(a)) and (ii) triple modular redun-dancy (TMR) with majority voter for masking faults (Fig.4(b)).

In DWC, the original block to be protected is replicatedtwice and the results produced by the original andthe outputs of replicated blocks are compared to detectfaults. The lockstep scheme of Fig. 4(a), is the imple-mentation of DWC at the processor level, supported bysome Xilinx FPGAs [3]. Two identical processors P1 andP2 receive the same inputs, simultaneously execute thesame instructions, and their results are compared step-

by-step at each clock cycle. P2 generates the reference

results to be compared against those ofP1 that providesthe system output. Basically, DWC is able to detect butnot to correct errors, because it cannot point out thefaulty processor. However, it could be capable to toleratetemporary faults, provided that it is supported by somere-execution procedure. In case of FPGA implementa-tion, the system needs also to be reconfigured to recovercorrect functionalities.

Fig. 4. Hardware redundancy: (a) duplication with com-parison (DWC)basic lockstep scheme; (b) triple modu-lar redundancy (TMR).

In TMR, the original block is replicated thrice, all three

blocks receive the same inputs, and a majority 2-out-of-3 voter is used to determine the correct result. It is alsopossible to detect inconsistencies in the outputs of thethree blocks to identify the faulty one. TMR allows tomask directly both temporary and permanent faults of asingle block.

DWC and TMR are conceptually relatively simple andeasy to implement in FPGA. Unfortunately, they are alsovery costly, because they involve respectively over 100%and 200% hardware overhead. Therefore, they must beused skillfully to guarantee required reliability level atthe minimum cost.

2.6 Error Recovery Techniques

Once an error is detected, the next step is error recovery,i.e., the process of removing errors from the system and

bringing it back to the error-free state. The processor con-text is the set of information needed to define uniquelythe state of the processor at a given moment. It couldinclude the states of the processor registers, the cache,the memory, etc. Saving and restoring all relevant valuesis essential for effective processor context switching anderror recovery.

Several error recovery schemes have been proposed[25][27] which can be classified as backward and for-

ward. The backward error recovery techniques, whichare based on time redundancy, rely on saving the cor-rect processor context and restoring it once the errorsare removed, so that the processor could resume cor-rect functioning at the last saved point (checkpoint).Amongst various context recovery techniques differingin the moment of saving and restoring the context, themost often used is rollback recovery using checkpoints.Checkpointing is the set of operations needed to retrieveand store the systems context, whereas rollback is theoperation that returns the processor state saved duringlast checkpoint before the error occurrence. The major



5/15


drawback of rollback recovery using checkpoints is thetime overhead (regular saving of the processor contextwith checkpointing frequency growing with the errorrate, computations lost from the last checkpoint followed

by restoring of the processor context). Figure 5 shows thescheme of checkpointing and rollback applied for basiclockstep scheme recovery. Because it is not possible toidentify the faulty processor without extra diagnosingsupport, The error recovery in FPGA designs can beachieved through reconfiguration of both processor coreswhich, however, can be time-consuming.

The latter problems are alleviated in roll-forward errorrecovery which does not need regular context saving.When an error occurs, it is corrected by copying thecorrect state of the processor from a fault-free mirrorprocessor, provided that there are means to identify it.Because a lockstep scheme using roll-forward rather thanrollback seems to offer better performance without sig-nificant increase of hardware resources, we will considerit in the designs proposed here.

Fig. 5. Checkpointing and rollback for basic lockstepscheme recovery.

3 NEW FAULT-TOLERANT ARCHITECTURE

Figure 6 shows the architecture of the fault-tolerantsystem proposed here, whose two main blocks are: theEnhanced Lockstep scheme and the fault-tolerant (FT)Configuration Engine. It relies on using two Xilinx Virtexhardware primitives: the reconfiguration port ICAP andthe FRAME ECC [15] as well as the highly reliable exter-nal Golden Memory to store the configurations (whosestudy is out of the scope of this paper).

Fig. 6. Block scheme of the fault-tolerant architecture.

3.1 Enhanced Lockstep Scheme

The basic lockstep scheme is the realization of DWC atthe processor level. It can be implemented in FPGA us-ing e.g. a softcore processor MicroBlaze. Unfortunately,

Fig. 7. Enhanced Lockstep scheme.

it can only detect errors without indicating the faultymodule. To alleviate this limitation, we propose here thenew Enhanced Lockstep scheme (shown in Fig. 7) whichis provided with means to identify the faulty processor,thus allowing to continue execution with the remainingfault-free processor.

Two identical MicroBlaze processors P1 and P2 arethe heart of the Enhanced Lockstep scheme. Each ofthem is a complex system consisting of a 32-bit proces-sor core, central bus, etc. (Fig. 3), designed using theprocedure proposed in [28]. Their outputs are identicalduring fault-free functioning, any disagreement indicat-ing error(s). A total of 150 bits of output signals ofthe PLB bus and the peripheral outputs are compared

by the Comparator/Multiplexer (COMP MUX). Thesesignals were chosen because the PLB is the central busof a MicroBlaze system and all the communications withthe peripherals must go through it: its size and data

throughput increase the likeliness to detect as soon aspossible any error which could occur during processoroperation. To note that each softcore instance should

be fixed to a specific location in a reconfigurable zone(i.e., P1 and P2 are implemented in separate PRRs).Otherwise, all the resources of the two cores could be

blended together, so that it would not be possible todistinguish the faulty one when an error is detected.Also, the location fixation allows for reconfiguration ofthe faulty core without interrupting the other.

The COMP MUX consists of two blocks: (i) the Com-parator that indicates any mismatch between the outputsOut1 and Out2 of P1 and P2 (containing the PLB

and final output signals), and (ii) the Multiplexer whichconnects one of the processors to the system output, sothat if one of them is reported to be faulty, the otheris switched on. The switching is an atomic operationexecuted in one clock cycle. Once the error is localized

by the FT Configuration Engine presented in Subs. 3.2,the affected processor is reconfigured to eliminate itsconfiguration upset. Then, the two processors need to

be synchronized to put the newly reconfigured one tothe same state as the correct one, thus enabling themto continue executing the same task in lockstep again.The recovery process of the Enhanced Lockstep scheme



6/15


is handled by the Context Recovery Block (CRB). TheCRB has a memory shared by both processors to storetheir context using on-chip BRAM that has dual-portconnection providing simultaneous access to two Recov-ery Buses RB1 and RB2. These buses serve to save andrestore the processors contexts to the BRAM as well asto control the CRB during the context switching process.To note that the COMP MUX itself is also a dynamicmodule that can be reconfigured in case of error by theFT Configuration Engine.

3.2 Fault-Tolerant Configuration Engine

The activation of the disagreement signal by the Com-parator triggers the procedure of localizing the faultyprocessor which is executed by the Configuration Engine(CE) (Fig. 8). The Configuration Engine consists of a Scan

Motor, a Frame Address Generator, and an 8-bit softcoreprocessor PicoBlaze (chosen because of its small size).Reliable uninterrupted operation of the ConfigurationEngine is crucial for the correct functioning of the wholeEnhanced Lockstep system. Because the ConfigurationEngine itself is also exposed to configuration upsets, toavoid a single point of failure, we made it fault-tolerantusing TMR (i.e., the FT Configuration Engine consistsof three Configuration Engines of Fig. 8 and a 2-out-of-3 majority voter). Moreover, if the FT ConfigurationEngine detects an error in one of its modules, the faultymodule will be disconnected and then the FT Configura-tion Engine will reconfigure it by transferring the correct

bitstream from the golden memory through the ICAP.

Fig. 8. Configuration Engine (CE).

3.2.1 Scan Motor

The Scan Motor continuously works in the background(hence, it does not affect the user design) to point out thephysical position of the configuration upset: it cyclicallyreads the configuration frame by frame using readbackthrough ICAP and checks each frame for errors. The

FRAME ECC primitive is used to identify any changesin the configuration memory due to upsets. If it hap-pens, the Scan Motor localizes and reports the erroneousFrame Addresses to the PicoBlaze, although the Config-uration Engine is unable to identify the faulty module.The actual address of the faulty frame of the lockstepmodule is decoded by the method explained below inSection 3.2.2. The module-based reconfiguration of thefaulty block (realized by the Configuration Engine) takesplace by sending the appropriate bitstream to the ICAP.

In a partially reconfigurable design, any reconfig-urable module is physically assigned to a specific region.

The addresses of the frames of that region which areoccupied by a given module can be enumerated usingRapidSmith [29]. In our design, once determined, theyare compressed and stored in the internal memory of thePicoBlaze. If during scan process, an upset is detected ina frame which does not belong to any module, it is verylikely that it is located in the interconnection betweenmodules and then it can be removed by the frame-basedreconfiguration.

3.2.2 Determination of Frame Addresses

The frame addresses of any reconfigurable module to bescanned (two softcores and the COMP MUX for eachentity of the Enhanced Lockstep scheme) can be deter-mined from three 32-bit words (stored in the internalmemory of the Configuration Engine) whose contentsshows Fig. 9. Word1 contains the Module ID which is theidentifier of any reconfigurable module, whose address-ing space is sufficiently large to identify practically anynumber of modules which would fit into a single FPGA.

Word2 contains the number of columns (Size) and the24-bit Starting frame address (FA) needed to determineall other addresses of the frame of this module. To ensurethat all columns (major addresses) are consecutive, eachmodule spans only one whole row. Because there are themaximum of 25 columns inside a row in the selectedFPGA and the largest reconfigurable module in oursystem (a softcore) requires 21 columns, eight bits sufficeto determine the size of any module. Finally, becausethere are different numbers of minor addresses insidea major address (a column) depending on logic type(CLB, DSP or BRAM interconnect), all relative positionsof the special columns (BRAM interconnect and DSP) are

stored inside the PRR of a module in 32-bit Word3.

Fig. 9. Three words used to determine frame addresses

4 FAULT MITIGATION STRATEGY

Figure 10 illustrates the fault mitigation strategy appliedto deal with errors occurring in the Enhanced Lock-step scheme. During normal operation, the COMP MUXcontinuously supervises the dual-core lockstep module,while the FT Configuration Engine scans all the recon-figurable blocks in the background. If the COMP MUXtriggers a mismatch signal, its cause can be a fault eitherin one of two softcore processors or in the COMP MUXitself. The COMP MUX sends a mismatch signal to



7/15


the FT Configuration Engine to request for scanning ofthe Enhanced Lockstep scheme. The FT ConfigurationEngine starts to scan the COMP MUX immediately and,if a fault is in the COMP MUX, the FT ConfigurationEngine will reconfigure it. If the COMP MUX is fault-free, the error must be due to a fault in one of thetwo cores, so the FT Configuration Engine only needsto scan one core to identify the source of error. The scanitself does not affect the cores task, although during thisspecial scan, the Enhanced Lockstep scheme needs to bepaused to prevent any catastrophic results.

Fig. 10. Fault mitigation strategy for the Enhanced Lock-step scheme.

If the error persists in a frame despite that configura-tion upsets have been recently corrected in it, a perma-nent configuration fault is declared. Depending on thefault duration, different reconfiguration techniques areselected: normal partial reconfiguration for a temporaryfault or tiling technique [30], [31] for a permanent fault.

The principle of tiling avoids usage of the faulty zoneof the FPGA by pre-compiling the same design with var-ious implementations, each of which has the followingtwo properties. First, it has a prohibited zone, so thata permanent fault can be masked by charging the ap-propriate configuration of the implementation in which

the prohibited zone overlaps the faulty area. Second, ithas its own bitstream, so that the fault masking pro-cess is done by downloading the appropriate bitstreamthrough ICAP by executing the partial reconfigurationprocedure. In our system, the bitstream generation forvarious implementation is done using a basic placementconstraint PROHIBIT [32] which is assigned through theuser placement constraint file provided to the synthesistool. The prohibited rectangle zone is defined by thecoordinates X and Y of two points (top-right and down-left). By varying these coordinates, the prohibited zonecan be displaced inside the PRR.

Fig. 11. Tiling technique using PROHIBIT constraint.

5 STATE RECOVERY PR OC ED U RE F OR T H EENHANCED LOCKSTEP SCHEME

Recall that the comparator can signal a mismatch causedby any upset in the dual processor core, without the ca-pability of indicating the faulty core. However, the faultdiagnostic process running in background (configurationscan and frame address determination) make possible todiagnose the faulty core. It is even likely that it wouldreport an error despite the comparator indicates no mis-match, which would mean that a non-sensitive bit wasflipped. To avoid accumulation of configuration upsets,it is corrected anyway using frame-based reconfiguration(which does not affect the processor operation). In theworst case, even if no errors were observed becauseof the dormant effect, they could still be detected bycheckpointing afterwards. Once a faulty processor isfound, it is reconfigured to bring it back to its initialstate as at the start-up, which is followed by loading init the same state as the other processor, so that its staterecovery process can be launched.

Fig. 12. Roll-forward state recovery procedure for theEnhanced Lockstep scheme.

Figure 12 shows the scheme of the roll-forward pro-cedure used to recover and synchronize the states of thetwo softcore MicroBlaze processors P1 and P2. The

context of a softcore MicroBlaze processor is representedby the 32-bit values of 31 General Purpose Registers andtwo Special Registers: the Program Counter (PC) andthe Machine Status Register (MSR) [33]. In the presentversion of our design, the context exchange betweenMicroBlaze processors is only applied to a mono-tasksystem without caches. Nevertheless, if necessary, thesame context exchange strategy could be also appliedto a more complex software (multi-task and/or withcaches), although it would involve longer recovery du-ration (the same problem would occur in case of TMR).During normal execution of both processors (A), if a



8/15


mismatch signal is triggered (B), the scan process (C)is launched immediately to localize the error (in thisexample, only the COMP MUX and P1 are scannedwhile the P2 is paused) and the reconfiguration processcorrects the error (D). Once it is completed (E), P2signals its ready status for the state recovery procedureto the CRB. The moment of recovery is decided by P1(e.g., it can wait until the end of a critical task thatshould not be interrupted). To start the recovery process,P1 signals the CRB inside the COMP MUX to startrecovery via RB1 bus. The CRB sends immediately aninterrupt signal via buses RB1 and RB2 to announce thetwo processors to begin recovery (F). Once the recoveryprocess completes, the two processors re-synchronizethemselves (G) and continue executing correctly the task(return to A state). During the period between D and F,only one processor is operational which could be an is-sue, should another fault appear in it. Nevertheless, thisduration is very short (about 4 ms) and the probability ofoccurrence of two consecutive upsets in two processors

within 4 ms in many applications seems rather low.According to [34], the nominal failure rate of Virtex-5FPGA equals to 151 FIT/Mb with 95% confidence rangein laboratory conditions (one FIT is one failure in 109

hours). Therefore, the probability of occurrence of onesensitive SEU during 4 ms in another processor (whileone is being reconfigured) is about 21017. To deal withthis highly unlikely event, the Configuration Engine can

be programmed to scan simultaneously the unprotectedprocessor to signal any SEUs. Recall also that partialprotection against faults was considered an acceptablesolution in various TMR-based designs as well [11][14].

The recovery process is detailed in Fig. 13. The CRB

launches an interrupt (INT) to force both processorsto enter the recovery routine (Start). At the beginning,P1 saves its context in the shared memory BRAM ofthe CRB, while P2 is waiting for the context savingprocess to finish (Wait for SAVE DONE). Then, P2starts restoring the context (Restore Context), while P1waits for the synchronization signal from the CRB. OnceP2 finishes restoring (RESTORE DONE), it signals itsready status to the CRB (Wait for SYNC), so that the CRBcould instantly send the synchronization signal (SYNC),and both processors would be liberated from recoveryto continue tasks execution synchronously.

To avoid undetected errors in BRAMs of each proces-

Fig. 13. Recovery process detail for the Enhanced Lock-step scheme.

sor, they were designed using Xilinx Design Suite v13.2,so that all single errors in BRAMs can be detected andcorrected using Error Detection and Correction (EDAC)implementation in LMB BRAMs [35]. The ECC used is a(32,7) Hamming code, i.e., the ECC codeword has a totalof 39 bits. Because BRAM is organized in 8-bit blocks,one additional BRAM block must be used to store 7check bits, which accounts for 25% BRAM overhead. TheEDAC circuitry occupies 92 slices which are all under thescan by the Configuration Engine as well.

Depending on the reliability level required, logic up-sets could be treated accordingly. In the minimalistcase, before each use, the whole CRB (including itsCRB memory contents and control logic) is refreshed

by partial reconfiguration, accompanied by internal resetto eliminate all undetected persistent errors. If errordetection and correction is required, the system canperform checkpointing and rollback, provided that agolden copy of the context is stored in the CRB memorywhich could be also protected by ECC (as suggested in

[36]) and would not require partial reconfiguration of itscontents.

6 IMPLEMENTATION DETAILS AND COMPAR-ISONS

We have implemented the system of Fig. 6 using Xil-inx Virtex-5 XC5VSX50T FPGA and Xilinx Design Suitev13.2. The 90-rotated layout view of the system isshown in Fig. 14. The dual lockstep cores operate at 125MHz (i.e., one cycle takes 8 ns), whereas each of threePicoBlazes inside the FT Configuration Engine runs at60 MHz (i.e., one cycle takes 16.7 ns).

Fig. 14. 90-rotated layout view of the Enhanced Lock-step scheme.

6.1 Tiling Implementation

Because configuration scrubbing techniques cannot han-dle permanent faults, to deal with the latter, we have ap-plied the tiling technique to the two processors (P1 andP2) and the COMP MUX. Figure 15 shows differentimplementations using the PROHIBIT macro (presentedin Section 4) which allow to avoid using a zone insidethe PRR ofP1. Notice that in each implementation, thecorresponding prohibited zone has a different location.



9/15


Fig. 15. Three sample tiling implementations of P1.

The following trade-offs are observed between thetiling granularity and the capacity of external memorywhich stocks tiling configuration bitstreams. The fineris the granularity (hence, the prohibited zone) of tiling,the more configurations are required to cover the whole

PRR and the more external memory space is needed tostore them. If a prohibited zone corresponds to a column,the number of needed configurations equals the numberof columns occupied by the concerned block. The costof stocking the bitstreams can be reduced considerably

by generating only the differential bitstream betweentwo adjacent columns, although it requires to solve twofollowing major issues.

(i) For two configurations, all frames of the othercolumns must be the same, except for the two prohibitedzones (prohibited columns). So that only two relatedcolumns need to be modified when switching from oneconfiguration to another.

(ii) The existence of persistent error(s) inside the framecontaining a permanent fault within a column could bea major problem. If some configuration bits of the zonecontaining a permanent fault cause persistent error(s),the whole processor (including BRAM contents) must

be reset in one of two ways applied on different levels:(a) by the module-based reconfiguration of the wholeprocessor followed by the reset signal controlled by theFT Configuration Engine, or (b) by the global reset of theentire platform.

However, finding a procedure leading to determiningthe optimal solutions (related to both (i) and (ii)) is

beyond the scope of this paper and will be consideredin the future.

6.2 Complexity Evaluation

Table 1 characterizes the complexity of the basic blocks ofthe system proposed. The FT Configuration Engine thatsupervises the whole FPGA device uses 3 120 + 24 =384 slices (three Configuration Engines and the voter).This constant overhead appears only once in the FPGAdevice, independently on the number of locksteppedpairs of softcore processors in the device. The overheadinvolved in each item of the Enhanced Lockstep scheme

present on the FPGA is one extra MicroBlaze Processor(560 slices and 40 KB BRAM), one COMP MUX (80slices), and one CRB (80 slices). The CRB contains 4KB BRAM (the smallest amount available in the devicematrix) that is shared by the two cores for the staterecovery process, despite that 1 KB would be sufficient.Each Configuration Engine, the dual core lockstep block,and the COMP MUX (except for the majority voter thatdrives the output signals of the FT Configuration Engine)are dynamically reconfigurable to be able to deal withconfiguration memory upsets. The only exception is thevoter, which however occupies only 24 slices, so itsupset probability is negligibly small compared to thatof the whole system. If necessary, this small voter can

be implemented using external hardened componentsor using Xilinx TMR solution XTMR [37], [38] whereinthe voter is also triplicated. Finally, because one BRAMcolumn contains 16 KB and one MicroBlaze Processorrequires 40 KB of BRAM including ECC, to put it intoone PRR, 48 KB BRAM must be used, which corresponds

to 3 BRAM columns. For the same reason, CRB uses 16KB of BRAM, even that 4 KB could be sufficient.

TABLE 1

FPGA resources needed by basic blocks of theEnhanced Lockstep scheme

Module Required[slices]BRAM

[KB]PRR Size

[slices]

Configuration Engine 99 120Voter 24MicroBlaze Processor 524 40 560COMP MUX 51 80CRB 72 4 80

XC5VSX50T8160 594

TABLE 2

Partial reconfiguration (PR) overhead and bitstreamsneeded for the system (without tiling)

Module PR overhead[slices]Bitstream size

[KB]Reconf. time

[s]

3 CE 3 21 3 44 7522 P 2 36 2 176 3000

COMP MUX 29 12 205CRB 8 50 853

Total 172 570

Table 2 shows the partial reconfiguration hardwareoverhead due to PRR partitioning. The PRR size andthe required resources for one Configuration Engine are120 and 99 slices, respectively (including the overhead of120 99 = 21 slices for one Configuration Engine). Thepartial reconfiguration (PR) overheads for one softcore,the COMP MUX, and the CRB are respectively 36, 29,and 8 slices. Bitstream sizes of PRRs and related module-

based reconfiguration times are also included, for com-pleteness. The reconfiguration bandwidth of 60 MB/swas assumed, which corresponds to 8-bit ICAP interface



10/15


Fig. 16. Context recovery C code.

operating at 60 MHz and 16-bit Flash Golden Memoryaccessed at 30 MHz.

The duration of task interrupt ti (in case of a sin-gle error) will be evaluated depending on the sourceand type of error. A non-persistent error necessitatesframe-based reconfiguration (frame reload) to eliminatethe error without interrupting the module operation. Apersistent error requires module-based reconfiguration

which resets the reconfigured module.1) An error in the COMP MUX

a) A non-persistent error: ti1a = tsCM;b) a persistent error: ti1b = tsCM + trCM,where: tsCMthe scan time of COMP MUX andtrCMthe reconfiguration time of COMP MUX.

2) An error in a processor of the lockstepped paira) A non-persistent error: ti2a = tsCM + tsP + tsw;

b) a persistent error: ti2b = tsCM + tsP + tsw + trec,where: tsPthe time to scan one processor, tswthe time to switch the COMP MUX, and trecthetime to recover the processor context in case ofpersistent error.

The FT Configuration Engine requires 41 clock cyclesto complete a frame scan. The number of configurationframes occupied by COMP MUX and one processorare 72 (80 slices) and 604 (560 slices), respectively. Thecontents of BRAM varies during normal operation, henceit is really difficult to check the correctness of its con-tents using configuration scan. Nevertheless, becausethe BRAM contents is protected by ECC, the BRAMframes of both the processor and the CRB are excludedfrom the scan. The maximum time to localize a faultis that required for scanning the COMP MUX and oneprocessor, i.e.: tsCM = 72 41 16.7 ns = 49.2 s and

tsP = 604 41 16.7 ns = 413.6 s. The minimumtime is that required to scan one frame690 ns. Ifthe FT Configuration Engine detects an error in theCOMP MUX, the reconfiguration of COMP MUX takestrCM = 205 s. The time for COMP MUX to switch theoutput of the lockstepped pair moduleonly one clockcycle equal to 8 nsis negligible.

Figure 16 shows the C code of the processor contextsaving and restoring processes. The registers R1-R31,the Machine Status Register (MSR), and the ProgramCounter (PC) register are saved from the starting address0x83C18000 of the BRAM inside the CRB. The context

TABLE 3Durations of task interrupts depending on error types,

affected modules, and error positions (in s)

Interrupt COMP MUX Processorduration Non-persist. Persistent Non-persist. Persistent

Min 0.69 205.7 49.9 55.1Max 49.20 254.2 462.8 468.0

Average 24.95 230.0 256.8 261.6

TABLE 4Comparison of single-processor architectures

Basic TMR EnhancedSimple Lockstep [5] Lockstep

Slices [%] 100 297 384 297BRAM [KB] 32 64 48 112Time Overhead tcor=1150trec trec tscan=95trecServices Contin. No Low High High

saving and restoring processes need respectively 272 cy-cles (272 8 ns = 2.2 s) and 325 cycles (2.6 s). The time

of the whole recovery process of Fig. 13, consisting of thedurations of context saving, restoring, and control, is trec= 5.2 s. Therefore, when a persistent error occurs inthe last frame to be scanned, the maximum system timeoverhead resulting from the duration of task interruptcan be determined as follows:

1) If a non-persistent or persistent error occurs in theCOMP MUX, it equals to ti1a = 49.2 s or ti1b =49.2 + 205 = 254.2 s, respectively.

2) If a non-persistent or persistent error occurs in thelockstepped pair module, it equals to ti2a = 49.2 +413.6 = 462.8 s or ti2b = 49.2 + 413.6 + 5.2 = 468s, respectively.

Besides the maximum durations of tasks interruptsdue to persistent errors, Table 3 shows minimum du-rations when a persistent error occurs in the first frameto be scanned.

6.3 Comparison with Other Designs

Table 4 summarizes various characteristics of differentsingle-processor architectures, all using MicroBlaze soft-core. To the best of our knowledge, only two FPGAarchitectures suitable to implement a fault-tolerant soft-core processor were reported in the literature. One is



11/15


TABLE 5Usage of slices by FT-MPSoC systems (560 slices =

100%; Xilinx FPGA XC5VSX50T = 1457%)

No. of processors 1 2 3 4 5 6 7

Ours [%] 297 525 753 981 1209 1437 1665TMR from [5] [%] 384 699 1014 1329 1644 1959 2274

a lockstep system using hardcore embedded processorPowerPC [3], and the other is the only fault-tolerantFPGA implementation using the MicroBlaze softcoreprocessor using TMR [5]. To have some reference com-plexity figures, we have also implemented the basic lock-step scheme. In our Enhanced Lockstep scheme, the twosoftcores and the voter are bound to one PRR. We havefound that the hardware overhead of the basic lockstep isthe same as of the proposed Enhanced Lockstep solution:either system consumes twice of slices as the simpleprocessor. Because there must be a fault-tolerant spareprocessor which controls the reconfiguration process of

the dual-core module in case of mismatch, we havechosen the TMR implementation of the 8-bit PicoBlaze(like in the FT Configuration Engine). Obviously, it usesthe same amount of slices as the FT ConfigurationEngine, the COMP MUX, and the CRB (384+80+80=544slices, i.e. 97% of those of the simple processor using 560slices). As for BRAM, a simple processor requires 32 KB,whereas our Enhanced Lockstep scheme requires 112 KBconsisting of two copies of BRAM of each processor (2 48 KB) and the BRAM of the CRB (16 KB). Because theTMR softcore system from [5] uses ECC for protectingBRAM data against errors, it uses a total of 48 KB BRAM.

Table 4 shows that our solution requires significantly

lower time overhead than the basic lockstep system.Firstly, because the scan process which pauses the systemis significantly shorter than the reconfiguration processof the whole dual-core block to correct an error ac-cording to the basic lockstep scheme. (The maximumtime overhead when an error occurs in the EnhancedLockstep is the scan time and the proposed roll-forwardrecovery time.) Secondly, the time overhead of the En-hanced Lockstep scheme, dominated by tscan, is about95 times larger than trec = 5.2s (the proposed roll-forward recovery duration). Furthermore, the proposedroll-forward recovery strategy used in our system isalways faster than the checkpointing and rollback strate-

gies used in the basic lockstep scheme. When an erroroccurs in the basic lockstep scheme, the time overheadis dominated by the correction process using dynamicreconfiguration (about 1150 times longer than one roll-forward recovery). Moreover, the major drawback ofthe basic lockstep scheme is low continuity of services

because, after error occurrence, its dual-core moduleneeds to be reconfigured immediately to correct error.

Comparing single fault-tolerant softcore processors,one using the TMR approach from [5] and our usingEnhanced Lockstep scheme, it is seen that the latterenjoys an advantage of smaller hardware overhead. This

advantage grows with the number of processorsdueto the presence of only one FT Configuration Engineper FPGA device. An FT-MPSoC system applying ourapproach incurs a constant hardware overhead of the FTConfiguration Engine (384 slices) and 720 slices (114% ofa processor) for each added lockstepped pair (560 slicesfor one processor, 80 slices for one COMP MUX, and 80slices for one CRB). In an FT-MPSoC system applyingTMR, 1120 slices (2 560) overhead is required for eachadded processor. Table 5 compares hardware overheadof the two approaches depending on the number ofprocessors. The more processors the FT-MPSoC con-tains, the relatively lower hardware overhead our systemrequiresthanks to the presence of only one FT Config-uration Engine in the FPGA. In particular, all slices avail-able in the XC5VSX50T FPGA suffice to construct an FT-MPSoC system containing 6 processors using EnhancedLockstep methodology and only 4 processors using TMR(shaded cells in Table 5). Unfortunately, the trade-off issignificantly longer error recovery time overhead of our

design.

6.4 Handling Logic Errors

Many fault-tolerance techniques in reconfigurable archi-tectures are intended to deal with SEUs in configurationmemory as the major source of concern [2], [8][10],[21]. Compared to state-of-the-art solutions dealing withconfiguration memory upsets in softcore systems, ourmethodology offers better performance. As for logicerrors (due to upsets in registers, flip-flops, etc.), TMRhas been a commonly accepted strategy [38]. In oursystem, these types of errors are handled using thefollowing strategy involving low hardware and timing

costs. Periodically, the two cores perform checkpointingto allow for recovery from upsets affecting logic, ifneeded. During checkpointing, the cores contexts arecopied into the first-half of the CRB memory. Whilecopying the context, the data present in the RecoveryBuses are compared to detect any mismatch. If the twocontexts agree, the context of P1 is saved as correctinto the second-half of the CRB memory and, should amismatch be detected during next checkpointing, it isused for rollback,

To deal with logic errors, the Configuration Enginemust scan the COMP MUX and each core of the lock-stepped pair. If the scan process cannot identify the

faulty module while the comparator still continues tosignal a mismatch, it is assumed that an SEU caused alogic error. As a consequence, the scan process could lastlonger, because more modules need to be scanned andwithin these modules all the frames must be scanned(max scan time), which would involve more timingoverhead.

As already mentioned in Subs. 2.3.2, the probabilitythat an SEU affects logic could be three orders of mag-nitude lower than for configuration bits. That is why oursystem can perform checkpointing and rollback signifi-cantly less frequently than the classical lockstep scheme



12/15


(TCP in Fig. 5 could be significantly longer), because thelatter treats upsets affecting both configuration and logicuniformly. Consequently, the resulting timing overhead(related to checkpointing and rollback) of the EnhancedLockstep scheme is significantly lower.

Fig. 17. New CRB to deal with upsets affecting logic

Figure 17 shows the new CRB modified to deal withupsets affecting logic. Before each checkpointing driven

by the FT Configuration Engine (FT-CE) via the CP sig-nal, the CRB is refreshed through partial reconfiguration,followed by internal reset to eliminate any undetected

persistent errors, except for BRAM whose contents isprotected by ECC. During checkpointing, the contextsof two cores are compared and the context of P1 iscopied by default into the first half (Addr) of the CRBmemory (BRAM with ECC). Because the Recovery BusesRB1 and RB2 connect also to the PLB bus and theCOMP MUX compares the PLB signals, there is no needto compare RB1 and RB2, as this comparison is alreadyperformed by the COMP MUX. If no error is signaled

by the COMP MUX during checkpointing, this contextis assumed correct and copied into the second half (Addr+ 2K) which serves as a golden copy for rollback (GoldenContext Copy in Fig. 17).

As for the additional hardware overhead needed todeal with upsets affecting logic, the CRB can also be used

but at the cost of extra 84 slices, which include 4 slicesof the Golden Context Copy block and 80 slices of theEDAC circuitry of the (32,7) Hamming code (see [36]).

7 VALIDATION USING FAULT INJECTION

To validate the efficiency of the fault-tolerant approachproposed, we have provided the FT Configuration En-gine with the possibility to carry out automatic faultinjection campaigns for the configuration memory ofP1 and the COMP MUX. The goal is to obtain some

statistical data on the design robustness, measured bythe percentages of sensitive bits and persistent errors.However, notice that the system malfunctions occurringdue to the faults in the BRAM of the CRB are not easy toobserve, because the contents of BRAMs change duringoperation, so the sensitivities and error persistences ofthe BRAM bits vary as well. As a consequence, it isnearly impossible to obtain the statistically meaningfulfault injection results for the BRAM. Moreover, becausehere BRAM blocks are protected against errors by ECC,faults are not injected in the BRAM ofP1 and the CRB.Therefore, we have considered only single configuration

bit flips injected using the frame-based reconfigurationthrough ICAP according to the fault injection procedureshown in Fig. 18.

Begin

End

Module-based reconfigurationand reset of P1/COMP_MUX

Application run

Error detected?

Non-sensitive

bit declared

Sensitive

bit declared

Bit flip removal(by frame-based reconfiguration)

Application re-run

Error detected?

Non-persistenterror declared

Persistenterror declared

Last bit flip?YES

NO

YESNO

YESNO

Bit flip injection(by frame-based reconfiguration)

Fig. 18. Flow diagram of the fault injection procedure.

Recall that P1 and the COMP MUX occupy 604 and72 configuration frames (which are separate for thesetwo blocks), respectively. Because each configurationframe contains 41 32-bit words, i.e. 1312 bits, the totalof 792448 bits (604 frames 1312 bits) and 94464 bits(72 frames 1312 bits) are related to P1 and the

COMP MUX, respectively. To evaluate the sensitivityof configuration bits and persistence of errors of theEnhanced Lockstep scheme, single bits are flipped inrandomly chosen locations in about 5% of the total bits(exactly 40000 and 4000 configuration bits ofP1 and theCOMP MUX, respectively).

Once a fault is injected, the lockstepped pair of proces-sors P1 and P2 executes the application program thatchecks whether a peripheral of a softcore operates cor-rectly. This application was chosen, because it involves arelatively large number of transactions passing throughPLB bus and thus offers the advantage that COMP MUXcan verify the consistency of outputs of P1 and P2

for frequently varying data. Although this applicationprogram does not cover the complete set of instructions,it is nevertheless a valuable proof of concept for theproposed system. For each fault, the application startsfrom the same initial state and runs during 120 106

cycles = 0.96 s. The procedure flow differs dependingon whether COMP MUX detects any mismatch duringthis period.

(i) If there is no mismatch, the bit is declared non-sensitive and the injection fault experiment terminatesfor this bit. Then, the tested block (P1 or COMP MUX)is reconfigured with a correct bitstream using module-



13/15


based reconfiguration (to eliminate any undetected per-sistent errors that could have resulted from the last bitflip, and thus make the FPGA ready for the next bit flip).

(ii) The first mismatch detected during this periodindicates that the bit is sensitive and hence the firstapplication run is stopped. To determine whether thissensitive bit causes a persistent error or not, the bit flipis first corrected by replacing the faulty frame with thecorrect one by frame-based reconfiguration. Then, thesame application is re-run from the very beginning foranother 0.96 seconds. Any mismatch signal triggeredduring the second application run is attributed to thepresence of a persistent error. As in the previous case,the module-based reconfiguration is launched to correctthis error and make the FPGA ready for another bit flip.

Finally, notice that any bit flop injected inCOMP MUX may result in only two outcomes:either it has no impact on COMP MUX (no mismatchsignal triggered), or COMP MUX triggers erroneouslya mismatch signal, which allow us to qualify the bit as

non-sensitive or sensitive, respectively.TABLE 6

Statistical results of fault injection

Module Slices Utiliz-ationInjectedbit flips

Sensitivebits

Sensi-tivity

Persis-tence

P1 560 93% 39360 3165 8.6% 2.3%COMP MUX 80 83% 3936 83 2.1% 0.4%

Table 6 shows the results of fault injection campaignscarried out for the single Enhanced Lockstep schemeimplemented on Virtex-5 FPGA, involving the slice con-figuration frames of P1 and the COMP MUX (but

excluding the BRAM configuration frame of the CRB).Because usually the PRRs are larger than the requiredresources of the reconfigurable module, the Utilizationcolumn in Table 6 represents the occupation ratios ofthe related blocks. Fault injection experiments revealthat only 8.6% and 2.1% configuration bits of P1 andCOMP MUX, respectively, are sensitive. More impor-tant, only a negligible percentage of these few sensitiveconfiguration bits actually cause persistent errors2.3%and 0.4% for P1 and COMP MUX, respectively, i.e.,on average, only one out of 43 sensitive configuration

bit flips in P1 and as few as one out of 250 sensitiveconfiguration bit flips in COMP MUX causes a persistent

error (which confirms some observations made earlierby other researchers). The considerably larger sensitivityof P1 than of the COMP MUX explains why one ofthe two processors (P1 or P2) is scanned before theCOMP MUX. In particular, the fault injection engine canalso be used to find out sensitive bits of all configurationframesto optimize the scanning order: in case of error,the configuration frames can be scanned in order of theirdecreasing sensitivities, beginning with the configura-tion frame containing the most sensitive bits.

Based on the fault injection results given in Table 6, theaverage interrupted duration caused by a configuration

error in P1 is: 261.6 s 2.3% + 256.8 s (8.6 - 2.3)%= 23 s. Now consider the basic lockstep system withoutthe identification capability of the faulty core. The twoprocessor cores must be reconfigured at the same timeto eliminate the sensitive error, which requires 6 ms (2 3 ms) to complete. Therefore, the average durationof interrupt caused by a configuration error in the basiclockstep is: 6 ms 8.6% = 516 s, which is 22.4 timeslonger than in our Enhanced Lockstep. Notice also thatthis 516 s has not yet taken into account the durationof task interrupts due to checkpointing and rollback,should they be integrated in the system.

8 CONCLUSION AND PERSPECTIVES

In this paper, we have proposed a new architectureof a fault-tolerant reconfigurable system which can beimplemented on any SRAM-based FPGA with integratedsoftcore processors. An Enhanced Lockstep scheme builtusing a pair of MicroBlaze cores was proposed and

implemented on Xilinx Virtex-5 FPGA. Unlike the basiclockstep scheme, ours allows to identify the faulty coreusing a Configuration Engine which allows to recoverfrom single-event upsets (SEUs) through partial reconfig-uration combined with roll-forward recovery technique.As a result, there the problem of fault latency is al-leviated, because faults are detected immediately, oncethey cause an error. To handle permanent faults, thetiling technique was also suggested. The ConfigurationEngine was built using PicoBlaze cores and, to avoida single point of failure, was implemented as fault-tolerant using triple modular redundancy (TMR). Thescheme proposed is a valuable alternative to other fault-

tolerant softcore schemes: it requires significantly shortererror recovery time compared to conventional lockstepscheme, and uses significantly smaller number of slicescompared to known TMR-based design although at sig-nificantly longer error recovery time. The advantage oflower hardware overhead of the new scheme comparedto TMR grows steadily with the number of processors,when applied to build a fault-tolerant Multi-ProcessorSystem-on-Chip (MPSoC). The efficiency of the proposedapproach was validated through fault injection experi-ments. Finally, to deal with SEUs affecting logic whichare significantly less likely than those affecting configu-ration, low hardware and timing overhead technique of

checkpointing and rollback was suggested.We believe that the proposed system can be applied

to the new Virtex-6 devices as well (which also containthe FRAME ECC and ICAP primitives), whereas thedesign methodology proposed can be adapted to someother softcore processors like Nios, LEON, etc. The samemechanism of context recovery could be performed,

because the contexts of these softcore processors areusually contained in 32 registers as well. Should an-other processor with different number of registers betargeted, one would have only to adjust the memorysize of the Context Recovery Block. If the number of



14/15


registers increases, it could involve extended durationof the context recovery processes, which could impactthe stand-by period of the system when SEUs occur.

Future work would include developing the method-ology which would facilitate the creation of tiling con-figurations using Xilinx Design Language (XDL) rep-resentation of the circuit accompanied by RapidSmithtaking into account the sensitivity and persistence oferrors caused by faults of the configuration bits insidethe permanently defected zone. We would like also toconsider fault injection in users logic and BRAM tovalidate the significant advantages of our system indealing with SEUs directly affecting logic.

ACKNOWLEDGMENTS

The work of M.-P. Pham was supported by ConseilGeneral des Cotes dArmorCG22 and the CIFAERproject (No. ANR-07-ARFU-002-02).

REFERENCES[1] Xilinx, Inc., PowerPC 405 Processor Block Reference Guide(UG018), 2004. [Online]. Available: www.xilinx.com/support/documentation/user guides/ug018.pdf

[2] , Single-Event Upset Mitigation Selection Guide,Appl. Note XAPP987 (v1.0), Mar. 2008. [Online].Available: http://www.xilinx.com/support/documentation/application notes/xapp987.pdf

[3] , PPC405 Lockstep System on ML310, Appl. Note XAPP564v1.0.2, 29 Jan. 2007. [Online]. Available: http://www.xilinx.com/support/documentation/application notes/xapp564.pdf

[4] F. Abate et al., New techniques for improving the performance ofthe lockstep architecture for SEEs mitigation in FPGA embeddedprocessors, IEEE Trans. Nucl. Sci., vol. 56, no. 4, Pt. 1, pp. 19922000, Aug. 2009.

[5] Y. Ichinomiya et al., Improving the robustness of a softcoreprocessor against SEUs by using TMR and partial reconfigura-

tion, in Proc. IEEE Annu. Int. Symp. on Field-Programmable CustomComputing Machines, May 2010, pp. 4754.

[6] Aeroflex Gaisler AB, LEON3 Product Sheet. [Online]. Available:www.aeroflex.com/gaisler

[7] , LEON3-FT SPARC V8 Processor. [Online]. Available:www.aeroflex.com/gaisler

[8] Xilinx, Inc., SEU strategies for Virtex-5 devices, XAPP864,5 Mar. 2009. [Online]. Available: http://www.xilinx.com/support/documentation/application notes/xapp864.pdf

[9] M. Lanuzza et al., An efficient and low-cost design methodologyto improve SRAM-based FPGA robustness in space and avionicsapplications, Proc. Int. Workshop on Reconfigurable Computing:

Architectures, Tools and Applications, Lect. Notes on Comput. Sci., vol.5453, pp. 7484, 2009.

[10] B. Dutton and C. Stroud, Single event upset detection andcorrection in Virtex-4 and Virtex-5 FPGAs, in Proc. ISCA 24thInt. Conf. on Computers and Their Applications (CATA 2009), New

Orleans, Louisiana, USA, 8-10 Apr. 2009, pp. 5762.[11] S.-F. Liu et al., Increasing reliability of FPGA-based adaptive

equalizers in the presence of single event upsets, IEEE Trans.Nucl. Sci., vol. 58, no. 3, pp. 10721077, Jun. 2011.

[12] M. Alderighi et al., Evaluation of single event upset mitigationschemes for SRAM based FPGAs using the FLIPPER fault injec-tion platform, in Proc. 22nd IEEE Int. Symp. on Defect and FaultTolerance in VLSI Systems, Rome, Italy, 2628 Sep. 2007, pp. 105113.

[13] B. Pratt et al., Fine-grain SEU mitigation for FPGAs using partialTMR, IEEE Trans. Nucl. Sci., vol. 55, no. 4, pp. 22742280, Aug.2008.

[14] H. Quinn, P. Graham, and B. Pratt, An automated approach to es-timating hardness assurance issues in triple-modular redundancycircuits in Xilinx FPGAs, IEEE Trans. Nucl. Sci., vol. 55, no. 6, pp.30703076, Dec. 2008.

[15] Xilinx, Inc., Virtex-5 FPGA configuration user guide (UG191v3.6), 2009. [Online]. Available: www.xilinx.com/support/documentation/user guides/ug191.pdf

[16] , Two flows for partial reconfiguration: Module based orsmall bit manipulations (XAPP290), 2002.

[17] , Early access partial reconfiguration (v1.2), User Guide

UG208, Sept. 2008.[18] Aeroflex Gaisler AB, LEON4 32-bit processor core, 2010.[Online]. Available: www.aeroflex.com/gaisler

[19] Processor Local Bus (PLB) v4.6, data sheet.[20] K. Morgan et al., SEU-induced persistent error propagation in

FPGAs, IEEE Trans. Nucl. Sci., vol. 52, no. 6, pp. 24382445, Dec.2005.

[21] Xilinx, Inc., Correcting single-event upsets through Virtexpartial configuration, Appl. Note XAPP216 v1.0, Jun.2000. [Online]. Available: http://www.xilinx.com/support/documentation/application notes/xapp216.pdf

[22] , Virtex-5 FPGA User-Guide, UG190, v4.5, p. 383, Jan. 2009.[23] F. Sturesson, S. Mattsson, C. Carmichael, and R. Harboe-Sorensen,

Heavy ion characterization of SEU mitigation methods for theVirtex FPGA, in Proc. European Conf. on Radiation and Its Effectson Components and Systems (RADECS), 2001, pp. 285291.

[24] E. Fuller et al., Radiation testing update, SEU mitigation, and

availability analysis of the Virtex FPGA for space re-configurablecomputing, in Int. Conf. on Military and Aerospace ProgrammableLogic Devices (MAPLD), Laurel, Maryland, USA, 2628 Sep. 2000,paper P30.

[25] D. Pradhan, Fault-Tolerant Computer System Design. Upper SaddleRiver, NJ, USA: Prentice-Hall, 1996.

[26] D. K. Pradhan and N. H. Vaidya, Roll-forward and rollbackrecovery: Performance-reliability trade-off, IEEE Trans. Comput.,vol. 46, no. 3, pp. 372378, Mar. 1997.

[27] , Roll-forward checkpointing scheme: A novel fault-tolerantarchitecture, IEEE Trans. Comput., vol. 43, no. 10, pp. 11631174,Oct. 1994.

[28] H.-M. Pham, S. Pillement, and D. Demigny, A fault-tolerantlayer for dynamically reconfigurable multi-processor system-on-chip, in Proc. Int. Conf. on ReConFigurable Computing and FPGAs,Cancun, Mexico, Dec. 2009, pp. 284289.

[29] C. Lavin et al., Rapid prototyping tools for FPGA designs:

RapidSmith, in Proc. Int. Conf. on Field-Programmable Technol.(FPT), Beijing, China, 810 Dec. 2010, pp. 353356.[30] A. Kanamaru et al., Tile-based fault tolerant approach using

partial reconfiguration, Proc. Int. Workshop on Reconfigurable Com-puting: Architectures, Tools and Applications, Lect. Notes on Comput.Sci., vol. 5453, pp. 293299, 2009.

[31] W.-J. Huang and E. J. McCluskey, Column-based precompiledconfiguration techniques for FPGA fault tolerance, in Proc. Annu.Int. IEEE Symp. Field-Programmable Custom Computing Machines,2001, pp. 137146.

[32] Xilinx, Inc., Constraints guide (UG625 v13.2), Jul. 2011. [Online].Available: http://www.xilinx.com/support/documentation/swmanuals/xilinx13 2/cgd.pdf

[33] , MicroBlaze Processor Reference Guide (UG081 v10.3),2009. [Online]. Available: http://www.xilinx.com/support/documentation/sw manuals/mb ref guide.pdf

[34] A. Lesea, Continuing experiments of atmospheric neutron

effects on deep submicron integrated circuits (WP286), 2009.[Online]. Available: www.xilinx.com/support/documentation/white papers/wp286.pdf

[35] Xilinx, Inc., IP processor LMB BRAM interface controller(v3.00b), Jun. 2011.

[36] , Single error correction and double er-ror detection (XAPP645 v2.2), Aug. 2006. [Online].Available: http://www.xilinx.com/support/documentation/application notes/xapp645.pdf

[37] , Xilinx TMRTool. [Online]. Available: http://www.xilinx.com/publications/prod mktg/XTMRTool ssht.pdf

[38] , Triple module redundancy design techniques for VirtexFPGAs, Appl. Note XAPP197, v1.0.1, Jul. 2006.



15/15


Hung-Manh Pham is a postdoctoral researchassistant with CAIRN INRIA Res. Team, IRISALab. (Research Institute in Computer Scienceand Random Systems). in Lannion, France. Hereceived his M.Sc. degree in Engineering Sci-ence in 2007 from the Pierre and Marie CurieUniversity (UPMC) in Paris, France, and his PhDdegree in Signal Processing and Telecommuni-cation from the University of Rennes 1, Francein 2010. His research interests include design of

embedded systems, dynamically reconfigurablesystems and fault-tolerant computing.

Sebastien Pillement is an associate professorat IUT in Lannion, the subdivision of the Univer-sity of Rennes 1, France, since 1999. He is alsoa research member of the CAIRN INRIA Res.Team of the IRISA Lab. He received a Ph.D.degree and Habilitation degrees in ComputerScience respectively from the the University ofMontpellier II and the University of Rennes 1. Hisresearch interests include dynamically reconfig-urable architectures, system on chips, designmethodology and NoC (Network on Chip) based

circuits. He focuses his research on designing flexible and efficientarchitectures managed in real-time. He is the authoror coauthor of about100 journal and conference papers.

Stanislaw J. Piestrak is a full professor at Uni-versity of Lorraine (created in 2012 from the fu-sion of four universities, including the Universityof Metz which he joined in 2004), France. Hereceived the Ph.D. and Habilitation degrees incomputer science in 1982 and 1996, respec-tively from the Wroclaw University of Technologyand Gdansk University of Technology, both inPoland. During three academic years from 2008to 2011 he was on leave with the CAIRN INRIA

Res. Team of the IRISA Lab. in Lannion. Until2004 he was with the Institute of Engineering Cybernetics at the Wro-claw University of Technology where he became a Professor in 1999.While in Poland he was invited by various universities abroad: as aVisiting Assistant Professor at the University of Southwestern Louisianain Lafayette, USA during the academic year 1984/5 and University ofGeorgia in Athens, USA during two academic years 19857. He wasalso the JSPS visiting scientist at Tokyo Institute of Technology, Japanduring the academic year 1993/4. On numerous occasions he alsovisited several French universities including TIMA/INPG in Grenoble,University of Rennes 1/ENSSAT in Lannion, and University of Metz. Hisresearch interests include design and analysis of VLSI digital circuits,fault-tolerant computing (in particular, self-checking circuits design, cod-ing theory, and reconfigurable systems), and computer arithmetic (de-sign of RNS-based hardware for high-speed digital signal processing).


Documents

Low Overhead Fault-Tolerance Technique for Dynamically Reconfigurable Softcore Processor