Dynamic Transient Fault Detection and Recovery for Embedded …cseweb.ucsd.edu/~gbournou/bournoutian-codes2012.pdf · Dynamic Transient Fault Detection and Recovery for Embedded Processor

Dynamic Transient Fault Detection and Recovery forEmbedded Processor Datapaths

Garo BournoutianUniversity of California, San Diego

9500 Gilman Dr. #0404La Jolla, CA [email protected]

Alex OrailogluUniversity of California, San Diego

9500 Gilman Dr. #0404La Jolla, CA [email protected]

ABSTRACT

As microprocessors continue to evolve and grow in function-ality, the use of smaller nanometer technology scaling cou-pled with high clock frequencies and exponentially increasingtransistor counts dramatically increases the susceptibility oftransient faults. However, the correct and reliable opera-tion of these processors is often compulsory, both in termsof consumer experience and for high-risk embedded domainssuch as medical and transportation systems. Thus, econom-ical fault detection and recovery becomes essential to meetall necessary market requirements. This paper explores theefficient leveraging of superscalar, out-of-order architecturesto enable multi-cycle transient fault-tolerance throughout thedatapath in a novel manner. By using dynamic instructionexecution redundancy, soft errors within the datapath areboth detected and recovered. The proposed microarchitectureselectively reevaluates corrupted instructions, reducing therecovery impact by preserving completed instructions unaf-fected by the fault. The additional computational workload isdynamically staggered to leverage the out-of-order nature ofthe architecture and minimize resource conflicts and delays.

Categories and Subject Descriptors

B.8.1 [Performance and Reliability]: Reliability, Test-ing, and Fault-Tolerance

General Terms

Design, Reliability

Keywords

fault tolerance, soft errors, reliability, embedded, datapath

1. INTRODUCTIONAs today’s embedded processors continue Moore’s perpet-

ual trend of exponentially increasing transistor counts, theissue of transient hardware faults will become more preva-lent. This is exacerbated by the smaller feature sizes, high

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.CODES+ISSS’12, October 7–12, 2012, Tampere, Finland.Copyright 2012 ACM 978-1-4503-1426-8/12/09 ...$15.00.

clock frequencies, and reduced voltage levels and noise mar-gins being utilized to meet market demands. External eventssuch as cosmic rays or ambient radiation can more easily al-ter voltage levels and the data represented by those levels[1]. This can lead to temporary inaccuracies within the datacomputations occurring within the processor, often termedsingle-event upsets (SEU), transient faults, or soft errors.In the past, the frequency of such transient faults was low,making fault-tolerant computers attractive only for high-risk, mission-critical domains such as medical devices, spaceprograms, and ground and air transportation. However, fu-ture microprocessors will be more and more susceptible totransient faults as designs continue to grow in complexityand shrink in size [2, 3, 4]. As we begin to embark on 22nmtechnology designs, the impact of soft errors will becomea significant factor in correct device operation and in thedelivery of a positive consumer experience.

Furthermore, a glance into the future of embedded mi-croprocessors will reveal an impending upheaval of presentdevice technologies. Diminishing transistor-speed scalingand practical energy limits of current Silicon CMOS tech-nology will soon compel a transition to a new technologyscaling regime. Many arenas are currently under explo-ration, including carbon nanotubes, graphene, and quantumelectronics. Such ground-breaking technologies will provideimmensely higher device densities, but will also lead to sub-stantially higher transient fault rates due to quantum effects,increased sensitivity to noise, and decreased fabrication tol-erance [5].

As these next-generation processors are deployed into moreubiquitous arenas, such as automotive, industrial, and med-ical devices, the criticality of accurate computation becomesparamount. For example, a seemingly minute transient faultwithin a car’s anti-lock braking system may result in a fatalaccident. Indeed, the emergent domain of dependable micro-processors, such as the ARM Cortex-R series, can attest tothe growing need for error-resistant processors even in theembedded domain.

Additionally, while fulfilling the goal of providing thisfault tolerance, care must be taken in order to ensure thatany associated overhead in terms of cost, performance, area,and power are not overly prohibitive. Simply duplicatingor triplicating the entire system would indeed achieve re-silience, but that would be quite inefficient. In order toremain competitive, a more intelligent and frugal approachto fault detection and recovery is necessitated. It has beenshown that the soft error rate (SER) for SRAM cells de-creases with decreasing technology size, and that the SERof latches remains relatively constant [6]. On the otherhand, the SER of combinational logic within the processorincreases quite rapidly as the feature size is reduced [6] and

43

as voltage is decreased [4]. Thus, given that memory arrayscan affordably be hardened using parity, Hamming codes,or ECC since the cost of the coding logic is amortized overthe array, focusing on hardening the combinational logic andstate throughout the datapath will become the primary chal-lenge.This paper proposes a novel approach to leveraging a con-

ventional superscalar, out-of-order architecture to efficientlyenable fault-tolerance throughout the datapath, includingmulti-cycle SEUs. Redundancy is achieved by dynamicallyduplicating and independently executing each instructionwithin the datapath. The final result of each instructionpair is compared to detect soft errors occurring within thedatapath. Upon fault detection, only the erroneous por-tion of the instruction chain (e.g. those instructions thatwere poisoned by the faulty computation) is selectively dis-carded. Other completed instructions that are independentand unaffected by the fault are preserved in order to re-duce unnecessary re-computation time and energy. The du-plicated instructions are dynamically staggered in order toleverage the out-of-order nature of the architecture and min-imize resource conflicts and performance delays. This helpsreduce structural hazards within the functional units, andallows for higher functional unit utilization and overall in-struction throughput. We show the implementation of thisarchitecture and provide experimental data taken over a gen-eral sample of complex, real-world benchmark applicationsto show the benefits of such an approach. The simulationresults show full soft error recovery, while incurring only an11% to 26% reduction in processor performance.

2. RELATED WORKPrior research has investigated the inherent hardware re-

dundancy provided by simultaneous multithreading (SMT)architectures. The AR-SMT architecture proposed duplicat-ing an application into two instruction threads, one leading(A) and one trailing (R) [7]. Each instruction pair is com-pared and validated. Information from the leading threadis also used to assist the trailing thread with control anddata predictions to reduce the performance penalty. Simi-larly, the authors in [8] leverage the SMT architecture, butrelax the coupling between the two threads by only compar-ing those instructions that produce side-effects visible out-side the processor core. In both these cases, fault detectionis the primary contribution, with fault recovery involvingmuch overhead.From the perspective of chip multiprocessors (CMP), [9]

proposed executing two duplicate application threads on dif-ferent processor cores. In comparison to the SMT proposals,leveraging completely separate hardware in different coresobviates the possibility of a transient fault persisting longenough within a common hardware block to corrupt boththreads. On the other hand, the inter-process communi-cation to maintain and compare information across coresis costly. Similarly, the authors in [10] proposed executingtwo duplicate applications on different processor cores, butinstead relied on the cache memory interface as the pointof comparison and checkpointing to help alleviate the over-heads of processor-to-processor communication.Superscalar out-of-order (OOO) architectures have pro-

vided another appealing foundation for fault tolerance. Theauthors in [11] proposed duplicating instruction executionwithin the OOO processor, leveraging two adjacent re-orderbuffer (ROB) entries per instruction. Upon both copies com-pleting execution, they are compared and, if identical, com-mitted. On the other hand, if the pair differs then all the

ROB entries that have not been committed are discardedand execution is restarted from the last committed PC value.In this manner, the inherent rewind capability of the OOOprocessor allows fault recovery at no additional hardwarecost. In a similar vein, the O3RS architecture proposedduplicating instruction execution, but removed the require-ment of using two ROB entries per instruction [12]. Instead,they employ temporal redundancy by adding a differentiat-ing state of first versus second execution to the ROB en-tries. In order to be committed, instructions must go fromthe first state to the second state and then be compared. Ifthey differ, the state is moved back to first and subsequentROB entries are also invalidated. Additionally, the idea ofstaggering the two computational threads was introduced in[13] in order to handle resource hazards, and showed im-provement in the performance overhead.

Hybrid approaches of using an OOO processor coupledwith an in-order redundancy pipeline have also been pro-posed. The DIVA architecture consists of a prototypicalOOO processor, plus a very simple in-order checker proces-sor [14]. Before instructions in the OOO pipeline can beretired, it must be compared with the result found from thein-order checker. While this improves the throughput (sincethe checker closely matches the throughput of the OOO pro-cessor), it requires separate functional units to be employedwithin the in-order checker which increases hardware over-head. The authors in [13] reduced this hardware overheadby allowing their in-order SHREC checker to share the samepool of functional units as the main OOO processor.

3. MOTIVATIONAs mentioned in the introduction, combinational logic

within the processor will increasingly become the victim ofsoft errors [6]. In particular, hardening the datapath of theprocessor is challenging due to the irregularity of the com-binational logic used for computation. Whereas commonfault-tolerance techniques, such as ECC, can be employedfor uniform, array-like processor structures, the datapathposes additional complexity.

When discussing faults within the combinational logic ofa circuit, it is important to observe that not every transientglitch will result in an SEU. The momentary voltage or cur-rent fluctuations that occur within the circuit are insteadtermed single-event transients (SETs). Only if the SETpropagates through the circuit and results in an incorrectvalue being latched into a storage element does it becomean SEU. Thus, the timing and longevity of the transientglitch can affect whether or not it manifests as an actual softerror or gets attenuated away. Because of this, the probabil-ity that momentary glitches will be captured as valid datain combinational logic increases linearly with frequency be-cause the occurrence of clock edges increases [15, 16, 17].

Thus, the combination of ultra deep submicron scaling,high clock frequencies, and reduced voltage levels and noisemargins all contribute to greatly increase the probability oftransient glitches manifesting into SEUs within the proces-sor datapath. Indeed, [18] reports the alpha-particle softerror rate (SER) measured increases of ≈2-3x when reduc-ing voltage from 0.95V to 0.75V in a 32nm circuit, while[4] reports a doubling in SER when reducing voltage from0.7V to 0.5V in 28nm. Furthermore, the rate of propaga-tion of a glitch through the circuit is dependent on the lin-ear energy transfer (LET) of the particle strike. LETs aslow as 3 MeV · cm2/mg are capable of generating sizabletransients, and an LET of 70 MeV · cm2/mg can cause du-rations exceeding 1ns [17]. Since many high-end embedded

44

systems operate in multi-gigahertz frequencies, this impliesthat some SETs can last more than one cycle when inducedby a particle with large enough LET. Given this, the con-sideration of guarding against multi-cycle faults within thedatapath is also essential.Fault tolerance requires both fault detection, as well as

fault recovery once the fault is detected. There are numer-ous hardware techniques to detect whether a transient faulthas occurred within a system. For example, one can sim-ply create two duplicate systems and run them side-by-side,verifying the results at each time interval. Any differenceswould indicate the presence of a transient fault. While thisapproach would not degrade performance, the area and ma-terial cost of having two complete copies of the system isprohibitive. Furthermore, to enable fault recovery, a thirdsystem would typically be required. On the other hand, onecan take a single system and execute the target applicationmultiple times (whether in serial or in parallel), checking fordifferences. This temporal approach does not incur the areacosts of hardware replication, yet the performance of such anapproach would be poor. For embedded processors, the sen-sitivity to such area or performance overheads is often morepronounced and efficiency within the fault tolerant design isparamount.In particular, there are three primary goals one should

strive to attain to deliver robust and economic fault toler-ance. First, given that the majority of the computations willnot encounter a soft error, it is critical to minimize the per-formance impact of the fault detection logic for these non-faulty cases. Second, given the increased likelihood of softerrors occurring more frequently in ultra deep submicronfeature sizes [2, 4], reducing the overhead of recovering fromthese faults is also important, both in terms of performanceand scalability. Lastly, the assumption that transient faultsonly affect a single cycle is no longer true, necessitating theability to handle multi-cycle faults. A novel architecture isproposed that will simultaneously achieve these three goals,intelligently leveraging a number of different design tech-niques to accomplish frugal multi-cycle transient fault tol-erance within the datapath. An overview of this proposedarchitecture is discussed in the next section.

4. PROPOSED ARCHITECTUREAs previously mentioned, fault tolerant proposals often in-

volve some form of spatial or temporal redundancy in orderto detect and recover from faults. Such redundancy often in-curs degradation in terms of area or performance. However,many of the inherent design properties of superscalar out-of-order (OOO) processors can be leveraged to help bridgethis gap. In essence, these systems provide some duplica-tion of hardware structures (functional units) and allow formultiple independent stream execution to occur in parallel.In order to detect transient faults, instruction redundancycan be leveraged within the OOO processor datapath. In-structions can be dynamically duplicated and independentlyexecuted. In this manner, fault detection within the datap-ath will simply become a matter of comparing the resultingvalues of the two instructions for differences. But, unlikefull processor replication, many of the other hardware stor-age structures, such as the instruction queue, register file,caches, and TLBs, can remain shared and be hardened viaother common techniques such as ECC or Hamming codes.Figure 1 shows the sphere of replication wherein the com-putation is duplicated. Only the computational datapath isreplicated, greatly reducing the cost and area overhead ofthe fault detection. Furthermore, the same mechanisms in

Figure 1: Generic Tomasulo Architecture Showing

Sphere of Replication

place for speculative out-of-order execution can be employedto enable fault recovery by simply causing a soft exception tooccur and rewind the execution back to a valid prior state.Building on this speculative OOO architecture, the fulfill-ment of the three primary design goals is described in thefollowing subsections.

4.1 Improving Non-Faulty PerformanceOne major concern about introducing fault tolerance is

the associated performance penalty incurred regardless ofwhether faults occur. Simplistically in a processor that is100% utilized, if every instruction is duplicated, then theperformance would degrade to 50% of the non-duplicatingbaseline. However, all computational resources of a proces-sor are rarely utilized 100% continuously. In reality, manyfunctional units may be idle at any given time. In orderto reduce the impact of instruction duplication, one wouldideally wish to have the duplicated instructions execute onidle functional units, and thus avoid elongating the time ofthe primary instruction thread.

Yet, if one simply creates two duplicate, independent ex-ecution threads, the functional unit access patterns will beidentical. This will lower the likelihood of being able toperfectly interleave the secondary thread’s hardware needswith that of the first’s. The differing lengths of time eachfunctional unit class takes (e.g. floating-point divide vs. in-teger add) will prevent a solution of simply offsetting thesecondary thread by a fixed period of time. Furthermore,once the secondary thread begins, it may then occupy a re-source that is also needed by the primary thread, which willelongate the overall time. An example of this is shown inFigure 2(a). The primary thread (shown in blue) and thesecondary thread (shown in red) are independent. Assumethere is only one functional unit of type A and it takes twocycles to complete, while there are two functional units oftype B and they take one cycle to complete. As one can see,

45

Figure 2: Duplicate Instruction Thread Decoupling

Improves Performance

this would force both threads to be sequential, causing theoverall time to take 6 cycles.To rectify this inefficiency, a decoupling of the secondary

thread is proposed. Duplicate secondary instructions arestill created, but the source operands for each secondaryinstruction are connected to the corresponding primary in-struction(s). Figure 2(b) shows this proposed modification.As one can see, because of this extra degree of freedom, theprocessor can complete these same six instructions in fourcycles instead of six. The hardware utilization is improved,since the additional B-type functional units are now lever-aged instead of waiting idly. In this fashion, one can amelio-rate and amortize the performance overhead of instructionduplication (regardless of the presence of faults) by decou-pling the secondary instructions and allowing them to exe-cute at a lower priority in any functional units that may beavailable.

4.2 Improving Faulty PerformanceAs fault rates increase, the importance of optimizing the

fault recovery aspect of the proposed architecture becomesmore important. Assuming a transient fault has been de-tected, the most basic recovery approach for a speculativeout-of-order processor would be to rewind the state to justbefore the offending instruction executes, and then restartexecution from that point. This effectively throws away allcomputation that occurred after the offending instruction.This behavior can be quite wasteful if the number of sub-sequent instructions computed after the instruction in ques-tion is large, and those computed instructions are unaffectedby the offending instruction.Instead, upon fault detection, only the erroneous portion

of the instruction chain (e.g. those instructions that werepoisoned by the faulty computation) should be selectivelydiscarded. Other completed instructions that are indepen-dent and unaffected by the fault are preserved in order toreduce unnecessary re-computation overhead. This not onlyabates resource contention within the system, but also en-joys the added benefit of reducing power consumption. Fur-thermore, as the occurrence of transient faults increases, theproposed architecture will be able to more robustly handlerecovery without becoming overwhelmed and unable to re-

Figure 3: Impact of Uniformly Distributed Fault Re-

sults in Balanced Average Re-Computation Effort

solve faults faster than they occur. Thus, this approach willbe able to scale with increases in soft error frequency.

In addition to employing poison bits to localize the re-computation effort and avoid invalidating unaffected work,a further optimization is proposed to try and recuperatepotentially poisoned work. Given that there are two dif-fering copies of the offending instruction, but no knowledgeof which one is valid, a third computation of the offend-ing instruction is initiated and the result of that computa-tion acts as a vote in favor of one of the other two copies.In this manner, the third computation can select which ofthe two original copies is the correct one, and preserve thevalid computational work that was performed in subsequentinstructions. The erroneous copy of the instruction is dis-carded, as well as any subsequent instructions that had thecorresponding poison bits.

Assuming a uniform fault distribution, a fault could occurwithin a primary or secondary instruction with equal prob-ability. Yet, if the fault occurred in the primary instruction,the poison bit setup would impact both the dependent pri-mary instructions, as well as the dependent secondary in-structions. On the other hand, if the fault occurred in thesecondary instruction, only that single instruction would beaffected (since all secondary instructions pull their sourceoperands from primary instructions). As one can observe,probabilistically these two cases balance out. Half of thetime, the fault will be in a primary instruction, which wouldimpact the entire dependency chain and require those in-structions to be re-computed, as shown in Figure 3(a). Theother half of the time, it will occur in the secondary instruc-tion and that single instruction can just be discarded at thatpoint, as shown in Figure 3(b).

4.3 Handling Multi-Cycle FaultsThe main concern with multi-cycle faults is the condi-

tion where both the primary and secondary instructions areexecuted on the same functional unit while that unit is be-ing perturbed by the multi-cycle fault. In this situation,both copies of the instruction may match, but both maybe incorrect. Thus, the fault detection capability of thesystem becomes undermined. In order to rectify this sit-uation, each functional unit must guard against being se-lected to compute an instruction that had the complementinstruction previously executed on it. This will ensure thesame functional unit will not be used by both copies of thesame instruction. One challenge is if the system only hasa single functional unit, as may be the case for more in-frequent and expensive calculations such as floating-pointdivision (FDIV). In these cases, since there is only a single

46

functional unit, it must be utilized by both the primary andsecondary instructions. As will be shown in Section 6, atemporal factor will be included in the guard logic to allowa single functional unit to be shared by both the primaryand secondary instructions.

5. DESIGN DECISIONSWhile early OOO implementations had a central register

update unit (RUU) for bookkeeping, most modern imple-mentations decentralize the information and have a sepa-rate reservation station, ROB, and register file to improvethroughput in the presence of speculation. This latter classof speculative out-of-order architectures is derived from theTomasulo design [19], and is the foundation of our proposedfault tolerant implementation. Furthermore, this proposalfocuses on the datapath of the processor, which is more chal-lenging to protect due to its irregularity when compared toarray-like storage structures. The other processor compo-nents, such as cache memories, register files, and TLBs, areassumed to be hardened against faults using existing solu-tions, such as ECC, and will not be discussed further.While the original Tomasulo architecture only scheduled

instructions via the issue queue read from instruction mem-ory, enabling faulty instructions to re-execute will require amechanism to reintroduce a completed instruction back intothe execution engine. The detailed modifications that aredone to implement this are discussed in the next section.Instructions that cause external side-effects, such as loads,

stores, and conditional branch instructions will not be fullyduplicated. Allowing two copies of such instructions to ex-ecute can cause serious performance and correctness issueswithin the pipeline. For example, having a load instructionduplicated may cause undesirable cache implications, suchas evicting other entries that were also active. Additionally,the bandwidth of the memory subsystem is often a limitingfactor and unnecessary congestion should be avoided. Thus,instead of fully duplicating memory instructions, only the ef-fective address calculation is hardened via this proposed du-plication scheme, but the actual memory access operationoccurs only once. This avoids memory consistency issuesand possible performance degradation due to cache sensitiv-ities. As modern architectures typically harden the memoryvia ECC or similar safeguards, it can be assumed that cor-rect values are retrieved from or stored into memory cellswith no additional effort. Similarly, having two copies of aconditional branch instruction that may be divergent willcause many issues. Given this, branch instructions are alsonot fully duplicated. Rather, only the conditional computa-tion within the datapath is duplicated, but the actual mod-ification of the program-counter is only done once. Again,it is presumed that storage elements providing the targetaddress are hardened against faults using typical memory-array fault tolerance techniques.

6. IMPLEMENTATIONThe typical reservation station entry contains the follow-

ing fields:

• OpCode - operation to be performed• Vj - value of operand 1• Vk - value of operand 2• Qj - ROB # of operand 1 (0 if already available)• Qk - ROB # of operand 2 (0 if already available)• Dest - destination ROB #• A - effective address for load/store• Busy - indicates entry and corresponding FU occupied

Figure 4: Modifications to Reservation Station En-

tries

Figure 5: Modifications to Reservation Station En-

tries

For this proposal, an additional 2-bit field is needed perreservation station entry:

• Type - indicates if this is primary(0), secondary(1), ortertiary(2) instruction

The necessary modifications to the reservation station en-tries are shown in Figure 4, highlighted in red. Moreover,the typical ROB entry contains four fields:

• InstructionType - branch, store, or ALU/load• Dest - register or memory address• V alue - value of instruction result• Ready - indicates if V alue is ready

However, in order to enable this proposal each ROB entrywill instead contain:

• OpCode - operation to be performed• Dest - register or memory address• Poison - bit-vector of progenitor ROB entries• Qj - ROB or RF # of operand 1• Qk - ROB or RF # of operand 2• V alue[2] - value of instruction result• Ready[2] - indicates if V alue is ready

The necessary modifications for each ROB entry are shownin Figure 5, again highlighted in red. The bit numbers shownassume an ROB size of 256 entries, and can be adjustedaccordingly.

The instruction stream will dynamically be duplicatedwithin the computational datapath, running independentlyon the available hardware. Figure 1 shows the sphere ofreplication wherein the computation is duplicated. Once aninstruction is issued, two independent computations will ex-ist for that instruction. Subsequent dependent instructionsthat are also duplicated will receive both of their operandsfrom only the primary thread (or the register file, if alreadycommitted). In this manner, the additional computationoverhead of the second instruction can be decoupled and ex-ecuted whenever there are idle computation resources avail-able. The pair of instructions will occupy a single ROB

47

Figure 6: Example of Selective Poisoning in Both

Primary and Secondary Threads

entry, but will account for two reservation station entries.Within a given ROB entry, the new Poison field that isadded will indicate all the other ROB entries that, if erro-neous, may potentially pollute this entry’s source operandsand cause an error. The V alue and Ready fields are dupli-cated to hold the results of both the primary and secondaryinstruction results received from a functional unit. When afunctional unit completes its computation, it will send theresult value along with the destination ROB # and whetherit was a primary or secondary instruction, allowing the cor-rect ROB entry V alue field to be updated.Once both copies of the instruction are completed (e.g.

both Ready fields in the ROB entry are true), they are com-pared to detect if a transient fault has occurred. If theymatch, the instruction is ready to be committed by writingthe computed value (either from the primary or secondaryV alue field within the ROB) into the destination register.This commit will only occur once the ROB entry reaches thehead of the buffer (to ensure in-order commit, as normallyexpected). Once the commit occurs, any other ROB entriesthat refer to the committed entry in their Qj or Qk fieldswill update their index to instead point to the register filedestination that was committed by the ROB. In this man-ner, if a subsequent ROB entry needs to be re-executed (dueto a mismatch within itself or from one of its progenitor in-structions), the Qj and Qk values will be used to correctlyrepopulate a reservation station entry to allow for the re-execution of that particular ROB entry.On the other hand, if the two values mismatch, a fault has

been detected. In this situation, a third, independent copyof the mismatched instruction will be submitted as a newentry into the reservation station, utilizing the informationstored locally within the ROB fields. The reservation stationwill treat this tertiary instruction as the highest priority,assigning it into the first available functional unit that is ableto service the computation. Once the result is computed, itwill be compared against both existing V alue fields in theROB for a match. If neither matches, another third copy willbe submitted and the process recurs until the resulting valuematches one of the two existing V alue fields. An optionalmaximum retry limit can be added to account for the caseof an unrecoverable fault.Once a match against the third thread is found, that

thread (be it primary or secondary) will be considered cor-rect and the other thread invalid. If it is the primary threadthat is deemed invalid, then the offending ROB entry willbroadcast its ROB # to the other ROB entries. The others

will perform a simple bitwise AND against their Poison fieldto detect if they may have been poisoned by the fault. Theoffending ROB entry and all those ROB entries that werepoisoned will clear their Ready fields and resubmit a newpair of entries into the reservation station for both primaryand secondary instructions, utilizing the information storedlocally within the ROB fields. If it is the secondary threadthat is deemed invalid, no extra work is incurred and theROB entry is now ready to be committed. The superior-ity of this approach can be appreciated in that statisticallyhalf of the times that a mismatch occurs, the ROB entriesare preserved and do not need to be re-computed, reduc-ing the performance and power overhead of fault recovery(since the erroneous entry was the secondary instruction).To illustrate this concept, Figure 6 shows (a) an exampledata-flow graph (DFG) of six instructions, (b) the case ofa fault occurring in a primary instruction resulting in thepoisoning of data-dependent instructions, and (c) the caseof a fault occurring in a secondary instruction which doesnot effect data-dependent instructions.

The reservation station functional unit allocation algo-rithm is modified to allow it to service instructions in afixed-priority fashion. Tertiary instructions (those that havebeen spawned in order to vote on the correctness of eitherthe primary or secondary instruction) are considered thehighest priority. Primary instructions are considered thenext highest priority, and secondary instructions are con-sidered the lowest priority. The reservation station will tryto assign the highest priority instruction it can find thathas both of its operand values ready into the next availablefunctional unit. If multiple equivalent-priority instructionsare ready to be executed, one is selected at random. Animportant observation is that this priority-based allocationscheme avoids deadlock. While secondary instructions maybe likely to get stuck waiting for a functional unit if thereare many primary instructions continuously occupying thefunctional units, the general structure of the ROB imposes adampener on this resource starvation. No new instructionscan be issued into the ROB if all the ROB entries are full.If secondary instructions are being held in the reservationstations, they preclude the retirement of their correspond-ing ROB entries. Thus, at some point the ROB will be fulland no new primary instructions will be issued. This willcause the stream of primary instructions to wane and allowthe awaiting secondary instructions to execute. Once thesecondary instructions finish and the ROB entries start tocommit, then new ROB entries will become available andthe system can continue to move forward.

The existence of multi-cycle faults necessitates some ad-ditional intelligence in the reservation stations. Utilizingthe same exact functional unit for the complement pair ofinstructions can allow multi-cycle faults to undermine thefault tolerance of the system. To rectify this, the reser-vation station must ensure both the primary and secondaryinstruction do not execute on the same functional unit, or atleast not within a certain time period. This is accomplishedby having a small table within the reservation station thatkeeps track of the last N instructions that were executed oneach functional unit. The value of N can be determined bythe likelihood of a multi-cycle fault lasting longer than Nfunctional unit calculations. For example, if it is infeasiblefor a fault to last longer than 2 functional unit calculations(e.g. SUB or FDIV), then the table will contain just twoentries. The entries will just need to hold the ROB # plusthe 2-bits of Type to determine which type of instruction itwas (primary, secondary, or tertiary). Additionally, to sup-

48

Table 1: Hardware Config 1ROB Entries 256

Reservation Station Entries 32 per FU typeLSQ Entries 64

Instruction L1 cache 64 KB, 2-way set-associativeData L1 cache 32 KB, 2-way set-associative

Unified L2 cache 512KB, 4-way set-associativeFunctional Unit Mix 4 IALU (1), 2 IMUL/IDIV (3/19),

2 FALU (2), 2 FMUL/FDIV (4/12)

port the case of a single functional unit (e.g. for FDIV), atime-based decay is added to remove the table entries after aperiod of inactivity greater than the anticipated multi-cyclefault length. In this manner, full fault tolerance againstmulti-cycle faults is accomplished, while incurring minimaloverhead in the regular operation of the entire datapath.Based on the proposed architectural modifications, each

reservation station would need 2 bits per entry in additionto the preexisting 127 bits. Similarly, each ROB entry wouldrequire an additional 309 bits on top of the preexisting 67bits. Given an architecture with 256 ROB entries and 128total reservation station entries (for all types of functionalunits), this would amount to approximately 9.7KB of addi-tional storage elements. While this amount of data is nottrivial, it is quite a reasonable price to pay to enable tran-sient fault detection and recovery throughout the processordatapath and demonstrates the frugality of the proposedfault tolerant design.

7. EXPERIMENTAL RESULTSIn order to assess the benefit from this proposed architec-

tural design, we utilized the SimpleScalar framework [20].The stock code initially utilized a basic register update unit(RUU) structure, combining the reorder buffer (ROB) andreservation stations and provided no register renaming. Inorder to fully exploit possible parallelism and instructionthroughput, we greatly modified the default sim-outordersimulator to implement a full speculative Tomasulo archi-tecture [19], including register renaming and decentralizedreservation stations. This will allow hardware instructionscheduling to cross basic block boundaries and implicitlyreduces register pressure, both of which will help improveIPC (instructions per cycle). Furthermore, the simulator isaugmented with the fault detection and recovery scheme pre-sented in this paper. When operating in fault-tolerant mode,all datapath instructions will be replicated and validated.For instructions with side-effects, such as memory accessesand control flow operations, only the numerical computation(e.g. effective memory address calculation or branch condi-tion comparison) will be replicated; the actual modificationof the LSQ (load/store queue) or program counter will onlybe permitted from the primary instruction. This behaviorensures program validity; if a fault does occur the preciseexception handling of the speculative Tomasulo implemen-tation will be able to correctly rewind the system state.A random fault injection routine was added to the sim-

ulator, allowing a configurable rate at which to randomlycorrupt instructions during execution. As each instructionbegins execution within a functional unit, the fault injectionroutine is queried to determine if that instruction should beforced to be corrupted. If so, the resulting value from thefunction unit is XORed with −1, effectively flipping all thebits. Furthermore, 20% of the instances where the faultinjection routine triggers a fault, it will also mark the corre-sponding functional unit as faulty for 2 instructions, simulat-

Table 2: Hardware Config 2ROB Entries 128

Reservation Station Entries 16 per FU typeLSQ Entries 64

Instruction L1 cache 64 KB, 2-way set-associativeData L1 cache 32 KB, 2-way set-associative

Unified L2 cache 512KB, 4-way set-associativeFunctional Unit Mix 4 IALU (1), 2 IMUL/IDIV (3/19),

2 FALU (2), 1 FMUL/FDIV (4/12)

ing a multi-cycle fault. In these cases, the next instructionthat enters the functional unit will also be corrupted in theexact same fashion.

We chose two representative superscalar out-of-order sys-tem configurations to demonstrate our proposal. Table 1defines Config 1, which is a more aggressive system employ-ing a 256-entry ROB and 32-entry reservation stations foreach functional unit type. Table 2 defines Config 2, whichis more conservative and has half the number of ROB andreservation station entries, and also only a single floating-point multiplier and divider (FMUL/FDIV). Both configu-rations have pipelined functional units (except for the IDIVand FDIV units). The number of cycles each functional unittype consumes is listed within the parentheses in each table.

The complete SPEC CPU2000 benchmark suite [21] isused, providing 12 integer and 14 floating-point real-worldapplications. A listing of these benchmarks and their re-spective descriptions are provided in Table 3. The bench-marks are cross-compiled for the PISA instruction set usingthe highest level of optimization available for the language-specific compiler. The reference inputs are used for eachbenchmark, with each benchmark executed in its entiretyfrom start to finish.

In order to assess our proposed architecture, we simu-lated the benchmarks on four different processor models:Baseline, FT No Faults, FT 1E-4 Fault Rate, and FT 1E-1Fault Rate. The Baseline processor model is a plain specu-lative Tomasulo architecture [19], with no instruction repli-cation or fault tolerance enabled. The FT No Faults proces-sor model enables instruction replication and fault tolerance

Table 3: Description of BenchmarksBenchmark Description

bzip2 Memory-based compression algorithmcrafty High-performance chess playing game

I eon Probabilistic ray tracer visualizationN gap Group theory interpreterT gcc GNU C compilerE gzip LZ77 compression algorithmG mcf Vehicle scheduling combinatorial optimizationE parser Dictionary-based word processingR perlbmk Perl programming language interpreter

twolf CAD place-and-route simulationvortex Object-oriented database transactions

vpr FPGA circuit placement and routing

F ammp Computational chemistryL applu Parabolic partial differential equationsO apsi Meteorology pollutant distributionA art Image recognition / neural networksT equake Seismic wave propagation simulationI facerec Face recognition image processingN fma3d Finite-element crash simulationG galgel Computational fluid dynamics

lucas Number theory / primality testingP mesa 3D graphics libraryO mgrid Multi-grid 3D potential field solverI sixtrack High energy nuclear physics acceleratorN swim Shallow water modelingT wupwise Physics / quantum chromodynamics

49

Figure 7: Performance Impact of Fault Tolerance on

Integer Benchmarks using Config 1

checking, but will not inject any faults. This will demon-strate the overhead of our proposed fault tolerance schemewithout the additional cost of recovering from faults. TheFT 1E-4 Fault Rate and FT 1E-1 Fault Rate processor mod-els also enable the fault tolerance scheme, but will injectfaults at the rate of 0.01% and 10%, respectively. The ex-periment range is selected broadly so as to demonstrate theeffectiveness of our design in enduring a variety of fault rates,including high ones.Figure 7 compares the IPC of the four processor models

using Config 1 for each of the 12 integer benchmarks. Asone can see, the Baseline attained an average IPC of approx-imately 1.77. When we enable the fault tolerance datapathreplication, the average IPC falls to 1.57, which is a reduc-tion of about 11%. Given that every computational instruc-tion is being executed twice, such a low reduction in IPC isquite impressive. Indeed, if every functional unit was contin-uously utilized throughout the lifetime of the program, suchinstruction duplication would amount to an IPC reductionof at least 50%. However, due to the intelligent and efficientimplementation of our system, the duplicated instructionsare able to dynamically recuperate idle hardware resourcesand achieve improved overall IPC while doing double thenumber of computations.Furthermore, the introduction of a 0.01% fault rate does

not cause any noticeable decline in IPC. In fact, the overheadof recovering these faults is quite minimal due to the selec-tive fault chaining that was implemented. When the primaryand secondary instructions mismatch and the tertiary voterinstruction determines which of the two is incorrect, thenonly the erroneous instruction and those instructions thatwere poisoned by it will be thrown away and re-computed.This allows for many independent instructions that wereexecuted in between to be preserved, as opposed to justrewinding the entire ROB back to the offending instructionand starting over. Furthermore, even the re-computation ofthe poisoned chain is something that will be avoided 50% ofthe time; if the fault is determined to have occurred in thesecondary instruction, no re-computation is necessary. Thisnot only helps reduce the performance overhead of fault re-covery, but also helps curb unnecessary power consumption.This efficiency becomes even more apparent as the fault


Floating-Point Benchmarks using Config 1

rate goes up to 10%. In the FT 1E-1 Fault Rate model, wecan still see the system is able to make forward progress,even though about 1 out of every 10 instruction executionswas faulty, plus 2 of those times the fault was multi-cycle andcorrupted the next instruction. While it may seem like anextreme case to have such a high fault rate, it is importantto demonstrate the ability of the system to adeptly handlehigh rates of failure.

Figure 8 compares the IPC for the same four proces-sor models using Config 1 across the floating-point bench-marks. In general, the floating-point benchmarks are able toachieve a higher IPC, as their diversity of additional floating-point instructions helps reduce the demands on the otherfunctional units. The average IPC for the Baseline in thiscase was 2.54, while the FT No Faults model achieved 1.87.This reflects a reduction in IPC of approximately 26%. Asone can see, the IPC reduction is more pronounced in thefloating-point benchmarks. This is due to the smaller num-ber and longer execution time of floating-point functionalunits, which can become a bottleneck as two copies of eachinstruction need to be computed. The overhead is quite low,considering the doubling in the computational work that istaking place.

Similar to the integer benchmarks, introducing a modestamount of faults does not significantly reduce the averageIPC for floating-point applications. One is able to achievecomplete datapath fault tolerance, while only incurring a26% reduction in performance. Furthermore, the system isstill able to handle the high level of faults of the FT 1E-1Fault Rate processor, demonstrating that even in the face ofextreme fault occurrence the system is able to make forwardprogress and eventually provide correct values.

Figure 9 and Figure 10 show the same IPC informationfor both integer and floating-point applications, respectively,but using the more conservative Config 2. In these cases,the number of ROB entries and reservation stations is re-duced by 50%. This will reduce the amount of inherent IPCby constraining the amount of instructions that can be in-flight, and also shrinking the effective window of how farthe primary instruction thread can run ahead of the sec-ondary instructions before having to wait for the secondaryinstructions to catch up. These resource limitations will also

50


Integer Benchmarks using Config 2

further degrade the performance when instruction replica-tion is enabled, since the available slack in the system ismore constrained and the availability of empty reservationstation entries is diminished.As one can see, the Baseline IPC for integer and floating-

point benchmarks using Config 2 was 1.53 and 1.72, respec-tively. When instruction replication is enabled without anyfaults, the IPC values are reduced by about 18% and 37%to become 1.26 and 1.09, respectively. Once fault injec-tion is enabled, the IPC does slightly degrade as would beexpected due to the instruction re-computation overhead.Furthermore, there may be an additional delay in sendingthe tertiary instruction to the reservation station for exe-cution if all entries in the smaller reservation station arealready occupied.Additionally, Config 2 demonstrates the effectiveness of

our proposal in situations where only a single functionalunit may exist. In this configuration, only a single FMULand FDIV functional unit is available. In the case where amulti-cycle fault occurs during an FDIV computation of aprimary instruction, the system must be smart enough toguard against the corresponding secondary instruction run-ning immediately afterward. If that situation were allowedto happen, the secondary instruction would be corrupted inan identical fashion, and the fault detection would fail tocover that case. In executing the various benchmarks onConfig 2, we were able to verify that the architecture prop-erly guarded against this situation, and only allowed a corre-sponding instruction to execute on the same functional unitafter the necessary delay period to ensure the multi-cyclefault would be over.In addition to the aforementioned IPC values computed

for the two fault rates of 1E-4 and 1E-1, we were inter-ested to see the impact on IPC as a result of increasing faultrates. We chose 15 different fault rates, ranging from 1E-6to 1E-1 to examine how rapidly IPC degrades as the faultrate increases. Obviously, at some point the fault rate willoutpace the ability of the system to recover, causing faultsfaster than they can be re-computed. As this point is ap-proached, one would expect to see an exponential drop-off inIPC. A group of six benchmarks were chosen to explore thisbehavior: perlbmk, gap, and crafty from the integer suite,and sixtrack, apsi, and applu from the floating-point suite.

Figure 10: Performance Impact of Fault Tolerance

on Floating-Point Benchmarks using Config 2

Using the Config 1 setup, these benchmarks were executedwhile varying the fault rate.

Figure 11 presents the interaction of the increasing faultrate on the IPC for the three integer benchmarks. As onecan see, the IPC remains somewhat stable with fault ratesbelow 1E-3 (about 1 instruction every 1,000 instructions ex-ecuted). As the fault rate increases beyond that level, thedegradation of IPC becomes more pronounced and eventu-ally begins to rapidly plunge as the fault rate becomes in-creasingly frequent. When the fault rate reaches 1E-1 (about1 instruction every 10 instructions executed), the IPC is on asharp trajectory to soon become stalled and unable to moveforward due to overwhelming faults.

Figure 12 presents the same interaction of IPC versus faultrate, but using the three floating-point benchmarks. Sim-ilar to the integer data, the IPC remains about the sameuntil reaching 1E-3. At that point, the IPC decays expo-nentially and eventually will result in the same livelock sit-uation where faults occur faster than they can be recoveredfrom.

Figure 11: Integer IPC versus Fault Rate

51

Figure 12: Floating-Point IPC versus Fault Rate

8. CONCLUSIONSAs the feature size and operating voltages of high-end em-

bedded processors continue to shrink to satisfy consumer’sinsatiable demand for greater functionality and less powerconsumption, sensitivity to soft errors increases dramati-cally. In the upcoming generation of 22nm devices and be-low, the transient fault rate in high-performance processorswill increase noticeably. Given this, the correct and reli-able operation of these microprocessors will become a moreprominent objective in order to satisfy consumer experiencelevels. Furthermore, as these next-generation processors aredeployed into more ubiquitous arenas, such as automotive,industrial, and medical devices, the criticality of accuratecomputation becomes paramount. Moreover, it has beenshown that the combinational logic of the datapath will be-come a major contributor of soft errors, and an efficientsolution to provide fault tolerance is necessary.In this paper, we have presented a novel and frugal fault

tolerance framework for current superscalar out-of-order ar-chitectures. In particular, the system is able to accountfor multi-cycle SEUs, which many existing approaches donot guard against. Furthermore, by leveraging most of theexisting hardware structures of the out-of-order machine,the entire datapath can be hardened against transient faultswith minimal additional hardware costs. Redundancy isachieved by dynamically duplicating and independently ex-ecuting each instruction within the datapath’s out-of-orderinfrastructure. Each pair of instructions is compared to de-tect soft errors, and recovery is accomplished by selectivelyre-executing only the necessary instructions that were poi-soned by the fault. A key optimization is that other com-pleted instructions that were independent and unaffectedby the fault are preserved in order to reduce unnecessary re-computation overhead. Furthermore, in order to maximizethe IPC during non-faulty operation, the duplicated instruc-tions are allowed to be dynamically staggered and utilize idlehardware functional units in an out-of-order fashion to re-duce resource conflicts and improve IPC.Extensive experimental results are provided to validate

the correct operation of the architecture. Complete faultdetection and recovery is achieved within the processor dat-apath, while only incurring moderate reductions of IPC of11% to 26% in order to handle the replicated computations.Furthermore, the system is able to withstand very high rates

of soft errors, and only begins to significantly degrade perfor-mance as the fault rate reaches 10% of instructions executed.

9. REFERENCES[1] R. Baumann. Soft errors in advanced computer systems.

IEEE Design and Test of Computers, 22(3):258–266, 2005.[2] K. Lee, A. Shrivastava, M. Kim, N. Dutt, and

N. Venkatasubramanian. Mitigating the impact of hardwaredefects on multimedia applications: A cross-layer approach.In Proc. Intl. Conf. on Multimedia, pages 319–328, 2008.

[3] N. Seifert, V. Ambrose, B. Gill, Q. Shi, R. Allmon,C. Recchia, S. Mukherjee, N. Nassif, J. Krause,J. Pickholtz, and A. Balasubramanian. On theradiation-induced soft error performance of hardenedsequential elements in advanced bulk cmos technologies. InProc. Intl. Reliability Physics Symp., pages 188–197, 2010.

[4] A. Dixit and A. Wood. The impact of new technology onsoft error rates. In Proc. Intl. Reliability Physics Symp.,pages 486–492, 2011.

[5] J. Han, J. Gao, P. Jonker, Q. Yan, and J. A. B. Fortes.Toward hardware-redundant, fault-tolerant logic fornanoelectronics. IEEE Design and Test of Computer,22(4):328–339, 2005.

[6] P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, andL. Alvisi. Modeling the effect of technology trends on thesoft error rate of combinational logic. In Proc. Intl. Conf.on Dependable Systems and Networks, pages 389–398, 2002.

[7] E. Rotenberg. AR-SMT: A microarchitectural approach tofault tolerance in microprocessors. In Proc. 29th Intl.Symp. on Fault-Tolerant Computing, pages 84–91, 1999.

[8] S. K. Reinhardt and S. S. Mukherjee. Transient faultdetection via simultaneous multithreading. In Proc. 27thIntl. Symp. on Computer Architecture, pages 25–36, 2000.

[9] M. Gomaa, C. Scarbrough, T. N. Vijaykumar, andI. Pomeranz. Transient-fault recovery for chipmultiprocessors. In Proc. 30th Intl. Symp. on ComputerArchitecture, pages 98–109, 2003.

[10] C. Yang and A. Orailoglu. A light-weight cache-based faultdetection and checkpointing scheme for MPSoCs enablingrelaxed execution synchronization. In Proc. Intl. Conf. onCompilers, Architectures and Synthesis for EmbeddedSystems, pages 11–20, 2008.

[11] J. Ray, J. C. Hoe, and B. Falsafi. Dual use of superscalardatapath for transient-fault detection and recovery. In Proc.34th Intl. Symp. on Microarchitecture, pages 214–224, 2001.

[12] A. Mendelson and N. Suri. Designing high-performance &reliable superscalar architectures: The out of order reliablesuperscalar (O3RS) approach. In Proc. Intl. Conf. onDependable Systems and Networks, pages 473–481, 2000.

[13] J. C. Smolens, J. Kim, J. C. Hoe, and B. Falsafi. Efficientresource sharing in concurrent error detecting superscalarmicroarchitectures. In Proc. 37th Intl. Symp. onMicroarchitecture, pages 257–268, 2004.

[14] T. M. Austin. Diva: A reliable substrate for deepsubmicron microarchitecture design. In Proc. 32nd Intl.Symp. on Microarchitecture, pages 196–207, 1999.

[15] S. Buchner, M. Baze, D. Brown, D. McMorrow, andJ. Melinger. Comparison of error rates in combinationaland sequential logic. IEEE Transactions on NuclearScience, 44(6):2209–2216, 1997.

[16] L. Anghel and M. Nicolaidis. Cost reduction and evaluationof temporary faults detecting technique. In Proc. Intl.Conf. on Design, Automation and Test in Europe, pages591–598, 2000.

[17] P. E. Dodd, M. R. Shaneyfelt, J. A. Felix, and J. R.Schwank. Production and propagation of single-eventtransients in high-speed digital logic ICs. IEEETransactions on Nuclear Science, 51(6):3278–3284, 2004.

[18] B. Gill, N. Seifert, and V. Zia. Comparison of alpha-particleand neutron-induced combinational and sequential logicerror rates at the 32nm technology node. In Proc. 47th Intl.Symp. on Reliability Physics, pages 199–205, 2009.

[19] J. Hennessy and D. Patterson. Computer Architecture: AQuantitative Approach. Morgan Kaufmann Publishers,Fifth Edition, 2011.

[20] Todd Austin, Eric Larson, and Dan Ernst. SimpleScalar:An infrastructure for computer system modeling.Computer, pages 59–67, 2002.

[21] SPEC CPU2000 Benchmarks. http://www.spec.org/cpu/.

52

Documents

Dynamic Transient Fault Detection and Recovery for Embedded …cseweb.ucsd.edu/~gbournou/bournoutian-codes2012.pdf · Dynamic Transient Fault Detection and Recovery for Embedded Processor