Continuous Reliability Monitoring Using Adaptive Critical ...ceng.usc.edu/techreports/2009/AnnavaramCENG-2009-13.pdf · test vectors needed for one path to get tested for a rising

Continuous Reliability Monitoring Using Adaptive Critical Path Testing

Bardia Zandian*, Waleed Dweik, Suk Hun Kang, Thomas Punihaole, Murali AnnavaramElectrical Engineering Department, University of Southern California

Mail Code: EEB200, University of Southern California, Los Angeles, CA 90089Tel:(213)740-3299, {bzandian,dweik,sukhunka,punihaol,annavara}@usc.edu

Abstract

As processor reliability becomes a first order design con-straint, this research argues for a need to provide continu-ous reliability monitoring. We present an adaptive criticalpath monitoring architecture which provides accurate andreal-time measure of the processor’s timing margin degra-dation. Special test patterns check a set of critical paths inthe circuit-under-test. By activating the actual devices andsignal paths used in normal operation of the chip, each testwill capture up-to-date timing margin of these paths. Themonitoring architecture dynamically adapts testing intervaland complexity based on analysis of prior test results, whichincreases efficiency and accuracy of monitoring. Experi-mental results based on FPGA implementation show thatthe proposed monitoring unit can be easily integrated intoexisting designs. Monitoring overhead can be reduced tozero by scheduling tests only when a unit is idle.

Keywords: Reliability, Critical Paths, Timing MarginsSubmission Category: Regular Paper for DCCS

1. Introduction

Reduced processor reliability is one of the negativerepercussions of silicon scaling. Reliability concerns stemfrom multiple factors, such as manufacturing imprecisionthat leads to several within-in die and die-to-die varia-tions [7, 8, 10], ultra-thin gate-oxide layers that breakdownunder high thermal stress, Negative Bias Temperature Insta-bility (NBTI) [2], and Electromigration. Many of these re-liability concerns lead to timing degradation first and even-tually lead to processor breakdown [7]. Timing degradationoccurs extremely slowly over time and can even be reversedin some instances, such as those caused by NBTI effects.When individual device variations are taken into consider-ation, timing degradation is hard to predict or accuratelymodel analytically. Most commercial products solve tim-ing margin degradation problem by inserting a guardbandat design and fabrication time. Guardband reduces the per-

formance of a chip during the entire lifetime just to ensurecorrect functionally in a small fraction of time during thelate stages of the chip lifetime.

Our inability to precisely and continuously monitor tim-ing degradation is the primary reason for over-provisioningof resources. Without an accurate and real-time measureof timing margin, designers are forced to use conservativeguardbands and/or use expensive error detection and recov-ery methods. Processors currently provide performance,power, and thermal monitoring capabilities. These capa-bilities are exploited by designers and developers to under-stand and debug performance and power problems and de-sign solutions to address them. In this paper, we argue thatproviding reliability monitoring to improve visibility intothe timing degradation process is equally important to fu-ture processors as there will be an ever increasing need tomonitor device reliability. Such monitoring capability en-ables just-in-time activation of error detection and recoverymethods, such as those proposed in [1, 13, 16, 17, 24, 25].

The primary contribution of this research is to propose aruntime reliability monitoring framework that uses adaptivecritical path testing. The proposed mechanism injects spe-cially designed test vectors into a circuit-under-test (CUT)that not only measure functional correctness of the CUT butalso its timing margin. The outcomes of these tests are an-alyzed to get a measure of the current timing margin of theCUT. Furthermore, a partial set of interesting events fromtest injection results are stored in flash memory to providean unprecedented view into timing margin degradation pro-cess over long time scales. For monitoring to be effective,we believe, it must satisfy the following three criteria:

1. Continuous Monitoring: Unlike performance andpower, reliability must be monitored continuously overextended periods of time; possibly many years.

2. Adaptive Monitoring: Monitoring must dynamicallyadapt to changing operating conditions. Due to dif-ferences in device activation factors and device vari-ability, timing degradation rate may differ from oneCUT to the other. Even a chip in early stages ofits expected lifetime can become vulnerable due to

1

aggressive runtime power and performance optimiza-tions such as operating at near-threshold voltages [15]and higher frequency operation [17, 24]. Some effectssuch as NBTI related timing degradation are even re-versible [2]. Thus there is a need to adapt the monitor-ing mechanisms to match the operating conditions atruntime.

3. Low Overhead Monitoring: The monitoring archi-tecture should have low area overhead and design com-plexity. Obviously, monitoring framework should beimplementable with minimal modifications to exist-ing processor structures, and should not in itself be asource of errors.

The reliability monitoring approach introduced in this pa-per is designed to satisfy the above criteria. Using the pro-posed monitoring mechanism has many benefits. With con-tinuous, adaptive and low-overhead monitoring, conserva-tive preset guardbands can be tightened. Unlike current ap-proaches where error detection and correction mechanismsare continuously enabled, processor can deploy preemptiveerror correction measures during in-field operations onlywhen the timing margin is small enough to affect circuitfunctionality. Furthermore, the unprecedented view pro-vided by continuous monitoring will enable designers tocorrelate predicted behavior from analytical models withthe in-field behavior and use these observations to make ap-propriate design changes for future processors.

The rest of this paper is organized as follows: Section 2explains our proposed monitoring architecture and how weachieve the three desired criteria mentioned above. Sec-tion 3 shows the experimental setup used to evaluate the ef-fectiveness of our proposed architecture. Results from ourevaluations are discussed in Section 4. Section 5 describesrelated work and compares the proposed approach to priorwork. We conclude in Section 6.

2. Reliability Monitoring FrameworkThe reliability monitoring framework proposed in this

paper is based on the notion of critical path tests. Speciallydesigned test vectors are stored in an on-chip repository andare selected for injection into the CUT at specified timeintervals. Using outcomes from these tests, the monitor-ing unit measures the current timing margin of the CUT. Inthe following two subsections we describe the architectureof the proposed reliability monitoring unit and how criticalpath test vectors are selected in detail.

2.1. Architecture of the Monitoring Unit

Figure 1(a) shows the overview of the proposed Relia-bility Monitoring Unit (RMU) and how it interfaces withthe CUT. In our current design we assume that a CUT maycontain any number of data or control signal paths that endin a flip-flop, but it should not contain intermediate storage

elements. The four shaded boxes in the figure are the keycomponents in our proposed RMU. The first component isthe Test Vector Repository (TVR). TVR holds a set of testpatterns and the expected correct outcomes when these testpatterns are injected into the CUT. TVR will be filled oncewith CUT-specific test vectors at post-fabrication phase. Wedescribe the process for test vector selection in more de-tail in Section 2.2. Multiplexer, MUX1, is used to selecteither the regular operating frequency of the CUT or onetest frequency from a small set of testing frequencies. Testfrequency selection will be described in Section 2.3. Mul-tiplexer, MUX2, on the input path of the CUT allows theCUT to receive inputs either from normal execution traceor from TVR. MUX1 input selection is controlled by theFreq. Select signal and MUX2 input selection is controlledby the Test Enable signal. Both these signals are generatedby the Dynamic Test Control (DTC) unit.

DTC selects a set of test vectors from TVR to inject intothe CUT. After each test vector injection the CUT outputwill be compared with the expected correct output and atest pass/fail signal is generated. For every test vector injec-tion an entry is filled in the Reliability History Table (RHT).Each RHT entry stores a time stamp of when the test isconducted, test vector index, testing frequency, pass/fail re-sult, and environment conditions such as CUT temperature.RHT is implemented as a two level FIFO structure wherethe first level (RHT-L1) only stores the most recent test in-jection results on an on-die SRAM structure. The secondlevel RHT (RHT-L2) is implemented on a flash memorythat can store test history information over multiple yearsand can be implemented on the same package but on a dif-ferent die, similar to multi-chip-packages. While RHT-L1stores a complete set of prior test injection results within asmall time window, RHT-L2 stores only interesting events,such as test failures, excessive thermal gradients. DTC onlyreads RHT-L1 data to determine when to perform the nexttest as well as how many circuit paths to test in the next testphase.

2.2. Test Vector Selection

Prior studies using industrial CPU designs, such as In-tel Core 2 Duo processor [3], show that microarchitecturalblock designs often result in three groups of circuit paths.First group contains few paths (< 1%) with zero timingmargin; second group contains several paths (about 10%)with less than 10% timing margin, followed by a vast major-ity of paths (about 90%) with a large timing margin. Verifi-cation and test engineers spend significant amount of effortto analyze the first two sets of critical paths and generate testvectors to activate these paths for pre and post-fabricationtesting purposes. Obviously, these paths are the ones withthe least amount of timing margin and are most susceptibleto timing failures. Even in the absence of manual effort to

2

Figure 1. (a) RMU (b) Reduced Guardbandidentify these paths, standard place-and-route tools can alsoclassify the paths into categories based on their timing mar-gin and then generate test vectors for activating the paths.We propose to exploit the effort already spent either by thetest designers or the CAD tools to create critical path testvectors. TVR is thus initially filled with test vectors that testpaths with less than 10% timing margin. Based on resultsshown in [3], even for very large and complex CUTs such ascache controllers, TVR may store in the order of 50-100 testvectors. In our current implementation, TVR stores vectorsin the sorted order of their timing criticality once during thepost fabrication phase. Compression techniques can be usedto further reduce the already small storage needs of TVR.

DTC makes decisions regarding which vectors to injectand when to inject. Note that the minimum number oftest vectors needed for one path to get tested for a risingor a falling transition is two. Instead of using just two in-put combinations that exercises the most critical path in theCUT, multiple test vectors that exercise a group of criticalpaths in the CUT will be used in each test phase. Thus, testvectors used in each test phase are a small subset of the vec-tors stored in TVR. This subset is dynamically selected byDTC based on the history of test results and CUT’s operat-ing condition. Initially DTC selects test vectors in the orderfrom the most critical (slowest) path at design time to lesscritical paths. As path criticality changes during the life-time of the chip, cases might be observed where paths thatwere initially thought to be faster are failing while the ex-pected slower paths are not. Hence, the order of the criticalpaths can be dynamically updated by the DTC by movingthe failing input patterns to the top of the test list. To ac-count for the unpredictability in device variations, DTC also

randomly selects additional test vectors from TVR duringeach test phase making sure that all the paths in the TVRare checked frequently enough. One simple test vector se-lection approach is to use top eight critical path test vectorsstored in TVR followed by two test vectors that are selectedin a round-robin fashion from the remaining test vectors inTVR. This multi-vector test approach allows more robusttesting since critical paths may change over time due to dif-ferent usage patterns, device variability, and difference inthe number of devices present on each signal path.

While our approach ensures that dynamic variations ofcritical paths are accounted for, the fundamental assump-tion is that a non critical path that is not tested by vectorsin TVR will not fail while all the critical paths that aretested through TVR are still operational. It is importantto note that due to the physical nature of the phenomenacausing aging related timing degradation, the probability ofa path which is not in the TVR failing before any of thecircuit paths that are being tested is extremely low. Stud-ies on 65nm devices show that variations in threshold volt-age due to NBTI are very small and gradual over time. Inparticular, the probability of a large variation in thresholdvoltage which result in large path delay changes is nearlyzero [18, 14]. Thus the probability of an untested path fail-ing is the same as the probability of a path with timingmargin greater than the guardband suddenly deterioratingto the extent that it violates timing. Chip designers rou-tinely make the assumption that such sudden deteriorationis highly unlikely while selecting the guardband. Hence webelieve our fundamental assumption is acceptable by indus-trial standards.

2.3. Test Frequency Selection

In order to accurately monitor the remaining timing mar-gin of a path in the CUT, a paths gets tested at multipletest frequencies above its nominal operation frequency. Thehighest frequency at which the path passes a test, providesus with information regarding the current level of timingdegradation that the path has experienced. The test fre-quency range is selected between nominal operating fre-quency and frequency without a guardband. This rangeis then divided into multiple frequency steps. When us-ing the proposed RMU, it would be possible to reducedefault guardband to a smaller value while still meetingthe same FIT goal. This is shown on Figure 1(b), where(Tclk1 = 1/fclk1) > (Tclk2 = 1/fclk2) > (∆Dinit). Tclk1

is the original clock period when using the default guard-band and Tclk2 is the clock period with a reduced guard-band (RGB). Note that in either case the initial path delayof the CUT is the same, ∆Dinit. The purpose of RMU isto continuously monitor the CUT and check if CUT timingis encroaching into the reduced guardband. For instance, if∆Dinit is 0.33 nanoseconds (3GHz) and RGB is 10% then

3

the nominal operating cycle time of the CUT will be 0.36nanoseconds (2.7 GHz). Since the CUT is operating at 2.7GHz with a 10% timing margin, RMU uses the RGB of 10%as the test frequency range. The test frequency range is thendivided into 10 equal frequency steps between 2.7 GHz and3 GHz. For each test vector selected by DTC, the CUT istested at each of the 10 test frequency steps to measure thecurrent timing margin of the CUT. It should be noted thatusing RGB instead of the more conservative guardband isnot necessary for correct operation of the proposed mon-itoring unit, and RGB is a performance improvement thatis made possible with RMU. If this potential opportunityto increase the operation frequency is not taken advantageof, the default processor guardband can be used in place ofRGB in all of the above discussion.

Our approach uses a narrow testing frequency rangewhich is subdivided into multiple frequency steps. Provid-ing such small frequency steps is not a difficult challenge.Prior research has shown that it is feasible to provide in-ternal clocking capabilities in increments of 2 MHz withlimited design effort [26]. Alternatively, an aging resilientdelay generator such as the one proposed in [1] can be usedwith minimal area and power overhead.

2.4. DTC and Opportunistic Tests

DTC is the critical decision making unit that determinesthe interval between tests (called interval throughout therest of the paper) and the number of test vectors to inject(called complexity) during each test phase. These two de-sign choices would exploit tradeoffs between increased ac-curacy and decreased performance due to testing overhead.

DTC reads the most recent RHT entries to decide the in-terval and complexity of the next testing phase. The 100most recent RHT entries are analyzed by DTC to see if anytests have failed during the testing phase. Each failed testentry will indicate which test vector did not produce the ex-pected output and at what testing frequency. DTC then se-lects the union of all the failed test vectors to be included inthe next phase of testing. If no prior tests have failed, DTCsimply selects a set of test vectors from the top of TVR.We explore two different choices for the number of vectorsselected in our results section.

Once the test complexity has been determined DTC thendecides the test interval. There are two different approachesfor selecting when to initiate the next test phase. In the firstapproach DTC initially selects a large test interval, say 1million cycles between two test phases and then DTC dy-namically alters the test interval to be inversely proportionalto the number of failures seen in the past few test phases.For instance, if two failures were noticed in the last eighttest phases then DTC decreases the new test interval to behalf of the current test interval.

An alternate approach to determine test interval is op-

portunistic testing. In this approach DTC initiates a testinjection only when the CUT is idle thereby resulting inzero performance overhead. Current microprocessors pro-vide multiple opportunities for testing a CUT. We describea few such opportunities. On a branch misprediction the en-tire pipeline is flushed and instructions from the correct ex-ecution path are fetched into the pipeline. Since the newlyfetched instructions take multiple cycles to reach the back-end, the execution, writeback and retirement stages of thepipeline are idle waiting for new instructions. When a longlatency operations such as a L2 cache miss is encountered,even aggressive out-of-order processors are unable to hidethe entire miss latency thereby stalling the pipeline. On anI-cache miss the fetch stage of the pipeline is stalled wait-ing for the cache miss, which provides an opportunity to testthe fetch stage of the pipeline. In addition to these commonevents, there are several rare events such as exceptions thatcause a pipeline flush which provide testing opportunities.Finally, computer system utilization rarely reaches 100%and the idle time between two utilization phases providesan excellent opportunity to test any CUT within the system.We quantify the distance between idle times and duration ofidle time for a select set of microarchitectural blocks in theexperimental results section.

DTC can automatically adapt to the reliability needs ofthe system. For a CUT which is unlikely to have failuresduring the early stages of its in-field operation, test intervaland complexity can be reduced. Even if an opportunistictesting approach is used one can simply ignore most oppor-tunities when the CUT is working well and use only a fewopportunities for testing. As the CUT ages or when the CUTis vulnerable due to low power settings, DTC can increasetesting. Note that the time scale for test interval is extremelylong and for instance, NBTI related timing degradation oc-curs only after many seconds or minutes of intense activityin the CUT. Hence testing interval will be in the order ofseconds even in the worst case. We will explore perfor-mance penalties for different selections of test intervals andtest complexities in the experimental results section.

2.5. Design Issues

We would like to note that there are design alternativesto several of the RMU components described. For imple-menting variable test frequencies, there are existing infras-tructures for supporting multiple clocks within a chip. Forinstance, DVFS is supported on most processors for powerand thermal management. While the current granularity ofscaling frequency may be too coarse, it is possible to createmuch finer clock scaling capabilities as described in [26].

The RHT-L2 size can also be quite small. For instance,each test vector generates 2 bytes of data in our current im-plementation which can be reduced further by using com-pression techniques. Since testing will be done rarely, say

4

once every minute, even a small 64KB RHT-L2 can storenearly 45 days worth of continuous test data. By selectivelystoring only failed test vectors in RHT the storage capacitycan be significantly prolonged to store multiple months ofdata using a relatively small flash.

While in our description each RMU is associated withone CUT, it is possible to have multiple CUTs share a singleRMU. This RMU can maintain multiple RHTs and TVRsfor different CUTs and use a single DTC logic block tomake the testing decisions. In addition, a system-wide DTCcan combine data from multiple local RHTs in a hierarchi-cal manner to make an appropriate global testing decision.

Note that using opportunistic testing a CUT is nevertaken offline. Rather we use the idle time to inject test vec-tors. Even if we do not use opportunistic testing, we shouldemphasize that testing the CUT does not in itself lead tonoticeable increase in aging of the CUT. The percentage oftime a CUT is tested is negligible compared to the normalusage time. For instance, if the CUT is tested once a sec-ond with 1000 tests (an extremely aggressive testing sce-nario) for a system which has a 1GHz clock, our approachincreases the CUT usage by 0.0001%. This is a negligiblepenalty we pay for providing continuous monitoring.

3. Experimental Methodology

Our proposed approach is not an error correcting mech-anism. It provides monitoring capabilities which may in-turn be used to more effectively apply prior error detectionor correction mechanisms. As such, monitoring by itselfdoes not improve FIT rate. The following are the aims ofour experimental design: First, area overhead and designcomplexity of RMU are measured using close-to-industrialdesign implementation of the RMU and all the associatedcomponents. Such a design also provides us an opportunityto verify that RMU does not negatively impact the CUT per-formance during the its normal operation. Second, the statespace of DTC and RHT interactions is explored and differ-ent test interval and test complexity parameters are studied.Third, overhead of monitoring in measured in terms of howmuch monitoring may reduce performance. As describedearlier, the overhead of testing can be reduced to zero iftesting is done opportunistically. We use software simu-lation to measure the duration of each opportunity (in cy-cles) and the distance between two opportunities. In all ourexperiments instead of waiting for the occurrence of rarestochastic events, effects of timing degradation are deliber-ately accelerated to measure the overheads and the resultsproduced by this assumption only show worst case over-head of monitoring compared to real implementation.

3.1. FPGA Emulation Setup

We modeled the proposed RMU design, including TVR,DTC and RHT using Verilog HDL. We then selected a

double-precision Floating Point Multiplier Unit (FPU) asthe CUT that will be monitored. The FPU selected in ourdesign implements 64-bit IEEE 754 floating point standard.It is a single stage combinatorial block that contains thou-sands of circuit paths. Hence, the FPU gives us credibleenough results due to its size and complexity. We focusmost of our experimental design effort on the implementa-tion of RMU.

RHT-L1 is implemented as 256 entry table using FPGASRAMs. During each test phase the test vector being in-jected, testing frequency, test result and operating condi-tions are stored as a single entry in RHT-L1. When a testfails the test result is also sent to the RHT-L2, which is im-plemented as an unlimited table on CompactFlash memory.When RHT-L1 is full the oldest entry in the RHT-L1 will beoverwritten, hence, DTC can only observe at most the past256 tests for making decisions on selecting the test interval.In our implementation DTC reading results from RHT doesnot impact the clock period of the CUT and it does not in-terfere with the normal operation of the CUT. Furthermore,reading RHT can be done simultaneously while test injec-tion results are being written to RHT.

One important issue that needs to be addressed is thedifference which exists between FPU’s implementation onFPGA as compared to ASIC (Application Specific Inte-grated Circuit) synthesis. The critical paths would be dif-ferent between these two implementations. These differ-ences would also be observed between instances of ASICimplementations of the same unit when different CAD toolsare used for synthesis and place and route (PAR), or be-tween different fabrication technology nodes. However, tomaximize implementation similarity to ASIC designs, theFPU is implemented using fifteen DSP48E Slices which ex-ist on the Virtex 5 XC5VLX100T FPGA chip. These DSPslices use dedicated combinational logic within the FPGArather than the traditional SRAM LUTs used in conven-tional FPGA implementations. Using these DSP slices in-creases the efficiency of the FP multiplier and would makeits layout and timing characteristics more similar to indus-trial ASIC implementation. Recall that in our RMU eachCUT will have a corresponding TVR that stores test vectorsthat exercise most critical paths specific to that CUT. SinceTVR stores CUT-specific test vectors, the potential differ-ences in critical paths between different implementationsdo not negatively impact any of our qualitative observationsand results presented here.

Section 2.2 described an approach for selecting test vec-tors to fill TVR by exploiting designer’s efforts to charac-terize and test critical paths. For the FPU selected in ourexperiments we did not have access to any existing test vec-tor data. Hence, we devised a simple approach to gener-ate test vectors using benchmark driven input vector gen-eration as described in this section. We selected a set of

5

5 CPU2K floating point benchmarks, Applu, Apsi, Mgrid,Swim, Wupwise, and generated a trace of floating pointmultiplies from these benchmarks by running them on Sim-plescalar [4] integrated with Wattch [11] and Hotspot [21].The simulator is configured to run as a 4-way issue Out-Of-Order processor with Pentium-4 processor layout. The first300 million instructions in each benchmark were skippedand then the floating point multiplication operations in thenext 100 million instructions were recorded. The number ofFP multiplies recorded for each trace ranges from 4.08 mil-lion (Wupwise) to 24.39 million (Applu). Each trace recordcontains the input operand pair and the expected correct out-put value and the temperature of the chip; we will shortlyexplain how the temperature is used to mimic processor ag-ing. These traces are stored on a CompactFlash memorythat is accessible to the FPU.

Initially the FPU is clocked at its nominal operating fre-quency reported by the CAD tool and all input traces pro-duce correct outputs. The FPU is then progressively over-clocked in incremental steps to emulate the gradual timingdegradation that the FPU experiences during its lifetime.We then feed all input traces to the FPU at each of the in-cremental overclocked step. At the first overclocking step,above the nominal operating frequency, input operand pairsthat exercise the longest paths in the FPU will fail to meettiming. At the next overclocking step more input operandpairs that exercise the next set of critical paths will fail, inaddition to the first set of failed input operand pairs.

In our design, FPU’s nominal clock period is 60.24nanoseconds (16.6 MHz) where all the input operand pairsgenerate the correct output. We then overclocked theFPU from 60.24 nanosecond clock period to 40 nanosec-ond clock period in equal decrements of 6.66 nanoseconds.These correspond to 4 frequencies: 16.6 MHz, 18.75 MHz,21.42 MHz, 25 MHz. The clock period reduction in eachstep was the smallest decrement we could achieve on theFPGA board. The percentage of paths failing to meet tim-ing gradually increase at each step and the fail rate increasesdramatically above 25 MHz. Percentage of the FP multi-ply instructions that failed as the test clock frequency wasincreased are shown in Figure 2. Results show the num-ber of input vectors that failed to produce correct outputfor one CUT but with multiple benchmarks. When theCUT is initially overclocked by a small amount the num-ber of input vectors that fail is relatively small. These arethe vectors that activate the critical paths with the smallesttiming margin. However, as CUT is overclocked beyond21.42MHz (essentially simulating a large timing degrada-tion), not surprisingly, almost all input vectors fail across allbenchmarks. The knee in the curves indicates that, almostall paths in the FPU have combinational delays that wouldnot be met at frequencies above 21.42MHz and hence nearlyall input test vectors will fail at that frequency. Finally, as to

Figure 2. Timing Margin Degradationbe expected, CUT timing violations are mostly independentof the benchmarks. Hence, we do not need to select a largenumber of benchmarks to demonstrate the results of RMU.

We selected the top 1000 failing input operand pairs asthe test vectors to fill the TVR. Since the minimum clock pe-riod decrease is 6.66 nanoseconds the number of input testvectors that fail increases in very large increments with eachoverclocking step and we were forced to fill TVR with manymore test vectors than needed. Input test vectors could bedifferentiated more precisely with finer changes to the clockfrequency, thereby reducing the size of TVR.

3.2. Three Scenarios for Monitoring

We emulated three scenarios for measuring the monitor-ing overhead. The three scenarios are: Early-stage monitor-ing, Mid-stage monitoring, Late-stage monitoring.

Early-stage monitoring: In early-stage monitoring weemulated the conditions of a chip which has just started itsin-field operations, for instance, the first year of chip’s oper-ation. Since the amount of timing degradation is relativelysmall, it is expected that the CUT will pass all the tests con-ducted by RMU. To emulate this condition, our RMU teststhe CUT at only the nominal frequency, which is 16.6 MHzin our experimental setup. The test vectors injected into theCUT do not produce any errors and hence DTC does notchange either test interval or test complexity. We exploredtwo different test intervals, where tests are conducted at in-tervals of 100,000 cycles and 1,000,000 cycles. At each testinjection phase we also explored two different test complex-ity setting. We used either a test complexity of 5 test vectorsor 20 test vectors for each test injection phase.

Mid-stage monitoring: In mid-stage monitoring weemulated the conditions when the chip’s timing margin hasdegraded just enough such that some of the circuit paths thatencroach into the reduced guardband (RGB as described inSection 2.3) of the CUT start to fail. These failure are de-tected by RMU when it tests the CUT with a test frequencynear the high end of the test frequency range. We emu-lated this failure behavior by selecting the frequency fortesting the CUT to be higher than the nominal operationfrequency of the CUT. In our experiments we used 18.75MHz for testing, which is 12.5% higher than the nominal

6

frequency. Since this testing frequency mimics a timingdegradation of 12.5% some test vectors are bound to failduring CUT testing. In mid-life monitoring DTC’s adaptivetesting is activated where it dynamically changes the testinterval. In our current implementation DTC simply com-putes the number of failed tests seen in the last 8 test injec-tions and then uses that count to determine when to activatethe next test phase. We explored two different test intervalselection mechanisms, the first scheme uses linear decreasein the test interval as the fail rate of the tests increase. Themaximum (initial) test interval selected for the emulationpurpose is 100,000 cycles and this will be reduced in equalsteps up to 10,000 cycles when the fail rate is detected asbeing 100%. For instance, when the number of test failuresin the last eight tests is zero then DTC makes a decisionto initiate the next test after 100,000 cycles. If one out ofthe eight previous tests have failed then DTC selects 88,750cycles as the test interval (100,000-(90,000/8)). An alter-native scheme uses initial test interval of 1 million cyclesand then as the error rate increases the test interval is re-duced in eight steps by dividing the interval to half for eachstep. For instance, when the number of test failures in thelast eight tests is zero then DTC makes a decision to initi-ate the next test after 1,000,000 cycles. If two out of theeight previous tests have failed then DTC selects 250,000cycles as the test interval (1, 000, 000/22). The exponen-tial scheme uses more aggressive testing when the test failrate is above 50% but it would do significantly more relaxedtesting when the fail rate is below 50%. These two schemesprovide us an opportunity to measure how different assump-tions about DTC adaptation policies will translate into test-ing overheads.Note that many of the parameters mentionedhere, such as test interval, test complexity and number ofRHT entries to analyze, are design space options.

Late-stage monitoring: In the late-stage monitoringscenario the chip has aged considerably. Timing margin hasreduced significantly with many paths in the CUT operatingwith limited timing margin. In this scenario, path failure de-tected by RMU are not only dependent on the test frequencybut are also dependent on the surrounding operating condi-tions. For instance, a path that has tested successfully acrossthe range of test frequencies under one operating conditionmay fail when the test is conducted under a different op-erating condition. The reason for the prevalence of such abehavior is due to non uniform sensitivity of the paths toeffects such as Electromigration and NBTI that occur overlong time scales.

The goal of our late-stage monitoring emulation is tomimic the pattern of test failures as discussed above. Onereasonable approximation to emulate late-stage test failurebehavior is to use temperature of the CUT at the time oftesting as a proxy for the operating condition. Hence, at ev-ery test phase we read the CUT temperature which has been

Figure 3. Dynamic Adaptation of RMUcollected as part of our trace to emulate delay changes. Forevery 1.25oC raise in temperature the baseline test clockperiod is reduced by 6.66 nanoseconds. We recognize thatthis is an unrealistic increase in test clock frequency witha small temperature increase. However, as mentioned ear-lier our FPGA emulation setup restricts each clock periodchange to a minimum granularity of 6.66 nanoseconds. Thisassumption forces the RMU to test the CUT much more fre-quently than would be in a real world scenario. The adaptivetesting mechanism of DTC is much more actively invokedin the late-stage monitoring scenario.

3.3. Dynamic Adaptation of RMU

Figure 3 shows how DTC dynamically adapts the testinterval to varying wearout conditions in Late-stage moni-toring scenario with TC=20 and using the linear test inter-val adaption scheme. The data in this figure represents anarrow execution slice from Apsi running on FPGA imple-mentation of FPU which is being monitored by RMU. Thehorizontal axis shows the trace record number, which repre-sents the progression of execution time. The first plot fromthe bottom shows the test fail rate seen by DTC. Recall thatDTC reads the last 8 test injection results from RHT andcounts the number of failures. The highlighted oval areashows a dramatic increase the test fail rate and the mid-dle plot shows how DTC reacts by dynamically changingthe interval between Test Enable signals. Focusing againon the highlighted oval area in the middle plot, it is clearthat the test interval has been dramatically reduced sincegap between successive test phases has shrunk. The topplot zooms in on one test phase to show 20 back-to-backtests that were injected into the CUT corresponding to ourTC=20 setting. Figure 3 confirms that our proposed RMUdesign dynamically adapts to changing wearout conditions.

4. Experimental Results

In this section results related to area and design overheadof RMU, in addition to performance overhead of monitoringare presented. Opportunities to perform tests without inter-rupting the normal operation of the processor are studied inSection 4.3.

7

Figure 4. Overhead of (a) Linear (b) Exponen-tial Schemes

4.1. Area Overhead of RMU

RMU and FPU implemented on FPGA utilize 4994FPGA slices out of which 4267 are used by the FPU and727 slices are used for the RMU implementation. Out ofthe 8818 SRAM LUTs used in our design the RMU con-sumes 953 LUTs while the remaining 7865 LUTs are usedby the FPU. The FPU also uses 15 dedicated DSP48E slicesfor building the double precision FP multiplier, while onlyone DSP48E slice is used by RMU logic. Note that overallarea overheads of monitoring can be reduced by increasingthe granularity of CUTs being monitored, say increase theCUT size from FPU to the whole execution pipeline stage ofa core. Since RMU size stays relatively constant irrespec-tive of CUT being monitored, the strength of the proposedscheme is this ability to adjust the CUT size so as to makethe RMU area overhead acceptable for any design.

4.2. Monitoring Overhead

Figure 4(a) shows the execution time overhead of test-ing compared to the total execution time of the benchmarktraces. The horizontal axis shows the three monitoring sce-narios, Early-stage (labeled Early), Mid-stage (Mid) andLate-stage (Late) monitoring. Vertical axis shows the num-ber of test injections as a percentage of the total trace lengthcollected for each benchmark. DTC uses linear decrease intest interval from 100,000 cycles to 10,000 cycles depend-ing on the test fail rates. Results with test complexity (TC)of 5 and 20 test vectors per test phase have been shown. Theearly-stage monitoring overhead is fixed for all the bench-

Figure 5. Test Fail Rates for Linear Schememarks and depends only on the test interval and complexity.Testing overhead varies per benchmark during Mid and Latetesting. The reason for this behavior is that benchmarkswhich utilize the CUT more frequently, i.e. benchmarksthat have more FP multiplications, will increase CUT activ-ity factor which in turn would accelerate CUT degradation.Degradation would be detected by DTC and it may increasethe monitoring overhead to test the CUT more frequently.The worst case overhead using late-stage monitoring sce-nario is still 0.07%.

Figure 4(b) shows the same results when DTC uses expo-nential test intervals. Note that the vertical axis scale rangeis different between Figure 4 parts (a) and (b). Compar-ing the percentage of time spent in testing between the lin-ear and exponential interval adjustment schemes, it is clearthat exponential method results in less testing overhead inalmost every one of the emulated scenarios. This observa-tion is a direct result of the fact that exponential test inter-vals start with a higher initial test interval setting and DTConly decreases the testing interval to conservative low val-ues when test failures are detected. Figure 5 shows the per-centage of the tests that have failed in each of the emulationschemes described earlier for Figure 4. The vertical axisshows only mid-stage and late-stage monitoring schemessince early-stage does not generate any test failures. Onlythe linear test interval scheme results are shown for boththe test complexities emulated, TC=5 and TC=20. Test failrates increase dramatically from mid-stage to late-stage sce-nario, as expected during in-field operation. Benchmarkssuch as Apsi and Wupwise have zero test failures in the midstage because they don’t stress the FPU as much the othercases and hence the emulated timing degradation of FPU issmall. However, in the late-stage emulations FPU’s timingdegrades rapidly and hence the test fail rates dramaticallyincrease. There is direct correlation between the fail rateobserved in Figure 5 and the test overhead reported in Fig-ure 4 which is a result of DTC’s adaptive design. Similarobservations hold for the exponential test interval scheme.

4.3. Opportunistic Testing

Testing overhead can be reduced to zero if tests are per-formed opportunistically when CUT is idle. There are two

8

types of opportunities: local opportunities exist when a par-ticular block within a processor is idle. On the other hand,global opportunities exist when many CUTs in a processorare all idle at the same time. For instance, after a branchmisprediction the entire pipeline is flushed leaving most ofthe CUTs in the processor idle. In general, cache misses,branch mispredictions, exceptions, imbalance distributionof different instruction types, variable execution latenciesfor different units in the processor can contribute to a vari-ety of CUT idle periods.

We studied two important characteristics of testing op-portunities for a range of CUTs: 1) The duration of anopportunity 2) The distance between opportunities. Forgenerating these two characterization results we used Sim-plescalar [4] simulator configured as a 4-wide issue out-of-order processor with 128 in-flight instructions and 16KB L1I/D cache. Since these opportunities depend on the applica-tion running on the processor, we decided to run 10 bench-marks from CPU2K benchmark suite, each for one billioncycles. The benchmarks used are Gzip, Crafty, Bzip, Mcf,and Parser in addition to the five used in our FPGA emu-lations. Our experiments are carried out using execution-driven simulations with a detailed processor model whichhas one floating point multiplier/divider (FPMD), four in-teger ALU’s, one floating point ALU (FPU), and one in-teger multiplier/divider unit (IMD). We measure the du-ration of idle times and distance between successive idletime to quantify local opportunities. To quantify the globalopportunities we collected two sets of data. We measurethe duration and distance between idle time periods at thepipeline stage level for instruction fetch (IF), integer execu-tion (EINT), floating point execution (EFP), and the retire-ment stage (RE). We also measured the distance between L1data cache misses (L1DCM), branch mispredictions (BMP),L2 cache misses (L2CM), and L1 instruction cache misses(L1ICM).

Figure 6(a) shows the distribution of opportunity dura-tions in terms of percentage of total execution cycles for theabove mentioned local and global opportunities. The firstfour bars are local opportunities and the second four barsare global opportunities. Data collected for the four ALUshas been averaged and shown as one data set.

Distribution of distance between consecutive opportuni-ties has been shown in Figure 6(b). In this figure the firstthree bars are local and rest are global opportunities. Pleasenote that the vertical axis for both Figure 6 (a) and (b) arein logarithmic scale. The last cluster of columns on thesefigures show the cumulative value of all the opportunitieswith duration/distance above 1000. Data on integer multi-plier/divider unit has not been shown on Figure 6(b) becauseall the opportunities for this unit (0.2% of execution cycles)happened within 100 cycles of each other.

Most local and global opportunities studied have a du-

Figure 6. Distribution of (a) Opportunity Du-ration (b) Distance Between Opportunities

ration of 10-100 cycles, which is sufficient time for testingwith multiple test vectors. More detailed breakdown of thelow duration opportunities (below 100 cycles) indicates thatfor the eight units studied on average, cases with durationlower than 10 clock cycles account for 0.8% of the totalsimulated execution cycles. Opportunities of duration 10-50 cycles account for 0.4% of the execution cycles and thenopportunities of 50-100 cycles duration account for 0.01%.Since these opportunities occur very close to each other it iseven possible to spread a single test phase across two con-secutive opportunity windows.

Based on data shown in Figure 6(b), local opportuni-ties (the first group of three bars) are prevalent when thedistance between opportunity is small. The number of lo-cal opportunities decrease as the distance increases. How-ever, global opportunities (the remaining group of bars) areprevalent both at short distances as well as long distances.Hence, we believe that local opportunities are better candi-dates for opportunistic testing. While Figure 6(a) providesus with valuable information on the duration of opportuni-ties available for testing Figure 6(b) shows if these opportu-nities happen often enough to be useful for our monitoringgoals.

There are far more test opportunities than the number oftests needed in our monitoring methodology and hence onlywhen the monitoring unit requires a test phase the opportu-nities are going to be taken advantage of. To better utilizethe idle cycles, DTC would specify a future time window inwhich a test must be conducted, rather than an exact time

9

to conduct a test. When a CUT is idle within the targettime window the test is conducted. In rare occasions whenthere are insufficient opportunities for testing, DTC can stillforce a test by interrupting the normal CUT operation. Evenin such a scenario the performance overhead of using ourmethodology is extremely low.

Monitoring multiple CUTs of small size instead of onelarge CUTs might result in increased area overhead due topresence of multiple RMUs but it would provide the possi-bility to take advantage of more fine grain local opportuni-ties while the other parts of the processor continue to fullyfunction. This is going to be beneficial even if the testingis not performed opportunistically because even if a test isforced on a unit and the normal operation of that unit is in-terrupted all other parts of the processor can continue func-tioning. Centralized test control with distributed TVRs andRHTs can be implemented so that monitoring of multipleCUTs can be controlled by one DTC units. Detail study ofsuch architectures is part of our future work.

5. Related Work

There have been prior works on modeling, prediction,and detection of wearout faults. Modeling chip lifetimereliability using device failure models has been studiedin [23, 19]. These models provide an architecture-level re-liability analysis framework which is beneficial for predic-tion of circuit wearout at design time but they do not addressactual infield wearout of the circuit which is highly depen-dant on dynamically changing operation conditions. Themechanism proposed by Bower et al. [9] uses a DIVA [5]checker to detect hard errors and then the units causing thefault are deconfigured to prevent the fault from happeningagain. Shyam et al. [20] explored adefect protection methodto detect permanent faults using Built in Self Test (BIST)for extensively checking the circuit nodes of a VLIW pro-cessor. BulletProof [12] focuses on comparison of the dif-ferent defect-tolerant CMPs. Fault prediction mechanismfor detecting NBTI-induced PMOS transistor aging timingviolations have been studied in [1]. The circuit failure pre-diction proposed in this paper is done at runtime by analyz-ing data collected from sensors that are inserted at variouslocations in the circuit. Blome et al. [6] use a wearout de-tection unit that performs online timing analysis to predictimminent timing failures. Double sampling latches whichare used in Razor [13] for voltage scaling can also be usedfor detecting timing degradation due to aging. FIRST (Fin-gerprints In Reliability and Self Test) [22], proposes usingthe existing scan chains on the chip and performing periodictests under reduced frequency guardbands to detect signs ofemerging wearout.

It has been shown that some of device timing degrada-tion due to aging, such as those caused by NBTI, can be re-covered from by reducing the device activity for sufficient

amount of time [2]. Using our method, the optimal timefor using these recovery methods can be selected based onthe information provided by the monitoring unit. Some ofthe related works mentioned above have suggested methodswhich use separate test structures that model the CUT suchas buffer chains or sensing flip flops and main advantage ofour method compared to these previous approaches is thatour test vectors activate the actual devices and paths that areused at runtime, hence each test will capture the most up-to-date condition of the devices taking into account overalllifetime wearout of each device.

Implementing dedicated sensing circuitry proposed insome of the prior works for monitoring sufficient number ofcircuit paths requires significant extra hardware and wouldlacks the flexibility and adaptability of the mechanism pro-posed in this work. Our method not only has the adapt-ability advantage to reduce performance overheads due totesting (specially during the early stages of the processorslifetime) but also is more scalable and customizable for in-creased coverage without incurring significant modificationto the circuit; monitoring of additional paths can simplybe done by increasing the size of the TVR. In addition tothe above mentioned key points, the capability to dynami-cally select the circuit paths to test would result in a moretargeted testing of the actual devices in the circuit whichwould significantly reduce the number tests that need to beperformed.

6. Conclusions

As processor reliability becomes a first order design con-straint for low-end to hign-end computing platforms thereis a need to provide continuous monitoring of the circuitreliability. Just as processors currently provide power andperformance monitoring capabilities, this paper argues thatreliability monitoring will enable more efficient and just-in-time activation of error detection and correction mecha-nisms. This paper presents a low cost architecture for mon-itoring a circuit using adaptive critical path tests. The pro-posed architecture dynamically adjusts the monitoring over-head based on current operating conditions of the circuit.Many of the parameters controlling the monitoring over-head, such as the interval between tests, the number of testsperformed at each test phase, the clock frequency used fortesting, and the selection test vectors can be dynamicallyadapted at runtime. This runtime adaptability not only pro-vides continuous monitoring with minimal overhead but isalso essential for robust monitoring in the presence of unre-liable devices and rapid variations in operating conditions.

Our close to industrial strength RTL implementation ef-fort shows that the proposed design is feasible with min-imal area and design complexity. FPGA emulation resultsshow that even in the worst case wearout scenarios the adap-tive methodology would work with negligible performance

10

penalty. We also showed that numerous opportunities whichare suitable for multi-path testing exist when different partsof the processor are not being utilized.

References

[1] M. Agarwal, B. Paul, M. Zhang, and S. Mitra. Circuit failureprediction and its application to transistor aging. In 25thIEEE VLSI Test Symposium, pages 277–286, May 2007.

[2] M. A. Alam and S. Mahapatra. A comprehensive modelof pmos nbti degradation. Microelectronics Reliability,45(1):71–81, 2005.

[3] M. Annavaram, E. Grochowski, and P. Reed. Implicationsof device timing variability on full chip timing. Proceedingsof the 13th International Symposium on High PerformanceComputer Architecture, pages 37–45, Feb 2007.

[4] T. Austin, E. Larson, and D. Ernst. Simplescalar: aninfrastructure for computer system modeling. Computer,35(2):59–67, Feb 2002.

[5] T. M. Austin. Diva: A reliable substrate for deep submicronmicroarchitecture design. Proceedings of the 32nd AnnualInternational Symposium on Microarchitecture, pages 196–207, Dec 1999.

[6] J. Blome, S. Feng, S. Gupta, and S. Mahlke. Self-calibratingonline wearout detection. In Proceedings of the 40th AnnualIEEE/ACM International Symposium on Microarchitecture,pages 109–122, 2007.

[7] S. Borkar. Designing reliable systems from unreliable com-ponents: the challenges of transistor variability and degra-dation. Proceedings of the 38th Annual International Sym-posium on Microarchitecture, pages 10–16, Dec 2005.

[8] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Ke-shavarzi, and V. De. Parameter variations and impact on cir-cuits and microarchitecture. Proceedings of the 40th AnnualConference on Design Automation, pages 338–342, 2003.

[9] F. A. Bower, D. J. Sorin, and S. Ozev. A mechanism foronline diagnosis of hard faults in microprocessors. In Pro-ceedings of the 38th annual IEEE/ACM International Sym-posium on Microarchitecture, pages 197–208, 2005.

[10] K. Bowman, S. Duval, and J. Meindl. Impact of die-to-die and within-die parameter fluctuations on the maximumclock frequency distribution for gigascale integration. IEEEJournal of Solid-State Circuits, 37(2):183–190, 2002.

[11] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a frame-work for architectural-level power analysis and optimiza-tions. SIGARCH Computer Architecture News, 28(2):83–94,2000.

[12] K. Constantinides, S. Plaza, J. Blome, B. Zhang,V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky. Bul-letproof: a defect-tolerant cmp switch architecture. InThe Twelfth International Symposium on High-PerformanceComputer Architecture, pages 5–16, Feb. 2006.

[13] D. Ernst, N. Kim, S. Das, S. Pant, R. Rao, T. Pham,C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge.Razor: a low-power pipeline based on circuit-level timingspeculation. Proceedings of the 36th Annual InternationalSymposium on Microarchitecture, pages 7–18, Dec. 2003.

[14] T. Fischer, E. Amirante, P. Huber, K. Hofmann, M. Oster-mayr, and D. Schmitt-Landsiedel. A 65nm test structure forsram device variability and nbti statistics. Solid-State Elec-tronics, 53(7):773–778, 2009.

[15] K. Flautner, N. Kim, S. Martin, D. Blaauw, and T. Mudge.Drowsy caches: simple techniques for reducing leakagepower. Proceedings of the 29th Annual International Sympo-sium on Computer Architecture, pages 148–157, July 2002.

[16] M. Li, P. Ramachandran, U. Karpuzcu, S. Hari, and S. Adve.Accurate microarchitecture-level fault modeling for study-ing hardware faults. In Proceedings of the 15th Interna-tional Symposium on High-Performance Computer Archi-tecture, Feb 2009.

[17] X. Liang, G. Wei, and D. Brooks. Revival: A variation-tolerant architecture using voltage interpolation and variablelatency. In 35th International Symposium on Computer Ar-chitecture, pages 191–202, June 2008.

[18] S. Rauch. Review and reexamination of reliability effects re-lated to nbti-induced statistical variations. Device and Mate-rials Reliability, IEEE Transactions on, 7(4):524–530, Dec.2007.

[19] J. Shin, V. Zyuban, Z. Hu, J. Rivers, and P. Bose. A frame-work for architecture-level lifetime reliability modeling. InProceedings of the International Conference on DependableSystems and Networks, pages 534–543, June 2007.

[20] S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, andT. Austin. Ultra low-cost defect protection for micropro-cessor pipelines. In Proceedings of the 12th InternationalConference on Architectural Support for Programming Lan-guages and Operating Systems, pages 73–82, Oct 2006.

[21] K. Skadron, M. Stan, W. Huang, S. Velusamy, K. Sankara-narayanan, and D. Tarjan. Temperature-aware microarchi-tecture. Proceedings of the 30th International Symposiumon Computer Architecture, pages 2–14, June 2003.

[22] J. Smolens, B. Gold, J. Kim, B. Falsafi, J. Hoe, andA. Nowatzyk. Fingerprinting: bounding soft-error detectionlatency and bandwidth. In Proceedings of the 11th inter-national conference on Architectural Support for Program-ming Languages and Operating Systems, pages 224–234,New York, NY, USA, 2004. ACM.

[23] J. Srinivasan, S. Adve, P. Bose, and J. Rivers. The casefor lifetime reliability-aware microprocessors. Proceedingsof the 31st Annual International Symposium on ComputerArchitecture, pages 276–287, June 2004.

[24] A. Tiwari and J. Torrellas. Facelift: Hiding and slowingdown aging in multicores. In Proceedings of the 41st Inter-national Symposium on Microarchitecture, pages 129–140,Nov 2008.

[25] A. Uht. Going beyond worst-case specs with teatime. Com-puter, 37(3):51–56, Mar 2004.

[26] Q. Wu, P. Juang, M. Martonosi, and D. Clark. Voltage andfrequency control with adaptive reaction time in multiple-clock-domain processors. In 11th International Symposiumon High-Performance Computer Architecture, pages 178–189, Feb. 2005.

11

Documents

Continuous Reliability Monitoring Using Adaptive Critical ...ceng.usc.edu/techreports/2009/AnnavaramCENG-2009-13.pdf · test vectors needed for one path to get tested for a rising