Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan D EPARTMENT OF E LECTRICAL AND C OMPUTER E NGINEERING T HE U NIVERSITY OF B RITISH C OLUMBIA

Towards Understanding the Effects of Intermittent Hardware Faults on Programs

Layali Rashid, Karthik Pattabiraman and Sathish GopalakrishnanDepartment of Electrical and Computer EngineeringThe University of British ColumbiaModeling The Propagation of Intermittent Hardware Faults in Programs

(the first project to study ifault effect on programs)SG: Emphasize that this work is unique.

1Motivation: Why Intermittent Faults?Intermittent faults are likely to be a significant concern in future processors [Cons08].

Transient FaultIntermittent FaultsDo not persist forever unlike permanent faultsPersist for longer duration than transient faultsMay impact program more than transient faults

One reason for ifaults only.Intermittent hardware faults are bursts of errors that occur at the same location (micro-architectural component) and last from a few cycles to a few seconds.PVT fluctuations. In-progress wear-out. Manufacturing residuals.We believe that understanding intermittent error propagation is important in designing fault tolerance techniques.

2Motivation

Intermittent faults can be caused by many reasons, such as in progress wear-out and temperature fluctuations. Say that you are not addressing design defects.3Motivationmov R1, #5mov R2, #6mov R3, #7ld R4, R1, Array_Addrld R5, R2, Array_Addrld R6, R3, Array_Addrmult R7, R5, R4

Intermittent Faultmov R1, #5mov R2, #6mov R3, #7ld R4, R1, Array_Addrld R5, R2, Array_Addrld R6, R3, Array_Addrmult R7, R5, R4

MotivationIntermittent FaultFailureProgram ExecutiontimeIntermittent Fault5Primary Research QuestionsDo all intermittent faults lead to program crash?

How many instructions are executed before the program crashes?

How many dynamic values are corrupted by the fault before the program crashes?Dont say the next question.Over 70% of the faults lead to program crash, 4% cause SDCs and 26% are benign. 99% of the faults crash within a few thousands of instructions.99% of the faults corrupt at most few dozens of variables.

6ContributionsUnderstand intermittent fault propagation through fault-injection experiments in programs.

Model the propagation of intermittent faults in programs at the instruction level.Validate the model using fault injections.

7Motivation: Why Model Error Propagation?Fault injection experiments are prohibitively expensive.Intermittent faults vary in location and duration.More than two orders of magnitude slower than modeling.

Modeling error propagation provides more insights that may help in tolerating faults.

I find that 2.8ApproachFault ModelSimpleScalarsimulatorActual ResultsFault InjectionsSay that SS is a cycle accurate simulator and I can inject faults at each cycle. Don't say that you've changed the simulator.

9ApproachFault ModelSimpleScalarsimulatorActual ResultsFault InjectionsPick start, duration, flip10Fault ModelAssumption: An intermittent fault affects two or more consecutive instructions in the program.

This fault model captures many faults including those in:ALULoad/Store Unit

We specify an intermittent fault by:Fault startFault length

This fault model captures different fault types, including those that affect ALU and LD/SW unit.

Fault start: program instruction that is first affected Fault length: number of instructions over which the fault persists.

11Experimental Setup (1)Computation of intermittent fault propagationIntermittent fault injections in instruction-level simulator (SimpleScalar).

Evaluate the Siemens benchmark suite.

We modify the ss simulator to inject faults12Major FindingsFailure categories:

Of the intermittent faults that are non-benign, 95% result in a program crash.91 to 95% of the faults cause program to crash within 300 instructions of the faults start.SimpleScalar simulator took 6 to 70 hours per program.

Comments on Siemens benchmarks briefly (maybe something about the size).

13ApproachCrash ModelDynamic Dependency GraphFault ModelExpected ResultsFault InjectionActual ResultsHere is my approach. The ifault propagation model is based on dynamic dependency graph that integrates a fault model and a crash model. Mention that because you are not using simulations in your model you need to construct a crash model.

14Approach Memory address Branch/jump address Function call addressCrash ModelDynamic Dependency GraphFault ModelExpected ResultsActual ResultsFault InjectionMention that you need a crash model for DDG and not for SS because there is not fault injection in our model. Explain what happens in the model and simulation.It is important to build an accurate model of when programs crash due to an error. This is because all errors, and more importantly the propagation of all errors within programs, will of course end at the instance of a program crash. We focus on crashes because they are the most common consequences of intermittent faults.

15ApproachA directed acyclic graph that models the dynamic dependencies between instructions [Agrawal90]....### #Crash ModelDynamic Dependency GraphFault ModelExpected ResultsActual ResultsFault InjectionDDG MetricsCrash Distance (CD): number of instructions that execute from the time an intermittent fault occurs until the program crashes (due to fault).

Intermittent Propagation Set (IPS): set of program values to which an intermittent fault propagates.

Say the intuition only don't read.

You're telling a story, don't say as 17ExampleCode FragmentNodemov R1, #51mov R2, #62mov R3, #73ld R4, R1, Array_Addr4ld R5, R2, Array_Addr5ld R6, R3, Array_Addr6mult R7, R5, R47Intermittent Error1257Array_Addr#5#636#7...Intermittent Propagation Set (1,2) = {?}Crash Distance (1, 2) = ?4Crash NodeI characterize the error propagations in programs using two quantities, intermittent propagation set and crash distance.Mention that the nodes correspond to the instructions in DDG.

18ExampleCode FragmentNodemov R1, #51mov R2, #62mov R3, #73ld R4, R1, Array_Addr4ld R5, R2, Array_Addr5ld R6, R3, Array_Addr6mult R7, R5, R471257Array_Addr#5#636#7...Intermittent Propagation Set (1,2) = {1, 2, 4}4Intermittent ErrorCrash Node4Intermittent Propagation Set (IPS): set of program values to which an intermittent fault propagates until the crash node,

19ExampleCode FragmentNodemov R1, #51mov R2, #62mov R3, #73ld R4, R1, Array_Addr4ld R5, R2, Array_Addr5ld R6, R3, Array_Addr6mult R7, R5, R47Crash Distance (1, 2) = 44Intermittent ErrorCrash Node1257Array_Addr#5#636#7...4Crash Distance (CD): number of instructions that execute from the time an intermittent fault occurs until the program crashes (due to fault).

20Experimental Setup (2)Evaluating the models accuracyConstruct the DDG of each program.Measure the difference between the predicted and the actual CD and IPS for crashes.

We modify the ss simulator to inject faults21Performance ComparisonCrash ModelDynamic Dependency GraphFault ModelExpected IPS and CDSimpleScalarsimulatorActual IPS and CD4 to 89 sec6 to 70 hrs 2275 to 82% of the predicted crash distances are within 10 nodes, 89 to 93% are within 100 nodes from actual crash distances.

DDG Model Vs. SimpleScalar (CD)

Accuracy within 1% of the program.DDG under predicts CD for less than 10% of the faults, explain the -ve numbers and the impact23DDG Model Vs. SimpleScalar (IPS)

88 to 94% of the difference sets cardinalities are within 10 nodes and that 93 to 97% are within 100 nodes.

Explain what is the difference in ips cardinality. Absolute diff.24ContributionsUnderstand intermittent fault propagation through fault injection experiments in programs.

Model the propagation of intermittent faults in programs at the instruction-level.Validate the model using fault injections.

25ConclusionsWe enhanced Dynamic Dependency Graph to accurately model intermittent fault propagation in programs.

Of the intermittent faults that are non-benign, 95% result in a program crash.

91 to 95% of the faults cause program to crash within 300 instructions of the faults start. 26DiscussionDetecting intermittent faults that affect programs using software-based techniques can be efficient.

Diagnosing intermittent faults using software-based techniques can be accurate.

Recovery using check-pointing techniques on the order of thousands of instructions will be effective.

Studying ifault behaviour in programs in crucial for fault tolerance.Detection: when ifaults affect programs, they usually cause exceptions.

Diagnosis: corruption caused by ifaults to the program state is rather limited since the program crashes soon.

Recovery: the program crash usually occur soon enough that it will not propagate to more than one checkpoint.27The EndBackup SlidesEffect of Intermittent Fault Length

The average actual CD is 7 to 8 nodes. Hence, most crashes occur within 8 nodes from the start of the intermittent fault. 30

Documents

Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan D EPARTMENT OF E LECTRICAL AND C OMPUTER E NGINEERING T HE U NIVERSITY OF B RITISH C OLUMBIA