Evaluating Impact of Soft-Errors in an Embedded System
-Vijay SheshadriGraduate Student
Dept. of Electrical Engineering
April 22, 2023 2
What is a Soft-error?
Transient fault caused by cosmic ray particles.
1 0
A charged particle incident on a component
The charged particle creates EHPs which get collected by the drain
Sufficient charge collection causes an erroneous bit-flip
April 22, 2023 3
Soft-error in a SystemBit
Read
Bit has error
protection
Erroris only detected(e.g., parity + no recovery)
Error can be corrected(e.g, ECC)
yes no
Does bit matter?
Silent Data Corruption
(SDC)
yesyes
no
Detected, but unrecoverable error
(DUE)
no error
yes no
benign faultno error
benign faultno error
Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 2005
April 22, 2023 4
Masking of Soft-error
REGISTERS
I1I2
I3I4I5I6I7
C
E
D
B
REGISTERS
O2
O1
1
1
10
1
0
10
Particle strike
Electrical masking
Soft error
No soft error
latching window masking
Logical Masking
4
April 22, 2023 5
FIT Equation: Vulnerability Factors
FIT = (for each vulnerable device i) (intrinsic error ratei * vulnerability factori)
Vulnerability Factor = Timing Vulnerability Factor * Architectural Vulnerability Factor Timing Vulnerability Factor (TVF)
fraction of time bit is vulnerable
Architectural Vulnerability Factor (AVF) fraction of time bit matters for final output of a program
Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 2005
April 22, 2023 6
Architectural Vulnerability Factor Fraction of time bit matters for final output of a program
Branch Predictor Doesn’t matter at all (AVF = 0%)
Program Counter Almost always matters (AVF ~ 100%)
Computing AVF for complex structures Statistical Fault Injection ACE (Architecturally Correct Execution) Analysis
Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 2005
Soft-error & Automobiles
Mar,2010 - NHTSA enlisted NASA Engineering and Safety Center (NESC) to investigate “Unintended Acceleration”
Apr,2011 – NESC discounts SEU in its report to NHTSA stating that the ICs manufactured using SOI (Silicon-on-insulator) technology
As per AEC-Q100 standard, SEU testing required for automobile electronics with RAM > 1Mb
April 22, 2023 7
An Example
Predicted Block RAM upset rates for a Virtex-5 FPGA = 635 FIT/Mb = 1.5E-05 upsets per day per Mb. Ref : A. Lesea, “Continuing Experiments of Atmospheric
Neutron Effects on Deep Submicron Integrated Circuits,” WP286 (v1.0), Xilinx, Inc. 2008
Assume this FPGA used in throttle control module If 500,000 such vehicles produced by vendor, then total
upsets per day = 1.5E-05 x 500,000 = 7.6 vehicle upsets per day
April 22, 2023 8
Soft-error Mitigation
Robust circuit designs (radiation-hardenend) resilient to soft-errors
Soft-error mitigation at Device-level – silicon-on-insulator, triple-well Circuit-level – DICE cell, Triple-modular redundancy Architecture-level – RMT, lock-stepping, ECC
April 22, 2023 9
April 22, 2023 10
Soft-error Mitigation
Soft-error mitigation techniques incur penalties in area (spatial redundancy) timing (temporal redundancy)
Selective hardening of the components for reduced penalty Often based on logical/electrical/timing derating
A low cost mitigation technique proposed for critical applications based on application derating Certain applications can mask or recover from transient faults*
Ref: V. Wong et al, “Soft Error Resilience of Probabilistic Inference Applications” SELSE II, 2006
April 22, 2023 11
Critical Application - An Analogy
Climate monitor/display
Airbag deployment
GPS
Cruise control
• A micro-controller embedded in a car dashboard maybe handling many applications.
• A critical application in this case could be ‘Airbag deployment’.
• SE during this application could be catastrophic
April 22, 2023 12
Target Module
PWM – output is a pulse, width of which decides speed of motor.
Etpwmi0 module ~800 FFs & ~3000 logic gates 180-nm CMOS technology, 80 MHz frequency
ADCCPU core
PWM
Motor
April 22, 2023 13
Basic Simulation Steps*
Pre-analysis: Identify components utilized by critical application
Fault injection: Inject a single fault at random time instance by depositing the opposite value on the component
Error metric: Error count => no. of mismatches b/w output and reference PW count => no. of clock-cycles the output is ‘1’ as compared
to reference
Ref: J. Blome et al, “Cost-Efficient Soft Error Protection for Embedded Microprocessors” CASES, 2006
Simulation tools
Verilog netlist simulated with timing information, using Synopsys VCS
Fault-injection module coded in C. Uses VPI (verilog procedural interface) functions to
Access a net in the netlist (vpiHandle) Read value of the net (vpi_get_value) Overwrite value of the net (vpi_put_value)
April 22, 2023 14
April 22, 2023 15
Simulation – Pre-analysis
Pre-analysis Categorize FFs based on their activity
a) Low-activity FFs (no. of toggles less than 2)b) High-activity FFs (no. of toggles higher than 2)
Opposite values forced and output pulse observed for errors
FFs in which errors were observed are identified and subjected to fault-injection
April 22, 2023 16
Simulation – Fault-injection
Fault injection For the FFs obtained from pre-analysis, inject fault at a
random instance of time (within time interval of first output pulse)
Measure Error count & PW count. Identify FFs with error in acceptable limits
Fault-injection window
Output pulse
Original valueTest
bench
Fault-injection module
(verilog) (C+VPI)
Modified value
April 22, 2023 17
Absolute error vs. Acceptable error
Absolute error – Raise error flag for any mismatch b/w the output pulse and reference
Acceptable error - Raise error flag only if mismatch b/w the output pulse and reference lies outside tolerance limit*
Examples: Delayed pulse - Self-correcting pulse
Fault-injected here
Target FF
Actual output
reference copy
Fault-injected here
Target FF
reference copy
Actual output
delay
Ref: X. Li, et al “Exploiting Soft Computing for Increased Fault Tolerance” Workshop on Architectural Support for Gigascale Integration, 2006
April 22, 2023 18
Simulations-Combinational logic
Fault injection steps: SE modeled as a 1ns pulse (System Clock Freq = 80MHz) Transient pulse injected onto the gate output Target combinational circuit selected at random Example: 2-input NAND gate
Actual output
reference copy
A
B
Y
Injected Fault
A
BY
April 22, 2023 19
Results
Pre-analysis - ~18% FFs used by the application
Fault-injection - number of faults injected is proportional to the number of flip-flops in the group
Low-toggle FFs more in number, hence no. of faults injected in low-toggle FF is higher
April 22, 2023 20
Results
Low-toggle FF more vulnerable to soft-errors since an erroneous bit-flip may remain unchanged
High-toggle FF is written very often, an erroneous bit flip has a higher probability of getting overwritten
April 22, 2023 21
Computing AVF
AVF = Pe * % component
Pe = probability that a fault injected in the component results in an error (Pe) = (no. of errors) / (no. of faults injected)
% component = the percentage of that component with respect to total number of components
Example: For a latch, a. if # errors = 50% of injected faults (Pe = 0.5)b. if latches make for 20% of circuit
AVF = 0.5 x 0.2 = 0.1
AVF - Results
Low activity FF have a higher Pe and are more in number; hence have a higher AVF
Combinational logic, though high in number, has Pe ~4E-03, causing AVF to drop
04/22/23 22
Summary Fault-resilience scheme for critical applications using
application derating and inherent error tolerance
For the application considered, ~12% of the sequential logic was safety critical (prev. work
reports 30% of seq. logic hardened for 99% fault-coverage in ARM embedded proc. running image processing algorithm)
failures in combinational logic were negligible
Worst-case scenario would only be the same as radiation-hardening a generic system i.e., all the hardware is identified as safety-critical
04/22/23 23
Future Work
Perform fault-injection analysis on the processor core managing the control loop
Conduct neutron beam experiments on the circuit to compare with simulations and find FIT rate
Implement circuit hardening and test the system to ascertain its robustness
04/22/23 24