Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Chair for Dependable Nano Computing (ITEC) 0 KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association
INSTITUTE OF COMPUTER ENGINEERING (ITEC) – CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC)
www.kit.edu
CLASS: Combined Logic Architecture Soft Error Sensitivity Analysis
Mojtaba Ebrahimi, Liang Chen, Hossein Asadi, and Mehdi B. Tahoori
Chair for Dependable Nano Computing (ITEC) 1
Soft Errors
Soft Errors caused by radiation-induced
particle strikes
Strike releases electron & hole pairs
Absorbed by source & drain
Alter transistor state
Soft error at logic-level and higher:
Bit-flip in flip-flops
Transient pulses in combinational gates
System Soft Error Rate (SER) grows
exponentially with Moore’s law [Dixit11IRPS][Ibe10TED]
Number of transistors per system is
proportional to SER
+ - + + + -
- -
Transistor Device
source drain
Particle
Strike
0
1
2
3
4
5
250 180 130 90 65 45 32 22Design Rule (nm)
Soft
Erro
r R
ate
(No
rmal
ize
d)
SER vs. Design Rule [adapted from Ibe10TED]
Chair for Dependable Nano Computing (ITEC) 2
Not all soft errors cause system failures
Circuit level masking (logical, timing, and electrical)
Architecture level masking
Application level masking
Nodes not equally susceptible to soft errors
Finding most susceptible modules
Applying selective protection schemes
Balancing reliability, cost, and performance
SER Estimation technique is needed
Rank the nodes according to their SER
Motivation: Importance of SER Estimation
Chair for Dependable Nano Computing (ITEC) 3
Previous Soft Error Rate (SER) estimation techniques
Fault injection (time consuming)
Analytical (accuracy)
Architecture-level techniques regular structures
(register-file, cache)
Circuit-level techniques irregular structures
(controller unit, pipeline)
Microprocessors include both regular and irregular structures
Independent analysis is not accurate
Proposed: CLASS: Combined Logic Architecture Soft error Sensitivity
analysis
Combines a circuit-level technique with an architecture-level analysis
Discrete time Markov chains for handling different error propagation
and masking scenarios
Motivation: What is missing?
Chair for Dependable Nano Computing (ITEC) 4
Preliminaries
CLASS
Experimental Results
Conclusions
Outline
Chair for Dependable Nano Computing (ITEC) 5
Preliminaries
CLASS
Experimental Results
Conclusions
Outline
Chair for Dependable Nano Computing (ITEC) 6
ACE (Architecturally Correct Execution) analysis is the
most common technique [Mukherjee03Micro]
a.k.a Life-time analysis
Lifetime of a bit is divided into :
ACE
e.g. Program counter almost always in ACE state
un-ACE (unnecessary for ACE)
e.g. Branch predictor always in un-ACE state
AVF (Architecture Vulnerability Factor)
Fraction of time that system is on ACE state
Program counter ~ 100%, Branch predictor = 0%
ACE analysis overestimates failure rate
In some cases up to 7x [George10DSN] Overdesign
Analytical Architecture Level SER Estimation
Techniques
Chair for Dependable Nano Computing (ITEC) 7
Binary/Algebraic Decision Diagrams (BDD/ADD)-based
techniques [Zhang06ISQED] [Norman05TCAD] [Krishnaswamy05DATE] [Miskov08TCAD]
Based on decision diagrams
Highly accurate
Scalability problem
Error Probability Propagation (EPP)-based techniques [Asadi06ISCAS] [Fazeli10DSN] [Chen12IOLTS]
Based on propagation rules
Reasonable accuracy
Scalable
Analytical Circuit Level SER Estimation
Techniques
Chair for Dependable Nano Computing (ITEC) 8
Shortcoming of Previous techniques
Circuit-level techniques consider:
ACE cannot completely model effect of irregular structures
in error propagation from regular structures
A circuit What seen by circuit
level techniques
Chair for Dependable Nano Computing (ITEC) 9
Preliminaries
CLASS
Experimental Results
Conclusions
Overview
Chair for Dependable Nano Computing (ITEC) 10
CLASS: A Simple Example
Error at Output
(Failure)
Latent in Memory
Error at Error Site
0.2 0.3
1
Masked
0.5
Error at Output
(Failure)
Latent in Memory
Error at Error Site
0.2 0.3
1
0.6 Error at Memory Outputs
Masked0.4
0.5
Error at Output
(Failure)
Latent in Memory
Error at Error Site
0.1
0.2 0.3
0.4
1
0.6 Error at Memory Outputs
Masked0.4
0.5
0.5
Memory
Primary Inputs
Primary Ouputs
Q
QSET
CLR
D
Q
QSET
CLR
D
0.3
0.2 0.6
0.1
0.4
Propagation from error site Propagation from memory
contents to its output
Propagation from memory
output (Iterative)
Saturates in few iterations
Discrete Time Markov Chain (DTMC)
Chair for Dependable Nano Computing (ITEC) 11
Overview of CLASS
Irregular Structures(Combinational and
Sequential Elements)
Phase ISystem Decomposition
Irre
gula
r St
ruct
ure
sR
egu
lar
Stru
ctu
res
Memory m
Memory 2
Memory 1
DTMC
Phase IIIJoint Analysis
Enhanced EPP Analysis
Phase IIIndependent Analysis
Memory ACE Analysis
(MACE)
Chair for Dependable Nano Computing (ITEC) 12
Important Error Propagations
Propagation from irregular structures to regular structures
and vice versa :
Short-term effect: error in memory inputs affects its outputs
immediately
Long-term effect: error in memory inputs contaminates its contents
MVF (Memory Vulnerability Factor): propagation probability of an
error from memory contents to its output
MemoryIrregular Structure
Primary Inputs
Primary Ouputs
Memory Inputs
Memory Outputs
1
1 Short-term Effect
1
2
3
MemoryIrregular Structure
Primary Inputs
Primary Ouputs
Memory Inputs
Memory Outputs
2
1
1
2
Short-term Effect
Long-term Effect
MemoryIrregular Structure
Primary Inputs
Primary Ouputs
Memory Inputs
Memory Outputs
2 3
1
1
2
3
Short-term Effect
Long-term Effect
MVF
Chair for Dependable Nano Computing (ITEC) 13
A simple memory
Computation of Short- and Long-term Effects
Probability of having short- and long-term
effect for each input port of memory
Short-term effect
Long-term effect
Error
at Port
Probability of Error
None Output
(Short-term)
Contents
(Long-term)
Both
Address 0 1-SP(WEn) SP(WEn) 0
Error
at Port
Probability of Error
None Output
(Short-term)
Contents
(Long-term)
Both
Address 0 1-SP(WEn) SP(WEn) 0
Data-in 1-SP(WEn) 0 0 SP(Wen)
WEn 0 0 0 1
Chair for Dependable Nano Computing (ITEC) 14
Looking at memory boundaries
Intervals leading to a read access are ACE
MVFWord: Fraction of time that it has an ACE state
MVFMemory : Average of all MVFWord
Memory ACE (MACE)
Write
Read
Write
Read
Read
Write
Write
ACE ACEunACE unACE
Time
Chair for Dependable Nano Computing (ITEC) 15
CLASS is compatible with different circuit-level techniques
It inherits characteristics of employed circuit-level technique
Multi-cycle 4-value EPP [Fazeli10DSN] is employed
Based on topological traverse of netlist
O(n2), Inaccuracy within 2%
Circuit-level technique should be enhanced to:
Model short-term effect of error in memories
Compute long-term effect probability (treated as independent error)
Enhanced Circuit-level Technique
BMemory
Address
AQ
QSET
CLR
D
Q
QSET
CLR
D
SP=0.8
SP=0.5
Data-in
WEn
SP=0.25Error Probability (EP)
EP=1
MVF×EPP(Memory output to primary output)
1×0.8=0.8
EP=0.80.8×0.75=0.6
EP=0.6
Short-term
Lang-term0.8×0.25=0.2
0.6×0.5=0.3
EP=0.3
EP=0.3
Chair for Dependable Nano Computing (ITEC) 16
Integration of MACE and EPP Using DTMC
Memory
Logic
Primary Inputs
Primary Ouputs
Error at Output
(Failure)
Latent in Memory
Error at Error Site
PMLT
PELT PEO
PMO
1
MVF Error at Memory Outputs
Masked1-MVF
1-PELT-PEO
1-PMLT-PMO
Error propagation DTMC
A simple system
Steady state
solution of DTMC
Independent of
error site
Symbol Definition
PEO P(Error site Primary Outputs)
PELT P(Error site Latent in Memory)
PMO P(Memory Output Primary Outputs)
PMLT P(Memory Output Latent in Memory)
MVF P(Latent in Memory Memory Outputs)
Chair for Dependable Nano Computing (ITEC) 17
Overall Flow
HDL DiscritpionApplicationSystem
SpecificationHDL Discritpion
Gate-level Netlist
Application
VCD File
Synthesis
Post-synthesis Simulation
System Specification
Done by Commercial
Tools
HDL Discritpion
Gate-level Netlist
MVF
Application
EPPs
VCD File
Signal Probabilites
Synthesis
Post-synthesis Simulation
MACE Analysis
EPP Analysis
DTMC Matrix
CLASS
Solve Matrix for Each Cell
Vulnerability of Cells
VCD file Analysis
System Specification
Done by Commercial
Tools
Chair for Dependable Nano Computing (ITEC) 18
Preliminaries
CLASS
Experimental Results
Conclusions
Outline
Chair for Dependable Nano Computing (ITEC) 19
OpenRISC 1200 processor
Synthesis results:
44,480 gates
4,960 flip-flops
4 Memory Units
Overall: 49,444 error sites
Benchmarks are selected
from Mibench:
QSort, CRC32, Stringsearch
Experimental Setup
[www.opencores.org]
Chair for Dependable Nano Computing (ITEC) 20
EPP [Asadi06ISCAS][Fazeli10DSN]
ACE[Biswas05ISCA]
implemented for caches and their tags
Simulation-based Statistical Fault Injection (SFI)
Faults are injected during post-synthesis simulation
One fault per running entire application
Each FI on average takes 87 seconds
Only logic masking is considered
Comparable with functional simulation
System configuration:
Intel Xeon E5520
16 GB RAM
Comparison with Other Techniques
Chair for Dependable Nano Computing (ITEC) 21
Inaccuracy Analysis
Regular structures:
OpenRISC Regular Structures
Instruction Cache (IC)
Instruction Cache Tag (ICT)
Data Cache (DC)
Data Cache Tag (DCT)
CLASS
Average inaccuracy: 3.4%
Maximum inaccuracy : 9.2%
ACE
Average inaccuracy : 11.7%
Maximum inaccuracy : 23.3%
CLASS may overestimate or
underestimate Due to inaccuracy of 4-value EPP
SFI
CLASS
ACE
Fai
lure
Pro
bab
ilit
y
Chair for Dependable Nano Computing (ITEC) 22
Inaccuracy Analysis
Irregular structures:
Three representative blocks of
OpenRISC:
ALU
Instruction Cache FSM (IC FSM)
Pipeline fetch unit
CLASS
Average inaccuracy : 2.1%
Maximum inaccuracy : 4.0%
EPP
Average inaccuracy : 7.2%
Maximum inaccuracy : 26.3%
Average inaccuracy of individual
error sites is less than 7%.
SFI
CLASS
EPP
Fai
lure
Pro
bab
ilit
y
Chair for Dependable Nano Computing (ITEC) 23
Runtime
Individual inaccuracy of 7% needs 196 faults at each error site
49,444 possible error sites
49,444 x 196 x 87 more than 27 years
CLASS is 5 orders of magnitude faster than simulation-based
SFI Less than one hour
Runtime Breakdown:
VCD Analysis 85%
EPP 14%
Solving DTMCs 1%
Chair for Dependable Nano Computing (ITEC) 24
Preliminaries
CLASS
Experimental Results
Conclusions
Outline
Chair for Dependable Nano Computing (ITEC) 25
CLASS: SER estimation technique for entire systems
Unlike existing techniques which only focused on either
regular or irregular structures
Interactions between regular and irregular structures are
considered
Compared to SFI:
Inaccuracy is less than 7%
5 orders of magnitude faster
Multiple times more accurate than EPP (4x) and ACE (3x)
Conclusion
Chair for Dependable Nano Computing (ITEC) 28
Sources of inaccuracy in current implementation
Propagation of an error to several memory units
Multiple propagation of an error to memory outputs
Sources of Inaccuracy
Chair for Dependable Nano Computing (ITEC) 29
Access Types: Read Miss (RM) and Read Hit (RH)
Un-ACE Intervals * RM
ACE Intervals: *RH
ACE Analysis: Instruction Cache
How to Estimate the Soft Error Rate of an Embedded Processor?
Chair for Dependable Nano Computing (ITEC) 30
Access types: Read Miss (RM), Read Hit (RH),
Write Miss (WM), Write Hit (WH)
ACE intervals: *RH
Un-ACE intervals: *WH
Potentially ACE intervals: *WM, *RM
ACE Analysis: Data Cache (Write-back)
Chair for Dependable Nano Computing (ITEC) 31
Short-term effect of error in memories
Long-term effect are considered as propagated as
independent error
Enhanced to calculate the probability of error propagation
to memory inputs
Only logic masking is considered
Multi-cycle 4-value EPP [Fazeli10DSN] is employed
CLASS has no limitation regarding circuit-level techniques
Enhanced EPP