Upload
jonas-mcbride
View
222
Download
2
Tags:
Embed Size (px)
Citation preview
SWAT: Designing Resilient Hardware by
Treating Software Anomalies
Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve, Vikram Adve, Shobha Vasudevan, Yuanyuan Zhou
Department of Computer Science
University of Illinois at Urbana-Champaign
Motivation
• Hardware will fail in-the-field due to several reasons
Þ Need in-field detection, diagnosis, recovery, repair
• Reliability problem pervasive across many markets
– Traditional redundancy solutions (e.g., nMR) too expensive
Need low-cost solutions for multiple failure sources* Must incur low area, performance, power overhead
Transient errors(High-energy particles )
Wear-out(Devices are weaker)
Design Bugs … and so on
Observations
• Need handle only hardware faults that propagate to software
• Fault-free case remains common, must be optimized
Watch for software anomalies (symptoms)
– Zero to low overhead “always-on” monitors
Diagnose cause after symptom detected
− May incur high overhead, but rarely invoked
SWAT: SoftWare Anomaly Treatment
SWAT Framework Components
• Detection: Symptoms of software misbehavior
• Recovery: Checkpoint and rollback
• Diagnosis: Rollback/replay on multicore
• Repair/reconfiguration: Redundant, reconfigurable hardware
• Flexible control through firmware
Fault Error Symptomdetected
Recovery
Diagnosis Repair
Checkpoint Checkpoint
Advantages of SWAT
• Handles all faults that matter– Oblivious to low-level failure modes and masked faults
• Low, amortized overheads– Optimize for common case, exploit SW reliability solutions
• Customizable and flexible– Firmware control adapts to specific reliability needs
• Holistic systems view enables novel solutions– Synergistic detection, diagnosis, recovery solutions
• Beyond hardware reliability– Long term goal: unified system (HW+SW) reliability
– Potential application to post-silicon test and debug
SWAT Contributions
In-situ diagnosis [DSN’08]
Very low-cost detectors [ASPLOS’08, DSN’08]
Low SDC rate, latency
Diagnosis
Fault Error Symptomdetected
Recovery
Repair
Checkpoint Checkpoint
Accurate fault modeling[HPCA’09]
Multithreaded workloads [MICRO’09]
Application-Aware SWATEven lower SDC, latency
This Talk
In-situ diagnosis [DSN’08]
Very low-cost detectors [ASPLOS’08, DSN’08]
Low SDC rate, latency
Diagnosis
Fault Error Symptomdetected
Recovery
Repair
Checkpoint Checkpoint
Accurate fault modeling[HPCA’09]
Multithreaded workloads [MICRO’09]
Application-Aware SWATEven lower SDC, latency
Outline
• Introduction to SWAT
• SWAT Detection
• SWAT Diagnosis
• Analysis of Recovery in SWAT
• Conclusions
• Future work
Fault Detection w/ HW Detectors [ASPLOS ’08]
• Simple HW-only detectors to observe anomalous SW behavior
– Minimal hardware area low cost detectors
– Incur near-zero perf overhead in fault-free operation
– Require no changes to SW
SWAT firmware
Fatal Traps
Division by zero,RED state, etc.
Kernel Panic
OS enters panicState due to fault
High OS
High contiguousOS activity
Hangs
Simple HW hangdetector
App Abort
Application abort due to fault
Fault Detection w/ SW-assisted Detectors
• Simple HW detectors effective, require no SW changes
• SW-assisted detectors to augment HW detectors
– Minimal changes to SW for more effective detectors
• Amortize resiliency cost with SW bug detection
• Explored two simple SW-assisted schemes
– Detecting out-of-bounds addresses* Low HW overhead, near-zero impact on performance
– Using likely program invariants* Instrumented binary, no HW changes
* <5% performance overhead on x86 processors
Fault Detection w/ SW-assisted Detectors
• Address out-of-bounds Detector
– Monitor boundaries of heap, stack, globals
– Address beyond these bounds HW fault
– HW-only detect such faults at longer latency
• iSWAT: Using likely program invariants to detect HW faults
– Mine “likely” invariants on data values* E.g., 10 ≤ x ≤ 20
* Hold on observed inputs, expected to hold on others
– Violation of likely invariant HW fault
– Useful to detect faults that affect only data* iSWAT not explored in this talk [Sahoo et al., DSN ‘08]
App Code
Globals
Heap
Stack
Libraries
Empty
Reserved
App Address Space
Evaluating Fault Detection
• Microarchitecture-level fault injection (latch elements)
– GEMS timing models + Simics full-system simulation
– All SPEC 2k C/C++ workloads in 64-bit OpenSolaris OS
• Stuck-at, transient faults in 8 µarch units (single fault model)
– 10,000 of each type statistically significant
• Simulate impact of fault in detail for 10M instructions
10M instr
Timing simulation
If no symptom in 10M instr, run to completion
Functional simulation
Fault
Masked orSilent Data Corruption (SDC)
• Metrics: SDC rate, detection latency
SDC Rate of HW-only Detectors
• Simple detectors give 0.7% SDC rate for permanent faults
• Faults in FPU need better detectors
– Mostly corrupt only data iSWAT may detect
De
co
de
r
INT
AL
U
Re
g D
bu
s
Int
reg
RO
B
RA
T
AG
EN
FP
AL
U
To
tal N
o F
P
0%
20%
40%
60%
80%
100%Permanent Faults
SDC
Detected
MaskedTo
tal i
nje
cti
on
s
SDC Rate of HW-only Detectors
• Transient faults also have low SDC rate of 0.3%
• High rate of masking from transients
– Consistent with prior work on transients
De
co
de
r
INT
AL
U
Re
g D
bu
s
Int
reg
RO
B
RA
T
AG
EN
FP
AL
U
To
tal N
o F
P
0%
20%
40%
60%
80%
100%Transient Faults
SDC
Detected
MaskedTo
tal i
nje
cti
on
s
Application-Aware SDC Analysis
• SDCs undetected faults that corrupt only data values
– SWAT detectors catch other corruptions
– Most faults do not corrupt only data values
• But some “SDCs” are actually acceptable outputs!
– Traditionally, SDC output differs from fault-free output
– But different outputs may still be acceptable* Diff solutions, diff solutions with degraded quality, etc.
* E.g., Same cost place & route, acceptable PSNR, etc.
• SWAT detectors cannot detect acceptable changes in output
– For each app, define % degradation in output quality
should not
• 10/16 SPEC have multiple correct solutions (results for all)
• App-aware analysis remarkably low SDC rate for SWAT– Only 28 faults show >0% degradation from golden output
– 10 of >16,000 injected faults are SDC at >1% degradation
• Ongoing work: Formalization of why/when SWAT works
Application-Aware SDC Analysis
SWAT >0% >0.01% >0.1% >1%0
5
10
15
20
25
30
35
40 38
95 4 4
Permanent Faults
Nu
mb
er
of
Fa
ult
s
SWAT >0% >0.01% >0.1% >1%0
5
10
15
20
25
30
35
40
24
19
1310
6
Transient Faults
Nu
mb
er
of
Fa
ult
s
• Detection latency dictates recoverability
– Fault recoverable as long as fault-free checkpoint exists
• Traditional detection lat = arch state corruption to detection
– Checkpoint records bad arch state SW affected
• But not all arch state corruptions affect SW output
• New detection latency = SW state corruption to detection
Detection Latency
Bad SW state
New Latency
Bad arch state
Old latency
FaultDetection
Recoverablechkpt
Recoverablechkpt
Recoverablechkpt
Detection Latency
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100%Permanent Faults
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100%Transient Faults
Latency HW-only
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s• >98% of all faults detected within 10M instructions
– Recoverable using HW checkpoint schemes
Detection Latency
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100%Permanent Faults
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100%Transient Faults
Latency w/ SW
Latency HW-only
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s• >98% of all faults detected within 10M instructions
– Recoverable using HW checkpoint schemes
• Out-of-bounds detector further reduces detection latency
– Many address violations longer latency detections
Detection Latency
• Measuring new latency important to study recoverability
- Significant differences between old and new latency
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100%Permanent Faults
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100%Transient Faults
Latency w/ SW
Latency HW-only
Old Latency HW-only
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s
Fault Detection - Summary
• Simple detectors effective in detecting HW faults
• Low SDC rate even with HW-only detectors
• Short detection latencies for hardware faults
– SW-assisted out-of-bounds detector latency further
– Measuring new detection latency important for recovery
• Next: Diagnosis of detected faults
Fault Diagnosis
• Symptom-based detection is cheap but
– May incur long latency from activation to detection
– Difficult to diagnose root cause of fault
• Goal: Diagnose the fault with minimal hardware overhead
– Rarely invoked higher perf overhead acceptable
SW Bug Transient Fault
PermanentFault
Symptom
?
SWAT Single-threaded Fault Diagnosis [Li et al., DSN ‘08]
• First, diagnosis for single threaded workload on one core
– Multithreaded w/ multicore later – several new challenges
Key ideas
• Single core fault model, multicore fault-free core available
• Chkpt/replay for recovery replay on good core, compare
• Synthesizing DMR, but only for diagnosis
Traditional DMR
P1 P2
=
Always on expensive
P1 P2
=
P1
Synthesized DMR
Fault-freeDMR only on fault
SW Bug vs. Transient vs. Permanent
• Rollback/replay on same/different core
• Watch if symptom reappears
No symptom Symptom
Deterministic s/w orPermanent h/w bug
Symptom detected
Faulty Good
Rollback on faulty core
Rollback/replay on good core
Continue Execution
Transient or non-deterministic s/w bug
Symptom
Permanenth/w fault,
needs repair!
No symptom
Deterministic s/w bug(send to s/w layer)
µarch-level Fault Diagnosis
Permanent fault
Microarchitecture-levelDiagnosis
Unit X is faulty
Symptomdetected
Diagnosis
Softwarebug
Transientfault
Trace Based Fault Diagnosis (TBFD)
• µarch-level fault diagnosis using rollback/replay
• Key: Execution caused symptom trace activates fault
– Deterministically replay trace on faulty, fault-free cores
– Divergence faulty hardware used diagnosis clues
• Diagnose faults to µarch units of processor
– Check µarch-level invariants in several parts of processor
– Diagnosis in out-of-order logic (meta-datapath) complex
Trace-Based Fault Diagnosis: Evaluation
• Goal: Diagnose faults at reasonable latency
• Faults diagnosed in 10 SPEC workloads
– ~8500 detected faults (98% of unmasked)
• Results
– 98% of the detection successfully diagnosed
– 91% diagnosed within 1M instr (~0.5ms on 2GHz proc)
SWAT Multithreaded Fault Diagnosis [Hari et al., MICRO ‘09]
• Challenge 1: Deterministic replay involves high overhead
• Challenge 2: Multithreaded apps share data among threads
• Symptom causing core may not be faulty
• No known fault-free core in system
Core 2
Fault
Core 1
Symptom Detectionon a fault-free core
Store
Memory
Load
mSWAT Diagnosis - Key Ideas
Challenges
Multithreaded applications
Full-system deterministic
replay
No known good core
Isolated deterministic
replayEmulated TMRKey Ideas
TA TB TC TD
TA
TA TB TC TD
TA
A B C D
TA
A B C D
mSWAT Diagnosis - Key Ideas
Challenges
Multithreaded applications
Full-system deterministic
replay
No known good core
Isolated deterministic
replayEmulated TMRKey Ideas
TA TB TC TD
TA
TA TB TC TD
A B C DA B C D
TD TA TB TC
TC TD TA TB
mSWAT Diagnosis: Evaluation
• Diagnose detected perm faults in multithreaded apps
– Goal: Identify faulty core, TBFD for µarch-level diagnosis
– Challenges: Non-determinism, no fault-free core known
– ~4% of faults detected from fault-free core
• Results
– 95% of detected faults diagnosed* All detections from fault-free core diagnosed
– 96% of diagnosed faults require <200KB buffers* Can be stored in lower level cache low HW overhead
• SWAT diagnosis can work with other symptom detectors
SWAT Recovery
• Recovery masks effect of fault for continuous operation
• Checkpointing “always-on” must incur minimal overhead
– Low area overhead, minimal performance impact
• SWAT symptom detection assumes checkpoint recovery
– Fault allowed to corrupt the architecture state
Checkpoint/replay
Rollback to pristine state,
re-execute
I/O buffering
Prevent irreversible effects
Recovery
Components of Recovery
• Checkpointing
– Periodic snapshot of registers, undo log for memory
– Restore register/memory state up on detection
• I/O buffering
– External outputs buffered until known to be fault-free
– HW buffer to record I/O until next checkpoint interval
RegistersSnapshot 1
RegistersSnapshot 2
RegistersSnapshot 3
MemoryLog 1
ST
old val
ST
MemoryLog 2
ST
old val
I/OBuffer 1
Device I/O
CommitI/O
Analysis of Recovery Overheads [Lead by Alex Li]
• Goal: Measure overheads from checkpointing, I/O buffering
– Measured on 2 server applications – apache, sshd
– ReVive for chkpt, several techniques for I/O buffering
• State-of-the-art incurs high overhead at short chkpt intervals
– >30% performance overhead at interval of <1M cycles!
• Long chkpt interval I/O buffering incurs high HW overhead
– Checkpoint intervals of 10M HW buffer of 100KB
• Push and pull effect between recovery components
• Ongoing work: SWAT recovery module with low overheads
Summary: SWAT works!
In-situ diagnosis [DSN’08]
Very low-cost detectors [ASPLOS’08, DSN’08]
Low SDC rate, latency
Diagnosis
Fault Error Symptomdetected
Recovery
Repair
Checkpoint Checkpoint
Accurate fault modeling[HPCA’09]
Multithreaded workloads [MICRO’09]
Application-Aware SWATEven lower SDC, latency
SWAT Advantages and Limitations
• Advantages
– Handles all faults that matter, oblivious to failure modes
– Low, amortized overheads across HW/SW reliability
– Customizable and flexible due to firmware implementation
– Concepts applicable beyond hardware reliability
• Limitations
– SWAT reliability guarantees largely empirical
– SWAT firmware, recovery module not yet ready
– Off core faults, other fault models not evaluated
Future Work
• Formalization of when/why SWAT works
• Near zero cost recovery
• More server/distributed applications
• Other core and off-core parts, other fault models
• Prototyping SWAT on FPGA
– With T. Austin/ V. Bertacco at University of Michigan
SWAT: Designing Resilient Hardware by
Treating Software Anomalies
Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve, Vikram Adve, Shobha Vasudevan, Yuanyuan Zhou
Department of Computer Science
University of Illinois at Urbana-Champaign
Backup Slides
Address Out-of-Bounds Detector• Address faults may result in long detection latencies
– Corrupt address unallocated but in valid page
– Many data value corruptions before symptom
• Low-cost address out-of-bounds detector
• Can amortize cost across software bug detectors
Compiler tellshardware
Malloc reports to hw
Limits recorded onfunction execution
App Code
Globals
Heap
Stack
Libraries
Empty
Reserved
0x0
0x100000000
0xffff… (264-1)
App Address Space
Permanent Faults: HW-only Detectors
• Fatal Traps and Panics detect most faults
• Large fraction of detections by symptoms from OS
De
co
de
r
INT
AL
U
Re
g D
bu
s
Int
reg
RO
B
RA
T
AG
EN
FP
AL
U
To
tal N
o F
P
0%
20%
40%
60%
80%
100%
SDC Unaccept
SDC Accept
Symp > 10M
High-OS
App-Abort
Panic
Hangs
Fatal Trap
Masked
To
tal i
nje
cti
on
s
Measuring Detection Latency
• New detection latency = SW state corruption to detection
• But idenitfying SW state corruption is hard!
– Need to know how faulty value used by application
– If faulty value affects output, then SW state corrupted
• Measure latency by rolling back to older checkpoints
– Only for analysis, not required in real system
FaultDetection
Bad arch state Bad SW state
New latency
ChkptRollback &
Replay
SymptomChkpt Fault effectmasked
Rollback &Replay
Extending SWAT Diagnosis to Multithreaded Apps
• Naïve extension – N known good cores to replay the trace
– Too expensive – area
– Requires full-system deterministic replay
• Simple optimization – One spare core
–Not Scalable, requires N full-system deterministic replays
–Requires a spare core
–Single point of failure
C1 SC2 C3 Symptom Detected
C1 SC2 C3 No Symptom Detected
C1 SC2 C3 Symptom Detected
Faulty core is C2
mSWAT Fault Diagnosis Algorithm
Symptom
detectedCapture fault
activating traceRe-execute
Captured trace
Diagnosis
TA TB TC TD
A B C D
Example
mSWAT Fault Diagnosis Algorithm
Symptom
detectedCapture fault
activating traceRe-execute
Captured trace
Diagnosis
TA TB TC TD
A B C D A B C D
Example
mSWAT Fault Diagnosis Algorithm
Symptom
detectedCapture fault
activating traceRe-execute
Captured traceFaulty
coreLook for
divergence
Diagnosis
TA TB TC TD
A B C DTD TA TB TC
A B C D
Divergence
Example
TA
A B C D
No Divergence
Faulty core is B
Recording Fault Activating Trace
Capture fault activating trace
Deterministicisolated replay
Faulty core
Look for divergence
What info to capture for deterministic isolated replay?
Thread
Ld
Ld
Ld
Ld
• Capture all inputs to thread as trace
- Record data values of all loads
• Ensures isolated deterministic replay
- Isolated replay => lower overhead
SymptomDetected
Comparing Deterministic Replays
• Compare all instr Large buffer needed
• Faults -> SW through branch, load, store
- Other faults manifest in these instr
• Record, compare only these instructions
- Lower HW buffer overhead
Capture fault activating trace
Deterministicisolated replay
Faulty core
Look for divergence
SymptomDetected
Thread
Store
Store
Branch
Load
How to identify divergence?
mSWAT Diagnosis: Hardware Cost
• Trace captured in native execution
– HW support for trace collection
• Deterministic replay is firmware emulated
– Requires minimal hardware support
– Replay threads in isolationÞ No need to capture memory orderings
• Long detection latency large trace buffers (8MB/core)
– Iterative Diagnosis Algorithm to reduce buffer overhead
Capture fault activating trace
Deterministicisolated replay
Faulty core
Look for divergence
SymptomDetected
Repeatedly execute on short tracese.g. 100,000 instrns
Results: MSWAT Fault Diagnosis
• Over 95% of detected faults are successfully diagnosed
• All faults detected in fault-free core are diagnosed
De-coder
INT ALU Reg Dbus
Int reg ROB RAT AGEN Av-er-age
0%
20%
40%
60%
80%
100%
CorrectlyDiagnosed Undiagnosed
Pe
rce
nta
ge
of
De
tec
ted
Fa
ult
s99 100 99 87 100 78 99 95.9
Fault-Free CoreExecution
Faulty CoreExecution
Trace-Based Fault Diagnosis (TBFD)
Permanent fault detected
Invoke TBFD
DiagnosisAlgorithm
=?
Trace-Based Fault Diagnosis (TBFD)
Permanent fault detected
Rollback faulty-core to checkpoint
Replay execution, collect info
=?
DiagnosisAlgorithm
Load checkpoint on fault-free core
Fault-free instruction exec
What info to collect?
What info to compare?What to do on
divergence?
Invoke TBFD
Rollback faulty-core to checkpoint
Replay execution, collect µarch info
Faulty trace
Load checkpoint on fault-free core
Fault-free instruction exec
Trace-Based Fault Diagnosis (TBFD)
Permanent fault detected
Test trace
=?
Synch state
divergence
Invoke TBFD
DiagnosisAlgorithm
What to buffer, checkpoint: Methodology
• Buffer requirements of I/O intensive workloads
– Apache, SSH daemon serving client requests on network
• Server daemon serving multiple client threads
– Fault injection and detection only at server* SDC rate similar to single-threaded, focus on recovery
• After detection, rollback to checkpoint, replay without fault
– Recoverable, Detected Unrecoverable Error (DUE), SDC
FaultApplication
OS
Hardware
Application
OS
Hardware
Simulated Server Simulated Client
SimulatedNetwork
SIMICS
SWAT Recovery
Recovery: Mask effect of error for continuous operation
• Is I/O buffering required?
– Previous software solutions vulnerable to hardware faults
• Overheads for I/O buffering and checkpointing?
– I/O buffering interval > 2 * detection latency
– Chkpt interval * # of chkpts > 2 * detection latency
Checkpoint/replay
Rollback to pristine state,
re-execute
I/O buffering
Prevent irreversible effects
What to Checkpoint and Buffer?
CPU
Memory
SCSIController
Host-PCIbridge
Console
Network
Checkpointed state
Device-to-memory write
• Proc+Mem• Proc+AllMem• FullSystem• Proc+AllMem+OutBuf
CPU-to-device write
Buffer
What to Checkpoint, Buffer: Methodology
• Buffer requirements of I/O intensive workloads
– Apache, SSH daemon serving client requests on network
• Server daemon serving multiple client threads
– Fault injection and detection only at server* SDC rate similar to single-threaded, focus on recovery
• After detection, rollback to checkpoint, replay without fault
– Recoverable, Detected Unrecoverable Error (DUE), SDC
FaultApplication
OS
Hardware
Application
OS
Hardware
Simulated Server Simulated Client
SimulatedNetwork
SIMICS
What to Checkpoint, Buffer?
• Need output buffering for full recovery!
Pro
c +
Me
m
Pro
c +
All
Me
m
Fu
llS
y..
.
Ou
tpu
...0%
20%
40%
60%
80%
100%
SDC
DUE
Recovered
Pe
rce
nta
ge
of
De
tec
ted
Fa
ult
s
69% 70% 71% 99%
How Much Output Buffering?
Monitored CPU-to-device writes for different intervals
Interval of 10M needs > 100K output buffer
10K needs < 1K buffer!
1E+04 1E+05 1E+06 1E+07 1E+0810
100
1,000
10,000
100,000
1,000,000apachesshdparsermcf
Buffering Interval (instructions)
Nu
mb
er
of
By
tes
10k 100k 1M 10M 100M
How Much Checkpoint Overhead?
• Used state of the art: ReVive– Effect of different cache sizes, intervals unknown
• Metholodgy– 16-core multicore with shared L2– 4 SPLASH parallel apps (worst-case for original ReVive)– Vary L2 cache size between 256KB to 2048KB– Vary checkpoint interval between 500K to 50M
Overhead of Hardware Checkpointing
Ocean
• Intervals, cache sizes have large impact on performance• < 5M checkpoint intervals can have unacceptable overheads
– But this requires 100 KB output buffer!– Need cheaper checkpoint mechanisms
1E+05 1E+06 1E+07 1E+08
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
256k512k1024k2048k
Checkpoint Interval (cycles)
Slo
wd
ow
n
100k 1M 10M 100M
SwatSim: Fast and Accurate Fault Modeling
• Need for accurate µarch-level fault models
– Must be able to observe system level effects
– µarch (latch) level injections fast but inaccurate
– Gate level injections accurate but too slow
• Can we achieve µarch-level speed at gate-level accuracy?
• SwatSim - Gate-level accuracy at µarch-level speeds
– Simulate mostly at µarch level
– Simulate only faulty component at gate-level, on-demand
– Permanent faults require back-and-forth between levels
SWAT-Sim: Gate-level Accuracy at µarch Speeds
µarch simulation
r3 r1 op r2
Faulty UnitUsed?
Continue µarch simulation
µarch-LevelSimulation
NoInput
Output
Gate-LevelFault
Simulation
Stimuli
Response
Fault propagatedto output
Yes
r3
SwatSim Results
• SwatSim implemented within full-system simulation– GEMS+Simics for µarch and functional simulations
– NCVerilog + VPI for gate-level ALU, AGEN
• Performance overhead– 100,000X faster than gate level full processor simulation
– 2X slowdown over µarch level simulation
• Accuracy of µarch fault models using SWAT coverage/latency– Compared µarch stuck-at with SwatSim stuck-at, delay
– µarch fault models generally inaccurate
* Accuracy varies depending on structure, fault model
* Differences in activation rate, multi-bit flips
• Unsuccessful attempts to derive more accurate µarch fault models
Need SwatSim, at least for now