SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,

SWAT: Designing Resilient Hardware by

Treating Software Anomalies

Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve, Vikram Adve, Shobha Vasudevan, Yuanyuan Zhou

Department of Computer Science

University of Illinois at Urbana-Champaign

[email protected]

Motivation

• Hardware will fail in-the-field due to several reasons

Þ Need in-field detection, diagnosis, recovery, repair

• Reliability problem pervasive across many markets

– Traditional redundancy solutions (e.g., nMR) too expensive

Need low-cost solutions for multiple failure sources* Must incur low area, performance, power overhead

Transient errors(High-energy particles )

Wear-out(Devices are weaker)

Design Bugs … and so on

Observations

• Need handle only hardware faults that propagate to software

• Fault-free case remains common, must be optimized

Watch for software anomalies (symptoms)

– Zero to low overhead “always-on” monitors

Diagnose cause after symptom detected

− May incur high overhead, but rarely invoked

SWAT: SoftWare Anomaly Treatment

SWAT Framework Components

• Detection: Symptoms of software misbehavior

• Recovery: Checkpoint and rollback

• Diagnosis: Rollback/replay on multicore

• Repair/reconfiguration: Redundant, reconfigurable hardware

• Flexible control through firmware

Fault Error Symptomdetected

Recovery

Diagnosis Repair

Checkpoint Checkpoint

Advantages of SWAT

• Handles all faults that matter– Oblivious to low-level failure modes and masked faults

• Low, amortized overheads– Optimize for common case, exploit SW reliability solutions

• Customizable and flexible– Firmware control adapts to specific reliability needs

• Holistic systems view enables novel solutions– Synergistic detection, diagnosis, recovery solutions

• Beyond hardware reliability– Long term goal: unified system (HW+SW) reliability

– Potential application to post-silicon test and debug

SWAT Contributions

In-situ diagnosis [DSN’08]

Very low-cost detectors [ASPLOS’08, DSN’08]

Low SDC rate, latency

Diagnosis


Recovery

Repair


Accurate fault modeling[HPCA’09]

Multithreaded workloads [MICRO’09]

Application-Aware SWATEven lower SDC, latency

This Talk




Diagnosis


Recovery

Repair





Outline

• Introduction to SWAT

• SWAT Detection

• SWAT Diagnosis

• Analysis of Recovery in SWAT

• Conclusions

• Future work

Fault Detection w/ HW Detectors [ASPLOS ’08]

• Simple HW-only detectors to observe anomalous SW behavior

– Minimal hardware area low cost detectors

– Incur near-zero perf overhead in fault-free operation

– Require no changes to SW

SWAT firmware

Fatal Traps

Division by zero,RED state, etc.

Kernel Panic

OS enters panicState due to fault

High OS

High contiguousOS activity

Hangs

Simple HW hangdetector

App Abort

Application abort due to fault

Fault Detection w/ SW-assisted Detectors

• Simple HW detectors effective, require no SW changes

• SW-assisted detectors to augment HW detectors

– Minimal changes to SW for more effective detectors

• Amortize resiliency cost with SW bug detection

• Explored two simple SW-assisted schemes

– Detecting out-of-bounds addresses* Low HW overhead, near-zero impact on performance

– Using likely program invariants* Instrumented binary, no HW changes

* <5% performance overhead on x86 processors

Fault Detection w/ SW-assisted Detectors

• Address out-of-bounds Detector

– Monitor boundaries of heap, stack, globals

– Address beyond these bounds HW fault

– HW-only detect such faults at longer latency

• iSWAT: Using likely program invariants to detect HW faults

– Mine “likely” invariants on data values* E.g., 10 ≤ x ≤ 20

* Hold on observed inputs, expected to hold on others

– Violation of likely invariant HW fault

– Useful to detect faults that affect only data* iSWAT not explored in this talk [Sahoo et al., DSN ‘08]

App Code

Globals

Heap

Stack

Libraries

Empty

Reserved

App Address Space

Evaluating Fault Detection

• Microarchitecture-level fault injection (latch elements)

– GEMS timing models + Simics full-system simulation

– All SPEC 2k C/C++ workloads in 64-bit OpenSolaris OS

• Stuck-at, transient faults in 8 µarch units (single fault model)

– 10,000 of each type statistically significant

• Simulate impact of fault in detail for 10M instructions

10M instr

Timing simulation

If no symptom in 10M instr, run to completion

Functional simulation

Fault

Masked orSilent Data Corruption (SDC)

• Metrics: SDC rate, detection latency

SDC Rate of HW-only Detectors

• Simple detectors give 0.7% SDC rate for permanent faults

• Faults in FPU need better detectors

– Mostly corrupt only data iSWAT may detect

De

co

de

r

INT

AL

U

Re

g D

bu

s

Int

reg

RO

B

RA

T

AG

EN

FP

AL

U

To

tal N

o F

P

0%

20%

40%

60%

80%

100%Permanent Faults

SDC

Detected

MaskedTo

tal i

nje

cti

on

s

SDC Rate of HW-only Detectors

• Transient faults also have low SDC rate of 0.3%

• High rate of masking from transients

– Consistent with prior work on transients

De

co

de

r

INT

AL

U

Re

g D

bu

s

Int

reg

RO

B

RA

T

AG

EN

FP

AL

U

To

tal N

o F

P

0%

20%

40%

60%

80%

100%Transient Faults

SDC

Detected

MaskedTo

tal i

nje

cti

on

s

Application-Aware SDC Analysis

• SDCs undetected faults that corrupt only data values

– SWAT detectors catch other corruptions

– Most faults do not corrupt only data values

• But some “SDCs” are actually acceptable outputs!

– Traditionally, SDC output differs from fault-free output

– But different outputs may still be acceptable* Diff solutions, diff solutions with degraded quality, etc.

* E.g., Same cost place & route, acceptable PSNR, etc.

• SWAT detectors cannot detect acceptable changes in output

– For each app, define % degradation in output quality

should not

• 10/16 SPEC have multiple correct solutions (results for all)

• App-aware analysis remarkably low SDC rate for SWAT– Only 28 faults show >0% degradation from golden output

– 10 of >16,000 injected faults are SDC at >1% degradation

• Ongoing work: Formalization of why/when SWAT works

Application-Aware SDC Analysis

SWAT >0% >0.01% >0.1% >1%0

5

10

15

20

25

30

35

40 38

95 4 4

Permanent Faults

Nu

mb

er

of

Fa

ult

s

SWAT >0% >0.01% >0.1% >1%0

5

10

15

20

25

30

35

40

24

19

1310

6

Transient Faults

Nu

mb

er

of

Fa

ult

s

• Detection latency dictates recoverability

– Fault recoverable as long as fault-free checkpoint exists

• Traditional detection lat = arch state corruption to detection

– Checkpoint records bad arch state SW affected

• But not all arch state corruptions affect SW output

• New detection latency = SW state corruption to detection

Detection Latency

Bad SW state

New Latency

Bad arch state

Old latency

FaultDetection

Recoverablechkpt

Recoverablechkpt

Recoverablechkpt

Detection Latency

<10k <100k <1m <10m >10m40%

50%

60%

70%

80%

90%


Detection Latency (Instructions)

De

tec

ted

Fa

ult

s

<10k <100k <1m <10m >10m40%

50%

60%

70%

80%

90%


Latency HW-only


De

tec

ted

Fa

ult

s• >98% of all faults detected within 10M instructions

– Recoverable using HW checkpoint schemes

Detection Latency

<10k <100k <1m <10m >10m40%

50%

60%

70%

80%

90%



De

tec

ted

Fa

ult

s

<10k <100k <1m <10m >10m40%

50%

60%

70%

80%

90%


Latency w/ SW

Latency HW-only


De

tec

ted

Fa

ult

s• >98% of all faults detected within 10M instructions

– Recoverable using HW checkpoint schemes

• Out-of-bounds detector further reduces detection latency

– Many address violations longer latency detections

Detection Latency

• Measuring new latency important to study recoverability

- Significant differences between old and new latency

<10k <100k <1m <10m >10m40%

50%

60%

70%

80%

90%



De

tec

ted

Fa

ult

s

<10k <100k <1m <10m >10m40%

50%

60%

70%

80%

90%


Latency w/ SW

Latency HW-only

Old Latency HW-only


De

tec

ted

Fa

ult

s

Fault Detection - Summary

• Simple detectors effective in detecting HW faults

• Low SDC rate even with HW-only detectors

• Short detection latencies for hardware faults

– SW-assisted out-of-bounds detector latency further

– Measuring new detection latency important for recovery

• Next: Diagnosis of detected faults

Fault Diagnosis

• Symptom-based detection is cheap but

– May incur long latency from activation to detection

– Difficult to diagnose root cause of fault

• Goal: Diagnose the fault with minimal hardware overhead

– Rarely invoked higher perf overhead acceptable

SW Bug Transient Fault

PermanentFault

Symptom

?

SWAT Single-threaded Fault Diagnosis [Li et al., DSN ‘08]

• First, diagnosis for single threaded workload on one core

– Multithreaded w/ multicore later – several new challenges

Key ideas

• Single core fault model, multicore fault-free core available

• Chkpt/replay for recovery replay on good core, compare

• Synthesizing DMR, but only for diagnosis

Traditional DMR

P1 P2

=

Always on expensive

P1 P2

=

P1

Synthesized DMR

Fault-freeDMR only on fault

SW Bug vs. Transient vs. Permanent

• Rollback/replay on same/different core

• Watch if symptom reappears

No symptom Symptom

Deterministic s/w orPermanent h/w bug

Symptom detected

Faulty Good

Rollback on faulty core

Rollback/replay on good core

Continue Execution

Transient or non-deterministic s/w bug

Symptom

Permanenth/w fault,

needs repair!

No symptom

Deterministic s/w bug(send to s/w layer)

µarch-level Fault Diagnosis

Permanent fault

Microarchitecture-levelDiagnosis

Unit X is faulty

Symptomdetected

Diagnosis

Softwarebug

Transientfault

Trace Based Fault Diagnosis (TBFD)

• µarch-level fault diagnosis using rollback/replay

• Key: Execution caused symptom trace activates fault

– Deterministically replay trace on faulty, fault-free cores

– Divergence faulty hardware used diagnosis clues

• Diagnose faults to µarch units of processor

– Check µarch-level invariants in several parts of processor

– Diagnosis in out-of-order logic (meta-datapath) complex

Trace-Based Fault Diagnosis: Evaluation

• Goal: Diagnose faults at reasonable latency

• Faults diagnosed in 10 SPEC workloads

– ~8500 detected faults (98% of unmasked)

• Results

– 98% of the detection successfully diagnosed

– 91% diagnosed within 1M instr (~0.5ms on 2GHz proc)

SWAT Multithreaded Fault Diagnosis [Hari et al., MICRO ‘09]

• Challenge 1: Deterministic replay involves high overhead

• Challenge 2: Multithreaded apps share data among threads

• Symptom causing core may not be faulty

• No known fault-free core in system

Core 2

Fault

Core 1

Symptom Detectionon a fault-free core

Store

Memory

Load

mSWAT Diagnosis - Key Ideas

Challenges

Multithreaded applications

Full-system deterministic

replay

No known good core

Isolated deterministic

replayEmulated TMRKey Ideas

TA TB TC TD

TA

TA TB TC TD

TA

A B C D

TA

A B C D

mSWAT Diagnosis - Key Ideas

Challenges

Multithreaded applications

Full-system deterministic

replay

No known good core

Isolated deterministic

replayEmulated TMRKey Ideas

TA TB TC TD

TA

TA TB TC TD

A B C DA B C D

TD TA TB TC

TC TD TA TB

mSWAT Diagnosis: Evaluation

• Diagnose detected perm faults in multithreaded apps

– Goal: Identify faulty core, TBFD for µarch-level diagnosis

– Challenges: Non-determinism, no fault-free core known

– ~4% of faults detected from fault-free core

• Results

– 95% of detected faults diagnosed* All detections from fault-free core diagnosed

– 96% of diagnosed faults require <200KB buffers* Can be stored in lower level cache low HW overhead

• SWAT diagnosis can work with other symptom detectors

SWAT Recovery

• Recovery masks effect of fault for continuous operation

• Checkpointing “always-on” must incur minimal overhead

– Low area overhead, minimal performance impact

• SWAT symptom detection assumes checkpoint recovery

– Fault allowed to corrupt the architecture state

Checkpoint/replay

Rollback to pristine state,

re-execute

I/O buffering

Prevent irreversible effects

Recovery

Components of Recovery

• Checkpointing

– Periodic snapshot of registers, undo log for memory

– Restore register/memory state up on detection

• I/O buffering

– External outputs buffered until known to be fault-free

– HW buffer to record I/O until next checkpoint interval

RegistersSnapshot 1

RegistersSnapshot 2

RegistersSnapshot 3

MemoryLog 1

ST

old val

ST

MemoryLog 2

ST

old val

I/OBuffer 1

Device I/O

CommitI/O

Analysis of Recovery Overheads [Lead by Alex Li]

• Goal: Measure overheads from checkpointing, I/O buffering

– Measured on 2 server applications – apache, sshd

– ReVive for chkpt, several techniques for I/O buffering

• State-of-the-art incurs high overhead at short chkpt intervals

– >30% performance overhead at interval of <1M cycles!

• Long chkpt interval I/O buffering incurs high HW overhead

– Checkpoint intervals of 10M HW buffer of 100KB

• Push and pull effect between recovery components

• Ongoing work: SWAT recovery module with low overheads

Summary: SWAT works!




Diagnosis


Recovery

Repair





SWAT Advantages and Limitations

• Advantages

– Handles all faults that matter, oblivious to failure modes

– Low, amortized overheads across HW/SW reliability

– Customizable and flexible due to firmware implementation

– Concepts applicable beyond hardware reliability

• Limitations

– SWAT reliability guarantees largely empirical

– SWAT firmware, recovery module not yet ready

– Off core faults, other fault models not evaluated

Future Work

• Formalization of when/why SWAT works

• Near zero cost recovery

• More server/distributed applications

• Other core and off-core parts, other fault models

• Prototyping SWAT on FPGA

– With T. Austin/ V. Bertacco at University of Michigan

SWAT: Designing Resilient Hardware by

Treating Software Anomalies

Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve, Vikram Adve, Shobha Vasudevan, Yuanyuan Zhou

Department of Computer Science

University of Illinois at Urbana-Champaign

[email protected]

Backup Slides

Address Out-of-Bounds Detector• Address faults may result in long detection latencies

– Corrupt address unallocated but in valid page

– Many data value corruptions before symptom

• Low-cost address out-of-bounds detector

• Can amortize cost across software bug detectors

Compiler tellshardware

Malloc reports to hw

Limits recorded onfunction execution

App Code

Globals

Heap

Stack

Libraries

Empty

Reserved

0x0

0x100000000

0xffff… (264-1)

App Address Space

Permanent Faults: HW-only Detectors

• Fatal Traps and Panics detect most faults

• Large fraction of detections by symptoms from OS

De

co

de

r

INT

AL

U

Re

g D

bu

s

Int

reg

RO

B

RA

T

AG

EN

FP

AL

U

To

tal N

o F

P

0%

20%

40%

60%

80%

100%

SDC Unaccept

SDC Accept

Symp > 10M

High-OS

App-Abort

Panic

Hangs

Fatal Trap

Masked

To

tal i

nje

cti

on

s

Measuring Detection Latency

• New detection latency = SW state corruption to detection

• But idenitfying SW state corruption is hard!

– Need to know how faulty value used by application

– If faulty value affects output, then SW state corrupted

• Measure latency by rolling back to older checkpoints

– Only for analysis, not required in real system

FaultDetection

Bad arch state Bad SW state

New latency

ChkptRollback &

Replay

SymptomChkpt Fault effectmasked

Rollback &Replay

Extending SWAT Diagnosis to Multithreaded Apps

• Naïve extension – N known good cores to replay the trace

– Too expensive – area

– Requires full-system deterministic replay

• Simple optimization – One spare core

–Not Scalable, requires N full-system deterministic replays

–Requires a spare core

–Single point of failure

C1 SC2 C3 Symptom Detected

C1 SC2 C3 No Symptom Detected

C1 SC2 C3 Symptom Detected

Faulty core is C2

mSWAT Fault Diagnosis Algorithm

Symptom

detectedCapture fault

activating traceRe-execute

Captured trace

Diagnosis

TA TB TC TD

A B C D

Example


Symptom



Captured trace

Diagnosis

TA TB TC TD

A B C D A B C D

Example


Symptom



Captured traceFaulty

coreLook for

divergence

Diagnosis

TA TB TC TD

A B C DTD TA TB TC

A B C D

Divergence

Example

TA

A B C D

No Divergence

Faulty core is B

Recording Fault Activating Trace

Capture fault activating trace

Deterministicisolated replay

Faulty core

Look for divergence

What info to capture for deterministic isolated replay?

Thread

Ld

Ld

Ld

Ld

• Capture all inputs to thread as trace

- Record data values of all loads

• Ensures isolated deterministic replay

- Isolated replay => lower overhead

SymptomDetected

Comparing Deterministic Replays

• Compare all instr Large buffer needed

• Faults -> SW through branch, load, store

- Other faults manifest in these instr

• Record, compare only these instructions

- Lower HW buffer overhead



Faulty core

Look for divergence

SymptomDetected

Thread

Store

Store

Branch

Load

How to identify divergence?

mSWAT Diagnosis: Hardware Cost

• Trace captured in native execution

– HW support for trace collection

• Deterministic replay is firmware emulated

– Requires minimal hardware support

– Replay threads in isolationÞ No need to capture memory orderings

• Long detection latency large trace buffers (8MB/core)

– Iterative Diagnosis Algorithm to reduce buffer overhead



Faulty core

Look for divergence

SymptomDetected

Repeatedly execute on short tracese.g. 100,000 instrns

Results: MSWAT Fault Diagnosis

• Over 95% of detected faults are successfully diagnosed

• All faults detected in fault-free core are diagnosed

De-coder

INT ALU Reg Dbus

Int reg ROB RAT AGEN Av-er-age

0%

20%

40%

60%

80%

100%

CorrectlyDiagnosed Undiagnosed

Pe

rce

nta

ge

of

De

tec

ted

Fa

ult

s99 100 99 87 100 78 99 95.9

Fault-Free CoreExecution

Faulty CoreExecution

Trace-Based Fault Diagnosis (TBFD)

Permanent fault detected

Invoke TBFD

DiagnosisAlgorithm

=?



Rollback faulty-core to checkpoint

Replay execution, collect info

=?

DiagnosisAlgorithm

Load checkpoint on fault-free core

Fault-free instruction exec

What info to collect?

What info to compare?What to do on

divergence?

Invoke TBFD

Rollback faulty-core to checkpoint

Replay execution, collect µarch info

Faulty trace

Load checkpoint on fault-free core

Fault-free instruction exec



Test trace

=?

Synch state

divergence

Invoke TBFD

DiagnosisAlgorithm

What to buffer, checkpoint: Methodology

• Buffer requirements of I/O intensive workloads

– Apache, SSH daemon serving client requests on network

• Server daemon serving multiple client threads

– Fault injection and detection only at server* SDC rate similar to single-threaded, focus on recovery

• After detection, rollback to checkpoint, replay without fault

– Recoverable, Detected Unrecoverable Error (DUE), SDC

FaultApplication

OS

Hardware

Application

OS

Hardware

Simulated Server Simulated Client

SimulatedNetwork

SIMICS

SWAT Recovery

Recovery: Mask effect of error for continuous operation

• Is I/O buffering required?

– Previous software solutions vulnerable to hardware faults

• Overheads for I/O buffering and checkpointing?

– I/O buffering interval > 2 * detection latency

– Chkpt interval * # of chkpts > 2 * detection latency

Checkpoint/replay

Rollback to pristine state,

re-execute

I/O buffering

Prevent irreversible effects

What to Checkpoint and Buffer?

CPU

Memory

SCSIController

Host-PCIbridge

Console

Network

Checkpointed state

Device-to-memory write

• Proc+Mem• Proc+AllMem• FullSystem• Proc+AllMem+OutBuf

CPU-to-device write

Buffer

What to Checkpoint, Buffer: Methodology

• Buffer requirements of I/O intensive workloads

– Apache, SSH daemon serving client requests on network

• Server daemon serving multiple client threads

– Fault injection and detection only at server* SDC rate similar to single-threaded, focus on recovery

• After detection, rollback to checkpoint, replay without fault

– Recoverable, Detected Unrecoverable Error (DUE), SDC

FaultApplication

OS

Hardware

Application

OS

Hardware

Simulated Server Simulated Client

SimulatedNetwork

SIMICS

What to Checkpoint, Buffer?

• Need output buffering for full recovery!

Pro

c +

Me

m

Pro

c +

All

Me

m

Fu

llS

y..

.

Ou

tpu

...0%

20%

40%

60%

80%

100%

SDC

DUE

Recovered

Pe

rce

nta

ge

of

De

tec

ted

Fa

ult

s

69% 70% 71% 99%

How Much Output Buffering?

Monitored CPU-to-device writes for different intervals

Interval of 10M needs > 100K output buffer

10K needs < 1K buffer!

1E+04 1E+05 1E+06 1E+07 1E+0810

100

1,000

10,000

100,000

1,000,000apachesshdparsermcf

Buffering Interval (instructions)

Nu

mb

er

of

By

tes

10k 100k 1M 10M 100M

How Much Checkpoint Overhead?

• Used state of the art: ReVive– Effect of different cache sizes, intervals unknown

• Metholodgy– 16-core multicore with shared L2– 4 SPLASH parallel apps (worst-case for original ReVive)– Vary L2 cache size between 256KB to 2048KB– Vary checkpoint interval between 500K to 50M

Overhead of Hardware Checkpointing

Ocean

• Intervals, cache sizes have large impact on performance• < 5M checkpoint intervals can have unacceptable overheads

– But this requires 100 KB output buffer!– Need cheaper checkpoint mechanisms

1E+05 1E+06 1E+07 1E+08

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

256k512k1024k2048k

Checkpoint Interval (cycles)

Slo

wd

ow

n

100k 1M 10M 100M

SwatSim: Fast and Accurate Fault Modeling

• Need for accurate µarch-level fault models

– Must be able to observe system level effects

– µarch (latch) level injections fast but inaccurate

– Gate level injections accurate but too slow

• Can we achieve µarch-level speed at gate-level accuracy?

• SwatSim - Gate-level accuracy at µarch-level speeds

– Simulate mostly at µarch level

– Simulate only faulty component at gate-level, on-demand

– Permanent faults require back-and-forth between levels

SWAT-Sim: Gate-level Accuracy at µarch Speeds

µarch simulation

r3 r1 op r2

Faulty UnitUsed?

Continue µarch simulation

µarch-LevelSimulation

NoInput

Output

Gate-LevelFault

Simulation

Stimuli

Response

Fault propagatedto output

Yes

r3

SwatSim Results

• SwatSim implemented within full-system simulation– GEMS+Simics for µarch and functional simulations

– NCVerilog + VPI for gate-level ALU, AGEN

• Performance overhead– 100,000X faster than gate level full processor simulation

– 2X slowdown over µarch level simulation

• Accuracy of µarch fault models using SWAT coverage/latency– Compared µarch stuck-at with SwatSim stuck-at, delay

– µarch fault models generally inaccurate

* Accuracy varies depending on structure, fault model

* Differences in activation rate, multi-bit flips

• Unsuccessful attempts to derive more accurate µarch fault models

Need SwatSim, at least for now

Documents

SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,