65
On Cosmic Rays, Bat Droppings, and what to do about them David Walker Princeton University with Jay Ligatti, Lester Mackey, George Reis and David August

On Cosmic Rays, Bat Droppings, and what to do about them

  • Upload
    kalona

  • View
    22

  • Download
    2

Embed Size (px)

DESCRIPTION

On Cosmic Rays, Bat Droppings, and what to do about them. David Walker Princeton University with Jay Ligatti, Lester Mackey, George Reis and David August. A Little-Publicized Fact. 1 + 1 =. 2. 3. How do Soft Faults Happen?. “Galactic Particles” Are high-energy particles that - PowerPoint PPT Presentation

Citation preview

Page 1: On  Cosmic Rays,  Bat Droppings,  and what to do about them

On Cosmic Rays, Bat Droppings,

and what to do about them

David Walker

Princeton University

with Jay Ligatti, Lester Mackey, George Reis and David August

Page 2: On  Cosmic Rays,  Bat Droppings,  and what to do about them

A Little-Publicized Fact

1 + 1 = 23

Page 3: On  Cosmic Rays,  Bat Droppings,  and what to do about them

How do Soft Faults Happen?

High-energy particles pass through devices and collides with silicon atom

Collision generates an electric charge that can flip a single bit

“Galactic Particles”Are high-energy particles thatpenetrate to Earth’s surface, throughbuildings and walls“Solar

Particles”Affect Satellites;Cause < 5% ofTerrestrial problems

Alpha particles frombat droppings

Page 4: On  Cosmic Rays,  Bat Droppings,  and what to do about them

How Often do Soft Faults Happen?

Page 5: On  Cosmic Rays,  Bat Droppings,  and what to do about them

How Often do Soft Faults Happen?

0

2000

4000

6000

8000

10000

12000

0 5 10 15

Cosmic ray flux/fail rate (multiplier)

Cit

y A

ltit

ud

e (f

eet)

NYC

Tucson, AZ

Denver, CO

Leadville, CO

IBM Soft Fail Rate Study; Mainframes; 83-86

Page 6: On  Cosmic Rays,  Bat Droppings,  and what to do about them

How Often do Soft Faults Happen?

0

2000

4000

6000

8000

10000

12000

0 5 10 15

Cosmic ray flux/fail rate (multiplier)

Cit

y A

ltit

ud

e (f

eet)

NYC

Tucson, AZ

Denver, CO

Leadville, CO

IBM Soft Fail Rate Study; Mainframes; 83-86 [Zeiger-Puchner 2004]

Some Data Points: • 83-86: Leadville (highest incorporated city in the US): 1 fail/2 days• 83-86: Subterrean experiment: under 50ft of rock: no fails in 9 months• 2004: 1 fail/year for laptop with 1GB ram at sea-level • 2004: 1 fail/trans-pacific roundtrip [Zeiger-Puchner 2004]

Page 7: On  Cosmic Rays,  Bat Droppings,  and what to do about them

How Often do Soft Faults Happen?

Soft Error Rate Trends[Shenkhar Borkar, Intel, 2004]

0

50

100

150

180 130 90 65 45 32 22 16

Chip Feature Size

Rela

tive

Soft

Erro

r Rat

e In

crea

se~8% degradation/bit/generation

we are approximatelyhere

6 yearsfrom now

Page 8: On  Cosmic Rays,  Bat Droppings,  and what to do about them

How Often do Soft Faults Happen?

Soft Error Rate Trends[Shenkhar Borkar, Intel, 2004]

0

50

100

150

180 130 90 65 45 32 22 16

Chip Feature Size

Rela

tive

Soft

Erro

r Rat

e In

crea

se~8% degradation/bit/generation

• Soft error rates go up as:• Voltages decrease• Feature sizes decrease• Transistor density increases• Clock rates increase

we are approximatelyhere

6 yearsfrom now

all futuremanufacturingtrends

Page 9: On  Cosmic Rays,  Bat Droppings,  and what to do about them

How Often do Soft Faults Happen?

In 1948, Presper Eckert notes that cascading effects of a single-bit error destroyed hours of Eniac’s work. [Zeiger-Puchner 2004]

In 2000, Sun server systems deployed to America Online, eBay, and others crashed due to cosmic rays [Baumann 2002]

“The wake-up call came in the end of 2001 ... billion-dollar factory ground to a halt every month due to ... a single bit flip” [Zeiger-Puchner 2004]

Los Alamos National Lab Hewlett-Packard ASC Q 2048-node supercomputer was crashing regularly from soft faults due to cosmic radiation [Michalak 2005]

Page 10: On  Cosmic Rays,  Bat Droppings,  and what to do about them

What Problems do Soft Faults Cause?

a single bit in memory gets flipped

a single bit in the processor logic gets flipped and there’s no difference in external observable behavior the processor locks up the computation is silently corrupted

register value corrupted (simple data fault) control-flow transfer goes to wrong place (control-flow fault) different opcode interpreted (instruction fault)

Page 11: On  Cosmic Rays,  Bat Droppings,  and what to do about them

FT Solutions Redundancy in Information

eg: Error correcting codes (ECC) pros: protects stored values efficiently cons: difficult to design for arithmetic units and control logic

Redundancy in Space multiple redundant hardware devices eg: Compaq Non-stop Himalaya runs two identical programs on two

processors, comparing pins on every cycle pros: efficient in time cons: expensive in hardware (double the space)

Redundancy in Time perform the same computations at different times (eg: in sequence) pros: efficient in hardware (space is reused) cons: expensive in time (slower --- but not twice as slow)

Page 12: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Solutions in Time

Compiler generates code containing replicated computations, fault detection checks and recovery routines eg: Rebaudengo 01, CFCSS [Oh et al. 02], SWIFT or CRAFT [Reis et al.

05], ...

pros: software-controlled --- new code with better reliability properties may be deployed whenever, wherever needed

cons: for fixed reliability policy, slower than specialized hardware solutions

Page 13: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Solutions in Time

Compiler generates code containing replicated computations, fault detection checks and recovery routines eg: Rebaudengo 01, CFCSS [Oh et al. 02], SWIFT or CRAFT [Reis et al.

05], ...

pros: flexibility --- new code with better reliability properties may be deployed whenever, wherever needed

cons: for fixed reliability policy, slower than specialized hardware solutions

cons: it might not actually work

Page 14: On  Cosmic Rays,  Bat Droppings,  and what to do about them

“It might not actually work”

Page 15: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Agenda Answer basic scientific questions about software-

controlled fault tolerance:

Do software-only or hybrid SW/HW techniques actually work?

For what fault models? How do we specify them?

How can we prove it?

Build compilers that produce software that runs reliably on faulty hardware Moreover: Let’s not replace faulty hardware with faulty software. Let’s prove every binary we produce is fault tolerant relative to

the specified fault model

Page 16: On  Cosmic Rays,  Bat Droppings,  and what to do about them

compiler front end

reliability transform

ordinaryprogram

fault tolerantprogram

optimizedFT program

optimization

A Possible Compiler Architecture

Page 17: On  Cosmic Rays,  Bat Droppings,  and what to do about them

compiler front end

reliability transform

ordinaryprogram

fault tolerantprogram

optimizedFT program

optimization

A Possible Compiler Architecture

Testing Requirements:

all combinations of featuresmultiplied byall combinations of faults

Page 18: On  Cosmic Rays,  Bat Droppings,  and what to do about them

compiler front end

reliability transform

ordinaryprogram

fault tolerantprogram

reliabilityproof

proofchecker

optimizedFT program

modifiedproof

optimization

A More Reliable Compiler Architecture

Page 19: On  Cosmic Rays,  Bat Droppings,  and what to do about them

compiler front end

reliability transform

ordinaryprogram

fault tolerantprogram

reliabilityproof

proofchecker

optimizedFT program

modifiedproof

optimization

Testing Requirements:

all combinations of featuresmultiplied byall combinations of faults

A More Reliable Compiler Architecture

Page 20: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Central Technical Challenges

Designing

Page 21: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Step 1: Lambda Zap

Lambda Zap [ICFP 06]

a lambda calculus that exhibits intermittent data faults + operators to detect and correct them

a type system that guarantees observable outputs of well-typed programs do not change in the presence of a single fault

types act as the “proofs” of fault tolerance expressive enough to implement an ordinary typed lambda calculus

End result: the foundation for a fault-tolerant typed intermediate language

Page 22: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Lambda zap models simple data faults only

The Fault Model

( M, F[ v1 ] ) ---> ( M, F[ v2 ] )

Not modelled: memory faults (better protected using ECC hardware) control-flow faults (ie: faults during control-flow transfer) instruction faults (ie: faults in instruction opcodes)

Goal: to construct programs that tolerate 1 fault observers cannot distinguish between fault-free and 1-fault runs

Page 23: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Lambda to Lambda Zap: The main idea

let x = 2 inlet y = x + x inout y

Page 24: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Lambda to Lambda Zap: The main idea

let x = 2 inlet y = x + x inout y

let x1 = 2 inlet x2 = 2 inlet x3 = 2 inlet y1 = x1 + x1 inlet y2 = x2 + x2 inlet y3 = x3 + x3 inout [y1, y2, y3]

atomic majority vote + output

replicateinstructions

Page 25: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Lambda to Lambda Zap: The main idea

let x = 2 inlet y = x + x inout y

let x1 = 2 inlet x2 = 2 inlet x3 = 7 inlet y1 = x1 + x1 inlet y2 = x2 + x2 inlet y3 = x3 + x3 inout [y1, y2, y3]

Page 26: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Lambda to Lambda Zap: The main idea

let x = 2 inlet y = x + x inout y

let x1 = 2 inlet x2 = 2 inlet x3 = 7 inlet y1 = x1 + x1 inlet y2 = x2 + x2 inlet y3 = x3 + x3 inout [y1, y2, y3]

but final output unchanged

corrupted valuescopied and percolatethrough computation

Page 27: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Lambda to Lambda Zap: Control-flow

let x = 2 inif x then e1 else e2

let x1 = 2 inlet x2 = 2 inlet x3 = 2 inif [x1, x2, x3] then [[ e1 ]] else [[ e2 ]]

majority vote oncontrol-flow transfer

recursively translate subexpressions

Page 28: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Lambda to Lambda Zap: Control-flow

let x = 2 inif x then e1 else e2

let x1 = 2 inlet x2 = 2 inlet x3 = 2 inif [x1, x2, x3] then [[ e1 ]] else [[ e2 ]]

majority vote oncontrol-flow transfer(function calls replicate arguments,

results and function itself)

recursively translate subexpressions

Page 29: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Almost too easy, can anything go wrong?...

Page 30: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Almost too easy, can anything go wrong?...

yes!

optimization reduces replication overheaddramatically (eg: ~ 43% for 2 copies), but can be unsound!

original implementation of SWIFT [Reis et al.]optimized away all redundancy leaving themwith an unreliable implementation!!

Page 31: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Faulty Optimizations

let x1 = 2 inlet x2 = 2 inlet x3 = 2 inlet y1 = x1 + x1 inlet y2 = x2 + x2 inlet y3 = x3 + x3 inout [y1, y2, y3]

In general, optimizations eliminate redundancy,fault-tolerance requires redundancy.

CSE let x1 = 2 inlet y1 = x1 + x1 inout [y1, y1, y1]

Page 32: On  Cosmic Rays,  Bat Droppings,  and what to do about them

The Essential Problem

voters depend on common value x1

let x1 = 2 inlet y1 = x1 + x1 inout [y1, y1, y1]

bad code:

Page 33: On  Cosmic Rays,  Bat Droppings,  and what to do about them

let x1 = 2 inlet x2 = 2 inlet x3 = 2 inlet y1 = x1 + x1 inlet y2 = x2 + x2 inlet y3 = x3 + x3 inout [y1, y2, y3]

The Essential Problem

voters depend on common value x1

let x1 = 2 inlet y1 = x1 + x1 inout [y1, y1, y1]

bad code: good code:

voters do not depend on a common value

Page 34: On  Cosmic Rays,  Bat Droppings,  and what to do about them

The Essential Problem

voters depend on a common value

let x1 = 2 inlet y1 = x1 + x1 inout [y1, y1, y1]

bad code:

let x1 = 2 inlet x2 = 2 inlet x3 = 2 inlet y1 = x1 + x1 inlet y2 = x2 + x2 inlet y3 = x3 + x3 inout [y1, y2, y3]

good code:

voters do not depend on a common value(red on red; green on green; blue on blue)

Page 35: On  Cosmic Rays,  Bat Droppings,  and what to do about them

A Type System for Lambda Zap

Key idea: types track the “color” of the underlying value & prevents interference between colors

Colors C ::= R | G | B

Types T ::= C int | C bool | C (T1,T2,T3) (T1’,T2’,T3’)

Page 36: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Sample Typing Rules

(x : T) in G--------------- G |--z x : T

------------------------ G |--z C n : C int

Judgement Form: G |--z e : T where z ::= C | .

simple value typing rules:

------------------------------ G |--z C true : C bool

Page 37: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Sample Typing Rules

G |--z e1 : R bool G |--z e2 : G boolG |--z e3 : B boolG |--z e4 : T G |--z e5 : T-----------------------------------------------------G |--z if [e1, e2, e3] then e4 else e5 : T

Judgement Form: G |--z e : T where z ::= C | .

G |--z e1 : R int G |--z e2 : G intG |--z e3 : B intG |--z e4 : T------------------------------------G |--z out [e1, e2, e3]; e4 : T

sample expression typing rules:

G |--z e1 : C int G |--z e2 : C int-------------------------------------------------

G |--z e1 + e2 : C int

Page 38: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Sample Typing Rules

Judgement Form: G |--z e : T where z ::= C | .

recall “zap rule” from operational semantics:

( M, F[ v1 ] ) ---> ( M, F[ v2 ] )

before:

|-- v1 : T

after:

|-- v2 ?? T ==> how will we obtain type preservation?

Page 39: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Sample Typing Rules

Judgement Form: G |--z e : T where z ::= C | .

recall “zap rule” from operational semantics:

----------------------G |--C C v : C U

( M, F[ v1 ] ) ---> ( M, F[ v2 ] )

before:

|-- v1 : C U

after:

|--C v2 : C U by rule:

no conditions

“faulty typing”occurs withina single coloronly.

Page 40: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Theorems Theorem 1: Well-typed programs are safe, even when

there is a single error.

Theorem 2: Well-typed programs executing with a single error simulate the output of well-typed programs with no errors [with a caveat].

Theorem 3: There is a correct, type-preserving translation from the simply-typed lambda calculus into lambda zap [that satisfies the caveat].

ICFP 06

Page 41: On  Cosmic Rays,  Bat Droppings,  and what to do about them

The Caveat

out [2, 3, 3]

bad, but well-typed code:

outputs 3 after no faults

out [2, 3, 3]

outputs 2 after 1 fault

out [2, 2, 3]

Goal: 0-fault and 1-fault executions should be indistinguishable

Solution: computations must independent, but equivalent

More importantly: out [2, 3, 3] is obviously a symptom of a compiler bug out [2, 3, 4] is even worse – good runs never come to consensus

Page 42: On  Cosmic Rays,  Bat Droppings,  and what to do about them

The Caveat

modified typing:

G |--z e1 : R U G |--z e2 : G UG |--z e3 : B UG |--z e4 : T G |--z e1 ~~ e2 G |--z e2 ~~ e3----------------------------------------------------------------------------G |-- out [e1, e2, e3]; e4 : T

Page 43: On  Cosmic Rays,  Bat Droppings,  and what to do about them

The Caveat

let [x1, x2, x3] = e1 in e2

Elimination form:

[e1, e2, e3]

• a collection of 3 items• each of 3 stored in separate register • single fault effects at most one

Introduction form:

More generally, programmers may form “triples” of equivalent values

Page 44: On  Cosmic Rays,  Bat Droppings,  and what to do about them

The Caveat

Elimination form:

More generally, programmers may form “triples” of equivalent values

Introduction form:

G |--z e1 : R U G |--z e2 : G UG |--z e3 : B UG |--z e1 ~~ e2 G |--z e2 ~~ e3---------------------------------------------G |-- [e1, e2, e3] : [R U, G U, B U]

G |--z e1 : [R U, G U, B U]

G, x1:R U, x2:G U, x3:B U, x1 ~ x2, x2 ~ x3 |--z e2 : T---------------------------------------------G |-- let [x1, x2, x3] = e1 in e2 : T

Page 45: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Theorems*

Theorem 1: Well-typed programs are safe, even when there is a single error.

Theorem 2: Well-typed programs executing with a single error simulate the output of well-typed programs with no errors.

Theorem 3: There is a correct, type-preserving translation from the simply-typed lambda calculus into lambda zap.

* There is still one “i” to be dotted in the proofs of these theorems. Lester Mackey, brilliant Princeton undergrad, has proven all key theorems modulo the dotted “i”.

Page 46: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Step 2: Fault Tolerant Typed Assembly Language (TAL/FT)

Lambda zap is playground for studying the principles of fault tolerance in an idealized setting

TAL/FT is a more realistic assembly-level, hybrid HW/SW, fault tolerance scheme with

(1) a formal fault model

(2) a formal definition of fault tolerance relative to memory-mapped I/O

(3) a sound type system for proving compiled programs are fault tolerant

Page 47: On  Cosmic Rays,  Bat Droppings,  and what to do about them

compiler front end

reliability transform

ordinaryprogram

TAL/FTreliabilityproof

proofchecker

optimizedTAL/FT

modifiedproof

optimization

A More Reliable Compiler Architecture

types

types

type

Page 48: On  Cosmic Rays,  Bat Droppings,  and what to do about them

TAL/FT: Key Ideas (Fault Model)

Fault model: registers may incur arbitrary faults in between

execution of any two instructions memory (including code) is protected by ECC fault model formalized as part of hardware

operational semantics

Page 49: On  Cosmic Rays,  Bat Droppings,  and what to do about them

TAL/FT: Key Ideas (Properties)

ECC-protectedmemory

ProcessorMem-mappedI/O device

store read

Primary Goal: if there is one fault then either

(1) Mem-mapped I/O device sees exactly the same sequence of stores as a fault-free execution, or

(2) Hardware detects and signals a fault and mem-mapped I/O sees a prefix of the stores from a fault-free execution

Secondary Goal: no false positives

Page 50: On  Cosmic Rays,  Bat Droppings,  and what to do about them

TAL/FT: Key Ideas (Mechanisms)

Compiler strategy create two redundant computations as lambda zap

two copies ==> fault detection fault recovery handled by a higher-level process

Hardware support special instructions & modified store buffer for implementing reliable stores special instructions for reliable control-flow transfers

Type system Simple typing mechanisms based on original TAL [Morrisett, Walker, et al.] Redundant values with separate colors like in lambda zap Value identities needed for equivalence checking tracked using singleton

types combined with some ideas drawn from traditional Hoare logics

Page 51: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Current & Future Work

Build the first compiler that can automatically generate reliability proofs for compiled programs TAL/FT refinement and implementation type- and reliability-preserving optimizations

Study alternative fault detection schemes fault detection & recovery on current hardware exploit multi-core alternatives

Understand the foundational theoretical principles that allow programs to tolerate transient faults general purpose program logics for reasoning about faulty programs

Page 52: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Other Research I Do

PADS [popl 06, sigmod 06 demo, popl 07] automatic generation of tools (parsers, printers, validators, format

translators, query engines, etc.) for “ad hoc” data formats with Kathleen Fisher (AT&T)

Program Monitoring [popl 00, icfp 03, pldi 05, popl 06, ...] semantics, design and implementation of programs that monitor

other programs for security (or other purposes)

TAL & other type systems [popl 98, popl 99, toplas 99, jfp 02, ... ] theory, design and implementation of type systems for compiler

target and intermediate languages

Page 53: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Conclusions

Semi-conductor manufacturers are deeply worried about how to deal with soft faults in future architectures (10+ years out)

Using proofs and types I think we

are going to be able to develop

highly reliable software that runs

on unreliable hardware

Page 54: On  Cosmic Rays,  Bat Droppings,  and what to do about them

end!

Page 55: On  Cosmic Rays,  Bat Droppings,  and what to do about them

The Caveat

Page 56: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Function O.S. follows

Page 57: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Lambda to Lambda Zap: Control-flow

let f = \x.e inf 2

let [f1, f2, f3] = \x. [[ e ]] in[f1, f2, f3] [2, 2, 2]

majority vote oncontrol-flow transfer

Page 58: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Lambda to Lambda Zap: Control-flow

let f = \x.e inf 2

let [f1, f2, f3] = \x. [[ e ]] in[f1, f2, f3] [2, 2, 2]

majority vote oncontrol-flow transfer

(M; let [f1, f2, f3] = \x.e1 in e2)--->(M,l=\x.e1; e2[ l / f1][ l / f2][ l / f3])

operational semantics:

Page 59: On  Cosmic Rays,  Bat Droppings,  and what to do about them

TAL/FT Hardware

store queue

replicatedprogramcounters

ECC-protectedCaches/Memory

Page 60: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Related Work Follows

Page 61: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Software Mitigation Techniques

Examples: N-version programming, EDDI, CFCSS [Oh et al. 2002], SWIFT [Reis et al. 2005], ... Hybrid hardware-software techniques: Watchdog Processors,

CRAFT [Reis et al. 2005] , ...

Pros: immediate deployment

would have benefitted Los Alamos Labs, etc... policies may be customized to the environment, application reduced hardware cost

Cons: For the same universal policy, slower (but not as much as you’d think).

Page 62: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Software Mitigation Techniques Examples:

N-version programming, EDDI, CFCSS [Oh et al. 2002], SWIFT [Reis et al.

2005], etc... Hybrid hardware-software techniques: Watchdog Processors,

CRAFT [Reis et al. 2005] , etc...

Pros: immediate deployment: if your system is suffering soft error-related

failures, you may deploy new software immediately would have benefitted Los Alamos Labs, etc...

policies may be customized to the environment, application reduced hardware cost

Cons: For the same universal policy, slower (but not as much as you’d think). IT MIGHT NOT ACTUALLY WORK!

Page 63: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Mitigation Techniques

Hardware: error-correcting codes redundant hardware

Pros: fast for a fixed policy

Cons: FT policy decided at hardware

design time mistakes cost millions

one-size-fits-all policy expensive

Software and hybrid schemes: replicate computations

Pros: immediate deployment policies customized to

environment, application reduced hardware cost

Cons: for the same universal policy,

slower (but not as much as you’d think).

Page 64: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Mitigation Techniques

Hardware: error-correcting codes redundant hardware

Pros: fast for fixed policy

Cons: FT policy decided at hardware

design time mistakes cost millions

one-size-fits-all policy expensive

Software and hybrid schemes: replicate computations

Pros: immediate deployment policies customized to

environment, application reduced hardware cost

Cons: for the same universal policy,

slower (but not as much as you’d think).

It may not actually work! much research in HW/compilers

community completely lacking proof

Page 65: On  Cosmic Rays,  Bat Droppings,  and what to do about them

Solutions in Time Solutions in Hardware

replication of instructions and checking implemented in special-purpose hardware

eg: Reinhardt & Mukherjee 2000 pros: transparent to software cons: one-size-fits-all reliability policy cons: can’t fix existing problem; specialized hardware has reduced

market

Solutions in Software (or hybrid Hardware/Software) compiler generates replicated instructions and checking code eg: Reis et al. 05 pros: flexibility: new reliability policies may be deployed whenever

needed cons: for fixed reliability policy, slower than specialized hardware

solutions cons: it might not actually work