65
New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

Embed Size (px)

Citation preview

Page 1: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

New Approaches to Fault-Tolerant Systems Design

Andreas Steininger

Vienna University of Technology

Page 2: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 2

My contact data

Andreas SteiningerVienna University of Technology

Faculty of InformaticsInstitute of Computer Engineering

Embedded Computing Systems Group

Treitlstrasse 3A- 1040 Vienna

Austria

[email protected]

http://ti.tuwien.ac.at/ecs

Page 3: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 3

Main Contributors to this Material

Dr. Thomas Kottke R. Bosch AG / EADS

Dr. Peter Tummeltshammer R. Bosch AG / Thales

Dr. Christoph Scherrer Alcatel / Thales

Dr. Eric Armengaud DecomSys / VirtualVehicle

Dr. Karl Thaller DecomSys / Elektrobit Austria

Dr. Martin Horauer UAT Technikum Wien

Paul Milbredt AUDI AG

Page 4: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 4

Outline• Fault tolerance – some (very) basics• Automotive electronics: the specific situation• Design of a cost efficient fault tolerant node

– Basic architecture– Temporal diversity– Treatment of common cause faults– Switching performance mode / safety mode– Fault-tolerance validation by fault injection

Page 5: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 5

Faults, Errors and Failures

computer

1 0

fault

error failure

Page 6: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 6

Error Detection

computer

1 0

fault

error failure

Fault detection: usually too difficult (too many possibilities)

Page 7: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 7

Error Detection

computer

1 0

error failure

Failure detection: too late:want to prevent failure!

Page 8: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger

0

page 8

Error Detection

computer

1 0

error

To decide that „1“ is wrong we need a reference.Where to get this reference from?

Option 1:Perform same compu-tation a second time (hopefully the fault is gone by then…)

Time redundancy

Page 9: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 9

Error Detection

computer

1 0

error

To decide that „1“ is wrong we need a reference.Where to get this reference from?

Page 10: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger

0

page 10

Error Detection

computer

1 0

To decide that „1“ is wrong we need a reference.Where to get this reference from?

Option 2:Use a second computer in parallel (hopefully this one works well…)

Space redundancy

Page 11: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 11

Error Detection

computer

1 0

error

To decide that „1“ is wrong we need a reference.Where to get this reference from?

Option 3:Add additional information (hopefully not affected as well…)

Information redundancy0

Page 12: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger

computer ED

computer

computer

page 12

Achieving Fault Tolerance

Fail safe: system can be safelystopped when error is detected example: train

computer

computer

ED

Fail operational: system must keep on working when error is detected example: autopilot in airplanecomputer

computer

computer

ED

Page 13: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 13

Outline Fault tolerance – some (very) basics• Automotive electronics: the specific situation• Design of a cost efficient fault tolerant node

– Basic architecture– Temporal diversity– Treatment of common cause faults– Switching performance mode / safety mode– Fault-tolerance validation by fault injection

Page 14: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 14

Electronics in Cars – some Facts

high proportion of value: up to 30%

high development potential:more than 80% of the innovations

high number of Electronic Control Units (ECUs)up to 70

complex distributed systemdifferent networks & topologies

Page 15: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 15

Electronics in Cars - Benefits

cheap alternative to existing mechanical solutions– lighter, smaller, cheaper, more flexible,…

enabler for further optimizations– electronic ignition, motor management, …

key to new functionality– safety: ESP, active suspension, crash sensing…– comfort: air conditioning, infotainment,…– security: immobilizer, alarm, electronic key, GPS tracking,…– autonomy: anticipatory braking, lane keeping,…

Page 16: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 16

Key Demands

Safety

Real-Time

Low Cost

Robustness

Testability

Page 17: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 17

Key Demands

Safety

Real-Time

Low Cost

Robustness

Testability

– high risk potential (energy!)

– high public awareness

– no safe state (in general)

– certification required(EN 61508, ISO 26262)

– high complexity of system & application

– legal issues (liability)

Page 18: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 18

Key Demands

Safety

Real-Time

Low Cost

Robustness

Testability

– engine: 6000 rpm = 1/10ms

– VDM: 100km/h = 28cm/10ms

– need to synchronize distributed activities

– real-time communication

– image processing tasks

Page 19: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 19

Key Demands

Safety

Real-Time

Low Cost

Robustness

Testability

– extreme competition

– high cost inhibits introduction

– tailored safety concepts minimum degree of replication use structural redundancies

– generic solutions scalable, configurable, flexible

– marginal costs beat NRE

Page 20: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 20

Current Status

fail safe functions realized:– shut off upon error– mechanical fall-back system assumes control

no true “by wire” functions– single-channel solutions sufficient

tolerance against random faults– avoid design faults by field experience => no diversity– avoid common cause faults by design (?)

single fault assumption– keep faults rare (shielding, etc.)

Page 21: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 21

Outline Fault tolerance – some (very) basics Automotive electronics: the specific situation• Design of a cost efficient fault tolerant node

– Basic architecture– Temporal diversity– Treatment of common cause faults– Switching performance mode / safety mode– Fault-tolerance validation by fault injection

Page 22: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 22

A Fault Tolerant Node

mission: make a node (processor) fault tolerant

need to consider CPU and memory

aim is “fail safe” (but keep option for fail op in mind)– simplex unit with error detection capabilities– duplication and comparison– hybrid approach

Page 23: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 23

Options for the CPU Core

Single core + ED

Dual core + cmp

Superscalar proc.+ cmp + ED

modify custom CPU core

– parity for buses

– two-rail coding for signals

– self-checking implemen-tation of simple units

– duplicate & compare for complex units

– careful layout

Page 24: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 24

Options for the CPU Core

Single core + ED

Dual core + cmp

Superscalar proc.+ cmp + ED

duplicate custom CPU core

– master/checker operation

– shared (safe) memory

– validity check for inputs

– self-checking comparator checks equality of outputs

– option: clock delay

– option: mode switch

Page 25: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 25

Solution Example “Dual Core Frame”

benefitscan use custom core without modificationssafety analysis valid for other cores as wellpromises high ED coverage with moderate effortsCPU is hard to protect otherwise

crucial pointsenable easy recovery ( => keep outage short)

eliminate single points of failuredetect common cause faults

Page 26: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 26

Instr. Mem Data Mem

=? =?=?

Instr. Addr. Instr. Data Addr.Data out Data in

Instr. Addr. Instr. Data Addr.Data out Data in

Core 1 (Master)

Core 2 (Checker)

Error_Sig„Safe memories“

Parity for buses

Dual-Rail CodingSelf-Checking Comparators

Protection in the Dual Core Frame

Page 27: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 27

Potential for Common Cause Faults

identical input data identical clock (lock step) shared clock generator shared power supply both processors on same die

(physical proximity; thermal & mechanical coupling)

Page 28: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 28

Temporal Diversity

operate checker with a delay against master– same fault hits at different point of computation– therefore different effect => detect by comparison– different critical paths emerge

store master output for comparison choose delay of 1 / 1.5 / 2 clock cycles

– larger delay causes high effort for little gain (=>experiments)

– error detection latency is equal to the delay– need to delay memory write and outputs by this amount

Page 29: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 29

Instr. Mem Data Mem

=? =?=?

Instr. Addr. Instr. Data Addr.Data out Data in

Instr. Addr. Instr. Data Addr.Data out Data in

Core #1 (Master)

Core #2 (Checker)

ErrorDT

Temporal Diversity: Implementation

Page 30: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 30

Fail Safe Dual Core Frame – Summary safe memories for instructions and data comparison of all core outputs parity protection for buses (data, address)

dual rail coding for single signals (int, rst, err)

totally self-checking comparators temporal diversity

How safe is the proposed solution?

Page 31: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 31

Assessment of the Solution’s Quality

How measure quality? ( Aim is fail safe)error detection coverage => detect all errors

error detection latency => detect them quickly

Which method to choose? theoretical analysis / modelling experimental fault injection field observation

Page 32: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 32

Fault Injection Experiment

2 SPEAR cores in fail safe frame (= DUT)

synthesized to EDIF netlist injected one by one into netlist exhaustive list of stuck-at-1 and stuck-at-0 faults download to FPGA, application run “golden device” as reference (= REF)

upon mismatch (DUT REF) => check comparator

Page 33: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 33

master slave frame overalldetected no effect 204 51170 3517 54891

before effect 19047 98 734 19879

during effectRD 0 0 0 0WR 559 0 921 1480

after effectRD 31455 0 87 31542WR 0 0 0 0

not detected

no effect 4269 4276 1073 9618with effect 0 0 0 0

overall 55534 55544 6332 117410

No change of memory contents in case of errorErroneous read access is uncritical

Results of FI Experiment

Page 34: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 34

Enabling fast Recovery error signal (dual rail)

notifies external component / memory turns any further WR into RD (error confinement) triggers processor interrupt

status register (memory mapped) updated by HW indicates source of error (data parity, address mismatch,…)

recovery can build on uncorrupted status can benefit from detailed status information

Page 35: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 35

Why is fast Recovery important? application specific fault-tolerance time

application can “survive” without computer even in fail-operational case typ. some 10ms for car (recall: 100km/h = 28cm/10ms)

meaning of fast recovery if failed computer recovers within FT time,

no need for hot standby => COST! re-booting after failure is

- pragmatic- safe- expensive!

Page 36: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 36

Fail Safe Dual Core – Summary 1 duplicate & compare

generic approach, applicable to any core typecovers all (local) errorsneed to carefully eliminate single points of failureneed to complement with protection for signals & buses

temporal diversity mitigates (many) common cause failures requires output delay to ensure error confinement

Page 37: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 37

Possible Sources of CCFs Design & process

design fault or (latent) process deficiency

Thermal coupling hot spot affects both replica in the same way

Mechancial defectaffects both replica symmetrically

Electrical couplingwire bound (shared lines: VDD, reset, clock) wireless (EMI)

Page 38: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 38

Komp.error

Why use Single Die then?

cheaper and fasteruse two instances of same designfast & comprehensive comparison

CCFs on single dieintuitively higher threadquantification of thread?mitigation techniques?

Page 39: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 39

The Actual Problem with CCFs

One fault event affects both replicaAND

is not detected by comparator i.e. leads to “symmetric” fault effect

AND

produces an erroneous outputi.e. does not crash the cores

Page 40: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 40

Possible Countermeasures for CCFs Design & process

design fault or (latent) process deficiency

Thermal coupling hot spot affects both replica in the same way

Mechancial defectaffects both replica symmetrically

Electrical couplingwire bound (shared lines: VDD, reset, clock) wireless (EMI)

diversity, burn-in,fault avoidance

asymmetric propagation paths

asymmetric critical paths

asymmetric antennas (?)

Page 41: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 41

Possible Countermeasures for CCFs Design & process

design fault or (latent) process deficiency

Thermal coupling hot spot affects both replica in the same way

Mechancial defectaffects both replica symmetrically

Electrical couplingwire bound (shared lines: VDD, reset, clock) wireless (EMI)

asymmetric propagation paths

Page 42: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 42

Propagation Speed Comparison

Thermal & mechanical propagation arerelatively slow

10000s of clock cycles within 1ms

Page 43: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 43

Experimental Assessment

Evaluation Experiments1) single corresponding points

with offset t

2) multiple corresp. points with offset t

3) single non-corresp. points no offset

Core 1 Core 2

Core 1 Core 2

Core 1 Core 2

Master

Compare unit

Checker

GoldenNode

Da

ta

Ad

dr

Iad

dr

Da

ta

Ad

dr

Iad

dr

We

We

Erroneous write

access?

Page 44: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 44

Symmetry Requirements for CCF

even a small offset…

fault multiplicity …

asymmetry of impact …

…improve detection coverage

Page 45: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 45

Symmetry Requirements for CCF

even a small offset…

fault multiplicity …

asymmetry of impact …

…improve detection coverage

RF (7028)

ExVecTab(8202)

ALU (2472)

PSW (308)

DEC (152)

P2 (158)

PC+P1(182)

Page 46: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 46

Squeezing our more Efficiency dual core is expensive normally yields performance improvement

would be welcome here as well: increasing performance demand @ limited clock rates

but: exclusively dedicated to safety here

observation: not all tasks are safety critical

enable flexible switching between “safety mode” and “performance mode”

Page 47: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 47

Operation in Performance Mode

cores execute different instruction streams in parallel both cores have direct access to memory / peripherals instruction caches introduced to minimize penalties from

conflicting access temporal diversity disabled comparator disabled

Page 48: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 48

Requirements on the Mode Switching coherent operation in safety mode

internal states of cores must be aligned before switching to safety mode (register file, cache)

safe operation in safety mode switching must not introduce safety leakageno corruption of safety-relevant data in perform. mode

low performance penalty for mode switchingslow or complicated switching would spoil the

anticipated performance gain

Page 49: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 49

Implementation of the Split Core Frame

InstructionRAM

Control

Instruction

safedata memory

safe instructionmemory

DataRAM

Control

Instruc-tion-

cache

Instruc-tion-

cache

Mode-SwitchDetect

Mode-Switch

Core 1Instructionaddress Instruction

Dataaddress

Dataout

Datainclk

WaitSignal Interrupt

Core 2

InstructionDataaddress

Dataout

Datainclk

WaitSignal Interrupt

modeswitch

modeswitch

Address

Adress parity

Instruction parity

Address with parity

Datawith parity

Data with parity

Instructionaddress

Mode-SwitchDetect

InstructionRAM

Control

Instruction

safedata memory

safe instructionmemory

DataRAM

Control

Instruc-tion-

cache

Instruc-tion-

cache

Mode-SwitchDetect

Mode-Switch

Core 1Instructionaddress Instruction

Dataaddress

Dataout

Datainclk

WaitSignal Interrupt

Core 2

InstructionDataaddress

Dataout

Datainclk

WaitSignal Interrupt

modeswitch

modeswitch

Address

Adress parity

Instruction parity

Address with parity

Datawith parity

Data with parity

Instructionaddress

Mode-SwitchDetect

Page 50: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 50

Mode Switch: Safety => Performance

core1 signal

message2

wait1

wait2

message1

clk

status safety mode

clk_core2

core2 signal

safety modeperformance mode

core1 signalcore1 signal

message2message2

wait1wait1

wait2wait2

message1message1

clkclk

status safety modestatus safety mode

clk_core2clk_core2

core2 signalcore2 signal

safety modeperformance mode safety modeperformance mode

LDL r1, 248LDH r1, 255

mode switchingLDW r2, r1BTEST r2, 1

JMPI_CT

load ID reg address

mode switch instr=> core1 wait=> core2 wait=> clk align=> switch mode

load & check ID bit=> cond branch core2

Page 51: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 51

Mode Switch: Performance => Safety

core1 signal

message2

wait1

wait2

message1

clk

status safety mode

clk_core2

core2 signal

safety modeperformance mode

core1 signalcore1 signal

message2message2

wait1wait1

wait2wait2

message1message1

clkclk

status safety modestatus safety mode

clk_core2clk_core2

core2 signalcore2 signal

safety modeperformance mode safety modeperformance mode

core1 signal

message2

wait1

wait2

message1

clk

status safety mode

clk_core2

core2 signal

safety modeperformance mode

core1 signalcore1 signal

message2message2

wait1wait1

wait2wait2

message1message1

clkclk

status safety modestatus safety mode

clk_core2clk_core2

core2 signalcore2 signal

safety modeperformance mode safety modeperformance mode

core1 encounters mode switch instr=> trigger MSU (core1 signal)

=> halt core1 (wait1)

=> interrupt core2 (message2)core2 encounters interrupt=> save context=> jump to mode switch instr

core2 executes mode switch=> halt core2 & switch clock=> resume core1=> resume core2 after delay

Page 52: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 52

master slave frame overall

detected no effect 1029 56962 5334 63325

before effect 5026 0 1324 6350

within 1,5cy 50956 0 569 51525

later 0 0 0 0

not detected

no effect 7055 7102 4275 18432

with effect 0 0 0 0

overall 64066 64064 11502 139632

Delayed WR still ensures error confinement

Fault Injection in Safety Mode

Page 53: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 53

Fault Injection in Performance Mode

detection in perf mode safety mode

nevereffect in early late stuck ≤1.5cy >1.5c

yperf only 1149 423 25617 34583 458both modes -- -- -- 0 0 0safety only -- -- -- 9654 0 0none 1473 47715 18560

fault injected in performance mode, then switch to safety mode

No undetected effects / late detections in safety modeWatchdog important to prevent hang-up in perf mode

Page 54: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 54

We still need a “Safe Memory”

detect bit flips in storage cellsparity (or EDC/ECC)

detect erroneous address decodingspecial decoder logic design

protect interfaces parity for data, address and control buses

prevent illegal WR access provide mask input for write enable

Why not duplicate & compare?

Page 55: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 55

We still need a “Safe Memory”

detect bit flips in storage cellsparity (or EDC/ECC)

detect erroneous address decodingspecial decoder logic design

protect interfaces parity for data, address and control buses

prevent illegal WR access provide mask input for write enable

Page 56: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 56

Possible Address Decoder Errors

correct behavior:any given address activates exactly

one assigned memory cell

erroneous behaviors: an address activates no memory cell at all an address activates more than one memory cell an address activates a wrong memory cell

Page 57: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 57

Checking the Address Decoder

large decoders built from cascade of smaller ones

memory cell array

dual-railchecker

pe

dual-railchecker

XOR

XOR

AP

&

A0

A1

A2

&

&

&

&

&

&

&

re-check parity behind cell array:OR over even cells parity ?

check for missing or multiple cell activations:XOR(upper half) XOR(lower half) ?

Page 58: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 58

Summary the automotive domain has its own laws and rules

need “extremely cost-effective robust solutions for safety-critical real-time applications, versatile and custom tailored”

on node level different redundancy concepts applicable example: dual core CPU and memory with protection mech’s on-line testing for memory may be required

on system level crucial role of communication infrastructure advantages of time triggered approach insufficient suitability of structural testing

Page 59: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 59

Hungry for more?

http://ti.tuwien.ac.at/ecs

[email protected]

Page 60: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 60

Related publications of my group (1)[1] T. Kottke and A. Steininger, “A Fail-Silent Memory for Automotive Applications”, 9th

IEEE European Test Symposium, Corsica 2004.

[2] T. Kottke and A. Steininger, “A Generic Dual Core Architecture with Error Containment”, Journal of Computing and Informatics, vol. 23, no.5, 2004.

[3] T. Kottke and A. Steininger, “A Reconfigurable Generic Dual-Core Architecture”, Int’l Conference on Dependable Systems and Networks (DSN2006), Philadelphia, 2006.

[4] T. Kottke and A. Steininger, “A Fail-Silent Reconfigurable Superscalar Processor”, 13th IEEE Pacific Rim Int’l Symposium on Dependable Computing, Melbourne, 2007.

[5] C. El Salloum, A. Steininger, P. Tummeltshammer and W. Harter, “Recovery Mechanisms for Dual Core Architectures”, 21st IEEE Int’l Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’06), Washington, 2006.

[6] A. Steininger and C. Temple, “Economic Self-Test in the Time-Triggered Architecture”, IEEE Design & Test of Computers, vol 3/1999

[7] A. Steininger, “Testing and Built-in Self-Test – A Survey”, Journal of Systems Architecture 46(2000)

Page 61: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 61

Related publications of my group (2)[8] A. Steininger and C. Scherrer, “On the Necessity of BIST in Safety-Critical

Applications – A Case Study”, 29th Annual Int’l Symposium on Fault-Tolerant Computing (FTCS’29), Madison, 1999.

[9] C. Scherrer and A. Steininger, “How does Resource Utilization Affect Fault Tolerance?”, 2000 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’00), Yamanashi, 2001.

[10] C. Scherrer and A. Steininger, “How to Tune the MTTF of a Fail-Silent System”, 2001 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’01), San Francisco, 2001

[11] C. Scherrer and A. Steininger, “Dealing with Dormant Faults in an Embedded Fault-Tolerant Computer System”, IEEE Transactions on Reliability, vol. 52, no. 4, 2003.

[12] K. Thaller and A. Steininger, “A Transparent Online Memory Test for Simultaneous Detection of Functional Faults and Soft Errors in Memories”, IEEE Transactions on Reliability, vol. 52, no. 4, 2003.

Page 62: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 62

Related publications of my group (3)[13] E. Armengaud, F. Rothensteiner, A. Steininger, R. Pallierer, M. Horauer, M. Zauner, “A

Structured Approach for the Systematic Test of Embedded Automotive Communication Systems”, Int’l Test Conference 2005, Austin 2005.

[14] E. Armengaud, A. Steininger and M. Horauer, “Automatic Parameter Identification in FlexRay based Automotive Communication Networks”, 11th IEEE Int’l Conference on Emerging Technologies and Factory Automation, Prague 2006.

[15] E. Armengaud, A. Steininger, M. Horauer, „Towards a Systematic Test of Embedded Automotive Communication Systems“, IEEE Transactions on Industrial Informatics vol 4, no 3

[16] P. Milbredt, A. Steininger and M. Horauer, “Automated Testing of FlexRay Clusters for System Inconsistencies in Automotive Networks”, 4th Int’l Symposium on Electronic Design, Test and Applications, Hong Kong, 2008.

[17] P. Milbredt, A. Steininger, M. Horauer, „An investigation of the Clique Problem in FlexRay“, Proc. 3rd IEEE Symposium on Industrial Embedded Systems, Las Vegas, 2008.

Page 63: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 63

Related publications of my group (4)[18] P. Tummeltshammer and A. Steininger, „Power Supply Induced Common Cause

Faults — Experimental Assessment of Potential Countermeasures“, 9th IEEE International Conference on Dependable Systems and Networks, Estoril, 2009.

[19] E. Armengaud, A. Steininger, M. Horauer, R. Pallierer, “A Layer Model for the Systematic Test of Time-Triggered Automotive Communication Systems”, 5th IEEE Int’l Workshop on Factory Communication Systems, Vienna, 2004.

[20] E. Armengaud, A. Steininger and M. Horauer, “Automatic Parameter Identification in FlexRay based Automotive Communication Networks”, 11th IEEE Int’l Conference on Emerging Technologies and Factory Automation, Prague 2006.

[21] E. Armengaud and A. Steininger, “Pushing the Limits of Remote Online Diagnosis in Embedded Real-Time Networks”, 6th IEEE Int’l Workshop on Factory Communication Systems, Torino, 2006.

[22] P. Milbredt, A. Steininger and M. Horauer, “Automated Testing of FlexRay Clusters for System Inconsistencies in Automotive Networks”, 4th Int’l Symposium on Electronic Design, Test and Applications (DELTA 2008), Hong Kong, 2008.

Page 64: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 64

Related PhD theses of my groupT. Kottke, “Untersuchung von fehlertoleranten Prozessorarchitekturen für

sicherheitsrelevante Automobilanwendungen”, PhD thesis, Vienna University of Technology, 2005. (German)

C. Scherrer, “Zuverlässigkeit zweifach redundanter Architekturen unter besonderer Berücksichtigung latenter Fehler”, PhD thesis, Vienna University of Technology, 2002. (German)

K. Thaller, “A Transparent Online Memory Test”, PhD thesis, Vienna University of Technology, 2001.

E. Armengaud, “A Transparent Online Test Approach for Time-Triggered Communication Protocols”, PhD thesis, Vienna University of Technology, 2008.

P. Tummeltshammer, “An Analysis of Common Cause Failures in Dual Core Architectures”, PhD thesis, Vienna University of Technology, 2009.

G. Fuchs, “Fault-Tolerant Distributed Algorithm for Robust Tick Synchronization: Concepts, Implementations and Evaluations”, PhD thesis, Vienna University of Technology, 2009

Page 65: New Approaches to Fault-Tolerant Systems Design Andreas Steininger Vienna University of Technology

A. Steininger page 65

Related ProjectsSTEACS (Systematic Test of Embedded Automotive Communication Systems)

http://embsys.technikum-wien.at/projects/steacs/index.html

EXTRACT (Exploiting Synchrony for Transparent Communication Services Testing)http://ti.tuwien.ac.at/ecs/research/projects/extract

DARTS (Distributed Algorithms for Robust Tick Synchronization)http://ti.tuwien.ac.at/ecs/research/projects/DARTS