31
[email protected] 1 Fault-Tolerant Fault-Tolerant Systems Design Systems Design Part 1 Part 1

[email protected] Fault-Tolerant Systems Design Part 1

Embed Size (px)

Citation preview

Page 1: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 1

Fault-TolerantFault-TolerantSystems DesignSystems Design

Part 1Part 1

Page 2: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 2

1. 1. Introduction: Introduction: Basic DefinitionsBasic Definitions

Fault-Tolerance is the ability of a system

to continuously perform correctly its

tasks after the occurrence of a fault.

Page 3: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 3

Reliability of a system is the function, R(t),

defined as the probability of the system to

perform correctly through the time interval

[t0, t], given that the system was performing

correctly at t0.

1. 1. Introduction: Introduction: Basic DefinitionsBasic Definitions

Page 4: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 4

Availability is the function, A(t), defined as

the probability of the system to operate

correctly and to be available to perform

its tasks through the interval [t0, t].

1. 1. Introduction: Introduction: Basic DefinitionsBasic Definitions

Page 5: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 5

Fault-Tolerant Systems can be designed by means of two basic approaches:

Fault Masking

Detection, localization and recovery, (via

reconfiguration) of the system to remove

the defective part.

2. 2. Design of FT SystemsDesign of FT Systems

Page 6: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 6

If the option is reconfiguration, then ...

before ...

Fault detection techniques

Fault location techniques

after ...

Fault recovery techniques

2. 2. Design of FT SystemsDesign of FT Systems

Page 7: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 7

Fault Recovery Techniques ...

Rollback Recovery

Forward Recovery

2. 2. Design of FT SystemsDesign of FT Systems

Page 8: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 8

All techniques to design FT systems

are based on some

type and degree

of redundancy.

2. 2. Design of FT SystemsDesign of FT Systems

Page 9: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 9

Redundancy is implemented through the use of HW, SW, information, or time beyond that necessary to system normal operation.

Results in a not negligible impact in the system in terms of performance, size, weight, power consumption, and reliability.

2. 2. Design of FT SystemsDesign of FT Systems

Page 10: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 10

Active

Passive

Hybrid

Redundancy at the HW Level:

2. 2. Design of FT SystemsDesign of FT Systems

Page 11: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 11

1. Based on the concept of fault masking to hide the occurrence of faults and prevent the faults from resulting in errors (developed around the concept of majority voting)

Do not provide for faults detection, but simply mask

them

HW Redundancy: 1. Passive

2. 2. Design of FT SystemsDesign of FT Systems

Page 12: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 12

Module 1

Module 2

Module 3

VoterOutput

Basic concept of Triple Modular Replication (TMR)

Proc 1

Proc 2

Proc 3

Voter

The use of triplicated voters in a TMR configuration

Voter

Voter

Mem 1

Mem 2

Mem 3

HW Redundancy: 1. Passive

2. 2. Design of FT SystemsDesign of FT Systems

Page 13: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 13

ExampleExample of SW of SW votingvoting

VoterVoterTaskTask

Task ATask A

Task BTask B

Task ATask A

Task ATask A

Proc 1Proc 1

Proc 3Proc 3

Proc 2Proc 2

HW Voting x SW Voting ?HW Voting x SW Voting ?

1. The availability of processor to perform the voting

2. The speed at which voting must be performed

3. The criticality of space, power, and weight

limitations

4. The # of different voters that must be provided

5. The flexibility required of the voter with respect to

future changes in the system

HW Redundancy: 1. Passive

2. 2. Design of FT SystemsDesign of FT Systems

Page 14: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 14

In practical applications of voting, 3 results in a TMR system may

not completely agree, even in a fault-free environment:

e.g., A/D converters in sensors may produce quantities that disagree in

the least-significant bits. This disagreement can propagate into

larger discrepancies after computation, which can significantly

affect the voting process.

HW Redundancy: 1. Passive

2. 2. Design of FT SystemsDesign of FT Systems

Page 15: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 15

Solution Mid-Value Select Technique

A TMR system selects the value that lies in the middle

of the others :

Corrupted signal

Uncorrupted signals

Selectedsignals

HW Redundancy: 1. Passive

2. 2. Design of FT SystemsDesign of FT Systems

Page 16: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 16

Attempts to achieve fault tolerance by means of fault

detection, fault location, reconfiguration, and recovery

(property of fault masking is not obtained: there is no attempt

to prevent faults from producing errors within the system)

More suitable for applications where temporary, erroneous

results are acceptable, as long as the system reconfigures and

regains its operational status in a satisfactory length of time

HW Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

2. Active (or Dynamic)

Page 17: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 17

Duplication of Functional Units

Standby Blocks Hot Standby Sparing Cold Standby Sparing

HW Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

2. Active (or Dynamic)

Page 18: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 18

Comparison Task

Processor A

Comparison Task

Processor B

Error Signals

A B

Processor A’s Result

Processor B’s Result

Shared Memory

Processor A’s Private Memory

Processor A’s Result

Processor B’s Private Memory

Processor B’s Result

A software implementation of duplication with comparison

2. Active (or Dynamic)

HW Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

Page 19: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 19

3. Hybrid

HW Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

Combines the attractive features of both the

Active and the Passive approaches

Page 20: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 20

Consistency Checks

Capacity Checks

N-Auto testable Programming

N-Version Programming

Recovery Blocks

SW Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

Page 21: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 21

Consistency Checks

SW Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

Use the previous knowledge about the chacacteristics of a given

information to check the information correctness.

Typically, for most applications, it is well known that a certain

quantity of a given operand cannot assume values beyond

predefined limits.

Page 22: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 22

Consistency Checks

SW Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

Examples ...

A processing system can sample and store many sensor

readings in a typical control application.

The amount of cash requested by a patron at a bank’s teller

machine should never exceed the maximum withdrawal allowed.

Page 23: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 23

Consistency Checks

SW Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

Examples ...

The address generated by a computer should never lie outside

the address range of the available memory.

In a computer, each instruction code can be checked to verify

that it is not one the illegal codes.

Page 24: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 24

Capability Checks

SW Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

Capability checks are performed to verify that a system

possesses the capability expected.

Page 25: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 25

Capability Checks

SW Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

Examples ...

Check whether a computer has the complete memory available.

Check whether the processors in a multiprocessor system are

working properly.

Periodically, a processor can execute specific instrutions on

specific data and compare the results to known results stored in

a ROM: check for ALU and Memory

Page 26: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 26

Program Version 1

Program Version n

Acceptance Tests

Acceptance Tests

Sel

ecti

on

Lo

gic

Pro

gra

m O

utp

uts

Program Inputs

Program Inputs

The N-Self-Checking Programming Approach to software fault tolerance

SW Redundancy:

N-Auto testable Programming

2. 2. Design of FT SystemsDesign of FT Systems

Page 27: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 27

Parity, Berger, and m-of-n Codes

Arithmetic Codes

Hamming Codes

Checksum Code

CRC (Cyclic Redundancy Checking) Code

Information Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

Page 28: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 28

Transient Fault Detection

Permanent Fault Detection

Re-computation for Error Correction

Time Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

Page 29: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 29

Transient Faults Detection

Time Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

The fundamental concept is to perform the same computation

two or more times and compare the results to determine if a

discrepancy exists.

Page 30: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 30

Time Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

Permanent Faults Detection

Computation

ComputationEncode

DataDecodeResult

StoreResult

StoreResult

CompareResults

DataTime t0

DataTime t1

Error

Page 31: vargas@computer.org1 Fault-Tolerant Systems Design Part 1

[email protected] 31

Time Redundancy:

2. 2. Design of FT SystemsDesign of FT Systems

Re-computation for Error Correction

Time redundancy approach can also provide for error correction

if the computations are repeated three or more times.

Consider the example of a logical ANDAND operation. Suppose the

operation is performed three times: first, without shifting the

operands; second, with a one-bit logical shift of the operands;

and third, with a two-bit logical shift of the operands.