Lecture07_FaultTolerance

7/31/2019 Lecture07_FaultTolerance

1/39

Fault Tolerance inDistributed

Systems


2/39

A systems ability to tolerate failure-1

Reliability: the likelihood that a system willremain operational for the duration of a

mission

requirement might be stated as 0.999999availability for a 10 hr mission -> the

probability of failure during the mission must

be at most 10-6

Very high reliability is most important in critical

applications - space shuttle, industrial control,

in which failure could mean loss of life!


3/39

A systems ability to tolerate failure-2

Availability: expresses the fraction of time a

system is operational

a 0.999999 availability means the system is not

operational at most one hour in a million hours a system with high availability may in fact fail -> its

recovery time and failure frequency must be small

enough to achieve the desired availability

high availability is important in airline reservations,telephone switching etc, in which every minute of

downtime translates into revenue loss


4/39

Importance of Design

A good fault- tolerant system design

requires a careful study of failures, causes

of failures, and system responses to

failures

Such a study should be carried out in

detail before the design begin and must

remain part of the design process


5/39

Requirement Specification-1

Planning to avoid failures is most

important

A designer must analyze the environment

and determine the failures that must be

tolerated to achieve the desired level of

reliability

To optimize fault tolerance, it is important

to estimate actual failure rate for each

possible failure


6/39

Requirement Specification-2

Failure types:

Some are more probable than

othersSome are transient, others

permanent

Some occur in hardware, others in

software


7/39

Design-1

Design of systems that tolerate faults

that occur while system is in use

Basic Principle - Redundancy Spatial - redundant hardware

Informational - redundant data

structures Temporal - redundant computation


8/39

Design-2

Redundancy costs money and time

One must optimize the design by trading offamount of redundancy used against the

desired level of fault tolerance Temporal redundancy usually requires re-

computation and it results in a slowerrecovery from failure

Spatial has faster recovery but increaseshardware costs, space, power etc.requirements


9/39

Design-3

Commonly Used Techniques for

Redundancy

Modular redundancy

Uses multiple, identical replicas of hardware

modules and a voter mechanism

The outputs from the replicas are compared, and

correct output is determined - majority vote

Can tolerate most hardware faults that can affect

the minority of the hardware modules


10/39

Design-4

N- Version Programming Write multiple versions of a software module

Outputs from these versions are received and correctoutput is determined via voting mechanism

Each version is written by different team, with thehope that they will not contain the same bugs

Can tolerate software bugs that affect a minority ofversions

Cannot tolerate correlated fault - reason for failure iscommon to two (or more) modules eg two modulesshare a single power supply, failure of which causesboth to fail


11/39

Error- Control Coding-1

Replication is expensive For certain applications - RAM, Buses, error

correcting codes can be used

Hamming or other codes

Checkpoints and rollbacks

A checkpoint is a copy of an applications state

saved in some storage that is immune to the

failures under consideration A rollback restarts the execution from a

previously saved checkpoint


12/39

Error- Control Coding-2

When a failure occurs, the applications

state is rolled back to the previous

checkpoint and restarted from there.

Can be used to recover from transient as

well as permanent hardware failures

Can be used for uniprocessor and

distributed applications


13/39

Recovery Blocks

Uses multiple alternates to perform the samefunction

One module is primary others are secondary

When primary completes execution, its

outcome is checked by an acceptance test If the output is not acceptable, a secondarymodule executes and so on until either anacceptable output is obtained or alternatesare exhausted

This method can tolerate software failures,because alternates are usually implementedwith different approaches (softwarealgorithms)


14/39

Dependability Evaluation-1 Once a system has been designed, it must be

evaluated to determine if it meets reliability anddependability objectives

Two dependability approaches:

Use an Analytical Model can help developers to determine a systems possible

states and probabilities of transitions among them

can be difficult to analyze models accurately

Injecting Faultso Various types of faults can be injected to determine

various dependability metrics


15/39

Dependability Evaluation-2

In distributed systems a transaction based

Service can accept occasional failures followed

by a lengthy recovery procedure

A Realtime Service - Process Control

o may have inputs that are readings taken from sensorso may have outputs to actuators that are used to control

a process directly or to activate alarms so that

humans can intervene in the process

o due to strict timing requirements, recovery must beachieved within a very small time limit e.g. air traffic

control, monitoring patients, controlling reactors


16/39

Dependability Evaluation-3

A Fault- Tolerant Service For a service to perform correctly, both the effect on a

servers resources and the response sent to the clientmust be correct

Correct behavior must be specified

Failure Semantics - the ways in which the service canfail, must be specified

Can detect a fault, thus,

fails predictablymasks fault from its users

operates in the presence of faults in services on which itdepends


17/39

Fault models Omission Failure

A server omits to respond to a request or receive request

Response Failure Value failure - returns wrong value

State transition failure - has wrong effect on resources

Timing Failure- any response that is not available to a

client within a specified real time interval Server Crash Failure: a server repeatedly fails torespond to requests until it is restarted Amnesia- crash - a server starts in its initial state, having

forgotten its state at the time of the crash, ie loses the values ofthe data items

Pause- crash - a server restarts in the state before the crash Halting- crash - server never restarts


18/39

Fault Example

UDP service

has omission failures because it

occasionally looses messages

does not have value failures because it

does not transmit corrupt messages.

UDP uses checksums to mask the value

failures of the underlying IP by converting

them to omission failures


19/39

Handling Failures


20/39

Process Resilience

Protection against process failures can

be achieved through process replication

into groups.

Flat Group: All the members are equal

Hierarchical Group: There is a

coordinator


21/39

Flat Groups versus Hierarchical

Groups

Figure 8-3. (a) Communication in a flat

group.

(b) Communication in a simple

hierarchical group.


22/39

Failure masking

In most simple fault case, k+1 processes

provide k fault tolerance.

If the faulty processes continue to run,

providing faulty response but do not team

up to give wrong response, then to have k

tolerant system we need 2k+1 processes.

Assume all the messages arrive at all

nodes at the same time.


23/39

Agreement issue in Faulty Systems

Possible assuptions about theunderlying system:

1. Synchronous versus asynchronous

systems.2. Communication delay is bounded or

not.

3. Message delivery is ordered ornot.

4. Message transmission is done

through unicasting ormulticasting.


24/39

Circumstances under which distributed

agreement can be reached


25/39

Byzantine Agreement Problem:

Lamport et al. 1982

Assume reliable synchronous ordered

unicast based message system. There are

N process, k of which may act as faulty or

even malicious. A faulty process may senddifferent values to different processes.


26/39

Byzantine failure In fault-tolerant distributed computing, a

Byzantine failure is an arbitrary fault that occursduring the execution of an algorithm in adistributed system. When a Byzantine failure hasoccurred, the system may respond in any

unpredictable way. These arbitrary failures may be loosely

categorized as follows: a failure to take another step in the algorithm, also

known as a crash failure;

a failure to correctly execute a step of the algorithm;and

arbitrary execution of a step other than the oneindicated by the algorithm.


27/39

Byzantine failure Byzantine refers to the Byzantine Generals' Problem, an

agreement problem in which generals of the Byzantine

Empire's army must decide unanimously whether or notto attack some enemy army.

The problem is complicated by the geographicseparation of the generals, who must communicate bysending messengers to each other, and by the presence

of traitors amongst the generals. These traitors can act arbitrarily in order to force good

generals into a wrong decision: trick some generals intoattacking; force a decision that is not consistent with thegenerals' desires,

e.g. forcing an attack when no general wished to attack; or soconfusing some generals that they never make up their minds. Ifthe traitors succeed in any of these goals, any resulting attack isdoomed, as only a concerted effort can result in victory.

Lamport et al., proved that with number of bad generals 1/3 orless, there is a solution.


28/39

Example: Byzantine Agreement problem for

four processes

Three non-faulty and one faulty process. (a) Each

process sends their value to the others. Process 3

lies, giving different values x, y,z to different

processes.


29/39

Byzantine Agreement problem-2

(b) The vectors that each process assembles

based on (a).(c) The vectors that each process receives in

step 3. Process 3 sends different vectors to

different processes.


30/39

Byzantine Agreement problem-3

Three processes can agree on the values

received from 1, 2, and 4. So that

malicious process 3 value is irrelevant

If N=3 and k=1, that is only two non-faulty

process this will not work!

Lamport proved that, with 2k+1 nonfaulty

processes, the system will survive k faulty

processes, which makes a total of 3k+1

process


31/39

BAP with two correct on faulty

processes..

Figure 8-6. The same as Fig. 8-5, except

now with two correct process and one

faulty process.


32/39

Client Server Communication

TCP-Transport Control Protocol as a

connection oriented end-to-end

communication protocol is a reliable

protocol. But it does not preventconnection crash failures, which require

searching for new connections


33/39

RPC Semantics in the

Presence of Failures Five different classes of failures that can occur in RPC

systems:1. The client is unable to locate the server.

2. The request message from the client to the server is lost.

3. The server crashes after receiving a request. This can be handled

with principles such as:-

At least once

At most once

The preferred principle is exactly once, which is virtually impossible to

implement

4. The reply message from the server to the client is lost.5. The client crashes after sending a request.

Each case needs to be resolved properly to mask the

failures


34/39

Failure Examples

UDP service

has omission failures because it

occasionally looses messages

does not have value failures because it

does not transmit corrupt messages.

UDP uses checksums to mask the value

failures of the underlying IP by converting

them to omission failures


35/39

Reliable Multicasting

Use negative acknowledgement, known as

scalable reliable multicasting-SRM

Non-hierarchical and hierarchical solutions are

possible Atomic multicasting requires all the replicas

reaching agreement on the success or failure of

multicast. This is known as distributed commit:

two-phase or three-phase commit protocols can

be used.


36/39

Recovery from a failure

When and how the state of a distributed

system be recorded and recovered to by

beans of check-pointing and logging.

To be able to recover to a stable state, it is

important that the state is safely stored..


37/39

Stable Storage

Sable storage is an example of group

masking at the disk block level

Designed to ensure permanent datais recoverable after a system failure

during a disk write operation or after a

disk block has been damaged


38/39

Stable Storage

Provided by a Careful Storage Service Unit of storage is the stable block

Each stable block is represented by two careful blocks that holdthe contents of the stable block in duplicate

Write operation writes one careful block ensuring it is correctbefore writing the second block

Careful blocks are disk blocks stored with a checksum to maskvalue failures, the blocks are located on different disk drives withindependent failure modes

Value failures are converted to omission failures

The Read operation reads one of the pair ofcareful blocks, if anomission failure occurs then it reads the other, thus masking theomission failures of the Careful Storage Service.


39/39

Stable Storage

Stable Storage - Crash Recovery

When a server is restarted after a crash, the

pair of careful blocks (representing the stable

block) will be in one of the following states: both good and the same

both good and different

one good, one bad

What does the recovery procedure do in

each of the above cases ?

Documents

Lecture07_FaultTolerance