Upload
kanchan-kanojia
View
215
Download
0
Embed Size (px)
Citation preview
7/31/2019 Lecture07_FaultTolerance
1/39
Fault Tolerance inDistributed
Systems
7/31/2019 Lecture07_FaultTolerance
2/39
A systems ability to tolerate failure-1
Reliability: the likelihood that a system willremain operational for the duration of a
mission
requirement might be stated as 0.999999availability for a 10 hr mission -> the
probability of failure during the mission must
be at most 10-6
Very high reliability is most important in critical
applications - space shuttle, industrial control,
in which failure could mean loss of life!
7/31/2019 Lecture07_FaultTolerance
3/39
A systems ability to tolerate failure-2
Availability: expresses the fraction of time a
system is operational
a 0.999999 availability means the system is not
operational at most one hour in a million hours a system with high availability may in fact fail -> its
recovery time and failure frequency must be small
enough to achieve the desired availability
high availability is important in airline reservations,telephone switching etc, in which every minute of
downtime translates into revenue loss
7/31/2019 Lecture07_FaultTolerance
4/39
Importance of Design
A good fault- tolerant system design
requires a careful study of failures, causes
of failures, and system responses to
failures
Such a study should be carried out in
detail before the design begin and must
remain part of the design process
7/31/2019 Lecture07_FaultTolerance
5/39
Requirement Specification-1
Planning to avoid failures is most
important
A designer must analyze the environment
and determine the failures that must be
tolerated to achieve the desired level of
reliability
To optimize fault tolerance, it is important
to estimate actual failure rate for each
possible failure
7/31/2019 Lecture07_FaultTolerance
6/39
Requirement Specification-2
Failure types:
Some are more probable than
othersSome are transient, others
permanent
Some occur in hardware, others in
software
7/31/2019 Lecture07_FaultTolerance
7/39
Design-1
Design of systems that tolerate faults
that occur while system is in use
Basic Principle - Redundancy Spatial - redundant hardware
Informational - redundant data
structures Temporal - redundant computation
7/31/2019 Lecture07_FaultTolerance
8/39
Design-2
Redundancy costs money and time
One must optimize the design by trading offamount of redundancy used against the
desired level of fault tolerance Temporal redundancy usually requires re-
computation and it results in a slowerrecovery from failure
Spatial has faster recovery but increaseshardware costs, space, power etc.requirements
7/31/2019 Lecture07_FaultTolerance
9/39
Design-3
Commonly Used Techniques for
Redundancy
Modular redundancy
Uses multiple, identical replicas of hardware
modules and a voter mechanism
The outputs from the replicas are compared, and
correct output is determined - majority vote
Can tolerate most hardware faults that can affect
the minority of the hardware modules
7/31/2019 Lecture07_FaultTolerance
10/39
Design-4
N- Version Programming Write multiple versions of a software module
Outputs from these versions are received and correctoutput is determined via voting mechanism
Each version is written by different team, with thehope that they will not contain the same bugs
Can tolerate software bugs that affect a minority ofversions
Cannot tolerate correlated fault - reason for failure iscommon to two (or more) modules eg two modulesshare a single power supply, failure of which causesboth to fail
7/31/2019 Lecture07_FaultTolerance
11/39
Error- Control Coding-1
Replication is expensive For certain applications - RAM, Buses, error
correcting codes can be used
Hamming or other codes
Checkpoints and rollbacks
A checkpoint is a copy of an applications state
saved in some storage that is immune to the
failures under consideration A rollback restarts the execution from a
previously saved checkpoint
7/31/2019 Lecture07_FaultTolerance
12/39
Error- Control Coding-2
When a failure occurs, the applications
state is rolled back to the previous
checkpoint and restarted from there.
Can be used to recover from transient as
well as permanent hardware failures
Can be used for uniprocessor and
distributed applications
7/31/2019 Lecture07_FaultTolerance
13/39
Recovery Blocks
Uses multiple alternates to perform the samefunction
One module is primary others are secondary
When primary completes execution, its
outcome is checked by an acceptance test If the output is not acceptable, a secondarymodule executes and so on until either anacceptable output is obtained or alternatesare exhausted
This method can tolerate software failures,because alternates are usually implementedwith different approaches (softwarealgorithms)
7/31/2019 Lecture07_FaultTolerance
14/39
Dependability Evaluation-1 Once a system has been designed, it must be
evaluated to determine if it meets reliability anddependability objectives
Two dependability approaches:
Use an Analytical Model can help developers to determine a systems possible
states and probabilities of transitions among them
can be difficult to analyze models accurately
Injecting Faultso Various types of faults can be injected to determine
various dependability metrics
7/31/2019 Lecture07_FaultTolerance
15/39
Dependability Evaluation-2
In distributed systems a transaction based
Service can accept occasional failures followed
by a lengthy recovery procedure
A Realtime Service - Process Control
o may have inputs that are readings taken from sensorso may have outputs to actuators that are used to control
a process directly or to activate alarms so that
humans can intervene in the process
o due to strict timing requirements, recovery must beachieved within a very small time limit e.g. air traffic
control, monitoring patients, controlling reactors
7/31/2019 Lecture07_FaultTolerance
16/39
Dependability Evaluation-3
A Fault- Tolerant Service For a service to perform correctly, both the effect on a
servers resources and the response sent to the clientmust be correct
Correct behavior must be specified
Failure Semantics - the ways in which the service canfail, must be specified
Can detect a fault, thus,
fails predictablymasks fault from its users
operates in the presence of faults in services on which itdepends
7/31/2019 Lecture07_FaultTolerance
17/39
Fault models Omission Failure
A server omits to respond to a request or receive request
Response Failure Value failure - returns wrong value
State transition failure - has wrong effect on resources
Timing Failure- any response that is not available to a
client within a specified real time interval Server Crash Failure: a server repeatedly fails torespond to requests until it is restarted Amnesia- crash - a server starts in its initial state, having
forgotten its state at the time of the crash, ie loses the values ofthe data items
Pause- crash - a server restarts in the state before the crash Halting- crash - server never restarts
7/31/2019 Lecture07_FaultTolerance
18/39
Fault Example
UDP service
has omission failures because it
occasionally looses messages
does not have value failures because it
does not transmit corrupt messages.
UDP uses checksums to mask the value
failures of the underlying IP by converting
them to omission failures
7/31/2019 Lecture07_FaultTolerance
19/39
Handling Failures
7/31/2019 Lecture07_FaultTolerance
20/39
Process Resilience
Protection against process failures can
be achieved through process replication
into groups.
Flat Group: All the members are equal
Hierarchical Group: There is a
coordinator
7/31/2019 Lecture07_FaultTolerance
21/39
Flat Groups versus Hierarchical
Groups
Figure 8-3. (a) Communication in a flat
group.
(b) Communication in a simple
hierarchical group.
7/31/2019 Lecture07_FaultTolerance
22/39
Failure masking
In most simple fault case, k+1 processes
provide k fault tolerance.
If the faulty processes continue to run,
providing faulty response but do not team
up to give wrong response, then to have k
tolerant system we need 2k+1 processes.
Assume all the messages arrive at all
nodes at the same time.
7/31/2019 Lecture07_FaultTolerance
23/39
Agreement issue in Faulty Systems
Possible assuptions about theunderlying system:
1. Synchronous versus asynchronous
systems.2. Communication delay is bounded or
not.
3. Message delivery is ordered ornot.
4. Message transmission is done
through unicasting ormulticasting.
7/31/2019 Lecture07_FaultTolerance
24/39
Circumstances under which distributed
agreement can be reached
7/31/2019 Lecture07_FaultTolerance
25/39
Byzantine Agreement Problem:
Lamport et al. 1982
Assume reliable synchronous ordered
unicast based message system. There are
N process, k of which may act as faulty or
even malicious. A faulty process may senddifferent values to different processes.
7/31/2019 Lecture07_FaultTolerance
26/39
Byzantine failure In fault-tolerant distributed computing, a
Byzantine failure is an arbitrary fault that occursduring the execution of an algorithm in adistributed system. When a Byzantine failure hasoccurred, the system may respond in any
unpredictable way. These arbitrary failures may be loosely
categorized as follows: a failure to take another step in the algorithm, also
known as a crash failure;
a failure to correctly execute a step of the algorithm;and
arbitrary execution of a step other than the oneindicated by the algorithm.
7/31/2019 Lecture07_FaultTolerance
27/39
Byzantine failure Byzantine refers to the Byzantine Generals' Problem, an
agreement problem in which generals of the Byzantine
Empire's army must decide unanimously whether or notto attack some enemy army.
The problem is complicated by the geographicseparation of the generals, who must communicate bysending messengers to each other, and by the presence
of traitors amongst the generals. These traitors can act arbitrarily in order to force good
generals into a wrong decision: trick some generals intoattacking; force a decision that is not consistent with thegenerals' desires,
e.g. forcing an attack when no general wished to attack; or soconfusing some generals that they never make up their minds. Ifthe traitors succeed in any of these goals, any resulting attack isdoomed, as only a concerted effort can result in victory.
Lamport et al., proved that with number of bad generals 1/3 orless, there is a solution.
7/31/2019 Lecture07_FaultTolerance
28/39
Example: Byzantine Agreement problem for
four processes
Three non-faulty and one faulty process. (a) Each
process sends their value to the others. Process 3
lies, giving different values x, y,z to different
processes.
7/31/2019 Lecture07_FaultTolerance
29/39
Byzantine Agreement problem-2
(b) The vectors that each process assembles
based on (a).(c) The vectors that each process receives in
step 3. Process 3 sends different vectors to
different processes.
7/31/2019 Lecture07_FaultTolerance
30/39
Byzantine Agreement problem-3
Three processes can agree on the values
received from 1, 2, and 4. So that
malicious process 3 value is irrelevant
If N=3 and k=1, that is only two non-faulty
process this will not work!
Lamport proved that, with 2k+1 nonfaulty
processes, the system will survive k faulty
processes, which makes a total of 3k+1
process
7/31/2019 Lecture07_FaultTolerance
31/39
BAP with two correct on faulty
processes..
Figure 8-6. The same as Fig. 8-5, except
now with two correct process and one
faulty process.
7/31/2019 Lecture07_FaultTolerance
32/39
Client Server Communication
TCP-Transport Control Protocol as a
connection oriented end-to-end
communication protocol is a reliable
protocol. But it does not preventconnection crash failures, which require
searching for new connections
7/31/2019 Lecture07_FaultTolerance
33/39
RPC Semantics in the
Presence of Failures Five different classes of failures that can occur in RPC
systems:1. The client is unable to locate the server.
2. The request message from the client to the server is lost.
3. The server crashes after receiving a request. This can be handled
with principles such as:-
At least once
At most once
The preferred principle is exactly once, which is virtually impossible to
implement
4. The reply message from the server to the client is lost.5. The client crashes after sending a request.
Each case needs to be resolved properly to mask the
failures
7/31/2019 Lecture07_FaultTolerance
34/39
Failure Examples
UDP service
has omission failures because it
occasionally looses messages
does not have value failures because it
does not transmit corrupt messages.
UDP uses checksums to mask the value
failures of the underlying IP by converting
them to omission failures
7/31/2019 Lecture07_FaultTolerance
35/39
Reliable Multicasting
Use negative acknowledgement, known as
scalable reliable multicasting-SRM
Non-hierarchical and hierarchical solutions are
possible Atomic multicasting requires all the replicas
reaching agreement on the success or failure of
multicast. This is known as distributed commit:
two-phase or three-phase commit protocols can
be used.
7/31/2019 Lecture07_FaultTolerance
36/39
Recovery from a failure
When and how the state of a distributed
system be recorded and recovered to by
beans of check-pointing and logging.
To be able to recover to a stable state, it is
important that the state is safely stored..
7/31/2019 Lecture07_FaultTolerance
37/39
Stable Storage
Sable storage is an example of group
masking at the disk block level
Designed to ensure permanent datais recoverable after a system failure
during a disk write operation or after a
disk block has been damaged
7/31/2019 Lecture07_FaultTolerance
38/39
Stable Storage
Provided by a Careful Storage Service Unit of storage is the stable block
Each stable block is represented by two careful blocks that holdthe contents of the stable block in duplicate
Write operation writes one careful block ensuring it is correctbefore writing the second block
Careful blocks are disk blocks stored with a checksum to maskvalue failures, the blocks are located on different disk drives withindependent failure modes
Value failures are converted to omission failures
The Read operation reads one of the pair ofcareful blocks, if anomission failure occurs then it reads the other, thus masking theomission failures of the Careful Storage Service.
7/31/2019 Lecture07_FaultTolerance
39/39
Stable Storage
Stable Storage - Crash Recovery
When a server is restarted after a crash, the
pair of careful blocks (representing the stable
block) will be in one of the following states: both good and the same
both good and different
one good, one bad
What does the recovery procedure do in
each of the above cases ?