Lecture07_FaultTolerance

Embed Size (px)

Citation preview

  • 7/31/2019 Lecture07_FaultTolerance

    1/39

    Fault Tolerance inDistributed

    Systems

  • 7/31/2019 Lecture07_FaultTolerance

    2/39

    A systems ability to tolerate failure-1

    Reliability: the likelihood that a system willremain operational for the duration of a

    mission

    requirement might be stated as 0.999999availability for a 10 hr mission -> the

    probability of failure during the mission must

    be at most 10-6

    Very high reliability is most important in critical

    applications - space shuttle, industrial control,

    in which failure could mean loss of life!

  • 7/31/2019 Lecture07_FaultTolerance

    3/39

    A systems ability to tolerate failure-2

    Availability: expresses the fraction of time a

    system is operational

    a 0.999999 availability means the system is not

    operational at most one hour in a million hours a system with high availability may in fact fail -> its

    recovery time and failure frequency must be small

    enough to achieve the desired availability

    high availability is important in airline reservations,telephone switching etc, in which every minute of

    downtime translates into revenue loss

  • 7/31/2019 Lecture07_FaultTolerance

    4/39

    Importance of Design

    A good fault- tolerant system design

    requires a careful study of failures, causes

    of failures, and system responses to

    failures

    Such a study should be carried out in

    detail before the design begin and must

    remain part of the design process

  • 7/31/2019 Lecture07_FaultTolerance

    5/39

    Requirement Specification-1

    Planning to avoid failures is most

    important

    A designer must analyze the environment

    and determine the failures that must be

    tolerated to achieve the desired level of

    reliability

    To optimize fault tolerance, it is important

    to estimate actual failure rate for each

    possible failure

  • 7/31/2019 Lecture07_FaultTolerance

    6/39

    Requirement Specification-2

    Failure types:

    Some are more probable than

    othersSome are transient, others

    permanent

    Some occur in hardware, others in

    software

  • 7/31/2019 Lecture07_FaultTolerance

    7/39

    Design-1

    Design of systems that tolerate faults

    that occur while system is in use

    Basic Principle - Redundancy Spatial - redundant hardware

    Informational - redundant data

    structures Temporal - redundant computation

  • 7/31/2019 Lecture07_FaultTolerance

    8/39

    Design-2

    Redundancy costs money and time

    One must optimize the design by trading offamount of redundancy used against the

    desired level of fault tolerance Temporal redundancy usually requires re-

    computation and it results in a slowerrecovery from failure

    Spatial has faster recovery but increaseshardware costs, space, power etc.requirements

  • 7/31/2019 Lecture07_FaultTolerance

    9/39

    Design-3

    Commonly Used Techniques for

    Redundancy

    Modular redundancy

    Uses multiple, identical replicas of hardware

    modules and a voter mechanism

    The outputs from the replicas are compared, and

    correct output is determined - majority vote

    Can tolerate most hardware faults that can affect

    the minority of the hardware modules

  • 7/31/2019 Lecture07_FaultTolerance

    10/39

    Design-4

    N- Version Programming Write multiple versions of a software module

    Outputs from these versions are received and correctoutput is determined via voting mechanism

    Each version is written by different team, with thehope that they will not contain the same bugs

    Can tolerate software bugs that affect a minority ofversions

    Cannot tolerate correlated fault - reason for failure iscommon to two (or more) modules eg two modulesshare a single power supply, failure of which causesboth to fail

  • 7/31/2019 Lecture07_FaultTolerance

    11/39

    Error- Control Coding-1

    Replication is expensive For certain applications - RAM, Buses, error

    correcting codes can be used

    Hamming or other codes

    Checkpoints and rollbacks

    A checkpoint is a copy of an applications state

    saved in some storage that is immune to the

    failures under consideration A rollback restarts the execution from a

    previously saved checkpoint

  • 7/31/2019 Lecture07_FaultTolerance

    12/39

    Error- Control Coding-2

    When a failure occurs, the applications

    state is rolled back to the previous

    checkpoint and restarted from there.

    Can be used to recover from transient as

    well as permanent hardware failures

    Can be used for uniprocessor and

    distributed applications

  • 7/31/2019 Lecture07_FaultTolerance

    13/39

    Recovery Blocks

    Uses multiple alternates to perform the samefunction

    One module is primary others are secondary

    When primary completes execution, its

    outcome is checked by an acceptance test If the output is not acceptable, a secondarymodule executes and so on until either anacceptable output is obtained or alternatesare exhausted

    This method can tolerate software failures,because alternates are usually implementedwith different approaches (softwarealgorithms)

  • 7/31/2019 Lecture07_FaultTolerance

    14/39

    Dependability Evaluation-1 Once a system has been designed, it must be

    evaluated to determine if it meets reliability anddependability objectives

    Two dependability approaches:

    Use an Analytical Model can help developers to determine a systems possible

    states and probabilities of transitions among them

    can be difficult to analyze models accurately

    Injecting Faultso Various types of faults can be injected to determine

    various dependability metrics

  • 7/31/2019 Lecture07_FaultTolerance

    15/39

    Dependability Evaluation-2

    In distributed systems a transaction based

    Service can accept occasional failures followed

    by a lengthy recovery procedure

    A Realtime Service - Process Control

    o may have inputs that are readings taken from sensorso may have outputs to actuators that are used to control

    a process directly or to activate alarms so that

    humans can intervene in the process

    o due to strict timing requirements, recovery must beachieved within a very small time limit e.g. air traffic

    control, monitoring patients, controlling reactors

  • 7/31/2019 Lecture07_FaultTolerance

    16/39

    Dependability Evaluation-3

    A Fault- Tolerant Service For a service to perform correctly, both the effect on a

    servers resources and the response sent to the clientmust be correct

    Correct behavior must be specified

    Failure Semantics - the ways in which the service canfail, must be specified

    Can detect a fault, thus,

    fails predictablymasks fault from its users

    operates in the presence of faults in services on which itdepends

  • 7/31/2019 Lecture07_FaultTolerance

    17/39

    Fault models Omission Failure

    A server omits to respond to a request or receive request

    Response Failure Value failure - returns wrong value

    State transition failure - has wrong effect on resources

    Timing Failure- any response that is not available to a

    client within a specified real time interval Server Crash Failure: a server repeatedly fails torespond to requests until it is restarted Amnesia- crash - a server starts in its initial state, having

    forgotten its state at the time of the crash, ie loses the values ofthe data items

    Pause- crash - a server restarts in the state before the crash Halting- crash - server never restarts

  • 7/31/2019 Lecture07_FaultTolerance

    18/39

    Fault Example

    UDP service

    has omission failures because it

    occasionally looses messages

    does not have value failures because it

    does not transmit corrupt messages.

    UDP uses checksums to mask the value

    failures of the underlying IP by converting

    them to omission failures

  • 7/31/2019 Lecture07_FaultTolerance

    19/39

    Handling Failures

  • 7/31/2019 Lecture07_FaultTolerance

    20/39

    Process Resilience

    Protection against process failures can

    be achieved through process replication

    into groups.

    Flat Group: All the members are equal

    Hierarchical Group: There is a

    coordinator

  • 7/31/2019 Lecture07_FaultTolerance

    21/39

    Flat Groups versus Hierarchical

    Groups

    Figure 8-3. (a) Communication in a flat

    group.

    (b) Communication in a simple

    hierarchical group.

  • 7/31/2019 Lecture07_FaultTolerance

    22/39

    Failure masking

    In most simple fault case, k+1 processes

    provide k fault tolerance.

    If the faulty processes continue to run,

    providing faulty response but do not team

    up to give wrong response, then to have k

    tolerant system we need 2k+1 processes.

    Assume all the messages arrive at all

    nodes at the same time.

  • 7/31/2019 Lecture07_FaultTolerance

    23/39

    Agreement issue in Faulty Systems

    Possible assuptions about theunderlying system:

    1. Synchronous versus asynchronous

    systems.2. Communication delay is bounded or

    not.

    3. Message delivery is ordered ornot.

    4. Message transmission is done

    through unicasting ormulticasting.

  • 7/31/2019 Lecture07_FaultTolerance

    24/39

    Circumstances under which distributed

    agreement can be reached

  • 7/31/2019 Lecture07_FaultTolerance

    25/39

    Byzantine Agreement Problem:

    Lamport et al. 1982

    Assume reliable synchronous ordered

    unicast based message system. There are

    N process, k of which may act as faulty or

    even malicious. A faulty process may senddifferent values to different processes.

  • 7/31/2019 Lecture07_FaultTolerance

    26/39

    Byzantine failure In fault-tolerant distributed computing, a

    Byzantine failure is an arbitrary fault that occursduring the execution of an algorithm in adistributed system. When a Byzantine failure hasoccurred, the system may respond in any

    unpredictable way. These arbitrary failures may be loosely

    categorized as follows: a failure to take another step in the algorithm, also

    known as a crash failure;

    a failure to correctly execute a step of the algorithm;and

    arbitrary execution of a step other than the oneindicated by the algorithm.

  • 7/31/2019 Lecture07_FaultTolerance

    27/39

    Byzantine failure Byzantine refers to the Byzantine Generals' Problem, an

    agreement problem in which generals of the Byzantine

    Empire's army must decide unanimously whether or notto attack some enemy army.

    The problem is complicated by the geographicseparation of the generals, who must communicate bysending messengers to each other, and by the presence

    of traitors amongst the generals. These traitors can act arbitrarily in order to force good

    generals into a wrong decision: trick some generals intoattacking; force a decision that is not consistent with thegenerals' desires,

    e.g. forcing an attack when no general wished to attack; or soconfusing some generals that they never make up their minds. Ifthe traitors succeed in any of these goals, any resulting attack isdoomed, as only a concerted effort can result in victory.

    Lamport et al., proved that with number of bad generals 1/3 orless, there is a solution.

  • 7/31/2019 Lecture07_FaultTolerance

    28/39

    Example: Byzantine Agreement problem for

    four processes

    Three non-faulty and one faulty process. (a) Each

    process sends their value to the others. Process 3

    lies, giving different values x, y,z to different

    processes.

  • 7/31/2019 Lecture07_FaultTolerance

    29/39

    Byzantine Agreement problem-2

    (b) The vectors that each process assembles

    based on (a).(c) The vectors that each process receives in

    step 3. Process 3 sends different vectors to

    different processes.

  • 7/31/2019 Lecture07_FaultTolerance

    30/39

    Byzantine Agreement problem-3

    Three processes can agree on the values

    received from 1, 2, and 4. So that

    malicious process 3 value is irrelevant

    If N=3 and k=1, that is only two non-faulty

    process this will not work!

    Lamport proved that, with 2k+1 nonfaulty

    processes, the system will survive k faulty

    processes, which makes a total of 3k+1

    process

  • 7/31/2019 Lecture07_FaultTolerance

    31/39

    BAP with two correct on faulty

    processes..

    Figure 8-6. The same as Fig. 8-5, except

    now with two correct process and one

    faulty process.

  • 7/31/2019 Lecture07_FaultTolerance

    32/39

    Client Server Communication

    TCP-Transport Control Protocol as a

    connection oriented end-to-end

    communication protocol is a reliable

    protocol. But it does not preventconnection crash failures, which require

    searching for new connections

  • 7/31/2019 Lecture07_FaultTolerance

    33/39

    RPC Semantics in the

    Presence of Failures Five different classes of failures that can occur in RPC

    systems:1. The client is unable to locate the server.

    2. The request message from the client to the server is lost.

    3. The server crashes after receiving a request. This can be handled

    with principles such as:-

    At least once

    At most once

    The preferred principle is exactly once, which is virtually impossible to

    implement

    4. The reply message from the server to the client is lost.5. The client crashes after sending a request.

    Each case needs to be resolved properly to mask the

    failures

  • 7/31/2019 Lecture07_FaultTolerance

    34/39

    Failure Examples

    UDP service

    has omission failures because it

    occasionally looses messages

    does not have value failures because it

    does not transmit corrupt messages.

    UDP uses checksums to mask the value

    failures of the underlying IP by converting

    them to omission failures

  • 7/31/2019 Lecture07_FaultTolerance

    35/39

    Reliable Multicasting

    Use negative acknowledgement, known as

    scalable reliable multicasting-SRM

    Non-hierarchical and hierarchical solutions are

    possible Atomic multicasting requires all the replicas

    reaching agreement on the success or failure of

    multicast. This is known as distributed commit:

    two-phase or three-phase commit protocols can

    be used.

  • 7/31/2019 Lecture07_FaultTolerance

    36/39

    Recovery from a failure

    When and how the state of a distributed

    system be recorded and recovered to by

    beans of check-pointing and logging.

    To be able to recover to a stable state, it is

    important that the state is safely stored..

  • 7/31/2019 Lecture07_FaultTolerance

    37/39

    Stable Storage

    Sable storage is an example of group

    masking at the disk block level

    Designed to ensure permanent datais recoverable after a system failure

    during a disk write operation or after a

    disk block has been damaged

  • 7/31/2019 Lecture07_FaultTolerance

    38/39

    Stable Storage

    Provided by a Careful Storage Service Unit of storage is the stable block

    Each stable block is represented by two careful blocks that holdthe contents of the stable block in duplicate

    Write operation writes one careful block ensuring it is correctbefore writing the second block

    Careful blocks are disk blocks stored with a checksum to maskvalue failures, the blocks are located on different disk drives withindependent failure modes

    Value failures are converted to omission failures

    The Read operation reads one of the pair ofcareful blocks, if anomission failure occurs then it reads the other, thus masking theomission failures of the Careful Storage Service.

  • 7/31/2019 Lecture07_FaultTolerance

    39/39

    Stable Storage

    Stable Storage - Crash Recovery

    When a server is restarted after a crash, the

    pair of careful blocks (representing the stable

    block) will be in one of the following states: both good and the same

    both good and different

    one good, one bad

    What does the recovery procedure do in

    each of the above cases ?