13
On the Quality of Service of Crash-Recovery Failure Detectors Tiejun Ma, Jane Hillston, and Stuart Anderson Abstract—We model the probabilistic behavior of a system comprising a failure detector and a monitored crash-recovery target. We extend failure detectors to take account of failure recovery in the target system. This involves extending QoS measures to include the recovery detection speed and proportion of failures detected. We also extend estimating the parameters of the failure detector to achieve a required QoS to configuring the crash-recovery failure detector. We investigate the impact of the dependability of the monitored process on the QoS of our failure detector. Our analysis indicates that variation in the MTTF and MTTR of the monitored process can have a significant impact on the QoS of our failure detector. Our analysis is supported by simulations that validate our theoretical results. Index Terms—Failure detectors, crash recovery, quality of service, availability, dependability, performance. Ç 1 INTRODUCTION F AULT tolerance is one of the most important issues for achieving dependable distributed systems. One of the most challenging problems in this research area is to tolerate the Byzantine failure, which is also sometimes called the arbitrary failure. This means that a process may behave in an arbitrary manner, producing arbitrary responses at arbitrary time [1]. It is the most difficult failure to detect. One possible solution of Byzantine failure detection is adopting consensus algorithms. To achieve K fault toler- ance, 3K þ 1 service replications are needed [2]. In the worst case, the K faulty services may send incorrect values, or incorrectly represent the values of others, but the remaining 2K þ 1 services can still return the same correct answer. Crash failure detection is one of the most important building blocks to achieve a successful consensus. However, detect- ing crash failures is a difficult problem. In [3], Fischer et al. show the impossibility of separating a crashed process and a very slow one, in a pure asynchronous system, known as the Fischer-Lynch-Paterson’s impossibility result. Subse- quently, failure detector oracles, which give possibly erroneous information about the state of the monitored target, have been proposed. In [4], Chandra and Toueg introduce the concept of unreliable crash failure detectors to detect the eventual crash behavior of a process and classify a set of abstract failure detectors based on the failure detectors’ eventual behavior to solve a certain set of membership problems. This work inspired many research- ers to study the quality of service (QoS), such as the speed and accuracy, of crash failure detector implementations and failure detection algorithms, e.g., [5], [6], [7], [8], [9], [10]. It is important to note that most of this previous work focused on the QoS of crash failure detectors is based on the crash-stop or fail-free assumption. The fail-free assumption assumes that failures do not occur. The crash-stop assumption assumes that there is only one failure and the monitoring procedure terminates once that crash failure is detected. The algorithms based on these assumptions focus on how to estimate the probabilistic message arrival time and a suitable time-out period for a failure detector to ensure a required QoS. However, fail-free and crash-stop can be strong assump- tions. An alternative approach is to consider the crash- recovery paradigm as discussed by Guerraoui and Rodrigues [11]. A process can keep crashing and recovering infinitely often and it is eventually always up and running. In theory, a process recovery can be achieved by adopting stable storage and the state information of the process can be stored and retrieved from the storage. After a crash is detected, the recovery procedure can be initiated to retrieve the latest stored process information. In practice, in order to provide high availability, self-repairing and self-healing mechanisms are widely adopted in fault-tolerant systems to achieve automatic recovery after a crash occurs. Particularly, in middleware systems, many techniques and algorithms have been proposed to achieve the self-repairing or self-healing goal, e.g., [12], [13], [14], [15]. In such systems, it is assumed that the system undergoes periodic crashes. During a crash period, the system is unable to service any requests or send any messages, externally behaving as if the system is unreachable. The end of the crash period is marked by a recovery, after which the system returns to normal service and its internal state is restored to the state before the crash failure occurred. For such systems, crash-recovery failure needs to be considered as a frequently occurring failure type to be detected. However, the crash-recovery case has been little studied, due to the fact that there are more possible discrepancies between the failure detector and the monitored target, increasing the size of the state space of the monitoring process, making the QoS analysis for such a paradigm more complicated. IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 271 . T. Ma is with the Department of Computing, Imperial College London, South Kensington Campus, 180 Queens Gate London, SW7 2AZ, UK. E-mail: [email protected]. . J. Hillston and S. Anderson are with the Laboratory for Foundations of Computer Science, School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB, UK. E-mail: {jeh, soa}@inf.ed.ac.uk. Manuscript received 19 Feb. 2008; revised 21 Apr. 2009; accepted 30 June 2009; published online 11 Aug. 2009. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TDSC-2008-02-0037. Digital Object Identifier no. 10.1109/TDSC.2009.36. 1545-5971/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society

On the quality of service of crash recovery

Embed Size (px)

DESCRIPTION

Dear StudentsIngenious techno Solution offers an expertise guidance on you Final Year IEEE & Non- IEEE Projects on the following domainJAVA.NETEMBEDDED SYSTEMSROBOTICSMECHANICALMATLAB etcFor further details contact us: [email protected] 044-42046028 or 8428302179.Ingenious Techno Solution#241/85, 4th floorRangarajapuram main road,Kodambakkam (Power House)http://www.ingenioustech.in/

Citation preview

Page 1: On the quality of service of crash recovery

On the Quality of Service ofCrash-Recovery Failure Detectors

Tiejun Ma, Jane Hillston, and Stuart Anderson

Abstract—We model the probabilistic behavior of a system comprising a failure detector and a monitored crash-recovery target. We

extend failure detectors to take account of failure recovery in the target system. This involves extending QoS measures to include the

recovery detection speed and proportion of failures detected. We also extend estimating the parameters of the failure detector to

achieve a required QoS to configuring the crash-recovery failure detector. We investigate the impact of the dependability of the

monitored process on the QoS of our failure detector. Our analysis indicates that variation in the MTTF and MTTR of the monitored

process can have a significant impact on the QoS of our failure detector. Our analysis is supported by simulations that validate our

theoretical results.

Index Terms—Failure detectors, crash recovery, quality of service, availability, dependability, performance.

Ç

1 INTRODUCTION

FAULT tolerance is one of the most important issues forachieving dependable distributed systems. One of the

most challenging problems in this research area is to toleratethe Byzantine failure, which is also sometimes called thearbitrary failure. This means that a process may behave inan arbitrary manner, producing arbitrary responses atarbitrary time [1]. It is the most difficult failure to detect.One possible solution of Byzantine failure detection isadopting consensus algorithms. To achieve K fault toler-ance, 3K þ 1 service replications are needed [2]. In the worstcase, the K faulty services may send incorrect values, orincorrectly represent the values of others, but the remaining2K þ 1 services can still return the same correct answer.Crash failure detection is one of the most important buildingblocks to achieve a successful consensus. However, detect-ing crash failures is a difficult problem. In [3], Fischer et al.show the impossibility of separating a crashed process and avery slow one, in a pure asynchronous system, known as theFischer-Lynch-Paterson’s impossibility result. Subse-quently, failure detector oracles, which give possiblyerroneous information about the state of the monitoredtarget, have been proposed. In [4], Chandra and Touegintroduce the concept of unreliable crash failure detectors todetect the eventual crash behavior of a process and classifya set of abstract failure detectors based on the failuredetectors’ eventual behavior to solve a certain set ofmembership problems. This work inspired many research-ers to study the quality of service (QoS), such as the speed

and accuracy, of crash failure detector implementations andfailure detection algorithms, e.g., [5], [6], [7], [8], [9], [10].

It is important to note that most of this previous workfocused on the QoS of crash failure detectors is based on thecrash-stop or fail-free assumption. The fail-free assumptionassumes that failures do not occur. The crash-stop assumptionassumes that there is only one failure and the monitoringprocedure terminates once that crash failure is detected. Thealgorithms based on these assumptions focus on how toestimate the probabilistic message arrival time and a suitabletime-out period for a failure detector to ensure a required QoS.

However, fail-free and crash-stop can be strong assump-tions. An alternative approach is to consider the crash-recovery paradigm as discussed by Guerraoui and Rodrigues[11]. A process can keep crashing and recovering infinitelyoften and it is eventually always up and running. In theory, aprocess recovery can be achieved by adopting stable storageand the state information of the process can be stored andretrieved from the storage. After a crash is detected, therecovery procedure can be initiated to retrieve the lateststored process information. In practice, in order to providehigh availability, self-repairing and self-healing mechanismsare widely adopted in fault-tolerant systems to achieveautomatic recovery after a crash occurs. Particularly, inmiddleware systems, many techniques and algorithms havebeen proposed to achieve the self-repairing or self-healinggoal, e.g., [12], [13], [14], [15].

In such systems, it is assumed that the system undergoesperiodic crashes. During a crash period, the system is unableto service any requests or send any messages, externallybehaving as if the system is unreachable. The end of thecrash period is marked by a recovery, after which the systemreturns to normal service and its internal state is restored tothe state before the crash failure occurred.

For such systems, crash-recovery failure needs to beconsidered as a frequently occurring failure type to bedetected. However, the crash-recovery case has been littlestudied, due to the fact that there are more possiblediscrepancies between the failure detector and the monitoredtarget, increasing the size of the state space of the monitoringprocess, making the QoS analysis for such a paradigm morecomplicated.

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 271

. T. Ma is with the Department of Computing, Imperial College London,South Kensington Campus, 180 Queens Gate London, SW7 2AZ, UK.E-mail: [email protected].

. J. Hillston and S. Anderson are with the Laboratory for Foundations ofComputer Science, School of Informatics, University of Edinburgh,10 Crichton Street, Edinburgh EH8 9AB, UK.E-mail: {jeh, soa}@inf.ed.ac.uk.

Manuscript received 19 Feb. 2008; revised 21 Apr. 2009; accepted 30 June2009; published online 11 Aug. 2009.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TDSC-2008-02-0037.Digital Object Identifier no. 10.1109/TDSC.2009.36.

1545-5971/10/$26.00 � 2010 IEEE Published by the IEEE Computer Society

Page 2: On the quality of service of crash recovery

In [16], we presented an evaluation of the QoS of a crash-recovery failure detector based on a simple time-out algo-rithm. A crash-recovery target was modeled as an alternatingrenewal process. The simulation results showed that thecrash-recovery behavior of the monitored target will impactthe QoS of such a failure detector, which implied that thecrash-recovery paradigm merited further studied. Such ananalysis was presented in [17]. In that paper, we outlinedhow to model the failure detection pair in a crash-recoveryrun and how to configure the failure detector to satisfy agiven QoS requirement. The current paper represents asubstantial expansion of [17]. We present more analyticaldetails and support the results with further simulationstudies. Analytical results, derived directly from the equa-tions in this paper, are also plotted and compared with thesimulation results. We are then able to present a detailedanalysis for each of the QoS metrics, which shows thevalidity of our model.

1.1 Our Contribution

We show how to remove the fail-free or crash-stop assump-tion and model the probabilistic behavior of a failuredetector with respect to a crash-recovery target, taking intoconsideration general dependability metrics, such as meantime to failure (MTTF) and mean time to recovery (MTTR). Weoutline how the QoS of a failure detector is limited by thedependability of the monitored target. Moreover, weestablish that the crash-stop or fail-free models are specialcases of the crash-recovery model.

In order to effectively assess the QoS of the failuredetector in a crash-recovery run, we have defined newQoS metrics to measure the recovery detection speed andthe proportion of the failures of the monitored target whichare detected. To make an accurate estimation of the failuredetector’s parameters needed to achieve a required QoS, aconfiguration procedure for a crash-recovery failure detectoris outlined. We demonstrate how to achieve the QoS froma given set of requirements based on the NFD-S algorithm(see Appendix B, which can be found on the ComputerSociety Digital Library at http://doi.ieeecomputersociety.org/10.1109/TDSC.2009.36,) proposed by Chen et al. [5]with suitable modifications. To the best of our knowledge,none of these aspects of QoS of failure detectors have beenpresented before.

1.2 Related Work

In [5], Chen et al. propose a set of QoS metrics to measurethe accuracy and speed of a failure detector. Their modelcontains a pair of processes: one is the monitor process, theother is the monitored process, and there is only one crashduring the monitoring period. The analysis is based on twoseparate stages of failure detection: the precrash stage,which is a fail-free run; and the postcrash stage, which is acrash-stop run when the monitoring procedure will beterminated. In order to formally define the QoS metrics,Chen et al. [5] define state transitions of a failure detectormonitoring a target process under the fail-free assumption.At any time, the failure detector’s state is either Trust orSuspect with respect to the monitored process’s liveness. If afailure detector moves from a Trust state to a Suspect state,then an S-transition occurs; if the failure detector movesfrom a Suspect state to a Trust state, then a T-transitionoccurs. Fig. 1 shows the state transitions of the failure

detector and the QoS metrics. In terms of the transitionsdefined above and the fail-free assumption, Chen et al.define the following QoS metrics for a failure detector:failure detection time (TD), mistake recurrence time (TMR),mistake duration (TM ), good period duration (TG), andquery accuracy probability (PA).

Some recent research has extended the QoS work of [5] ina number of ways. For example, the authors of [6], [9], [10],[18] refine the model with different probabilistic messagedelay and loss estimation methods. Meanwhile, others, suchas [7], [8], [19], [20], [21], focus on the scalability andadaptivity of crash failure detection. But all of these papersare based on eventual crash-stop behavior of the monitoredprocess or the fail-free assumption. Crash-recovery failuredetectors have been considered by several groups, e.g.,Boichat and Guerraoui [22] implemented reliable and totalorder broadcast primitives, assuming a practical asynchro-nous crash-recovery model in which the processes andchannels may crash and recover or crash and never recover;[23], [24], [25], [26], each of which proposes failure detectorsto solve consensus problems rather than focusing on theQoS of the failure detector itself. In [23], the monitoredprocess is characterized as always-up, eventually-up, even-tually-down, or unstable. A process which crashes andrecovers infinitely many times is regarded as unstable.But crash-recovery looping behavior exists for most systems.From the perspective of stochastic theory, crash-recoverybehavior can be regarded as a regenerative process in whichthe probabilistic live and recovery times are not zero. In thefollowing sections, we will analyze such a crash-recoveryparadigm and its failure detector from a QoS perspective.

This paper is organized as follows: in Section 2.1, wemodel a crash-recovery service with general dependabilitymetrics. Then, we show our model of the probabilisticmessage communication and its QoS metrics. In Section 3,we show how to model the crash-recovery failure detector’sprobabilistic behavior. We refine the completeness of a crash-recovery failure detector and extend the QoS metrics tomeasure the completeness and the recovery detection speedof such a failure detector. Then, we show how to involvethe general dependability metrics for an approximateanalysis of the QoS of a failure detector and how toconfigure a crash-recovery failure detector to satisfy a givenset of QoS requirements. Moreover, we discuss the impactof the dependability of the crash-recovery service on the QoSof failure detectors in detail. In Section 4, the estimation ofthe input parameters of a crash-recovery failure detector ispresented. We show how to estimate the message delay,message loss, MTTF, MTTR, etc., in a crash-recovery run. In

272 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010

Fig. 1. The QoS metrics without considering false positive mistakes.

Page 3: On the quality of service of crash recovery

Section 5, the analytical and simulation results are plottedand analyzed in detail. We show that the dependability of acrash-recovery target has an impact on the QoS of a failuredetector and our analysis is valid. In Section 6, a briefsummary of the paper is presented. Appendix A provides anotation table for the variables used in the paper.Appendix B shows the pseudocode of the NFD-S algorithm.Appendix C presents the main proofs of the lemmas andtheorems presented in this paper.

2 CRASH-RECOVERY SERVICE AND QoS OF

MESSAGE COMMUNICATION

In this section, we outline the assumptions underlyingour framework, considering the crash-recovery behaviorof the target service, its dependability characteristics, andthe behavior of the communication channel whichsupports the failure detection process.

2.1 The Crash-Recovery Service Modeling

For a crash-recovery target service (CR-TS), we consider thatthe service might crash at arbitrary time and take some timeto be repaired and restart again after it fails. Let S be thestate space of a stochastic process Z :¼ fZðtÞ; t � 0g, whereZ captures a CR-TS’s lifetime. Then, S can be regarded as{Alive, Crash} and the CR-TS can periodically switchbetween these two states. A transition occurs when thestate of the CR-TS changes. Fig. 2 shows the state transitionsof a CR-TS, where a C-transition occurs when the state of theCR-TS switches from the Alive state to the Crash state; anR-transition occurs when the state of the CR-TS switchesfrom the Crash state to the Alive state.

Assumption 1. If the service’s recovery is treated as a restart,then the CR-TS’s lifetime Z is a regenerative process.

Assumption 1 will be used in the following. It is basedon the following observations. The CR-TS will periodicallycrash and recover, leading to a sequence of time points,S1; S2; . . . ; Sn; . . . (n � 0), representing the times of theCR-TS’s recovery. The behavior of the system after Sn(n � 0) is independent of what has occurred before, andthus, Sn can be regarded as a restart. Moreover, theprobability of Sn occurring is 1. This makes the time pointsS1; S2; . . . ; Sn regeneration points.

Since the CR-TS’s lifetime Z is a regenerative process andthe sequence fS1; S2; . . . ; Sn; . . .g characterizes the lifetimeof the service, we can give an alternative definition of thestochastic process Z. The stochastic process Z is a set of

random variables fXðnÞ; n 2 Ng, where XðnÞ is the randomvariable representing the time which elapses from the timeof the nth regeneration point to the ðnþ 1Þth one (i.e.,XðnÞ ¼ Snþ1 � Sn). For simplicity of presentation, we use Xinstead of XðnÞ in the following since it is sufficient toconsider a single regeneration period. Furthermore, we canconsider X to be the sum of two independent randomvariables: Xa and Xc. Here, Xa represents the time whichelapses from the time that the CR-TS starts a regenerationperiod to the time the CR-TS fails and Xc represents thetime from when the CR-TS fails until to the time of the nextregeneration point.

Lemma 1. In steady state, the CR-TS is an alternating renewalprocess and the time between any two consecutive recovery timepoints is one period of the crash-recovery service’s lifetime.

Thus, we assert that in order to design a failure detector forthe CR-TS, which is sensitive to the CR-TS’s behavior, weonly need to consider one period of the CR-TS since all of theother periods are independent and identically distributed.

2.2 Dependability of a Crash-Recovery Service

Dependability, one of the most important issues forcomputer systems, is a complex attribute. Laprie et al. [1]define the concept of dependability as the property of acomputer system such that reliance can justifiably be placed on theservice it delivers. Associating timing information with thebehavior of a system, its dependability can be describedquantitatively. Generally speaking, the dependability of asystem can be measured according to a number of differentaspects such as reliability, availability, consistency, usability,security, etc. In order to simplify the measurements whichare related to failure detection, here, we only introducereliability, availability, and consistency, which are stronglyrelated to the QoS of failure detectors.

In [27], Knight and Strunk give a definition of softwarereliability and availability. We extend this with a definitionof consistency as follows:

. Reliability: is the probability that the system willoperate correctly in a specified operating environ-ment up until time t (t > 0).

. Availability: is the probability that the system will beoperational at time t.

. Consistency: is the probability that in a specifiedoperating environment, the system will return tonormal operation correctly after a failure within time t.

These three metrics present different aspects of thesystem dependability. Generally, reliability presents howlong a system will operate correctly and can be captured byMTTF, which records the likelihood of a service to persistwithout a failure. Availability presents the probability that asystem is accessible or reachable with correct operation atan arbitrary time and can be captured by mean time to failuredivided by mean time between failure (MTTF

MTBF ). Consistencypresents the ability of a system to recover from a failurestate to the correct operation state and can be captured byMTTR, which records how quickly a system recovers.

In different scenarios, different aspects of dependabilitymay be given greater relative importance. For example,consistency may be valued more than reliability in asystem designed to be always accessible. This means thatfault-tolerance mechanisms should be able to adapt toreflect differing dependability requirements.

MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 273

Fig. 2. Crash-recovery service modeling.

Page 4: On the quality of service of crash recovery

2.3 QoS of Message Communication

In order to measure the communication between the FDS andtarget service quantitatively, we define the communicationpath between the FDS and the target service as a channel.Each communication component pair holds one or morevirtual one-way, source-to-destination channel. Messagescan only flow from the source component to the destinationcomponent. In addition, the channel model in this paperrelies on the assumption of a basic unreliable communicationchannel with fairness, no-creation, and no-duplication [28].This has some similarities with the Stubborn channels in [28],but they allow duplicated messages and we assume thatthere are no duplicated messages in our model.

This channel-based communication, which maintainsthe interaction between the FDS and the CR-TS, can becharacterized by the QoS of the communication, the adoptedfailure detection algorithm, and the adopted communica-tion protocol, each of which has some associated properties.In particular, we take the message transmission behavior tobe probabilistic: we describe the message delay or loss asprobabilistic behaviors associated with the communicationchannel.

Definition 1. Let D be a random variable representing the timewhich elapses from the time a message is sent until the time itarrives at the destination and EðDÞ be the average messagedelay; let pL be the probability of a message loss during thetransmission; let XL be a random variable representing thenumber of consecutive messages lost and EðXLÞ be the averagenumber of consecutive messages lost.

From these definitions, properties such as the followingcan be derived:

Lemma 2. If each message’s transmission and loss behavior areindependent, then the probability that x (x � 1) consecutivemessages are lost is

PrðXL ¼ xÞ ¼ pxL � ð1� pLÞ:

Overall, the QoS of this channel-based communicationbetween the FDS and the CR-TS can be captured by EðDÞ,pL and EðXLÞ. In the following sections, we analyze howthe FDS monitors the CR-TS and how the FDS can beconfigured based on the characteristics of this channel-based communication.

3 QoS OF THE CRASH-RECOVERY FDS

3.1 System Model

We consider a distributed system model with two services:One FDS and one CR-TS, distributed over a wide-areanetwork. The FDS and the CR-TS are connected by anunreliable communication channel (see Section 2.3). Liveness(heartbeat) messages are transmitted through the channel.The communication channel does not create or duplicateliveness messages, but the messages might be lost or delayedindefinitely during transmission.1 The CR-TS can fail bycrashing but can be repaired and restart to run again aftersome repair time, i.e., it behaves as a crash-recovery model. Thedrift of the local clocks of the FDS and the CR-TS is small

enough to be ignored and their local clocks are sufficientlysynchronized (this can be guaranteed by some time synchro-nization service such as the Network Time Protocol used in[6]) to be regarded as a clock synchronized system. Thefailure detection algorithm we adopt is the NFD-S algorithmproposed in [5].

3.2 Modeling a Push-Style Crash-Recovery FDS

The failure detector (FDS) has a set of suspicion levels Ss :¼fTrust; Suspectg as in [5]. The FDS can either trust or suspecta CR-TS’s liveness. Thus, for a fail-free run, a service only hasone state: Alive. The state space of an FDS is Sf :¼fTrust-Alive; Suspect-Aliveg, and the event space of an FDSF :¼ fS-transition; T -transitiong (Fig. 3a). For a fail-free run,the QoS metrics of an FDS can be measured quitestraightforwardly. The average time spent in the Trust stateis the mean length of the good periodEðTGÞ; the average timespent in the Suspect state is the mean time of the mistakeduration EðTMÞ; the average time between two consecutivetransfers to the Suspect state (two consecutive S-transitions) isthe mean time of the mistake recurrence EðTMRÞ.

However, precisely speaking, the state space of an FDSSc :¼ S � Ss, where S is the state space of the target service.Therefore, for a CR-TS with failures, the state space of itsFDS increases because the service has more than one state(see Fig. 3b). If the suspicion level is more than two, then Scwill increase as well. The QoS metrics of an FDS are nolonger as simple as for fail-free runs.

For a fail-free run (MTTF! þ1) or a crash-stop run(MTTR! þ1), the CR-TS’s current state SCR�TS will beAlive for all time up to the crash, and it is easy to deduce theFDS’s accuracy SA directly from the FDS’s current state.However, for a crash-recovery run, since the CR-TS could failor recover at arbitrary time, SA cannot be deduced solelyfrom the state of the FDS.

Furthermore, compared with a fail-free or crash-stop run,there are more mistake types in a crash-recovery run. Inprevious work, such as [5], [6], [8], [9], [10], [18], [20], onlythe mistakes caused by the message transmission behaviors(message delay and loss) are considered. But in a crash-recovery run, a mistake starts whenever the CR-TS’s andFDS’s states diverge. Thus, there are also mistakes causedby the CR-TS’s crash (see TF in Fig. 1 or T 3

M in Fig. 4c) andrecovery (see Fig. 4d) due to the delayed detection of suchevents. Fig. 4 shows the four types of mistake which couldoccur within a crash-recovery run. T 1

M in Fig. 4a represents a

274 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010

Fig. 3. State space in a crash-recovery run. (a) Fail-free transition.(b) Crash-recovery transition.

1. This channel-based message transmission is the same as theprobabilistic network model in [5].

Page 5: On the quality of service of crash recovery

mistake caused by a message delay. T 2M in Fig. 4b

represents a mistake caused by a message loss. T 3M in

Fig. 4c represents a mistake caused by CR-TS’s crash, whilethe FDS still trusts the CR-TS. T 4

M in Fig. 4d represents amistake caused by CR-TS’s recovery, while the FDS stillsuspects the CR-TS. A message loss or delay will result in aSuspect-Alive mistake of the FDS (see Fig. 3b). A crashfailure will result in a Trust-Crash mistake. A recovery eventwill result in a Suspect-Alive mistake. Mistakes caused bydifferent reasons will result in different FDS parameterreconfiguration plans. For instance, the best way for theFDS to tolerate more message losses or a longer messagedelay is to increase the time-out duration; the best way forthe FDS to minimize the mistake duration caused by a crashevent is to decrease the time-out duration; and the best wayto minimize the mistake duration caused by a recoveryevent is to increase the liveness message sending frequency.Thus, we can see that an inaccurate mistake type identifica-tion might reduce the QoS of an FDS and should beavoided.

From the above analysis, we can see that due to theincreasing mistake types in a crash-recovery run, the defini-tion of the QoS metrics in [5] using transitions is not valid in acrash-recovery run. Thus, we redefine them as below:

. Detection time (TD): The elapsed time from whenthe monitored target crashes until the failuredetector correctly suspects the monitored target.

. Mistake recurrence time (TMR): The time betweenthe occurrence of two consecutive mistakes.

. Mistake duration (TM ): The time to correct amistaken suspect or trust.

. Good period duration (TG): The duration for whichthe failure detector maintains the correct stateinformation.

. Query accuracy probability (PA): The probabilitythat the state information from the failure detector iscorrect at an arbitrary time.

The above QoS metrics can measure some QoS aspects ofa failure detector in a crash-recovery run. However, theycannot measure how fast a recovery can be detected, theproportion of the detected failures over the occurredfailures (completeness), etc. In the following section, weextend the QoS metrics to measure the recovery detectionspeed and the completeness of a failure detector.

3.3 Extended QoS Metrics for a Crash-RecoveryFDS

For an FDS in a crash-recovery run, in addition to the QoSmetrics introduced above, we propose some new QoSmetrics.

First, in order to measure the speed with which an FDScan discover a recovery of the CR-TS, we define—therecovery detection time (TDR)—a random variable whichrepresents the time that elapses from the CR-TS’s recoverytime (an R-transition) to the time when the FDS discoversthe recovery.

Then, since in a crash-recovery run, there is no eventualbehavior of a CR-TS, and a fast recovery could make afailure undetectable by the FDS. Under such circumstances,the completeness property of a failure detector defined in [4]can no longer be satisfied. In order to reflect this situation,we refine the definition of the completeness as follows:

. Strong completeness: Every crash failure of a recover-able process will be detected.

. Weak completeness: A specified proportion of the crashfailures of a recoverable process will be detected.

Therefore, in order to measure the completeness property of acrash-recovery FDS, we propose a new QoS metric. Thedetected failure proportion (RDF ) is a random variablecapturing the ratio of the detected crashes over the occurredcrashes (0 � RDF � 1). When no crash failures are detected,RDF ¼ 0. When all of the occurring crashes are detected,RDF ¼ 1. The strong completeness property of an FDSrequires that EðRDF Þ ¼ 1 (where E denotes expectation).The weak completeness property requires that EðRDF Þ � RL

DF ,

MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 275

Fig. 4. The analysis of possible TM in a crash-recovery run. (a) T 1M . (b) T 2

M . (c) T 3M . (d) T 4

M .

Page 6: On the quality of service of crash recovery

where RLDF is the specified lower bound of the detected

failure proportion and 0 � RLDF � 1.

Overall, the QoS for a crash-recovery FDS can be

captured by PA, TM , TMR, TD, TDR, and RDF . In the next

section, we will analyze the QoS bounds of the FDS based

on the NFD-S algorithm in a crash-recovery run by adopting

the proposed basic and extended QoS metrics.

3.4 QoS Estimate of the Crash-Recovery FDS Basedon the NFD-S Algorithm

In a crash-recovery run, as the state of a CR-TS can switchbetween Alive and Crash, these crash or recovery events willforce the output of the FDS to be accurate or inaccurate. Foranalyzing the behavior of the failure detection pair, wewant to pick an observation period, which will cover all theevents which may possibly occur. In our model, we pickone MTBF period as the observation period. This is because,as we discussed in Section 2.1, in order to study the steadystate behavior of a CR-TS throughout its lifetime, we onlyneed to observe the time period between two consecutiveregeneration points (recovery times) of the CR-TS and theaverage duration between the two consecutive regenerationpoints is MTBF. In the following, we will treat these as alsoregeneration points of the system consisting of the failuredetection pair. This is an approximation made for prag-matic reasons but it can be justified as follows:

Fig. 5 shows the relationship between an FDS and aCR-TS on the interval t 2 ½t0; t3Þ, where both t0 and t3 areregeneration points. Obviously, the mean time between t0and t3 is the MTBF. We split ½t0; t3Þ into three intervals½t0; t1Þ, ½t1; t2Þ, and ½t2; t3Þ:

. t1 is the time when the FDS detects the transition ofthe CR-TS from the Crash state to the Alive state.

. t2 is the time when the service crashes. Note that theperiod ½t1; t2Þ is without failures.

Additionally, we define the following times:

. �s is the first liveness message sending time after arecovery;

. �f is the sending time of the last liveness messagebefore a crash;

. �i is the sending time of a liveness message between�s and �f ;

. � is the liveness message sending interval;

. �s is the first decision time after recovery;2

. �i is the time of the ith freshness point correspondingto �i;

. �b is the last freshness point3 before a crash; and

. �f is the freshness point corresponding to �f .

Let time-out be the threshold waiting time for the

expected arrival of the liveness message before suspecting

the CR-TS (time-out ¼ �i � �i in Fig. 5). Let tmr (m � 1) be a

recovery time of the current MTBF period (see Fig. 5). Then

in our model, the key thing for the QoS bounds analysis is

to derive the average number of mistakes that will happen

between the mth and ðmþ 1Þth recovery times, and the

average duration of each mistake. We make the following

definitions as extensions of Definition 1 in [5]:

Definition 2. For the fail-free duration ½t1; t2Þ within each

MTBF period:

1. k: for any i � 1, let k be the smallest integer such thatfor all j � iþ k, mj is sent at or after time �i, wheremj is the jth heartbeat message.4

2. For any i � 1, let pijðxÞ be the probability that the FDSdoes not receive the ðiþ jÞth message miþj bytime �i þ x, for every j � 0 and every x � 0; letpi0 ¼ pi0ð0Þ.

3. For any i � 2, let qi0 be the probability that the FDSreceives message mi�1 before time �i.

4. For any i � 1, let uiðxÞ be the probability that the FDSsuspects the CR-TS at time �i þ x, for every x 2 ½0; �Þ.

5. pis: for any i � 2, let pis be the probability that anS-transition occurs at time �i.

According to the QoS analysis of the NFD-S algorithm in

Proposition 3 in [5], we now analyze the basic QoS metrics

of the FDS based on the NFD-S algorithm in a crash-recovery

run and show the following relations hold:

Proposition 1.

1. k ¼ dtime-out=�e.2. for all j � 0 and for all x � 0,

pijðxÞ ¼ ðpL þ ð1� pLÞ � PrðD > time-outþ x� j�ÞÞ� Pr

�Xa > �i � tmr þ x

�:

3. qi0 ¼ ð1� pLÞ � PrðD < time-outþ �Þ�Pr�Xa > �i � tmr

�:

4. For all x 2 ½0; �Þ; uiðxÞ ¼Qk

j¼0 pijðxÞ.

5. pis ¼ qi0 � uið0Þ.In Proposition 1, the bounds of each QoS metric are

derived based on the analysis of the average number of

possible mistakes within the distinct intervals ½t0; t1Þ, ½t1; t2Þ,and ½t2; t3Þ. In consequence, the following theorem holds

and can be used to estimate the FDS’s parameters or QoS

bounds within a crash-recovery run:

Theorem 1. The crash-recovery FDS based on the NFD-S

algorithm has the following properties:

276 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010

Fig. 5. The analysis of the FDS based on the NFD-S algorithm in acrash-recovery run.

2. The actual arrival time of the first received valid liveness message.

3. The expected arrival time of the liveness message.4. k is assumed to be independent of i approximately. In fact, in a crash-

recovery run, k is not completely independent of i. However, if the CR-TSwill remain alive for a reasonable duration, k will be almost independent of iexcept for the last few messages before the CR-TS crashes.

Page 7: On the quality of service of crash recovery

MTBF � EðTMRÞ

� MTBF��MTTF�EðTDRÞ�

�þ 1�� pis þ

� EðDÞ�

�þ 2

:ð1Þ

If Xc > � þ time-out, then

MTBF

2� EðTMRÞ

� MTBF��MTTF�EðTDRÞ�

�þ 1�� pis þ

� EðDÞ�

�þ 2

;ð2Þ

PA � 1�EðTDÞ þ EðTDRÞ þ MTTF�EðTDRÞ

� �R �

0 uiðxÞdx

MTBF; ð3Þ

EðTMÞ �EðTDRÞ þ MTTF�EðTDRÞ

� �R �

0 uiðxÞdxþEðTDÞ��MTTF�EðTDRÞ

�þ 1�� pis þ 1

; ð4Þ

EðTDRÞ ¼ EðDÞ þ � � EðXLÞ; ð5Þ

EðRDF Þ � PrðXc > � þ time-outÞ: ð6Þ

Details of the proof of the theorem can be found in [29] andAppendix C.2.

When the monitored target is fail-free or crash-stop,5 forthe basic QoS metrics in [5], applying (1)-(4) of Theorem 1,we can easily deduce that

EðTMRÞ ��

pis; ð7Þ

EðTMÞ �1

pis�Z �

0

uiðxÞdx � �

qi0; ð8Þ

PA � 1� 1

��Z �

0

uiðxÞdx: ð9Þ

Thus, EðTMRÞ, EðTMÞ, and PA are exactly reduced to theQoS analysis results in [5] (see Appendix C.4 for the detailsof the proof scratch). We can conclude that in terms of failuredetection, a fail-free run or a crash-stop run with MTTFtending to infinity is a particular case of a crash-recovery run.If the monitored target’s MTTF is not sufficiently long andthe target is recoverable, then the impact of its dependabilitymust also be taken into consideration. In the followingsection, we will introduce how to configure the crash-recoveryFDS according to the QoS bounds we have derived fromTheorem 1.

3.5 The Configuration of the Crash-Recovery FDSBased on the NFD-S Algorithm

For crash failure detectors, it is crucial to select somesuitable input parameters (such as the liveness messageintersending interval and the time-out duration) to satisfy agiven set of QoS requirements. In this section, we will showhow to achieve such steps in a crash-recovery run based onthe NFD-S algorithm. In a crash-recovery run, an assumptionthat the sequence numbers of the heartbeat messages arecontinually increasing after every recovery of the CR-TS is

needed to ensure that the NFD-S algorithm is still valid aftereach recovery. However, without persistent storage tosnapshot the runtime information frequently, when a crashfailure occurs, all of the current runtime information mightbe lost. Thus, continuously increasing the heartbeat se-quence number cannot be guaranteed.

Since the NFD-S algorithm assumes that the local clocks ofthe FDS and the CR-TS are synchronized, we can comparethe sending times of heartbeat messages instead of theheartbeat sequence numbers in the algorithm. Then, for acrash-recovery FDS, if the QoS requirements of the FDS aregiven, the configuration procedure is illustrated in Fig. 6.

Initially, we can assume that the QoS of messagecommunication is perfect (e.g., pL ¼ 0, EðDÞ is small andEðXLÞ ¼ 0), and the CR-TS is fail-free. As the monitoringprocedure continues, the estimation of the QoS of messagecommunication and the dependability metrics of the CR-TSwill become more accurate. Thus, the FDS will be reconfi-gured to adapt to changing input parameters, which helpbetter estimate � and time-out.

Then for given QoS requirements, expressed as bounds,the following inequalities need to be satisfied where asuperscript U denotes an upper bound and a superscript Ldenotes a lower bound:

TD � TUD ; EðTMRÞ � TLMR; PA � PLA ;

EðTMÞ � TUM; EðTDRÞ � TUDR; EðRDF Þ � RLDF :

ð10Þ

From Theorem 1, we can estimate the parameters (� andtime-out) of the NFD-S algorithm according to the followinginequalities:

� þ time-out � TUD ; � > 0; ð11Þ

MTBF��MTTF�EðTDRÞ�

�þ 1�� pis þ

� EðDÞ�

�þ 2� TLMR; ð12Þ

1�EðTDÞ þEðTDRÞ þ MTTF�EðTDRÞ

� �R �

0 uiðxÞdx

MTBF� PL

A ; ð13Þ

EðTDRÞ þ MTTF�EðTDRÞ� �

R �0 u

iðxÞdxþEðTDÞ��MTTF�EðTDRÞ�

�þ 1�� pis þ 1

� TUM; ð14Þ

MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 277

Fig. 6. The extended FDS configuration based on the NFD-S algorithmin a crash-recovery run.

5. The precrash duration of the crash-stop process is a long run.

Page 8: On the quality of service of crash recovery

EðDÞ þ �EðXLÞ � TUDR; ð15Þ

PrðXc > � þ time-outÞ � RLDF : ð16Þ

Then, the task of the NFD-S algorithm is to find thelargest � satisfying inequalities (12)-(15) and if such � exists,find the largest time-out that satisfies � þ time-out � TUD andPrðXc > � þ time-outÞ � RL

DF . This can be done in thefollowing steps:

Step I. If TLMR < MTBF, continue; else the QoS of the FDScannot be achieved.

Step II. Find the largest � that satisfies the inequalities(12)-(15); otherwise cannot find an appropriate � (QoScannot be achieved).

Step III. If � > 0, find the largest time-out � TUD � � andPrðXc > � þ time-outÞ � RL

DF .From the above steps, the estimation of � and time-out for

a crash-recovery FDS based on the NFD-S algorithm amountsto finding a numerical solution for the inequalities (11)-(16).This can be done using binary search similarly to theapproach outlined in [5]. But the estimation of the inputparameters of the configuration becomes more difficultbecause parameters, such as EðXLÞ, MTTF, MTTR, etc., areneeded. How to estimate these parameters will be discussedin Section 4.

Note that for this configuration procedure, choosing adifferent message transmission protocol (e.g., TCP andUDP) can imply different QoS for message communication.Thus, this new configuration can be more adaptive to themessage transmission characteristics. For example, if themessage loss probability or message delay is high for acertain protocol, then the FDS can switch to a more reliableprotocol to achieve a better QoS without increasing thecommunication frequency or the time-out length.

In the next section, we will discuss how to estimate theQoS of message transmission and the dependability metricsof the CR-TS.

4 PARAMETER ESTIMATION

In the previous section, we explained how to configure acrash-recovery FDS. However, for this procedure, severalinput parameters are needed (see Fig. 6). In this section, wewill show how to estimate these input parameters for anFDS configuration.

4.1 Dependability Metrics Estimation for the CR-TS

From the CR-TS modeling in Section 2, we see that there isan intimate relationship between the MTTF, MTTR, andMTBF and the QoS of the FDS. In order to estimate thesedependability metrics, we only need to estimate the crashand recovery time of the CR-TS. We assume that the clocksbetween the FDS and the CR-TS are synchronized. Let t1r bethe CR-TS’s first start time, then for m � 1, tmr represents themth recovery time; tmdr represents the mth recovery detectiontime; tmc represents the mth crash time; and tmd representsthe mth crash detection time (see Fig. 7). tmr can be saved tothe persistent storage by the CR-TS after a recovery hascompleted (see [29]). tmd can be recorded by the FDS when afailure is detected, EðTDÞ can be estimated by using1n

Pnm¼1ðtmd � tmc Þ when tmc is known. Actually, tmc can be

estimated by saving the latest successful message sendingtime �l in the persistent storage. If a crash event happens

uniformly distributed on ½�l; �l þ �Þ, then after a recoveryhas completed, the average tmc can be estimated bytmc ¼ �l þ

�2 . Notice that a smaller message intersending

time (�) can result in a more accurate tmc estimate. Then, theCR-TS’s MTBF, MTTF, MTTR, and the probability that theCR-TS has not crashed up to time �i þ x since its lastrecovery, PrðXa > �i þ x� tmr Þ, can be estimated as follows:

Estimate MTBF. From the definition of MTBF, we knowthat MTBF is only related to the CR-TS’s recovery timestmr ðsÞ. These tmr ðsÞ can be obtained by adopting the recoverytime estimation methods proposed in [29]. Thus, MTBF canbe estimated as below:

MTBF ¼ E�tmþ1r � tmr

�¼ 1

n

Xnm¼1

�tmþ1r � tmr

�: ð17Þ

Estimate MTTF. MTTF can be estimated by using the

recovery time (tmr ) and the crash detection time (tmd ) as

Eðtmd � tmr Þ ¼ MTTFþ EðTDÞ. Then,

MTTF ¼ E�tmd � tmr

�� EðTDÞ ¼

1

n

Xnm¼1

�tmd � tmr

�� EðTDÞ:

ð18Þ

Estimate MTTR. MTTR can be estimated by using MTBFand MTTF directly for MTTR ¼ MTBF�MTTF or byusing tmþ1

r and tmd . Hence, the MTTR can be estimated asEðtmþ1

r � tmd Þ ¼ MTTR� EðTDÞ. Then,

MTTR ¼ E�tmþ1r � tmd

�þ EðTDÞ

¼ 1

n

Xnm¼1

�tmþ1r � tmd

�þ EðTDÞ:

ð19Þ

Estimate PrðXa > �i þ x� tmr Þ. When the probabilitydensity function faðxÞ or the probability distributionfunction FaðxÞ of Xa is known, the probability that theCR-TS does not crash until �i þ x after its last recovery canbe estimated as

Pr�Xa > �i þ x� tmr

�¼ 1�

Z �iþx�tmr

0

faðxÞdx

¼ 1� FaðxÞj�iþx�tmr

0 :

ð20Þ

When x ¼ 0, we obtain that

Pr�Xa > �i � tmr

�¼ 1�

Z �i�tmr

0

faðxÞdx ¼ 1� FaðxÞj�i�tmr

0 :

ð21Þ

278 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010

Fig. 7. Dependability metrics estimation.

Page 9: On the quality of service of crash recovery

When the probability density function faðxÞ and theprobability distribution function FaðxÞ of Xa are unknown,an empirical distribution function (EDF) estimation methodcan be adopted to estimate faðxÞ or FaðxÞ. In addition,PrðXa > �i þ x� tmr Þ is used to estimate the probability thatan S-transition happens on ½t1; t2) (see Proposition 1), whichis used to count the average number of mistakes in thatperiod. If we maximize PrðXa > �i þ x� tmr Þ, then amaximum average number of mistakes on ½t1; t2) will beobtained. Therefore, we will get stricter QoS boundestimates for PA, TM , and TMR. Thus, we can adopt i ¼ 1and x ¼ 0 to simplify the estimation of PrðXa > �i þx� tmr Þ. Notice that the above method is only for the strictbound estimation rather than an optimized estimation.

4.2 Message Loss Length Estimation

As discussed earlier, the parameters related to messagetransmission are the average message delay (EðDÞ), prob-ability of message loss (pL), and the consecutive messageloss number XL (see Fig. 6). Since pL and EðDÞ estimationcan be done very easily and have been introduced in manyother papers such as [5], we do not discuss them here. Theadditional parameter XL is also used and captures thebursty message loss behavior. In this section, we propose abasic estimation method for XL, assuming independentmessage transmissions.

Lemma 3. If each message’s transmission and loss behavior is

independent, then the mean number of consecutive message

losses is EðXLÞ ¼ pLð1�pML Þ1�pL �MpMþ1

L , where M is the

maximum number of consecutive messages lost and pL is the

probability that each message is lost during the transmission.

The proof can be found in [29].

Remark 1. When M ! þ1 and 0 < pL < 1, then pML ! 0and MpML ! 0, we obtain that

EðXLÞ ¼pL

1� pL:

From the above lemma, we see that if each livenessmessage’s transmission is independent, EðXLÞ dependsonly on pL and can be computed straightforwardly.

4.3 The Impact of Service Dependability Metrics onthe QoS of the FDS

A thorough analysis of the impact of the service depend-ability metrics on the QoS of the FDS has been presented in[16]. Here, we only highlight the main observations.

4.3.1 The Impact on TM and TDGenerally, for an FDS, the time-out length governs thefailure detection speed because the FDS makes its decisionat the time-out points. As the time-out length decreases, theFDS will make faster, but less accurate, decisions. As time-out increases, TD slows down but the FDS can tolerate moremessage delays or losses, which can improve the detectionaccuracy to some extent. For a CR-TS, continually increas-ing the time-out length may mean that failures becomeundetectable, because its recovery duration could be shorterthan TD. Thus, EðTMÞ will not increase more than therecovery duration, MTTR.6

4.3.2 The Impact on TMR

For a fail-free run, Chen et al. showed that when time-outlength increases linearly,TMR increases exponentially (Fig. 12in [5]). This implies that for such systems, an arbitrary level ofTMR can be achieved. Roughly speaking, in a fail-free run,when time-out increases to n� � (n 2 ZZþ and n � 1), the FDScan tolerate around n consecutive communication messagelosses. The mistake recurrence which is caused by messagelatency or loss decreases 1

Pn rapidly, where

P ¼ pL þ ð1� pLÞ � Prðtime-out < Delay < þ1Þ:

For a crash-recovery run, mistakes may occur on bothcrash and recovery (see Fig. 3b) since message transmissionlatency will delay the detection of the CR-TS’s state change.These mistakes are inevitable. This means that the upperbound on TMR is governed by MTTF and MTTR (seeinequalities (1)-(2) in Theorem 1). Even if all message delaysand losses can be tolerated, EðTMRÞ cannot increase to anarbitrary level when MTTF is not þ1 and MTTR is not þ1or 0. If failure is detectable, EðTMRÞ cannot exceed MTBF

2

since for each MTBF duration, there will be at least twomistakes, corresponding to the two changes of state in theCR-TS. When failure is undetectable, mistakes may happenat the CR-TS’s crash or recovery time. Then, EðTMRÞ cannotexceed MTBF. Thus, after EðTMRÞ reaches MTBF

2 , the overallEðTMRÞ approaches MTBF gradually.

4.3.3 The Impact on PA

PA, the proportion of time that the FDS is not in a mistake

state, will depend on the ratio of EðTMÞ and EðTMRÞ(PA ¼ 1� EðTM Þ

EðTMRÞ in [5]). If a service is fail-free, PA can rapidly

approach 1. But in a crash-recovery run, when the time-out

length is increased, both EðTMÞ and EðTMRÞ will eventually

reach their upper bounds, meaning that PA will also be

bounded. Generally, as time-out increases, less failures will

be detected and the mistakes caused by failures (see T 3M in

Fig. 4c) will have more impact on EðTMÞ; thus, EðTMÞ will

approach MTTR, since the maximum length of EðT 3MÞ is

MTTR. As the time-out length becomes larger with respect to

MTTR, more failures become undetectable. Thus, EðTMÞwill gradually approach MTTR.

The speed of increase of TMR will depend on whenTMR reaches MTBF

2 . Before this bound is reached, as thetime-out length increases, TMR can increase exponentiallyfast, as more message losses can be tolerated. After TMR

exceeds MTBF2 , it can only increase gradually to MTBF, as

time-out increases and more and more crashes becomeundetectable. Thus, when TMR reaches its upper boundbut TM has not yet reached its upper bound, PA willdecrease as time-out length increases. When both TM andTMR reach their upper bound, PA will approach MTTF

MTBF ,which is equal to the availability of the CR-TS.

5 SIMULATION EVALUATION AND ANALYSIS

In previous sections, we have shown how to calculate theparameters of the FDS with a given set of QoS requirementsand analyzed the QoS bounds of the crash-recovery FDSbased on the NFD-S algorithm. In this section, we introduce

MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 279

6. Assuming that pL and D are not very large and MTTR� �.

Page 10: On the quality of service of crash recovery

our analytical and simulation results, which verify ourprevious analysis work.

5.1 Evaluation of the Crash-Recovery FDS Basedon the NFD-S Algorithm

For the simulation studies, we fix the heartbeat interval at

� ¼ 1 and gradually increase the time-out length.The message transmission parameters are pL ¼ 0:01 and

EðDÞ ¼ 0:02, and the delay is assumed to be exponentially

distributed. These settings are similar to those used in the

simulations in [5].The CR-TS is defined as a recoverable process with

various values of MTTF and MTTR (exponentially distrib-

uted). We choose the exponential distribution for the

following reasons. First, exponential failures are widely

adopted for reliability analysis in many practical systems;

second, unlike some heavy tailed distributions such as the

log-normal distribution, crash, and recovery with an ex-

ponential distribution will occur with reasonable interarri-

val times, avoiding the CR-TS behaving like a fail-free or

crash-stop process.

5.1.1 Analysis for the Basic QoS Metrics

We implemented the NFD-S algorithm presented in [5] to

evaluate the QoS of the FDS and compared the results with

the analytical results derived from Theorem 1. Figs. 8, 9, and

10 compare the QoS of the FDS based on the NFD-S algorithm

(simulation results) and the corresponding analytical results

from different perspectives. From these three figures, we

have the following observations.Fig. 8 presents the EðTMÞ of the FDS derived from

simulation and analytical results for two values of MTTR, 5

and 50, with corresponding values of MTTF, 100 and 1,000.

The simulation result for MTTR ¼ 5 shows that as the time-

out length increases, EðTMÞwill tend to MTTR, i.e., EðTMÞ is

bounded by MTTR. With the exponentially distributed

MTTR used in the simulation, the proportion of the detectable

crashes will decrease more gradually. Thus, EðTMÞ ap-

proaches MTTR more slowly than in the analytical results.Simulation results for MTTR ¼ 50 confirm that if MTTR

becomes large, as the time-out length increases, EðTMÞ can

also grow large, since the bound is now large. Note that in

the graph, we see only the linear part rather than the

complete characteristics. If the time-out length was increasedto 200, EðTMÞ would approach MTTR ¼ 50 closely.

An interesting phenomenon is visible in the graph astime-out increases from 0.5 to 1.1: EðTMÞ decreases (orincreases more slowly), and then, increases again. Weanalyze this phenomenon in detail as follows: Recall that fora given length of time-out, there are four aspects which haveimpact on TM : the message delay and loss, and the CR-TS’scrash and recovery (see Fig. 4). TM caused by a messagedelay is governed by the ratio between EðDÞ and TD. For thesame EðDÞ, as time-out increases, more delayed messagescan be tolerated. Thus, TM caused by a message delay (T 1

M )will decrease and occur less frequently. TM caused by amessage loss (T 2

M ) is related to �, pL, EðDÞ, and the time-outlength. For constant message communication QoS (i.e., fixedpL and EðDÞ), TM caused by message loss is governed by theratio between � and TD. Since as the time-out lengthincreases, more message losses can be tolerated, the averageduration of T 2

M will decrease, and T 2M will occur less

frequently. TM caused by a crash (T 3M ) is mainly governed

by TD (see Fig. 4c), because if a crash occurs, a false positivemistake will last until the time-out time or until the CR-TSrecovers. For detectable crashes, as the time-out lengthincreases, T 3

M will increase. TM caused by a recovery (T 4M ) is

mainly governed by pL and EðDÞ (see Fig. 4d), since afterthe CR-TS’s recovery, a recovery can be detected as soon asa valid liveness message is received.

From the above analysis, we know that for the same �,

pL, EðDÞ, MTTF, and MTTR, when the time-out length

increases, the average mistake duration caused by message

delays and message losses will decrease (T 1Mb and T 2

Mc), the

average mistake duration caused by the CR-TS’s crash will

increase (T 3Md), and the average mistake caused by the

CR-TS’s recovery from a detectable crash is unaffected (T 4M )

but fewer crashes and recoveries will be detected. In the

simulation pL ¼ 0:01 and MTBF ¼ 105, when time-out is

small, T 2M and T 3

M occur with similar frequency. When time-

out increases from 0.5 to 1.0, (the FDS can tolerate zero

message loss and most message delays), EðTMÞ increases

slow because T 1Mb, T 2

Mb, T 3Md, and T 4

M and their impacts

counterbalance. Overall, EðTM ) is stable within this period.

As the time-out length increases, T 2M will occur less

frequently. But T 3M occurs every MTBF period. Thus, as

280 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010

Fig. 8. The NFD-S algorithm: EðTMÞ. Fig. 9. The NFD-S algorithm: EðTMRÞ.

Page 11: On the quality of service of crash recovery

the time-out increases, T 3M will dominant and EðTMÞ will

increase gradually.In the simulation, pL ¼ 0:01 and MTBF ¼ 1;050, when

the time-out length is small, T 2M will have more impact than

T 3M , because T 2

M occurs more frequently than the crash andrecovery. Therefore, as the time-out length increases, theaverage duration of T 2

M decreases and occurs less fre-quently; EðTMÞ will increase slower or even decrease sincemore message losses are tolerated. But if time-out continuesto increase, T 3

M will become dominant and EðTMÞ will thenincrease gradually.

Overall, Fig. 8 shows that in a crash-recovery run, EðTMÞexhibits quite different characteristics from a fail-free orcrash-stop run. If the message delay and the probability ofmessage loss are not very large, EðTMÞ is bounded byMTTR. From Fig. 8, we also observe that EðTMÞ canpossibly be decreased for some time-out values. Unlike in afail-free run, continually increasing the time-out lengthcannot achieve a better ðTMÞ.

Fig. 9 presentsEðTMRÞ of the FDS derived analytically andfrom simulation with exponential MTTF and MTTR as above.We can see that with constant time-out length, as MTBFincreases, EðTMRÞ also increases. This implies that EðTMRÞ isgreatly impacted by the dependability of the CR-TS.

We can also see that for both these simulation cases,EðTMRÞ initially increases exponentially fast but afterEðTMRÞreaches MTBF

2 , the rate of increase is reduced. For the CR-TSwith exponential MTTR,EðTMRÞwill increase gradually andapproach MTBF, until all crashes become undetectable. Thisis because for nondeterministic MTTR, as the time-out lengthincreases, the proportion of the detectable crashes decreases.Therefore, for the detectable crashes,TMR � MTBF

2 , and for theundetectable crashes, TMR � MTBF. Thus, EðTMRÞ willincrease gradually between ½MTBF

2 ; MTBF�, and finally,stabilize at MTBF. All of these results match our analysis inSection 4.3 well and indicate that if a CR-TS is not fail-free(MTTF!1) orcrash-stop (MTTR!1), EðTMRÞ will bebounded by MTBF when failures are undetectable and byMTBF

2 when failures are detectable.

Fig. 10 considers PA under the same communication QoS.

We see that when MTBF increases, PA will be improved. This

is because EðTMRÞ also increases. Thus, from the equation

PA ¼ 1� EðTM ÞEðTMRÞ , we know that for the same time-out length,

when MTBF increases, a better PA can be achieved.

However, from Fig. 10, we can also see that as the time-out

length increases, PA is not always increasing as in a fail-freeor

crash-stop run. Continually increasing time-out could de-

crease PA. This is because TMR is bounded by MTBF2 or MTBF

as discussed above. After EðTMRÞ reaches MTBF2 , it increases

slowly rather than exponentially fast but EðTMÞ increases

linearly and faster than EðTMRÞ. Thus, PA decreases, and

finally, PA will approach MTTFMTBF , which is equal to the

availability of the CR-TS.The above results indicate that for a highly available CR-

TS, a reasonable QoS for the FDS can be achieved even if theFDS always trusts the CR-TS, when only the QoS metricsdefined in [5] are considered. This is especially true for ahighly available and highly consistent but not highlyreliable CR-TS. However, the completeness property of theFDS will not be satisfied. Consequently, these simulationresults demonstrate the necessity of the additional QoSmetrics we proposed in Section 3.3 to measure thecompleteness aspects and the speed of the recovery detectionof a crash-recovery FDS. Furthermore, these results alsodemonstrate the necessity of adopting the recovery detec-tion protocols in [29], which can improve the proportion ofdetected failures without reducing other QoS aspects.

In Figs. 8, 9, and 10, we can also observe how thedependability of a CR-TS can influence the QoS of the FDS.Particularly, for a highly available but not highly reliableCR-TS, the dependability of the CR-TS can have moreimpact than the performance of the algorithm and the QoSof message transmission. In such situations, the depend-ability of the CR-TS must be taken into account for the FDSdesign and implementation.

From Figs. 8, 9, and 10, we can see that PA, EðTMRÞ andEðTMÞ have bounds. Continually increasing the time-outlength might not be a reasonable way to achieve better PA,EðTMRÞ, and EðTMÞ. A potential trade-off exists betweenthe QoS metrics. For instance, for the NFD-S algorithm,time-out 2 ð1; 1:1Þ (time-outþ � 2 ½2; 2:1�) might achieve thebest over all QoS.

In addition, EðTMÞ in a crash-recovery run exhibits quitedifferent characteristics compared with a fail-free or crash-stop run. This is because in a crash-recovery run, the mistakescaused by the crash and recovery are taken into considera-tion, which means continually increasing the time-out lengthwill not always decrease EðTMÞ. It may have the effect ofincreasing false positive mistakes (T 3

M , see Fig. 4). As the time-out length increases, mistakes caused by message delaysand losses will occur less frequently, and false positivemistakes (which were not considered previously) will startto dominate the QoS of the FDS.

From Figs. 8, 9, and 10, we can observe that the

simulation results of EðTMÞ are smaller than the analytical

results, and the simulation results of EðTMRÞ and PA are

larger than the analytical results, which indicate that the

bound analysis of the basic QoS metrics in Theorem 1 is

valid and the simulation results satisfy the QoS require-

ments according to the analysis. We can also observe a

gap between the analytical and simulation results. This is

caused by the overestimation or underestimation of some

values within the analytical results. EðTMÞ is overestimated

MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 281

Fig. 10. The NFD-S algorithms: PA.

Page 12: On the quality of service of crash recovery

by using the total mistake duration over the underestimated

average number of mistakes that might occur within a crash-

recovery period. Thus, the analytical results of EðTMÞ will be

larger than the simulation results. Similarly, EðTMRÞ is

underestimated by using the observation duration (MTBF)

over an overestimation of the number of mistakes that

might occur within a period. For instance, the number of

mistakes within the period is estimated as dEðDÞ� e þ 1, which

is an upper bound rather than the average number. It

follows that EðTMRÞ of the analytical results will be smaller

than the simulation results. Finally, PA is underestimated by

using one minus an overestimated total mistake duration

over the observation period (MTBF). Thus, PA of the

analytical results will be smaller than the simulation results.All of these results satisfy the QoS requirements

EðTMÞ < TUM , PA > PLA , and EðTMRÞ > TLMR. In addition,

according to the NFD-S algorithm, the failure detectiontime TD is bounded by � þ time-out regardless of thecorrectness of the detection; thus, TD < TUD must besatisfied.

From Figs. 8, 9, and 10, we can also see that there are somegaps between the analytical results and the simulationresults. This is mainly caused by the overestimating andunderestimating method we adopted to restrict the failuredetector’s QoS bound as discussed above. In addition, weuse MTBF, MTTF, and MTTR, which are the expected valuesrather the real values for each failure and recovery. In thesimulation, the results are calculated according to therandomly generated failure time and recovery time, whichrepresent the real time to failure and recovery, and theserandom variables will deviate from the expected values.Thus, there will be some discrepancies between the simula-tion and analytical results. These gaps show that there is stillspace to improve the accuracy of the model and it would beinteresting to investigate this point further in the future.

5.1.2 Analysis for the Extended QoS Metrics

We also plot the simulation and analytical results for thefailure detection proportion (RDF ) defined in Section 3.3 todemonstrate the impact of the failure and recovery eventson this metric.

Fig. 11 shows the proportion of failures detected by theFDS, for different dependability characteristics of the CR-TS,based on both simulation and analytical results. As thetime-out length increases, EðRDF Þ of the NFD-S algorithm

decreases. When MTTR becomes shorter, EðRDF Þ willdecrease faster. This is because the smaller MTTR is, thesooner time-outþ � crosses MTTR (TUD > MTTR). Therefore,more crashes remain undetected when the NFD-S algorithmis adopted. In Fig. 11, we can also see that the simulationresults ofEðRDF Þ are larger than the analytical results, whichmeans that the bound analysis of EðRDF Þ is valid and thesimulation results satisfy the QoS requirements in terms ofRLDF . However, since most existing failure detection algo-

rithms adopt increasing the time-out length to tolerate moremessage losses and delays, if a CR-TS is recoverable andrecovers fast, it could be difficult for these algorithms toachieve the QoS in [5] and satisfy the completeness property atthe same time. In such a situation, the recovery detectionprotocol introduced in [29] can be adopted, which can solvethis problem reasonably well.

6 CONCLUSION

In this paper, the crash-recovery target and its failure detectorare modeled as stochastic processes. We redefined pre-viously proposed QoS metrics to be applicable to crash-recovery failure detection and introduced some new metricsto measure the recovery detection speed and the completenessproperty of a failure detector. We also discussed the impactof the monitored target’s crash-recovery behavior on each QoSmetric and showed that if a failure detector’s parameters areto be accurately estimated, these dependability character-istics must be taken into account. Thus, we showed how toconfigure the failure detector to satisfy a given set ofrequirements based on the dependability characteristics inaddition to the QoS of message transmission (see Fig. 12).This was based on the NFD-S algorithm [5]. Our analysisshows that the QoS analysis in [5] is a particular case of acrash-recovery run. Furthermore, we discussed how toestimate the input parameters for the algorithm.

Finally, the plotted simulation and analytical resultsdemonstrate that our QoS bound analysis is valid and can beused as an approximate solution for the computation of thefailure detector’s parameters or the QoS bounds estimationif the failure detector’s parameters are given. Our simula-tion results confirm that when a failure detector is designedand implemented, the dependability of the crash-recoverytarget needs to be considered in order to achieve moreaccurate parameter estimation. Furthermore, if the recoveryof the monitored target needs to be detected, furtherenhancement of the existing algorithms is needed.

282 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010

Fig. 11. The NFD-S algorithms: EðRDF Þ.

Fig. 12. The QoS relationship between communication, CR-TS,and FDS.

Page 13: On the quality of service of crash recovery

ACKNOWLEDGMENTS

The authors would like to thank Isi Mitrani, Mahesh Marina,

and the anonymous reviewers for their comments and

suggestions which helped improve the quality of this paper.

Hillston’s work is supported in part by the SENSORIA

project, an EU FET-IST GC 2 project (IST-3-016004-IP-09).

REFERENCES

[1] J. Laprie, A. Avizienis, and H. Kopetz, Dependability: Basic Conceptsand Terminology. Springer-Verlag, 1992.

[2] L. Lamport, R. Shostak, and M. Pease, “The Byzantine GeneralsProblem,” ACM Trans. Programming Languages and Systems, vol. 4,no. 3, pp. 382-401, 1982.

[3] M.J. Fischer, N.A. Lynch, and M.S. Paterson, “Impossibility ofDistributed Consensus with One Faulty Process,” J. ACM, vol. 32,no. 2, pp. 374-382, Apr. 1985.

[4] T.D. Chandra and S. Toueg, “Unreliable Failure Detectors forAsynchronous Systems (Preliminary Version),” Proc. 10th ACMSymp. Principles of Distributed Computing (PODC ’91), pp. 325-340,1991.

[5] W. Chen, S. Toueg, and M.K. Aguilera, “On the Quality of Serviceof Failure Detectors,” IEEE Trans. Computers, vol. 51, no. 5, pp. 561-580, May 2002.

[6] L. Falai and A. Bondavalli, “Experimental Evaluation of the QoSof Failure Detectors on Wide Area Network,” Proc. Int’l Conf.Dependable Systems and Networks, pp. 624-633, July 2005.

[7] N. Hayashibara, A. Cherif, and T. Katayama, “Failure Detectorsfor Large-Scale Distributed Systems,” Proc. 21st IEEE Symp.Reliable Distributed Systems, pp. 404-409, 2002.

[8] N. Hayashibara, X. Defago, R. Yared, and T. Katayama, “TheAccrual Failure Detector,” Proc. 23rd IEEE Int’l Symp. ReliableDistributed Systems, pp. 66-78, 2004.

[9] R.C. Nunes and I. Jansch-Porto, “QoS of Timeout-Based Self-Tuned Failure Detectors: The Effects of the Communication DelayPredictor and the Safety Margin,” Proc. Int’l Conf. DependableSystems and Networks, pp. 753-761, 2004.

[10] I. Sotoma and E.R.M. Madeira, “A Markov Model for Quality ofService of Failure Detectors in the Pressure of Loss Bursts,” Proc.18th Int’l Conf. Advanced Information Networking and Applications,vol. 2, pp. 62-67, 2004.

[11] R. Guerraoui and L. Rodrigues, Introduction to Reliable DistributedProgramming. Springer, 2006.

[12] E.M. Dashofy, A. van der Hoek, and R.N. Taylor, “TowardsArchitecture-Based Self-Healing Systems,” Proc. First WorkshopSelf-Healing Systems (WOSS ’02), pp. 21-26, 2002.

[13] M.E. Shin and D. Cooke, “Connector-Based Self-Healing Mechan-ism for Components of a Reliable System,” Proc. 2005 WorkshopDesign and Evolution of Autonomic Application Software, pp. 1-7,2005.

[14] R. Koo and S. Toueg, “Checkpointing and Rollback-Recovery forDistributed Systems,” IEEE Trans. Software Eng., vol. 13, no. 1,pp. 23-31, Jan. 1987.

[15] D. Manivannan and M. Singhal, “A Low-Overhead RecoveryTechnique Using Quasi Synchronous Checkpointing,” Proc. IEEEInt’l Conf. Distributed Computing Systems, pp. 100-107, 1996.

[16] T. Ma, J. Hillston, and S. Anderson, “Evaluation of the QoS ofCrash-Recovery Failure Detection,” Proc. ACM Symp. AppliedComputing (DADS Track), 2007.

[17] T. Ma, J. Hillston, and S. Anderson, “On the Quality of Service ofCrash-Recovery Failure Detectors,” Proc. Int’l Conf. DependableSystems and Networks, June 2007.

[18] M. Bertier, O. Marin, and P. Sens, “Implementation andPerformance Evaluation of an Adaptable Failure Detector,”Proc. Int’l Conf. Dependable Systems and Networks, pp. 354-363,2002.

[19] I. Gupta, T.D. Chandra, and G.S. Goldszmidt, “On Scalable andEfficient Distributed Failure Detectors,” Proc. 12th ACM Symp.Principles of Distributed Computing, pp. 170-179, 2001.

[20] R.V. Renesse, Y. Minsky, and M. Hayden, “A Gossip-Style FailureDetection Service,” technical report, Cornell Univ., 1998.

[21] P. Stelling, I. Foster, C. Kesselman, C.A. Lee, and G. vonLaszewski, “A Fault Detection Service for Wide Area DistributedComputations,” Cluster Computing, vol. 2, no. 2, pp. 117-128, 1999.

[22] R. Boichat and R. Guerraoui, “Reliable and Total Order Broadcastin the Crash Recovery Model,” PhD thesis, Ecole PolytechniqueFed., 2001.

[23] M.K. Aguilera, W. Chen, and S. Toueg, “Failure Detection andConsensus in the Crash-Recovery Model,” Distributed Computing,vol. 13, no. 2, pp. 99-125, Apr. 2000.

[24] D. Dolev, R. Friedman, I. Keidar, and D. Malkhi, “FailureDetectors in Omission Failure Environments,” Technical Report96-1608, Dept. of Computer Science, Cornell Univ., 1996.

[25] M. Hurfin, A. Mostefaoui, and M. Raynal, “Consensus inAsynchronous Systems Where Processes Can Crash and Recover,”Proc. 17th IEEE Symp. Reliable Distributed Systems, pp. 280-286, Oct.1998.

[26] R. Oliveira, R. Guerraoui, and A. Schiper, “Consensus in theCrash-Recover Model,” Technical Report 97-239, Dept. d’Informa-tique, EPFL, http://citeseer.ist.psu.edu/oliveira97consensus.html, 1997.

[27] J.C. Knight and E.A. Strunk, “Software Dependability,” Proc. Int’lConf. Dependable Systems and Networks, Tutorials, June 2006.

[28] R. Guerraoui, R. Oliveira, and A. Schiper, “Stubborn Commu-nication Channels,” technical report, Dept. d’Informatique, EPFL,1998.

[29] T. Ma, “Qos of Crash-Recovery Failure Detection,” PhD disserta-tion, The Univ. of Edinburgh, Mar. 2007.

Tiejun Ma received the BEng degree in automa-tion and the BEng degree in computer sciencefrom Dalian University of Technology, China, andthe MSc and PhD degrees from the Laboratoryfor Foundations of Computer Science, School ofInformatics, The University of Edinburgh, in 2003and 2007, respectively. He is a postdoc researchassociate of the Large-Scale Distributed SystemGroup, Department of Computing, Imperial Col-lege. Before moving to the Imperial College, he

was a staff member at the Oxford e-Research Centre, Oxford University,United Kingdom. His principal research interests are large-scaledistributed systems, dependable computing, fault tolerance, performanceevaluation, and grid computing.

Jane Hillston received the BA degree inmathematics from the University of York, UnitedKingdom, the MSc degree in mathematics fromLehigh University, and the PhD degree incomputer science from The University of Edin-burgh in 1994. She is a professor of quantitativemodeling in the School of Informatics at TheUniversity of Edinburgh, and holds an AdvancedResearch Fellowship from the Engineering andPhysical Sciences Research Council. She is a

fellow of the Royal Society of Edinburgh. After a brief period working inindustry, she joined the Department of Computer Science at TheUniversity of Edinburgh, as a research assistant, in 1989. Her work onthe stochastic process algebra PEPA (www.dcs.ed.ac.uk/pepa) wasrecognized by the British Computer Society in 2004, which awarded herthe first Roger Needham Award. Currently, her principal researchinterests are in the use of stochastic process algebras to model andanalyze computer systems and biological systems and the developmentof efficient solution techniques for such models.

Stuart Anderson is a senior lecturer in theSchool of Informatics at the University ofEdinburgh. His principal research interests arein the dependability of sociotechnical systems, inparticular, the analysis of the role of risk andtrust in such systems.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 283