Performance evaluation of a QoS-aware framework for providing tunable consistency and timeliness

Performance Evaluation of a QoS-Aware Frameworkfor Providing Tunable Consistency and Timeliness

Sudha Krishnamurthy, William H. SandersCoordinated Science Laboratory,

Dept. of Computer Science,

and Dept. of Electrical & Computer Engineering

University of Illinois at Urbana-Champaign

Email: {krishnam,whs}@crhc.uiuc.edu

Michel CukierCenter for Reliability Engineering

Dept. of Materials & Nuclear Engineering

University of Maryland, College Park

Email: [email protected]

Abstract— Strong replica consistency models ensure that thedata delivered by a replica always includes the latest updates, al-though this may result in poor response times. On the other hand,weak replica consistency models provide quicker access to infor-mation, but do not usually provide guarantees about the degree ofstaleness in the data they deliver. In order to support emerging dis-tributed applications that are characterized by high concurrencydemands, increasing shift towards dynamic content, and timelydelivery, we need quality-of-service models that allow us to explorethe intermediate space between these two extreme approaches toreplica consistency. Further, to better support time-sensitive appli-cations that can tolerate relaxed consistency in exchange for bet-ter responsiveness, we need to understand how the desired levelof consistency affects the timeliness of a response. The QoS modelwe have developed to realize these objectives considers both timeli-ness and consistency, and treats consistency along two dimensions:order and staleness. In this paper, we experimentally evaluate theframework we have developed to study the timeliness/consistencytradeoffs for replicated services and present experimental resultsthat compare these tradeoffs in the context of sequential and FIFOordering.

I. INTRODUCTION

Distributed applications often have to contend with differ-ent degrees of unpredictability, which arise from several fac-tors, such as unanticipated faults in the system and transientoverloads caused by sharing the network and computing re-sources with other users and applications. Owing to this un-predictability, users can benefit if they are allowed to controlthe quality-of-service (QoS) they receive from the services theyaccess. To support the diverse QoS requirements of modernapplications, we need to build QoS-aware middleware layersthat allow the clients to express their application-specific re-quirements using the right level of abstraction. The middlewareshould be able to map the application-specific requirements tothe properties of the resources it manages, so that the require-ments can guide the middleware in effectively sharing the re-sources among the users. Furthermore, owing to the dynamicnature of the distributed environment, the middleware shouldbe able to adapt the allocation of resources to meet the require-ments of the users, based on feedback from the environment.Considerable work on building such QoS-aware middleware

This research has been supported by DARPA contract F30602-98-C-0187.

layers for specific applications has been done. For example,Agilos [8] uses a control-theoretic approach to adaptively allo-cate resources, such as network bandwidth and CPU cycles, tomultimedia-related applications. Globus [2] is another middle-ware that manages the allocation of multiple resources, such asthe network, CPU, and disk resources for a variety of distributedapplications, by combining features of a reservation-based ap-proach and an adaptation-based approach.

In contrast, we have a designed a distributed, QoS-aware,middleware framework that manages all of the server replicasoffering a service, and adaptively assigns replicas to service aclient based on the client’s QoS requirements and the respon-siveness of the replicas [6]. This work was done in the contextof AQuA [9], which is a CORBA-based dependable middle-ware that transparently replicates objects across a LAN. Repli-cating a service provides dependability and enables us to delivergood response times, by selecting different replicas to servicedifferent clients concurrently. However, since concurrent oper-ations have the potential to introduce replica inconsistency, oneof the challenges in replicating distributed services is the prob-lem of delivering a consistent and timely response to the clients.Traditional replica consistency models use a binary approach:strong consistency models ensure that the data delivered by areplica always includes the latest updates, although this mayresult in poor response times; weak consistency models pro-vide quicker access to information, but do not usually provideguarantees about the degree of staleness in the information theydeliver. In our research, we target client applications that havespecific temporal and consistency requirements. These applica-tions can tolerate a certain degree of relaxed consistency in ex-change for better response time. Our framework allows clientsto access replicated services by requesting quality-of-servicealong two dimensions: timeliness of response and consistencyof delivered data. While our framework ensures that the con-sistency requirements are always met, the uncertainty in mostdistributed environments makes it hard to provide determinis-tic guarantees for meeting the timeliness requirements of appli-cations. Hence, our approach provides probabilistic temporalguarantees.

One of the important responsibilities of our QoS-aware layer

is the selection of appropriate replicas to service the clients tomeet their QoS requirements. One approach would be to selectall the available replicas to service a single client. However,such an approach is not scalable, as it increases the load on allthe replicas and results in higher response times for the remain-ing clients [5]. On the other hand, assigning a single replicato service each client allows us to service multiple clients con-currently. However, should a replica fail while servicing a re-quest, the failure could result in an unacceptable delay for theclient being serviced. Hence, neither approach is suitable whena client has specific timing constraints and when failure to meetthe constraints results in a penalty for the client. We use anapproach based on an application-aware model of adaptation,in which the middleware and application collaborate to adapt.This supports application diversity by allowing the applicationsto determine the mapping between their QoS requirements andthe replicas. The middleware keeps track of the current stateand responsiveness of the replicas by monitoring them at run-time. It then uses probabilistic models to select an appropriatesubset of replicas to service a client, by using the runtime mea-surements as inputs to the model.

Our framework can be used to support multiple consistencysemantics and to study the timeliness/consistency tradeoffs forreplicated services. In [6], we described how the framework canbe used to support relaxed consistency semantics using sequen-tial ordering [1]. In this paper, we have extended the frameworkto support FIFO ordering, which provides weaker consistencysemantics than sequential ordering. We also provide an exper-imental evaluation of the framework for both types of order-ing and present results that compare the timeliness/consistencytradeoffs for sequential and FIFO ordering. The results wepresent are promising and show that the probabilistic approachwe have developed allows a middleware to effectively adapt theallocation of replicated servers to meet the QoS requirementsunder different scenarios.

The remainder of this paper is organized as follows. In Sec-tion II we describe our QoS model for accessing replicated ob-jects. In Section III we review the main features of the adaptiveframework we have designed to support multiple consistencysemantics at the middleware layer. In Section IV, we explainthe probabilistic models that form the basis for evaluating thetimeliness/consistency tradeoffs. We present our experimen-tal analysis in Section V. We present our conclusions in Sec-tion VI, and discuss ideas for future extensions in Section VII.

II. QOS MODEL FOR ACCESSING REPLICATED OBJECTS

We use a request model that allows the middleware to distin-guish update invocations, which modify the state of the objectthey invoke, from read-only invocations that merely retrieve thestate. Our QoS model allows a client to specify its timeliness re-quirements using a pair of attributes: <response time, probabil-ity of timely response>. This pair specifies the time by which aclient expects a response after it has transmitted its request, andthe probability with which it expects its temporal constraint to

be met. Failure to meet a client’s deadline results in a timingfailure for the client. In our QoS model, the timeliness attributeis applicable only for read-only requests and not for update op-erations.

Several metrics have been proposed to quantify consistency.For example, the TACT middleware [11], a closely relatedwork, uses numerical error, order error, and staleness in real-time to assess the tradeoffs between consistency and availabil-ity of replicas. On the other hand, our QoS model regardsconsistency as a two-dimensional attribute: <ordering guaran-tee, staleness threshold>. The ordering guarantee is a service-specific attribute that denotes the guarantee provided by a ser-vice to all its clients about the order in which their requests willbe processed by the servers, so as to prevent conflicts betweenoperations. We currently support sequential and FIFO order-ing. The staleness threshold, which is specified by the client, isa measure of the maximum degree of staleness a client is will-ing to tolerate in the response it receives. It bounds the degreeof staleness in a way that is meaningful to the users. In or-der to meet a client’s QoS specification, a response delivered tothe client should be no more stale than the staleness thresholdspecified by the client. We compute the staleness of a replicaby associating each update operation with a timestamp. We usetimestamps based on “logical clocks” [7] because this obviatesthe need for synchronized clocks across the distributed repli-cas. A replica whose staleness is x is one whose state has notyet been updated to reflect the modifications ensuing from themost recent x updates, but reflects all updates committed priorto that. As an example of the use of the above model, a clientof a document-sharing application can specify that he wishes toobtain a copy of the document that is not more than 5 versionsold within 500.0 milliseconds with a probability of at least 0.8.

III. SUPPORT FOR TUNABLE CONSISTENCY

We now review the main ideas of the framework that makesuse of the above QoS model to support tunable consistency andtimeliness at the middleware layer. The main components ofthis framework are the hierarchical organization of the repli-cas, the protocols that implement different ordering guarantees,and a mechanism that dynamically assigns replicated servers toservice a client based on the client’s QoS requirements. Be-fore we review these individual components, we first providean overview of the AQuA middleware, shown in Figure 1.

A. Overview of AQuA

AQuA enhances the capabilities of CORBA objects by trans-parently replicating the objects across a LAN. A dependabilitymanager manages the replication level for different applicationsbased on their dependability requirements. Replicas offeringthe same service are organized into a group. Communicationbetween members of a group takes place through the Maestro-Ensemble group communication layer [10], [3], above whichAQuA is layered. The use of group communication in AQuA istransparent to the end applications. Hence each of the clients,

Client

AQuA

Gateway

Client Handler

Maestro/Ensemble

LAN

Server

Server

Handler

AQuA

Gateway

Maestro/Ensemble

Communication pathperceived by client/server

Intercepted path

Fig. 1AQUA OVERVIEW

which are all CORBA objects, is given the perception that itis communicating with a single server object using CORBA’sremote method invocation, although the client’s request maybe processed by multiple server replicas. This transparencyis achieved using an AQuA gateway, which transparently in-tercepts a local application’s CORBA message and forwards itto the destination replica group through Maestro-Ensemble, asshown in Figure 1. For the sake of clarity, in this figure we haveillustrated a server replica group that has only a single mem-ber. In reality, this group may have multiple replica members.While previous work in AQuA [9] has focused on gateway han-dlers for providing fault tolerance, we have enhanced AQuA bydeveloping gateway handlers that support tunable consistencyand timeliness for time-sensitive applications.

B. Hierarchical Replica Organization

We organize all the replicas offering a service into twogroups: a primary replication group and a secondary replica-tion group. We also use a QoS group, which encompasses allof the replicas of a service and their clients. In our implemen-tation, all of these groups are derived from Maestro groups. Wedepend on Maestro-Ensemble to provide reliability, virtual syn-chrony, and FIFO messaging guarantees, and build upon theseguarantees to provide the different end-to-end consistency guar-antees. Maestro-Ensemble is also responsible for informing thegroup members when changes in the group membership occur.

The primary and secondary replication groups of an objectmay be adaptively organized to implement multiple consistencysemantics. The primary replication group is used to implementstrong consistency semantics, whereas the secondary group im-plements weaker consistency semantics. This two-level replicaorganization was motivated by the need to favor operations thatcan tolerate relaxed consistency, to a certain degree, in ex-change for a timely response. While a write-all scheme thatwrites to all the replicas concurrently always provides accessto the latest updates, it may result in higher response timesfor the read operations. Our approach, on the other hand, per-forms updates on the smaller primary group, while allowing thesecondary replicas, which are greater in number, to handle theread-only operations. The primary replicas subsequently bringthe state of the secondary replicas up-to-date using lazy update

Maestro/Ensemble

LAN

Gateway

W

S

G

Server Gateway

W

S

Server Gateway

Server BServer A Client

TOTAL FIFO

FIFOTOTAL

Maestro/EnsembleMaestro/Ensemble

Gateway Handlers

Maestro/Ensemble

Fig. 2TIMED CONSISTENCY HANDLERS IN THE AQUA GATEWAY

propagation. We can bound the degree of divergence betweenthe states of primary and secondary replicas by choosing anappropriate frequency for the lazy update propagation. Thus,while clients that need the most up-to-date state to be reflectedin their response may have to depend more on the response froma primary replica, clients that are willing to tolerate a certain de-gree of staleness in their response can achieve better responsetimes, due to the higher availability of the secondary replicas.In Section V-B, we will present experimental results that justifythe use of a hierarchical replica organization.

C. Ordering Guarantees

In order to maintain replica consistency, we need to ensurethat the replicas service their clients by respecting the order-ing guarantee associated with the service. Our framework al-lows different ordering guarantees to be implemented as timedconsistency handlers within the AQuA gateway, as shown inFigure 2. The framework we described in [6] supported onlysequential ordering. In this paper, we extend this frameworkby adding FIFO ordering. The sequential handler was moti-vated by applications, such as document-sharing applications,in which the replicated servers provide total ordering by main-taining a common state that is shared by all the clients. Onthe other hand, the FIFO handler, which provides weaker con-sistency, was designed to support applications, such as bank-ing transactions, in which the replicated servers maintain statesthat are specific to each client. A client can communicate witha replicated service by using the gateway handler appropriatefor the service. For example, Figure 2 shows a client commu-nicating with Service A using a sequential handler and withService B using a FIFO handler. We have designed the proto-cols to ensure that the ordering guarantees are provided evenwhen replica failures occur. We do not, however, describe thedetails of the failure handling in this paper, owing to the spaceconstraint.

We will now compare and contrast the sequential and FIFOhandlers with respect to the way they service a client’s updateand read-only requests. In case of both handlers, a client’s up-date request is sent to all the primary replicas. The secondaryreplicas do not directly service a client’s update request. In-stead, the secondary replicas update their state when one ofthe members of the primary group lazily propagates its updatedstate to the secondary group. We call this member the lazy pub-lisher. A client may send its read-only request to a subset of

primary and secondary replicas. In Section IV, we will describethe selection of this subset. A selected replica responds to theread request immediately, if its most recently updated state isno more than x versions old, where x is the staleness thresholdspecified by the client in its QoS specification. In other words,the replica performs an immediate read operation if it meetsthe staleness specification of the client. However, a secondaryreplica may have a state that is more stale than the stalenessthreshold specified by the client. The reason for this is that thesecondary replicas update their state only upon receiving thestate update from the lazy publisher. In such a case, the replicaperforms a deferred read by buffering the read request and re-sponding to the client upon receiving the next state update fromthe lazy publisher.

The sequential and FIFO handlers differ in the order in whichthe replicas commit the updates and the manner in which areplica determines if its state meets the staleness threshold spec-ified by a client. In the sequential consistency case, all thereplicas see the effects of the updates in the same sequentialorder. The order in which the replicas commit updates is de-termined by the Global Sequence Number (GSN) of the updateoperation, which is assigned by the leader of the primary groupand broadcast by the leader to the other primary replicas. Theleader of the primary group is elected by Ensemble. The leadermerely serves as the sequencer and does not actually servicethe client’s request. The value of the GSN at any instant of timemay be considered to be the value of the leader’s logical clock.For read-only operations, this GSN serves as the basis for deter-mining the staleness of a replica. In contrast, the FIFO handlerdoes not use a dedicated sequencer to determine the order inwhich the replicas commit their updates. Instead, the clientssend a sequence number along with their invocations. The pri-mary replicas commit the updates of each client in increasingorder of this sequence number. In the case of a read request,the replicas use this client-specific sequence number to deter-mine if their state with respect to a client’s updates is withinthe client-specified staleness threshold. They then perform animmediate or deferred read accordingly.

IV. REPLICA SELECTION

Having described the replica organization and the protocolsthat provide the different application-specific ordering guaran-tees for the replicated servers, we now review the third maincomponent of our adaptive framework. This is a selection mod-ule that is local to each client’s gateway handler and providesa mechanism to select replicas to service a client’s read re-quest, based on their ability to meet the client’s temporal re-quirements, as well as on whether the state of the replica iswithin the staleness threshold specified by the client in its QoSspecification. For an update request, the mechanism selects allthe primary replicas.

The selection problem is similar to a resource allocationproblem and is challenging, because in most distributed envi-ronments it is impossible for a client to know with certainty

if any single replica can meet its deadline. Further, while theclient can be certain that the state of the primary replicas is al-ways up-to-date because of the immediate update propagation,it cannot make such guarantees about the state of the secondaryreplicas, which update their state lazily. Hence, our selectionscheme makes use of spatial redundancy by selecting multiplereplicas to service a read request, and delivers the earliest re-sponse to the client. Since distributed services may have highconcurrency demands, the selection algorithm has to choose thedegree of redundancy to service a request judiciously. Our algo-rithm makes use of probabilistic models to estimate a replica’sstaleness and to predict the probability that the replica will beable to meet the client’s deadline. These models make theirpredictions based on information gathered by monitoring thereplicas at run-time. Our selection algorithm then uses this on-line prediction to choose a subset of replicas that can meet theclient’s timing constraints with at least the probability requestedby the client. While the algorithm ensures that the response de-livered to the client will meet the staleness constraint, it canonly provide probabilistic guarantees about meeting the tempo-ral constraint.

A. Parameters of the Probabilistic Model

To identify the parameters of our probabilistic model, weconducted experiments to determine the factors that affect theresponse time in AQuA. Based on our experimental analysis,we found that in the case of an immediate read, the responsetime random variable Ri for a replica i is given by Equation 1:

Ri = Si + Wi + Gi (1)

For a deferred read, Ri is given by Equation 2:

Ri = Si + Wi + Gi + Ui (2)

where Si is the random variable denoting the service time for aread request processed by replica i; Wi is the random variabledenoting the queuing delay experienced by a request waiting tobe serviced by i; Gi is the random variable denoting the two-way gateway-to-gateway delay between the client and replicai; and Ui is the duration of time the replica spends waiting forthe next lazy update. In the case of sequential ordering, thequeuing delay includes the time the replica spends waiting forthe sequencer to send the GSN for the request. As shown inFigure 2, the service times (S) and queuing delays (W) are spe-cific to the individual replicas, while the gateway delays (G)are specific to a client-replica pair. For each read request, weexperimentally measure the values of the above performanceparameters by instrumenting the gateway handlers. The val-ues of Si, Wi, and Ui for a read request are measured by theserver-side handler. The server handler then publishes the newmeasurements to all the clients. The value of the two-way gate-way delay, Gi, is measured by the client-side handler when itreceives a response from replica i. For each replica, the clienthandlers record the most recent l measurements of these param-eters in separate sliding windows in an information repository

that is local to each client. The size of the sliding window, l,is chosen so as to include a reasonable number of recent re-quests, while eliminating obsolete measurements. The detailsof the gateway instrumentation and online performance moni-toring are provided in [6].

B. Estimating the Probability of a Timely Response

As mentioned in Section II, our work targets clients that havespecific consistency and timeliness constraints. Each client ex-presses its constraints in the form of a QoS specification thatincludes the response time constraint, d; the minimum proba-bility of meeting this constraint, Pc(d); and the maximum stal-eness, a, that it can tolerate in its response. If a response failsto meet the deadline constraint of the client, then it results in atiming failure for the client. Using the above performance his-tory, collected by monitoring the performance of the replicasonline, our selection module predicts the probability, PK(d),that at least one response from the subset K ∈ M , consistingof k > 0 replicas, will arrive by the client’s deadline, d, andthereby prevent a timing failure. M denotes the complete set ofreplicas that offer the service requested by the client. The set Kis made up of a subset Kp of primary replicas and a subset Ks

of secondary replicas (i.e., K = Kp ∪ Ks). While each replicain K independently processes the client’s request and returnsits response, only the first response received for a request is de-livered to the client. Hence, a timing failure occurs only if noresponse is received from any of the replicas in the selected setK within d time units after the request was transmitted. Usingthis independence assumption, we have

PK(d) = 1 − P (no replica i ∈ K � Ri ≤ d)

PK(d) = 1 − P (no i ∈ Kp � Ri ≤ d) · P (no j ∈ Ks � Rj ≤ d) (3)

Although the independence assumption may not be strictly truein some cases (e.g., if the network delays are correlated), it doesresult in a model that is fast enough to solve online. This trade-off was necessary since our work targets time-sensitive applica-tions. The experimental results we present in Section V showthat, despite this approximation, the resulting model makes rea-sonably good predictions most of the time.

Before we proceed to explain how we evaluate PK(d), wewill introduce a few notations. Let t denote the time at which arequest is transmitted by the client to the server. Since replicasare selected at the time of request transmission, we also use tto denote the time at which the replica selection is done. LetP (Ri ≤ d) denote the probability that a response from replicai will be received by the client within the client’s deadline, d.This probability depends on whether the replica is functioningand has a state that can satisfy the client-specified stalenessthreshold. We can make use of these individual probabilitiesto choose a subset K of replicas such that PK(d) ≥ Pc(d). Thereplicas in the set K will then form the final set selected to ser-vice the request. Let Ai(t) denote the staleness of the state ofreplica i at time t, and let P (Ai(t) ≤ a) denote the staleness

factor for replica i at time t. The staleness factor is defined asthe probability that the state of replica i at the time of requesttransmission, t, is within the staleness threshold, a, specifiedby the client. Since the update requests of the clients are prop-agated to the primary group immediately, the staleness factorP (Ai(t) ≤ a) = 1 for a primary replica. Hence, in the case ofthe primary subset, we have

P (no i ∈ Kp � Ri ≤ d) =�

i∈Kp

P (Ri > d) =�

i∈Kp

(1 − F IRi

(d)) (4)

where F IRi

denotes the response time distribution function forreplica i, conditioned on the fact that it can respond immedi-ately to a read request without waiting for a state update.

For a secondary replica, the staleness factor does not alwayshave a value of 1. Therefore, for a replica j ∈ K s,

P (Rj > d) = P (Rj > d|Aj(t) ≤ a) · P (Aj(t) ≤ a)

+P (Rj > d|Aj(t) > a) · P (Aj(t) > a)

Since the lazy update is propagated to all the secondary replicasat the same time, it is reasonable to assume that their degreesof staleness at the time of request transmission, t, are identical.Hence, rather than associating the staleness factor with an indi-vidual secondary replica as above, we use As(t) to denote thestaleness of the secondary group at the time of request trans-mission t. The expression for the probability that no replica inthe secondary subset can respond within the deadline d is thengiven by

P (no j ∈ Ks � Rj ≤ d) =

� �j∈Ks

P (Rj > d|As(t) ≤ a)

�· P (As(t) ≤ a) +

� �j∈Ks

P (Rj > d|As(t) > a)

�· P (As(t) > a)

P (no j ∈ Ks � Rj ≤ d) =

� �j∈Ks

(1 − FIRj

(d))

�· P (As(t) ≤ a) +

� �j∈Ks

(1 − F DRj

(d))

�· (1 − P (As(t) ≤ a))

(5)

where F IRj

, as before, denotes the response time distribution

function of the replica j for an immediate read, and F DRj

is theresponse time distribution function, given that the replica defersthe read until it has received the next lazy state update.

We make use of the performance history of the replicasrecorded at runtime, as explained in Section IV-A, to computethe values of F I

Riand F D

Rifor a replica i . To evaluate F I

Ri(d),

we first compute the probability mass function (pmf ) of S i andWi based on the relative frequency of their values recorded inthe sliding window. We then use the pmf of Si, the pmf of Wi,and the most recently recorded value of G i to compute the pmfof the response time Ri as a discrete convolution of Si, Wi, andGi. For Gi, we decided to use its most recently recorded valuerather than record its history over a period of time, because thegateway-to-gateway delay in a LAN does not fluctuate as muchas the other parameters do. In environments where this obser-vation is not true, it would be simple to extend our approach

to record the gateway delay over a sliding window, as we dofor the service time and queuing delay. Finally, the pmf ofRi obtained using the discrete convolution can be used to com-pute the value of the distribution function F I

Ri(d). We follow a

similar procedure to compute F DRi

(d), although in this case werecord a performance history of U i and include the pmf of Ui

in the convolution.We now describe how a client’s selection module estimates

the staleness factor, P (As(t) ≤ a), at the time it selects replicasto service the client’s read transaction. The staleness of a sec-ondary replica at the instant t is the number of update requeststhat has been received by the primary group since the time ofthe last lazy update. Let tl denote the duration between the timeof request transmission, t, and the time of the last lazy update.Let Nu(tl) be the number of update requests committed by theprimary group in the duration t l. Since As(t) = Nu(tl), wehave P (As(t) ≤ a) = P (Nu(tl) ≤ a). Using the assumptionthat the update arrivals follow a Poisson distribution with rateλu, we obtain

P (As(t) ≤ a) = P (Nu(tl) ≤ a) =a�

n=0

(λutl)ne−λutl

n!(6)

The sequential and FIFO handlers differ slightly in the way theyevaluate the update arrival rate, λu. In sequential ordering, thereplica state is shared by all the clients, and therefore λu is therate at which updates are received by the primary replicas fromall the clients. However, in the case of FIFO ordering, sincethe updates to the replicated object are specific to the individualreplicas, λu is the rate at which the client that is making thereplica selection updates the replicated object. In either case,the staleness of the secondary replicas can be determined prob-abilistically if the client handler knows the arrival rate of theupdate requests and the time elapsed since the last lazy update.Like the performance parameters in Section IV-A, this informa-tion is obtained by instrumenting the gateway handlers [6]. Al-though we have assumed Poisson arrivals in our work, it shouldbe possible to evaluate the staleness factor for other kinds ofupdate distributions by using the appropriate models. Now thatwe have a way to compute all the components of our proba-bilistic model, we can use the expressions from Equations 4, 5,and 6 in Equation 3 to evaluate PK(d).

C. Selection of Replicas with Dynamic State

Algorithm 1 outlines the selection algorithm that enables aclient gateway to select a set of replicas that can together meetthe client’s QoS specification, based on the prediction made bythe probabilistic models described in the previous section. Thealgorithm uses the model’s prediction to select no more than thenumber of replicas necessary to meet the client’s response timeconstraint with the probability the client has requested. The al-gorithm is executed by the selection module in a client gatewaywhen the client performs a read-only request on a server ob-ject. If the client makes an update request, the selection modulechooses all the primary replicas to service the request.

Algorithm 1 Replica Selection Algorithm: Dynamic StateRequire: V =< i, F I

Ri(d), F D

Ri(d), erti > , staleFactor

Require: Client Inputs: a : staleness threshold, d : deadline, Pc(d): minimumprobability of meeting this deadline

1: primCDF ⇐ 1 ; secImmedCDF ⇐ 1; secDelayedCDF ⇐ 12: sortedList ⇐ sort V in decreasing order of erti.3: K ⇐ [first(sortedList)] ; maxCDFReplica ⇐ [first(sortedList)] ; ad-

vance(sortedList)4: for all i in sortedList do {visit the remaining replicas in sorted order}5: K ⇐ K∪ i6: if F I

Ri(d) > maxCDFReplica.immedCDF() then

7: found ⇐ includeCDF(maxCDFReplica, maxCD-FReplica.immedCDF(), maxCDFReplica.delayedCDF())

8: maxCDFReplica ⇐ i9: else

10: found ⇐ includeCDF(i, FIRi

(d), F DRi

(d))11: end if12: if found eq true then {found an acceptable set}13: return K14: end if15: end for16: return K {return the set comprising all the replicas}17: includeCDF(replica, immedCDF, delayedCDF)18: begin19: if replica ∈ PrimaryGroup then20: primCDF ⇐ primCDF * (1 - immedCDF)21: else22: secImmedCDF ⇐ secImmedCDF * (1 - immedCDF); secDelayedCDF

⇐ secDelayedCDF * (1 - delayedCDF)23: secCDF ⇐ secImmedCDF * staleFactor + secDelayedCDF * (1 - stale-

Factor)24: end if25: if 1 - (primCDF * secCDF) ≥ Pc(d) then26: return true {found an acceptable replica set}27: else28: return false {need more replicas}29: end if30: end

The algorithm receives as inputs the QoS specification, andthe list of secondary and primary replicas, along with rele-vant information about them. For each replica i, the algo-rithm also receives the values of the replica’s immediate anddeferred response time distribution functions, which are de-noted by F I

Ri(d) and F D

Ri(d). However, for a primary replica

i, F DRi

(d) is not used. The algorithm also receives the stale-ness factor for the secondary replicas, which is computed usingEquation 6. Furthermore, the algorithm receives the elapsed re-sponse time, erti, which is the duration that has elapsed sincea reply was last received by the client from replica i. While theresponse time distributions, which are computed from the per-formance history as explained in Section IV-B, are specific tothe individual replicas and are nearly identical in all the clientinformation repositories, the ert information is specific to eachclient-replica pair and is likely to be different in all the reposito-ries. The algorithm first sorts the replicas in decreasing order oftheir elapsed response time, ert. This allows the clients to favorthe selection of replicas that it used least recently, and therebyalleviate the occurrence of “hot-spots.” Replicas that have thesame value of ert are sorted in decreasing order of the valuesof their distribution functions.

The steps of the algorithm are as follows. The algorithm tra-

verses the replica list in sorted order, including each visitedreplica in the candidate set K , until it includes enough repli-cas in K to satisfy the terminating condition PK(d) ≥ Pc(d).Each time it includes a new replica, the algorithm invokes thefunction includeCDF() to obtain the value of PK(d) for thereplicas that are currently included in the set K . This functionreceives as arguments the values of F I

Ri(d) and F D

Ri(d) for the

newly included replica, and uses them to compute the value ofPK(d) according to Equation 3. The function then tests the ter-minating condition in Line 25 and returns true if the condition issatisfied, indicating that an appropriate replica subset has beenfound. Finally, in the case of sequential ordering, the selectedset K is extended to include the sequencer, which is responsiblefor broadcasting the global sequence number.

Notice that when evaluating PK(d), we exclude the responsetime distribution of the member, maxCDFReplica, that hasthe highest probability among the selected members, of re-sponding by the requested deadline. We do this in order tochoose a subset that can meet a client’s time constraint despitea single replica failure. We propose that if we can choose a setof replicas that can satisfy the timing constraint with the spec-ified probability despite the failure of the member that has thehighest probability of meeting the client’s deadline, then such aset should be able to handle the failure of any other member inthe set. In [5] we have provided a formal justification for thisproposal. The above exclusion in effect simulates the failure ofthe replica with the highest probability of meeting the client’sdeadline among the selected replicas, and therefore allows us totolerate single replica failures. Although Algorithm 1 addressesonly single replica crashes, it should be possible to extend it tohandle multiple replica crashes.

V. EXPERIMENTAL RESULTS

We conducted experiments to evaluate the performance ofour selection approach and experimentally analyzed the trade-offs between timeliness and consistency, using the sequentialand FIFO ordering handlers we implemented in AQuA. In thissection, we discuss the results we obtained. Our experimentalsetup is composed of a set of uniprocessor Linux machines dis-tributed over a 100 Mbps LAN. The overhead of the probabilis-tic selection algorithm increases with the number of replicasand the size of the sliding window used to record the perfor-mance histories. For a sliding window of size 20, we found thatthe selection overhead increased linearly from 400 microsec-onds to 1200 microseconds when we increased the number ofreplicas from 2 to 10. The overhead includes the time to com-pute the distribution functions and the time to select the replicasubset. Computation of the response time distribution functioncontributes to about 90% of the overhead, while selection ofthe replica subset using Algorithm 1 contributes to the remain-ing 10%. The overhead is incurred during each request and isaccounted for during replica selection. All confidence intervalsfor the results we present in the following sections are at a 95%level, and have been computed under the assumption that thenumber of timing failures follows a binomial distribution [4].

0

2

4

6

8

10

80 100 120 140 160 180 200 220

Ave

rag

e N

um

be

r o

f R

ep

lica

s S

ele

cte

d

Deadline (milliseconds)

(prob: 0.9, LUI: 4 secs)(prob: 0.5, LUI: 4 secs)(prob: 0.9, LUI: 2 secs)(prob: 0.5, LUI: 2 secs)

(a) Number of replicas se-lected: sequential

0

2

4

6

8

10

80 100 120 140 160 180 200 220

Ave

rag

e N

um

be

r o

f R

ep

lica

s S

ele

cte

d



(b) Number of replicas se-lected: FIFO

Fig. 3SEQUENTIAL VS. FIFO ORDERING: ADAPTIVITY OF PROBABILISTIC

MODEL

A. Effectiveness of Probabilistic Model

One of the main objectives of Algorithm 1 is to assign thereplicated servers to service a request adaptively, based on thecurrent responsiveness of the replicas and the QoS requestedby the clients. We now present the results of the experimentswe conducted to evaluate the adaptivity and effectiveness of theprobabilistic selection algorithm for sequential and FIFO or-dering. We used an experimental setup composed of 10 serverreplicas. 4 of the server replicas were in the primary group,and the remaining replicas were in the secondary group. Inaddition, we had a sequencer when we were using sequentialordering. The service time for all the read and update requestswas normally distributed with a mean of 100 milliseconds anda variance of 50 milliseconds. We used two clients that ran ontwo different machines and independently issued requests to thereplicated service with a 1000 millisecond request delay, whichwe define as the duration that elapses before a client issues itsnext request after completion of its previous request. In everyrun, each of the two clients issued 1000 alternating write andread requests to the service. One of the clients (which we re-fer to as Client1) requested the same QoS for all of the runs;it specified a staleness threshold of value 4, a deadline of 200milliseconds, and a minimum probability of timely response of0.1. The second client (which we refer to as Client2) specifieda staleness threshold of value 2 in all the runs, and requested adifferent deadline in each run.

To study the behavior of the selection algorithm for differ-ent values of the probability of timely response specified by aclient, we repeated the experiments for two different probabil-ity values specified by Client2 in its QoS specification: 0.9 and0.5. In order to study the effect of the staleness of the replicason the timeliness of their response, we conducted experimentsusing different values for the lazy update interval (LUI). TheLUI is the periodicity with which the lazy publisher propagatesits updates to the secondary replicas. For sequential ordering,we carried out experiments using lazy update intervals of 2 sec-onds and 4 seconds. In the case of FIFO ordering, we observednearly identical performances for our experiments using LUI

0

0.05

0.1

0.15

0.2

0.25

0.3

80 100 120 140 160 180 200 220

Ob

se

rve

d P

rob

ab

ility

of

Tim

ing

Fa

ilure



(a) Probability of timing fail-ures: sequential

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

80 100 120 140 160 180 200 220O

bse

rve

d P

rob

ab

ility

of

Tim

ing

Fa

ilure



(b) Probability of timing fail-ures: FIFO

Fig. 4SEQUENTIAL VS. FIFO ORDERING: TIMELINESS/CONSISTENCY

TRADEOFFS

values of 2 seconds and 4 seconds. Hence, for FIFO ordering,we have shown the results using LUI values of 2 seconds and 6seconds instead.

We evaluated the adaptivity of our algorithm by determin-ing how the degree of redundancy chosen to service a requestvaries with the requested QoS. In our experiments we did thisby measuring the average number of replicas selected by theselection algorithm to service Client2, as the client varied itsQoS specification. Figure 3a shows the results for sequentialordering and Figure 3b shows the results obtained for FIFOordering. From the graphs we observe that, regardless of theordering, the number of selected replicas decreases as the re-quested deadline varies from 100 milliseconds to 200 millisec-onds. Similarly, the number of selected replicas decreases asthe requested probability of timely response reduces from 0.9 to0.5. Another observation from these two figures is that for thesame QoS specification and the same LUI values, the numberof replicas selected in the case of sequential ordering is higherthan in the case of the weaker, FIFO ordering. All of theseobservations lead us to conclude that as the QoS specificationbecomes less stringent, Algorithm 1 is able to adapt by choos-ing no more than the number of replicas that are needed in orderto meet the client’s QoS requirement, as predicted by the prob-abilistic model. The less stringent a client’s QoS specificationis, the higher the probability that a chosen replica will meet theclient’s specification, and hence the fewer the replicas requiredto meet the specification.

We evaluated the effectiveness of our probabilistic model byexperimentally determining if the selected replicas were indeedable to meet the QoS specification of the clients. As mentionedin Section IV, our algorithm guarantees that a client will al-ways receive a response that is within the staleness thresholdthat it specifies. However, the algorithm provides only proba-bilistic guarantees about meeting the temporal requirement. Inour experiments, we used the observed probability of timingfailures as the metric to assess the effectiveness of our model.For each of the QoS specifications of the second client, we ex-perimentally computed the probability of timing failures in a

run by monitoring the number of requests in the run for whichthe client failed to receive a response within the requested dead-line. This monitoring was done by the timing failure detector,which is a component of the client’s gateway handler. Figure 4shows how successful the selected replicas (shown in Figure 3)were in meeting the QoS specifications of Client2 for sequentialand FIFO ordering.

The first observation from Figures 4a and 4b is that in eachcase, the set of replicas selected by Algorithm 1 was able tomeet the client’s QoS requirements successfully by maintain-ing the timing failure probability within the failure probabilitythat was acceptable to the client. For example, in Figure 4a,consider the case in which the LUI is 4 seconds and the clientrequests a probability of timely response of at least 0.9. Theprobability of timing failures we observe experimentally is 0.1for a deadline value of 100 milliseconds, and reduces as thedeadlines become more flexible. We observe similar behaviorfor the other cases as well. Thus, for the experimental runs weconducted, the model we used was effective in predicting theset of replicas that can return the appropriate response by theclient’s deadline, with at least the probability requested by theclient.

A second observation from Figures 4a and 4b is that as the in-terval between lazy updates increases, the observed probabilityof timing failures also increases. That is because the replica’sstate becomes increasingly stale as the lazy updates become lessfrequent. That, in turn, increases the probability that a chosenreplica may have to defer its response until it has received thenext lazy update, in order to meet the client’s staleness thresh-old. The delayed response results in a higher probability oftiming failures. A final observation from Figure 4 is that for thesame QoS specification and lazy update frequency, the timingfailures observed in the case of sequential ordering is higherthan in the case of FIFO. The reason is that in sequential con-sistency, we allow both clients to update the same replicatedstate. The common updates cause the replicated state to becomeobsolete faster, thereby increasing the probability of a delayedresponse. Therefore, in order to achieve timing failure proba-bilities that are closer to those obtained using the FIFO handler,we need to propagate the lazy updates more frequently whensequential ordering is used.

B. Single-Tier vs. Hierarchical Replica Organization

In Section III-B, we mentioned that our motivation for usinga two-tier replica organization was to favor read operations thatcan tolerate relaxed consistency to a certain degree, in exchangefor better responsiveness. Our thesis was that by limiting thewrites to a small subset of replicas, we can reduce the occur-rence of timing failures by allowing the remaining replicas toservice the read operations, which we assume to be more fre-quent than the write operations. However, one may argue that awrite-all scheme that writes to all the replicas concurrently mayresult in fewer timing failures. The reason is that unlike the

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

80 100 120 140 160 180 200 220

Obs

erve

d P

roba

bilit

y of

Tim

ing

Fai

lure


(client update rate: 0.95, %primary: 100%)(client update rate: 1.8, %primary: 100%)(client update rate: 0.95, %primary: 40%)(client update rate: 1.8, %primary: 40%)

Fig. 5SINGLE-TIER VS. HIERARCHICAL REPLICA GROUPS

replicas in a two-tier scheme, the replicas in a write-all schemecan respond without waiting for a state update.

Figure 5 presents the results of the experiments we carriedout to compare the performance of the two schemes under dif-ferent scenarios. We used the same setup with 10 replicas and2 clients, as explained in Section V-A. In the two-tier organiza-tion, 40% of the replicas were in the primary group, and the lazyupdate interval for updating the secondary replicas was 2 sec-onds. For the write-all scheme, we used a single-tier organiza-tion in which all the replicas were designated as primaries. Wecarried out our experiments with two clients having the sameQoS specification that was used for the results presented in Sec-tion V-A. Figure 5 compares the probability of timing failuresobserved for different update rates using the sequential handler.These failure probabilities were measured for Client2, whichvaried its deadline from 100 milliseconds to 200 millisecondsand requested a probability of timely response of at least 0.9.

From the results in Figure 5 we see that when the arrival rateof the updates from the clients is small (about 1 update per sec-ond), the performance of the two schemes is nearly identical.However, when the update arrival rate of the clients is almostdoubled, the write-all scheme results in a significantly highernumber of timing failures. The reason is that as the updaterate increases, the client-induced load on all the replicas in thewrite-all scheme increases. Although the replicas in the write-all scheme do not have to wait for a state update in order torespond to a request, they have to wait for the previous writeto complete before they can respond. Since all the replicas areinvolved in writes, the response time for the following read op-erations is higher. On the other hand, in the case of the two-tierorganization, only 40% of the replicas are involved in write op-erations. Although some of the remaining 60% may have todefer their response until they receive the next state update, thevalue of the lazy update interval we have chosen ensures that theprobability of that is small. We observed similar results com-paring the two schemes using FIFO ordering guarantees. Theabove results justify the choice of a hierarchical replica organi-zation to support relaxed consistency models. Although in thiswork we restrict ourselves to a two-tier organization of repli-cas to study the tradeoffs between timeliness and consistency, it

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

80 100 120 140 160 180 200 220

Observ

ed P

robabili

ty o

f T

imin

g F

ailu

re


(prob: 0.9, request delay: 250 ms)(prob: 0.5, request delay: 250 ms)


(a) Probability of timing failures

0

2

4

6

8

10

80 100 120 140 160 180 200 220

Avera

ge N

um

ber

of R

eplic

as S

ele

cte

d




(b) Number of replicas selected

Fig. 6SEQUENTIAL CONSISTENCY: VARIABLE REQUEST DELAY

should be easy to extend this architecture to multiple tiers rep-resenting intermediate degrees of staleness in the replica states.

C. Performance Under Load

We now describe the experiments we carried out to determinehow well our model adapts to meet the client’s QoS specifica-tion under different client-induced loads. We varied the client-induced load by varying the request delay and the number ofclients accessing a service. When we varied the client-inducedload by varying the request delay, we used the same experi-mental setup with two clients that we described in Section V-A.The induced load on the servers is higher for smaller requestdelays. Figure 6 presents the results for sequential ordering, us-ing a lazy update interval of 2 seconds, for two different valuesof the request delay: 1000 milliseconds and 250 milliseconds.The graphs in Figure 6a compare the observed timing failureprobability, while the graphs in Figure 6b compare the averagenumber of replicas selected.

The first observation from Figure 6 is that the observed fail-ure probability increases as the request delay reduces from 1000milliseconds to 250 milliseconds. That is because as the requestdelay reduces from 1000 milliseconds and approaches valuescloser to the mean service time of 100 milliseconds, the num-ber of requests that experience queuing delays at the serversincreases. We also observe from the graphs in Figure 6 that asthe queuing delay increases, the probabilistic scheme is some-times unable to find enough replicas to meet the deadline withthe probability requested by the client. For instance, when therequest delay is 250 milliseconds, the replica subset chosen bythe probabilistic scheme is unable to meet deadline values ≤120 milliseconds with a probability ≥ 0.9, although the requestis sent to all 10 available replicas, as shown in Figure 6b. Insuch cases, the selection handler can inform the client that thereare insufficient resources to satisfy its QoS requirement, so thatthe client can choose either to renegotiate its QoS specificationor to send its requests at a later time when the system is lessloaded.

We also varied the client-induced load by varying the numberof clients accessing a service concurrently. The induced load onthe servers increased with the number of clients. All the clients

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

80 100 120 140 160 180 200 220

Ob

se

rve

d P

rob

ab

ility

of

Tim

ing

Fa

ilure


(2 clients, LUI: 2 secs)(4 clients, LUI: 2 secs)(8 clients, LUI: 2 secs)

(a) Probability of timing failures

0

2

4

6

8

10

80 100 120 140 160 180 200 220A

ve

rag

e N

um

be

r o

f R

ep

lica

s S

ele

cte

dDeadline (milliseconds)

(2 clients, LUI: 2 secs)(4 clients, LUI: 2 secs)(8 clients, LUI: 2 secs)

(b) Number of replicas selected

Fig. 7SEQUENTIAL CONSISTENCY: VARIABLE NUMBER OF CLIENTS

used the same request delay value of 1000 milliseconds. Oneof the clients specified a different deadline in each run and re-quested that its deadline be met with a probability ≥ 0.5. All ofthe remaining clients specified a deadline of 200 millisecondsin each run and requested that this deadline be met with a prob-ability ≥ 0.1. The plots in Figure 7 evaluate the performance ofthe probabilistic scheme using 2, 4, and 8 clients when the lazyupdate interval is 2 seconds. Figure 7a shows the timing failureprobability for each case, as measured at the client that spec-ified that its probability of timely response should be at least0.5, and Figure 7b shows the corresponding average number ofreplicas selected by the probabilistic scheme to meet the QoSspecifications of this client. As expected, the observed timingfailure probability increased as the number of clients requestingservice increased, because of the higher queuing delays. How-ever, we find that in each case, the model was able to adaptappropriately to select a subset of replicas that could meet theclient’s QoS specification.

VI. CONCLUSIONS

The experimental results we have presented show that themodel we developed to assign the replicated servers to servicethe clients allows a QoS-aware middleware to adapt the assign-ment based on changes in the load and the QoS requirements ofthe clients. The model also helped in understanding the trade-offs between timeliness and consistency for different consis-tency semantics. The results we presented show that the fre-quency of lazy updates is an important parameter that allows usto tune the tradeoffs between the desired levels of consistencyand timeliness. Some of the factors that need to be consideredwhen choosing the frequency include the arrival rate of the up-dates from the clients, the ratio between the size of the primaryand secondary replica groups, and the timeliness, as well as theconsistency specification of the clients.

VII. FUTURE EXTENSIONS

Our research has demonstrated the need for feedback andthe efficacy of simple analytical models that map a user’s QoSspecification onto the underlying properties of the resources.Although our probabilistic approach was mainly developed to

adaptively share replicated servers in uncertain environments,similar techniques can be applied to a range of systems prob-lems, including scheduling and other resource allocation prob-lems. Given the increasing demand for different services andthe diversity of the requirements of client applications, suchadaptive frameworks that rely on feedback-based control arelikely to play an increasing role in solving a range of problemsrelated to building dependable systems.

One of the limitations of our work is that it measures stal-eness in terms of “versions” or logical clocks. We made thatchoice to avoid the dependence on synchronized clocks. How-ever, in environments in which clocks are synchronized, we canmodify our approach to measure staleness in terms of real-timeclocks. We can then use our modified approach to allow usersof real-time, distributed database applications, such as traffic-monitoring, online stock-trading, and electronic patient record-ing systems in a hospital, to access information that is no olderthan a specified interval of time, within a certain deadline, prob-abilistically. Another limitation is that we currently admit allthe clients, and only after the timing failures have been detected,inform a client if the observed failure probability exceeds itsexpectations. However, with some modifications, we could in-stead use our framework to perform admission control, in orderto determine which clients can be admitted based on the cur-rent availability of the replicas. Finally, it is easy to extendour framework so that the clients can replace the probability oftimely response with a higher-level specification, such as pri-ority or the cost the client is willing to pay for timely delivery.The middleware can then internally map these higher-level in-puts to an appropriate probability value and perform adaptivereplica selection, as described.

REFERENCES

[1] K. Birman. Building Secure and Reliable Network Applications. Man-ning, 1996.

[2] I. Foster, A. Roy, and V. Sander. A Quality of Service Architecture thatCombines Resource Reservation and Application Adaptation. In Proc. ofthe 8th International Workshop on Quality of Service, 2000.

[3] M. Hayden. The Ensemble System. PhD thesis, Cornell University, Jan-uary 1998.

[4] N. Johnson, S. Kotz, and A. Kemp. Univariate Discrete Distributions,chapter 3, pages 129–130. Addison-Wesley, second edition, 1992.

[5] S. Krishnamurthy, W. H. Sanders, and M. Cukier. A Dynamic ReplicaSelection Algorithm for Tolerating Timing Faults. In Proc. of the Interna-tional Conference on Dependable Systems and Networks, pages 107–116,July 2001.

[6] S. Krishnamurthy, W. H. Sanders, and M. Cukier. An Adaptive Frame-work for Tunable Consistency and Timeliness Using Replication. In Proc.of the International Conference on Dependable Systems and Networks,June 2002. To appear.

[7] L. Lamport. Time, Clocks, and the Ordering of Events in DistributedSystems. Communications of the ACM, 21(7):558–565, July 1978.

[8] B. Li. Agilos: A Middleware Control Architecture for Application-AwareQuality of Service Adaptations. PhD thesis, University of Illinois, 2000.

[9] Y. (J.) Ren, T. Courtney, M. Cukier, C. Sabnis, W. H. Sanders, M. Seri,D. A. Karr, P. Rubel, R. E. Schantz, and D. E. Bakken. AQuA: An Adap-tive Architecture that Provides Dependable Distributed Objects. IEEETransactions on Computers. To appear.

[10] A. Vaysburd. Building Reliable Interoperable Distributed Applicationswith Maestro Tools. PhD thesis, Cornell University, May 1998.

[11] H. Yu and A. Vahdat. Design and Evaluation of a Continuous ConsistencyModel for Replicated Services. In Proc. of the 4th Symp. on OperatingSystems Design and Implementation (OSDI), October 2000.

Documents

Performance evaluation of a QoS-aware framework for providing tunable consistency and timeliness