Data Freshness

Embed Size (px)

DESCRIPTION

paper

Citation preview

  • Distributed View Divergence Control of DataFreshness in Replicated Database Systems

    Takao Yamashita, Member, IEEE

    AbstractIn this paper, we propose a distributed method to control the view divergence of data freshness for clients in replicated

    database systems whose facilitating or administrative roles are equal. Our method provides data with statistically defined freshness to

    clients when updates are initially accepted by any of the replicas, and then, asynchronously propagated among the replicas that are

    connected in a tree structure. To provide data with freshness specified by clients, our method selects multiple replicas using a

    distributed algorithm so that they statistically receive all updates issued up to a specified time before the present time. We evaluated by

    simulation the distributed algorithm to select replicas for the view divergence control in terms of controlled data freshness, time,

    message, and computation complexity. The simulation showed that our method achieves more than 36.9 percent improvement in data

    freshness compared with epidemic-style update propagation.

    Index TermsData replication, weak consistency, freshness, delay, asynchronous update.

    1 INTRODUCTION

    WITH the progress of network computing technologies,data processing occupies an important role invarious applications such as electronic commerce, deci-sion-support systems, and information dissemination. Datareplication methods [1], [2], [3], [4] are effective forachieving the required scalability and reliability for dataprocessing. They are categorized into two types accordingto the timing of their data updates: eager and lazy [5]. Ineager replication, data on replicas are simultaneouslyupdated using read-one-write-all or quorum consensus[1], [2]. In lazy replication, an update is initially processedby one replica, and then, gradually propagated to the otherreplicas [6], [7], [8]. Data replication methods can also becategorized into two types, master and group, according tothe number of replicas that are candidates for initiallyaccepting update requests from clients [5]. In groupreplication, multiple replicas can initially process updates,while single replica can initially accept updates for a dataobject in master replication. Lazy-group replication is mostadvantageous for achieving high scalability and availabil-ity, but it cannot offer strict data consistency. Manyapplications do not require strict data consistency butallow weak data consistency. In addition, scalability of dataprocessing is essential in large-scale systems. Hence, lazy-group replication is suitable under such conditions. Lazy-group replication has server-based peer-to-peer architec-ture because the facilitating or administrative roles ofreplicas can be equal.

    Recently, cross-enterprise business-to-business colla-boration, such as multiorganization supply chain manage-ment and virtual Web malls, has become a key industrytrend of information technology [9]. Therefore, large-scaleresource sharing using Grid technologies is needed notonly for science and technical computing but also forindustrial areas [9]. As a result, information processinginfrastructures for enterprises are distributed. To provideservices to customers according to their requests, enter-prises have to process a large number of transactions fromclients. In lazy-group replication, to increase the processingperformance of update transactions, the transactions areaggregated into one when they are propagated amongreplicas. Therefore, deferring update propagation for aparticular period to improve the processing performance ofupdate transactions is necessary. In addition, the greaterthe number of replicated database systems, the longer theupdate propagation delay. A number of applications, suchas a decision-support system, have to retrieve distributeddata reflecting such updates from multiple sites dependingon user requirements. In decision-support systems, fresh-ness is one of the most important attributes of data [10],[11]. Hence, in lazy-group replication, data freshnessobtained by clients should be controlled so that it satisfiesclient requirements.

    In this paper, we propose a distributed method tocontrol the view divergence of data freshness for clients inthe replicated database systems that have server-basedpeer-to-peer architecture. Our method enhances the cen-tralized way for view divergence control of data freshnessby improving the computation complexity of view diver-gence control [12], [13]. To control data freshness, which isstatistically defined, our method selects multiple replicas,called a read replica set, using a distributed algorithm sothat they statistically receive all updates issued up to aspecified time before the present time with a particularprobability. Our method then provides the data that reflectall updates received by selected replicas.

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 10, OCTOBER 2009 1403

    . The author is with NTT Information Sharing Platform Laboratories, NTTCorporation, 3-9-11, Midori-Cho Musashino-Shi, Tokyo 180-8585, Japan.E-mail: [email protected].

    Manuscript received 29 Feb. 2008; revised 29 Sept. 2008; accepted 12 Nov.2008; published online 21 Nov. 2008.Recommended for acceptance by S. Wang.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTKDE-2008-02-0116.Digital Object Identifier no. 10.1109/TKDE.2008.230.

    1041-4347/09/$25.00 2009 IEEE Published by the IEEE Computer Society

  • The remainder of this paper is organized as follows: Wedescribe themotivation of our research in Section 2. Section 3briefly introduces the centralized view divergence controlmethod [12], [13]. This includes a centralized algorithm fordetermining a read replica set to control the data freshness.Section 4 describes the distributed view divergence controlmethod that includes the distributed algorithm used todetermine a read replica set. In Section 5, we evaluate bysimulation data freshness that can be controlled by theproposed method and compare it with related work. Wethen evaluate the time, message, and computation complex-ity of the proposed distributed algorithm. In Section 6, wecompare our method with related work. Section 7 concludesthis paper with a summary of the main points.

    2 MOTIVATION

    Lazy-group replication is most suitable for large-scale dataprocessing systems, as described in Section 1. In lazy-groupreplication, data freshness depends on replicas. We aremotivated to control the view divergence of data freshness inlazy-group replication by two types of applications fordistributed computing infrastructure andenterprise systems.The first type of application is data warehousing. In datawarehousing, data are first extracted from operationaldatabases such as online transaction processing (OLTP)systems. Such data are then hierarchically gathered foranalysis and reporting in a data warehouse [14]. A datawarehouse is separately constructed from operational data-bases in order to process complex queries. OLTP systems,however, process short query, insert, andupdate transactions[14]. When we adopt lazy-group replication to achieve thehigh scalability of OLTP systems and data sharing amongmultiple organizations, inserts and updates are asynchro-nously propagated among replicas. Data extracted fromreplicas can vary depending on source replicas due to theasynchronous arrival of inserts andupdates. To improvedatafreshness for data warehousing, we have to gather suchextracted data from distributed systems. This is because dataquality is essential for data warehouses. The users of datawarehouses make decisions using data satisfying theirrequirements of data quality. One measure of data qualityis timeliness [15]. For example, decision makers sometimesneed the entire sales of a product for every hour, day, month,quarter, and year [15]. If they need the entire sales ofperiods 1, 2, 3, and 4 at a particular time in period 5 as shownin Fig. 1, decisionmakers then have to retrieve data reflectingupdates issued by the end of period 4. Therefore, control ofdata freshness is needed in decision making using datawarehouses because when and how new data are neededdepends on business and propagating all updates in a shorttime so that the data freshness of local data can meet anybusiness request is very costly.

    The second type of application is infrastructure servicesin distributed systems such as directory services [9], [16],

    which handle a large number of short queries, inserts, andupdates. When an application looks up a new entry storedin those services and fails, it may not be up-to-date. A newentry should not always be distributed quickly for thescalability of data processing because a large majority ofusers might not always look up a new entry. In such a case,an application and/or its users can estimate possiblereasons for the looking-up failure and should try to searchfor newer data. Another example of this type is mobile hosttracking because a host that tries to communicate with amobile host can estimate if the obtained location of themobile host is already old data. A host needs the newlocation data of a mobile host right after it moves.

    Cross-enterprise business-to-business collaboration is akey industry trend of information technology [17], [9].There are two options for a system architecture controllingdata freshness in enterprises: centralized and decentralizedarchitecture. Both are available for controlling data fresh-ness when they are used in only one enterprise. For cross-enterprise business-to-business collaboration, the availablesystem architecture depends on the relation among theenterprises. If there is no center organization that admin-istrates the business activities of all enterprises or the datashared among them, adopting a centralized architecture forcontrolling data freshness is difficult. Therefore, a decen-tralized architecture is advantageous and flexible because itis available in a variety of relations among enterprises.

    3 CENTRALIZED VIEW DIVERGENCE CONTROL

    3.1 System Architecture

    Thesystemarchitectureused inourmethod is shown inFig. 2.This system processes two types of transactions: refer andupdate. To process them, there are three types of nodes: client,front-end, and replica nodes. Each replica node is composed ofmore than one database server for fault tolerance, asdescribed in [12], [13]. Replicas are connected by logical links.

    As described in Section 1, our system is designed forenterprise and cross-enterprise systems. A system architec-ture for enterprise use must be simple for easy administra-tion and operation. For example, when update propagationtroubles occur in complicated systems, tracing where andhow they occurred is difficult. Cross-enterprise systems inparticular require simple interfaces between enterprises fortroubleshooting. In addition, simple system architectureleads to simple estimation of update delays for controllingdata freshness. From the viewpoint of fault tolerance, the

    1404 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 10, OCTOBER 2009

    Fig. 1. Need for data freshness in data warehousing.

    Fig. 2. System architecture used in view divergence control for lazy-

    group replication.

  • system architecture should have redundancy of processingnodes and update propagation paths. To solve the dis-advantage of a tree-topology network, where there is onlyone path between replicas, every replica is composed ofmultiple database servers for node failures in our system. Inaddition, we can assume that network failures are recov-ered by routing protocols [18]. Therefore, we use a treestructure for update propagation among replicas.

    An update to change data in replicated database systemsis propagated through the links of a tree as a refreshtransaction to which some updates are aggregated. How toaggregate updates to one refresh transaction depends onthe application. For example, in mobile host tracking, olddata are overwritten by the newest data. As a result, oldupdates are omitted. For decision-support systems tohandle the sales of a product in an area, a refreshtransaction includes the update of totaled sales includedin all updates.

    In our system, a replicamay join and leave a tree-topologynetwork. When a replica joins a tree-topology network, thereplicas administrator determines how it should be con-nected based on the distances between replicas.

    A client retrieves and changes the values of data objectsstored in replicated database systems through a front-endnode. The functions of a client are to send refer and updaterequests to a front-end node to retrieve and change data,respectively, and to receive in reply processing results froma front-end node. A refer request includes the degree offreshness that is required by a client. The degree offreshness, called read data freshness (RDF), is statisticallydefined in this system. The RDF is formally described inSection 3.3. When clients request the same RDF, our methodrestricts the differences in RDF accordingly. We call thisdifference in RDF among clients view divergence.

    A front-end node is a proxy for clients to refer andupdate data in replicas. When a front-end node receives anupdate from a client, it simply transmits the update to areplica. In Fig. 2, when an update from client c1 is receivedby front-end f1, it is forwarded to replica r3. When a front-end node receives a refer request from a client, it calculatesdata values satisfying the degree of freshness required bythe client after it retrieved data values and transactionprocessing logs of one or more replicas. We call this set ofreplicas accessed by a front-end node a read replica set. InFig. 2, when front-end f3 receives a refer request from clientc3, it retrieves data values and transaction processing logsfrom replicas r3; r9, and r12.

    To process refer and update transactions, a replica hasfive functions:

    1. processing refer and update requests to its localdatabase;

    2. processing asynchronously propagated updates inrefresh transactions;

    3. propagating update and refresh transactions to otherreplicas;

    4. calculating read replica sets for multiple degrees offreshness by cooperating with other replicas; and

    5. informing all front-end nodes about whether or notit is a read replica for a degree of freshness.

    In our system, the calculation of read replica sets formultiple degrees of freshness is periodically performedusing samples of update delay for a past period whosestatistical characteristics are determined to be the same asthose of the current period using nonparametric statisticalmethods. The periods whose statistical characteristics arecompared are determined using the knowledge on applica-tion domains so that statistical characteristics do not changemuch in each of those periods. Nonparametric statisticalmethods are distribution-free statistical methods, such asthe Mann-Whitney U or the median test [19], [20], [21]. TheMann-Whitney U and the median test verify similaritybetween two sets of samples from the viewpoints of ranksum and median, respectively. Here, samples of updatedelay used in nonparametric statistical methods areacquired by inserting time stamps of the starting time ofupdate propagation in a refresh transaction. Read replicasets are not calculated every time a refer request arrives at afront-end node in our system because calculating a readreplica set for every refer request requires high overhead oftransaction processing and a comparatively long time tocomplete processing a refer request. Let T i be the degreesof freshness for which read replica sets are calculated,where i is an integer and T i < T i1 for any i. Here, thesmaller the T i, the newer the retrieved data. When a clientrequests a degree of freshness Tp, the read replica set for T

    i

    that satisfies condition T i Tp < T i1 is selected toprovide data with the degree of freshness required by theclient [12]. T i should be selected so that applicationrequirements are satisfied because when and how newdata are needed depends on applications and the timing ofdata acquisition, as described in Section 2.

    3.2 Available Transactions

    In lazy-group replication, any node can initially process anupdate. Updates are then asynchronously propagatedamong replicas. Because the orders of transactions pro-cessed by replicas can vary, updates may conflict with eachother. Conflicts between updates cause inconsistent statesof data objects. To eliminate inconsistency, various ways areused to process updates and record deletions in lazy-groupreplication. For update processing, attributes associatedwith update requests and data objects such as time stampsand version numbers are used [5]. For eliminating incon-sistent states caused by the difference in the order ofupdates and record deletions among replicas, a methodcalled death certificate or tombstone is used. It offers dataconsistency among replicas by keeping the time stamps ofdeleted objects for a particular period [22].

    In our system, when a front-end node provides thevalues of data objects that satisfy the degree of freshnessrequired by a client, it behaves like a replica to processrefresh transactions after receiving data values and transac-tion processing logs from read replicas. In general, a front-end node first selects one of the read replicas. Second, todistinguish transactions that have not been received by theselected read replica yet but have already been received byat least one of the other read replicas, a front-end nodecompares the transaction logs received from the readreplicas. Then, it generates refresh transactions and recorddeletions that do not reach the selected read replica yet and

    YAMASHITA: DISTRIBUTED VIEW DIVERGENCE CONTROL OF DATA FRESHNESS IN REPLICATED DATABASE SYSTEMS 1405

  • applies them to the data values received from the selectedreplica. For some applications, the role of a front-end nodeis simpler than the above described. For example, in thelocation tracking of a mobile host, the value of a data objectis replaced by a replica with a newer value. Therefore, afront-end node selects the newest value of a data objectamong those in the read replicas. Because a front-end nodeperforms the same process as a replica does when itprovides data with the degree of freshness required byclients, our method does not further restrict transactionsavailable in lazy-group replication.

    3.3 Read Data Freshness

    As a measure of data freshness, there are two options: theprobabilistic degree and the worst boundary of datafreshness.Whenwe use the probabilistic degree of freshness,updates issued before a particular time are reflected inobtained datawith some probability. On the other hand, datareflect all updates issued before a particular time when datafreshness is defined by its worst boundary. There are anumber of applications whose users are tolerant to somedegree of errors inherent in the data. For some decision-support processes, knowledge workers can tolerate somedegree of error and omission if they are aware of the degreeand nature of the error [15]. In addition, the probabilisticdegree of data freshness can represent its worst boundary,which is the probabilistic degree of data freshness withprobability 1. Therefore, we use the probabilistic degree ofdata freshness as its measure, referring to this measure asRDF. The degree of RDF represents a clients requirements interms of the view divergence. The formal definition of RDF,or Tp, is Tp tc te, where time te is such that all updatesissued before te in the network are reflected in the dataacquired by a client with probability p, and tc is the presenttime. If the last update reflected in the acquired data was at tl(

  • 3.4.2 Terminology

    We define four terms for explaining our algorithm: a rangeoriginator, a read replica (set), a classified replica, and amandatory replica. Table 1 describes their formal definitions.A range originator for replica r is a replica from which arefresh transaction can reach replica r within time Tp withprobability p, where Tp is the degree of RDF required by aclient. A range originator can be determined by statisticalestimation methods [19], [20], [21].

    For example, a range originator set can be calculatedusing samples of update delay and the nonparametricestimation method described in Appendix C, which can befound on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.230. LetOi be the set of range originators for replica i. Then, wesay that replica j is covered or noncovered by replica i ifcondition j 2 Oi or j 62 Oi is satisfied, respectively. Inaddition, we say that replica i is superior to, inferior to, orequivalent to replica j in range-originator-set (ROS)capability when condition Oi Oj;Oi Oj, or Oi Oj issatisfied, respectively.

    A read replica is a replica to which a front-end nodesends a refer transaction to obtain the value of a data object.A read replica set is the set of read replicas for a degree ofRDF required by a client. The sum of range originator sets ofall read replicas in a read replica set is the set of all replicas.Our algorithm calculates a minimum read replica set.

    A classified replica is a replica whose set of rangeoriginators is not a subset of those for any other replicas. Aclassified replica is a candidate for an element of a readreplica set.

    A mandatory replica is a classified replica that has one ormore range originators not included in the range originatorset of any other classified replica. Amandatory replicameansan indispensable element of the minimum read replica set.As a result, a mandatory replica becomes a read replica.

    3.4.3 Calculating a Minimum Read Replica Set

    In our method, we have to choose a read replica set such thatfor any replica r, at least one element of the set can receive

    updates from replica r within Tp with probability p. Thismeans that the sum of the range originator sets of all readreplicas is the set of all replicas. If a replica can have any setof replicas for its range originator set, the problem tocalculate a minimum read replica set is an example of theset-covering problem represented as follows. An instanceX;F of the set-covering problem consists of a finite set Xand a family F of subsets of X such that every element of Xbelongs to at least one subset in F (i.e., X US2FS). Theproblem is to find a minimum size subset C F [27]. Theset-covering problem, which has some variations includingthe hitting-set problem, is NP-hard [27]. In our method, thesets of replicas and range originator sets correspond to Xand F , respectively. However, there is a certain relationshipamong the range originator sets of replicas depending on thelocation of replicas in a network for update propagationbecause we use a tree-topology network for updatepropagation and the range originator set of a replica has tofollow the properties of probabilistic delay described below.Hence, the problem to calculate a minimum read replica setin our method is not the set-covering problem because arange originator set cannot have arbitrary elements, which isapparent from Theorem 1 and Lemma 1 described inSections 4.3.1 and 4.3.2, respectively.

    The probabilistic delay has properties different fromthose of the distance in a weighted graph. For example,when there is a replica j on the path from i to k, the distancefrom replica i to k is the sum of the distances from replica ito j and from j to k. However, let dxy be the upperconfidence limit, with probability p, of the delay time fromreplica x to y. Then, dik is not the sum of dij and djk. Foranother example, when there is replica k along the pathsfrom replica i to l and j to l and when condition dik djk issatisfied, dil is not always equal to or less than djl. Ouralgorithm uses the following two properties that arise fromthe statistical delay distribution of update propagation [12].The first property is that if condition dik Tp is satisfied,then conditions dij Tp and djk Tp are satisfied for anyreplica j that is along the path from replica i to k. Thesecond property is that if condition dij > Tp is satisfied, thencondition dik > Tp is satisfied for any replica k such thatreplica j is along the path from replica i to k.

    The flow of the centralized algorithm to calculate aminimum read replica set is shown in Fig. 3a. Thepseudocode of the centralized algorithm is in Fig. 4. Themain part of the algorithm is composed of three processes:classified-replica, mandatory-replica, and minimum-subtreedetermination. The three processes are iterated until onereplica covers all replicas in a tree. Range originator setsare first calculated in step 2) of Fig. 4 before the threeprocesses are performed. The range originator set ofreplica i is calculated by removing replicas covered bymandatory replicas from its original range originator set.In classified-replica determination, which corresponds tosteps 3) and 4) in Fig. 4, replica i is excluded fromcandidates for read replicas when Oi is a subset of therange originator set of another replica because a replicawhose range originator set is a superset of Oi is morecapable of decreasing the number of read replicas. Ifmultiple replicas whose range originator sets are equal

    YAMASHITA: DISTRIBUTED VIEW DIVERGENCE CONTROL OF DATA FRESHNESS IN REPLICATED DATABASE SYSTEMS 1407

    TABLE 1Formal Definition of Terminology

  • remain, all but one of them are excluded as candidates forread replicas because they have the same capability tocover replicas. Then, the remaining replicas are classifiedreplicas, which are candidates for read replicas. Inmandatory-replica determination, which corresponds tosteps 5) and 6) in Fig. 4, classified replica j is selected as amandatory replica when it has one or more elements thatare not included in Ok for any classified replica k, where jis not equal to k. This is because we cannot replaceclassified replica j with any other classified replica toconstruct a minimum read replica set. In minimum-subtree determination, which corresponds to steps 7), 8),and 9) in Fig. 4, we calculate a minimum subtree definedas the subtree that includes replicas not covered bymandatory replicas and has the minimum number ofedges. All calculated mandatory replicas are added to theread replica set. We iterate classified-replica, mandatory-replica, and minimum-subtree determination to decreasethe size of a tree until one replica l covers all remainingreplicas in a tree as described in step 1) of Fig. 4. Finally,we add replica l to the read replica set in step 10) of Fig. 4.

    The computation complexity of this centralized algo-rithm is On3 as proved in [12], where n is the number of

    replicas. This computation complexity is determined by thenumber of range-originator-set comparisons in each itera-tion and the iterations of the three processes describedabove, which are On2 and On, respectively. In classified-replica and mandatory-replica determination, any pair ofrange originator sets of replicas and classified replicas needto be compared in every iteration, respectively. The keypoint of low computation complexity is that there is at leastone mandatory replica, which corresponds to a read replica,in every iteration. This is caused by the use of a tree-topology network for update propagation and two proper-ties of probabilistic delay time described in the secondparagraph of this section. On the other hand, the centralizedalgorithm cannot apparently solve the set-covering problemwith low computation complexity because it cannot alwaysfind one or more mandatory replicas in every iteration. Thecomputation complexity of the centralized algorithm iseffective but should be improved for a large number ofreplicas. Therefore, we decrease the computation complex-ity of the algorithm to calculate a minimum read replica setin the distributed view divergence control of data freshnessdescribed in Section 4.

    4 DISTRIBUTED VIEW DIVERGENCE CONTROL

    4.1 Assumptions for Distributed View DivergenceControl

    In addition to the assumptions for the centralized viewdivergence control described in Section 3.4.1, we assume thefollowing conditions:

    1. A replica can directly communicate with only itsadjacent replicas.

    2. A replica knows the adjacent replica to which itshould send a message destined for a particularreplica. This can be achieved by various routingprotocols [18].

    3. A front-end node knows the set of all replicas. Thiscan be accomplished using diffusing computationand the termination detection for it [28], [29].

    4. Every replica knows the set of all front-end nodes.

    On termination of the distributed algorithm, all replicasinform all front-end nodes about whether or not they are

    1408 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 10, OCTOBER 2009

    Fig. 4. Centralized algorithm to calculate a minimum read replica set.

    Fig. 3. Flow of centralized and distributed algorithms to calculate a

    minimum read replica set.

  • read replicas. A front-end node can determine whether itlearns the set of all read replicas because it knows the set ofall replicas and every replica informs front-end nodeswhether or not it is a read replica.

    4.2 Overview

    The distributed view divergence control method describedin this section is based on the centralized view divergencecontrol method described in the previous section, in whichonly the way for calculating a minimum read replica setrequires centralized operation. Therefore, a distributed viewdivergence control method needs a distributed algorithmto calculate a minimum read replica set, which includesclassified-replica, mandatory-replica, andminimum-subtreedetermination. For efficiency, distributed view divergencecontrol achieves low time and computation complexity ofthe algorithm to calculate a minimum read replica set. Inparticular, we decrease the computation complexity of thealgorithm to calculate a minimum read replica set from thatof the centralized view divergence control, which is On3.We accomplish low time and computation complexity usingthe relation among the range originator sets of replicascaused by the topology of update propagation andprobabilistic delay. When the centralized algorithm tocalculate a minimum read replica set is executed in adistributed manner, classified-replica, mandatory-replica,and minimum-subtree determination are iterated a numberof times. Therefore, each replica needs to detect thetermination of a current process to determine when itshould start the next process.

    Fig. 3b shows the distributed algorithm executed byevery replica to calculate a minimum read replica set. In thecentralized algorithm, the steps that are not included in anyof classified-replica, mandatory-replica, and minimum-subtree determination are steps 1), 2), and 10) in Fig. 4. Inthe distributed algorithm, we incorporate steps 1) and 2)with minimum-subtree determination. On the other hand,step 10) performs the final selection of a mandatory replicawhen there is at least one replica that covers all remainingreplicas. This can be achieved by the execution of classified-replica and mandatory-replica determination. Therefore, toachieve step 10), classified-replica and mandatory-replicadetermination are executed one more time in the distrib-uted algorithm than in the centralized algorithm. Then, thesize of the subtree for the next iteration is zero, whichmeans the termination of the distributed algorithm.

    A distributed algorithm is evaluated in terms of time andmessage complexity [29]. Time complexity is the time analgorithm takes to terminate. Message complexity is thetotal number of messages exchanged among nodes in theworst case. In this paper, we measure time complexity inthe same way as that described in [29], where it is measuredusing the upper bound of time for each task of each process(denoted by bp) and for the single task of each channel(denoted by bc). bp includes time for protocol processing,transaction processing, and logging for recovery. bc includetime for packet forwarding and message transmission. Thetime complexity of the distributed algorithm to determine aminimum read replica set is an important measure becauseour algorithm needs to calculate read replica sets formultiple degrees of freshness periodically. Therefore, we

    design the distributed algorithm to calculate a minimumread replica set so that its time complexity is as small aspossible. When the hop count distance between replicas iand j is h, it takes time hbp bc bp for replica i to interactwith replica j in one way. Here, let Li be the set of replicaswith which replica i has to interact in process p. We call thedistance between replica i and the furthest replica in Lifrom it the maximum distance in process p by replica i. Forlow-time complexity, we should decrease the maximumdistance in classified-replica, mandatory-replica, and mini-mum-subtree determination by replica i for any replica i asmuch as possible.

    4.3 Classified-Replica Determination

    In the classified-replica determination of the centralizedalgorithm, the range originator set of every replica iscompared with those of all the other replicas. If a replicainteracts with all the other replicas in distributed classi-fied-replica determination, it takes time Dbp bc bp inthe worst case that a replica receives information fromthe furthest replica, where D is the diameter of a tree forupdate propagation. However, when new data arerequired, the range originator set of a replica tends toinclude only replicas close to it. In other words, the rangeoriginator sets of replicas far from each other tend to haveno intersection. This means that the range originator set ofa replica does not need to be compared with that of areplica far from it for classified-replica determination.Therefore, the time complexity of classified-replica deter-mination in the distributed algorithm can be improved byeliminating comparisons between the range originator setsof replicas far from each other. In addition, the decrease ofthe number of range-originator-set comparisons leads tothe decrease of computation complexity. To decrease thenumber of range-originator-set comparisons, we dividedistributed classified-replica determination into twophases. In the first phase, a replica compares its rangeoriginator set with those of its adjacent replicas todetermine whether or not it is a classified replica. In thesecond phase, which is only performed as necessary,replicas with the same range originator set coordinate witheach other to determine whether or not they are classifiedreplicas.

    4.3.1 First Phase

    For the decrease of range-originator-set comparison, in thefirst phase of classified-replica determination, a replicacompares its own range originator set with only those of itsadjacent replicas to determine whether or not it is aclassified replica. When replicas j are the adjacent replicasof replica i, we divide the relationship between replicas iand j into three cases according to which of the followingthree conditions are satisfied:

    8j : Oi 6 Oj; 1

    9j : Oi Oj; 2

    9j : Oi Oj ^ 8k 2kj k 2 N i ^ k 6 j : Oi 6 Ok; 3

    where Oi is the range originator set and Ni is the set of

    adjacent replicas for replica i. Condition (3) is the condition

    YAMASHITA: DISTRIBUTED VIEW DIVERGENCE CONTROL OF DATA FRESHNESS IN REPLICATED DATABASE SYSTEMS 1409

  • that neither condition (1) nor (2) is satisfied. When condition(1) is satisfied, the following theorem indicates that Oi is notequal to and is not a subset of Ok for any replica k other thanreplica i. Therefore, replica i is a classified replica accordingto the definition of classified replicas. The theorems andlemma in the remainder of this paper are proved inAppendix A, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.230.

    Theorem 1. Suppose that replicas i and j are adjacent and replicak is covered by replica i but not by replica j. For any replicamsuch that the path from replica i to m includes j, replica mdoes not cover replica k.

    When condition (2) is satisfied, replica i is, by definition,not a classified replica. Therefore, when condition (1) or (2)is satisfied, a replica can determine whether or not it is aclassified replica by comparing its range originator set withonly those of all its adjacent replicas.

    When condition (3) is satisfied, replica i cannot determinewhether or not it is a classified replica in the first phase. In thiscase, replica i has to cooperate with one or more replicas thatare not its adjacents in the second phase of classified-replicadetermination, as described in Section 4.3.2. In addition,when condition (2) is satisfied, a replica also performs part ofthe second phase of classified-replica determination to helpother replicas that have the same range originator set toidentify themselves as nonclassified replicas.

    In the first phase of classified-replica determination,every replica informs all its adjacent replicas about its rangeoriginator set. We call a message used in the first phase arange originator notification (RON) message. An RONmessage sent by replica i includes Oi and i and istransmitted on each direction of each link. It takes time2bp bc for an RON message to be transmitted on a link andprocessed by the replica that receives it. Therefore, the timeand message complexity of the first phase are O2bp bcand ONv, respectively, where Nv is the number of replicas.The termination condition of this phase is that a replicareceives RON messages from all its adjacent replicas.

    The first phase of classified-replica determination com-prises Na-times RON message transmission, receipt, andcomparison of range originator sets, where Na is thenumber of adjacent replicas. Hence, the computationcomplexity of the first phase of classified-replica determi-nation in one iteration is ONa. The pseudocode of the firstphase of classified-replica determination is described inAppendix B.1, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.230.

    4.3.2 Second Phase

    The situation where condition (3) is satisfied by one or morereplicas can be divided into two cases by using Theorem 2proved by Lemma 1 below: 1) the minimum subtree thatincludes replicas that have the same range originator set orsupersets thereof consists of replicas with the same rangeoriginator sets or supersets thereof and 2) the minimumsubtree that includes replicas with the same range originatorset or supersets thereof consists solely of replicas with the

    same range originator set. We call cases 1) and 2) inequi-valent-replica and equivalent-replica cases, respectively.

    Lemma 1. If replica k is a range originator for replicas i and j,then replica k is a range originator for any replica l along thepath between replicas i and j.

    Theorem 2. The situation where condition (3) is satisfied by oneor more replicas can be divided into an inequivalent-replica oran equivalent-replica case.

    Figs. 5a and 5b show an inequivalent-replica case and anequivalent-replica case, respectively. In Fig. 5a, replicas 1, 4,5, and 6 have the same range originator set, while that forreplica 2 is a superset of this. In this example, replicas 5 and6 satisfy condition (3). In Fig. 5b, replicas 1, 2, 4, 5, and 6 havethe same range originator sets, so all of them satisfycondition (3). In the second phase, different means are usedto determine whether a replica is a classified replicaaccording to whether an inequivalent-replica or an equiva-lent-replica case applies as described below.

    Inequivalent-replica case. When replica i satisfiescondition (3) in an inequivalent-replica case, there are oneor more replicas that are superior to replica i in ROScapability. Therefore, replica i is not a classified replica.Replicas with the same range originator set as Oi need tointeract with either of two types of replicas to determinethat it is not a classified replica: 1) a replica superior toreplica i in ROS capability or 2) a replica that is equivalentto replica i in ROS capability and satisfies condition (2). Byinteracting with replicas of types 1) and 2), replica i candetermine that it is not a classified replica as follows. Whenreplica i interacts with a replica of type 1), replica i can learnthe existence of a replica that is superior to itself in ROScapability. When replica i interacts with a replica of type 2),replica i can learn that a replica with the same rangeoriginator set as Oi determined itself to be a nonclassifiedreplica. Theorem 3 described below indicates that replica ishould interact with a replica of type 2).

    Theorem 3. When replica j is superior to replica i in ROScapability, there exists a replica k that satisfies the followingconditions: 1) replica k is equivalent to replica i inROS capability; 2) replica i can reach replica k along the pathconsisting of only replicas equivalent to replica i in ROScapability; and 3) replica k satisfies condition (2).

    Theorem 3 means that replica i should interact with areplica of type 2) instead of type 1) in inequivalent-replica

    1410 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 10, OCTOBER 2009

    Fig. 5. Two cases in second phase of classified-replica determination.

  • cases to decrease the maximum distance in the second phaseby replica i. In general, in an inequivalent-replica case, thereare one or more replicas of type 2) because there can be oneor more replicas superior to replica i in ROS capability. LetBi be the set of replicas of type 2) for replica i in aninequivalent-replica case. Replica i should interact with areplica in Bi so that it can determine as early as possiblewhether it is not a classified replica. We call a message usedin an inequivalent-replica case a nonclassified-replica notifica-tion (NCRN) message, which is initially sent by any replicain Bi and includes its range originator set. This messageinforms the receiver that the replica with a range originatorset in an NCRN message is not a classified replica. Replica ireceives one or more NCRNmessages because every replicain Bi originates an NCRN message. When replica i firstreceives one of the NCRN messages with the same rangeoriginator set as its own, it can determine that it is not aclassified replica. As a result, replica i interacts with areplica in Bi so that it can determine as early as possiblewhether it is not a classified replica. An NCRNmessage sentby any replica in Bi has the same information.

    In an inequivalent-replica case, there are one or moresubtrees Tl consisting of only replicas with the same rangeoriginator set. From Theorem 3, NCRN messages should bepropagated among replicas in Tl. The propagation ofNCRN messages in Tl can be achieved by broadcasting.Because there is only one path between any pair of replicasin a tree, a node operation for the broadcasting of anNCRN message in a tree is to simply send messages to allbut one of its adjacent nodes from which it receives anoriginal message. When the size of Bi is more than one, anyreplica p in Tl forwards an NCRN message to adjacentreplicas from which it does not receive NCRN messages.This is because when replica p receives an NCRN messagefrom an adjacent replica q, an NCRN message originatedby another element in Bi has been propagated amongreplicas that replica p can reach through replica q. In theexample of Fig. 5a, replicas 1 and 4 are replicas of type 2).Therefore, replica 4 sends NCRN messages to replicas 5and 6 because through the first phase of classified-replicadetermination, replica 4 learns that replicas 5 and 6 aresuperior to itself in ROS capability.

    In an inequivalent-replica case, NCRN messages arepropagated among replicas with the same range originatorset. One or two NCRNmessages are transmitted along a linkbecause a replica may send an NCRN message to itsadjacent replica from which it has not received an NCRNmessage but which has already sent an NCRN message.Therefore, the time and message complexity of the secondphase in an inequivalent-replica case are ODcbp bc andONc, respectively, where Dc and Nc are the maximumvalue among the diameters of subtrees Tl and the sum of thenumbers of replicas of each Tl, respectively. The terminationcondition of the second phase in an inequivalent-replicacase is that a replica receives/sends NCRN messages from/to at least one adjacent replica in each Tl.

    Equivalent-replica case. Similar to an inequivalent-replica case, let Tl be the set of replicas with the samerange originator set in an equivalent-replica case. Unlike inan inequivalent-replica case, in an equivalent-replica case,

    any replica i in Tl cannot determine whether or not it is aclassified replica by comparing its range originator set withthose of others because there is no replica that is superior toreplica i in ROS capability in a tree for update propagation.As is described in the centralized algorithm, only onereplica is designated as a classified replica among replicaswhose range originator sets are equal and are not inferior toany other replicas in ROS capability. This can be achievedby the leader election algorithm [29]. This algorithm in atree is as follows [29]. Each leaf node is initially enabled tosend an elect message to its unique adjacent node, where anelect message includes a set of node identifiers that thesender of an elect message has learned. Any node thatreceives electmessages from all but one of its adjacent nodesis enabled to send an elect message to the remainingadjacent node. As a result, one or two nodes can determinewhich node is the leader with the maximum identifier.Then, such a node broadcasts the result in a tree. In thesecond phase of classified-replica determination, we call amessage corresponding to an electmessage a classified-replicaprobe (CRP) message. In the example of Fig. 5b, replicas 1, 2,4, 5, and 6 exchange CRP messages to designate replica 6 asa classified replica.

    The time and message complexity of the leader electionalgorithm described above are ODbp bc and On,respectively, where D and n are the diameter of a tree andthe number of nodes, respectively. Therefore, the time andmessage complexity of the second phase in an equivalent-replica case are ODcbp bc and ONc, respectively,where Dc and Nc are the maximum value among thediameters of subtrees Tl and the sum of the numbers ofreplicas of each Tl, respectively. The termination condition ofthe secondphase in an equivalent-replica case is that a replicasends CRP messages to all its adjacent replicas in each Tl.

    The second phase of classified-replica determinationcomprises Nb-times range-originator-set comparison andNCRN message and CRP message transmission/receipt,where Nb is the number of adjacent replicas with thesame range originator set. Therefore, the computationcomplexity of the second phase of classified-replicadetermination is ONb. The pseudocode of the secondphase of classified-replica determination is described inAppendix B.2, which can be found on the ComputerSociety Digital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.230.

    4.4 Mandatory-Replica Determination

    As is described in the centralized algorithm to calculate aminimum read replica set, a classified replica is selected asa mandatory replica when its range originator set has one ormore elements that are not included in the range originatorset of any other classified replica. In mandatory-replicadetermination, we can also decrease its time and computa-tion complexity for the same reason as in classified-replicadetermination because the range originator sets of replicasfar from each other tend to have no intersection. Beforedescribing a theorem to decrease the number of range-originator-set comparisons in mandatory-replica determi-nation, we here define a relationship between replica i andclassified replica j. We say that replica i is neighboring toclassified replica j when there is no classified replica along

    YAMASHITA: DISTRIBUTED VIEW DIVERGENCE CONTROL OF DATA FRESHNESS IN REPLICATED DATABASE SYSTEMS 1411

  • the path between replica i and classified replica j. We alsouse the same phrase for a mandatory replica in Section 4.5.

    Theorem 4. Classified replica i is a mandatory replica if and onlyif Oi has at least one element that is not included in the rangeoriginator set of any classified replica that is neighboring toclassified replica i.

    From Theorem 4, a classified replica can determinewhether or not it is a mandatory replica by sharing itsrange originator set with its neighboring classified replicas.In a tree for update propagation, there are subtrees whoseleaves are classified replicas or leaves in the tree. Classifiedreplicas neighboring each other exist in such a subtree.Therefore, a classified replica can determine whether ornot it is a mandatory replica by sharing its range originatorset in such subtrees. In the example of Fig. 6, there arefour subtrees in a tree: the subtrees that consist off3; 7g; f1; 3g; f1; 2g, and f1; 4; 5; 6g, where the set of rangeoriginator sets ff1; 3; 7gg; ff1; 2; 3; 4; 5g; f1; 3; 7gg; ff1; 2; 3;4; 5gg, and ff1; 2; 3; 4; 5g; f4; 6gg are shared, respectively,as shown in Fig. 6. When nodes have pieces of informa-tion, the process of every node learning the wholecumulative information is called gossiping [30]. By usinggossiping, classified replicas in a subtree defined abovecan share their range originator sets. The leader electionalgorithm can be solved by using gossiping [29]. Therefore,the gossiping algorithm in a tree is similar to the leaderelection algorithm described in Section 4.3.2. In the leaderelection algorithm, the whole set of node identifiers is firstgathered. Then, the selected leader is notified to all nodes.In gossiping, pieces of information in nodes are firstgathered. Then, the cumulative information is notified toall nodes. In mandatory-replica determination, a piece ofinformation in replica i is Oi or the empty set whenreplica i is a classified replica or a nonclassified replica,respectively. In mandatory-replica determination, we call amessage corresponding to an elect message used in theleader election algorithm a mandatory-replica probe (MRP)message.

    The time complexity of the leader election algorithm in atree, which is equal to that of gossiping, is ODbp bc,where D is the diameter of a tree. Therefore, the timecomplexity of mandatory-replica determination isODmbp bc, where Dm is the maximum value amongthe diameters of subtrees defined above. The messagecomplexity of the leader election algorithm in a tree, whichis equal to that of gossiping, is On, where n is the numberof nodes in a tree. In mandatory-replica determination, thegossiping algorithm is executed in every subtree defined

    above. Let ni be the number of nodes in subtree i. The sumof ni for all subtrees is at most ONv, where Nv is thenumber of replicas in a tree. Therefore, the messagecomplexity of mandatory-replica determination is ONv.The termination condition of mandatory-replica determina-tion is the same as that of the second phase of classified-replica determination in an equivalent-replica case becausethe distributed mandatory-replica determination is basedon the gossiping algorithm.

    Themandatory-replicadetermination comprisesNa-timesMRP message transmission/receipt and one-time set com-parison at most. Therefore, the computation complexity ofmandatory-replica determination is ONa. The pseudocodeof mandatory-replica determination is described in Appen-dix B.3, which can be found on the Computer Society DigitalLibrary at http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.230.

    4.5 Minimum-Subtree Determination

    In the centralized algorithm, for the next iteration,minimum-subtree determination calculates the minimumsubtree including all replicas that are not included in therange originator set of any mandatory replica. As isdescribed in Section 4.2, minimum-subtree determinationin the distributed algorithm consists of three functions:1) calculating the minimum subtree for the next iterationas the centralized algorithm does; 2) removing replicascovered by mandatory replicas from the range originatorsets of replicas in the current subtree; and 3) detecting thetermination of the distributed algorithm that calculates aminimum read replica set. Functions 2) and 3) correspondto steps 2) and 1) in Fig. 4, respectively.

    A replica in the minimum subtree for the next iterationsatisfies at least one of the following two conditions: 1) it isnot included in the range originator set of any mandatoryreplica or 2) it is along the path between replicas satisfyingcondition 1). We can use Lemma 1 to decrease the time andcomputation complexity for determining whether a replicasatisfies condition 1). A replica can verify the satisfaction ofcondition 1) by comparing its identifier with all elements ofthe range originator sets of all neighboring mandatoryreplicas. If at least one mandatory replica has the rangeoriginator set including the identifier of a replica, it does notsatisfy condition 1). When condition 1) is not satisfied, areplica has to verify the satisfaction of condition 2). Toverify condition 2), a replica has to learn all replicas that donot satisfy condition 1), which needs gossiping of suchreplicas among all replicas. On the other hand, thegossiping of the range originator sets of mandatory replicasfor the verification of condition 2) enables a replica to verifythe satisfaction of condition 1). In every iteration, there is atleast one replica that satisfies condition 1) because there isat least one mandatory replica. Therefore, we need thegossiping of the range originator sets of mandatory replicasto achieve function 1). In gossiping, the pieces of informa-tion in replicas are shared among replicas. A piece ofinformation of a mandatory or a nonmandatory replica is itsrange originator set or the empty set, respectively. We call amessage used for gossiping on the range originator sets ofmandatory replicas a minimum-subtree probe (MSP) message.

    1412 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 10, OCTOBER 2009

    Fig. 6. Example of mandatory-replica determination.

  • For function 1), all replicas in the current subtree share theset of replicas covered by mandatory replicas V as describedjust above. As a result, a replica can easily achieve function 2)using V . For function 2), a replica removes all elements in Vfrom its range originator set. For function 3), the terminationcondition of the distributed algorithm is changed from thatof the centralized algorithm. As is described in Section 4.2,the distributed algorithm executes classified-replica andmandatory-replica determination until there is no replica inthe subtree for the next iteration. A replica can determinethat its role in the distributed algorithm is completedwhen itis not in theminimum subtree for the next iteration because areplica will never be included in a subtree after it is onceexcluded from the subtree in an iteration. Hence, a replica isa read replica if it has become a mandatory replica at leastonce. Otherwise, it is not a read replica. When a replicacompletes its role in the distributed algorithm, it informs allfront-end nodes about whether or not it is a read replica. Asa result, the termination condition of minimum-subtreedetermination is that of function 1), which is the same as thatof the second phase of classified-replica determination in anequivalent-replica case.

    From the above discussion, the time and messagecomplexity of minimum-subtree determination are equalto those of gossiping. Therefore, the time and messagecomplexity are ODgbp bc and ONv, respectively,where Dg and Nv are the diameter in a tree for updatepropagation and the number of replicas, respectively.

    The minimum-subtree determination comprises Na-times MSP message transmission/receipt and set compar-ison. Therefore, the computation complexity of minimum-subtree determination is ONa. The pseudocodeof minimum-subtree determination is described inAppendix B.4, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.230. In addition, we demonstrate ourdistributed algorithm in Appendix D, which can be foundon the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.230.

    4.6 Time, Message, and Computation Complexity

    We discussed the time, message, and computation com-plexity of classified-replica, mandatory-replica, and mini-mum-subtree determination in Sections 4.3, 4.4, and 4.5,respectively. In each iteration, the time and messagecomplexity of our distributed algorithm to calculate aminimum read replica set are the sums of those of theabove processes. Let ct and cm be the time and messagecomplexity of the distributed algorithm in an iteration.Then, ct and cm are the sums of ODcbs; ODmbs, andODgbs and ONv Nc; ONv, and ONv, respectively,where bs is the sum of bp and bc. In every iteration, there isat least one mandatory replica [12]. This means that thenumber of replicas in a tree is decreased by at least one inan iteration. The time and message complexity of ourdistributed algorithm to calculate a minimum read replicaset are consequently ct and cm multiplied by Nv,respectively. For Dc;Dm, and Nc, conditions Dc Dg;Dm Dg, and Nc Nv are satisfied. As a result, the timeand message complexity of our algorithm are at mostONvDgbs and ON2v , respectively. In each of thoseinequalities, the right and left sides are almost equal whenold data are required by clients. Here, Dc and Dm, which

    determine the time complexity of classified-replica andmandatory-replica determination, respectively, are muchless than Dg when new data are required by clients.Hence, the time complexity for obtaining new data ismuch lower than that for obtaining old data though theorder of the time complexity for new data is equal to thatfor old data.

    The computation complexity of the first and secondphase of classified-replica determination is ONa andONb, as described in Sections 4.3.1 and 4.3.2, respectively.These of mandatory-replica and minimum-subtree deter-mination are ONa. Therefore, the computation complexityof our distributed algorithm executed by all replicas isONaN2v because there are Nv replicas, at least one replicais removed in one iteration [12], and Na is equal to orgreater than Nb. The computation complexity of ourdistributed algorithm to calculate a minimum read replicaset is decreased from the centralized algorithm by using theinformation of topology that connects all replicas and theproperties of probabilistic delay.

    4.7 Dynamic Addition and Deletion of Replicas

    In our system, replicas dynamically join and leave replicateddatabase systems. Dynamic addition and deletion of replicascause the change in paths along which refresh transactionsare propagated among replicas, which leads to the recalcula-tion of range originator sets in replicas. However, a replicacannot immediately estimate update propagation delay fromothers because it must collect samples of update propagationdelay alongnewpaths for a particular timeperiod. Therefore,when a new replica joins replicated database systems, weadd it as a leaf replica node tominimize the change in updatepropagation paths in our system. Then, our distributedalgorithm is performed among all replicas except for thenewly joining replica. A front-end node sends refer transac-tions to a newly joining replica and all read replicascalculated for an original update propagation tree. Whenreplicas collect samples of update propagation delay from anewly joining replica, our distributed algorithm is performedby all replicas in awhole tree-topology network including thenewly joining replica.

    We divide replicas deleted from replicated databasesystems into two types: leaf and nonleaf replicas. When aleaf replica leaves replicated database systems, our distrib-uted algorithmworks by deleting it from the range originatorsets of all replicas. When a nonleaf replica leaves replicateddatabase systems, a tree-topology network for updatepropagation in our system is divided into two or moresubtrees. Our system designates one replica among theadjacent replicas of the deleted replica as their hub andconnects the designated replica to them. Because updatedelay between replicas included in different subtreeschanges after the deletion of a replica, our distributedalgorithm is separately performed in every subtree. A front-end node sends refer transactions to all members of the sumof read replica sets calculated in each subtree. When replicascollect samples of update propagation delay from all replicasinother subtrees after thedeletionof a replica, ourdistributedalgorithm is performed by all replicas in a newly constructedtree-topology network.

    YAMASHITA: DISTRIBUTED VIEW DIVERGENCE CONTROL OF DATA FRESHNESS IN REPLICATED DATABASE SYSTEMS 1413

  • 5 EVALUATION

    5.1 Controlled Data Freshness

    Data freshness controlled by our algorithm depends on thedelay of update propagation and topology that connectsreplicas. When our method is practically used, topology thatconnects replicas should have a minimum diameter in orderto improve data freshness because the number of hop countsfor message propagation has the minimum value in a treewith a minimum diameter. Therefore, we used topologywith a minimum diameter for the evaluation of datafreshness controlled by our algorithm, though our algorithmalso works in arbitrary tree-topology networks. In additionto a tree with a minimum diameter, we also evaluatedcontrolled data freshness in five randomly generated treesfor comparison. In these trees, there were 1,000 replica nodeswhose maximum degrees were 6 and 11. This means that themaximum fan outs for message propagation are 5 and 10when the maximum degrees of nodes are 6 and 11,respectively. The delay of update propagation is caused bycomplicated processes such as message delivery time bynetwork systems, waiting time in the message queue of anoperating system, and transaction processing time bydatabase management systems. Therefore, modeling thestatistical delay time distribution of update propagation, ingeneral, is difficult. For the delay of update propagation,long and short delays may occur with low probability. Thistype of distribution includes the Gamma and Weibulldistribution. We used the well-known Gamma distributionfor the delay time because our objectives are to achieverelative comparison of our method with related work andevaluate the rough efficiency of our method. This function isgenerally represented by

    x cr1 expx c: 4We used two probability density functions, each withdifferent Gamma function parameters, and one of the twofunctions was randomly assigned to each direction of eachlink. The parameters were: 1) c 10s; 2, and r 1and 2) c 20s; 2, and r 2. The delay assigned toeach direction of each link includes those that occur in thechannel of the assigned link and both its end nodes forthe period from the beginning of update propagation inthe initial end node to the completion of updateprocessing in the terminal end node.

    The view divergence of data freshness controlled by ourmethod and related works, which are epidemic-style [31],[32], [33] and chain-style [34] update propagation, are shownin Fig. 7. The horizontal axis is the RDF normalized by themaximum delay time in the tree-topology network with aminimum diameter, where the maximum delay times are2:03 102 and 1:36 102 when the maximum fan outs are 5and 10, respectively. The vertical axis is the percentage ofread replicas in all replicas. In this evaluation, probability pused in the definition of RDF is 0.95. Epidemic-style updatepropagation is a message delivery method using an infect-die model [32]. In the infect-die model, a node distributes amessage to randomly selected nodes when it first receives amessage. A node never distributes a message to any nodewhen it receives an already received message. In chain-style

    update propagation, nodes are connected in a chain [34].When a node receives a message, the message is propagatedto a constant number of the closest nodes, where only themost distant node has the role of message propagation toother nodes.

    The process of direct update propagation from one replicato another in epidemic-style and chain-style methods forlazy-group replication is similar to that in our method.Therefore, one of the two probability density functions usedfor the evaluation of our method was randomly assigned toeach path alongwhich an update is directly propagated fromone replica to another in epidemic-style and chain-styleupdate propagation.

    From Fig. 7, the proposed method can retrieve data withthe normalized RDF values of 5:06 101 and 5:07 101 ina tree with the minimum diameter by searching only onereplica node when the maximum fan outs of the tree-topology networks are 5 and 10, respectively. On the otherhand, the normalized RDF values of epidemic-style (chain-style) methods are 6:16 101 and 6:72 101 (1:09 10and 8.16) when the maximum fan outs are 5 and 10,respectively. In addition, the proposed method achieves thenormalized RDF values of 3:40 101 and 4:24 101 bysearching less than 1 percent of all replicas when themaximum fan outs of the tree-topology network are 5 and10, respectively. As a result, our method achieves more than36.9 and 94.8 percent improvement in RDF compared withepidemic-style and chain-style methods, respectively. Whenthe proposed method is used for the randomly generatedtrees, it can retrieve data with the normalized RDF values of1.22 and 1.16 by searching only one replica node, onaverage, when the maximum fan outs of the tree-topologynetworks are 5 and 10, respectively. This means that thedegrees of RDF achieved by our method are 98.1 and72.6 percent greater than epidemic-style methods when themaximum fan outs are 5 and 10, respectively. The increaseof RDF values in the randomly generated trees is caused bythe increase of the maximum update propagation delay,which is 2.09 (1.99) times that of the tree with the minimum

    1414 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 10, OCTOBER 2009

    Fig. 7. View divergence of data freshness achieved by the proposedmethod and related work.

  • diameter when the maximum fan out is 5 (10). However,our method can achieve better RDF than epidemic-stylemethods do by accessing more than 0.74 percent of allreplicas on average.

    From the above discussion, in a tree with the minimumdiameter, ourmethod can control RDF regions from 6:90 10to 2:03 102 and from 5:77 10 to 1:36 102, which are notnormalized, when the maximum fan outs are 5 and 10,respectively. On the other hand, the RDF values of epidemic-style (chain-style) methods are 1:25 102 and 9:14 10(2:21 103 and 1:11 103) when the maximum fan outs are5 and 10, respectively. If epidemic-style or chain-stylemethods are used, we have to increase the fan out for updatepropagation, which leads to the increase of the load onreplicas, evenwhen fresh data are rarely needed. In addition,the RDF values achieved by related work (epidemic andchain-style methods) are worse than those of our methodeven if we double the fan out of the related work. As a result,whenwe use epidemic-style and chain-stylemethods, clientshave to wait for more than the time corresponding to thenormalized RDF value of 2:48 101 in order to acquire thesame data as in our method or to use old data correspondingto more than that time. Therefore, in a case where when andhow new data are needed depends on applications and thetiming of data acquisition as described in Section 2, ourmethod ismuchmore advantageous than epidemic-style andchain-style methods.

    5.2 Efficiency of Distributed Algorithm

    We evaluated the efficiency of our distributed algorithm interms of time, message, and computation complexity. Forevaluation, we used 50 randomly generated tree-topologynetworks. They satisfied only the condition that the degreesof nodes are in the range from 1 to 5. The numbers ofreplicas in the evaluation were 100, 500, and 1,000.

    Our distributed algorithm iterates the three types ofprocesses: classified-replica, mandatory-replica, and mini-mum-subtree determination. Fig. 8a shows the meannumbers of iterations performed by replicas. The horizontalaxis is the RDF Tp with probability p 0:95. In this

    evaluation, the maximum delay times for update propaga-tion in networks with 100, 500, and 1,000 replicas were2:71 102; 4:40 102, and 5:20 102, respectively, on aver-age. Because the number of iterations is at most 3.12 in thefigure, our algorithm terminates in the small number ofiterations though it is theoretically ONv in the worst case[12], where Nv is the number of replicas. In addition, themean numbers of replicas participating in the second orlater iterations in tree-topology networks with 100, 500, and1,000 replicas are at most 7.28, 4:51 10, and 9:36 10,respectively. Hence, the time, message, and computationcomplexity of our algorithm are dominated largely by thefirst iteration.

    Fig. 8b shows the time complexity of the distributedalgorithm normalized by the time complexity for propagat-ing information along the mean diameter of networks. Themean diameters of networks with 100, 500, and 1,000 replicaswere 1:50 10; 2:46 10, and 2:93 10, respectively. Thenormalized time complexity in the figure is in the rangebetween 1.40 and 2.70 independent of the number of replicas.This is because the normalized time complexity of ouralgorithm is theoretically ONv, which is caused by thenumber of iterations and dominated largely by the firstiteration of our algorithm. For comparison, the time com-plexity to calculate a minimum read replica set using thecentralized algorithm and gossiping is also plotted in Fig. 8b.

    Fig. 9a shows the mean number of messages normalizedby the number of replicas. Because the message complexityof our algorithm is at mostON2v as described in Section 4.6,the number of messages normalized by the number ofreplicas is ONv, which is caused by the number ofiterations. However, the number of messages that werecalculated by simulation is almost independent of thenumber of replicas because the number of messages isdominated largely by the first iteration of our algorithm.

    As described in Section 4.6, the computation complexityof the distributed algorithm to calculate a minimum readreplica set isONa for one replica in one iteration, where thecomputation complexity of message transmission/receiptand set comparison are both ONa. The computation

    YAMASHITA: DISTRIBUTED VIEW DIVERGENCE CONTROL OF DATA FRESHNESS IN REPLICATED DATABASE SYSTEMS 1415

    Fig. 8. Number of iterations and time complexity. (a) Number of iterations. (b) Time complexity of our distributed algorithm normalized by time

    complexity for propagating information along mean diameter of networks. For comparison, time complexity of calculating a minimum read replica set

    using gossiping in classified-replica and mandatory-replica determination is also shown in (b).

  • complexity of the centralized algorithm in one iteration isdetermined by the number of set comparisons, as describedin Section 3.4.3. Therefore, we compare the computationcomplexity of the centralized and distributed algorithms interms of the number of set comparisons.

    Fig. 9b shows the mean number of comparisons of rangeoriginator sets normalized by the number of replicas. Thenumber of comparisons in the distributed algorithm is muchsmaller than that in the centralized one. In addition, thenumber of comparisons in the figure is almost independentof the number of replicas because the number of comparisonsof our distributed algorithm normalized by the number ofreplicas is theoretically ONaNv, which is caused by thenumber of iterations, and is dominated largely by the firstiteration of our algorithm as described in this section.

    From the above discussion, the evaluation results interms of time, message, and computation complexity of ourdistributed algorithm showed that our method can controlthe view divergence in networks with 100 to 1,000 replicaswith high scalability. In addition, our distributed algorithmachieves lower computation complexity compared with thecentralized one and effective load balancing of computationcomplexity among replicas.

    6 RELATED WORK

    Data freshness is one of the most important attributes ofdata quality in a number of applications, such as datawarehouses, Web servers, and data integration [10], [35],[36]. To guarantee and improve data freshness, variousmethods have been proposed [36], [35], [26]. They achievethe guarantee or improvement of data freshness under thecondition where only one source-database exists in a systemlike lazy-master replication [36], [35], [26]. When using lazy-group replication, a replica with the most up-to-date datadoes not always exist as a source database. Therefore, theaforementioned methods are not available in lazy-groupreplication. On the other hand, our method enables clientsto retrieve data with various degrees of freshness fromreplicas in lazy-group replication architectures according tothe degree of freshness required by clients.

    Recently, data replication and cachingmethods have beenstudied for distributed systems with unreliable componentssuch as peer-to-peer systems and ad hoc networks [37], [38],[31], [34]. Such systems are usually composed of individuallyadministrated hosts that dynamically join and leave systems.These methods can probabilistically provide high availabil-ity when operational replicas satisfy particular conditions.For replicated peer-to-peer systems, update propagationmethods have been studied [31], [34]. Thesemethods achieveeffective update propagation in peer-to-peer systems. Insuch systems, the freshness of data that a node can retrievedepends on system conditions, such as the frequency ofupdates, workload of replicas, and network delay. However,our method can provide data with various degrees offreshness to clients by adaptively changing read replicasaccording to such system conditions.

    7 CONCLUSION

    We have proposed a distributed method to control the viewdivergence of data freshness for clients in replicateddatabase systems. In our method, a refer request issuedby a client is sent to multiple replicas in what we call a readreplica set, and the client obtains the data that reflect allupdates received by the replicas in that set. Our methodcalculates a minimum read replica set using a distributedalgorithm so that the data freshness requested by a clientare satisfied. We evaluated by simulation the distributedalgorithm to calculate a minimum read replica set in termsof controlled data freshness time, message, and computa-tion complexity. As a result, our method achieves more than36.9 and 94.8 percent improvement in data freshnesscompared with epidemic-style and chain-style updatepropagation methods, respectively. In addition, our methodcan control the view divergence in networks with 100 to1,000 replicas with high scalability while enabling effectiveload balancing of view divergence control.

    REFERENCES[1] P.A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency

    Control and Recovery in Database Systems. Addison-Wesley, 1987.

    1416 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 10, OCTOBER 2009

    Fig. 9. Message and computation complexity. (a) Number of messages exchanged among replicas. (b) Number of comparisons of range originator

    sets. Numbers of messages and comparisons are normalized by the number of replicas.

  • [2] A. Helal, A. Heddaya, and B. Bhargava, Replication Techniques inDistributed Systems. Kluwer Academic Publishers, 1996.

    [3] R. Ladin, B. Liskov, and S. Ghemawat, Providing HighAvailability Using Lazy Replication, ACM Trans. ComputerSystems, vol. 10, no. 4, pp. 360-391, 1992.

    [4] C. Pu and A. Leff, Replica Control in Distributed Systems: AnAsynchronous Approach, Proc. ACM SIGMOD 91, pp. 377-386,May 1991.

    [5] J. Gray, P. Helland, P. ONeil, and D. Shasha, The Dangers ofReplication and a Solution, Proc. ACM SIGMOD 96, pp. 173-182,June 1996.

    [6] J.J. Fischer and A. Michael, Sacrificing Serializability to AttainHigh Availability of Data in an Unreliable Network, Proc. FirstACM Symp. Principles of Database Systems, pp. 70-75, May 1982.

    [7] D.S. Parker and R.A. Ramos, A Distributed File SystemArchitecture Supporting High Availability, Proc. Sixth BerkeleyWorkshop Distributed Data Management and Computer Networks,pp. 161-183, Feb. 1982.

    [8] P. Cox and B.D. Noble, Fast Reconciliations in Fluid Replication,Proc. Intl Conf. Distributed Computing Systems, pp. 449-458, 2001.

    [9] The Grid 2: Blueprint for a New Computing Infrastructure, I. Fosterand C. Kesselman, eds. Morgan Kaufmann, 2003.

    [10] B. Shin, An Exploratory Investigation of System Success Factorsin Data Warehousing, J. Assoc. for Information Systems, vol. 4,pp. 141-170, 2003.

    [11] M. Bouzeghoub and V. Peralta, A Framework for Analysis ofData Freshness, Proc. Intl Workshop Information Quality inInformation Systems, pp. 59-67, 2004.

    [12] T. Yamashita and S. Ono, View Divergence Control of ReplicatedData Using Update Delay Estimation, Proc. 18th IEEE Symp.Reliable Distributed Systems, pp. 102-111, Oct. 1999.

    [13] T. Yamashita and S. Ono, Controlling View Divergence of DataFreshness in a Replicated Database System Using StatisticalUpdate Delay Estimation, IEICE Trans. Information and Systems,vol. E88-D, no. 4, pp. 739-749, 2005.

    [14] J. Han and M. Kamber, Data Mining, second ed. MorganKaufmann, 2006.

    [15] L.P. English, Improving Data Warehouse and Business InformationQuality. John Wiley & Sons, 1999.

    [16] Distributed Systems, S. Mullender, ed. ACM Press, 1989.[17] I. Foster, C. Kesselman, J.M. Nick, and S. Tuecke, Grid Services

    for Distributed System Integration, Computer, vol. 35, no. 6,pp. 37-46, June 2002.

    [18] C. Huitema, Routing in the Internet. Prentice-Hall, 1995.[19] L.M. Leemis, Reliability. Prentice-Hall, 1995.[20] S. Shiba and H. Watanabe, Statistical Methods II: Estimation, (in

    Japanese). Shinyosha, 1976.[21] M. Hollander and D.A. Wolfe, Nonparametric Statistical Methods,

    second ed. John Wiley & Sons, 1999.[22] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker,

    H. Sturgis, D. Swinehart, and D. Terry, Epidemic Algorithm forReplicated Database Maintenance, Proc. Sixth Ann. ACM Symp.Principles of Distributed Computing, pp. 1-12, 1987.

    [23] D.L. Mills, Precision Synchronization of Computer NetworkClocks, Computer Comm. Rev., vol. 24, no. 2, pp. 28-43, 1994.

    [24] J. Levine, An Algorithm to Synchronize the Time of a Computerto Universal Time, IEEE/ACM Trans. Networking, vol. 3, no. 1,pp. 42-50, Feb. 1995.

    [25] J. Gray and A. Reuter, Transaction Processing: Concepts andTechniques. Morgan Kaufmann Publishers, 1993.

    [26] E. Pacitti, E. Simon, and R. Melo, Improving Data Freshness inLazy Master Schemes, Proc. 18th IEEE Intl Conf. DistributedComputing Systems, pp. 164-171, May 1998.

    [27] T.H. Cormen, C.H. Leiserson, R.L. Rivest, and C. Stein, Introduc-tion to Algorithms, second ed. MIT Press, 2001.

    [28] E.W. Dijkstra, Termination Detection for Diffusing Computa-tions, Information Processing Letters, vol. 11, no. 1, pp. 1-4, 1980.

    [29] N.A. Lynch, Distributed Algorithms.Morgan Kaufmann Publishers,1996.

    [30] Combinatorial Network Theory, D. Du and D.F. Hsu, eds. KluwerAcademic Publishers, 1996.

    [31] A. Datta, M. Hauswirth, and K. Aberer, Updates in HighlyUnreliable, Replicated Peer-to-Peer Systems, Proc. 23rd IEEE IntlConf. Distributed Computing Systems, pp. 76-88, 2003.

    [32] P.T. Eugster, R. Guerraoui, A.-M. Kermarrec, and L. Massooulie`,Epidemic Information Dissemination in Distributed Systems,Computer, vol. 37, no. 5, pp. 60-67, May 2004.

    [33] I. Gupta, A.-M. Kermarrec, and A.J. Ganesh, Efficient andAdaptive Epidemic-Style Protocols for Reliable and ScalableMulticast, IEEE Trans. Parallel and Distributed Systems, vol. 17,no. 7, pp. 593-605, July 2006.

    [34] Z. Wang, S.K. Das, M. Kumar, and H. Shen, Update Propagationthrough Replica Chain in Decentralized and Unstructured P2PSystems, Proc. Intl Conf. Peer-to-Peer Computing (P2P 04), pp. 64-71, 2004.

    [35] A. Labrinidis and N. Roussopoulos, Exploring the Tradeoffbetween Performance and Data Freshness in Database-DrivenWeb Servers, VLDB J., vol. 13, no. 3, pp. 240-255, 2004.

    [36] R. Hull and G. Zhou, A Framework for Supporting DataIntegration Using the Materialized and Virtual Approaches,Proc. ACM SIGMOD 96, pp. 481-492, June 1996.

    [37] F.M. Cuenca-Acuna, R.P. Martin, and T.D. Nguyen, AutonomousReplication for High Availability in Unstructured P2P Systems,Proc. 22nd IEEE Intl Symp. Reliable Distributed Systems, pp. 99-108,2003.

    [38] V. Gopalakrishnan, B. Silaghi, B. Bhattacharjee, and P. Keleher,Adaptive Replication in Peer-to-Peer Systems, Proc. 24th IEEEIntl Conf. Distributed Computing Systems, pp. 360-369, 2004.

    Takao Yamashita received the BS and MS degrees in electricalengineering from Kyoto University in 1990 and 1992, respectively, andthe PhD degree in informatics from Kyoto University in 2006. In 1992, hejoined Nippon Telegraph and Telephone Corporation. His currentresearch interests encompass loosely coupled distributed systems,network security, and distributed algorithms. He is a member of theIEEE, the IEEE Computer Society, and the APS.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    YAMASHITA: DISTRIBUTED VIEW DIVERGENCE CONTROL OF DATA FRESHNESS IN REPLICATED DATABASE SYSTEMS 1417