Data Replication in Data Intensive Scientiï¬c Applications With

1

Data Replication in Data Intensive ScientificApplications With Performance Guarantee

Dharma Teja Nukarapu, Bin Tang, Liqiang Wang, and Shiyong Lu

Abstract— Data replication has been well adopted in dataintensive scientific applications to reduce data file transfertime and bandwidth consumption. However, the problem ofdata replication in Data Grids, an enabling technology fordata intensive applications, has proven to be NP-hard andeven non-approximable, making this problem difficult to solve.Meanwhile, most of the previous research in this field iseither theoretical investigation without practical consideration,or heuristics-based with little or no theoretical performanceguarantee. In this paper, we propose a data replication algo-rithm that not only has a provable theoretical performanceguarantee, but also can be implemented in a distributed andpractical manner. Specifically, we design a polynomial timecentralized replication algorithm that reduces the total datafile access delay by at least half of that reduced by the optimalreplication solution. Based on this centralized algorithm, wealso design a distributed caching algorithm, which can beeasily adopted in a distributed environment such as DataGrids. Extensive simulations are performed to validate theefficiency of our proposed algorithms. Using our own simulator,we show that our centralized replication algorithm performscomparably to the optimal algorithm and other intuitiveheuristics under different network parameters. Using GridSim,a popular distributed Grid simulator, we demonstrate thatthe distributed caching technique significantly outperforms anexisting popular file caching technique in Data Grids, and itis more scalable and adaptive to the dynamic change of fileaccess patterns in Data Grids.

Index Terms - Data intensive applications, Data Grids, datareplication, algorithm design and analysis, simulations.

I. Introduction

Data intensive scientific applications, which mainly aimto answer some of the most fundamental questions facinghuman beings, are becoming increasingly prevalent in awide range of scientific and engineering research domains.Examples include human genome mapping [40], high energyparticle physics and astronomy [1], [26], and climate changemodeling [32]. In such applications, large amounts of datasets are generated, accessed, and analyzed by scientistsworldwide. The Data Grid [49], [5], [22] is an enablingtechnology for data intensive applications. It is composed ofhundreds of geographically distributed computation, storage,and networking resources to facilitate data sharing andmanagement in data intensive applications. One distinctfeature of Data Grids is that they produce and manage verylarge amount of data sets, in the order of terabytes and

Dharma Teja Nukarapu and Bin Tang are with the Department ofElectrical Engineering and Computer Science, Wichita State University,Wichita, KS 67260; Liqiang Wang is with the Department of ComputerScience, University of Wyoming, Laramie, WY 82071; and Shiyong Lu iswith the Department of Computer Science, Wayne State University, Detroit,MI 48202.

petabytes. For example, the Large Hadron Collider (LHC)at the European Organization for Nuclear Research (CERN)near Geneva, Switzerland, is the largest scientific instrumenton the planet. Since it began operation in August of 2008,it was expected to produce roughly 15 petabytes of dataannually, which are accessed and analyzed by thousands ofscientists around the world [3].

Replication is an effective mechanism to reduce filetransfer time and bandwidth consumption in Data Grids- placing most accessed data at the right locations cangreatly improve the performance of data access from auser’s perspective. Meanwhile, even though the memory andstorage capacity of modern computers are ever increasing,they are still not keeping up with the demand of storing hugeamounts of data produced in scientific applications. Witheach Grid site having limited memory/storage resources,1

the data replication problem becomes more challenging. Wefind that most of the previous work of data replication underlimited storage capacity in Data Grids mainly occupies twoextremes of a wide spectrum: they are either theoreticalinvestigations without practical consideration, or heuristics-based experiments without performance guarantee (pleaserefer to Section II for a comprehensive literature review).In this article, we endeavor to bridge this gap by proposingan algorithm that has a provable performance guarantee andcan be implemented in a distributed manner as well.

Our model is as follows. Scientific data, in the form ofdata files, are produced and stored in the Grid sites as theresult of scientific experiments, simulations, or computa-tions. Each Grid site executes a sequence of scientific jobssubmitted by its users. To execute each job, some scientificdata as input files are usually needed. If these files are notin the local storage resource of the Grid site, they will beaccessed from other sites, and transferred and replicated inthe local storage of the site if necessary. Each Grid site canstore such data files subject to its storage/memory capacitylimitation. We study how to replicate the data files ontoGrid sites with limited storage space in order to minimizethe overall file access time, for Grid sites to finish executingtheir jobs.

Specifically, we formulate this problem as a graph-theoretical problem and design a centralized greedy datareplication algorithm, which provably gives the total datafile access time reduction (compared to no replication) atleast half of that obtained from the optimal replicationalgorithm. We also design a distributed caching technique

1In this article, we use memory and storage interchangeably.

2

based on the centralized replication algorithm,2 and showexperimentally that it can be easily adopted in a distributedenvironment such as Data Grids. The main idea of ourdistributed algorithm is that when there are multiple replicasof a data file existing in a Data Grid, each Grid sitekeeps track of (and thus fetches the data file from) itsclosest replica site. This can dramatically improve DataGrid performance because transferring large-sized files takestremendous amount of time and bandwidth [18]. The centralpart of our distributed algorithm is a mechanism for eachGrid site to accurately locate and maintain such closestreplica site. Our distributed algorithm is also adaptive –each Grid site makes a file caching decision (i.e., replicacreation and deletion) by observing the recent data accesstraffic going through it. Our simulation results show thatour caching strategy adapts better to the dynamic change ofuser access behavior, compared to another existing cachingtechnique in Data Grids [39].

Currently, for most Data Grid scientific applications, themassive amount of data is generated from a few centralizedsites (such as CERN for high energy physics experiments)and accessed by all other participating institutions. We en-vision that, in the future, in a closely collaborative scientificenvironment upon which the data-intensive applications arebuilt, each participating institution site could well be a dataproducer as well as a data consumer, generating data as aresult of scientific experiments or simulations it performs,and meanwhile accessing data from other sites to run itsown applications. Therefore, the data transfer is no longerfrom a few data-generating sites to all other “client” sites,but could take place between an arbitrary pair of Grid sites.It is a great challenge in such an environment to efficientlyshare all the widely distributed and dynamically generateddata files in terms of computing resources, bandwidth usage,and file transfer time. Our network model reflects such avision (please refer to Section III for the detailed Data Gridmodel).

The main results and contributions of our paper are asfollows:

1) We identify the limitation of the current research ofdata replication in Data Grids: they are either the-oretical investigation without practical consideration,or heuristics-based implementation without a provableperformance guarantee.

2) To the best of our knowledge, we are the first topropose data replication algorithm in Data Grid, whichnot only has a provable theoretical performance guar-antee, but can be implemented in a distributed manneras well.

3) Via simulations, we show that our proposed replicationstrategies perform comparably with the optimal algo-rithm and significantly outperform an existing popularreplication technique [39].

2We emphasize the difference between replication and caching in thisarticle. Replication is a proactive and centralized approach wherein theserver replicates files into the network to achieve certain global performanceoptimum, whereas caching is reactive and takes place on the client sideto achieve certain local optimum, and it depends on the locally observednetwork traffic, job and file distribution, etc.

4) Via simulations, we show that our replication strategyadapts well to the dynamic access pattern change inData Grids.

Paper Organization. The rest of the paper is organized asfollows. We discuss the related work in Section II. In Sec-tion III, we present our Data Grid model and formulate thedata replication problem. Section IV and Section V proposethe centralized and distributed algorithms, respectively. Wepresent and analyze the simulation results in Section VI.Section VII concludes the paper and points out some futurework.

II. Related WorkReplication has been an active research topic for many

years in World Wide Web [35], peer-to-peer networks [4],ad hoc and sensor networking [25], [44], and mesh net-works [28]. In Data Grids, enormous scientific data andcomplex scientific applications call for new replication al-gorithms, which have attracted much research recently.

The most closely related work to ours is by Cibej, Slivnik,and Robic [47]. The authors study data replication on DataGrids as a static optimization problem. They show that thisproblem is NP-hard and non-approximable, which meansthat there is no polynomial algorithm that provides anapproximation solution if P 6= NP . The authors discuss twosolutions: integer programming and simplifications. Theyonly consider static data replication for the purpose offormal analysis. The limitation of the static approach is thatthe replication cannot adjust to the dynamically changinguser access pattern. Furthermore, their centralized integerprogramming technique cannot be easily implemented in adistributed Data Grid. Moreover, Baev et al. [6], [7] showthat if all the data have uniform size, then this problemis indeed approximable. And they find 20.5-approximationand 10-approximation algorithms. However, their approach,which is based on rounding an optimal solution to the linearprogramming relaxation of the problem, cannot be easilyimplemented in a distributed way. In this work, we followthe same direction (i.e., uniform data size), but design apolynomial time approximation algorithm, which can alsobe easily implemented in a distributed environment like DataGrids.

Raicu et al. [36], [37] study both theoretically and empir-ically the resource allocation in data intensive applications.They propose a “data diffusion” approach that acquirescomputing and storage resources dynamically, replicatesdata in response to demand, and schedules computationsclose to the data. They give a O(NM) competitive ratioonline algorithm, where N is the number of stores, eachof which can store M objects of uniform size. However,their model does not allow for keeping multiple copies ofan object simultaneously in different stores. In our model,we assume each object can have multiple copies, each on adifferent site.

Some economical model based replica schemes are pro-posed. The authors in [11], [8] use an auction protocolto make the replication decision and to trigger long-term

3

optimization by using file access patterns. You et al. [50]propose utility-based replication strategies. Lei et al. [30]and Schintke and Reinefeld [41] address the data replicationfor availability in the face of unreliable components, whichis different from our work.

Jiang and Zhang [27] propose a technique in Data Gridsto measure how soon a file is to be re-accessed before beingevicted compared with other files. They consider both theconsumed disk space and disk cache size. Their approachcan accurately rank the value of each file to be cached.Tang et al. [45] present a dynamic replication algorithm formulti-tier Data Grids. They propose two dynamic replicaalgorithms: Single Bottom Up and Aggregate Bottom Up.Performance results show both algorithms reduce the aver-age response time of data access compared to a static repli-cation strategy in a multi-tier Data Grid. Chang and Chang[13] propose a dynamic data replication mechanism calledLatest Access Largest Weight (LALW). LALW associates adifferent weight to each historical data access record: a morerecent data access has a larger weight. LALW gives a moreprecise metric to determine a popular file for replication.Park et al. [33] propose a dynamic replica replicationstrategy, called BHR, which benefits from “network-levellocality” to reduce data access time by avoiding networkingcongestion in a Data Grid. Lamehamedi et al. [29] proposea lightweight Data Grid middleware, at the core of whichare some dynamic data and replica location and placementtechniques.

Ranganathan and Foster [39] present six different repli-cation strategies: No Replication or Caching, Best Client,Cascading Replication, Plain caching, Caching plus Cascad-ing Replication, and Fast Spread. All of these strategies areevaluated with three user access patterns (Random Access,Temporal Locality, and Geographical plus Temporal Local-ity). Via the simulation, the authors find that Fast Spreadperforms the best under Random Access, and Cascadingwould work better than others under Geographical and Tem-poral Locality. Due to its wide popularity in the literatureand its simplicity to implement, we compare our distributedreplication algorithm to this work.

Systemwise, there are several real testbed and systemimplementations utilizing data replication. One system is theData Replication Service (DRS) designed by Chervenak etal. [17]. DRS is a higher-level data management service forscientific collaborations in Grid environments. It replicatesa specified set of files onto a storage system and registersthe new files in the replica catalog of the site. Another sys-tem is Physics Experimental Data Export (PheDEx) system[21]. PheDEx supports both the hierarchical distribution andsubscription-based transfer of data.

To execute the submitted jobs, each Grid site either getsthe needed input data files to its local computing resource,schedules the job at sites where the needed input data filesare stored, or transfers both the data and the job to a thirdsite that performs the computation and returns the result.In this paper, we focus on the first approach. We leave thejob scheduling problem [48], which studies how to map jobsinto Grid resources for execution, and its coupling with data

Fig. 1. Data Grid Model.

replication, as our future research. We are aware of veryactive research studying the relationship between these two[12], [38], [16], [23], [14], [46], [19], [10]. The focus ofour paper, however, is on the data replication strategy witha provable performance guarantee.

As demonstrated by the experiments of Chervenak et al.[17], the time to execute a scientific job is mainly the timeit takes to transfer the needed input files from server sitesto local sites. Similar to other work in replica managementfor Data Grids [30], [41], [10], we only consider the filetransfer time (access time), not the job execution time inthe processor or any other internal storage processing orI/O time. Since the data are read only for many DataGrid applications [38], we do not consider consistencymaintenance between the master file and the replica files. Forreaders who are interested in the consistency maintenancein Data Grids, please refer to [20], [42], [34].

III. Data Grid Model and Problem FormulationWe consider a Data Grid model as shown in Figure 1. A

Data Grid consists of a set of sites. There are institutionalsites, which correspond to different scientific institutionsparticipating in the scientific project. There is one top levelsite, which is the centralized management entity in the entireData Grid environment, and its major role is to mangethe Centralized Replica Catalogue (CRC). CRC provideslocation information about each data file and its replicas,and it is essentially a mapping between each data file andall the institutional sites where the data is replicated. Eachsite (top level site or institutional site) may contain multiplegrid resources. A grid resource could be either a computingresource, which allows users to submit and execute jobs,or a storage resource, which allows users to store datafiles.3 We assume that each site has both computing andstorage capacities, and that within each site, the bandwidthis high enough that the communication delay inside the siteis negligible.

3We differentiate site and node in this paper. We refer to each site inthe Grid level and each node in the cluster level, where each node is acomputing resource or storage resource or combination of both.

4

For the data file replication problem addressed in thisarticle, there are multiple data files, and each data fileis produced by its source site (the top level site or theinstitutional site may act as a source site for more thanone data files). Each Grid site has limited storage capacityand can cache/store multiple data files subject to its storagecapacity constraint.

Data Grid Model. A Data Grid can be represented as anundirected graph G(V,E), where a set of vertices V ={1, 2, ..., n} represents the sites in the Grid, and E is aset of weighted edges in the graph. The edge weight mayrepresent a link metric such as loss rate, distance, delay,or transmission bandwidth. In this paper, the edge weightrepresents the bandwidth and we assume all edges havethe same bandwidth B (in Section VI, we study hetero-geneous environment where different edges have differentbandwidths). There are p data files D = {D1, D2, . . . , Dp}in the Data Grid, Dj is originally produced and stored inthe source site Sj ∈ V . Note that a site can be the originalsource site of multiple data files. The size of data file Dj issj . Each site i has a storage capacity of mi (for a source sitei, mi is the available storage space after storing its originaldata).

Users of the Data Grid submit jobs to their own sites,and the jobs are executed in the FIFO order. Assume thatthe Grid site i has ni submitted jobs {ti1, ti2, . . . , tini}, andeach job tik (1 ≤ k ≤ ni) needs a subset Fik of D asits input files for execution. If we use wij to denote thenumber of times that site i needs Dj as an input file, thenwij =

∑nik=1 ck, where ck = 1 if Dj ∈ Fik and ck = 0

otherwise. The transmission time of sending data file Dj

along any edge is sj/B. We use dij to denote the numberof transmissions to transmit a data file from site i and j(which is equal to the number of edges between these twosites). We do not consider the file propagation time alongthe edge, since compared with transmission time, it is quitesmall and thus negligible. The total data file access cost inData Grid before replication is the total transmission timespent to get all needed data files for each site:

n∑i=1

p∑j=1

wij × diSj × sj/B.

The objective of our file replication problem is to minimizethe total data file access cost by replicating data files in theData Grid. Below, we give a formal definition of the filereplication problem addressed in this article.

Problem Formulation. The data replication problem in theData Grid is to select a set of sets M = {A1, A2, . . . , Ap},where Aj ⊂ V is a set of Grid sites that store a replica copyof Dj , to minimize the total access cost in the Data Grid:

τ(G,M) =

n∑i=1

p∑j=1

wij ×mink∈({Sj}∪Aj)dik × sj/B,

under the storage capacity constraint that |{Aj |i ∈ Aj}| ≤mi for all i ∈ V, which means that each Grid site i appearsin, at most, mi sets of M . Here, each site accesses the

data file from its closest site with a replica copy. Cibej,Slivnik, and Robic show that this problem (with arbitrarydata file size) is NP-hard and even non-approximable [47].However, Baev et al. [6], [7] have shown that when datais uniform size (i.e., si = sj for any Di, Dj ∈ D), thisproblem is approximable. Below we present a centralizedgreedy algorithm with provable performance guarantee foruniform data size, and a distributed algorithm respectively.

IV. Centralized Data Replication Algorithm in DataGrids

Our centralized data replication algorithm is a greedyalgorithm. First, all Grid sites have all empty storage space(except for sites that originally produce and store some files).Then, at each step, it places one data file into the storagespace of one site such that the reduction of total access costin the Data Grid is maximized at that step. The algorithmterminates when all storage space of the sites has beenreplicated with data files, or the total access cost cannotbe reduced further. Below is the algorithm.

Algorithm 1: Centralized Data Replication AlgorithmBEGINM = A1 = A2 = . . . = Ap = ∅ (empty set);while (the total access cost can still be reduced by

replicating data files in Data Grid)Among all sites with available storage capacity andall data files, let replicating data file Di on site mgives the maximum τ(G, {A1, A2, . . . , Ai, . . . , Ap})−τ(G, {A1, A2, . . . , Ai ∪ {m}, . . . , Ap});Ai = Ai ∪ {m};

end while;RETURN M = {A1, A2, . . . , Ap};

END. ♦

The total running time of the greedy algorithm of datareplication is O(p2n3m), where n is the number of sites inthe Data Grid, m is the average number of memory pagesin a site, and p is the total number of data files. Note thatthe number of iterations in the above algorithm is boundedby nm, and at each stage, we need to compute at most pnbenefit values, where each benefit value computation maytake O(pn) time. Below we present two lemmas showingsome properties of above greedy algorithm. They are alsonecessary for us to prove the theorem below, which showsthat the greedy algorithm returns a solution with a near-optimal total access cost reduction.

Lemma 1: Total access cost reduction is independent ofthe order of the data file replication.Proof: This is because the reduction of total access costdepends only on the initial and final storage configurationof each site, which does not depend on the exact order ofthe data replication.

Lemma 2: Let M be a set of cache sites in an arbitrarystage. Let A1 and A2 be two distinct sets of sites notincluded in M . Then the access cost reduction due toselection of A2 based upon cache site set M ∪A1, denotedas a, is less than or equal to the access cost reduction dueto selection of A2 based upon M , denoted as b.

5

Fig. 2. Each site i original graph G has mi memory space, each site inmodified graph G’ has 2mi memory space.

Proof: Let ClientSet(A2,M) be the set of sites that havetheir closest cache sites in A2 when A2 are selected as cachesites upon M . With the selection of A1 before A2, some sitesin ClientSet(A2,M) may find that they are closer to somecache sites in A1 and then “deflect” to become elements ofClientSet(A1,M). Therefore, ClientSet(A2,M ∪ A1) ⊆ClientSet(A2,M).

Since the selection of A1 before A2 does not change theclosest sites in M for all sites in ClientSet(A2,M ∪ A1),if ClientSet(A2,M ∪A1) = ClientSet(A2,M), a = b; ifClientSet(A2,M ∪A1) ⊂ ClientSet(A2,M), a < b.

Above lemma says that the total access cost reduction dueto an arbitrary cache site never increases with the selectionof other cache site before it.

We present below a theorem that shows the performanceguarantee of our centralized greedy algorithm. In the theo-rem and the following proof, we assume that all data filesare of the same uniform-size (occupying unit memory). Theproof technique used here is similar to that used in [44] for aclosely related problem of data caching in ad hoc networks.

Theorem 1: Under the Data Grid model we propose, thecentralized data replication algorithm delivers a solutionwhose total access cost reduction is at least half of theoptimal total access cost reduction.Proof: Let L be the total number of memory pages in theData Grid. WLOG, we assume L is the total number ofiterations of the greedy algorithm (note it is possible that thetotal access cost cannot be reduced further even though sitesstill have available memory space). We assume the sequenceof selections in greedy algorithm is {ng1f

g1 , n

g2f

g2 , ..., n

gLf

gL},

where ngi fgi indicating that at iteration i, data file fgi is

replicated at site ngi . We also assume the sequence ofselections in optimal algorithm is {no1fo1 , no2fo2 , ..., no

LfoL},

where noi foi indicating that at iteration i, data file foi is

replicated at site noi . Let O and C be the total accesscost reduction from optimal algorithm and greedy algorithm,respectively.

Next, we consider a modified data grid G′, wherein eachsite i doubles its memory capacity (now 2mi). We constructa cache placement solution for G′ such that for site i, thefirst mi memories store the data files obtained in greedyalgorithm and the second mi memories store the data filesselected in the optimal algorithm, as shown in Figure 2.Now we calculate the total access cost reduction O′ for G′.

Obviously, O′ ≥ O, simply because each site in G′ cachesextra data files beyond the data files cached in the same sitein G.

According to Lemma 1, since the total access cost re-duction is independent of the order of the each individualselection, we consider the sequence of the cache sectionin G′ as {ng1f

g1 , n

g2f

g2 , ..., n

gLf

gL, n

o1f

o1 , n

o2f

o2 , ..., n

oLf

oL}, that

is, the greedy selection followed by optimal selection.Therefore, in G′, the total access cost reduction is dueto the greedy selection sequence plus the optimal selec-tion sequence. For the greedy selection sequence, afterthe selection of ngLf

gL, the access cost reduction is C.

For the optimal selection sequence, we need to calculatethe access cost reduction due to the addition of eachnoi f

oi (1 ≤ i ≤ L) based on the already added selec-

tions of {ng1fg1 , n

g2f

g2 , ..., n

gLf

gL, n

o1f

o1 , n

o2f

o2 , ..., n

oi−1f

oi−1},

and from Lemma 2, we know that it is less than theaccess cost reduction due to the addition of noi f

oi based

on already added sequence of {ng1fg1 , n

g2f

g2 , ..., n

gi−1f

gi−1}.

Note that the latter is less than the access cost reductiondue to the addition of ngi f

gi , based on the same sequence

of {ng1fg1 , n

g2f

g2 , ..., n

gi−1f

gi−1}. Thus, the sum of the access

cost reduction due to selection of {no1fo1 , no2fo2 , ..., noLf

oL} is

less than or equal to C too.Thus, we come to the conclusion that O is less than or

equal to O′, which is less than or equal to 2 times of C.

V. Distributed Data Caching Algorithm in Data Grids

In this section, we design a localized distributed cachingalgorithm based on the centralized algorithm. In the dis-tributed algorithm, each Grid site observes the local DataGrid traffic to make an intelligent caching decision. Ourdistributed caching algorithm is advantageous since it doesnot need global information such as the network topologyof the Grid, and it is more reactive to network states such asthe file distribution, user access pattern, and job distributionin the Data Grids. Therefore, our distributed algorithm canadopt well to such dynamic changes in the Data Grids.

The distributed algorithm is composed of two importantcomponents: nearest replica catalog (NRC) maintained ateach site and a localized data caching algorithm running ateach site. Again, as stated in Section III, the top level sitemaintains a Centralized Replica Catalogue (CRC), which isessentially a list of replica site list Cj for each data file Dj .The replica site list Cj contains the set of sites (includingsource site Sj) that has a copy of Dj .

Nearest Replica Catalog (NRC). Each site i in the Gridmaintains an NRC, and each entry in the NRC is of the form(Dj , Nj), where Nj is the nearest site that has a replica ofDj . When a site executes a job, from its NRC, it determinesthe nearest replicate site for each of its input data files andgoes to it directly to fetch the file (provided the input fileis not in its local storage). As the initialization stage, thesource sites send messages to the top level site informing itabout their original data files. Thus, the centralized replicacatalog initially records each data file and its source site.The top level site then broadcasts the replica catalogue to

6

the entire Data Grid. Each Grid site initializes its NRC tothe source site of each data file. Note that if i is the sourcesite of Dj or has cached Dj , then Nj is interpreted as thesecond nearest replica site, i.e., the closest site (other thani itself) that has a copy of Dj . The second nearest replicasite information is helpful when site i decides to removethe cached file Dj . If there is a cache miss, the request isredirected to the top level site, which sends the site replicasite list for that data file. After receiving such infomation, thesite will update correctly its NRC table and sends the requestto the site’s nearest cache site for that data file. Therefore, acache miss takes much longer time. The above informationis in addition to any information (such as routing tables)maintained by the underlying routing protocol in the DataGrids.

Maintenance of NRC. As stated above, when a site i cachesa data file Dj , Nj is no longer the nearest replica site, butrather the second-nearest replica site. The site i sends amessage to the top level site with information that it is anew replica site of Dj . When the top level site receives themessage, it updates its CRC by adding site i to data file Dj’sreplica site list Cj . Then it broadcasts a message to the entireData Grid containing the tuple (i,Dj) indicating the ID ofthe new replica site and the ID of the newly made replicafile. Consider a site l that receives such message (i,Dj). Let(Dj , Nj) be the NRC entry at site l signifying that Nj isthe replica site for Dj currently closest to l. If dli < dlNj ,then the NRC entry (Dj , Nj) is updated to (Dj , i). Note thedistance values, in terms of number of hops, are availablefrom the routing protocol.

When a site i removes a data file Dj from its local storage,its second-nearest replica site Nj becomes its nearest replicasite (note the corresponding NRC entry does not change).In addition, site i sends a message to the top level site withinformation that it is no longer a replica site of Dj . The toplevel site updates its replica catalogue by deleting i fromDj’s replica site list. And then it broadcasts a message withthe information (i,Dj , Cj) to the Data Grid, where Cj is thereplica site list for Dj . Consider a site l that receives such amessage, and let (Dj , Nj) be its NRC entry. If Nj = i, thensite l updates its nearest replica site entry using Cj (withthe help of the routing table).

Localized Data Caching Algorithm. Since each site haslimited storage capacity, a good data caching algorithm thatruns distributedly on each site is needed. To do this, eachsite observes the data access traffic locally for a sufficientlylong time. The local access traffic observed by site i includesits own local data requests, non-local data requests to datafiles cached at i, and the data request traffic that the site iis forwarding to other sites in the network.

Before we present the data caching algorithm, we givethe following two definitions:• Reduction in access cost of caching a data file.

Reduction in access cost as the result of caching a datafile at a site is the reduction in access cost given bythe following: access frequency in local access trafficobserved by the site × distance to the nearest replica

site.• Increase in access cost of deleting a data file. Increase

in access cost as the result of deleting a data file at a siteis the increase in access cost given by the following:access frequency in local access traffic observed by thesite × distance to the second-nearest replica site.

For each data file Dj not cached at site i, site i calculatesthe reduction in access cost by caching the file Dj , whilefor each data file Dj cached at site i, site i computes theincrease in access cost of deleting the file. With the helpof the NRC, each site can compute such a reduction orincrease of access cost in a distributed manner using onlylocal information. Thus our algorithm is adaptive: each sitemakes a data caching or deletion decision by observinglocally the most recent data access traffic in the Data Grid.

Cache Replacement Policy. With the above knowledge,a site always tries to cache data files that can fit in itslocal storage and that can give the most reduction in accesscost. When the local storage capacity of a site is full, thefollowing cache replacement policy is used. Let |D| denotethe size of a data file (or a set of data files) D. If the accesscost reduction of caching a newly available data file Dj ishigher than the access cost increase of some set D of cacheddata files where |D| > |Dj |, then the set D is replaced byDj .

Data File Consistency Maintenance. In this work, weassume that all the data are read-only, and thus no data con-sistency mechanism is needed. However, if we do considerthe consistency issue, two mechanisms can be considered.First, when the file in a site is modified, it can be consideredas a data file removal in our distributed algorithm discussedabove with some necessary deviation. That is, the site sendsa message to the top level site with information that itis no longer a replica site of the file. The top level sitebroadcasts a message to the Data Grid so that each sitecan delete the corresponding nearest replica site entry. Thesecond mechanism is Time to Live (TTL), which is the timeuntil the replica copies are considered valid. Therefore, themaster copy and its replica are considered outdated at theend of the TTL time value and will not be used for the jobexecution.

VI. Performance Evaluation

In this section, we demonstrate the performance of ourdesigned algorithms via extensive simulations. First, wecompare the centralized greedy algorithm (referred to asGreedy) with the centralized optimal algorithm (referredto as Optimal) and two other intuitive heuristics using ourown simulator written in Java (Section VI-A). Then, usingGridSim [43], a well-known Grid simulator, we evaluate ourdistributed algorithm with respect to an existing cachingalgorithm in the well-cited literature [39] (Section VI-B),which is based upon a hierarchical Data Grid topology.Finally, we compare our distributed algorithm with popularLRU/LFU algorithms implemented in OptorSim [10] in ageneral Data Grid topology.

7

0

100

200

300

400

500

600

700

4 6 8 10 12 14 16 18 20

Tota

l A

ccess C

ost

Number of Files

OptimalGreedy

Local GreedyRandom

(a) Varying number of data files.

0

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8 9

Tota

l A

ccess C

ost

Storage Capacity (GB)

OptimalGreedy

Local GreedyRandom

(b) Varying storage capacity.

0

100

200

300

400

500

600

700

5 10 15 20 25

Tota

l A

ccess C

ost

Number of Grid sites

OptimalGreedy

Local GreedyRandom

(c) Varying number of Grid sites.

Fig. 3. Performance comparison of Greedy, Optimal, Local Greedy, and Random algorithms. Unless varied, the number of sites is 10,number of data files is 10, and memory capacity of each site is 5. Data file size is 1.

A. Comparison of Centralized Algorithms

Optimal Replication Algorithm. Optimal is an exhaustive,and time-consuming, replication algorithm that yields anoptimal solution. For a Data Grid with n sites and p datafiles of unit size, and each site has storage capacity m, theoptimal algorithm is to enumerate all possible file replicationplacements in the Data Grid and find the one with minimumtotal access cost. There are (Cm

p )n such file placements,where Cm

p is the number of ways selecting m files outof p files. Due to the high time complexity of the optimalalgorithm, we experiment on Data Grids with 5 to 25 sitesand the average distance between two sites is 3 to 6 hops.

Local Greedy Algorithm and Random Algorithm. We alsocompare Greedy with two other intuitive replication algo-rithms: Local Greedy Algorithm and Random Algorithm.In Local Greedy, it replicates each site’s most frequentlyrequested files in its local storage. Random takes place inround; in each round, a random file is selected and placedin a randomly chosen non-full site that does not have a copyof that file. The algorithm stops until all the storage spacesof all sites are occupied.

In most comparisons of centralized algorithms, we assumethat all file sizes are equal and all edge bandwidths are equal.Since the total data file access cost between sites is definedas the total transmission time spent to obtain the data file,which is equal to

number of hops between two sites× file size/bandwidth,

we can use the number of hops to approximate the fileaccess cost. However, we also show that even in the het-erogeneous scenarios wherein the file sizes and bandwidthsare not equal, our centralized and distributed algorithms stillperform very well (Figure 5 and Figure 13).

Recall that each data has a uniform size of 1. For allthe algorithms, initially the p files are randomly placed insome source sites; for the same comparison, the initial fileplacements are the same for all algorithms. Recall that forsource sites, their storage capacities are those that remainafter storing their original data files. Each data point in theplots below is an average over five different initial placementof the p files. We assume that the number of times each siteaccesses each file is a random number between 1 and 10.

In all plots, we show error bars indicating the 95 percentconfidence interval.

Varying Number of Data Files. First, we vary the numberof data files in the Data Grid while fixing the number of sitesand the storage capacity of each site. Due to the high timecomplexity of the optimal algorithm, we consider a smallData Grid with n = 10 and m = 5, and we vary p to be5, 10, 15, and 20. The comparison of the four algorithms isshown in Figure 3 (a). In terms of the total access cost, weobserve that Optimal < Greedy < Local Greedy < Random.That is, among all the algorithms, Optimal performs thebest since it exhausts all the replication possibilities, butGreedy performs quite close to the optimal. Furthermore,we observe that Greedy performs better than Local Greedyand Random, which demonstrates that it is indeed a goodreplication algorithm. It also shows that when the storagecapacity of each site is five and the total number of files isfive, like the other three algorithms, Greedy will eventuallyreplicate all five files into each Grid site, resulting in zerototal access cost.

Varying Storage Capacity. Figure 3 (b) shows the resultsof varying the storage capacity of each site while keepingthe number of data files and number of sites unchanged. Wechoose n = 10 and p = 10, and vary m from 1, 3, 5, 7,to 9 units. It is obvious that with the increase of memorycapacity of each Grid site, the total access cost decreasessince more file replicas are able to be placed into the DataGrid. Again, we observe that Greedy performs comparablyto Optimal, and both perform better than Local Greedy andRandom.

Varying Number of Grid Sites. Figure 3 (c) shows theresult of varying the number of sites in Data Grid. Weconsider a small Data Grid, choose p = 10 and m = 5,and vary n as 5, 10, 15, 20, or 25. The optimal algorithmperforms only slightly better than the greedy algorithm, inthe range of 15% to 20%.

Comparison in Large Scale. We study the scalability ofdifferent algorithms by considering a Data Grid in largescale, with 500 to 2,000 data files, storage capacity of eachsite varying from 10 to 90, and the number of Grid sitesvarying from 10 to 50. Figure 4 shows the performancecomparison of Greedy, Local Greedy, and Random. Again,

8

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

2000 1500 1000 500

Tota

l A

ccess C

ost

Number of Files

GreedyLocal Greedy

Random


0

2000

4000

6000

8000

10000

12000

14000

16000

18000

10 20 30 40 50 60 70 80 90

Tota

l A

ccess C

ost


GreedyLocal Greedy

Random


2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

10 15 20 25 30 35 40 45 50

Tota

l A

ccess C

ost


GreedyLocal Greedy

Random


Fig. 4. Performance comparison of Greedy, Optimal, Local Greedy, and Random algorithms in large scale. Unless varied, the numberof sites is 30, number of data files is 1,000, and storage capacity of each site is 50. Data file size is 1.

100

200

300

400

500

600

700

800

900

4 6 8 10 12 14 16 18 20

Tota

l A

ccess C

ost

Number of Files

OptimalGreedy

Local GreedyRandom


100

200

300

400

500

600

700

1 2 3 4 5 6 7 8 9

Tota

l A

ccess C

ost


OptimalGreedy

Local GreedyRandom


100

200

300

400

500

600

5 10 15 20 25

Tota

l A

ccess C

ost


OptimalGreedy

Local GreedyRandom


Fig. 5. Performance comparison of Greedy, Optimal, Local Greedy, and Random algorithms with varying data file size. Unless varied,the number of sites is 10, number of data files is 10, and storage capacity of each site is 5. Data file size is varying from 1 to 10.

(a) Cascading Replication [39]. (b) Simulation Topology. Tier 3 has 32 sites, we do not show all of them dueto space limit.

Fig. 6. Illustration of simulation topology.

it shows that Greedy performs the best among the threealgorithms.

Comparison with Varying Data File Size. Even thoughTheorem 1 is valid only for uniform data size, we experi-mentally show that our greedy algorithm also achieves goodsystem performance for varying the data file size. Figure 5shows the performance comparison of Optimal, Greedy,Local Greedy, and Random, wherein each data file size is arandom number between 1 and 10. Greedy performs the bestagain among the three algorithms. It can be seen that due tothe non-uniform data size, the access cost is no longer zerofor all four algorithms when there are five files and each sitehas a storage capacity five.

B. Distributed Algorithm versus Replication Strategies byRanganathan and Foster [39]

In this section, we compare our distributed algorithm withthe replication strategy proposed by Ranganathan and Foster[39]. First, we give an overview of their strategies, andthen present the simulation environment and discuss thecomparison simulation results.

1) Replication Strategies by Ranganathan and Foster:Ranganathan and Foster study the data replication in ahierarchical Data Grid model (see Figure 6). The hierarchicalmodel, represented as a tree topology, has been used inLHCGrid project [2], which serves the scientific collabora-tions performing experiments at the Large Hadron Collider(LHC) in CERN. In this model, there is a tier 0 site at

9

CERN, and there are regional sites, which are from differentinstitutes in the collaboration. For comparison, we also usethis hierarchical model and assume that all data files are atthe CERN site initially.4 Like [2], we assume that only theleaf site executes jobs. When the needed data files are notstored locally, the site always goes one level higher for thedata. If the data is found, it fetches the data back to leafsite; otherwise, it goes one level higher until it reaches theCERN site.

Ranganathan and Foster present six different replicationstrategies: (1) No Replication or Caching; (2) Best Client,whereby a replica is created at the best client site (the sitethat has the largest number of requests for the file); (3)Cascading Replication, whereby once popularity exceedsthe threshold for a file at a given time interval, a replicais created at the next level on the path to the best client;(4) Plain Caching, whereby the client that requests the filestores a copy of the file locally; (5) Caching plus CascadingReplication, which combines Plain Caching and CascadingReplication strategies; and (6) Fast Spread, whereby replicasof the file are created at each site along its path to the client.All of the above strategies are evaluated with three useraccess patterns: (1) Random Access: no locality in patterns,(2) Temporal Locality: data that contain a small degree oftemporal locality (recently accessed files are likely to be ac-cessed again), and (3) Geographical plus Temporal Locality:data containing a small degree of temporal and geographicallocality (files recently accessed by a site are likely to beaccessed again by a close-by site). Their simulation resultsshow that Caching plus Cascading would work better thanothers in most cases. In our work, we compare our strategywith Caching plus Cascading (referred to as Cascading inthe simulation) under Geographical and Temporal Locality.

Cascading Replication Strategy. Figure 6 (a) shows in detailhow Cascading works. There is a file F1 in the root site.When the number of requests for file F1 exceeds thethreshold at the root, a replica copy of F1 is sent to the nextlayer on the path to the best client. Eventually the thresholdis exceeded at the next layer, and a replica copy of F1 issent to Client C.

TABLE ISIMULATION PARAMETERS FOR HIERARCHICAL TOPOLOGY.

Description ValueNumber of sites 43Number of files 1,000 - 5,000Size of each file 2 GBAvailable storage of each site 100 - 500 GBNumber of jobs executed by each site 400Number of input files of each job 1 - 10Link bandwidth 100 MB/s

2) Simulation Setup: In the simulation, we use the Grid-Sim Simulator [43] (version 4.1). The GridSim toolkitallows modeling and simulation of entities in parallel and

4However, our Data Grid model is not limited to a tree-like hierarchicalmodel. In Section VI-C, we show a general graph model, and that our dis-tributed replication algorithm performs better than LRU/LFU implementedin OptorSim [10].

distributed computing environment, for systems-users, ap-plications, resources, and resource brokers. The 4.1 versionof GridSim incorporates a Data Grid extension to modeland simulate Data Grids. In addition, GridSim offers ex-tensive support for design and evaluation of schedulingalgorithms, which suits the need for future work whereina more comprehensive framework of data replication andjob scheduling will be studied. As a result, we use GridSimin our simulation.

Figure 6 (b) shows the topology used in our simulation.Similar to the one in [39], there are four tiers in thegrid, with all data being produced at the top most tier0 (the root). Tier 2 consists of the regional centers ofdifferent continents; here we consider two. The next tier iscomposed of different countries. The final tier consists of theparticipating institution sites. The Grid contains a total of 43sites, 32 of them executing tasks and generating requests.

Table I shows the simulation parameters, most of whichare from [39] for the purpose of comparison. The CERNsite originally generates and stores 1,000 to 5,000 differentdata files, each of which is 2 GB. Each site (except for theCERN site) dynamically generates and executes 400 jobs,and each job needs 1 to 10 files as input files. To executeeach job, the site first checks if the input files are in its localstorage. If not, it then goes to the nearest replica site to getthe file. The available storage capacity at each site variesfrom 100 GB to 500 GB. The link bandwidth is 100 MB/s.Our simulation is run on a DELL desktop with Core 2 DuoCPU and 1 GB RAM.

We emphasize that the above parameters take into accountthe scaling. Usually, the amount of data in the Data Gridsystem is in the order of petabytes. To enable simulation ofsuch a large amount of data, a scale of 1:1000 is considered.That is, the number of files in the system and the storagecapacity of each site are both reduced by a factor of 1,000.

Data File Access Patterns. Each job has one to ten inputfiles as noted above. For the input files of each job, weadopt two file access patterns: random access pattern andtempo-spatial access pattern. In random access pattern, theinput files of each job are randomly chosen from theentire set of data files available in the Data Grid. In thetempo-spatial access pattern, the input data files required byeach job follows Zipf distribution [51], [9] and geographicdistribution, as explained below.

• In Zipf distribution (the temporal access part), for thep data files, the probability for each job to accessthe jth(1 ≤ j ≤ p) data file is represented byPj =

1jθ

∑ph=1 1/hθ

, where 0 ≤ θ ≤ 1. When θ = 1, theabove distribution follows the strict Zipf distribution,while for θ = 0, it follows the uniform distribution. Wechoose θ to be 0.8 based on real web trace studies [9].

• In geographic distribution (the spatial access part), thedata file access frequencies of jobs at a site dependon its geographic location in the Grid such that sitesthat are in proximity have similar data file accessfrequencies. In our case, the leaf sites close to eachother have similar data file access pattern. Specifically,

10

0

2000

4000

6000

8000

10000

12000

14000

16000

1000 2000 3000 4000 5000

Avera

ge file

access tim

e (

second)

Total number of files

Ditributed in RandomCascading in Random

Distributed in TS-StaticCascading in TS-Static

Distributed in TS-DynamicCascading in TS-Dynamic

(a) Varying total number of files.

0

2000

4000

6000

8000

10000

12000

14000

16000

100 200 300 400 500

Avera

ge file

access tim

e (

second)





(b) Varying storage capacity of each site.

Fig. 7. Performance comparison between our distributed algorithm and Cascading in hierarchical topology. In (a), the storage capacity of each site is300 GB; in (b), the number of data files is 3,000. Each data file size is 2 GB.

0

20

40

60

80

100

120

5000 4000 3000 2000 1000

Perc

enta

ge Incre

ase o

f F

ile A

ccess T

ime (

%)

Number of Files

DistributedCascading


0

20

40

60

80

100

120

500 400 300 200 100

Perc

enta

ge Incre

ase o

f F

ile A

ccess T

ime (

%)


DistributedCascading

(b) Varying storage capacity of each Grid site.

Fig. 8. Percentage increase of the average file access time due to dynamic access pattern.

0

50000

100000

150000

200000

1000 50000 200000 400000 600000 8000001000000

Avera

ge file

access tim

e (

second)






0

50000

100000

150000

200000

250000

300000

1 50 100 200 400 600 800 1000

Avera

ge file

access tim

e (

second)

Storage Capacity (TB)





Fig. 9. Performance comparison between our distributed algorithm and Cascading in a typical Grid environment. In (a), the storage capacity of eachsite is 500 TB; in (b), the number of data files in the Grid is 500,000. Each data file size is 1 GB.

11

suppose there are n client sites, numbered 1, 2, ..., n,where client 1 is close to client 2, which is close toclient 3, etc. Suppose the total number of data is p.If the accessed data ID is id according to the originalZipf-like access pattern, then, for client i (1 ≤ i ≤ n),the new data ID would be (id+p/n∗ i) mod p. In thisway, neighboring clients should have similar, but notthe same, access patterns, while far-away clients havequite different access patterns.

To investigate the effects of dynamic user behaviors upondifferent algorithms, we further divide the tempo-spatialaccess pattern into two: static and dynamic. In the dynamicaccess pattern, the file access pattern of each site changes inthe middle of the simulation, while for static access pattern,it does not change. In summary, in our simulation, we use thefollowing three file access patterns: random access pattern(Random), static tempo-spatial access pattern (TS-Static),and dynamic tempo-spatial access pattern (TS-Dynamic).

3) Simulation Results and Analysis: Below we presentour simulation results and analysis.

Varying Number of Data Files. In this part of simulation,we vary the number of data files from 1,000, 2,000, 3,000,4,000, to 5,000 files, while fixing the storage capacity ofeach Grid site as 300 GB. Figure 7 (a) show the comparisonof the average file access time of the two algorithms in allthree access patterns. We note the following observations.First, for both algorithms in all three access patterns, with anincrease in the number of files, the average file access timeincreases. However, our distributed greedy data replicationalgorithm yields average file access time three to four timesless than that obtained by Cascading in all three accesspatterns. Second, for both our distributed algorithm andCascading, their average file access time under TS-staticis 20% to 30% smaller than that under Random accesspattern. This is because TS-Static incorporates both temporaland spatial access patterns, which benefit both replicationalgorithms. Figure 7 (a) also shows the performance of bothalgorithms under the TS-Dynamic access pattern, whereinthe file access pattern of each site changes during the simu-lation. As shown, both our distributed replication algorithmand Cascading suffer from file access pattern change, withincreased average file access time compared with TS-Static.This is because it takes some time for both algorithms toobtain good replica replacement for the dynamic file accesspattern. In a static access pattern, the Data Grids will finallyreach some stable stage where the file placement in eachsite remains unchanged, thus giving a lower file accesstime. Figure 8 (a) also shows that our algorithm adapts theaccess pattern change better than Cascading algorithm: thepercentage increase of the average file access time of ouralgorithm due to dynamic access patter is around 50%, whilefor Cascading, it is around 100%.

Varying Storage Capacity of Each Site. We also varythe storage capacity of each Grid site from 100, 200, 300,400, to 500 GB while fixing the number of files in thegrid as 3000. The performance comparison is shown inFigure 7 (b). As expected, as the storage capacity increases,

there is a decrease in the file access time for both algo-rithms, since larger storage capacities allow each site tostorage more replicated files for local access. It gives thesame observation that our distributed replication algorithmperforms much better than Cascading under Random andTS-Static/Dynamic access patterns. Moreover, for both ourdistributed algorithm and Cascading Algorithm, its averagefile access time under Random is higher than that underTS-static (about 20% to 30% higher). Again, compared toTS-Static pattern, the file access time for both algorithmsincreases in TS-Dynamic pattern. However, as shown inFigure 8 (b), our distributed replication algorithm evidentlyadapts to the dynamic access pattern better than Cascadingdoes, giving smaller percentage increase of average fileaccess time.

TABLE IISIMULATION PARAMETERS FOR TYPICAL GRID ENVIRONMENT.

Description ValueNumber of sites 100Number of files 1,000 - 1,000,000Size of each file 1 GBAvailable storage of each site 1 TB - 1 PBNumber of jobs executed by each site 10,000Number of input files of each job 1 - 10Link bandwidth 4000 MB/s

TABLE IIISIMULATION PARAMETERS FOR TYPICAL CLUSTER ENVIRONMENT.

Description ValueNumber of nodes 1,000Number of files 1,000 - 1,000,000Size of each file 1 GBAvailable storage of each site 10 GB - 1 TBNumber of jobs executed by each site 1,000Number of input files of each job 1 - 10Link bandwidth 100 MB/s

Comparison in Typical Grid Environment. In a typicalGrid environment, there are usually only 10s to 100s ofdifferent sites with large storage (PB per site), and thenetwork bandwidth between Grid sites can be 10 GB/s, evenas high as 100 GB/s. To simulate such an environment weuse the scaled parameters as shown in Table II. Figure 9 (a)and (b) show the performance comparison of our distributedalgorithm and Cascading by varying the total number offiles (while fixing storage capacity at 500 TB) and thestorage capacity of each site (while fixing the number of datafiles as 500,000), respectively. Below are the observations.First, when there are 1,000, 50,000 and 200,000 files inthe Grid (Figure 9 (a)), or storage capacity is 600 TB,800 TB, and 1,000 TB (Figure 9 (b)), the performanceof both algorithms under the same access pattern and theperformance of the same algorithm under different accesspatterns are very similar. This is because, in both cases,each site has enough space to store all of its requested files;therefore, no cache replacement algorithm is involved. Assuch, even with different access patterns, each site worksthe same way with both algorithms: requesting the inputfiles for each job, storing the received input files in local

12

0

100000

200000

300000

400000

500000

1000 50000 200000 400000 600000 8000001000000

Avera

ge file

access tim

e (

second)






0

100000

200000

300000

400000

500000

600000

700000

1 50 100 200 400 600 800 1000

Avera

ge file

access tim

e (

second)





(b) Varying storage capacity of each node.

Fig. 10. Performance comparison between our distributed algorithm and Cascading in a typical cluster environment. In (a), the storage capacity of eachnode is 500 GB; in (b), the number of data files in the cluster is 500,000. Each data file size is 1 GB.

0

20000

40000

60000

80000

100000

120000

140000

160000

1000 50000 200000 400000 600000 8000001000000

Avera

ge file

access tim

e (

second)






0

20000

40000

60000

80000

100000

120000

1 50 100 200 400 600 800 1000

Avera

ge file

access tim

e (

second)





(b) Varying storage capacity of each node.

Fig. 11. Performance comparison between our distributed algorithm and Cascading in a cluster environment with full connectivity. In (a), the storagecapacity of each node is 500 GB; in (b), the number of data files in the cluster is 500,000. Each data file size is 1 GB.

storage, and executing the next job. Otherwise, we observethe performance differences as shown in Figure 7. Second,comparisons with different performance show that for ourdistributed algorithm, the percentage increase of access timedue to the dynamic pattern is very small, around 5% to8%. This shows that our algorithm adjusts to the dynamicaccess pattern well in a typical Grid environment. Third,Figure 9 (b) shows that with increased storage capacity ofeach site, the performance difference between our distributedalgorithm and Cascading is getting small for all three accesspatterns. For example, when the storage capacity is 1 TB,Cascading yields more than twice the file access time thanour distributed algorithm; while at 400 TB, it costs 40%more access time than our distributed algorithm. This showsthat in the more stringent storage scenario, our cachingalgorithm is a better mechanism to reducing file access timeand thus job execution time.

Comparison in Typical Cluster Environment. In a typical

cluster environment, a site often has 1,000 to 10,000 nodes(note the difference between sites and nodes as explainedin Section III), each with small local storage in the range10 GB to 1 TB. The network bandwidth within a clusteris typically 1 GB/s. We use the parameters in Table IIIto simulate the cluster environment. Figure 10 shows theperformance comparison of our distributed algorithm andCascading under different access patterns. We observe thatfor our distributed algorithm, the percentage increase ofaccess time due to dynamic pattern is relatively large, around40% to 100%. This shows that our algorithm does notadjust well to the dynamic access pattern in a typical clusterenvironment. This is because in a cluster, the diameter of thenetwork (number of hops in the longest shortest path of twocluster nodes) is much larger than that in the typical Gridenvironment, and the bandwidth in a cluster is much smallerthan that in Grids. As a result, when the file access patternof each site changes in the middle of the simulation, it takeslonger time to propagate such information to other nodes.

13

TABLE IVSTORAGE CAPACITY OF EACH SITE IN EU DATA GRID TESTBED 1 [10]. IC IS IMPERIAL COLLEGE.

Site Bologna Catania CERN IC Lyon Milano NIKHEF NorduGrid Padova RAL TorinoStorage (GB) 30 30 10000 80 50 50 70 63 50 50 50

Therefore, the nearest replica catalog maintained by eachnode could get outdated, causing frequent cache misses forneeded input files.

Comparison in a Cluster Environment with Full Con-nectivity. We consider a cluster with a fully connected setof nodes and show the simulation results in Figure 11. Thereare two major observations. First, comparing Figure 9 withFigure 11, we observe that the average file access time ina cluster with full connectivity is lower than that of a Grid.This is because even though a typical Grid bandwidth mightbe higher than that of a cluster (4000 MB/s versus 100MB/s shown in Tables II and III), there are much fewerpoint-to-point links in a Grid of 100 sites compared to acluster with a fully connected set of 1000 nodes (around5 × 103 edges versus 5 × 105 edges). Therefore, the fullnetwork bisection bandwidth in a cluster is higher thanthat of a Grid (5 × 105 × 100 = 5 × 107 MB/s versus5 × 103 × 4000 = 2 × 107 MB/s). Second, we observethat for all three access patterns, our distributed algorithmperforms the same as Cascading. This is because withfull connectivity, each site can get the needed data filesdirectly from the CERN site, therefore each site is unableto observe the data access traffic of other sites to makeintelligent caching decisions. Both our distributed cachingalgorithm and Cascading algorithm boil down to the sameLRU/LFU cache replacement algorithm. In a fully connectedenvironment, our distributed algorithm loses its advantageof being an effective caching algorithm, like any otherdistributed caching algorithms.

Fig. 12. EU Data Grid Testbed 1 [10]. Numbers indicate the bandwidthbetween two sites.

C. Distributed Algorithm versus LRU/LFU in GeneralTopology

OptorSim [10] is a Grid simulator that simulates a generaltopology Grid testbed for data intensive high energy physicsexperiments. Three replication algorithms are implementedin OptorSim: (i) LRU (Least Recently Used), which deletesthose files that have been used least recently; (ii) LFU (LeastFrequently Used), which deletes those files that have beenused least frequently in the recent past; and (iii) an economicmodel in which sites “buy” and “sell” files using an auctionmechanism, and will only delete files if they are less valuablethan the new file.

We compare our distributed algorithm with LRU andLFU in a general topology since previous empirical researchhas found that even though the LRU/LFU strategies areextremely trivial, it is very difficult to improve them. Thetopology we use is the simulated topology of EU DataGridTestBed 1 [10], as shown in Figure 12. The storage capacityof each site is given in Table IV. Since the simulationresults of LRU and LFU are quite similar, we only show theresults of LRU. Figure 13 (a) and (b) show the performancecomparison of our distributed algorithm and LRU by varyingthe total number of files and storage capacity of eachsite, respectively. It shows that our distributed algorithmoutperforms LRU in most cases. However, the performancedifference is not as significant as that between distributedalgorithm and Cascading. In particular, under TS-Dynamicaccess pattern, when the storage capacity of each site is 400GB and 500 GB, the average file access time of LRU is evena little smaller than that of our distributed algorithm.

VII. Conclusions and Future WorkIn this article, we study how to replicate data files in data

intensive scientific applications, to reduce the file accesstime with the consideration of limited storage space ofGrid sites. Our goal is to effectively reduce the access timeof data files needed for job executions at Grid sites. Wepropose a centralized greedy algorithm with performanceguarantee, and show that it performs comparably with theoptimal algorithm. We also propose a distributed algorithmwherein Grids sites react closely to the Grid status and makeintelligent caching decisions. Using GridSim, a distributedGrid simulator, we demonstrate that the distributed replica-tion technique significantly outperforms a popular existingreplication technique, and it is more adaptive to the dynamicchange of file access patterns in Data Grids.

We plan to further develop our work in the followingdirections:• As ongoing and future work, we are exploiting the

synergies between data replication and job schedulingto achieve better system performance. Data replicationand job scheduling are two different but complementary

14

0

500

1000

1500

2000

2500

3000

1000 2000 3000 4000 5000

Avera

ge file

access tim

e (

second)






0

500

1000

1500

2000

2500

3000

100 200 300 400 500

Avera

ge file

access tim

e (

second)






Fig. 13. Performance comparison between our distributed algorithm and LRU in general topology.

functions in Data Grids: one to minimize the total fileaccess cost (thus total job execution time of all sites),and the other to minimize the makespan (the maxi-mum job completion time among all sites). Optimizingboth objective functions in the same framework is avery difficult (if not unfeasible) task. There are twomain challenges: first, how to formulate a problemthat incorporates not only data replication but also jobscheduling, and which addresses both total access costand maximum access cost; and second, how to find anefficient algorithm that, if it cannot find optimal solu-tions of minimizing total/maximum access cost, givesnear-optimal solution for both objectives. These twochallenges remain largely unanswered in the currentliterature.

• We plan to design and develop data replication strate-gies in the scientific workflow [31] and large-scalecloud computing environments [24]. We will also pur-sue how provenance information [15], the derivationhistory of data files, can be exploited to improve theintelligence of data replication decision making.

• A more robust dynamic model and replication algo-rithm will be developed. Right now, each site observesthe data access traffic for a sufficiently long timewindow, which is set in an ad hoc manner. In the future,a site should dynamically decide such an “observingwindow”, depending on the traffic it is observing.

REFERENCES

[1] The large hadron collider. http://public.web.cern.ch/Public/en/LHC/LHC-en.html.

[2] Lcg computing fabric. http://lcg-computing-fabric.web.cern.ch/lcg-computing-fabric/.

[3] Worldwide lhc computing grid. http://lcg.web.cern.ch/LCG/.[4] A. Aazami, S. Ghandeharizadeh, and T. Helmi. Near optimal number

of replicas for continuous media in ad-hoc networks of wirelessdevices. In Proc. International Workshop on Multimedia InformationSystems, 2004.

[5] B. Allcock, J. Bester, J. Bresnahan, A.L. Chervenak, C. Kesselman,S. Meder, V. Nefedova, D. Quesnel, S. Tuecke, and I. Foster. Secure,efficient data transport and replica management for high-performance

data-intensive computing. In Proc. of IEEE Symposium on MassStorage Systems and Technologies, 2001.

[6] I. Baev and R. Rajaraman. Approximation algorithms for dataplacement in arbitrary networks. In Proc. ACM-SIAM Symposiumon Discrete Algorithms (SODA 2001).

[7] I. Baev, R. Rajaraman, and C. Swamy. Approximation algorithms fordata placement problems. SIAM J. Comput., 38(4):1411–1429, 2008.

[8] W. H. Bell, D. G. Cameron, R. Cavajal-Schiaffino, A. P. Millar,K. Stockinger, and F. Zini. Evaluation of an economy-based filereplication strategy for a data grid. In Proc. of International Workshopon Agent Based Cluster and Grid Computing (CCGrid 2003).

[9] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web cachingand zipf-like distributions: Evidence and implications. In Proc.of IEEE International Conference on Computer Communications(INFOCOM 2009).

[10] D. G. Cameron, A. P. Millar, C. Nicholson, R. Carvajal-Schiaffino,K. Stockinger, and F. Zini. Analysis of scheduling and replicaoptimisation strategies for data grids using optorsim. Journal of GridComputing, 2(1):57–69, 2004.

[11] M. Carman, F. Zini, L. Serafini, and K. Stockinger. Towards aneconomy-based optimization of file access and replication on a datagrid. In Proc. of International Workshop on Agent Based Cluster andGrid Computing (CCGrid 2002).

[12] A. Chakrabarti and S. Sengupta. Scalable and distributed mechanismsfor integrated scheduling and replication in data grids. In Proc. of 10thInternational Conference on Distributed Computing and Networking(ICDCN 2008).

[13] R.-S. Chang and H.-P. Chang. A dynamic data replication strategyusing access-weight in data grids. Journal of Supercomputing,45:277–295, 2008.

[14] R.-S. Chang, J.-S. Chang, and S.-Y. Lin. Job scheduling and datareplication on data grids. Future Generation Computer Systems,23(7):846–860.

[15] A. Chebotko, X. Fei, C. Lin, S. Lu, , and F. Fotouhi. Storing andquerying scientific workflow provenance metadata using an rdbms. InProc. of the IEEE International Conference on e-Science and GridComputing (e-Science 2007).

[16] A. Chervenak, E. Deelman, M. Livny, M.-H. Su, R. Schuler,S. Bharathi, G. Mehta, and K. Vahi. Data placement for scientificapplications in distributed environments. In Proc. of IEEE/ACMInternational Conference on Grid Computing (Grid 2007).

[17] A. Chervenak, R. Schuler, C. Kesselman, S. Koranda, and B. Moe.Wide area data replication for scientific collaboration. In Proc. ofIEEE/ACM International Workshop on Grid Computing (Grid 2005).

[18] A. Chervenak, R. Schuler, M. Ripeanu, M. A. Amer, S. Bharathi,I. Foster, and C. Kesselman. The globus replica location service: De-sign and experience. IEEE Transactions on Parallel and DistributedSystems, 20:1260–1272, 2009.

[19] N. N. Dang and S. B. Lim. Combination of replication and schedulingin data grids. International Journal of Computer Science and NetworkSecurity, 7(3):304–308, March 2007.

15

[20] D. Dullmann and B. Segal. Models for replica synchronisation andconsistency in a data grid. In Proc. of the 10th IEEE InternationalSymposium on High Performance Distributed Computing (HPDC2001).

[21] J. Rehn et. al. Phedex: high-throughput data transfer managementsystem. In Proc. of Computing in High Energy and Nuclear Physics(CHEP 2006).

[22] I. Foster. The grid: A new infrastructure for 21st century science.Physics Today, 55:42–47, 2002.

[23] I. Foster and K. Ranganathan. Decoupling computation and datascheduling in distributed data-intensive applications. In Proc. of the11th IEEE International Symposium on High Performance DistributedComputing (HPDC 2002).

[24] I. Foster, Y. Zhao, I. Raicu, and S. Lu. Cloud computing and gridcomputing 360-degrees compared. In Proc. of IEEE Grid ComputingEnvironments, in conjunction with IEEE/ACM Supercomputing, pages1–10, 2008.

[25] C. Intanagonwiwat, R. Govindan, and D. Estrin. Directed diffusion:a scalable and robust communication paradigm for sensor networks.In Proc. of ACM International Conference on Mobile Computing andNetworking (MOBICOM 2000).

[26] J. C. Jacob, D.S. Katz, T. Prince, G.B. Berriman, J.C. Good, A.C.Laity, E. Deelman, G.Singh, and M.-H Su. The montage architecturefor grid-enabled science processing of large, distributed datasets. InProc. of the Earth Science Technology Conference, 2004.

[27] S. Jiang and X. Zhang. Efficient distributed disk caching in data gridmanagement. In Proc. of IEEE International Conference on ClusterComputing (Cluster 2003).

[28] S. Jin and L. Wang. Content and service replication strategies inmulti-hop wireless mesh networks. In Proc. of ACM InternationalConference on Modeling, Analysis and Simulation of Wireless andMobile Systems (MSWiM 2005).

[29] H. Lamehamedi, B. K. Szymanski, and B. Conte. Distributed datamanagement services for dynamic data grids, unpublished.

[30] M. Lei, S. V. Vrbsky, and X. Hong. An on-line replication strategyto increase availability in data grids. Future Generation ComputerSystems, 24:85–98, 2008.

[31] C. Lin, S. Lu, X. Fei, A. Chebotko, D. Pai, Z. Lai, F. Fotouhi, and JingHua. A reference architecture for scientific workflow managementsystems and the view soa solution. IEEE Transactions on ServicesComputing, 2(1):79–92, 2009.

[32] M. Mineter, C. Jarvis, and S. Dowers. From stand-alone programstowards grid-aware services and components: A case study in agri-cultural modelling with interpolated climate data. EnvironmentalModelling and Software, 18(4):379–391, 2003.

[33] S. M. Park, J. H. Kim, Y. B. Lo, and W. S. Yoon. Dynamic data gridreplication strategy based on internet hierarchy. In Proc. of SecondInternational Workshop on Grid and Cooperative Computing (GCC2003).

[34] J. Perez, F. Garcıa-Carballeira, J. Carretero, A. Calderon, andJ. Fernandez. Branch replication scheme: A new model for datareplication in large scale data grids. Future Generation ComputerSystems, 26(1):12–20, 2010.

[35] L. Qiu, V. N. Padmanabhan, and G. M. Voelker. On the placementof web server replicas. In Proc. of IEEE Conference on ComputerCommunications (INFOCOM 2001).

[36] I. Raicu, I. Foster, Y. Zhao, P. Little, C. Moretti, A. Chaudhary, andD. Thain. The quest for scalable support of data intensive workloadsin distributed systems. In Proc. of the ACM International Symposiumon High Performance Distributed Computing (HPDC 2009).

[37] I. Raicu, Y. Zhao, I. Foster, and A. Szalay. Accelerating large-scaledata exploration through data diffusion. In Proc. of the InternationalWorkshop on Data-aware Distributed Computing (DADC 2008).

[38] A. Ramakrishnan, G. Singh, H. Zhao, E. Deelman, R. Sakellariou,K. Vahi, K. Blackburn, D. Meyers, and M. Samidi. Scheduling data-intensive workflows onto storage-constrained distributed resources.In Proc. of the Seventh IEEE International Symposium on ClusterComputing and the Grid (CCGRID 2007).

[39] K. Ranganathan and I. T. Foster. Identifying dynamic replicationstrategies for a high-performance data grid. In Proc. of the SecondInternational Workshop on Grid Computing (GRID 2001).

[40] A. Rodriguez, D. Sulakhe, E. Marland, N. Nefedova, M. Wilde, andN. Maltsev. Grid enabled server for high-throughput analysis ofgenomes. In Proc. of Workshop on Case Studies on Grid Applications,2004.

[41] F. Schintke and A. Reinefeld. Modeling replica availability in largedata grids. Journal of Grid Computing, 2(1):219–227, 2003.

[42] H. Stockinger, A. Samar, K. Holtman, B. Allcock, I. Foster, andB. Tierney. File and object replication in data grids. In Proc. of the10th IEEE International Symposium on High Performance DistributedComputing (HPDC 2001).

[43] A. Sulistio, U. Cibej, S. Venugopal, B. Robic, and R. Buyya. Atoolkit for modelling and simulating data grids: An extension togridsim. Concurrency and Computation: Practice and Experience,20(13):1591–1609, 2008.

[44] B. Tang, S. R. Das, and H. Gupta. Benefit-based data caching in adhoc networks. IEEE Transactions on Mobile Computing, 7(3):289–304, March 2008.

[45] M. Tang, B.-S. Lee, C.-K. Yeo, and X. Tang. Dynamic replicationalgorithms for the multi-tier data grid. Future Generation ComputerSystems, 21:775–790, 2005.

[46] M. Tang, B.-S. Lee, C.-K. Yeo, and X. Tang. The impact of datareplication on job scheduling performance in the data grid. FutureGeneration Computer Systems, 22:254–268, 2006.

[47] U. Cibej, B. Slivnik, and B. Robic. The complexity of static datareplication in data grids. Parallel Computing, 31(8-9):900–912, 2005.

[48] S. Venugopal and R. Buyya. An scp-based heuristic approach forscheduling distributed data-intensive applications on global grids.Journal of Parallel and Distributed Computing, 68:471–487, 2008.

[49] S. Venugopal, R. Buyya, and K. Ramamohanarao. A taxonomy ofdata grids for distributed data sharing, management, and processing.ACM Computing Surveys, 38(1), 2006.

[50] X. You, G. Chang, X. Chen, C. Tian, and C. Zhu. Utility-basedreplication strategies in data grids. In Proc. of Fifth InternationalConference on Grid and Cooperative Computing (GCC 2006).

[51] G. K. Zipf. Human Behavior and the Principle of Least Effort: AnIntroduction to Human Ecology. Addison-Wesley, 1949.

16

Dharma Teja Nukarapu re-ceived the BE degree in In-formation Technology fromJawaharlal Nehru Techno-logical University, India, in2007, the MS degree fromthe Department of ElectricalEngineering and ComputerScience, Wichita State Uni-versity in 2009. He is cur-rently a Ph.D student in the

Department of Electrical Engineering and Computer Scienceat Wichita State University. His research interest is datacaching and replication and job scheduling in data intensivescientific applications.

Bin Tang received theBS degree in physics fromPeking University, China, in1997, the MS degrees in ma-terials science and computerscience from Stony BrookUniversity in 2000 and 2002,respectively, and the PhDdegree in computer sciencefrom Stony Brook Univer-sity in 2007. He is currentlyan Assistant Professor in the

Department of Electrical Engineering and Computer Scienceat Wichita State University. His research interest is algo-rithmic aspect of data intensive sensor networks. He is amember of the IEEE.

Liqiang Wang is currentlyan Assistant Professor inthe Department of ComputerScience at the University ofWyoming. He received theBS degree in mathematicsfrom Hebei Normal Univer-sity, China, in 1995, theMS degree in computer sci-ence from Sichuan Univer-sity, China, in 1998, and thePhD degree in computer sci-

ence from Stony Brook University in 2006. His researchinterest is the design and analysis of parallel computingsystems. He is a Member of the IEEE.

Shiyong Lu received thePhD degree in Computer Sci-ence from the State Univer-sity of New York at StonyBrook in 2002, ME from theInstitute of Computing Tech-nology of Chinese Academyof Sciences at Beijing in1996, and BE from the Uni-versity of Science and Tech-nology of China at Hefei in1993. He is currently an As-

sociate Professor in the Department of Computer Science,Wayne State University, and the Director of the ScientificWorkflow Research Laboratory (SWR Lab). His researchinterests include scientific workflows and databases. He haspublished more than 90 papers in refereed internationaljournals and conference proceedings. He is the founderand currently a program co-chair of the IEEE InternationalWorkshop on Scientific Workflows (2007 2010), an edito-rial board member for International Journal of SemanticWeb and Information Systems and International Journal ofHealthcare Information Systems and Informatics. He is aSenior Member of the IEEE.

Documents

Data Replication in Data Intensive Scientiï¬c Applications With