13
Noname manuscript No. (will be inserted by the editor) Extending OAI-PMH over structured P2P networks for digital preservation Everton F. R. Se´ ara · Marcos S. Sunye · Luis C. E. Bona · Tiago Vignatti · Andre L. Vignatti · Anne Doucet the date of receipt and acceptance should be inserted later Abstract Open Archives Initiative (OAI) allows both Li- braries and Museums creating and sharing their own low- cost Digital Libraries (DL). OAI Digital Libraries are based on OAI-PMH protocol which, although is consolidated as a pattern for disseminating metadata, does not rely on either digital preservation and availability of content, essential re- quirements in this type of system. Building new mechanisms to guarantee improvements in this scope, at no or low cost increases, becomes a great challenge. This paper proposes a distributed archiving system based on a P2P network, that allows OAI-based libraries to replicate digital objects in or- der to ensure their reliability and availability. The proposal system both keeps and extends the current OAI-PMH proto- col characteristics and is designed as a set of OAI reposito- ries, where each repository has an independent fail probabil- ity assigned to it. Items are inserted with a reliability that is satisfied replicating them in subsets of repositories. Commu- nication between the nodes (repositories) of the network is organized in a distributed hash table and multiple hash func- tions are used to select repositories that will keep the replicas of each stored item. The OAI characteristics combined with a structured P2P digital preservation system, made possible the construction of a reliable and totally distributed digital library. The archiving system was evaluated through exper- iments in a real environment and the OAI-PMH extension validated by implementation of a proof-of-principle proto- type. E. F. R. Se´ ara · M. S. Sunye · L. C. E. Bona · T. Vignatti Federal University of Paran´ a, Department of Informatics - Curitiba - Brazil. E-mail: {rufino, sunye, bona, vignatti}@inf.ufpr.br A. L. Vignatti University of Campinas, Institute of Computing - Campinas - Brazil E-mail: [email protected] Anne Doucet Universit´ e PARIS VI, LIP6, D´ epartement DAPA - Paris - France E-mail: [email protected] Keywords digital library · long-term preservation · digital archiving · peer-to-peer 1 Introduction Since the amount of content in the World Wide Web has been rapidly increasing during the last years, users are relying on particularized tools to navigate through the vast volumes of data, and several Information Retrieval systems and tools have been proposed to fill this necessity. The arrival of XML provided the structured organization of content and made possible a great growth of Digital Libraries (DL) in several areas. In order to create a standard mechanism for informa- tion interoperability among Digital Libraries repositories, the Open Archives Initiative (OAI) [3] proposed the Proto- col for Metadata Harvesting – OAI-PMH [18]. OAI Digital Libraries, also known as Data Providers (DP), are reposi- tories responsible for both storing digital objects and ex- posing its structured metadata via OAI-PMH framework. Each metadata follow the Dublin Core schema [34] and de- scribe a digital object. Service Providers then make OAI- PMH service requests for harvesting metadata and offer a global search. The OAI harvesting protocol is compatible with many of open source softwares, like EPrints and Dspace [2,32]. Nowadays, commercial library administration software like Virtua TLS [6] became OAI-PMH compliant and search en- gines, like Google, are collecting metadata from OAI repos- itories. An OAI federation is capable to offer a global search over a set of Digital Libraries that share their metadata. In OAI-PMH each repository stores its digital objects independently. Although this approach keeps the protocol quite simple, because only metadata is shared, it does not rely on digital data preservation, i.e., digital objects can be

Extending OAI-PMH over structured P2P networks for digital preservation

  • Upload
    ufpr

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Noname manuscript No.(will be inserted by the editor)

Extending OAI-PMH over structured P2P networks for digitalpreservation

Everton F. R. Seara · Marcos S. Sunye · Luis C. E. Bona · Tiago Vignatti · Andre L.Vignatti · Anne Doucet

the date of receipt and acceptance should be inserted later

Abstract Open Archives Initiative (OAI) allows both Li-braries and Museums creating and sharing their own low-cost Digital Libraries (DL). OAI Digital Libraries are basedon OAI-PMH protocol which, although is consolidated as apattern for disseminating metadata, does not rely on eitherdigital preservation and availability of content, essential re-quirements in this type of system. Building new mechanismsto guarantee improvements in this scope, at no or low costincreases, becomes a great challenge. This paper proposesa distributed archiving system based on a P2P network, thatallows OAI-based libraries to replicate digital objects inor-der to ensure their reliability and availability. The proposalsystem both keeps and extends the current OAI-PMH proto-col characteristics and is designed as a set of OAI reposito-ries, where each repository has an independent fail probabil-ity assigned to it. Items are inserted with a reliability that issatisfied replicating them in subsets of repositories. Commu-nication between the nodes (repositories) of the network isorganized in a distributed hash table and multiple hash func-tions are used to select repositories that will keep the replicasof each stored item. The OAI characteristics combined witha structured P2P digital preservation system, made possiblethe construction of a reliable and totally distributed digitallibrary. The archiving system was evaluated through exper-iments in a real environment and the OAI-PMH extensionvalidated by implementation of a proof-of-principle proto-type.

E. F. R. Seara·M. S. Sunye· L. C. E. Bona· T. VignattiFederal University of Parana, Department of Informatics -Curitiba -Brazil. E-mail:{rufino, sunye, bona, vignatti}@inf.ufpr.br

A. L. VignattiUniversity of Campinas, Institute of Computing - Campinas -BrazilE-mail: [email protected]

Anne DoucetUniversite PARIS VI, LIP6, Departement DAPA - Paris - FranceE-mail: [email protected]

Keywords digital library · long-term preservation· digitalarchiving· peer-to-peer

1 Introduction

Since the amount of content in theWorld Wide Webhas beenrapidly increasing during the last years, users are relyingonparticularized tools to navigate through the vast volumes ofdata, and several Information Retrieval systems and toolshave been proposed to fill this necessity. The arrival of XMLprovided the structured organization of content and madepossible a great growth of Digital Libraries (DL) in severalareas.

In order to create a standard mechanism for informa-tion interoperability among Digital Libraries repositories,the Open Archives Initiative (OAI) [3] proposed the Proto-col for Metadata Harvesting – OAI-PMH [18]. OAI DigitalLibraries, also known as Data Providers (DP), are reposi-tories responsible for both storing digital objects and ex-posing its structured metadata via OAI-PMH framework.Each metadata follow the Dublin Core schema [34] and de-scribe a digital object. Service Providers then make OAI-PMH service requests for harvesting metadata and offer aglobal search.

The OAI harvesting protocol is compatible with manyof open source softwares, like EPrints and Dspace [2,32].Nowadays, commercial library administration software likeVirtua TLS [6] became OAI-PMH compliant and search en-gines, like Google, are collecting metadata from OAI repos-itories. An OAI federation is capable to offer a global searchover a set of Digital Libraries that share their metadata.

In OAI-PMH each repository stores its digital objectsindependently. Although this approach keeps the protocolquite simple, because only metadata is shared, it does notrely on digital data preservation, i.e., digital objects can be

2

easily lost if there is no external backup mechanisms as-signed to each repository. Moreover, the enormous neededfor disk space and the cost involved to use specialized hard-ware makes this solution, often, economically unfeasible.Then, an alternative to preserve digital content is to replicatethe information in multiple storage repositories, consistingof conventional and low-cost computers [12].

In this context, we considered that OAI digital librarieshave the same goal, dissemination of digital content on theInternet, and therefore, can collaborate for creating a long-term data preservation environment that ensures reliabilityand availability of their objects. To enable this colaboration,we proposed an OAI-PMH extension – explained in detailsin Section 3 – that keeps current OAI-PMH protocol charac-teristics and organize Data Providers over a distributed sys-tem, allowing to replicate digital object between nodes inthe system.

It is important to note that in our extention we are inter-ested in replicating only object’s content but not metadata,once the OAI-PMHharvestingmechanism performs this jobefficiently. However, considering that objetcs are replicatedamong data providers, which can store content from serveralother repositories, we created a new OAI-PMHverb in theservice provider side, calledGetRecord. This verb enablesdata providers to recover any object’s metadata in the entireOAI federation.

Besides to store replicated objects, our OAI-PMH exten-tion promotes an user-transparent access to them through aRound Robin DNS, which performes requests only to avail-able nodes in the system. With this solution, objects can berecovered from the system even if the original data providerhas failed.

Although the proposal extension can be carried out overany distributed system, inasmuch as it was generically de-fined, Peer-to-Peer (P2P) networks appear as a promisingapproach to organize these kinds of systems, since they arehighly scalable for the distribution and retrieval of data.Atthe same time, the mechanisms of replication available inmost P2P networks are not sufficient to ensure the preserva-tion of data over a long period of time.

To solve this problem, we created a totally distributedP2P archiving system. In this system, OAI repositories areorganized by adistributed hash table(DHT) [25,28,31,37]andmultiple hash functionsare used as mechanisms for rep-lication. The choice of structured P2P, instead of non-struc-tured, is motivated by its scalability regarding the numberofnodes. Moreover, often, the search algorithms of non-struc-tured P2P networks cannot locate rare items, which is un-acceptable in the our context. Multiples hash funcions wereused to perform a selection on specific set of repositories, afeature not present in non-structured P2P model using DHT.The system was evaluated through experiments in a real en-vironment.

The P2P digital archiving system motivates the defini-tion of a model for data replication, described in Section4. Thus, we propose a model for replication where a reli-ability metric is associated with each repository. This met-ric denotes the probability that the data will not be lost ordamaged in a given period of time. Furthermore, each item(digital information) needs to be stored with adesired re-liability that reflects its importance, allowing high or lowdurability (preservation time). To ensure the desired relia-bility of an item, severalreplicas of it are created in dif-ferent repositories. The number of replicas needed to pre-serve a specific item is determined by the reliability metricof each repository. This allows an optimization of the net-work resources usage, compared to other systems where thenumber of replicas is fixed [21,26], however, more elabo-rated strategies for replication should be used in this case.Another contribution of this work is to present and comparethree different strategies for the reliable replication problem.

Fig. 1 Repositories labeled with independent reliabilities.

For example, Figure 1 shows a network with five repos-itories, identified by 1, 2, 3, 4 and 5; with reliability of 40%,80%, 30%, 60% and 25% respectively. Suppose that an userwants to insert an item in the network with adesired relia-bility of 90%. A simple tactics would replicate this item inthe order of identifiers of repositories, until it achieves thedesired reliability. Thus, with one replica in the repository1, the item has guaranteed a reliability of 40%. Adding areplica in the repository 2, the reliability guaranteed wouldbe 1− (0.6×0.2) = 88%, but still not reaching the desiredreliability of 90%. With an additional replica in the reposi-tory 3, the reliability guaranteed would be 1− (0.6×0.2×0.7) = 91.6%, therefore reaching the desired reliability ofthe item.Contributions and Results Obtained:This paper presentsan OAI-PMH extension that defines some new functional-ities for both Service and Data Providers, to enable digitalcontent preservation in an OAI federation, and uses a RoundRobin DNS to promote an user-transparent access to the ob-jects. This extension allows users to see the entire system asa single server, and transparently insert and retrieve objectsfrom the P2P archiving system. Moreover, the proposal OAI

3

extension enables data providers to retrieve object’s meta-data from service providers, making possible to gain accessto the object even if its original data provider is crashed. Theimplementation of a proof-of-principle prototype showed aplenary integration between OAI extension and P2P archiv-ing system, ensuring reliability and availability in OAI digi-tal libraries.

We also presente a model for reliable replication of datain digital archiving systems with immutable data. To copewith the scalability, we create a structured P2P network forthe reliable archiving, carried out over the proposed model.We also present three strategies of reliable replication. Ex-periments in real environments have been made to establishthe P2P digital archiving system.

In the proposed model, independent probabilities of fail-ure are associated with each storage repository. These prob-abilities allow items to be stored based on their preservationtime needs. Moreover, the repositories have different stor-age capacities, i.e., the repositories areheterogeneous. Thus,due to limited storage capacity, we aim to insert the maxi-mum number of items in the network; always satisfying thedesired reliability of the items.

To maximize the number of items inserted in the net-work, we designed the strategies with two (heuristic) goals:balance the load among the repositories and, at the sametime, minimize the total number of replicas created. This isjustified by two facts: (i) balance the load, as well as mini-mize the replicas created, avoids the overhead in the repos-itories. Furthermore, (ii) if the load balancing is not per-formed, a repository can easily becomes completely filledand, consequently, the number of repositories that can beselected decreases, thus decreasing the number of availableoptions that meet the desired reliability of an item. This, inturn, infers the creation of a larger number of replicas tosatisfy the desired reliability of an item and hence the totalnumber of items to be inserted in the network decreases dueto the limited capacity of the repositories.

In the case of heterogeneous repositories that we con-sider, balancing the load and minimizing the replicas do notnecessarily imply the maximization of the number of itemsinserted in the network due to different capacities of eachrepository. Even so, we use these two goals, hoping to ap-proximate the above arguments. Indeed, experimental re-sults of the simulations, presented in [33], confirm that theseare good goals for the heterogeneous case.

In Section 4.2, we present three strategies for creatingreplicas. All these strategies aim to optimize the number ofreplicas created and the load balancing. However, each strat-egy is based on different arguments to justify its use. TheRandomizedstrategy has the load balancing as the main mo-tivation, justified by theoretical results of ball-and-bins [24].Also, in this strategy, we obtained a theoretical bound in thenumber of replicas created. TheGreedy over Subsetstrategy

has the minimization of replicas created as the main motiva-tion, but adjustments are made to perform a good load bal-ancing. TheIdeal Subsetstrategy solves the problem with-out giving priority to minimizing the replicas or balancingthe load, acting as a “mean” between the two goals.

Through simulations, presented in [33], we found thatamong the three strategies presented, the Ideal Subset strat-egy proved to be the most effective in the creation of replicasand load balancing, and also obtaining the highest numberof inserted objects in the network. This strategy has a com-putational complexity that could not be feasible in practiceif not carefully formulated.

Due to high scalability for the distribution and retrievalof data, structured P2P networks appear as natural candi-dates to implement the model proposed. Thus, in Section5, we present a reliable P2P archiving system, which is onemain contribution in this work. The structured P2P networkshave a difficulty inherent in the method of routing informa-tion which makes difficult the selection of specific nodes(repositories) for the storage. However, this problem wasovercome by usingmultiple hash functions, which is themain contribution of this section. We describe in detail thealgorithms that use multiple hash functions for the basic op-erations of the network. The implementation of the P2P dig-ital archiving system was evaluated in a real environment,where digital objects were inserted in a network to stress thepreservation time in function of their importance.Organization: Section 2 presents some related work andtheir differences from our work. In Section 3 we give a littlebackground about OAI-PMH framework and explain in de-tails the architecture of the OAI extension and how it insertsand recovers object from the system. Section 4 presents themodel of replication for archiving systems and the strategiesused for creating replicas. In Section 5, we present the re-liable P2P archiving system and use a real environment forevaluation. The conclusion and future work are presented inSection 6.

2 Related Work

The usage of P2P networks to create distributed digital li-braries has been addressed in several works [8–10,36]. Ingeneral, the goal of these approaches are to create mecha-nisms to search content in distributed repositories.

Freelib [10], is a standalone client that once installed onuser’s machine will connect itself to a P2P network and auto-matically create connections among people who share com-mon interests, i.e., have similar searches. Similarly, FlexibleCAN [36] is used for information managing and retrievingand detect groups of users with the same interest, creatingcommunities and allowing them to share their resources.

OAI-P2P [9] proposes a P2P network formed by dataproviders in order to support distributed search over all con-

4

nected metadata repositories. In this solution, as our workalso proposes, data providers are organized in a P2P net-work, but no solution to preserve digital content is proposed.

When we are considering digital preservation and long-term archiving environments, is very important to empha-size that information is not updated or removed [22], i.e., areplica of an object isimmutablein the system. Thus, in thiswork we are not interested in strategies and mechanisms forreplicas update.

Traditional solutions for backup or data storage, such asreplicated file systems and RAID disks [19] do not providea degree of autonomy as the P2P networks do. Unlike theproposed P2P archiving system of this work, these systemsuse a centralized control to manage the content that needsto be replicated. P2P systems for file sharing such as Kazaa,eDonkey and Gnutella [23] focus on searching resources indynamic collections, and are not focused on the reliabilityof the information. In such systems, a file is replicated everytime it is copied into a new node.

Other mechanisms, like CFS [13] and PAST [27] are notable to change the number of replicas dynamically, sincethey employ replication in the neighborhood. The number ofreplicas cannot exceed the size of the neighbors lists, oncethe number of neighbors is tied with the DHT protocol. Anydigital archiving system built on these systems could not beable to achieve the desired degree of replication required topreserve their information. Other forms of storage that usesDHT as OceanStore [17] and Glacier [7] consider a simplermodel where their nodes have equal probability of failure.BRICKS [26] consider availability instead reliability, asso-ciating a single fail probability to all nodes in the network.

The LOCKSS (Lots of Copies Keep Stuff Safe) [21] usesa P2P network where nodes are controlled by autonomousorganizations to preserve the information for a long period.It uses a complex scheme of audit to detect and repair thedamage in the replicas. Unlike our model, the LOCKSS sys-tem treats its repositories with a single probability of failure,which does not exactly model real environments. Further-more, LOCKSS considers a fixed number of replicas for itsitems and does not integrate with OAI-PMH in a real-timeenvironment.

After studying and evaluating the solutions describedabove, we believe that our solution has large differential,which is to join a P2P digital archiving system (with ade-sired reliability for each item inserted) in consonance withOAI-PMH framework, ensuring reliability and availabilityin OAI Digital Libraries.

3 Extending OAI-PMH for Digital Preservation

This Section describes an OAI-PMH extension which, work-ing together with a distributed archiving system based on

P2P networks (Section 5), allows OAI-based libraries to rep-licate digital objects and ensure their reliability and avail-ability. Such extension makes possible communication a-mong data providers and allows OAI digital libraries to pre-serve their digital content with low-cost.

In order to contextualize the reader about important fea-tures in OAI-PMH framework, we briefly present the proto-col (Subsection 3.2), and then explain our extension.

3.1 Open Archives Initiative and Protocol

Self-archiving includes storing copies of digital documentson the Internet, for offer open access to it. Since this idea hasstarted in 1992, with arXiv System [1], several software havebeen created to organize the exposure of scientific work ontheWorld Wide Web[2,32]. In order to provide a centralizedsearch among individual archives, the Open Archives Initia-tive [3] proposed, in 2001, theOpen Archives Initiative Pro-tocol for Metadata Harvesting(OAI-PMH) [18]. The OAI-PMH framework provides an application-independent inter-operability based onmetadata harvesting, where two classesof participants are found [18]: (i) Data Providers; and (ii)Service Providers.

Data Providers (DP), or OAI repositories, are networkaccessible servers responsible for storing digital objects andexposing its metadata to harvesters. All data providers sup-ply metadata in a commom format – the Dublin Core Meta-data Element Set [34] – thus can address interoperabilityand extensibility. Each digital object in a data provider has aDigital Object Identifier (DOI) – in URI (Uniform ResourceIdentifier) syntax – that is used, in OAI-PMH requests, foridentifying the object and extracting its metadata from therepository.

Fig. 2 Global search over a set of OAI repositories.

Service Providers (SP), or harvesters, are client appli-cations that issue OAI-PMH requests for harvesting eithernew or out-of-date metadata from a set of data providersand store them in an unique database. As shown in Figure 2,users accesses service providers to perform a global searchover metadata from an unlimited number of data providers.

5

When the user finds the desired metadata in the service pro-vider, a Handle Server maps the data provider where the ob-ject is stored from its DOI. Handle Servers [4,5] are nec-essary to keep the identifier tolerant to DP changes (servername or even domain name).Problems with this topology: OAI-PMH promotes a veryefficient dissemination of metadata on the Internet, however,objects are stored in a single data provider, therefore, canbe-come unavailable if this specific computer isoffline. More-over, if the computer crashes, content can be permanentlylost. To solve these problems, we combined the advantagesof OAI-PMH framework and P2P networks to preserve, withlow-cost, digital content in OAI digital libraries.

3.2 OAI-PMH Extension Architecture

Many challenges are related with creating a long-term dig-ital preservation system. In our work, we are mainly inter-ested in keep digital objects available and reliable for longperiods of time, even if adverse situations happen.

To achieve this goal, we organize data providers as nodesin a P2P digital archiving system, as illustrated in Figure 3.It means that beside their original functions (share metadataand store itself content) data providers are able to replicateand store replicas of other libraries, reducing the possibilityof the content be lost. Replicas of an object can be storedin several data providers, but metadata are maintained andmanaged only in its original repository, i.e., metadata is notreplicated. This decision is justified by the fact that meta-data are likely to be modified, and the cost to keep multiplesynchronized copies is very high.

Fig. 3 OAI-PMH extension topology.

To ensure that metadata will not be lost and will be avail-able whenever we need, our extension requires at least oneservice provider to regularlyharvestmetadata from all dataproviders in the Federation. Thus, users can accesses serviceproviders and perform a global search. When an user finds

the desired metadata, he gets the Digital Object Identifier(DOI) and is able to gain access to the object.

It is very important to note that in conventional OAI-PMH systems, the DOI normally addresses for either a spe-cific data provider or a Handle Server, which redirects to thedata provider. If this data provider isoffline, is not possi-ble to gain access to the object. To solve this problem, inour extension the DOI does not address to a specific dataprovider, but to an unique URI that is resolved by a RoundRobin DNS (see Figure 3). This DNS knows which nodesare alive in the network, thus, redirects the user to one ofthese nodes. When a node receives the request, it shows theobject’s information and recover the content from the P2Pdigital archiving system. Requests for any objects can be re-ceived from any node.

Following, we show the responsibilities of each compo-nent in our extension and how an object is inserted and re-trieved in the system.Data Provider: In OAI-PMH framework a data provider isresponsible for sharing metadata and providing access to itsobjects. Our extension defines that a data provider, in addi-tion to its original functions, is part of a P2P digital archivingsystem, which allows the replication of its objects in otherrepositories and the collaboration to find objects in any otherdata provider in the federation.

Data providers are also responsible for creating both ob-ject’s DOI and metadata when an object is inserted in thesystem. The DOI is formed by a unique URI which identi-fies the federation (stored in a configuration file in the DP),a number that identifies the DP – generated when it is regis-tered in the Open Archives Initiative – and a serial numberthat locally identifies the object, as presents Figure 4. Aftercreated, the DOI is inserted in the metadata and is used askeyto insert and retrieve the object in the archiving system.

Fig. 4 Shows how the Digital Object Identifier is formed

In addition, data providers are responsible for identify-ing the desired reliability for the object and send this in-formation to the P2P archiving system when the object isinserted. This information will define the number of repli-cas in the system, as present Sections 4 and 5. Finally, ifnecessary, we extended Data providers to contact Serviceproviders, via theverb GetRecord, and retrieve the object’smetadata (see Section 3.4).

6

Service Provider: In OAI-PMH framework service provid-ers are responsible forharvestingmetadata from a set of dataproviders and offer a global search. In our extension ser-vice providers continue performing the same job, however,a group of service providers, calledmasters, shouldharvestall metadata in the Federation. It is necessary because heremastersare also metadata suppliers to data providers. A listidentifying themastersis maintained by data providers.

To perform this work, service providers accept requestsvia the verb GetRecord. This verb follows the same defi-nition of its original, implemented in data provider side. Itreceives the DOI as parameter and returns the correspondingmetadata. Requests are made via HTTP (Hypertext TransferProtocol) and the metadata is delivered following the DublinCore Metadata Element Set [34].

It is important to note that if either an error or an ex-ception is found,mastersshould return a message, in XMLformat, containing the problem. This message follows theGetRecord specification described in [18].

This solution ensures that metadata will not be easilylost, because they are stored in severalmasters, and allowsdata providers to gain access to metadata when they are not“owners”.

3.3 How to Insert an object

Inserting a new object in our system is quite simple. Frist, adata provider reads, from a system administrator, all infor-mation that will compose the metadata. The content that willbe inserted and its desired reliability are also required.

After gets these information, the data provider createsthe DOI, following the standard presented in Figure 4, andinserts it in the object’s metadata. Once created, the meta-data can beharvestedby service providers.

Finally, the data provider invokes the functioninsert,implemented in the P2P digital archiving system, whose rep-licates the content over the network until reaches itsdesiredreliability. The Algorithm 1 illustrated those steps.

input : metadatain fo, content, reliabilitybegin

DOI←URI+DPNumber+sequence/* Creates the DOI */

metadata= metadatain fo+DOIP2P.insert(DOI,content, reliability)

end

Algorithm 1 : Insert by Data Provider

3.4 How to Retrieve an object

The frist step to retrieve an object from the system is to ac-cess a service provider and recovers its DOI. Since a DOI is

a simple URI, users can use a Web Browser to obtain infor-mation about the object and access its content.

To address a specific computer, a URI is normally re-solved by a Domain Name System (DNS). In our extension,a Round Robin DNS is used to perform this work. A RoundRobin DNS [11] is a technique of load balancing, or fault-tolerance for redundant Internet Protocol service hosts (e.g.,Web servers) by managing the DNS responses to addressrequests from client computers according to an appropriatestatistical model.

There are many implementation of Round Robin DNS,in its simplest it works by responding to DNS requests notonly with a single IP address, but a list of IP addresses ofseveral servers that host identical services. Thus, when anuser access the DOI address he is redirected to an avaliableData provider. Avaliable nodes are maintained by anodechecker, which regularly checks the activity of nodes. If anode isoffline it is removed from the list until recover itsnormal operation.

When a data provider receives the request for an objectit verifies if the object’s metadata is locally stored. If it isnot, the data provider requests the metadata for the serviceprovider, viaGetRecord verb, using the DOI as input. Oncethe metadata is recovered the data provider presents its in-formation to the user and recovers the content from the P2Pdigital archiving system using the DOI askey, as illustratedin Algorithm 2.

input : DOIbegin

metadata= get metadata local(DOI)if metadata not foundthen

metadata= ServiceProvider.getRecord(DOI)if metadata not foundthen

return “Object does not exist”

/* shows object’s information to the user */

showMetadata(metadata)return P2P.retrieve(DOI)

end

Algorithm 2 : Retrieve by Data Provider

3.5 Evaluation

To evaluated our OAI extension we implemented a proof-of-principle prototype. This prototype was implemented inJava programming language and XML documents were usedto store metadata and information to configure the system.Messagens among the components were sent and receivedby Sockets [35].

The main purpose of this evaluation is to show that OAI-PMH extension works in consonance to P2P digital archiv-ing system, explained in Section 5. To achieve this goal, wecreated a set oflogs during the execution of algorithms 1

7

and 2. Figure 5 shows alog, from a data provider, whichillustrate an object being recovered.

Fig. 5 Data Provider log: Illustrate an object being recovered

It is important to note that we mapped several differentbehaivor with logs. In [29] is possible to find other experi-ments that helped us to validade our proposal extension. In areal world environment, it can be implemented over a num-ber of open source softwares, like EPrints and Dspace [2,32].

4 Model and Proposed Strategies

The model is composed by a setN of |N| = n items (dig-ital objects). All items have identical size, without loss ofgenerality, equal to 1. Each itemi has associated a proba-bility 0 < r i < 1, called thedesired reliabilityof the item.Furthermore, we have a setM = {ℓ1, . . . , ℓm} of |M| = mrepositories, where each repositoryℓ j has associated astor-age capacity cj and a probability 0< p j < 1, called thereli-ability of the repository. To determine this reliability, we canconsider some parameters such as the number of bugs, thevulnerabilities of the system, human factors, the frequencyat which the machines are repaired, among others. The de-termination of the desired reliability of an item is not in thescope of this work.

The reliability of the subsetS⊆ M is defined as 1−∏ℓ j∈S(1− p j), which denotes the probability of at least onerepository inSdoes not lose data in a given time interval.

We define the problem as follows. Items arrive one byone, i.e., initially there is no item, and the items arrive astime passes. After an itemi arrives, we have to choose asubsetSi of repositories where each repository inSi receivesa copy of the item (replica) to satisfy the desired reliabilityr i . In other words, we selectSi ⊆M such that

1− ∏ℓ j∈Si

(1− p j)≥ r i . (1)

In addition, each repository inSi must have enough freespace to receive the copy. Theload of a repository is the

number of replicas assigned to it. The replicas are never up-dated, i.e., they are immutable. The objective of the problemis to maximize the number of items inserted in the network,satisfying the desired reliability of items and the capacity ofthe repositories.

As argued before, we focus on two goals: (i) on the onehand, we want to minimize the total number of replicas cre-ated, i.e., minimize∑∀i |Si |. On the other hand, (ii) at thesame time, the replicas should be placed so as to balance theload. To measure the load balancing, we evaluate two met-rics: themakespan, i.e., the load of the most loaded repos-itory, andstandard deviationof the repositories load. Notethat the minimization of the makespan and the standard de-viation of the loads are sufficient measures to ensure that allrepositories have a balanced load.

In what follows, we make some comments regarding tothe feasibility of the problem

Observation 1 Assigning two or more replicas of the sameitem to a repository does not bring any benefits.

By Observation 1, in the worst case we createm replicasfor each item.

Observation 2 There are instances which are not feasiblesolutions to the problem.

To illustrate Observation 2, suppose that all reposito-ries have a low reliability, e.g. 1/(m+ 1). By Observation1, we do not take advantage in creating more thanm repli-cas. Thus, when creating replicas of a given item in allmrepositories, we guarantee the following reliability.

1−(

1− 1m+1

)m≤ 1− 1

e ≤ 2/3.

In other words, it is impossible to guarantee a desiredreliability of an item that requires a reliability greater than2/3 . In view of Observation 2, we would like to identifyconditions that make an instance of the problem feasible/in-feasible. A necessary and sufficient condition (i.e., an “ifandonly if” condition) would be ideal in this case.

Observation 3 Due to theon-linenature of the input, it isimpossible to decide if an instance of the problem is feasi-ble/infeasible.

By Observation 3, we can only say that an instance isinfeasible if the item that came last makes it infeasible, ex-cluding the existence of a necessary and sufficient conditionfor detecting feasibility. However, without “looking intothefuture” and look only at the items that already arrived, theproblem is feasible if and only if 1−∏ℓ j∈Si

(1− p j) ≥ r i ,∀i ∈ N.

A naive approach would treat both goals in an indepen-dent way, for example, first solving the minimization of the

8

replicas and then balancing the load. This is not a good ap-proach, because the number of replicas created depends di-rectly from repositories which had allocated the replicas.Asan example, we illustrate a situation where we first solve theproblem of replicas and then solve the load balancing. Sup-pose an instance where the repositoryℓ j has reliabilityp j

and all items have desired reliability less thanp j . It sufficesto create a single replica of each item inℓ j . Howeverℓ j willbe overloaded, and when we start the phase of balancing theload, we have to remove items fromℓ j . In this way, we haveto select other repositories to accommodate new replicas ofthese items to meet their desired reliability, lying again onthe problem that we thought was solved before the load bal-ancing.

4.1 Equivalent Definition

Next, we rewrite the desired reliability constraint to easethehandling of the problem of replicas creation. We present anequivalent definition of the desired reliability constraint, re-placing the product by a summation. We rewrite the desiredreliability constraint as follows:

1− ∏ℓ j∈Si

(1− p j)≥ r i ≡ ∏ℓ j∈Si

(1− p j)≤ 1− r i

≡ ∏ℓ j∈Si

eln(1−p j ) ≤ eln(1−r i)

≡ e∑ℓ j∈Siln(1−p j) ≤ eln(1−r i).

But

e∑ℓ j∈Siln(1−p j) ≤ eln(1−r i)⇐⇒ ∑

ℓ j∈Si

ln(1− p j)≤ ln(1− r i).

As 0< p j , r i < 1, then the value of the logarithm function isnegative. Thus, for clarity of notation, we redefine the prob-lem variables. Leta j = − ln(1− p j) andbi = − ln(1− r i).Therefore, the desired reliability constraint can be rewrittenin the following equivalent way,

∑ℓ j∈Si

a j ≥ bi . (2)

That is, for an itemi we have to selectSi ⊆M such thatthe sum ofa j exceedsbi . Equation 2, besides being easier tohandle, is equivalent to the Equation 1.

4.2 Strategies for Replicas Creation

In what follows, we present three strategies for the creationof replicas. In all strategies, we do a selection over a set ofrepositories. When doing a selection onM, the complexityof the worst case is linear inm, which is a good theoretical

bound. In practice, however, a selection on the set of all ma-chines of the network is infeasible. Thus, in the strategiesdescribed below, we denote byMo ⊆M the set of machinesthat are available based on a feasible number of machinesthat we can select in practical situations. For each item thatarrives to be inserted, we consider thatMo is selected at ran-dom fromM; in practice, the way thatMo is selected maydepends on the system features. Note that the strategies pre-sented are generic, i.e., can be used in various situations ofreliable replication systems. Therefore, the size ofMo maydepends on the features of the real-world situation that weare considering.

It is worth noting that in all strategies, when a replicais assigned to a full repository, we ignore it and randomlyselect another repository of the considered set.Randomized Strategy:Here, we present the first proposedstrategy to solve the problem. Algorithm 3 shows the details.

beginS= /0while reliability of S is less than ri do

chooseℓ j ∈Mo uniformly at randomS= S∪{ℓ j}

return Send

Algorithm 3 : Randomized

In this subsection, we usem = |Mo| and assume thatthe reliabilities of the repositories are uniformly distributedin an interval. Formally, the valuesp j is chosen accordingto thecontinuous uniform distributionin the interval[a,b],wherea > 0 andb < 1. We assume that the expected num-ber of reliable repositories (in a specified time interval) is atleast 8 lnm; note that this is not a strong assumption to theproblem, since the expected fraction8 lnm

m of reliable repos-itories goes to 0 whenm goes to infinity. For example, form= 1000, we assume that the expected fraction of reliablerepositories is8 ln1000

1000 ≈ 0.05, i.e., we assume that, in expec-tation, 5% of repositories are reliable.

Let Xj be a random variable that is equal to 1 if therepositoryℓ j is reliable in the time interval specified, 0 oth-erwise. LetX = ∑ j∈Mo Xj be the random variable of the totalnumber of reliable repositories in a given time interval. Aswe assume that the reliabilities of the repositories are dis-tributed according to the continuous uniform distribution,thenE[X] = m(a+b)

2 . In a given time interval,X can be lessthanE[X], however, with high probability, it cannot be muchless than the expected value, as shown in Lemma 1.

Lemma 1 Let Xj be a random variable that is equal to1 ifthe repositoryℓ j is reliable,0 otherwise. Let X= ∑ℓ j∈Mo Xj

be the random variable of the total number of reliable repos-itories in a given time interval. Then Pr(X < 1

4E[X]) < 1m2 .

9

Proof Note thatX is a sum of Poisson random variables.Therefore, we can apply a known Chernoff bound [24], ob-

taining Pr(X < (1− 34)E[X]) ≤ e−

9E[X]32 ≤ m−2, where the

last inequality follows from the fact thatE[X]≥ 8lnm.

Lemma 1 tells us that with high probability (i.e., proba-bility greater than 1− 1

m2 ), a fraction ofa+b8 of the reposito-

ries are reliable. That is, with high probability, when select-ing a repository uniformly at random, we have probabilityat leasta+b

8 that it is a reliable repository. Thus, by choosingk repositories uniformly at random, the probability that one

or more repositories are reliable is at least 1−(

1− a+b8

)k

and we want that this probability be greater than the desired

reliability r i . Thus, by choosingk =⌈

8a+b ln

(

11−r i

)⌉

repos-

itories uniformly at random and place the replicas on them,the desired reliability of the itemi is satisfied with high prob-ability, as the Theorem 1 claims.

Theorem 1 If item i places k=⌈

8a+b ln

(

11−r i

)⌉

replicas in

k repositories chosen uniformly at random then, with highprobability, the desired reliability of i is satisfied.

As an example of Theorem 1 application, suppose thatthe reliability of the repositories are uniformly distributedbetween 50% and something close to 100%. Thus, an itemwith desired reliability of 95% needs to choose a minimum

of⌈

80,5+1 ln

(

11−0,95

)⌉

= 16 repositories uniformly at ran-

dom to place its replicas.Regarding the load balancing, we note that in Algorithm

3, a replica always chooses a repository uniformly at ran-dom. That is, Algorithm 3 simulates the balls-and-bins pro-cess, well studied in the area of randomized algorithms andthe existing results can be used in our problem [24]. The re-sults of Theorems 2 and 3 can be easily obtained from theresults of balls and bins and therefore the demonstrationswill be omitted.

Theorem 2 If n balls are thrown independently and uni-formly at random on m bins, then the load of the most loadedbin is bounded by2en

m +2logm with high probability.

Theorem 3 Let n≥mlogm. If n balls are thrown indepen-dently and uniformly at random on m bins, then the load ofthe most loaded bin is bounded byn

m +√

8mn logm with high

probability.

Theorems 2 and 3 are related, respectively, to a small anda large number of balls (in our problem, they are the items).However, in practical situations, it is likely thatn≥mlogmand, in this case, Theorem 3 tells us that with high proba-bility the most loaded repository have a constant number ofitems in addition to the optimal balance.Greedy over Subset Strategy:Suppose we want to min-imize the total number of replicas without worrying about

the load balancing. Thus, using the equivalent definition ofSection 4.1, it suffices to solve the integer linear program(ILP) below

min ∑ℓ j∈Mo

x j

s.t. ∑ℓ j∈Mo

a jx j ≥ bi

x j ∈ {0,1} ∀ℓ j ∈Mo

The ILP described can be optimally solved by sortingthe valuesa j in non-increasing order and taking the valuesin such order until the sum of the selected values reachesbi. This greedy strategy, besides being very simple and effi-cient, optimally solves the ILP as, in a given step, we haveno advantage in selecting a value less than the value in thesorted order. Algorithm 4 shows the details of this strategy.

beginsortMo in non-increasing order according to the valuesaj

S= /0t = 1while ∑ℓ j ∈Saj < bi do

S= S∪{ℓt ∈Mo}t = t +1

return Send

Algorithm 4 : Greedy over Subset

Note that this strategy accumulates the replicas on therepositories inMo with higher reliability, which is not goodfor the load balancing. We know thatMo is randomly chosenin each insertion of an item, but this do not suffices to im-prove the load balancing because ifMo is large, we lie againin the problem of accumulating the replicas in repositorieswith higher reliability. Moreover, ifMo is small, we do nothave many options to choose or there are not enough repos-itories to satisfy the desired reliability of an item. Severalsizes ofMo are evaluated in [33].Ideal Subset Strategy:To create the replicas, we select thesubsetS⊆Mo that provides the reliability that is closest tothe desired reliability of the item. That is, we chooseS⊆Mo

that minimizes 1−∏ℓ j∈S(1− p j)− r i . Using the equivalentdefinition of the problem, we need to solve the followingILP

min ∑ℓ j∈Mo

a jx j −bi

s.t. ∑ℓ j∈Mo

a jx j ≥ bi

x j ∈ {0,1} ∀ℓ j ∈Mo

Note that, if the solution value is equal to 0 then there is asubset of values that together sums exactlybi ; if the solution

10

value is greater than 0, then there is no such subset. Thus,if we solve this ILP, then we could also solve the SUBSET-SUM decision problem, which is NP-complete[14]. There-fore, as the ILP is an optimization problem, then it is NP-hard; assuming P6= NP, no polynomial algorithm can solvesuch problem. However, there is adynamic programmingalgorithm which solves this problem in pseudo-polynomialtime, but in practice is satisfactory for the vast majority ofinstances [14]. The details of the algorithm are omitted.

The solution of the above ILP do not necessarily pro-vides a good solution to minimize the number of replicascreated. Nevertheless, the Ideal Subset strategy is motivatedby the fact that in practical situations it is expected to notcreate too many replicas and to select different subsets foreach item, so that the distribution of replicas in the reposi-tories balance their loads. Note that, if we had not used theequivalent definition, we are not able to model this case asa subset sum problem, and then we need a real exponentialalgorithm to solve this, which turns out to be an unfeasiblealternative to this case.

5 A Peer-to-Peer Digital Archiving System

The model proposed in Section 4 is designed in a genericway and can be implemented on any distributed mechanismfor organizing the storage repositories. In particular, struc-tured P2P using DHT appears as natural candidate as it ishighly scalable for data distribution and retrieval. However,a difficulty inherent in structured P2P networks is the accu-rate selection of nodes (repositories) to store the replicas. Itis not trivial to select a specific subset of nodes using therouting method from DHTs. Therefore, the implementationof the digital archiving system needs to define an architec-ture that accommodates all the features that the model ofSection 4 requires. Thus, we present a scheme of selectionof nodes usingmultiple hash functions, which allows the se-lection of a particular set of nodes.

5.1 Architecture

Structured P2P Networks and DHT: The system routesthe messages of the network through structured P2P net-works, using DHTs. The choice of structured P2P, insteadof non-structured, is motivated by its scalability regardingthe number of nodes. A problem of non-structured P2P net-works is that they often use brute force algorithms to per-form the search (“flooding”) and are more suitable for pop-ular content. Moreover, in many cases, the search algorithmsof non-structured P2P networks cannot locate rare items,which is unacceptable in the context of digital archivingwhere the objects are equally popular [20].

DHTs have a problem with the transient population, i.e.,maintaining the structure of the routing tables is relativelyexpensive inchurnsituations. However, the machines tran-siency in organizations that intend to preserve digital docu-ments is not as frequent when compared with machines usedin traditional applications in the non-structured architecture[21]. Therefore, the necessary adjustments in the topologyof the structured networks does not overload the archivingsystem in case of churn.Specific Selection of Repositories:The strategies proposedin Section 4.2 assume that is possible to do a selection onspecific repositories. However, the DHT by itself does notprovide the mechanisms of selection of a specific node, dueto its method of index keys. So if we are interested to storethe content in a given set of repositories, we must provide amechanism that simulates the process of specific selection.To perform the selection of specific nodes, we propose theuse ofmultiple hash functions, as explained below.

A digital object consists of akey, which is the identi-fier of the object; avalue, which is the content of the ob-ject; and a parameter ofdesired reliability, which is the re-liability that should be achieved when inserting the objectin the network. Leth1,h2, . . . ,hr be ther hash functions.The hash functions have global visibility, i.e., they are thesame for all nodes. Given the keyk of a digital object, weapply k to the hash functions, i.e.,h1(k),h2(k), . . . ,hr(k).Each of ther generated hash maps to a node in the net-work. Thus, for each object to be inserted, we getr nodeswhere we can place replicas (i.e., the setMo). ¿From thisset of r nodes, we use a strategy of replica creation (e.g.,the strategies of Section 4.2) to define the subset of theser nodes that receive the replicas. It is not difficult to ob-tain a family of such hash functions. One way is to use asingle hash functionh and append a numberi = 1, . . . , rto the key of the object, which is used as argument to thefunctionh. For instance, if the object key is the stringfoo,thenh( f oo1),h( f oo2), . . . ,h( f oor) would give usr hashesof this object.

Figure 6 illustrates the selection of repositories whichare performed by three different objects. In the figure, keysare represented byobject a, object b andobject c, andthe dotted lines denotes the set of nodes associated with eachkey after applyingr = 6 hash functions. As we have 6 hashfunctions, the resulted hashes maps to 6 nodes of the net-work. From each subset of nodes associated with an object,the strategy of creating replicas is applied to determine thenodes that receive the replicas. The black circles representthe repositories that have been chosen to put the replicas.

It is worth noting that performing a selection amongallnodes of the network (or equivalently, usingm hash func-tions) is not a good approach because an instance wouldhave the size of the network, which is unfeasible in realworld situations. On the other hand, considering a subset of

11

Fig. 6 Subsets of repositories associated with their respective digitalobjects.

small size is a feasible option of implementation even in faceof the scalability of the network. Thus, based on the resultsshowed in [33], we can choose the optimal size of the subsetand, therefore, decide how many hash functions to use.

The main advantage of using multiple hash functions isthe ability to select specific nodes. Without this, the strate-gies proposed in Section 4.2 could not be implemented instructured P2P networks using DHT. Moreover, an advan-tage of using the strategy of multiple hash functions is theability to use any DHT protocol to route messages in the net-work, unlike the strategies for replication in P2P using theneighborhoodor thepath [16], which are tied to the proto-col. Multiple hash functions also allow flexibility for digitalobjects to have different numbers of replicas, which is notpossible in thesymmetricstrategy of replication [15]. An-other feature is the easy retrieving of a given replica withoutnecessarily retrieve another one previously, making it easyto develop a retrieving algorithm. In thecorrelated hashstrategy [16], all the keys of a given object are correlatedwith the first key, thus not allowing this feature.

The selection of repositories proposed above involvesno centralized information about the location of the storedreplicas. An alternative would be the use of a super-nodethat had a “directory” which could be consulted about theinformation of the exact location of replicas of a given ob-ject. However, this super-node would be a contention point.Our system avoids super-nodes, adopting a completely dis-tributed approach where no information is centralized.

5.2 Algorithms for System Operation

To operate the P2P digital archiving system, we need to de-fine some basic operations of the system. In this section, wepresent the algorithmsinsert andretrieve, used respec-tively for the insertion and retrieving of a digital object inthe system. These algorithms implement multiple hash func-tions discussed earlier. Note that both algorithms are exe-cuted locally in each node. For example, a user who wishesto insert or retrieve an object contacts any node of the net-

work, which in turn initiates the process of routing the mes-sage of the DHT.

input : key, value, reliability r ibegin

Mo = /0for i = 1 to r do

Mo = Mo∪{ℓhi (key)}

S= insertion strategy(Mo, r i )foreachs∈ Sdo

j← hash function number ofsput(hj (key),value)

end

Algorithm 5 : insert (key, value, reliability)

To insert an object, the routineinsert(key, value,

reliability) is used, as shown in Algorithm 5. When in-serting an object in the network, the desired reliability oftheobject is previously chosen by the user. Initially,Mo startsempty. The first loop selects the subsetMo of size r asso-ciated to the key of the object;ℓhi(key) is an abuse of no-tation which denotes the node pointed by thei-th hash ofthe key. In this loop, we implicitly save the valuesi usedfor each node; this will be used later. After that, the func-tion insertion strategy(Mo, r i ) is executed, which re-turns the subsetS⊆Mo of nodes that will receive the repli-cas. The functioninsertion strategy can be replacedby any strategy of replication, for example, those presentedin Section 4.2. In our implementation, the reliability of thenodes are stored in the DHT. The last loop is where the inser-tion occurs. The valuej denotes the hash function numberof the objects considered; as stated previously, these val-ues were saved in the first loop. The routineput(h j(key),

value) puts a replica of the object in the chosen location.The implementation of the algorithm takes care of not

assigning more than two replicas of the same object in onenode.

input : keybegin

for i = 1 to r dovalue← get(hi(key))if value is not nullthen

return value

return −1 /* not found */

end

Algorithm 6 : retrieve (key)

To retrieve the object, the user performs the functionretrieve(key), wherekey is the key of the object to beretrieved, as shown in Algorithm 6. The idea of the algo-rithm is to search in allr nodes for a replica of the item,using the hash functions for that. There are two cases wherethe DHT does not return the object: when the node is notpresent or the node does not contain a replica of the object.

12

0 5 10 15 20 25 30system's life years

0

1

2

3

4

5

6

7

# o

f obje

cts

able

to r

etr

ieve

99%90%70%50%30%

Fig. 7 Number of objects able to retrieving information dependingonthe age of the system.

5.3 Experimental Results

The P2P archiving system was implemented and evaluatedthrough experiments carried out in a real world environment.The implementation uses theOverlay Weaverenvironmentto build networks [30]. Overlay Weaver provides great flex-ibility in the choice of DHT protocols and other high-levelservices implemented as overlay networks. In particular, thisenvironment supports Chord, Pastry, Tapestry and Kadem-lia. In our experiments we use Chord. It is worth notingthat Overlay Weaver has itself a mechanism of replicationthat has been turned off for the evaluation of our experi-ments. The experiments were conducted in a network with12 nodes; this quantity is enough to validate the ability oflong-term archiving and the mechanism of creation of repli-cas.

The reliability of each node was considered to be theprobability of the node do not lose information during theperiod of 1 year. Thus, we can simulate the state of thenetwork regarding the preservation of information over theyears. In our experiments, we do not evaluate changes in thereliability of the nodes over the years.

The experiment starts with a network where various ob-jects are inserted. Objects are inserted with different reliabil-ities. For the replication of objects we used the Ideal Subsetstrategy, with the size of the subsets equal to 6. Initially allnodes are reliable (online). When a node becomes unreli-able, it loses its information and is disconnected from thenetwork (offline). We purposely did not implement a strat-egy to recover offline nodes because we want to stress thatobjects with high desired reliability are preserved longer.

Figure 7 shows the results. Of the total of 12 nodes, 2of them have 30% of reliability, 2 have 50%, 2 have 70%,3 have 80%, and the remaining 3 have 90% of reliability.Initially, in the first year, we include a total of 25 objects,divided into 5 groups, each group containing objects withdesired reliability respectively, 30%, 50%, 70%, 90% and99%.

After the first year of existence of the system, a nodeof reliability equal to 30%, a node with 50% and anotherwith 90% failed. Even with these fails, all the 25 objectswere able to be retrieved. At the end of the fifth year of thesystem, two nodes (30% and 70% of reliability) get discon-nected from the network. In the fifth year, it was not possi-ble to retrieve the 3 objects with 30% of desired reliability,4 objects of 50%, 1 of 70% and 1 of 90%. In the tenth year,three additional nodes getsoffline. They are nodes with re-liability of 70%, 80% and 90%, respectively. It was unableto locate a total of 3 objects of 30% of desired reliability, 4objects of 50%, 4 of 70% and 2 of 90%. At the end of the fif-teenth year, a node with 80% of reliability was disconnectedand the same objects of the tenth year were possible to beretrieved. In the twentieth year, 1 node of 50% was discon-nected and only one node of 80% and another of 90% leftthe system. In this age of the system, it was able to locateall objects of 99% of desired reliability and 1 object with90%; all other objects were lost. In the twenty-fifth year, thenode of 80% failed and it was still possible to retrieve 4 ob-jects of 99% and 1 of 90%. In the thirtieth year the last nodewere disconnected from the network and no further replicasexisted.

6 Conclusions and Future Work

The experiments conducted in this work demonstrate thatOAI-based digital libraries, in consonance to a P2P archiv-ing system, can preserve information for a long period oftime. The importance for the preservation of each digitalobject - measured in the model by the desired reliability- impacts on different lifetime of this object. Objects thatwere inserted with a high desired reliability had a greaterlifetime. Thus, we can conclude three main characteristicsfor our system:OAI-PMH integration: It is intended for OAI-based sys-tems keeping unchanged all OAI-PMH funcionalities. Thisfeature is important, since OAI-PMH is widely used and canbe considered a standard to build digital libraries.Independence of object preservation:different informa-tion requires different storage time. Collections of photos,journals and articles may need few years of storage; otherinformation such as digital objects in museums and librariesrequires hundreds of years. Our system allows flexibility inthe choice of lifetime of objects to be preserved;Optimization of the storage resources:storage reposito-ries may suffer many types of damages on their contents,so each repository has a different reliability. Allowing eachstorage repository with a parameter capable of measuringthe independent probability of failure is the closest way tomodel real networks. This approach allows the flexibility inthe time of preservation of each object and therefore dif-

13

ferent numbers of replicas for them, impacting on a betterusage of network storage repositories.

Future works include the implementation of a completedigital preservation system. To do this, the system should beconcerned with the auditory of the replicas and other threatssuch as software obsolescence, not considered in this work.Furthermore, very little emphasis was given to the retriev-ing of the items. Possibly, we can use our model of digitalarchiving on systems such as LOCKSS and BRICKS, andevaluate their feasibility. Finally, when the complete digi-tal preservation system is finished, we intent to implementa whole distributed digital library, based on this work, in areal world environment.

References

1. arXiv.org e-Print archive. www.arxiv.org.2. Open access and institutional repositories with eprints.

www.eprints.org.3. Open Archives Initiative. www.openarchives.org.4. The Digital Object Identifier System. www.doi.org.5. The Handle System. www.handle.net.6. Virtua tls. www.vtls.com.7. Glacier: highly durable, decentralized storage despitemassive cor-

related failures. InIn proceedings of NSDI’05, Berkeley, CA,USA, 2005. USENIX Assoc.

8. Maristella Agosti, Hans-Jorg Schek, and Can Turker, editors.Digital Library Architectures: Peer-to-Peer, Grid, and Service-Orientation, Pre-proceedings of the Sixth Thematic Workshop ofthe EU Network of Excellence DELOS, S. Margherita di Pula,Cagliari, Italy, 24-25 June, 2004. Edizioni Libreria Progetto,Padova, 2004.

9. Benjamin Ahlborn. OAI-P2P: A Peer-to-Peer Network for OpenArchives. InICPPW ’02: Proceedings of the 2002 InternationalConference on Parallel Processing Workshops, page 462, Wash-ington, DC, USA, 2002. IEEE Computer Society.

10. A. Amrou, K Maly, and M. Zubair. Freelib: Peer-to-peer-basedDigital Libraries. InAINA ’06: Proceedings of the 20th Interna-tional Conference on Advanced Information Networking and Ap-plications - Volume 1 (AINA’06), pages 9–14, Washington, DC,USA, 2006. IEEE Computer Society.

11. T. Brisco. DNS Support for Load Balancing, 1995.12. Brian Cooper and Hector Garcia. Creating trading networks of

digital archives. InJCDL ’01: Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries. ACM, 2001.

13. Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris,and Ion Stoica. Wide-area cooperative storage with CFS. InPro-ceedings of the 18th ACM Symposium on Operating Systems Prin-ciples (SOSP ’01), Chateau Lake Louise, Banff, Canada, October2001.

14. Michael R. Garey and David S. Johnson.Computers and In-tractability: A Guide to the Theory of NP-Completeness. 1979.

15. Ali Ghodsi, Luc Onana Alima, and Seif Haridi. Symmetric repli-cation for structured peer-to-peer systems. InDBISP2P, pages74–85, 2005.

16. Salma Ktari, Mathieu Zoubert, Artur Hecker, and Houda Labiod.Performance evaluation of replication strategies in DHTs underchurn. InIn proceedings of MUM ’07. ACM, 2007.

17. John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski,Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea,Hakim Weatherspoon, Chris Wells, and Ben Zhao. Oceanstore:anarchitecture for global-scale persistent storage. InIn Proceedingsof ASPLOS-IX, New York, NY, USA, 2000. ACM.

18. Carl Lagoze and Herbert Van de Sompel. The open archives initia-tive: building a low-barrier interoperability framework.In JCDL’01: Proceedings of the 1st ACM/IEEE-CS joint conference onDigital libraries, pages 54–62, New York, NY, USA, 2001. ACM.

19. Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson,and Liuba Shrira. Replication in the harp file system.SIGOPS,1991.

20. Qin Lv, Pei Cao, Edith Cohen, Kai Li, and Scott Shenker. Searchand replication in unstructured peer-to-peer networks. InIn pro-ceedings of SIGMETRICS ’02. ACM, 2002.

21. P. Maniatis, M. Roussopoulos, T. Giuli, D. Rosenthal, andM. Baker. The LOCKSS peer-to-peer digital preservation system.ACM Trans. Comput. Syst., 23, 2005.

22. Vidal Martins, Esther Pacitti, and Patrick Valduriez. Survey ofdata replication in P2P systems. Technical report, INRIA, 2006.

23. D. S. Milojicic, V. Kalogeraki, R. Lukose, and K. Nagarajan. Peer-to-peer computing. Technical report, HP Labs, 2002.

24. Michael Mitzenmacher and Eli Upfal.Probability and Computing: Randomized Algorithms and Probabilistic Analysis. 2005.

25. Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, andScott Schenker. A scalable content-addressable network. In Pro-ceedings of the SIGCOMM ’01, New York, NY, USA, 2001. ACM.

26. Thomas Risse and Predrag Knezevic. A self-organizing data storefor large scale distributed infrastructures. InICDEW ’05: Pro-ceedings of the 21st International Conference on Data Engineer-ing Workshops, Washington, DC, USA, 2005. IEEE Computer So-ciety.

27. A. Rowstron and P. Druschel. Storage management and caching inPAST, a large-scale, persistent peer-to-peer storage utility, 2001.

28. Antony Rowstron and Peter Druschel. Pastry: Scalable, decen-tralized object location, and routing for large-scale peer-to-peersystems.Lecture Notes in Computer Science, 2001.

29. Everton Flavio Rufino Seara. Uma arquitetura OAI paraPreservacao Digital utilizando redes Peer-to-Peer Estruturadas.Master’s thesis, Federal University of Parana, 2008.

30. Kazuyuki Shudo, Yoshio Tanaka, and Satoshi Sekiguchi. Overlayweaver: An overlay construction toolkit.Comput. Commun., 2008.

31. Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, andHari Balakrishnan. Chord: A scalable Peer-To-Peer lookup ser-vice for internet applications. InProceedings of the SIGCOMM’01, pages 149–160, 2001.

32. Robert Tansley, Mick Bass, David Stuve, Margret Branschofsky,Daniel Chudnov, Greg McClellan, and MacKenzie Smith. TheDSpace institutional digital repository system: current function-ality. In JCDL ’03: Proceedings of the 3rd ACM/IEEE-CS jointconference on Digital libraries, pages 87–97, Washington, DC,USA, 2003. IEEE Computer Society.

33. Tiago Vignatti, Luis Carlos Erpen De Bona, Andre Vignatti, andMarcos Sfair Sunye. Long-term digital archiving based on selec-tion of repositories over P2P networks. InIEEE P2P’09: NinthInternation Conference on Peer-to-Peer Compution, 2009. to ap-pear.

34. S. Weibel, J. Kunze, C. Lagoze, and M. Wolf. Dublin Core Meta-data for Resource Discovery, 1998.

35. J. Winett. Definition of a socket, May 1971.36. Yanfei XU. A P2P Based Personal Digital Library for Commu-

nity. In PDCAT ’05: Proceedings of the Sixth International Con-ference on Parallel and Distributed Computing Applications andTechnologies, pages 796–800, Washington, DC, USA, 2005. IEEEComputer Society.

37. Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, An-thony D. Joseph, and John D. Kubiatowicz. Tapestry: A resilientglobal-scale overlay for service deployment.IEEE Journal on Se-lected Areas in Communications, January 2004.