Future Generation Computer Systems

7/30/2019 Future Generation Computer Systems

1/9

Future Generation Computer Systems 24 (2008) 824832

Contents lists available at ScienceDirect

Future Generation Computer Systems

journal homepage: www.elsevier.com/locate/fgcs

Improving the performance of Federated Digital Library services

Jernej Trnkoczy, Vlado Stankovski

Faculty of Civil and Geodetic Engineering, University of Ljubljana, Jamova 2, SI-1000 Ljubljana, Slovenia

a r t i c l e i n f o

Article history:

Received 4 December 2007Received in revised form

8 April 2008Accepted 8 April 2008Available online 18 April 2008

Keywords:

OAIPMHGridFederated Digital LibraryPerformance

a b s t r a c t

Thenumber of Digital Libraries(DLs) accessible over theOpen Archives Initiative Protocol for MetadataHarvesting (OAIPMH) has been constantly increasing in the past years. Earlier efforts in the DL areahave concentrated on metadata harvesting and provisioning of value-added Federated Digital Library(FDL) services to the users. FDL services, however, have to meet significant performance and scalabilityrequirements, which is difficult to achieve in centralized metadata harvesting systems. The goal of thepresent study wasto evaluate thebenefitsof using WebServicesResource Framework (WSRF) compliantgrid middleware infrastructure for providing efficient and reliable FDL services. The presented FDLapplication allows for parallel harvesting of OAIPMH compliantDLs.The results showthat this approachefficiently solves the performance related problems, while it also contributes to greater flexibility of thesystem. The quality of service is improved as metadata can be updated frequently, and the system doesnot exhibit a single point of failure.

2008 Elsevier B.V. All rights reserved.

1. Introduction

The advancement of World Wide Web (WWW) technologies

causes an exponential growth of available widely distributeddigital content. Web search engines, such as Google or Yahoo,and encyclopedias, such as Wikipedia already point to millions ofdigital objects and may be considered as huge, ubiquitous digitallibraries. Digital Library (DL)technologies address theneedsfor themanagement of vast amounts of available digital content (i.e. free-text articles and multimedia) and provide sophisticated contentand knowledge services for the users.

An important goal in the development of DL technology isto improve the quality, scope and accuracy of existing Websearch engines by utilizing structured resource descriptionsi.e. semantically rich metadata. This is possible since metadataare usually freely available. On the other hand, due to restrictivecopyrights, the actual digital content is freely available only in

limited number of cases. A practical approach is, therefore, toharvest metadata from a number of geographically distributed DLsat a central location, index these metadata,and let the users searchthe generated index from a single user interface. The resultingsearch services are also known as Federated Digital Libraries(FDLs).

When building a FDL, it is important to take into considera-tion the growing number of available DLs, the necessity to use

Corresponding author. Tel.: +386 (0)1 4768511, +386 (0)41 200565 (mobile);fax: +386 (0)1 4250681.

E-mail address: [email protected] (V. Stankovski).URL: http://www.stankovski.net (V. Stankovski).

advanced, computationally intensive information retrieval algo-rithms, as well as the growing number of users who need person-alized perspectives on the DL content. All these factors are likely to

cause scalability and performance problems when building FDLs.For example, Maly et al. [17] used exhaustive harvesting to buildand update a large collection of metadata. Their FDL took over fourdays to complete one cycle of harvesting from over 160 existingDLs andtwo additional days to index theharvestedmetadata. Theyfound that the harvesting process was long running because of thelow network bandwidth and slow response of the contacted DLs.

In this kind of application, the scalability, reliability andperformance requirements may possibly be alleviated by the useof grid technology [8]. Grid technology is particularly beneficialwhen computational power and storage must be scaled to meetthe demands of complex problem solving applications. Propertieslike these make grid technology particularly suitable for the areaof DLs, which is demonstrated by a number of on-going research

projects, such as GRACE [12], DILIGENT [2], DELOS [6], DigitalLibrary GRID [16] and Cheshire3 [14].The proposed innovation is to develop a full range of FDL

services on top of mainstream, WSRF-standard [4] compliant gridtechnology. The application will benefit by exploiting available,otherwise idle computational resources on the Internet. Weinvestigated a use case scenario in which:

the end-user selects a set of distributed DLs; metadata are harvested from the selected DLs; the harvested metadata are transformed into a proprietary

format, which is used by a particular indexing algorithm; a central index is computed; the index is used by a search service;

0167-739X/$ see front matter 2008 Elsevier B.V. All rights reserved.doi:10.1016/j.future.2008.04.007
http://www.elsevier.com/locate/fgcshttp://www.elsevier.com/locate/fgcsmailto:[email protected]://www.stankovski.net/http://dx.doi.org/10.1016/j.future.2008.04.007http://dx.doi.org/10.1016/j.future.2008.04.007http://www.stankovski.net/http://www.stankovski.net/http://www.stankovski.net/http://www.stankovski.net/mailto:[email protected]://www.elsevier.com/locate/fgcshttp://www.elsevier.com/locate/fgcs


2/9

J. Trnkoczy, V. Stankovski / Future Generation Computer Systems 24 (2008) 824832 825

the user can now search for digital objects contained ingeographically distributed DLs.

Our goal is to overcome the scalability, reliability and perfor-mance problems of todays FDLs by distributing the metadata har-vesting, indexing and other computing power and time consumingtasks on various computational clusters on the Internet. As a start-ing point, we investigated the possibility of speeding up the pro-

cess of metadata harvesting by its parallelization. In this paper, wewill therefore focus on the investigation of the key performanceparameters, which are related to the metadata harvesting prob-lem. To the best of our knowledge, this is the first study that in-vestigates the performance of metadata harvesting in the contextof WSRF-standard compliant grid environments. As an exception,the Grid File Transfer Protocol (GridFTP) service [1] is used, whichis not WSRF compliant.

The paper is organized as follows. Section 2 presents thestate-of-the-art in the area of grid computing for DL applications.Section 3 describes the methodology and the middleware tech-nologies that were used to build the experimental grid environ-ment. Section 4 describes the actual grid test bed used in ourexperiments and the developed grid-enabled FDL application. TheevaluationofthesystemperformanceispresentedinSection 5, and

finally, Section 6 discusses the results obtained and presents theconclusions.

2. State-of-the-art overview

With the rapid evolution of grid technologies and the benefitsthey offer, several projects combining DL and grid technologieshave recently emerged. Here is a brief overview of these projects.

The key goal of the Digital Library Grid [16] project is similar tothatofexistingWebsearchenginessuchasGooglei.e.toharvestallof the existing content repositories in the World. For this purpose,grid technology is used to distribute the cost of high latencyharvesting andindexing tasks to grid nodes,and only leave the costof maintaining the federated search service to a service provider.

Their grid-based architecture, similar to ours, enables parallelharvesting over the OAIPMH protocol and it supports: dynamicallocation of harvesting nodes, scheduling of harvesting tasks tomaximize the performance, and uniform load distribution for theindexing node. However, the Digital Library Grid architecture istuned only for distributed harvesting and indexing and it has beenimplemented by an earlier version 3 the Globus Toolkit, whichis not WSRF-compliant. Their system scales-up the harvestingtask, but, it does not provide for larger scale virtualization andpersonalization of the services.

In the Cheshire3 [14] project, a low level architecture hasbeen defined that permits DL operations to be distributed overmany nodes on a network, vastly increasing the throughputof data for computational and storage intensive processes. Theimplementation uses distributed indexing and search processes

over a cluster of high performance machines to achieve high speedindexing. Their implementation is not based on standard gridmiddleware and protocols, such as WSRF, and it uses a proprietarygrid solution.

Grid-IR [18] is an initiative to realize Information Retrieval(IR) on the Open Grid Services Architecture (OGSA) platform.It aims to move existing IR standards (such as Z39.50) to theWeb service platform. The Grid-IR approach differs from oursin the sense that their architecture is purely service based. Thismeans that every entity in the system is implemented as aservice (e.g. metadata service; collection management service,indexing service; searching service; query processing service).Furthermore,the Grid-IR projectbuilds on thedistributedmodelofDL federation, while our approach builds on the harvesting model.

On the other hand, the Grid-IR initiative is currently a proposedworking group of the Open Grid Forum.

The DILIGENT project[2] aims to build a test bed that integratesGrid and Digital Library technologies. Their developments arebased on the achievements of the European Enabling Gridsfor E-science (EGEE) project. The EGEE infrastructure alreadyprovides for some of the functionality required for DILIGENT(e.g. the dynamic allocation of resources, support for cross-organizational resource sharing, security infrastructure). Foreffectively supporting DLs, additional services such as supportfor redundant storage and automatic data distribution, metadatabroker, metadata and content management, advanced resourcebrokers, approaches for ensuring content security in distributedenvironments and the management of content and communityworkflows are currently being developed, in addition to servicesthat support the creation and management of Virtual DLs.

The GRACE project [12] addresses situations where no central-ized index is available. It proposes the development of distributedsearch and categorization engine that enables just in time flexibleallocation of data and computational resources. GRACE adopts thegrid middleware developed by Large Hadron Collider (LHC) Com-puting Grid (LCG). In this project, grid technology is used to meetthe computational demands of natural language processing meth-ods, which are mainly text normalization and categorization for

indexing purposes. This is accomplished by distributing the com-putationally intensive part on a grid, which involves secure anddynamic sharing of computational and storage resources.

3. Methodology and grid middleware technologies

This section focuses on two fundamental components of theproposed FDL system: the grid middleware services and toolsused to build the FDL application and the OAIPMH protocol bywhich metadata records are harvested from distributed DLs. Thegrid test bed, which is used in this study is based on state-of-the-art DataMiningGrid [23], Globus toolkit [9] and Condor [22]middleware technologies, that will be described in the followingsections.

3.1. Grid middleware services

One of the most important grid-related standards developed inrecent years is the Web Services Resource Framework (WSRF), aspecification promoted by the Organization for the Advancementof Structured Information Standards (OASIS). WSRF provides ageneric, open framework for modeling and accessing statefulresources using Web services, a functionality that is typicallyneeded in todays grid computing infrastructures. Web services,as currently specified by the World Wide Web Consortium (W3C),are usually stateless i.e. there is no standard way that a Webservice can keep itsstatefrom oneinvocation to another. However,grid applications do generally require statefulness and the WSRF

specification defines a standard way of making Web servicesstateful.The latest WSRF specification is version 1.2and it hasbeenapproved as an OASIS Standard in 2006, a statusof thehighest levelof ratification.

Different grid middleware software solutions exist and con-tinue to be developed over the past years. One of the first grid mid-dleware toolkits implementing the WSRF v. 1.2 specification [4],is the Globus Toolkit 4 (GT4) [10]. GT4 provides a range of gridservices that can be directly used to build a distributed grid envi-ronment. These include data management, job execution manage-ment, community authorization services etc. All these services canbe used to build custom grid applications, and are elaborated indetail elsewhere [1,7,11,20]. Besides these ready-to-use services,the GT4 provides an Application Programming Interface (API), that

allows for development of proprietary WSRF-compliant services.Due to these reasons the GT4 was selected to be used in this study.


3/9

826 J. Trnkoczy, V. Stankovski / Future Generation Computer Systems 24 (2008) 824832

Following is a short review of relevant ready-to-use GT4 ser-vices.The WebService - Grid Resource Allocation andManagement(WS-GRAM) provides all basic mechanisms required for executionmanagement i.e., initiation, monitoring, management, scheduling,and coordination of remote computations. GT4 also provides anumberof services fordata management. The services GridFTP andReliable File Transfer (RFT) [15] are particularly useful for the FDLapplication. These data services are mainly used for transfer andmanagement of distributed, file based data, including program ex-ecutables and their software libraries. GridFTP is used, for exam-ple, to transfer executables and required libraries to the selectedcomputational server in the grid. Information services are used todiscover, characterize and monitor resources, services and compu-tation [3]. The GT4s Monitoring and Discovery System 4 (MDS4)provides information about the available grid resources and theirstatus. It has the ability to collect and store information from mul-tiple, distributed information sources. This information is used tomonitor (e.g. to track usage) and discover (e.g. to assign computing

jobs and other tasks) the current state of services and resources ina grid system. The DataMiningGrid high-level services (in partic-ular the Resource Broker and Information Services) are using theMDS4 service. In our FDL application, the following GT4 servicesare extensively used: WS-GRAM, GridFTP, and MDS4.

Scheduling of grid jobs in local computing clusters is achievedby using the Condor [22] middleware. Condor is a special-ized workload management software for submitting compute-intensive jobs to local computational clusters. In our application,the GT4 submits a subset of parallel jobs to appropriate Condorclusters, and it is up to the Condor software to place them into alocal queue, choose when and where in the local cluster to run the

jobs, carefully monitor the progress of the jobs, and ultimately in-form GT4 services upon their completion.

3.2. DataMiningGrid high-level services

InadditiontothecoregridservicesprovidedbyGT4,otherhigh-level WSRF compliant ready-to-use services have recently beendeveloped under the DataMiningGrid project [5]. Here, we providea brief overview of the Resource Broker and the InformationIntegrator Service, which are used extensively in our personalizedFDL application. These services support the parallel execution of avariety of batch-style programs on arbitrary machines in the gridenvironment.

3.2.1. Resource broker

The Resource Broker service [13] is responsible for theexecution of software resources, such as the DL harvestingapplication, as stand alone applications anywhere in the gridenvironment. It provides matching between the request for

application execution,which is alsocalled ajob in grid terminology,and the available computational and data resources in the grid.It takes as input the computational requirements of the job(Central Processing Unit power, memory, disk space etc.) anddata requirements of the job (data size, data transfer speed, datalocation etc.) and selects the most appropriate execution machinefor the particular job. The job is passed on to the WS-GRAM serviceand executed either on an underlying Condor cluster or by usingthe GT4 Fork mechanism.

The Resource Broker service is capable of job delegation toresources spanning over multiple administrative domains. Theexecution machines are automaticallyselected so that the inherentcomplexity of the underlying infrastructure is hidden from theusers. The Resource Broker service performs the orchestration of

automatic data and application transfers between the grid nodes,using the GridFTP and RFT components of GT4 for the transfers.

The DataMiningGrid Resource Broker can execute multi-jobs.Multi-jobs are collections of single jobs that are bound for parallelexecution. In the DataMiningGrid, a multi-job usually consists ofa single application, which is instantiated with different inputparameters and/or input data sets.

In the case of our FDL application a multi-job is formed byinstantiating the DL-Harvester application (see Section 3.3) severaltimes, each time with a different DL to be harvested. The individual

jobs are then executed in parallel on various computationalservers in the grid environment. Each job, therefore, represents theharvesting of one DL, while a multi-job represents the harvestingof several DLs in parallel.

3.2.2. Information integrator service

The Resource Broker makes extensive use of the InformationIntegrator service, which is also provided by the DataMiningGridand operates in connection to the MDS4 service provided by GT4.The Information Integrator service is designed to feed into othergrid components and services, including services for discovery,replication, scheduling, troubleshooting, application adaptation,and so on. Its key role is to create and maintain a register of grid-enabled applications. It facilitates the discovery of grid-enabledapplications on the grid, and their later use through the ResourceBroker service.

3.3. The OAIPMH protocol and a DL-Harvester application

Metadata contained in DLs can be accessed over variousprotocols, such as Z39.50 [24] or the Open Archives Initiative Protocol for Metadata Harvesting (OAIPMH) [19]. OAIPMHis a simple protocol that allows data providers to expose theirmetadata for harvesting. It is specified by the Open ArchivesInitiative (OAI), which develops and promotes interoperabilitystandards to facilitate the efficient dissemination of metadataon the WWW. The technological framework for this purpose,is the above mentioned OAIPMH protocol. This protocol is

independent of both the type of content offered (e.g. free-textarticles, multimedia) and the economic mechanisms surroundingthat content, and it promises to have a big impact on opening upaccess to a wide range of digital materials. Currently (Feb. 2008),there are 771 OAIPMH compliant repositories listed on the OAIweb page [26], 1075 on the OpenDOAR directory of academic openaccess repositories [27] and 1010 on the Registry of Open AccessRepositories (ROAR) [28] web portal. With the growing acceptanceof the OAI initiative the number of OAI-compliant repositories israpidly increasing.

The OAIPMH protocol supports metadata dissemination andharvesting in different metadata formats. The requested metadatarecords are returned as well-formed Extensible Markup Language(XML) instance documents that are valid according to a prescribed

XML schema. The characters are encoded in an 8-bit UCS/UnicodeTransformation Format (UTF-8), and the Hypertext TransferProtocol (HTTP) is used for transport. As a minimum standardfor interoperability, OAIPMH compliant DLs must be able todisseminate metadata in the Dublin Core (DC) format [25]. DCdefines fifteen metadata elements for simple content descriptionand discovery, such as: Title, Creator, Subject, Abstract, Publisher,etc. These kinds of metadata were used for the present study.

The OAIPMH protocol supports both full harvesting andselective harvesting of DLs. Metadata harvesting is achievedthrough the ListRecords HTTP request. When this request isissued to a repository it returns a complete list of metadata recordscontained in that repository. If the repository is big, the list ofmetadata records may be too large, so several HTTP requests and

responses are needed in order to achieve full harvesting. In thiscase:


4/9


Fig. 1. The grid-enabled DL-Harvester application in the DataMiningGrid test bed.

(1) The repository replies to the ListRecords request with anincomplete list and a resumption token. The number ofmetadata records included in the returned incomplete list isnot defined by the protocol itself, so it varies depending on therepository implementation.

(2) In order to assemble a complete list, the harvester needsto issue additional requests by using resumption tokens asarguments until the last record list with an empty resumptiontoken is received.

(3) A complete list of records is then formed by concatenating theseparate lists collected from the sequence of requests.

The OAIPMH protocol also provides specifications for selectiveharvesting. This makes it possible to limit harvesting requests

to portions of the available metadata in a repository. Two typesof harvesting criteria may be combined in an OAIPMH request:(1) datestamps to harvest only those records that have beencreated, deleted or modified within a specified date range and (2)set membership to harvest only records that belong to a certaincategory defined by the library.

For the purpose of this study, we developed and grid-enableda DL-Harvester application, which harvests a selected DL over theOAIPMH protocol. The developed DL-Harvester is a batch-styleharvester application, which takes as input the Uniform ResourceIdentifier of a DL to be harvested, and additional input parametersthat allow for selective metadata harvesting. The DL-Harvesterapplication supports the control flow of the OAIPMH protocol byhandling resumption tokens and concatenating response results,

automatically. The DL-Harvester application could therefore beeasily configured to perform either full harvesting or selective

harvesting. Nevertheless, in our use-case scenario each user isallowed to select his own set of DLs to be harvested. The user-selected libraries are harvested on-the-fly, hence, full harvestinghas to be performed (see [23] for details).

It should be also noted that selective harvesting is often im-possible because (1) support for deleted records is inconsistentlyimplemented in existing DLsand (2)an instability of DL servers fre-quently causes problems in determining datestamps to re-syncthe harvested metadata with the remote DL. Therefore, in practice,the only reliable way to ensure that aggregated metadata is up todate is to perform a new full harvesting cycle (see [29] for details).

Due to the reasons listed above along with the fact that fullharvesting is most time consuming, an experimental setup was

designed in which only full harvesting was performed, rather thanselective (incremental) harvesting of DLs.

4. Experimental setting

4.1. Resources and test bed

For the purpose of the FDL application, we used a grid testbed, which was developed by the DataMiningGrid Consortium.The test bed spans three countries: the United Kingdom, Germanyand Slovenia. A part of the test bed which is used in the presentstudy is depicted in Fig. 1. It consists of 4 front-end serverswith GT4 installations and local computational clusters with avarying number of computational machines (from 2080). Condor

is used as a local scheduler that controls the local computationalclusters. All four GT4 servers run core GT4 and high-level


5/9


Fig. 2. Harvesting multi-job for five Digital Libraries.

DataMiningGird services to support the execution of different grid-enabled applications in the test bed.

The DataMiningGrid test bed provides a number of capabilities,the most important being the following:

The ability to execute a variety of batch-style applications,including the DL-Harvester application at any appropriatecomputational server in the grid. Over 25 grid-enabledapplications are currently stored in executable repositories onvarious grid servers. Several of these applications may be usedfor designing sophisticated DL services. For example, alongwith the DL-Harvester, a computationally intensive distributedindexing algorithm was also grid-enabled and the end-usersmay at any time decide to run it on the corpus of harvested

metadata. Meta-scheduling i.e. dynamic and automatic allocation of

optimal computational servers in the grid environment, whichis achieved through the use of the DataMiningGrid ResourceBroker, the DataMiningGrid Information Integrator service andMDS4.

Application and data movement across different administra-

tive domains, which is achieved through the use of the GridFTPand RFT services.

In addition to these, the DataMiningGrid test bed has a numberof other capabilities, such as a Grid Security Infrastructure, whichare extensively described elsewhere [21].

4.2. Grid-enabling the DL-Harvester application

In order to grid-enable the DL-Harvester application wefollowed a very simple two step procedure. In the first step, theactual DL-Harvester executable is uploaded on one of the gridservers. In the second step, an XML document that describesthe DL-Harvester application is prepared and registered with theDataMiningGrid Information Integrator service, which passes theXML document to the associated MDS4 service.

From this point forward the DL-Harvester application is readyto be used in the grid environment. The application and all itsproperties may later be easily found in the grid by searchingMDS4. This information is also used by other grid services, suchas the Resource Broker. The XML document that describes the DL-

Harvesterapplicationis, in fact, an instance of a generic ApplicationDescription Schema (ADS instance), which was developed recently

by the DataMiningGrid project. The ADS provides properties todescribe applications in an uniform way so that they can later beexecuted in a grid environment. The ADS is described in detailelsewhere [21]. The ADS instance contains valuable informationabout the application domain, properties of the executable, adescription of its input parameters and data, processing, storageand memory requirements, and information about the exactstorage location of the DL-Harvester executable in the gridenvironment.

The FDL application is implemented as a client to the ResourceBrokerservice. The client first composes a multi-job to be executedon the grid. This is done by filling additional properties in the (DL-Harvesters) ADS instance. For example, the URLs of all DLs to beharvested are included into the ADS instance, the storage locationwhere harvested records will be concatenated is also specifiedand so on. The result is a fully populated XML instance, whichrepresents thedescriptionof a multi-jobto be executed on thegrid.The client then issues this multi-job to the Resource Broker service(Step 1 in Fig. 1). Once it receives a multi-job, the Resource Brokerservice selects appropriate computational servers in the test bed.The GridFTP service is then called to transfer copies of the DL-Harvester executable to all of the selected grid servers (Step 2).After the DL-Harvester transfer is complete, the Resource Brokersubmits the harvesting jobs to the WS-GRAM services (Step 3).While the jobs are executed, the Resource Broker keeps a recordof their execution (e.g. time of submission, owner, and status)(Step 4).

After execution is completed, the Resource Broker transfers the

harvested metadata records and log files from the computationalservers to a dedicatedStorage Server by using the GridFTP (Step 5).Theaggregated metadatacan nowbe further processedor used. Forexample, it would be possible to run an indexing application, againusing grid nodes to reduce processing time. As the last step, theResource Broker service cleans up all of the temporarily generatedfiles on the computational servers. This process completes themulti-job.

4.3. A possible execution scenario and performance measures

Fig. 2 depicts a possible execution scenario for a harvestingmulti-job with 5 DLs, while Table 1 defines a number ofperformance measures, which are used in this study. Stage-in

time A is the time from the moment of submission of the multi-job to the Resource Broker until the first job starts to run. This


6/9


Table 1

Definition of performance measures

n Number of jobs in a multi-job

A Stage-in timeB = B1 + B2 Additional time due to suboptimal scheduling and grid synchronization overheadC1,2,...,n Run times of the individual instances of the DL- Harvester applicationCmax = max{C1 , C2 , C3, C4 , . . . , Cn} The longest lasting instance of the DL-Harvester applicationD Stage-out time

E= A+ B1 + Cmax + B2 + D Multi-job run timeF=

n

i=1 Ci Sequential multi-jobs run-timeG = F

nAverage run time of a sequential job

T= FE

Actual speed-upTtheory =

F

CmaxMaximum theoretically achievable speed-up

time includes all the processing time needed to determine theexecution machines and to transfer the DL-Harvester applicationto these machines. The additional time needed for execution dueto the grid synchronization overhead, compared to an ideal casewhen all individual jobs execute within the time frame of thelongest job is represented by time B. Each job overhead includesthe Condor overhead time (time for job scheduling at the localcomputational cluster). The time Cmax represents the duration of

the longest job in the multi-job, which usually corresponds tothe largest DL within a set of DLs. Stage-out time D is the timefrom the end of the last job in the multi-job until the multi-

job is completed i.e. until the time when the results are madeavailable. D is largely the time needed for the transfer of theharvested metadata records to the specified location where all ofthe results are merged (e.g. for subsequent indexing purposes). Erepresents theoverall time from themulti-job submission until themulti-job completion. The theoretical speed-up is computed as thesummation of all harvesting run-times (as if they were executedin sequence) divided by the longest job run-time and representsthe theoretically maximum achievable speed-up (i.e. in case alllibraries are harvested in parallel and their processing and datatransfer overhead is equal to zero).

5. Performance measurements

The performance measurements had two main goals: (1) toidentify the speed-up factors with a growing number of DLsharvested in parallel, and (2) to assess the overhead introduced bythe use of grid technology.

As a first step in this study, we conducted a detailed analysis ofthe available OAIPMH compliant DLs. Although several thousandOAIPMH enabled DLs exist, only a limited number of theseprecisely comply with the standard and operate reliably withouthuman intervention. The most common problem encountered isrelated to UTF-8 errors which result in non-valid XML documentsreturned by DLs. Other problems, include improper date stamping,

bad resumption tokens etc. A report on problems with harvestingOAIPMH repositories can be found in [29]. Due to these problems,only 56 reliable OAIPMH compliant DLs were identified and usedfor the study.

Traditionally, speed-up is measured by varying the number ofjobs, which must be of the same size (i.e. their execution timeis the same on the same computing node). This, however, wasnot possible to achieve in our scenario. A number of factors maysignificantly influence the harvesting time, e.g. the number ofrecords harvested, the network bandwidth, and/or the number ofusers that simultaneously harvest the DL. Adding a long lastingharvesting job to a multi-job of several short lasting harvesting

jobs would significantly influence the speed-up measurements, sowe tried to avoid such a situation. This implied that we had to

categorize the harvesting jobs into sets, jobs belonging to one setbeing at least comparable in their size.

Fig. 3. Dependence of the harvesting time on the number of harvested records.

The 56 selected DLs varied significantly with respect to thenumber of metadata records they contained. The smallest DLstored only 223 metadata records, while the largest DL stored317884 metadata records.

Experimentally, it was confirmed that the harvesting time of a

DL, depends largely on the number of metadata records it contains(see Fig. 3). These results were obtained by taking into account 868executions of the DL-Harvester. The Pearson R2 test is 0.8547. Inour use case scenario, each job represents full harvesting of oneDL. The 56 DLs were divided into three groups according to thenumber of records they contained. There were 32 small (S), 19medium-sized (M) and 5 large (L) DLs. In total, 11 experimentswere scheduled (see Table 2). The jobs within a multi-job werecomparable in size, and consequently, it was possible to formmulti-jobs of various sizes (big, medium, small). This experimentalsetting made it possible to investigate the influence of the sizeof the harvesting multi-job on the speed-up and grid overheadmeasures.

The experiments were executed in a real-world grid test bed,

that was used by several other users at the same time. Therefore,special care was taken not to execute the multi-jobs in a heavilyover-loaded grid environment, which could influence the speed-up measurements. The maximum number of DLs which could beharvested in parallel on the grid was 32 (in experiment S-32),so, we made sure that the number of unoccupied computationalmachines in the test bed was always higher than 32 at the time ofexecution.

In total, 95 multi-jobs were run. A multi-job run time wascompared with the time needed to run the jobs sequentially andthe speed-up value was calculated. Table 3 shows average valuesof the measured parameters, since each experiment (i.e. a multi-

job) was repeated 10 times in the case of small and medium sizedDLs and 5 times in the case of the large DLs. In the case of the large

libraries, the experiment L-5 on an average run was almost 7 h,more precisely 24953 s.


7/9


Table 2

Experimental-set up

Experiment No. of jobs in a multi-job (n) No. of experiment repetitions (k) Average no. of metadata records DL size

S-1 1 10 971 SmallS-5 5 10 901 SmallS-10 10 10 949 SmallS-32 32 10 952 SmallM-1 1 10 5241 Medium

M-5 5 10 7636 MediumM-10 10 10 7566 MediumM-19 19 10 8327 MediumL-1 1 5 31 705 LargeL-3 3 5 140 329 LargeL-5 5 5 154 118 Large

Table 3

Speed-up measurement results

Experiment A B Cmax D E F T Ttheory

S-1 48.8 13.3 98.2 34.1 194.4 98.2 0.5 1.0S-5 57.6 29.6 240.6 39.7 367.5 526.8 1.5 2.7S-10 63.7 39.9 160.0 49.0 312.6 680.2 2.3 5.1S-32 77.0 88.9 215.0 75.9 456.8 1980.8 4.3 9.6M-1 53.7 15.4 625.1 35.6 729.8 625.1 0.9 1.0M-5 66.4 43.8 646.0 54.4 810.6 2396.1 3.0 3.7M-10 66.5 34.3 751.7 77.6 930.1 4401.6 4.8 5.9M-19 74.0 55.4 937.3 125.0 1191.7 8370.1 7.0 8.9L-1 65.8 12.4 3056.0 52.6 3186.8 3056.0 1.0 1.0L-3 53.6 33.8 12 535.0 195.8 12 818.2 31 646.6 2.5 2.5L-5 66.2 52.4 24 492.8 342.4 24 953.8 59 534.4 2.4 2.4

Fig. 4. Evaluation of the grid overhead with small, medium and large DLs.

In Fig. 4 it is possible to visually compare the grid-synchronization overhead (in percentages relative to the total run time ofthe multi-job), which increases with the number of jobs withina multi-job. Grid synchronization overhead occurs because ofthe suboptimal scheduling policy of the Resource Broker. Inoptimal conditions all the jobs should be completed within timeinterval of the longest job in a multi-job, but this was notthe case in our experiments as can be seen in Fig. 2. Also,as it was expected, the total grid overhead (including stage-inoverhead, synchronization overhead and stage-out overhead) ismuch smaller in the experiments conducted with medium andlarge DLs, while in the case of small DLs the total grid overhead isvery big. For example in experiment S-32, the total grid overheadrepresents cca. 80% of the total run time of the multi-job.

Finally, Fig. 5 shows that the speed-up increases linearly withthe growing number of harvested DLs. This increase is faster in

the case of medium and large size DLs. The linear approximationformulae are significant in the case of small and medium size DLsand are as follows:

for small DLs: T= 0.1142n+ 0.7704, Pearson R2 = 0.9617 for medium DLs: T= 0.3336n+ 0.9839, Pearson R2 = 0.9716 for large DLs: T= 0.3587n+ 0.8663, Pearson R2 = 0.7068.

6. Discussion and conclusions

In this paper, we presented a grid-based application thataddresses the performance, scalability and reliability requirementsof existing Federated Digital Library solutions. The provision ofnew, sophisticated,reliable,personalized FDL solutionsrequires an

infrastructure and services capable of solving complex problems.This kind of computational, data and informational complexity


8/9


Fig. 5. Speed-up measurements for small, medium and large DLs.

cannot be adequately addressed by pure Web service technology,which is demonstrated by the number of related research projectscombining DL and grid technology (see Section 2 for moredetails). To the best of our knowledge, these projects have not yetpublished results, so that we may compare. Our results show thatopen, standard interfaces and WSRF-compliant services for gridcomputing may be used to address the investigated problems.

At a technical level, we have achieved the execution of the DL-Harvester application in a geographically distributed environment,without prior installation and have exploited the redundancy ofcomputational servers in the grid environment in order to achieveapplication speed-up. Our DL-Harvester application is capable ofperforming selective harvesting according to dates, however, thisfeature was not used when performing the experiments. Theobtained results are promising and indicate that a system likethe one presented in this study may be useful for developing aproduction level FDL.

Digital libraries differed largely in terms of their performance.Some parameters that influence the performance are: the num-ber of metadata records in the DL, the different DL software im-plementations, the network bandwidth, the number of concurrentDL users etc. Due to these reasons the DL performance may vary onan hourly basis.

We were mostly interested in the variations of speed-upand related grid overhead with a growing number of parallelyharvested libraries. These variations were observed separately forthe harvesting of libraries containing small, medium and largenumberof metadata records. The grid overhead was compared: (1)

by comparing the differences of the stage-in, grid-synchronizationoverhead and stage-out times in the various experiments (thegrid-synchronization overhead rate was lower in the case ofthe medium and large size DLs), and (2) by comparing theactually obtained speed-up results with the theoretical maximumachievable speed-up (the difference was small in the case ofmedium size libraries, and it was minimal in the case of the largestlibraries, see Table 3). This implies that the use of grid technologyis especially beneficial when individual jobs of a grid multi-jobharvest large number of metadata records. In this case the relativetotal grid overhead remains low.

The measurements show that speed-up increases approxi-mately linearly with a growing number of harvested DLs. This in-crease is faster if the grid jobs are large, since in this case the

relative grid overhead is small. Another observation is that thedifference between the theoretical and actual speed-up increases

with the increasing number of jobs within a multi-job. The reasonfor this is the sub-optimal scheduling policy, which is used by theResource Broker (see the example of 5 DLs presented in Fig. 2). Thegreater the numberof jobs that arepart of the multi-job, the higherthe possibility that some of these jobs will be sub-optimally sched-uled and consequently, it will take longer to execute the multi-

job. This could be improved, for example, by applying advancedscheduling system using the adaptive scheduling algorithms, suchas described in [30].

Based on the study, it is possible to conclude that:

Harvesting small digital libraries (from 200 to 2000 metadatarecords i.e. harvesting times in order of 100s of seconds) is nota problem suitable for global grids;

The use of grid environments is beneficial with larger libraries

(2000 and more records and harvesting times of more than10 min); the larger the DLs the greater the benefits of using gridtechnology;

In order to achieve good speed-up the jobs within a multi-jobshould be comparable in size. In the case of DL-Harvester thiscan be achieved by partitioning the harvesting task of a largeDL to several jobs by using selective harvesting;

The grid overhead increases with the number of jobs. This maybe improved by improving the global scheduling policy, whichis used by the Resource Broker.

The implemented system prototype, which is based on latestDataMiningGrid (released in March 2007) and GT4 technologies, isgeneric as it also allows for the inclusion of arbitrary harvesting,indexing, ontology learning and other applications in the gridenvironment. This, in turn, will allow service providers to buildinnovative, scalable, high-performance FDL applications that wereimpossible to imagine in the past. As the next research step we areplanning to set up a complete FDL service, where the harvestingprocess will be followed by indexing and search phases.

We have shown that grid technology may be beneficial in thecase of FDL applications, especially with the growing size andrapidincrease of the numberof such repositories. Improvements are stillneeded in the resource scheduling policies that may significantlyreduce the grid synchronization overhead. We identified two mainreasons why the use of grid technology is to improve future FDLservices. The first reason is the use of distributed computationalresources to speed-up the process of harvesting, indexing andother computationally intensive tasks, which allows for frequentharvesting and indexing and therefore keeps the FDL up-to-date.

The secondreason is the improved systemreliability, since the gridsystem has no single point of failure.


9/9


Acknowledgement

This work has been conducted under the DataMiningGridproject, Data Mining Tools and Services for Grid ComputingEnvironments, research grant, EU IST-2004-004475.

References

[1] G. Aloisio, M. Cafaro, I. Epicoco, Early experiences with the GridFTP protocolusing the GRB-GSIFTP library, Future Generation Computer Systems 18 (8)(2002) 10531059.

[2] D. Castelli, Digital libraries of the future - and the role of libraries, Library HiTech 24 (4) (2006) 496503.

[3] K. Czajkowski, C. Kesselman, S. Fitzgerald, I. Foster, Grid information servicesfor distributed resource sharing, in: Proc. 10th IEEE International Symposiumon High-Performance Distributed Computing, 2001, p. 181.

[4] K. Czajkowski, D. Ferguson, I. Foster, J. Frey, S. Graham, D. Snelling, S. Tuecke,From open grid services infrastructure to web services resource framework:Refactoring and evolution, Retrieved April 07, 2008 from http://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdf.

[5] DataMiningGrid (Data Mining Tools and Services for Grid Computing Envi-ronments) project, RetrievedApril 07, 2008 fromhttp://www.datamininggrid.org.

[6] DELOS (Digital Library Architectures: Peer-to-peer, grid, and service-orientation) network of excellence on digital libraries, Retrieved April

07, 2008 from http://www.delos.info/.[7] M. Feller, I. Foster, S. Martin, GT4 GRAM: A functionality and perfor-mance study, Retrieved April 07, 2008 from http://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison-final.pdf.

[8] I. Foster, C. Kesselman, The Grid 2: Blueprint for a New ComputingInfrastructure, Morgan Kaufmann Publishers, San Francisco, CA, USA, 2004.

[9] I. Foster, C. Kesselman, The Globus project: A status report, Future GenerationComputer Systems 15 (56) (1999) 607621.

[10] I. Foster, Globus Toolkit version 4: Software for service-oriented systems,in: IFIP Intl. Conf. on Network and Parallel Computing, in: Lecture Notes inComputer Science, vol. 3779, Springer, 2005, pp. 213.

[11] I. Foster, C. Kesselman, J. Nick, S. Tuecke, The physiology of the Grid: An opengrid services architecture for distributed systems integration, Retrieved April07, 2008 from http://www.globus.org/alliance/publications/papers/ogsa.pdf.

[12] G. Haya,F. Scholze, J. Vigen, Developinga grid-basedsearch and categorizationtool, High Energy Physics Libraries Webzine, Issue 8, October 2003.

[13] V. Kravtsov, T. Niessen, V. Stankovski, A. Schuster, Service-based resourcebrokering for grid-based data mining, in: Proc. of the 2006 International

Conference on Grid Computing and Applications, 2006, pp. 163169.[14] R.R. Larson, R. Sanderson, Grid based digital libraries: Cheshire3 anddistributed retrieval, in: Proc. Fifth ACM/IEEE Joint Conf. on Digital LibrariesCyberinfrastructure for Research and Education Denver, CO, USA, Session:Tools & techniques track: searching and IR, 2005, pp. 112113.

[15] R.K. Madduri, C.S. Hood, W.E. Allcock, Reliable file transfer in grid environ-ments,in: Proceedings of the27thAnnualIEEE Conference on Local ComputerNetworks, 2002, pp. 737738.

[16] K. Maly, M. Zubair, V. Chilukamarri, P. Kothari, GRID based federated digitallibrary, in: Proc. of the 2nd Conference on Computing Frontiers, 2005,pp. 97105.

[17] K. Maly, M. Zubair, X. Li, A high performance implementation of an OAI-basedfederation service, in: Proceedings of the 11th International Conference onParallel and Distributed Systems, ICPADS05, vol. 01, 2005, pp. 769774.

[18] G.B. Newby, K. Gamiel, N. Nassar, Secure information sharing and informationretrieval infrastructure with GridIR, intelligence and security informatics,in: First NSF/NIJ Symposium, ISI, Tucson, AZ, USA, in: Lecture Notes inComputer Science, Springer, Berlin, 2003.

[19] The openarchivesinitiative, the openarchivesinitiativeprotocol for metadataharvesting, Protocol Version 2.0 of 2002-06-14, Retrieved April 07, 2008 fromhttp://www.openarchives.org/OAI/openarchivesprotocol.html.

[20] J.M. Schopf, L. Pearlman, N. Miller, C. Kesselman, I. Foster, M. DArcy,A. Chervenak, Monitoring the grid with the Globus Toolkit MDS4, in: Proc. ofSciDAC 2006, Scientific Discovery Through Advanced Computing, 2529 June2006, Denver, Colorado, USA, Journal of Physics: Conference Series 46 (2006)521526.

[21] V. Stankovski, M. Swain, V. Kravtsov, T. Niessen, D. Wegener, J. Kindermann,W. Dubitzky, Grid-enabling data mining applications with DataMiningGrid:An architectural perspective, Future Generation Computer Systems 24 (4)(2008) 259279.

[22] D. Thain, T. Tannenbaum, M. Livny, Distributed computing in practice: Thecondor experience, Concurrency and Computation: Practice & Experience 17(24) (2005) 323356.

[23] J. Trnkoczy, . Turk, V. Stankovski, A grid-based architecture for personalizedfederation of digital libraries, Library Collections, Acquisitions, and TechnicalServices 30 (34) (2006) 139153.

[24] NISO standard: ANSI/NISO Z39.50 - Information Retrieval:ApplicationServiceDefinition& Protocol Specification,RetrievedApril 07, 2008 fromhttp://www.niso.org/kst/reports/standards/.

[25] NISO standard: ANSI/NISO Z39.85 - The Dublin Core Metadata Element Set,Retrieved April 07, 2008 from http://www.niso.org/kst/reports/standards/.

[26] OAI RegisteredData Providers,RetrievedFebruary21, 2008 from http://www.openarchives.org/Register/BrowseSites.

[27] OpenDOAR directoryof academic openaccessrepositories,Retrieved February21, 2008 from http://www.opendoar.org/.

[28] Registry of Open Access Repositories (ROAR), Retrieved February 21, 2008from http://roar.eprints.org/.

[29] C. Lagoze, D. Krafft, T. Cornwell, N. Dushay, D. Eckstrom, J. Saylor, Metadataaggregation and automated digital libraries: A retrospective on the NSDLexperience, in: International Conference on Digital Libraries, Proceedings ofthe6th ACM/IEEE-CSjoint conference on Digital libraries,Chapel Hill,NC, USA,2006, pp. 230239.

[30] Yang Gao, Hongqiang Rong, Joshua Zhexue Huang, Adaptive grid jobscheduling with genetic algorithms, Future Generation Computer Systems 21(1) (2005) 151161.

Jernej Trnkoczy studied telecommunications and wasawarded his engineering degree in 2003 from the Facultyof Electrical Engineering, University of Ljubljana. He isemployed as a researcher at the Laboratory for DigitalSignal Processing (LDOS) and is also engaged in post-graduate study at the same Faculty. His research interests

include distributed computing, grid and P2P technologies,and theirapplications in information retrieval systems.Hehasbeeninvolved inseveralgrid andP2P related Europeanprojects, such as the EU IST DataMiningGrid project.

Vlado Stankovski wasawarded hisB.Sc andM.Sc.degreesin computer science from the University of Ljubljanain 1995 and 2000, respectively. He began his career in1995 as consultant and later as project manager with theFujitsu-ICL Corporation in Prague. From 19982002 heworked as researcher at the University Medical Centre inLjubljana. From 2003 on, he is employed as researcherat the Department of Civil Informatics at the Facultyof Civil and Geodetic Engineering. Recently, he was thetechnical manager of the EU IST DataMiningGrid project.He specializes in semantic grid technologies.
http://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdfhttp://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdfhttp://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdfhttp://www.datamininggrid.org/http://www.datamininggrid.org/http://www.datamininggrid.org/http://www.delos.info/http://www.delos.info/http://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison-final.pdfhttp://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison-final.pdfhttp://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison-final.pdfhttp://www.globus.org/alliance/publications/papers/ogsa.pdfhttp://www.globus.org/alliance/publications/papers/ogsa.pdfhttp://www.openarchives.org/OAI/openarchivesprotocol.htmlhttp://www.openarchives.org/OAI/openarchivesprotocol.htmlhttp://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.openarchives.org/Register/BrowseSiteshttp://www.openarchives.org/Register/BrowseSiteshttp://www.openarchives.org/Register/BrowseSiteshttp://www.opendoar.org/http://www.opendoar.org/http://roar.eprints.org/http://roar.eprints.org/http://roar.eprints.org/http://roar.eprints.org/http://roar.eprints.org/http://roar.eprints.org/http://www.opendoar.org/http://www.opendoar.org/http://www.opendoar.org/http://www.opendoar.org/http://www.openarchives.org/Register/BrowseSiteshttp://www.openarchives.org/Register/BrowseSiteshttp://www.openarchives.org/Register/BrowseSiteshttp://www.openarchives.org/Register/BrowseSiteshttp://www.openarchives.org/Register/BrowseSiteshttp://www.openarchives.org/Register/BrowseSiteshttp://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.niso.org/kst/reports/standards/http://www.openarchives.org/OAI/openarchivesprotocol.htmlhttp://www.openarchives.org/OAI/openarchivesprotocol.htmlhttp://www.openarchives.org/OAI/openarchivesprotocol.htmlhttp://www.openarchives.org/OAI/openarchivesprotocol.htmlhttp://www.openarchives.org/OAI/openarchivesprotocol.htmlhttp://www.openarchives.org/OAI/openarchivesprotocol.htmlhttp://www.openarchives.org/OAI/openarchivesprotocol.htmlhttp://www.globus.org/alliance/publications/papers/ogsa.pdfhttp://www.globus.org/alliance/publications/papers/ogsa.pdfhttp://www.globus.org/alliance/publications/papers/ogsa.pdfhttp://www.globus.org/alliance/publications/papers/ogsa.pdfhttp://www.globus.org/alliance/publications/papers/ogsa.pdfhttp://www.globus.org/alliance/publications/papers/ogsa.pdfhttp://www.globus.org/alliance/publications/papers/ogsa.pdfhttp://www.globus.org/alliance/publications/papers/ogsa.pdfhttp://www.globus.org/alliance/publications/papers/ogsa.pdfhttp://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison-final.pdfhttp://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison-final.pdfhttp://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison-final.pdfhttp://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison-final.pdfhttp://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison-final.pdfhttp://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison-final.pdfhttp://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison-final.pdfhttp://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison-final.pdfhttp://www.globus.org/alliance/publications/papers/TG07-GRAM-comparison-final.pdfhttp://www.delos.info/http://www.delos.info/http://www.delos.info/http://www.delos.info/http://www.datamininggrid.org/http://www.datamininggrid.org/http://www.datamininggrid.org/http://www.datamininggrid.org/http://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdfhttp://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdfhttp://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdfhttp://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdfhttp://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdfhttp://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdfhttp://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdfhttp://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdfhttp://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdfhttp://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdfhttp://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdfhttp://www.globus.org/wsrf/specs/ogsi_to_wsrf_1.0.pdf

Documents

Future Generation Computer Systems