Evaluation of Digital Libraries (Fulltext)

Int J Digi LibrDOI 10.1007/s00799-007-0011-z

REGULAR PAPER

Evaluation of digital libraries

Norbert Fuhr · Giannis Tsakonas · Trond Aalberg · Maristella Agosti ·Preben Hansen · Sarantos Kapidakis · Claus-Peter Klas · László Kovács ·Monica Landoni · András Micsik · Christos Papatheodorou · Carol Peters ·Ingeborg Sølvberg

© Springer-Verlag 2007

Abstract Digital libraries (DLs) are new and innova-tive information systems, under constant developmentand change, and therefore evaluation is of critical impor-tance to ensure not only their correct evolution but alsotheir acceptance by the user and application communi-ties. The Evaluation activity of the DELOS Network ofExcellence has performed a large-scale survey of cur-rent DL evaluation activities. This study has resulted ina description of the state of the art in the field, whichis presented in this paper. The paper also proposes anew framework for the evaluation of DLs, as well as for

N. Fuhr · C.-P. Klas (B)University of Duisburg-Essen, Duisburg, Germanye-mail: [email protected]

T. Aalberg · I. SølvbergNorwegian University of Science and Technology,Trondheim, Norway

P. HansenSwedish Institute of Computer Science, Kista, Sweden

L. Kovács · A. MicsikComputer and Automation Research Institute of theHungarian Academy of Sciences (MTA SZTAKI),Budapest, Hungary

M. AgostiUniversity of Padua, Padova, Italy

G. Tsakonas · S. Kapidakis · C. PapatheodorouDepartment of Archive and Library Sciences,Ionian University, Corfu, Greece

C. PetersCNR-ISTI, Pisa, Italy

M. LandoniDepartment of Computer and Information Sciences,University of Strathclyde, Glasgow, UK

recording, describing and analyzing the related researchfield. The framework includes a methodology for theclassification of current evaluation procedures. Theobjective is to provide a set of flexible and adaptableguidelines for DL evaluation.

1 Introduction

Digital libraries (DLs) are complex systems; they can be,and are, viewed from different perspectives. The meth-ods and metrics for the evaluation of DLs may varyaccording to whether they are viewed as institutions, asinformation systems, as new technologies, as collections,or as new services.

A DL is a particular kind of information system andconsists of a set of components, typically a collection (orcollections), a computer system offering diverse serviceson the collection (a technical infrastructure), people,and the environment (or usage), for which the systemis built. When designing a DL the starting points are itsintended usage and the corresponding user needs. Themodel is based upon the assumption that the user and theuser needs specify the main requirements with respectto the range and content of the collections. The natureof the collections will thus predetermine the range oftechnologies that are needed. The attractiveness of thecollections to the users and the ease of use of the tech-nologies by the user group will determine the extent ofthe usage of the DL.

DLs are components in several types of applicationsin areas such as cultural heritage, health, government,learning, and science. Technological advances in areaslike information searching and retrieval, informationstorage, user interfaces, telecommunications as well as

N. Fuhr et al.

the increasing availability of a variety of multimediacollections make it possible to offer new and better ser-vices for user groups. As DLs have been around for afew years now, an increasing number of users have somefamiliarity with them. The expectations and demandsfor better service and functionality from these users areincreasing. Thus the importance of quality in DL contentand services is higher than ever. However, the qualityof a complex system is never better than the quality ofits “weakest” component. In order to improve qualitythere is a need for definitions of what is intended byquality, and for the appropriate metrics and methodsfor evaluation.

The design and development of a DL is expensive.Evaluation results from previous systems can give guide-lines as well as assist in determining methods for the con-struction of cost-effective and sustainable new systems.“Doing it right” in the first phases of system develop-ment is critical for the final result as the quality of anend-product depends to a great extent on the quality ofearly requirements and conceptual models.

The Evaluation Forum of the (2000–2003) DELOSNetwork of Excellence on DLs first proposed a frame-work for the classification of DLs, and produced the firstversion of a MetaLibrary of test-beds to be used in DLevaluation [1]. This was presented at the first DELOSworkshop on evaluation, the Budapest workshop [2].From the Budapest workshop two broad themesemerged: the complementary needs for metrics and test-beds, and for evaluation in the context of specific DLapplications. One of the first activities of the new DE-LOS evaluation workpackage under the Sixth Frame-work Programme was to organize a workshop in Padua,Italy, October 4–5, 2004 [3]. At the Padua Workshopan extensive overview of the literature in all areas ofDL evaluation was presented and different aspects wereaddressed: user experiences; DL models; DL usage; con-tent evaluation; system evaluation; and experiencesfrom the CLEF and the INEX initiatives.1 The work-shop provided a good foundation for our description ofthe state of the art in DL evaluation, and the identifica-tion of major issues for further research in this area.

The workshop results constitute the context, the aimsand the motivation of this paper. It is recognized thatDLs support specific activities in specific contexts andthat they need to be evaluated to determine how use-ful, usable and effective they are. Therefore in Sect. 2

1 Both the Cross-Language Evaluation Forum (CLEF) andthe Initiative for the Evaluation of XML Retrieval (INEX)are sponsored by the DELOS Evaluation workpackage.See http://www.clef-campaign.org/ and http://inex.is.informatik.uni-duisburg.de/.

we introduce the principal concepts and conditions inthe DL evaluation domain, and in the following section(Sect. 3) we present the state of the art in the domain.In Sect. 4, we have employed a revised version of thefirst DELOS evaluation framework in order to providea model that holistically covers the current state in DLevaluation. In Sect. 5 we present a new DL framework,which operates dynamically at diverse levels. A set ofrecommendations is listed in Sect. 6. A summary andconclusions are given in the final section (Sect. 7).

2 Basic concepts and assumptions of DL evaluation

Although evaluation activities started soon after the firstDL systems were available, the underlying assumptionsand goals of these efforts were quite disparate. Thus,there is a need for agreement with respect to the centralconcepts, assumptions, parameters, and criteria of DLevaluations. In this section, we pinpoint some of theseissues.

2.1 Evaluating the interaction between userand content

Digital libraries can be used for many reasons, but themost central set of use cases focuses around informationaccess. Finding certain types of content, retrieving spe-cific information, locating known items, accessing mate-rial the client does not know enough about, there aremany content-based, more or less goal-directed moti-vations that will lead a user to the access terminal of adigital collection of information.

2.2 Current evaluation schemes

The system components that provide the user with accessto the content of the library are, for the most part, basedon information retrieval technology. This is a technol-ogy that is well researched and has developed its owntried and tested methodologies for evaluation. Much ofwhat is known about the characteristics of the centralcomponents of any information system is due to a well-established and accepted formalization of the tasks theyare designed to perform and a consequent evaluationof the performance. This established evaluation frame-work is laboratory-based and implies a fairly draconicgeneralization of usage, user task, and user situation. Itis derived from ideas first introduced in the Cranfieldexperiments [4].


2.3 Current evaluation: the notion of relevance

The target notion of experimental evaluation thus farhas been that of relevance. The concept of relevanceis central to understanding the interactions betweenusers, usage, information needs and informational con-tent. Relevance – the momentary quality of a text thatmakes it valuable enough to read – is a function of task,text characteristics, user preferences and background,situation, tool, temporal constraints, and untold otherfactors. This has been understood and operationalizedas a relation between documents and assessments madeby a panel of human judges, in partial contradiction ofthe everyday use of the word.

Relevance, as measured by information retrieval eval-uation cycles, does not take into account user satis-faction and pleasure, quality of information, relationsbetween documents, or reliability of information. Mostimportantly, it is completely abstracted away from everyconceivable context one might care to investigate. Thisincludes the various types of contexts in which the itemof information, the user, the producer, and the sessionmay be engaged.

2.4 Current evaluation: the metrics of precisionand recall

Given a target notion of relevance, system performanceis evaluated using the well-accepted twin measures ofrecall and precision. These measures have been definedto model completeness and exactitude of systems,respectively, but not to model user satisfaction, resultpertinence or system effectiveness for a given task con-text. Even more important, the TREC interactive trackhas shown that the results from batch evaluations [5] donot carry over to interactive retrieval – users are ableto compensate for most of the differences observed inbatch experiments.

2.5 New demands on evaluation

In DL research and deployment, new use cases and mar-ket opportunities beyond those of ad-hoc retrieval arebeing developed and introduced, and information accessis being placed in context. When evaluating a DL as acontent-bearing tool, system retrieval performance isbut one component of overall system acceptability andevaluation must take more than system performanceinto account.

Purely laboratory-based approaches will not be ableto fulfil this task. First and foremost, evaluation schemesmust be designed specifically for the DL field. Secondly,the session must be taken as the unit for evaluation, not

the information request. Users of information accesssystems interact with information, with the content ofthe systems they use. They retrieve material for a rea-son; they continuously evaluate the information theyare accessing; they formulate, reformulate, and changetheir goals dynamically as they go; the information theyaccess changes their perspective. They judge documentsby perusing them in detail or by assessing them briefly incomparison with other available documents. An infor-mation access session is a real and perceived item ininformation access behavior, but systems typically areunaware of interaction on levels of abstraction higherthan immediate request-retrieval turns.

2.6 Current tools and future tools – how to evaluate?

The technological support for many of these interac-tional aspects of information access does not yet existother than implicitly in the system context. Most of thepersistence or continuity in interaction with the systemis provided by the user, rather than the system. Thus,the scope of an information retrieval component shouldbe extended so that it is more task-effective, more per-sonalized, or more enjoyable. In order to evaluate thesesystems, new metrics are required which tie together thedisparate and vague notions of user satisfaction, perti-nence to task, and system performance.

2.7 A new relevance?

If the concept of relevance is de-constructed andenhanced to take context into account, and informationaccess systems are made to model information users,information producers, information access sessions, andcontexts, future research will better be able to satisfythe needs of information seekers, both professional andincidental. Extending the notion of relevance so thatit does not lose its attractive formal qualities is not astraightforward task, and has been attempted in vari-ous research endeavours in the past. It is completelycrucial for any extension to be agreed upon by a largenumber of researchers. A well-crafted context sensitiverelevance measure will be a durable contribution to thefield, which thereafter will only be able to ignore contextand usage factors at its peril.

3 State of the art

The goal of this section is to present the present thecurrent DL foundations as well as to describe thedescription and evaluation trends of various research

N. Fuhr et al.

and practice sectors, which could be also adopted by theDL community.

Digital libraries represent a truly interdisciplinaryresearch domain; existing models from a number ofdifferent communities can be applied to componentsof the DL paradigm. For example, the Reference Modelfor an Open Archival Information System (OAIS-RM)formally defines the processes for the archiving, pres-ervation and access of digital objects [6]. The FRBR(Functional Requirements for Bibliographic Records)data schema, developed under the umbrella of IFLA,provides a framework and conceptual model for creat-ing bibliographic records [7].

The scope of these models goes beyond DLs, andvice versa, the area of DLs is much wider than whatthese models cover. How broad is the definition of a DL[8,9]? So far, there is no single accepted definition. Atthe broadest level, the following aspects of a DL have tobe considered: an infrastructure of various networkedservices, a structured collection of digital objects and anorganization with different user roles and human inter-action patterns.

Among the models for the DL community, one canfind a reference model for actors and roles in DLs, pro-posed by a DELOS/NSF working group [10]. Within thisreference model, users, professionals and agents wereidentified as the main classes of actors, and their roleswere analyzed in characteristic DL use cases. In the con-text of the DELOS Network of Excellence2 an effort isunderway to complete previous efforts and to developa comprehensive reference model for DL managementsystems [11].

At present a formal and general model for DLs is the5S model [12]. According to this model a DL consists ofa repository, metadata catalogues, services and a societyof users. The 5S refers to streams and structures for theconstruction of digital objects, spaces for the descrip-tion of digital object collections and their interrelations,scenarios for the definition of how services and activi-ties change the state of the system, and finally societiesfor the interconnection of roles and activities within theuser community. The 5S model is based on a mathemati-cal formalism, and has been used in various case studies,including the generation of a taxonomy of DL terms.5SL is a declarative language based on this model forthe generation of DL applications.

Models that relate to DLs may contain or may beextended to contain a formalization of evaluation targetsand aspects of quality. The 5S model has already beenextended towards formalizing quality aspects of DLs.Properties such as accessibility, preservability, timeliness,

2 See http://www.delos.info/.

and completeness have been defined mathematicallyusing this model, so that they are usable as evaluationmetrics.

Effectiveness, cost-effectiveness and the cost-benefitratio are metrics often used for measuring performanceof library services [13]. Nicholson’s conceptual frame-work for the holistic evaluation of library services [14]proposes an evaluation matrix with dimensions of per-spective (internal – library system and external – user),and topic (library system and use). This matrix helpsevaluators to choose targets for measurements andmethods for measuring selected targets. In Nicholson’sview, evaluation implies measurement, and evaluationcriteria are calculated combining some of the measure-ments. He introduces the evaluation viewpoints of users,library personnel and decision makers, where evalua-tions are refined and propagated from users throughlibrary personnel towards decision makers, and deci-sions and changes are propagated in the opposite direc-tion. Optimally, he argues, changes in the library willinduce measurements of their impact; the evaluation ofthis will suggest further changes and so on. Mapping cur-rent measurement and evaluation activities of a libraryonto this framework helps to relate these activities toeach other and to decision making in a holistic view.

In addition to models and frameworks, there are alsosome practical methodologies and tools that can be usedto measure the value of DL services. For example, theLibQUAL+ methodology is a complete market surveyof user perceptions for identifying gaps in service deliv-ery [15]. It is used extensively in the U.S. within theAssociation of Research Libraries (ARL) and it is undercontinuous development in pursuing of applicability inthe DL sector [16]. The goal of eVALUEd3 is to pro-duce a transferable model for e-library evaluation andto provide training and dissemination in e-library eval-uation.

These models differ from DL evaluation models, dueto the relation of the DL operation outcome with cer-tain served communities, for example impact or satisfac-tion of geographically fixed communities. The outcomeof this section is the inherent difficulty in evaluationactivities. The complexity of DL constituting elements,structure and provision determines to a great extend thevariety of evaluation activities in terms of formalism,dimensions, depth and practicality. This was reinforcedby the outcomes of the first survey carried by the DE-LOS WG on evaluation, mainly in the area of semanticsand constructs granularity [1]. It is a notable observationthat under current conditions evaluation in whole is a

3 http://www.evalued.uce.ac.uk/.


multi-faceted process that demands contributions fromvaried and distant in between communities.

4 Defining a conceptual model for DL evaluation

The DELOS Working Group on Evaluation has workedon the definition of a DL conceptual model. The initialversion of the model [1] was discussed at the first DE-LOS evaluation workshop held in Budapest, 2002.

This approach (see Fig. 1) uses a generic definition of aDL as its starting point, and is illustrated by the small cir-cles within the central circle labelled “DL DOMAIN”.The model falls into three non-orthogonal components:the users, the data/collection, the technology. The defi-nition of the set of users predetermines the appropri-ate range and content of the collection (thick arrowconnecting “users” to “collection”). The nature of thecollection predetermines the range of appropriate tech-nologies that can be used (thick arrow from “collection”to “technology”). The attractiveness of the collection tothe users and the ease with which they use the technol-ogies will determine the extent of usage of the DL (thinarrows show the human–computer interactions, whilethe dotted arrows show the collective contribution ofuser, collection and technology interactions to observedoverall usage).

From this starting point, it is possible to move out-wards to the domain of the DL researchers (outer ring),

Fig. 1 A DL classification and evaluation scheme

and to use the non-orthogonal relationships betweenthe principal research areas (users, usage, collection andtechnology) to create a set of researcher requirementsfor a DL test-bed. Because ‘content is king’, the nature,extent and form of the collection predetermine both therange of potential users and the required technology set.Evaluation criteria are assigned to each of the compo-nents, and metrics are suggested for these criteria.

4.1 The interaction triptych model

At the second DL evaluation workshop in Padua [3],special attention was given to the relations between thecomponents of a DL, i.e. the relations User–Content,Content–System and User–System [17], where user hererepresents both the user and the usage nodes of Fig. 1.This is shown in Fig. 2. The Content–System pair isrelated to the performance attributes (precision, recall,response time, etc.), the User–System pair is related tousability aspects (effectiveness, satisfaction, etc.), andthe User–Content pair is related to usefulness aspects.These pairs represent the relations – interactions amongthe DL components and define a three-axes frameworkfor the DL evaluation which we call Interaction Triptych[18]. The following subsections are structured accordingto this logical model when discussing various evaluationaspects.

4.1.1 User

Users is the first component of any interaction pro-cess and his characteristics are complex and constantlyevolving [1]. Even in specific evaluation projects, whereresearch parameters are constrained, the number andthe attributes of the users are many and complicated.

Fig. 2 The interaction triptych model

N. Fuhr et al.

Whatever the particular methodology chosen, whenperforming a user-oriented evaluation, the objectivesare to acquire a deep understanding of the user require-ments for technology design and planning, and to receivesystematic user feedback throughout the design process.Marchionini [19] evidences two important issues thatmust be taken into account: evaluation research shouldbe longitudinal in order to capture a rich and reliabledata set for analysis, and it should be multifaceted using acombination of methods. Each evaluation may also haveits own set of measures and data collection methods.

4.1.2 Content

Content is the prime reason for interacting with a DL.This component addresses the user’s information needs.The relation between the user and the content stronglydepends on the informational need of the user. The per-ceived usefulness of content is the first selection crite-rion for the user. During a search session, the users’perceptions of the usefulness of the content and theirinformation needs can change and they may be forcedto reformulate them and re-direct the information strat-egy (if any).

Content characterizes the DL and affects the conse-quent processes dramatically. It is indispensably boundwith the purposes of the DL and the desired outcomesof its operation and serves the information needs of acommunity. The evaluation of content must take placeunder strict rules and standards that guarantee useraccess to high quality, appropriate-for-their-needs infor-mation. However, underneath this user-centered layerof evaluation lie numerous other issues; issues related tothe nature, the structure and the administration of thecontent.

4.1.3 System

The system is the best known component of the inter-action process, as it is governed by the rationale ofthe developer. It consists of various subsystems thatperform different basic and auxiliary operations. DLsystems generally include collections of multimedia digi-tized data and services that help storage, access, retrievaland analysis of the data collections. The final aim of aDL system should be that of enabling people to accesshuman knowledge any time and anywhere, in a friendlymulti-modal way, by overcoming barriers of distance,language and culture, and by using multiple network-connected devices [20].

In order to fulfill all the requirements on it, a DL sys-tem must be the result of the integration of a number ofdifferent components. Discussions as how to implement

each single component depends on the architecturalchoices, on the type of data access and retrieval, on thevisualization and personalization of information. Theindividual evaluation of such components can be carriedout following evaluation standards already available inthe literature. However, an important issue is the eval-uation of how the individual components interoperateinside a DL. The evaluation of a DL system makes sensemainly if it is performed in relation with users and con-tents.

4.2 Axes of evaluation

As stated, the evaluation process should be based onthe relations of the DL components. These interactionsdefine our Interaction Triptych which consists of the fol-lowing axes:

– Usability, which defines the quality of interactionbetween the “User” and the “System”. This helpsthe user to manipulate a system effectively, in anefficient and enjoyable way and to exploit all theavailable functionalities. A usable system is easy tolearn, flexible and adapts to user preferences andskills.

– Usefulness, which concerns the “User” and “Con-tent” components. The usefulness of the content andits relevance to the user tasks and needs are the rea-sons behind the selection and usage of a DL. Thisrelevance is translated into actual relevance, typeand level of resource relevance and task relevance.

– Performance, which is placed between the “Content”and the “System”. Although it is of crucial impor-tance, this is an aspect of the system that the usercannot see or evaluate directly. The performance ofthe system depends strongly on the formats, struc-tures and representations of the content.

4.2.1 Usability and related studies (User–System)

In our conceptual model, usability constitutes a majorresearch field that derives from the interaction betweenthe user and the system. It represents a set of principlesthat should ensure that the system helps the users toconduct their tasks in an efficient, effective and satis-factory way. According to the most commonly quoteddefinition “usability is the extent to which a productcan be used by specified users to achieve specific goalswith effectiveness, efficiency and satisfaction in a speci-fied context of use” [21]. Usability studies not only thesurface representation of a system, but also its under-lying structures. A usable system needs to be attractive


to the user, but above all, must be learnable, safe andreliable. Usable systems need to be error tolerant, con-sistent, adaptive to the terminological and operationalpreferences of the users, supportive, easy to be remem-bered and visually clear. Usability is not only an issuefor the DL research area. A major part of the marketsections of DLs and electronic publishing is investinga considerable amount of money in the application ofusability principles and methods in order to produceusable large-scale systems [22].

Blandford, et al. [23], in the editorial of a special issueof the International Journal of DLs on the usability ofDLs, underline the danger of producing DLs with noveland advanced features, which will be neglected by theusers due to the difficulty of using, learning and under-standing their operation. The wealth of research intothe usability of library websites has uncovered the syn-drome of “librarians know better” [24]. This syndromecan be found everywhere and it means that the devel-opers are ignoring the users’ primary needs, behavioralmodels, operational skills and context infrastructure.

Usability evaluation is conceived as “the act of mea-suring (or identifying potential issues affecting) usabilityattributes of a system or device with respect to particu-lar users, performing particular tasks in a particular con-text” [25]. In this wide sense evaluation is conceived asall those processes that (at least) describe and evaluatethe impact of specific interface features on the creationof a normal and effective interaction between the userand the system. In any case, usability is the main keyfor system developers to address user centered designissues.

Usability issues are located at different stages of thedevelopment cycle. Formative and summative evalua-tion share common techniques but have different per-spectives and the results vary with respect to the numberand weight of recommendations. System developersapply a wealth of techniques that range from fullyautomatic ones to techniques that necessitate theinvolvement of real users in controlled environmentsor operational contexts. Automated evaluation methodshave been criticized for their lack of ability to produceinterpretable and qualitative data that will help the eval-uators understand the nature of the usability problem,its severity and the impact it has on the user interac-tion [26]. Techniques based on user participation aredescribed as time and resource consuming. Neverthe-less, the wealth of usability evaluation techniques canbe grouped into five main categories:

– Automated techniques that use specific usabilityevaluation software to produce mostly quantitative

data. In this category we can include transaction loganalysis, a widely used technique that examines theactivity of users in a given time session [27–29].

– Empirical or research based techniques that requireusers’ participation, observation and recording oftheir actions [30–32].

– Comparative techniques that collate the systemwith given standards and guidelines or other operat-ing systems. This may be performed automatically.

– Text interviews, focus groups, surveys and question-naires are the prime methods for collecting qualita-tive data, like user preferences, explanatory opinionsand satisfaction rates [33,34]. Multiple variationsexist for each of these techniques, and also for theway in which they are conducted (e.g.: online, in per-son) and the level (e.g.: structured, semi-structuredinterviews).

– Analysis techniques that require the participationof usability experts. They can examine the system,either alone, or with the cooperation of domainexperts for better results [35].

CALIMERA, a project funded by European Com-mission’s IST Programme, has done a significant amountof work on usability [36]. Dicks, quoting Rubin, notesthat usability evaluation has inherent limitations [37].One of these is the fact that the testing procedure isan artificial situation. Thus the results of a usabilitytesting will not be fully representative. The HCI com-munity has struggled to lift these inherent limitationsby recommending the combined application of severalmethods in a sequential and repetitive manner [38]. Inthe DL research field this recommendation has beenwidely used, see for example the ADL [39], DeLIver[40] and Envision [41] projects. The combination ofdifferent methods has taken the form of the “convergentparadigm”, which, apart from improving system usabil-ity, can lead to comparative evaluation of the methodsthemselves [42].

Other ways of minimizing the effects of usability test-ing exist, such as letting the users perform their owntasks [43] or developing and practicing techniques thatutilize users’ spontaneous feedback during real timeinteraction [44].

User interface related studies Generally, the userinterface acts as a bridge between the user and the sys-tem-environment so that the user can interact with theDL system. It is also the means by which the systemreveals itself to the user. A DL must provide supportfor interaction. In fact, it must support different typesof interactions and therefore appropriate evaluation

N. Fuhr et al.

methodologies must be developed to measure the effec-tiveness of these interactions.

In a DL, the complexity of the interactions is deter-mined by the dimensions of the users, the informationsources and the objects as well as the complexity of tasksto be performed. Some time ago, Bates [45] suggesteda four-level approach to search activities (move, tactics,stratagems, and strategies) that needed to be supportedby an information access system (such as a DL) in orderfor a user to retrieve the information needed. Her pro-posal clearly refers to interface issues as well as to task-related issues. Cousins [46] suggested an early solutionto the problem by proposing a task-oriented interfacefor a DL. This interface would need to take into accounta range of activities and processes, and in particular thevariety of:

– information sources,– information objects,– users and groups of users.

Any evaluation of the user interface also needs totake into account how well the different componentsand functionalities are integrated. This will affect theresults of the information handling task.

Task and context-oriented studies Each domain andcontext contains a specific set of components that formthat particular DL environment. The contexts in whichspecific information access situations occur are there-fore important. One of the goals of a DL will be to sup-port users and communities in their task performanceprocesses. Criteria that can be used for evaluation needto be indicated. An example of the group of studies thatis concerned with work-task and domain related evalu-ation studies is the Cognitive Framework for Analysisand Evaluation [47]. Evaluation is a guided bottom-upapproach that focuses on individual as well as collabo-rating users and the context of their activities within themeans and ends of their work domains [48].

Within a task-oriented evaluation perspective, aspectssuch as interaction and interactivity should be consid-ered. Aspects that can be evaluated include:

– Types of tasks and stages of a single task.– Task performance and task procedures.– Task complexity and task variations.– Differences between domains.– Impact on organizational levels.– Types of information access systems and processing

activities.

Usage studies Aspects of the evaluation continuumthat have not been sufficiently evidenced regard use and

usage. When dealing with different types of informa-tion systems, the system design usually ends when theuser has retrieved the required information. But theinformation handling process does not end there. Thereare other issues to be examined like: What part of theretrieved information is actually used and in what way?When and how is the acquired/accessed/retrieved infor-mation used? What types of information componentsare used and what type of outcome is produced.

As Borgman [49] points out, creating, using andseeking information involves a continuing and iterativeevaluation process, which may involve a variation ofevaluation methods and measures. The choice betweenthe various methods is in itself a research question: thelast word on evaluation methodologies has not yet beenwritten, neither with regard to the art of methodologydevelopment nor to the craft of methodology choice forpractical evaluation purposes.

4.2.2 Usefulness (User–Content)

Usefulness is the abstraction of every information needthat stimulates user interaction with DLs. It lies betweenthe content and the user needs and it reflects how usersperceive the relevance of a DL with their needs, thewidth, the breadth, the quality, as well as the validityof its collection, and the ability to serve their goals. Theassessment of usefulness depends on the features of boththe user and the content components. User monitoringprovides us with an insight into the way the user seeksinformation and uses it in the DL environment. Twomajor approaches are followed when addressing theseissues, namely user studies and information behavior.These are two multidisciplinary areas that can addressdifferent variables of the evaluation process. Either asmain evaluation processes, or as supplementary compo-nents (on a priori and/or posteriori stage), they providevaluable tools for assessing the usefulness of the infor-mation content.

User studies An evaluation of DL collections andservices must take into account the characteristics ofthe user community. User-oriented evaluations are wellknown and have been applied in many studies. Theevaluation of a DL can serve many purposes includingunderstanding phenomena such as human informationseeking behavior and information handling and refine-ment. User-oriented evaluation involves many stake-holders, both individual end-users and librarians, as wellas various groups from the community or society in


general. Criteria used in the evaluation of user–systeminteraction include:

– Types of users and their characteristics, such as differ-ent levels of knowledge and experience.

– The information needs of the users.– Different satisfaction factors (e.g.: functionalities

and task accomplishment).– Types of information handling strategies.– Tasks and task procedures.– Representation of work domains and environment.– Collaboration between users and groups of users.

User studies are the most common means to collectand assess information about the preferences of a usercommunity. Surveying techniques are used for the col-lection of data and to produce statistical representationsof the preferred properties of the DL content, the wayand the modes of access.

Moreover differences between classes of users arerevealed. Individual interests as well as dissimilaritiesbetween user classes constitute informative estimatorson the generalized value of the DL. Some of the userstudies mentioned in the 2003 CLIR report, by Teno-pir [50], focus on the profile generation of users whouse electronic resources in a physical environment. Westill need to form a clear image of those users whodo not enjoy the advantages of a physical organizationthat could help them solve problems encountered whenaccessing the DL.

As mentioned earlier, user studies exploit survey-ing techniques, like questionnaires, surveys and onlineforms [51]. In the case of small samples more directmethods are selected, such as interviews and focusgroups. Direct contacts between the evaluator and theparticipant help the former to interpret previously col-lected statistical data, to explain behaviors and to checkuser concepts [52]. As already mentioned, another prac-tical method of data collection is the employment oftransaction logs. This makes it possible to collect andanalyze all traces of a user’s choices during a worksession and thus create meaningful models of usage.Despite criticisms of the cohesion of transaction loganalysis on the web with cached files, this method iswidely used in many evaluation projects around thedigital information globe. The Super Journal projectinvestigated the employment of transaction logs for theextraction of useful behavioral models and then devel-oped an evaluation framework for the selection of theappropriate data for collection [53]. A coordinated effortamong some DELOS partners and TEL has beenlaunched in 2006 to conduct search engine and Webserver log analysis to gain insights for improving user–

system usability in the context of a large DL such asTEL is.4

Information behavior This research field focuses onthe patterns of behavior that are generated betweendifferent classes of users and different information sys-tems. Information behavior is an interdisciplinary fieldthat involves personnel, methods and applications frompsychology and cognitive science, computer science, andinformation science [54]. Information behavior investi-gates the information seeking and retrieval process indetail. It splits this process into separate stages and ana-lyzes them in correlation with actions, cognitive statesand emotions. Interesting research issues are the moti-vations of usage, the affective states of the process [55],differences between certain classes of users [56], thestructured or unstructured nature of accessing informa-tion, the serendipitous selection of information seekingdirections [57], and the problematic states [58]. Methodsfor collecting data in information behavior studies areonce again surveying techniques and questionnaires, aswell as interviews [59] and focus groups. Finally obser-vational studies, including recording of user interactionand think-aloud protocols, are also used.

Content-related studies The following issues mayaffect DL policy and decisions, as well as the technol-ogy infrastructure and processes (e.g. server technology,maintenance). In an interdependent environment, suchas a DL, evaluation cannot study these factors in isola-tion; they must be measured in the context of the entiresystem. A clear picture of the properties of the contentis an essential part for the DL evaluation on the axisof usefulness. Apart from acquiring a user populationprofile, we also need to know what kind of content isavailable and how access and adoption is affected by it.The following issues represent the major research areasregarding the properties of content.

– Content nature– Form (text, video, images, sounds)– Language (language(s) used and whether the

content is monolingual, bilingual, multilingual)– Method of creation (digitized, born digital)– Type (white, grey literature)

– Content structure– Level (primary object, metadata)– Size– Cohesion (homogeneous or heterogeneous attri-

butes usually apply at collection level and are

4 See http://www.TheEuropeanLibrary.org/.

N. Fuhr et al.

connected with interoperability standards andtechnologies)

– Delivery mode (hypertext protocols, streamingtechnologies)

– Administration– Collection growth (acquisitions, drop-outs)– Rights (related to the license mode and content

type, e.g. evaluating policies on thesis acquisitionand provision)

– License mode (open access, one-off purchasing,subscription-based with multiple variations suchas moving walls).

It is obvious that some of these issues are easilyresolved (for example size), whereas others need specifictools and applications, and may be related to legislativeparameters. Transaction log analysis may reveal the rateof preference on each of the content attributes, searchengine and Web server log analysis can serve to informon preference of the users and to personalize servicesfor the users themselves, while other methods, like Con-spectus [60] can assess associations between collectionsand inform about the collection growth.

4.2.3 Performance (System–Content)

Given a general framework of evaluation (or contextas in [61]), in this section we mainly concentrate on asystem-centered approach. This approach is one of themost prevalent, and involves a study of performanceaspects. It also includes the assessment of the efficiencyof a feature, a specific design, or a technological compo-nent. With respect to the classification scheme of a DLgiven in [1], we focus on the System/Technology view ofa DL. In particular, we concentrate on the evaluationaspects of information access and extraction.

In general, the evaluation of information systems hasbeen carried out in controlled settings, using differentcriteria and measures, with the aim of providing bench-marks and improving performance. The performanceevaluation of a DL is a non-trivial issue that shouldcover various aspects, such as: architecture, storage andaccess capability, and management of multimedia con-tent. Although DLs by their nature are distributed mod-ular systems in which various components cooperatetogether using some network capabilities, evaluationcampaigns organized to evaluate DL components rarelyexploit this distributed nature. In this section, we dis-cuss important issues regarding both classic “off-line”evaluation and also some of the questions raised by theemerging areas of distributed access/computing (“on-line” evaluation).

“Off-line” evaluation Current evaluation campaignspresent the participants with one or more test-bed col-lections, a set of tasks to be performed, and a methodby which the performance of their systems can be eval-uated with respect to the collections and the tasks. Thiskind of approach is found in the most important evalu-ation campaigns such as: TREC,5 CLEF,6 NTCIR7 andINEX.8 However, a more mature way of evaluating per-formance within these campaigns would also take intoconsideration the following aspects:

– Every year collections change constantly. The num-ber of different collections usually increases as wellas the total amount of data to be processed. Thisconstant growth can be seen in relationship withthe capacity of computers to store even more bytes.However, comparing system performance in differ-ent years is almost impossible. Therefore we needmethods to make performance analysis comparableover time.

– Together with the question of how to make collec-tions stable, there is also a problem of a lack of ahistorical/temporal vision of system evaluation.Deciding whether a maximum level of performancehas been achieved for some specific feature/design/component or not is a non-trivial question. Even ifthe focus is on one single aspect of a DL, it is noteasy to claim that an asymptotic limit of best perfor-mance has been reached or improvements can stillbe made.

“On-line” evaluation In the campaigns mentionedin the previous paragraph the search mechanisms of DLsare evaluated independently of other architectural fea-tures. In [62] aspects of how to manage the distributednature of DLs were explored in depth. It is clear that newareas like peer-to-peer (P2P), grid computing, and ser-vice-oriented computing provide new opportunities forDLs. For example, data management by means of P2Parchitectures allows for loosely coupled integration ofinformation services and sharing of information such asrecommendations and annotations [63]. DLs where mul-timedia content is a key-point (such as protein structuredisplay in biomedical DLs), and an intensive computa-tional load is required, can use a grid computing middle-ware. Other issues such as describing the semantics andthe usage of information services and combining services

5 http://trec.nist.gov/.6 http://www.clef-campaign.org/.7 http://research.nii.ac.jp/ntcir/.8 http://inex.is.informatik.uni-duisburg.de/.


into work-flow processes for sophisticated search andmaintenance of dependencies can be provided by ser-vice-oriented computing procedures. This intrinsic fea-ture of a DL suggests that the evaluation process shouldalso be in a sense on-line, meaning that the evaluationprocess should exploit the network and the modularityof a DL.

4.3 Practical issues

In the previous subsections we outlined the componentsof a DL and the interactions between them, as well as theprincipal directions, methods and techniques for theirevaluation. The following questions refer to the practi-cal implementation of an evaluation procedure. As inevery evaluation project, the selection of the appropri-ate method or technique is determined by some, ofteninter-dependent, factors. Some of these concern:

– Cost (how much will it cost in terms of money,resources and time?)

– Effectiveness (how effective will it be/how manyproblems will be revealed?)

– Time (how much time is required?)– Personnel (are skilled staff or usability experts

needed?)– Infrastructure (what kind of storage and analysis

software is needed?)– Place (can the evaluation be done remotely/can it

be done in controlled environments or in contextualconditions?)

– Evaluation timing (at which stage of the develop-ment cycle will it occur?)

– Pace of evaluation (how frequently will it take place,with what aims and under which circumstances?)

– Collaborative evaluation (are test-bed systemsneeded for the evaluation?)

One major aspect that will affect evaluation dealswith the type of experimental framework in which theevaluation will take place [64]. We analyze the factor“Place” since DLs have to serve physically remote usersin diverse time periods. This requirement will impact onthe criteria, measures, data collection methods, proce-dure and outcome. Generally, we apply either alaboratory-based evaluation or a real-life empirical eval-uation setting. Usually, there is a dichotomous relation-ship between these approaches but, in fact, they couldcomplement and support each other at a higher level ofevaluation. For this reason, it would be interesting to beable to integrate statistical data and narratives in orderto assess impact as well as performance and usage. Oneof the differences between these two approaches is the

extent to which the evaluation process and outcome arecontrolled. In an experimental, laboratory-based settingwe can control what and how we evaluate, whereas in areal-life situation there can be components of the evalu-ation setting that we cannot really control and thereforedifferent sets of data collection methods may have to beapplied.

The process of collecting qualitative data throughremote methods is very difficult; so far we have insuffi-cient evidence with respect to several issues, e.g. how torecruit and stimulate user participation in remote eval-uation settings. The DLF/CLIR 2002 report on usageand usability assessment indicates a number of concernsabout assessment methods, for example the need to col-lect meaningful and purposeful data, the development ofskills to gather, analyze and present user data [65]. DLsare systems that, even when they have been developedto serve physically located user communities, still needto provide a remote and asynchronous mode of accessto information. Thus we need methods that effectivelyevaluate them for cases of full remote access, where theuser does not have any physical connection with the DLadministration.

The selection of the right evaluation method is depen-dent on the above mentioned factors and their combina-tion with the type of data that will be collected. Methodsthat involve qualitative data collection include think-aloud protocols, observation, interviews, questionnairesand simulations, while, for quantitative data collection,methods such as transaction log analysis, and error andtime analysis can be used.

Digital libraries are destined to serve user commu-nities [66]. If unused these systems fall into oblivionand terminate their operation. This fundamental com-mitment of DLs determines an evaluation of the rolethat the DL plays in serving a community and its mem-bers. The difficult task of this evaluation is to assessin an effective way the qualitative serving of this com-munity and its members. There are diverse levels ofassessment. In addition to the level of evaluation (indi-vidual/community), other factors like the mode of use(e.g. collaborative) and the context of use (informationaccess, education, research) make the whole processeven more complex. System developers and evaluatorsmust implement the right evaluation process, so that itis more efficient and can generate useful system designrecommendations. Therefore a prerequisite of this typeof evaluation campaign should be the careful and ana-lytical mapping of the evaluation research onto the realneeds and aims. This process must be done in a system-atic way in order to cover all facets, to address the mainquestions and should be finalized in a set of methods,criteria and metrics that will be used in the campaign.

N. Fuhr et al.

4.4 A classification model of evaluation

Evaluation Computer model [67], given in Fig. 3 is ananalytic mechanism of description and classification ofevaluation procedures. Based on a faceted analysis, theEvaluation Computer is a framework that can be usedto create comprehensive profiles that could be compar-atively listed in order to assess the depth and width ofcoverage of the current evaluation foci. According tothis model the evaluation process is defined as a selec-tion of a set of points in a multidimensional space. Thespace consists of a set of aspects, namely evaluation,user, organizational, content and system. Each aspectis divided in a set of facets. Thus the Evaluation Com-puter is able to systematically record the main itemsregarding the evaluation process (evaluation aspects),the DL components (user, content, system aspects) andthe contextual conditions (organizational aspects).

The central section in Fig. 3 provides an example.The first part of the evaluation profile building process(the inner ring) concentrates on the evaluation aspect.In the case of this example, it indicates that the natureof the evaluation activity is summative and that themethod selected is transaction log analysis. The secondpart of the profile (the next ring) is determined by useraspects. The evaluation activity here uses logs resultingfrom usage by the general public, and there are no geo-graphical constraints. This worldwide user communityuses the DL for educational purposes (organizationalaspects). What is evaluated is shown in the fourth ring:the interaction of users with the content and more spe-cifically with textual objects in their full format (con-tent aspects). As shown in the outermost ring, the usersinteract with the DL through a web interface, while thedistribution parameter is based on the dominant “client-server” architecture (system aspects).

This example illustrates the ability of the EvaluationComputer to classify an evaluation activity, both at the

Fig. 3 A possible layout of the evaluation computer

process and at the DL component level. The resultingclassification can be compared against the existing cor-pus of evaluation studies. This kind of analysis couldhelp designers to better understand the reasons under-lying particular decisions in previous studies, the finalresults and implications, and the relation between theresults and the undergoing activity.

5 DL evaluation framework

It is clear at this point that setting up a framework forthe evaluation of DLs is not only a very difficult taskbut also quite controversial with so many models andformalisms already available and so many contributionsfrom related research areas.

We base our work on the framework for DL eval-uation designed by Saracevic [61]; he introduced fourdimensions or components (construct, context, criteriaand methodology) for describing evaluation activities,which we describe in more detail in the following. Theframework per se acts both as a classification scheme forexisting evaluation studies and a model for new ones. Inparticular, a set of guidelines can be extracted from itin order to make the overall evaluation experience notonly less controversial and complex but, more impor-tantly, easier to replicate and compare to similar onesso that, in the future, evaluation experiments and theirfindings would not be obsolete individualistic exercisesbut would serve the entire community and allow for pro-gress and improvements where needed. Saracevic’s fourdimensions have also proved to be naturally descriptiveand sufficiently flexible to describe a variety of studies.This is an essential feature if the framework is to bewidely adopted by the DL community. As a whole, theyrespond to the most crucial questions raised within theDELOS Evaluation forum:

– Why evaluate?– What to evaluate?– How to evaluate?

The why, what and how questions are immediate andintuitive to use and understand, but clearly have someoverlaps in terms of their coverage. Furthermore, takenin isolation, they are open to individual interpretationsand their complexity can grow out of control. This couldmake their application to model existing evaluation ini-tiatives or design new ones cumbersome, complex andinconsistent.

Saracevic’s model has instead proved its usefulnessas it provides four clean cut categories while still main-taining the rich expressivity necessary for this type of


framework [68–70]. By using a more concrete terminol-ogy, specific to evaluation, it is easier to classify existingstudies keeping in mind the important questions.

5.1 The four dimensions

Construct represents what is being evaluated and how itcan be used both at a high and a low level. For instance,it could be applied in cases where the evaluation teamdefines its own interpretation of the DL. Similarly itcan define those aspects and/or components of the DLthat are objects of the evaluation procedure. Constructanswers the WHAT question at as many levels asneeded.Context is possibly the richest of all dimensions as itaccounts for everything that qualifies as motivation andframework for each evaluation study and as such cov-ers scenarios, actors, objectives and goals, approachesand perspective in the study. It could be further devel-oped into additional dimensions in order to provide allthe necessary elements to describe the framework foreach individual evaluation study. This is where the WHYquestion is answered.Criteria is actually the core of the evaluation study andcovers parameters, factors and measures used to assessthe quality of what is evaluated and every aspect of aDL being evaluated. It responds partially to the HOWquestion.

Methodology together with criteria also answers theHOW question and provides a means to describe theprocedures and protocols followed in both the gatheringand the analysis of data from the evaluation experiment.

None of the four dimensions is independent of theothers but, taken as a whole, they provide a self-contained environment where the various issues relatedto DL evaluation can be explored and described withoutthe risk of unnecessary overlap.

5.2 Evaluation models – Contributionsto the framework

Although the three models possess a significant descrip-tive, planning and developmental strength, they areplaced under the umbrella of the Saracevic model. TheSaracevic model defines the wide and narrow environ-mental conditions of the evaluation initiative and, as itis based on a comprehensive review of the evaluationliterature, encapsulates the knowledge and experienceof varied disciplines and applications in an abstractedfashion. The models are placed under its umbrella, asthey share similarities with its taxonomic properties, aswell as they have a common mentality in their con-struction. Table 1 demonstrates the contribution of thethree models used for the development of a conceptualDL evaluation model in the formation of a new frame-

Table 1 Contributions of models

DELOS WG model Interaction Triptych Framework Evaluation Computer

General Aim To classify and present an eval-uation schema. It identifiesmain entities in the DL lifecy-cle on a deterministic route.

To identify DL main entities, trace theirrelations and to define evaluationaxes and criteria under the user inter-action point of view.

To formulate evaluation profiles andencourage comparisons betweenthem. It serves as a comparisonbenchmark.

Why Context It defines four componentsdescribing DL operation(users, usage, data, system)and addresses system-cen-tered evaluation aims.

It considers DL as an interaction ofthree entities (users, content, system).It considers that these entities definethree evaluation axes (usefulness,usability, performance). The evalua-tion focuses on the user interactionimprovement.

An evaluation profile is consideredas the application of an evaluationmethod to a DL contextual instance.A DL contextual instance is a tetradof values from the domains: user, con-tent, organization, system. Each eval-uation profile constitutes a completeevaluation process and its resultscould be compared with the resultsof other profiles.

WhatConstruct

It suggests the development ofa testbed that aggregates therequirements for DL opera-tion and evaluation.

It considers evaluation as a compositeprocess analyzed to three axes (use-fulness, usability, performance).

It analyzes DL operation to four maincomponents (user, content, organi-zation, system). It combines differ-ent evaluation methods with the fourcomponents defining particular eval-uation profiles.

How Criteria As a generic model it encour-ages the researchers to defineparticular criteria for eachevaluation experiment.

For each evaluation axis it provides aparticular and rich set of criteria andmetrics.

It provides a mechanism (evaluationcomputer) for the generation of eval-uation profiles.

N. Fuhr et al.

work. Two issues constitute the difference between thesemodels:

– Orientation: DELOS WG is system-centered, ITF isuser-centered, while EC is neutral;

– Depth of provided tools: vertical-horizontal. Fur-thermore, while the first two models propose eval-uation entities in the broad and narrow view of DLoperation, EC provides an overview of the DL eval-uation set-up.

5.3 Proposed guidelines – how to adopt the frameworkin the most productive way

The idea behind adopting this high level framework isto provide designers and evaluators with a better under-standing of the role and the benefits of a well designedand structured evaluation experiment. It is also a way tosupport evaluators in setting up effective experiments,so that their findings can be compared with similar orrelated studies and thus maximize their utility. The fourdimensions can then be used in a variety of ways, accord-ing to the level of development of the evaluation and theneeds of its designers. In a very initial stage, in prepa-ration for the design of the evaluation study, these fouraspects can be used to group similar case studies intoclusters, so that the evaluation designers can obtain abetter understanding of what the rest of the communityis focusing on and what has been neglected.

The next stage for the evaluators is to decide on whichaspects to focus according to their needs and the pro-ject requirements. Here the four dimensions are used asthe basis for a brainstorming process in which design-ers elaborate and adapt them to their evaluation needs.They help to structure the answers to the crucial ques-tions: why, what and how.

– Why evaluate; i.e. determine the aims of evaluation.In this stage strategic decisions are taken regardingthe constructs, the relationships and the evaluationitself. For example, how satisfied are the users ofa DL (construct: user) with the effectiveness of anexisting IR system (association: user - system) andwhat results do the evaluators expect to be produced(e.g. a summary of system deficiencies in comparisonwith design aims). The Evaluation Computer couldbe used to formulate a profile of the evaluation pro-cess itself through the “Evaluation Aspects” ring.

– What to evaluate; this involves:– Determining the constructs (components, rela-

tionships)

– Determining the type (comparative, formative,summative)

– Determining the target DL service.The framework presented at the DELOS PaduaWorkshop [17] and shown in Fig. 2 indicated the mainconstructs of an evaluation procedure, as well as theirassociations. The interaction axes represent threemain evaluation foci, which are discussed extensivelyin the literature. In conjunction with the EvaluationComputer, this framework can be used to define themain constructs and sort their attributes (e.g. divid-ing content attributes by semantic and morphologicproperties).

– How to evaluate; i.e. decide on the way to performthe evaluation:– Planning the evaluation by selecting methods, cri-

teria, metrics, samples (e.g. humans, collections)– Executing the evaluation by selecting and ana-

lyzing data (main methods and alternatives)– Presenting the results.

At this point, the process materializes all the deci-sions taken in the previous stages. The Interaction Trip-tych model described in Sects. 4.1–4.2 presents the mainevaluation criteria and metrics mapped onto the threeinteraction axes. This could serve as a starting point forthe evaluators to find the appropriate measurementsand to adapt them to the needs and requirements oftheir research.

Once the evaluation experiment has been designedand the operational procedure is defined, the fourdimensions can be used for the analysis of the datacollected. This has the twofold advantage of providingdirection to the evaluation and providing results compa-rable to those from other studies inside the same frame-work.

It is worth noting that any tool of this level of expres-siveness and effectiveness is valuable to a communityas long as it is easy to use and flexible to adapt. Theproposed framework has both qualities. The verticalityof the four dimensions permits the incorporation of var-ious evaluation frameworks, the analysis of the currentconditions and the proposal of solutions, as has beenillustrated in the case of the Evaluation Computer andInteraction Triptych frameworks.

Overall the four dimensions can be used in a numberof practical and effective ways during the design, imple-mentation and final analysis of any evaluation study.DLs can be complex systems and their evaluation is achallenging exercise. Any tool that could help designersin this task should be carefully considered and adoptedby the DL community if it fulfils their needs. In our


opinion, the simple framework proposed above couldhave a positive impact in a variety of ways by assistingand supporting designers of DL evaluation studies atdifferent stages.

6 Recommendations

A number of recommendations emerge from the multi-faceted analysis presented in this paper and should betaken into consideration when setting up a DL evalua-tion activity:

Flexible evaluation frameworks For complex enti-ties such as DLs, the evaluation framework should beflexible, allowing for multi-level evaluations (e.g. byfollowing the six levels proposed by Saracevic [61],including user and social). Furthermore any evaluationframework should undergo a period of discussion, revi-sion and validation by the DL community before beingwidely adopted. Flexibility would help researchers toevade obsolete studies, that are based on rigid frame-works, but to use models that can “expand” or “collapse”at their project’s requirements and conditions.

Involvement of practitioners and real users Practi-tioners have a wealth of experience and domain-relatedknowledge that is often neglected. Better communica-tion and definition of common terminology, aims andobjectives could establish a framework of cooperationand boost this research area.

Build on past experiences of large evaluation initiativesEvaluation initiatives, such as TREC, CLEF, INEX, andNTCIR, have collected a wealth of knowledge aboutevaluation methodology.

In order to foster evaluation research in general, thefollowing issues should be addressed:

Community building in evaluation research The lackof globally accepted abstract evaluation models andmethodologies can be counterbalanced by collecting,publishing and analyzing current research activities.Maintaining an updated inventory of evaluation activ-ities and their interrelations can help to define goodpractice in the field and to help the research communityto reach to a consensus.

Establishment of primary data repositories The pro-vision of open access to primary evaluation data (e.g.transaction logs, surveys, monitored events) as is com-mon in other research fields, should be a goal. In thisrespect, methods to render anonymous the primary data

must be adopted, as privacy is of strong concern.Common repositories and infrastructures for storing pri-mary and pre-processed data are proposed along withthe collaborative formation of evaluation best practices,and modular building blocks to be used in evaluationactivities.

Standardized logging format Further use and dis-semination of common logging standards is also consid-ered useful [71]. Logging could be extended to includeuser behavior and system internal activity as well inorder to support the personalization and intelligent userinterface design processes.

Evaluation of user behavior in-the-large Currently,evaluation is focused too much on user interface and sys-tem issues. The user satisfaction with respect to how farhis/her information needs have been satisfied (i.e. infor-mation access) must be investigated, independently ofthe methods used to fulfil these needs. The determina-tion of user strategies and tactics is also recommended(such as search strategies, browsing behaviors). Thisrelates to evaluation in context, and to the question ofidentifying dependencies in various contexts (e.g. socio-logical, business, institutional). Collecting user behavioras implicit rating information can also be used to estab-lish collaborative filtering services in DL environments.

Differentia specifica of the domain of evaluation Animportant problem is how to relate a possible model ofDLs to other overlapping models in other areas. Howdoes a DL relate to other complex networked infor-mation systems (e.g. archives, portals, knowledge bases,etc.) and their models? Is it possible to connect or inte-grate DL models to the multitude of related existingmodels? The answer to this question should also help todefine the independent research area of DL evaluation.

7 Conclusions

In this paper we described the state of art in the DLevaluation research area. In parallel we have attemptedto establish a framework for evaluating DL systems ina holistic way. Our objective is to formulate a frame-work that could accommodate as many aspects as canbe found in the various levels of the evaluation proce-dure. The multitude of evaluation foci and the variety ofperspectives have been a concern during the formationof the framework. Finally, we have provided recommen-dations concerning the actions needed and the strategiesto be followed in DL evaluation. It is expected that theunification of the multi-form and varied DL evaluation

N. Fuhr et al.

activities within a common framework will need con-siderable time as DL research and development is stillat an early stage and wide-scale awareness, acceptanceand employment of DL systems is only just beginning.However, evaluation is and will remain a crucial aspectin the evolution and acceptance of these systems.

Acknowledgements The work has been supported by the DE-LOS Network of Excellence on DLs, as part of the InformationSociety Technologies (IST) Program of the European Commission(Contract G038-507618 ).

Evaluation of Digital Libraries: Used Terminology

Assessment: The process of measuring, quantifying,and/or describing the attributes of a system coveredby the evaluation

Evaluation: The systematic process of determining themerit, value, and worth of something. Evaluationis broader than assessment and involves making ajudgement as to the effectiveness of an assessment.Evaluation has a certain goal, methodology anddevices or techniques. The goal suggests some eval-uation criteria (e.g. performance), which sometimesbreak down to evaluation parameters (e.g. responsetime, availability). At times these parameters arequalitative, at times quantitative. If quantitative, ametric can be defined. In the case of metrics the val-ues are of course comparable over different evalua-tions. A successful evaluation should be transferable(apply to other contexts) and confirmable (verifi-able).

Evaluation, comparative: An evaluation activity thatdenotes the degree or grade by which a system orcomponent has a property or quality greater or lessthan that of another.

Evaluation, formative: An evaluation activity carriedout in parallel with the development phases of a DLsystem. As a part of development, a formative evalu-ation aims to minimize system imperfections beforerelease.

Evaluation, summative: An evaluation activity carriedout at the final stages of the development cycle orafter an initial release in order to measure the per-formance in real operating conditions.

Evaluation device: Instrument or application that isused as part of an evaluation process.

Evaluation framework: a defined structure of methodsand supporting devices designed to support the eval-uation needs of a community.

Evaluation goal: The desired knowledge about a sys-tem, its performance or usability.

Evaluation method: A series of steps or actions takento accomplish an evaluation.

Evaluation model: An abstract, general or theoreticalrepresentation of an evaluation approach.

Evaluation parameter: A measurement or value onwhich something else in the evaluation processdepends.

Evaluator: The person, group or organization that per-forms the evaluation.

Interactivity: The ability of complex systems to detectand react to human behavior.

Metric: A unit of measure for some property of a sys-tem.

Precision: The fraction of relevant documents in the setof all documents returned by a search.

Quality: The desirability of properties or characteristicsof a system or process. Quality is contextual.

Qualitative: A type of empirical knowledge. Qualitativedata are unbiased and/or relative.

Quantitative: Quantitative properties can be meaning-fully measured using numbers; properties which arenot quantitative are called qualitative.

Recall: The fraction of relevant material that is returnedby a search.

Response time: The time a system or functional unittakes to react to a given input.

Test bed: A collection of resources that can be used forevaluation purposes.

Use case: A technique for capturing the potentialrequirements of a system or software. Each use caseprovides one or more scenarios that convey how thesystem should interact with the end user or anothersystem to achieve a specific goal.

User situation: A state of the user in which an infor-mation need emerges. A user situation is formed bythe contextual conditions of the user, both in theenvironment and the domain.

User task: The mental and physical activity of the userto achieve a goal.

Validation: Testing to ensure the system (or compo-nent) conforms to requirements.

Verification: Testing to ensure proper functioning of thesystem (or component) in technical terms.

References

1. Fuhr, N., Hansen, P., Mabe, M., Micsik, A., Solvberg, I.:Digital Libraries: A Generic Classification and EvaluationScheme. In: Lecture Notes In Computer Science, vol. 2163,pp. 187–199. Springer, Berlin (2001) Research and AdvancedTechnology for Digital Libraries: 5th European Conference,ECDL 2001, Darmstadt, Germany, September 4–9, 2001. Pro-ceedings


2. DELOS: DELOS Workshop on Evaluation of DigitalLibraries: Testbeds, Measurements, and Metrics. Technicalreport (2002) http://www.sztaki.hu/conferences/deval

3. Agosti, M., Fuhr, N. (eds.) Notes of the DELOSWP7 Workshop on the Evaluation of Digital Libraries,http://dlib.ionio.gr/wp7/workshop2004_program.html (2004).

4. Cleverdon, C.: The Cranfield tests on index language devices.In: Sparck-Jones, K., Willett, P. (eds.) Readings in Informa-tion Retrieval, pp. 47–59 Morgan Kaufmann, San Francisco(1997)

5. Turpin, A., Hersh, W.: Why batch and user evaluations donot give the same results. In: Proceedings of the SIGIR01, pp.225–231. ACM, New York (2001)

6. CCSDS: A reference model for an open archival informa-tion system. Document Number: ISO 14721:2003 (2003)http://ssdoo.gsfc.nasa.gov/nost/wwwclassic/documents/pdf/CCSDS-650.0-B-1.pdf

7. IFLANET: Functional requirements for bibliographicrecords (1998) http://www.ifla.org/VII/s13/frbr/frbr.pdf

8. Levy, D., Marshall, C.: Going digital: a look at assumptionsunderlying digital libraries.. Commun. ACM 38, 77–84 (1995)

9. Borgman, C.: What are digital libraries? Competing visions.Inf. Process. Manage. 35, 277–243 (1999)

10. DELOS: Reference models for digital libraries: actors androles - final report. Technical report (2003) http://www.dli2.nsf.gov/internationalprojects/working_group_reports/actors_final_report.html

11. Agosti, M., Candela, L., Castelli, D., Ferro, N., Ioannidis,Y., Koutrika, G., Meghini, C., Pagano, P., Ross, S., Schek,H.J., Schuldt, H.: A reference model for DLMSs interimreport. In: Candela, L., Castelli, D. (eds.) DeliverableD1.4.2 – Reference Model for Digital Library ManagementSystems [Draft 1], DELOS, A Network of Excellenceon Digital Libraries – IST-2002-2.3.1.12, Technology-enhanced Learning and Access to Cultural Heritage –http://146.48.87.122:8003/OLP/Repository/1.0/Disseminate/delos/2006_WP1_D142/content/pdf?version=1 [last visited2006, October 2] (2006)

12. Goncalves, M.A., Fox, E., Kipp, N., Watson, L.: Streams,structures, spaces, scenarios, societies (5S): a formal modelfor digital libraries. ACM Trans. Inf. Syst. 22, 270–312 (2004)

13. Baker, S., Lancaster, F.: The measurement and evaluationof library services. Information Resources Press, Arlington(1991)

14. Nicholson, S.: A conceptual framework for the holistic mea-surement and cumulative evaluation of library services. J.Doc. 60, 164–182 (2004)

15. Kyrillidou, M., Lorie, M., Hobbs, B., Choudhury, G.,Webster, D., Flores, N., Heath, F.: Emerging tools for eval-uating DL services: conceptual adaptations of LIBQUAL+and CAPM. J. Digit. Inf. 4, (2003) http://jodi.ecs.soton.ac.uk/Articles/v04/i02/Heath

16. Kyrillidou, M., Giersch, S.: Developing the digiqual protocolfor digital library evaluation. In: JCDL ’05: Proceedings of the5th ACM/IEEE-CS Joint Conference on Digital Libraries,New York, NY, USA, pp. 172–173. ACM, New York (2005)

17. Tsakonas, G., Kapidakis, S., Papatheodorou, C.: Evaluationof user interaction in digital libraries. In: Agosti, M., Fuhr, N.(eds.) Notes of the DELOS WP7 Workshop on the Eval-uation of Digital Libraries, Padua, Italy (2004) http://dlib.ionio.gr/wp7/workshop2004_program.html

18. Tsakonas, G., Papatheodorou, C.: Analyzing and evaluatingusefulness and usability in electronic information services. J.Inf. Sci. 32(5), 400–419 (2006)

19. Marchionini, G.: Evaluating digital libraries: a longitudinaland multifaceted view. Library Trends 49, (2000)

20. DELOS: A Network of Excellence on Digital Libraries(2004) Technical Annex 1, pag. 6 - DELOS Network of Excel-lence on Digital Libraries, as part of the Information SocietyTechnologies (IST) Program of the European Commission(Contract G038-507618)

21. ISO: ISO 9241-11. Ergonomic Requirements for Office WorkWith Visual Display Terminals (VDT). Part 11: Guidancein Usability. International Standards Organization, London(1997)

22. DeGroot, S.P., Knapp, A.E.: Applying the user-centereddesign (ucd) process to the development of a large bib-liographic navigation tool: a partnership between librarian,researcher and developer. Scopus White Paper Series (2004)http://www.elsevier.com/librarians

23. Blandford, A., Buchanan, G., Jones, M.: Usability of digitallibraries. Int. J. Digit. Libr. 4, 69–70 (2004)

24. Dickstein, R., Mills, V.: Usability testing at the University ofArizona Library: how to let the users in on the design. Inf.Technol. Libr. 19, (2000)

25. Hilbert, D.M., Redmiles, D.F.: Extracting usability informa-tion from user interface events. ACM Comput. Surv. 32,384–421 (2000)

26. Ivory, M., Hearst, M.: The state of the art in automat-ing usability evaluation of user interfaces. ACM Comput.Surv. 33, 470–516 (2001)

27. Ke, H.R., Kwakkelaar, R., Tai, Y.M., Chen, L.C.: Explor-ing behavior of E-journal users in science and technology:transaction log analysis of Elsevier’s ScienceDirect OnSite inTaiwan. Libr. Inf. Sci. Res. 24, 265–291 (2002)

28. Sfakakis, M., Kapidakis, S.: User behavior tendencies on datacollections in a digital library. In: Lecture Notes in Com-puter Science, vol. 2458, pp. 550—559. Springer, Berlin (2002)Research and Advanced Technology for Digital Libraries: 6thEuropean Conference, ECDL 2002, Rome, Italy, September16–18, 2002. Proceedings

29. Jones, S., Cunningham, S.J., McNab, R., Boddie, S.: A trans-action log analysis of a digital library. Int. J. Digit. Libr. 3,152–169 (2000)

30. Sutcliffe, A.G., Ennis, M., Hu, J.: Evaluating the effective-ness of visual user interfaces for information retrieval. Int. J.Human-Comput. Stud. 53, 741–763 (2000)

31. Kengeri, R., Fox, E.A., Reddy, H.P., Harley, H.D.,Seals, C.D.: Usability study of digital libraries: ACM,IEEE-CS, NCSTRL, NDLTD. Int. J. Digit. Libr. 2, 157–169(1999)

32. Clark, J.A.: A usability study of the Belgian AmericanResearch Collection: measuring the functionality of a digitallibrary. OCLC Syst. Serv. Int. Digit. Libr. Perspect. 20, 115–127 (2004)

33. Rousseau, G.K., Rogers, W., Mead, S.E., Sit, R.A.,Rousseau, J.B.A.: Assessing the usability of on-line librarysystems. Behav. Inf. Technol. 17, 274–281 (1998)

34. Thong, J.Y.L., Hong, W., Tam, K.: Understanding user accep-tance of digital libraries: what are the roles of interface char-acteristics, organizational context, and individual differences?Int. J. Human-Comput. Stud. 57, 215–242 (2002)

35. Hartson, R.H., Shivakumar, P., Perez-Quinones, M.A.:Usability inspection of digital libraries: a case study. Int. J.Digit. Libr. 4, 108–123 (2004)

36. Glosiene, A., Manzuch, Z.: Usability of ICT-based Systems:State-of-the-art Review. Technical Report CALIMERADeliverable 9 (2004)

37. Dicks, R.S.: Mis-usability: on the uses and misuses of usabil-ity testing. In: Proceedings of the 20th Annual Interna-tional Conference on Computer Documentation, New York,pp. 26–30 ACM, New York (2002)

N. Fuhr et al.

38. Doubleday, A., Springett, M.V., Sutcliffe, A.G., Ryan, M.:A comparison of usability techniques for evaluating design.In: Proceedings of the Conference on “Designing Interac-tive Systems: Processes, Practices, Methods and Techniques”,New York, pp. 101–110 ACM, New York (1997)

39. Hill, L.L., Carver, L., Smith, T.R., Frew, J., Larsgaard, M.,Dolin, R., Rae, M.A.: Alexandria digital library: user eval-uation studies and system design. J. Am. Soc. Inf. Sci. 51,246–259 (2000)

40. Bishop, A.P., Merkel, C., Neumann, L.J., Star, S.L., San-dusky, R.J., Ignacio, E.: Digital libraries: situating use inchanging information infrastructure. J. Am. Soc. Inf. Sci. 51,394–413 (2002)

41. Fox, E.A., Heath, L.S., Rao, D., Brueni, D.J., Nowell, L.T.,Wake, W.C., Hix, D.: Users, user interfaces, and objects: envi-sion, a digital library. J. Am. Soc. Inf. Sci. 44, 480–491 (1993)

42. Buttenfield, B.: Digital libraries: philosophies, technicaldesign considerations, and example scenarios. In: UsabilityEvaluation of Digital Libraries, pp. 39–59. Haworth Press,New York (1999)

43. Blandford, A., Stelmaszewska, H., Bryann-Kinns, N.: Useof multiple digital libraries: a case study. In: Proceedings ofthe first ACM/IEEE-CS joint conference on digital libraries,pp. 179—188. ACM Press (2001)

44. Castillo, J., Hartson, H.R., Hix, D.: Remote usability eval-uation: can users report their own critical incidents. In:CHI 1998 Conference Summary on Human Factors in Com-puting Systems, New York, pp. 253–254 ACM, New York(1998)

45. Bates, M.: Where should the person stop and the infor-mation search interface start? Inf. Process. Manage. 26,575–591 (1990)

46. Cousins, S.B.: A task-oriented interface to a DL. In: Tauber,M.J. (ed.) Conference Companion on Human Factors in Com-puting Systems: Common Ground, New York, pp. 103–104.ACM, New York (1996)

47. Pejtersen, A., Fidel, R.: A framework for work centered eval-uation and design: a case study of IR on the web. Technicalreport (1998)

48. Rasmussen, J., Pejtersen, A., Goodstein, L.: Cognitive Sys-tems Engineering. Whiley, New York (1994)

49. Borgman, C.: From Gutenberg to the global informationinfrastructure: Access to information in the networked World.MIT, Cambridge (2000)

50. Tenopir, C.: Use and users of electronic library resources:an overview and analysis of recent research studies. Coun-cil on Library and Information Resources, Washington(2003)

51. Banwell, L., Ray, K., Coulson, G., Urquhart, C., Lonsdale,R., Armstrong, C., Thomas, R., Spink, S., Yeoman, A.,Fenton, R., Rowley, J.: The JISC user behaviour monitoringand evaluation framework. J. Document. 60, 302–320 (2004)

52. Payette, S.D., Rieger, O.Y.: Supporting scholarly inquiry:incorporating users in the design of the digital library. J. Acad.Libr. 24, 121–129 (1998)

53. Yu, L., Apps, A.: Studying E-journal user behavior usinglog files: the experience of superjournal. Libr. Inf. Sci.Res. 22, 311–338 (2000)

54. Wilson, T.D.: Information behaviour: an interdisciplinary per-spective. Inf. Process. Manage. 33, 551–572 (1997)

55. Kuhlthau, C.C.: Inside the search process: information seek-ing from the user’s perspective. J. Am. Soc. Inf. Sci. 42,361–371 (1991)

56. Ellis, D., Haugan, M.: Modeling the information seeking pat-terns of engineers and research scientists in an industrial envi-ronment. J. Document. 53, 384–403 (1997)

57. Foster, A., Ford, N.: Serendipity and information seeking: anempirical study. J. Document. 59, 321–340 (2003)

58. Belkin, N.J., Oddy, R.N., Brooks, H.M.: ASK for informationretrieval: part 1 Background and theory. J. Document. 38,61–71 (1982)

59. Wilson, T.D.: Exploring models of information behaviour: theuncertainty project. Inf. Process. Manage. 893–849 (1999)

60. Wood, R.: The Conspectus: a collection analysis and develop-ment success. Libr. Acquis. Pract. Theory 20, 429–453 (1996)

61. Saracevic, T.: Evaluation of digital libraries: an overview. In:Agosti, M., Fuhr, N. (eds.) Notes of the DELOS WP7 Work-shop on the Evaluation of Digital Libraries, Padua, Italy(2004) http://dlib.ionio.gr/wp7/workshop2004_program.html

62. Agosti, M., Ferro, N., Frommholz, I., Thiel, U.: Annotations indigital libraries and collaboratories: facets, models and usage.In: Lecture Notes In Computer Science, vol. 3232, pp. 244—255 Springer, Berlin (2004) Research and Advanced Technol-ogy for Digital Libraries: 8th European Conference, ECDL2004. Proceedings

63. Agosti, M., Albrechtsen, H., Ferro, N., Frommholz, I.,Hansen, P., Orio, N., Panizzi, E., Pejtersen, A.M., Thiel, U.:Dilas: a digital library annotation service. In: InternationalWorkshop on Annotation for Collaboration. Methods, Tools,and Practices, pp. 91–101 (2005)

64. Hansen, P., Jarvelin, K.: The information seeking and retrievalprocess at the swedish patent and registration office: mov-ing from lab-based to real life work task environment. In:Agosti, M., Fuhr, N. (eds.) Proceedings of the ACM-SIGIR 2000 Workshop on Patent Retrieval, Athens,Greece (2000) 43–53 http://www.sics.se/∼preben/papers/SIGIR20000-WS.pdf

65. Covey, D.: Usage and usability assessment: librarypractices and concerns, vol. 105. CLIR (2002)http://www.clir.org/pubs/reports/pub105/pub105.pdf

66. Borgman, C.L.: Social Aspects of Digital Libraries-Final Report to the NSF. Technical report (1996)http://is.gseis.ucla.edu/research/dl/UCLA_DL_Report.html

67. Kovács, L., Micsik, A.: The evaluation computer: a modelfor structuring evaluation activities. In: Agosti, M., Fuhr,N. (eds.) DELOS Workshop on the Evaluation of Digi-tal Libraries, Padova, Italy (2004) http://dlib.ionio.gr/wp7/workshop2004_program.html

68. Banwell, L., Coulson, G.: Users and user study methodol-ogy: the JUBILEE project. Inf. Res. 9, 47–56 (2004) http://informationr.net/ir/9-2/paper167.html

69. Jeng, J.: What is usability in the context of the digital libraryand how can it be measured? Inf. Technol. Libr. 24, 47–56 (2005)

70. Xie, H.: Evaluation of digital libraries: criteria and problemsfrom users perspectives. Libr. Inf. Res. 28, 433–452 (2006)

71. Klas, C., Albrechtsen, H.N.F., Hansen, P., Kapidakis, S.,Kovacs, L., Kriewel, S., Micsik, A., Papatheodorou, C.,Tsakonas, G., Jacob, E.: A logging scheme for comparativedigital library evaluation. In: Lecture Notes In ComputerScience, vol. 4172, pp. 267—278 Springer, Berlin (2006)Research and Advanced Technology for Digital Libraries:10th European Conference, ECDL 2006, Alicante, Spain,September 17—22, 2006. Proceedings

Documents

Evaluation of Digital Libraries (Fulltext)