Evaluating interactive systems in TREC

Evaluating Interactive Systems in TREC

Micheline Beaulieu* and Stephen Robertson Centre for Interactive Systems Research, Department of Information Science, City University, Northampton Square, London EC1 V OHB, United Kingdom. E-mail: [email protected]; E-mail: ser @is.city.ac.uk

Edie Rasmussen School of Library and Information Science, University of Pittsburgh, Pittsburgh, PA 7 5260. E-mail: [email protected]

The TREC (Text REtrieval Conference) experiments were designed to allow large-scale laboratory testing of information retrieval techniques. As the experiments have pro- gressed, groups within TREC have become increasingly interested in finding ways to allow user interaction without invalidating the experimental design. The development of an “interactive track” within TREC to accommodate user interaction has required some modifications in the way the retrieval task is designed. In particular there is a need to simulate a realistic interactive searching task within a laboratory environment. Through successive interactive studies in TREC, the Okapi team at City University London has identified methodological issues relevant to this process. A diagnostic experiment was conducted as a follow-up to TREC searches which attempted to isolate the human and automatic contributions to query formulation and retrieval performance.

1. Introduction

In information retrieval (IR) research, the need for rigorous evaluation was recognized very early, and evaluation has continued to have a significant impact on the direction of research in the field. The methods adopted in the first Cranfield experiments, and the subsequent in- tensive debate over their validity, led to the development of a set of accepted techniques which could be imple- mented in a so-called laboratory environment involving a standard document collection, set of queries, and relevance judgments. These evaluation methods were set forth in Information Retrieval Experiment (Sparck Jones, 198 1)) and more recently discussed and updated in a special issue of Information Processing & Manage- ment ( Harman, 1992 ) .

While they are well-established, laboratory techniques have been criticized as unrealistic in at least two areas:

* To whom all correspondence should be addressed.

0 I996 John Wiley & Sons. Inc.

Their limitations in size and scale when compared to operational systems, and their failure to take into account the human contribution to the retrieval process. The first criticism, that of size and scale, has recently been addressed in the TREC (Text REtrieval Conference) series of experiments co-ordinated by the National Institute of Standards and Technology (NIST), in which research groups around the world are encouraged to test their IR systems on test collections containing several gigabytes of data, using realistic “topics” and collection-wide relevance judgments (Harman, 1995b). Though it was not a goal of TREC as originally conceived, these experiments have also begun to address the second criticism, that of eliminating user interaction during the evaluation of IR, by introducing an “interactive track” at the initiative of several of the TREC participants. In this track, participants will concentrate their effort in the interactive retrieval process within the general framework of TREC and follow guidelines which will try to ensure the general comparability of results across systems.

IR interaction has been defined as “the interactive communication processes that occur during the retrieval of information by involving all the major participants in IR, i.e., the user, the intermediary, and the IR system” (Ingwersen, 1992, p. xviii). In the present study, the emphasis is on interaction between the searcher or intermediary and the system, and the contribution that the searcher makes to the formulation of a successful search. The aim ofthis article is to examine how interaction may be integrated into laboratory IR evaluation and to explore the status of interactive evaluation in the TREC experiments. As an example of laboratory evaluation to examine interaction, a diagnostic experiment is described which explores contributions made through searcher intervention for document selection and term selection during the search process. The experiment is conducted using the Okapi system, TREC topics and documents, and expert searchers.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 47(1):85-94, 1996 CCC 0002-8231/96/010085-10

1.1 A Noteon TREC The first Text REtrieval Conference, co-sponsored by

Advanced Research Projects Agency (ARPA) and NIST, was held in 1992. The aim was to bring together IR researchers from industry and academia to conduct retrieval tests on a new large heterogeneous test collection (including Wall Street Journal, Associated Press newswires, San Jose Mercury News, ZIFF computer ar- ticles, Federal Register reports, U.S. Patents, and Depart- ment of Energy abstracts). Three annual rounds have since been completed and a fourth is currently underway (Harman, 1993, 1994, 1995a). The focus has been on two tasks simulating basic retrieval situations: A “routing” task and an “ad hoc” task. In the routing task, the existing questions are being used to search new data, as with a user profile for a selective dissemination of information (SDl) service. In the ad hoc task. new questions are put to existing data in the same way that a user would approach an online system with a query. Thus in TREC routing is represented by known topics and known relevant documents for those topics. Participants create queries from the given topics and test them against training data in order to generate an optimal query, which can then be tested against new data. In the ad hoc task queries are generated from new topics and existing data is searched without known relevant documents.

In addition to the task, three different methods of query construction or ways of generating queries for the topics, are specified: Automatic (queries resulting from fully automatic processing of the topics), manual (queries whose generation involves some human intervention, but without interaction with the data), and interactive (queries developed by a human searcher whilst interacting with the data).

Some 200 test topics have been created by information analysts, who also provide the relevance assessments for the retrieved documents. These are judged from a pool of the top 200 items retrieved by the different systems. Comparative performance measures based on recall/precision are then calculated for each system as well as for each topic.

To date, the great variations in system parameters have led to inconclusive results with regard to identifying the best methods for getting overall best system performance. However, many questions have been raised concerning the contribution of individual elements and approaches, e.g., linguistic versus statistical retrieval models, single term versus compound term indexing, selective versus massive query expansion (Sparck Jones, 1995). Although it may be premature to expect an im- mediate impact on operational systems, TREC is playing a very important role in providing an evaluation refer- ence paradigm for IR research.

1.2 A Note on Okapi Okapi is a text retrieval system which was developed

as an evaluation facility for advanced retrieval methods

in the Centre for Interactive Systems Research at City University London. The system is based on a term weighting probabilistic retrieval model which provides ranked output and uses relevance feedback for query expansion. The design philosophy has been to create a system which is self-explanatory and appropriate for the un- trained end-user.

Unlike the IR experiment based on the test collection, the evaluation approach has been to design a system or a series of systems which can be tested not only under laboratory conditions but also in different operational settings, with real users with real information needs searching different databases. Different experiments have been undertaken to assess the retrieval effectiveness of stemming, automatic, and interactive query expansion, as well as thesaurus navigation for query formulation ( Hancock-Beaulieu, Fieldhouse, & Do, 1995; Han- cock-Beaulieu & Walker, 1992). Evaluation has focused on diagnostic analysis for system as well as user performance rather than on the standard precision and recall measures.

The Okapi team has also taken part in the main (non- interactive) TREC experiments, and has used the experience to improve its internal methods for term selection and weighting (Robertson, Walker, & Hancock-Beau- lieu, 1995a).

2. Interaction and Evaluation in IR Systems

The emphasis on laboratory environments for IR ex- perimentation is compatible with a system-centered approach in which the elimination of the human searcher from the experiment allows the control of variables and ensures that the effects found in the research are due to variations in system parameters. Of course, there is an inherent conflict between the need to control variables to measure system performance and the need to reflect a real world situation, where humans are involved in information retrieval. It is not only a question of whether or not the user is included within the system’s boundaries for evaluation purposes, but more importantly to ascer- tain what the human element will contribute to the outcome.

The desire for controlled laboratory conditions to ensure the validity of evaluation made it difficult to study factors related to users and user behavior using these methods. In studies involving users, other techniques have been used which owe more to research methods in the social sciences. such as case studies, transaction logs, observation, interviews, and questionnaires. Qualitative, rather than quantitative methods, have been seen as appropriate for the study of IR as a process (Fidel, 1993). While these techniques have proven very useful in giving insight into user behavior, they are limited in the information they can provide about very specific system issues as they relate to users. The problem here is how to realistically allow interaction in experiments in IR sys-

86 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1996

terns in such a way as to isolate and evaluate the contribution that interaction is making. The challenge is two- fold: To explore how to evaluate interactive systems and how to make interactive and non-interactive approaches to evaluation more compatible.

2.1 Interaction in Search Formulation

Outside the IR laboratory, of course, the searcher plays a prominent role in the process of information retrieval, and yet surprisingly little is known about how he or she interacts with IR systems. What is known is that the search process is complex; searchers have multiple options at each stage in the search process, based on their sources of search terms, the number and type of search terms employed, and the ways they combine moves in the search process (Fidel, 199 1). It has been the nature of IR research that laboratory tests have more often been conducted on systems based on models developed as al- ternatives to the Boolean model, such as ranked output. Tests designed to study user behavior have more often been carried out on operational systems which have usually been Boolean because of the predominance of the Boolean model in commercial IR systems. An exception has been the work of The Centre for Interactive Systems Research at City University based on the Okapi system.

In an IR system based on ranked output, the searcher may participate at several stages in the process of search formulation: By providing a natural language description of the search, by contributing a list of key words or concepts, by deciding whether displayed documents are relevant or not, and by choosing terms to add to or delete from the query, either through examination of retrieved documents or by selecting from a system-provided list. However, in laboratory evaluation of these systems, it is possible to fully automate the search process, starting from automatic query generation based on a natural language statement of information need, through relevance feedback (using known relevance judgments) and query expansion (again using known relevance judgments). This allows the researcher to focus the experiment on system parameters.

While this is experimentally valid, it does not take into account the contribution of the searcher. Certainly there are reasons relating to human factors for involving the searcher in the process. Bates ( 1990) has commented on the desire of the searcher to maintain control while searching. Searchers are not always content with a “black box” approach to online retrieval in which they neither understand nor control the actions taken by the system to generate or improve a search strategy. A central question then becomes: Does the searcher’s intervention improve or degrade performance? Or perhaps: Can the system be designed so that the searcher’s intervention im- proves performance? As discussed in the next section, early TREC results have not been encouraging.

2.2 Query Expansion

Query expansion is a process whereby a query is aug- mented with additional search terms. The additional terms may be added automatically by the system, or manually by the searcher, or selected by the searcher on the prompting of the system. The terms may be identified through relevance feedback, term clustering, use of term co-occurrences, substitution of thesaurus classes, or any other technique which seems to suggest a relation- ship between terms in the original query and new terms. (The process of query expansion may also involve the complementary activity, removal of terms from the query). Query expansion is intuitively appealing, since it should improve recall if the original query fails to incor- porate terms which authors use to discuss a concept. However, early laboratory experiments were not encouraging, showing little improvement in performance when terms were automatically added to a query (Ekmekcioglu, Robertson, & Willett, 1992; Peat & Wil- lett, 199 1). The role of the user in selecting and ranking terms and term sources for query expansion has been ex- amined in more recent studies (Efthimiadis, 1993; Spink, 1995). The diagnostic experiment reported in Section 5 compares the effects on performance of system-based and user-based decisions in selecting terms for query expansion.

3. Interactive Evaluation in TREC

Although the prime ob.ject of TREC has been to conduct a Cranfield type experiment on a large scale, some attempt has been made to accommodate interactive searching, albeit in a limited way. To date the performance of interactive searches has been very poor in comparison with the automatic runs. From the outset it was recognized that the experimental design was not well suited for evaluating interactive systems. As discussed above, to devise an appropriate experimental approach to evaluate interactive searching whether outside or within the conditions and rules of TREC, is problematic. It must be stressed that each round has been a learning experience bringing to light different constraints in the various aspects of the retrieval task.

3.1 TREC Topics

In TREC, the approach adopted for generating topics was a major consideration. Appropriate topics were identified by experienced information professionals, based on user queries already encountered in the work environment. The aim was to create rich discursive queries which were more akin to a description of an information need. Each topic in the first two rounds con- tained a narrative, a description, and a set of concepts which seemed to simulate the outcome of a pre-search

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January 1996 87

interview between an end-user and a search intermediary.

Although this seems a reasonable scenario for providing real queries for automatic retrieval tests, certain aspects appear to be incompatible with interactive testing. Firstly, the professionals who constructed the topics had access to the test data. Thus it could be argued that in generating the topics, a certain degree of interactive searching had already taken place. Secondly, the influ- ence of the search intermediary was most evident in the concepts generated for each topic. The concepts as presented could be translated easily into actual search ex- pressions. Under those conditions it may be difficult for a human searcher to produce a more effective query than that already inherent in the topic. This may well account for the fact that queries which were derived manually from the topics, did not perform better than those constructed by automatic methods.

As a result of these possible effects: changes were in- troduced in subsequent rounds. In TREC-3 concepts were not included for ad hoc topics. It is proposed for TREC-4, that topics should be generated without access to the documents. Hence, shorter, less artificial topics, which have not been manipulated in advance, should provide a fairer basis for comparison between the different methods of query processing or construction.

3.2 Query Construction and Feedback

The TREC rules allow for three methods of query construction: Automatic, manual, and interactive. The distinction between manual and interactive methods of query formulation is problematic, since both involve human intervention. The difference is that the manual approach does not include any consultation of the collection, whereas in the interactive approach the query is modified through feedback. In both cases, however, as- sistance can be sought from other sources, e.g., dictio- nary or thesaurus.

In interactive searching, the demarcation between the query formulation process and the searching process is somewhat artificial. In a realistic setting, a user’s goal is not necessarily to generate an optimal final query but to find documents or information. Moreover, a query will be the result of a cumulative iterative process and is likely to evolve dynamically in the course of a search session.

With regard to interactive query construction, a problem arose in defining how feedback should be included in the evaluation methodology and what should consti- tute the output of an interactive session. In the first two rounds, all items viewed by the searcher in the course of an interactive session were included in the final ranking of the results submitted for evaluation, irrespective of the relevance judgment. Even if an item viewed by the searcher had been rejected, this was not taken into account in the final results. Whilst the “frozen ranks” method produced comparable results with queries con-

structed without feedback, it also reduced the effect of relevance feedback. For this reason in TREC-3 frozen ranking was abandoned. To overcome this conflict, the TREC-3 interactive task was defined in terms of routing: The interactive process on the training set was used to generate an optimal query to be evaluated on the test set.

3.3 Relevance Judgments

The nature of relevance judgments is another factor to be considered in the different approaches to evaluation. In interactive searching based on relevance feedback, it is more apparent that judgments can be made at different levels. For example, a searcher may consider that a document is pertinent to a topic but does not meet all of the specified criteria of a request. In some instances an item could be a good source of terms for expanding a query, even if it falls outside the set criteria. In other cases only a specific section of a document could be deemed as relevant. Since searcher relevance judgments are so fundamental in interactive searching, there may be a case for making finer distinctions in defining relevance, in order to really assess the searcher’s impact on retrieval performance. In addition, the ability to mark relevant paragraphs or sentences particularly for full text, could be beneficial for both interactive and non-interactive evaluation.

4. Interactive Searching: The Okapi Experience

Until TREC-3, the Okapi team was the only partici- pant to submit interactive runs (Robertson et al., 1994; Robertson, Walker, & Hancock-Beaulieu, 1995a; Rob- ertson, Walker, Hancock-Beaulieu, Gull, & Lau, 1993; Robertson, Walker, Jones, Hancock-Beaulieu, & Gat- ford, 1995b). In order to highlight the issues involved, it would be useful to describe how the approach to interactive searching evolved through each round.

The highly interactive environment of the Okapi system had to be adapted to accommodate the TREC batch- mode experimental conditions. Our attempt to simulate a realistic interactive searching task was constrained in a number of ways in terms of the searchers, the task and the interface.

4. I Searchers

Ideally we would have liked to have had end-users with the same knowledge domain as the TREC analysts. The topics and data have specific characteristics and may reflect a certain type of information seeker, whose moti- vation is not necessarily easy to replicate. In TREC- 1 the searchers were mainly information science students with some knowledge of searching in general, limited domain knowledge, and no particular knowledge of the system. In the last two rounds, project staff and research students with considerably more system knowledge played the


role of search intermediaries. None of the other participants in the interactive experiments used actual end-users. Our searchers’ familiarity with searching a ranking system may have been an advantage, judging from the response of expert online searchers trained on Boolean methods who searched the INQUERY system for TREC-3 (Koenemann, Quatrain, Cool, & Belkin, 1995). Recruiting enough searchers to undertake 50 searches remains an obstacle for most TREC participants and is probably one of the main reasons why so few have undertaken interactive runs to date.

4.2 The Task

In the first two rounds of TREC, interactive searches were carried out for the ad hoc queries only, whereas in TREC-3 the routing task was actually specified as the official interactive task. The ad hoc interactive searches were optional and were undertaken by only one partici- pant.

In TREC-1 it was far from obvious how to proceed. Searchers in the first instance spent some time preparing their search off-line and were also allowed to access the database in order to formulate an initial query by exper- imenting with different terms and making tentative relevance judgments before starting the definitive search. This approach was deemed to be contrary to the TREC rules (see Section 3.2) and was disallowed in successive rounds. Hence, whilst the query construction phase was clearly demarcated from the searching phase for the automatic runs, the distinction is less clear for ad hoc interactive searches.

Apart from the initial exploratory phase, the searching procedure was more or less the same for TREC- 1 and 2. The aim of the search session was to find a reasonable number of relevant documents to provide appropriate terms for the final expanded query. Sessions involved three phases: Query construction, feedback, and the extraction of new terms for query expansion. In the first round, only a single query expansion iteration was carried out, whereas in TREC-2 the number of iterations was left to the discretion of the searcher and varied between zero and four, with a mean of two. The searcher formulated a query by generating and combining terms and in the second round proximity operators were available to allow for the identification of phrases. A relevance judgment had to be provided for each document displayed and when a sufficient number were identified, (usually at least six), the system displayed in ranked order a new term set of up to 20 individual terms extracted from the selected relevant items. At the end of the session, the system carried out a final query expansion on the last term set and output the top ranking documents (200 and 1,000 for the first and second rounds, respectively) as the final result.

There were two major constraints in this search process. Firstly, the user defined phrases in the query con-

struction phase in the second round, were not retained in the term extraction procedure and appeared, if at all, as single terms in the extracted term lists. Secondly, users could not delete individual terms from the extracted term set.

By contrast, the experimental conditions and the search procedure for the interactive routing task in TREC-3 were quite different from the previous rounds. The object was to generate an optimal query with assis- tance from the knowledge about the official relevance judgments. In making their own relevance judgments for relevance feedback, searchers concentrated on selection of documents likely to be useful for term extraction. This reduced the conflicts experienced in TREC-2 in making judgments for different purposes (see Section 3.3). The searcher was given greater freedom to manipulate and modify the query and could examine documents without being penalized by the “frozen ranks” method (see Sec- tion 3.2). Phrases were used extensively and these were weighted and retained as single search terms. In addition “noise” from the system-generated term sets could be eliminated, e.g., numbers, proper names, or other terms which might be considered to have a disproportion- ately high weight. Searchers thus modified 35 out of the 50 queries. Unlike in the other rounds, the last extracted term set did not necessarily form the final query and in a third of the queries, the searcher chose a previous term set.

In spite of the greater flexibility for query modifica- tion, the performance of the interactive routing runs was very poor in comparison with the automatic ones. This was the case for all of the four participants who submitted results for the interactive routing category. In retro- spect. it was evident that the human searchers did not have equal access and could not benefit from the training data for query construction, i.e., from all of the official relevance judgments. as was possible for automatic methods for generating routing queries. In the case of Okapi, searchers did disagree with official judgments in the training set and they also retrieved items which had not been judged by the assessors. Other problems arose in trying to choose between terms sets for the final query. It was difficult to assess whether time-dependent names, events, and place should be retained for searching the new data. The effect of removing terms and treating phrases as single terms was also an unknown with regard to the term weighting.

Although the searcher could not compete against the system in the routing task and the artificiality of the task itself did not make it ideal for interaction, some characteristics of interactive searching did begin to emerge.

4.3 Interface Issues and Searching Behavior

The original VT 100 interface for Okapi was designed so that the functionality of the system was invisible to the user. The automatic query expansion facility only re-


TABLE I. Searching behavior characteristics TREC-3 interactive routing searches.

System Okapi INQUERY ST-PatTREC

Avg no query terms Initial query 8.06 5.38 1 Range 2-20 1-21 1 Final query 16.86 23.82 18.18 Range 3-28 3-120 3-42

Avg no iterations 8.58 7.76 8.9 Range 4-18 2-2 I 3-18

Avg no items viewed 27.78 6.62 18.18 Range 10-72 O-36 3-42

Avg clock time min 39.32 15.46 32.44 Range 8-84 l-22 12-60

quired the searcher to indicate whether or not an item displayed was relevant, and to strike one key if they wanted the system to find more items similar to those selected. For the TREC tasks, an interface for expert searchers was developed to allow for query construction. This made the interface much less transparent and more cumbersome and undoubtedly increased the cognitive load on the searcher. How the various kinds of information are displayed and interpreted and the ease in which terms can be manipulated is far from trivial. In making the retrieval task more interactive, greater attention needs to be paid to interface issues. The focus on an interface for the expert user may be in some ways a retro- grade step, bearing in mind that ultimately systems must be geared to end users. Nevertheless there may be a case for collecting evidence on how experienced searchers in- teract with ranking systems and to what extent tactics learned in Boolean environments are applicable or trans- ferable ( Keen, 1994).

Although the four participating systems in the interactive searches in TREC-3 had different functionality and different windows-based interfaces, their overall performance was comparable (Charoenkitkarn, Chignell. & Golovchingsky, 1995; Koenemann et al., 1995; Robert- son et al., 1995b; Tong, 1995). Whereas Okapi searching was command driven and focused on query expansion based on relevance feedback, INQUERY was mouse driven and combined manual as well as automatic feedback methods for query construction. The third system, ST-PatTREC, was primarily a direct manipulation browsing system with a querying facility added on and the fourth, Verity’s Topic system, concentrated on query construction using a knowledge representation facility. Table 1 provides comparative data on searching behavior characteristics for the interactive routing task for three of the systems (data on the Topic searches was not available at the time of writing). Okapi searchers relied heavily on the topics for sources of terms and generated few terms of their own and this is likely to have been the case for the searchers of the other systems. Query expan-

sion based on relevance feedback contributed to 96% of the searches in the Okapi system and 66% of those in INQUERY. In the case of Okapi, only the top 20 extracted terms were displayed to the user, and will account for the lower number of terms in the final query compared to INQUERY. For the ST-PatTREC system, query terms were generated predominantly by browsing and selecting terms directly from documents.

Interestingly, the average number of iterations (defined as the number of sub-queries or individual term sets generated for a single search) was almost identical for all three systems. It appeared that there was a law of diminishing returns in that a high number of iterations was indicative of problems and did not lead to improved performance.

A major difference seems to be in the number of full text documents viewed. The INQUERY system displayed short title records in a summary window, from which relevance judgments were made. Okapi also displayed an intermediate short record but it included only information on term occurrence and the collection source. Searchers had to make relevance judgments on the full text. Another possibly related factor was the clock time. INQUERY searches averaged less than half the time of the other two systems, but a limit of 20 minutes had been imposed. Nevertheless, on average, Okapi searchers viewed a high number of full documents in comparison.

Clearly, the many variables in the interactive searching make it very difficult to make meaningful comparisons not only between systems but also with the automatic methods for retrieval.

5. A Diagnostic Experiment

To date. evaluation in TREC has focused on input/ output and the quantitative measures of recall and precision, The overall comparative results on system performance have raised a number of issues concerning the experimental conditions, but tell us little about individual contributory factors within or across systems. One of the problems is that the timetable allows for little oppor- tunity for carrying out retrospective qualitative diagnos- tics. However, more importantly, much effort needs to be put into determining what would be appropriate approaches and methods for diagnostic evaluation. The experiment described below is an initial attempt at a diagnostic comparison of interactive and automatic searches in Okapi.

5.1 Experimental Methods

Since the routing task had proven to be unsuitable for the evaluation of interactive searching, an experiment was run on an interactive ad hoc task. Three searchers carried out searches on 14 of the TREC-3 ad hoc queries. The objective of the experiment was to try to isolate and


TABLE 2. Schedule of diagnostic runs.

Relevance judgments Term selection

INT User DIAG 1 User AUTO Auto DlAG2 User

MI + user M I + auto M2 + auto M2 + auto

compare the elements which the human searchers contributed to the retrieval performance, in particular the quality of the relevance judgments and the quality of term selection.

A similar interface and procedures as those developed for our interactive routing entry for TREC-3 were used. In particular, the searcher’s task was to arrive (through whatever interactions they found useful) at a good final query formulation. This formulation would then be run against the existing data (that is, the TREC training data) for evaluation purposes. The main interactive tools were: Term input (including phrases defined by adjacency); document viewing and marking for relevance; viewing of ranked lists of terms generated by query expansion procedures from documents marked relevant. Usually, the final query formulation would be the result of the searcher selecting and/or rejecting terms from a list generated by a final query expansion step, from the accumu- lated relevant items. This list was normally 20 terms long; if the searcher deleted a term, the next one down the ranking would be added. It might contain phrases previously used by the searcher.

As it was an ad hoc task, the interactive session was conducted on the same dataset; however, items seen by the searcher in the course of interaction were not treated in any special way in the final run. The interactive run (INT) was then compared with three diagnostic runs described in the next section.

5.2 Diagnostic Runs

The diagnostic runs were based on two sets of variables (Table 2): The source of the relevance judgments, and the method of term selection. Two methods of term ranking and selection were used: Ml for interactive selection and M2 for automatic selection.

In order to isolate the effects of the searchers’ term manipulations (rejecting system-suggested terms and contributing phrases), a diagnostic run was conducted using completely automatic methods to extract terms from the items judged relevant by the searchers. In the first diagnostic run (DIAG 1)) the top 20 terms from the query expansion process, in the order in which they had been presented to the searchers, were selected (M 1); any searcher-defined phrases were excluded.

In order to examine the effects of the searchers’ choice of relevant items, these were compared with the top 40

documents retrieved by a good automatic search on the topics as given. This procedure (automatic query expansion without relevance information) was shown to be beneficial in our TREC-3 automatic runs. As we had used rather better term ranking methods here than in the interactive experiments, these rather better methods (M2) were used for this comparison. Thus AUTO involves (a) an initial search on the topic as given, (b) selection of the top 40 documents from the output, (c) term extraction, selection, and weighting from this document set, and (d) a run based on this new query. The DlAG2 run uses the documents selected as relevant by the searcher, but reproduces (c) and (d) exactly.

5.3 Results

Table 3 presents the four sets of summary results for the fourteen queries. INT is the final interactive formulation. The evaluation measures are a selection of those used in TREC. AveP is the 1 l-point average precision (at recall points 0.0, 0.1, 0.2, . . ., 1.0); Px are the precision values at x documents retrieved; RP is the precision at R documents retrieved, where R is the total known relevant for that particular topic. Rcl is the recall at 1,000 documents retrieved.

A comparison of INT and DlAGl indicates that the searcher term manipulations (deletion of some system- suggested terms, addition of phrases) are, on the whole, beneficial. A comparison of AUTO and DlAG2, on the other hand, shows that the user selection of relevant documents is only slightly better than using the top retrieved documents from a good search. (It should be noted, however, that the “good search” is based on the topics as given, which are still artificially elaborate descriptions.) The fact that the latter two runs are better than the for- mer two is a reflection of the better term ranking procedures. It is clearly not possible to say whether searcher intervention would have improved further on these better procedures.

The problems of determining any statistical significance for these results are discussed in Section 5.5.

5.4 Analysis of Terms

Further evidence of the performance of different classes of terms in the INT and DIAG I runs was sought

TABLE 3. Results of the interactive, automatic, and diagnostic ad hoc runs.

Run

INT DIAG 1 AUTO DIAG2

AveP P5 P30 PI00 RP Rcl

0.425 0.829 0.68 I 0.523 0.446 0.772 0.379 0.729 0.645 0.494 0.418 0.725 0.468 0.814 0.714 0.553 0.48 1 0.830 0.493 0.843 0.755 0.571 0.499 0.840


TABLE 4. Selection values and matching weights for interactive and diagnostic1 term sets.

Mean 1 2 3 4 5 6 7 8

RSV 19.22 5.88 24.53 19.37 27.09 17.80 9.85 48.90 Weights 78.13 76.60 78.89 81.95 85.82 77.06 62.46 115.55

Mean I = All terms in the interactive term set including terms removed by the user. 2 = Terms removed by the user. 3 = Terms actually used in the final interactive query formulation. 4 = Terms used in the final diagnostic formulation. 5 = Terms used in both the interactive and diagnostic formulation. 6 = Terms used only in the interactive quer)? formulation. 7 = Terms used only in the diagnostic formulation. 8 = Phrases used in the interactive query formulation.

by comparing the mean values for the term selection function and the mean term weights for all the 14 queries as presented in Table 4.

The Robertson Selection Value (RSV) used for term ordering for feedback in Okapi is intended to measure how useful a term is for a query, i.e., how much its inclu- sion in the query will improve performance (Robertson, 1990). The weights are the standard weights for the probabilistic model. A term may have high weight but low RSV if. for example, it is very rare-although its pres- ence is a good indicator of relevance, it occurs too seldom to have a significant impact on overall performance on the query.

The RSV is normally used in Okapi for term selection in query expansion, but here it is being used in a diagnostic fashion. Both the RSVs and the weights below are calculated retuospectctive!l,-i.e., they are based on the official relevance judgements that are used for evaluation.

The terms removed by the user (2) were justifiably removed since they have the lowest RSVs. User generated phrases and term deletion would seem to account for the higher selection value for the interactive term set (3) compared to the diagnostic one (4). However, terms with the higher RSVs (5) occurred in both term sets which would indicate that some care must be taken ifthe user overrides candidate terms extracted by the system. By contrast, user generated terms (6) did not have very high selection values. This finding on the greater retrieval effectiveness of relevance feedback terms as opposed to user terms is supported by a recent study by Spink ( 1995). The specific character of phrases accounts for their high weight (S), however the RSV was also high, indicating that searcher-generated phrases are indeed valuable.

With regard to term weighting, the similarity in the weights for all the different classes of term sets. with the exception of phrases, provides some evidence for the greater diagnostic power of the selection values. The analysis of average selection values given here seems to correlate well with and to amplify the evaluation results given in Section 5.3, and may therefore serve as a useful

diagnostic tool or measure in addition to output measures. However, an examination of averages for the terms of individual queries seems to indicate that there are other factors at play, and that the method in itself is probably not sufficiently robust.

Clearly there is scope for the development of both manual and automatic diagnostic methods to throw light on various fundamental research questions, as for example the relation of searcher-defined phrases and searcher selection of terms to the probabilistic model.

5.5. Statistical Sign$cance

The question of whether these IR test results (Table 3) show any statistically significant differences is clearly of interest. An analysis of the TREC-3 data from all participating systems ( Tague-Sutcliffe & Blustein, 1995) suggests that in order to be significant, very large differences in the performance measures would have to be seen. Since a major source of variation is the topic. and since the above results are based on 14 topics rather than 50. the point would apply even more strongly here. Thus, the above differences would certainly show up as not significant in the analysis.

A simpler form of significance test is the sign test based on topics. For any one of the six measures, we count up the number oftopics for which Run A performs better than, or worse than, or the same as Run B. These numbers are then compared to the binomial distribu- tion. Applying this method to the comparison between INT and DIAG 1, the counts on each measure favor INT, as we might expect from the average. The largest margin, on both RP and Rcl. is 10 topics favouring INT, 3 favor- ing DIAG 1. one neutral. The sign test on this result gives p = 0.095. While this cannot be regarded as statistically significant, it would encourage the view that a larger sample of topics might indeed give a significant result. A comparison between AUTO and DlAG2 shows somewhat less consistency, though here too Rcl shows a lO/ 1 / 3 bias in favor of DlAG2.

The question of statistical significance arises again in the results on terms in Table 4. The statistical structure


of the data summarized in Table 4 is a little complex: Averages were first taken for each category of term for each topic, and these figures were then averaged across topics. The number of terms within each category varied between topics, so that the first-stage averages were based on different numbers. The second-stage averages would normally be over the 14 topics; however any topic having no terms in the given category would be treated as a miss- ing value, hence some of the second-stage averages were actually based on less than 14 first-stage averages.

The situation is a little difficult to analyze statistically. As a first attempt, only the second-stage averaging process was considered: Thus each figure was taken as a mean of a sample of 14 or less. T-tests were conducted on appropriate pairs (not all comparisons make sense). Of the differences tested, only those between 2 and 3 (on RSV) and between 3 and 8 (on weight) were significant (p < 000 1 and p < 0.0 1, respectively). Thus, terms removed by the user are definitely worse than those kept, and phrases definitely have higher weights than the group of terms used as a whole. But the differences between means 5, 6, and 7 were not significant. Further study of the statistical character of these measurements is suggested.

6. Interactive Track in TREC-4

For TREC-4 various subsidiary tracks have been added to the main experiment in order to focus on those special aspects of IR evaluation which have come to light in the course of the previous rounds. These include, for example, natural language processing, multilingual retrieval, and data fusion. The introduction of a specific track for interactive searching in TREC is in recognition of the need to design experimental protocols to allow interaction questions to be addressed in a laboratory context. We are concerned on the one hand with the role of different kinds of searcher-provided information in the retrieval process and on the other hand with the role of different kinds of system-provided information in the searcher’s perception of the search process. For the experimental design, particular consideration is being given to the starting point, i.e., the nature of the topics, and the definition of the task. At the time of writing, the official guidelines had not been finalized, but a consensus on the general approach had emerged.

Firstly, it is envisaged that more realistic topics can be devised to serve the requirements of both automatic and interactive searching. Secondly, with regard to the task specification, the search is being perceived as a two-stage process. In the interactive stage, the user will aim to ob- tain a reasonable number of relevant documents as he/ she sees fit and the output of this primary task is likely to be measured in terms of recall/precision (according to official relevance judgments) and elapsed time. It may be possible to introduce other measures of the complexity of the search and participants will be invited to consider

appropriate measures for their own systems. In the second stage of the search, the searcher will be required to generate or identify an appropriate search formulation to produce a ranked list of documents for offline evaluation using standard TREC evaluation methods. This would be consistent with the situation where a searcher is trying to establish the best possible ranked list, without exam- ining every item.

The results of this secondary task would be more comparable with the main automatic ad hoc TREC runs. In addition, participants will be encouraged to undertake a baseline non-interactive run, which would be comparable to the interactive run. The idea would be for a run which would use essentially the same system as is used for the interactive run, but without interaction, as exem- plified in Section 5. Although TREC rules limit the number of runs which can be submitted, this does not prevent participants from making replications and reporting their own comparisons.

It is recognized, as was the case for the main TREC experiment, that procedures will need to evolve. It is important at this exploratory stage to allow for flexibility and diversity. The success and development of the interactive track will also depend on attracting a sufficient number of participants, using experimental as well as operational systems. TREC has created a strong impetus to try to integrate non-interactive and interactive approaches to evaluation in information retrieval. The prospects for this collaborative venture are exciting!

Acknowledgment

Okapi’s participation in TREC was supported by Brit- ish Library Research and Development Department.

References

Bates, M. J. ( 1990). Where should the person stop and the information search interface start? I~livmation Prowssing & Munugemeni, ?6(5), 575-591.

Charoenkitkarn. N., Chignell, M., & Golovchingsky, G. ( 1995). Inter- active exploration as a formal text retrieval method: How well can interactivity compensate for unsophisticated retrieval algorithms. In D. Harman (Ed.), Overview qi” the Third Te,xt Retrieval Conference (TREC-3) (pp. 179- 199). Gaithersburg, MD: NIST.

Efthimiadis, E. N. ( 1993). A user-centered evaluation of rankingalgo- rithms for interactive query expansion. In R. Korthage. E. Rasmus- sen, & P. Willett ( Eds.). Proceedings ofthe 16th Annual International ACW SIGIR C‘onf~kwce on Rewarch and Development in InJbrma- tion Retrieval (pp. 146-l 59). Pittsburgh, PA: ACM.

Ekmekcioglu, F. C.. Robertson, A. M., & Willett. P. ( 1992). Effective- ness of query expansion in ranked-output document retrieval systems. Journal c$Ifl/i,rmation Science, 18( 2), 139- 141.

Fidel. R. ( 199 I ). Searchers selection of search keys: 1. The selection routine: II. Controlled vocabulary or free-text searching: III. Search- ing styles. Jownul oft/w Amekan Society,firr Irzjbrmation Science, 42.490-527.

Fidel. R. (1993). Qualitative methods in information retrieval research. Lihrur~~ und IpJi)rmation Science Rescurch. 15, 2 19-247.

Hancock-Beaulieu, M.. & Walker, S. ( 1992). An evaluation of auto-

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-January1996 93

matic query expansion in an online library catalogue. Jotlrnal oj Documentation, 48( 4), 406-42 1.

Hancock-Beaulieu, M., Fieldhouse, M., & Do, T. ( 1995). The evaluation of interactive query expansion in an online catalogue with a graphical user interface. JournalqfDocumentation, 51(3). 225-243.

Harman, D. (Ed.) ( 1992). Special Issue on IR Evaluation. Information Processing & Management, 28( 4).

Ingwersen, P. ( 1992). Ir@mation Retrieval Interaction. London: Tay- lor Graham.

Harman, D. (Ed.) ( 1993). The First Te,xt REtrieval Confirence (TREC-I). Gaithersburg, MD: NIST.

Harman, D. (Ed.) ( 1994). The Secund Tpxt REtrieval Ckference (TREC-2). Gaithersburg, MD: NIST.

Harman, D. (Ed.) ( 1995a). Overview of the Third Text Retrieval Con- ference (TREC-3). Gaithersburg, MD: NIST.

Harman, D. ( 1995b). Overview of the Third Text Retrieval Confer- ence (TREC-3). In D. Harman (Ed.), Overview ofthe Third Text Retrieval Conference (TREC-3) (pp. l-19). Gaithersburg, MD: NIST.

Keen, M. ( 1994). Query formulation in ranked output interaction, In R. Leon (Ed.), Irzformation retrieval new systems and current research. Proceedings ef the 15th Research Colloquium ofthe British Computer Society Iflformation Retrieval Specialist Group, Glasgoa;. 1993 (pp. 150- 16 I ) London: Taylor Graham.

Koenemann, J., Quatrain, R.. Cool, C.. & Belkin, N. J. (1995). New tools and old habits: The interactive searching behavior of expert online searchers using INQLJERY. In D. Harman (Ed.), Overview qf the Third Te.rt Retrieval Conference (TREC-3) (pp. 145- 171). Gaithersburg, MD: NIST.

Peat, H. J., & Willett, P. ( I99 1). The limitations of term co-occurrence data for query expansion in document retrieval systems. Jozrrnul of the ilmerican Society,li,r Ir&mation Science. 42, 378-383.

Robertson, S. E. ( 1990). On term selection for query expansion. Journal i?fDoczlmc~ntation, 46( 4). 359-364.

Robertson, S. E., Walker. S., Hancock-Beaulieu, M., Gull, A., & Lau, M. (1993). Okapi at TREC. In D. Harman (Ed.), The First Tedxt REtrieval Corffirence (TREC-1) (pp. 2 I-30). Gaithersburg, MD: NIST.

Robertson, S. E.. Walker, S., & Hancock-Beaulieu, M. (1995a). Large test collection experiments on an operational, interactive system: Okapi at TREC. Information Processing & Management, 31(3), 345-360.

Robertson, S. E.. Walker, S., Jones, S., Hancock-Beaulieu, M., & Gat- ford, M. ( 1995b). Okapi at TREC-3. Overview ofthe Third Text REtrievul Cor?firence (TREC-3) (pp. 109- 126). Gaithersburg, MD: NIST.

Robertson, S. E.. Walker. S., Jones, S., Hancock-Beaulieu, M., Gull, A., & Gatford, M. ( 1994). Okapi at TREC-2. In D. Harman (Ed.), The Second Text REtrieval Confirence (TREC-2) (pp. 2 l-34). Gaithersburg, MD: NIST.

Sparck Jones, K. ( 198 I ). It@mation Retrieval E.yperiment. London: Butterworths.

Sparck Jones, K. ( 1995). Reflections on TREC. Irzfilrmation Process- ing& Munagement, 32( 3). 291-3 14.

Spink, A. ( 1995). Term relevance feedback and mediated database searching: Implications for information retrieval practice and systems design. I&wnu/ion Processing & Management, 31(2). 161- 171.

Tague-Sutcliffe. J.. & Blustein, J. ( 1995) A statistical analysis of the TREC-3 data. In D. Harman (Ed.), Overvie\+, ofthe Third Text RE- trievul Confiwncc (TREC-3J (pp. 385-392). Gaithersburg, MD: NIST.

Tong, R. ( 1995). Interactive document retrieval using TOPIC: A re- port on the TREC-3 experiment. In D. Harman (Ed.), Overview of the Third Tel-t REtrieval C’or?/krence (TREC-3) (pp. 20 l-209) Gaithersburg. MD: NIST.


Documents

Evaluating interactive systems in TREC