· Abstract Today, search engines are embedded into all aspects of digital world: in addition to Internet search, all operating systems have integrated search engines that respond

1

˙

Large Scale IR Evaluation

Virgil Pavlu

Thesis Advisor: Javed Aslam

February 20, 2010

Abstract

Today, search engines are embedded into all aspects of digital world: in additionto Internet search, all operating systems have integrated search engines thatrespond even as you type, even over the network, even on cell phones; thereforethe importance of their efficacy and efficiency cannot be overstated. There aremany open possibilities for new ideas, implementations and optimizations inthis area; that, combined with the immense interest from both academia andcorporations, makes it a very attractive research field.

My PhD work is focused on search quality. I have developed algorithms andmodels for efficient evaluation, estimation of query difficulty, metasearch, andexploration of relevant patterns; I also participated to collection-building effortsfor research purposes. This thesis presents a unified view of many of these tasks,often making use of intrinsic connections between them. We propose solutionsthat make current technologies scalable with data growth, a certainty of theclose future. Some of these solutions have been already implemented and usedby the research community; certain methods made possible research endeavorsthat would have been otherwise unworkable.

The main result of this thesis is the proposal of two methods for pooling andevaluation of IR systems. Both approaches are well studied in other domains,and have a good mathematical foundation. Both offer a strong direction forefficient evaluation, without which IR experimentation cannot be performed ona large scale. An additional benefit of the sampling methods is that estimatedvariance leads to a confidence in the evaluation, a first within all IR evaluationtechnology.

Contents

1 Introduction 51.1 IR Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.1 IR evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 71.1.2 Text REtrieval Conference . . . . . . . . . . . . . . . . . . 8

1.2 Thesis Preview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 IR Evaluation 112.1 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . 132.1.2 Average Precision . . . . . . . . . . . . . . . . . . . . . . 152.1.3 R-precision geometrical correlation with Avg precision . 162.1.4 nDCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.5 Other measures . . . . . . . . . . . . . . . . . . . . . . . . 182.1.6 Performance for a set of queries . . . . . . . . . . . . . . . 19

2.2 Maximum Entropy application . . . . . . . . . . . . . . . . . . . 202.2.1 The Maximum Entropy Method . . . . . . . . . . . . . . 21

2.3 TREC Pooling and System Evaluation . . . . . . . . . . . . . . . 322.4 Evaluation experimentation . . . . . . . . . . . . . . . . . . . . . 332.5 Evaluation with incomplete judgments . . . . . . . . . . . . . . . 35

2.5.1 Confidence in the evaluation with incomplete judgments . 37

3 Relevance priors 393.1 From ranked list to distribution . . . . . . . . . . . . . . . . . . . 39

3.1.1 Measure-based rank weighting scheme . . . . . . . . . . . 403.2 Average Precision prior . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.1 Sum precision as an expected value . . . . . . . . . . . . . 433.2.2 A global prior for documents . . . . . . . . . . . . . . . . 45

4 Sampling 464.1 Sampling for IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Sampling Theory and Intuition . . . . . . . . . . . . . . . . . . . 50

4.2.1 Uniform random sampling and infAP . . . . . . . . . . . 524.2.2 Importance sampling . . . . . . . . . . . . . . . . . . . . . 54

4.3 Non-uniform sampling with replacement . . . . . . . . . . . . . . 58

1

CONTENTS 2

4.3.1 Estimation, scaling factors . . . . . . . . . . . . . . . . . . 594.3.2 Optimal sampling distribution . . . . . . . . . . . . . . . 624.3.3 Estimating prec@rank and R-prec . . . . . . . . . . . . . 644.3.4 Practical summary . . . . . . . . . . . . . . . . . . . . . . 654.3.5 Sampling with replacement results . . . . . . . . . . . . . 67

4.4 Non-uniform sampling without replacement . . . . . . . . . . . . 704.4.1 The sample . . . . . . . . . . . . . . . . . . . . . . . . . . 714.4.2 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 734.4.3 Sampling designs, computation of inclusion probabilities . 754.4.4 Sampling without replacement results . . . . . . . . . . . 81

4.5 Confidence intervals and reusability . . . . . . . . . . . . . . . . . 884.5.1 Robust Track 05 and the sab05ror1 system . . . . . . . . 90

5 Million Query Tracks 945.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.2 Phase I: Running Queries . . . . . . . . . . . . . . . . . . . . . . 96

5.2.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.2.2 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.2.3 Submissions . . . . . . . . . . . . . . . . . . . . . . . . . . 975.2.4 Submitted runs . . . . . . . . . . . . . . . . . . . . . . . . 97

5.3 Phase II: Relevance judgments and judging . . . . . . . . . . . . 995.3.1 Judging overview . . . . . . . . . . . . . . . . . . . . . . . 1005.3.2 Judgments . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.3.3 Selection of documents for judging . . . . . . . . . . . . . 102

5.4 Minimal Test Collections (MTC) . . . . . . . . . . . . . . . . . . 1035.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.6.1 Efficiency Studies . . . . . . . . . . . . . . . . . . . . . . . 1085.6.2 Reusability . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6 Hedge online pooling 1166.1 Online pooling and evaluation . . . . . . . . . . . . . . . . . . . . 117

6.1.1 Hedge methodology . . . . . . . . . . . . . . . . . . . . . 1186.2 Hedge application . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.3 Hedge results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

7 Metasearch 1267.1 Established metasearch techniques . . . . . . . . . . . . . . . . . 1277.2 Measure based metasearch . . . . . . . . . . . . . . . . . . . . . . 129

7.2.1 AP-based Metasearch at TREC MQ tracks . . . . . . . . 1327.3 Hedge-based metasearch w/ relevance feedback . . . . . . . . . . 134

7.3.1 Hedge at Terabyte Track 2006 . . . . . . . . . . . . . . . 1367.4 Query difficulty estimation . . . . . . . . . . . . . . . . . . . . . 141

7.4.1 Query hardness . . . . . . . . . . . . . . . . . . . . . . . . 1427.4.2 Predicting query hardness . . . . . . . . . . . . . . . . . . 144

CONTENTS 3

7.4.3 Query hardness estimation via Metasearch . . . . . . . . . 1457.4.4 Query hardness results . . . . . . . . . . . . . . . . . . . . 146

8 Conclusions 1538.1 Thoughts on IR Evaluation . . . . . . . . . . . . . . . . . . . . . 1538.2 Sampling remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 1548.3 Online metasearch . . . . . . . . . . . . . . . . . . . . . . . . . . 155

Acknowledgements

Many people have helped me along the way, but two stand out as fundamentallychanging my life. My grandma was special, in the sense that she made everythingaround her special, sometimes for better, sometimes for worse; the direct impacton me (and my brother) was that although we grew up in a poor neighborhoodwhere every kid was a street kid, we were the only school kids, eventually goingto the best high-school in the country.

The second, a passionate, speak-your-mind, rather strange, mathematicianwho I met when I was 11, could never agree on anything with any adults, includ-ing his family, but enjoyed talking mathematics with anyone around. That wasnot the fourth grade mathematics we studied in school, but rather complicatedtheories which I fully understood only many years later while attending the bestmath college in the country. Against the odds given by that part of the worldat the time, all kids in that group ended up doing something in life out of theordinary. My biggest luck was, no doubt, meeting him.

My parents did everything good parents do, but they also did sometimeswhat parents are not supposed to do, and that worked very well for me: givenmy strong interest for logical things, they let me do what I wanted as a teenager,even though sometimes that was playing tournament bridge instead of attendingliterature classes. Ergo I am not the best writer, but I have an understandingof anything arithmetic.

The saying goes “everyone is different”, but my advisor, Jay, is definitelyone of a kind. He thought me the most important PhD subject of all: “scientificintuition”, which I natively had some, but it was completely uneducated. Be-sides helping with research, and being an amazing teacher, he was like a secondfather to me, in a place that I found culturally very different than my naturalheritage and hence sometimes hard to adapt to. My second biggest luck wasmeeting him.

Special thanks go to Ian Soboroff and Ben Carterette for helping me puttingin practice many of thesis ideas, through TREC Million Query Tracks.

My wife Micaela took care of many things, while I had to stay focused onmy thesis. Her dancing and singing kept my worries out of the house.

Finally, many friends and collaborators contributed good spirits, arguments,multiplayer online strategy environments, salsa dancing, cards playing, and evenwords to this thesis: Florin, Kunal, Alin, Andreea, Emine, Evan, Robert - Ithank you.

4

Chapter 1

Introduction

In the last few decades we have witnessed a major shift to a digital world thatnot only affects all major dimensions of a civilization — science, military, cultureand commerce — but it completely changes long established norms. Considerthe concept of a library: since ancient times, many organizations collected andarchived records, writings and facts into voluminous physical spaces. Today,all the information anyone needs can be stored or accessed on a pocket-sizedevice; for example, all Wikipedia fits into a cell phone while the entire Libraryof Congress can be stored on a workstation. This is made possible by majoradvances in systems, architectures and miniaturization; however we need newtools to make use of the vast majority of data we now have access to. Twonewer disciplines are quickly becoming foundations of modern libraries: MachineLearning [ML] is responsible for mining and creating knowledge from data, andInformation Retrieval [IR] is responsible for accessing the data.

Search engines are much of IR, but much more than meets the naive eye. Ofcourse, the basic concept is search: a mechanism to answer information needs;but for it to work, a significant infrastructure must be in place, which is evenmore important than the search mechanism: caching and updating very largedatasets, making sense of implicit data structure, dealing with billions of queriesa day, personalizing the results etc.

My PhD work is focused on search quality. I have developed algorithms andmodels for efficient evaluation, estimation of query difficulty, metasearch, andexploration of relevant patterns; I also participated to collection-building effortsfor research purposes. This thesis presents a unified view of many of these tasks,often making use of intrinsic connections between them. We propose solutionsthat make current technologies scalable with data growth, a certainty of theclose future. Some of these solutions have been already implemented and usedby the research community; certain methods made possible research endeavorsthat would have been otherwise unworkable.

In this thesis, we often present a principled solution to a problem togetherwith a particular (our) implementation; while other, better implementationsmay come along in the future, the main ideas presented here are likely to remain

5

CHAPTER 1. INTRODUCTION 6

central for IR research, especially for system evaluation.

1.1 IR Research

Today, search engines are embedded into all aspects of digital world: in additionto Internet search, all operating systems have integrated search engines thatrespond even as you type, even over the network, even on cell phones; thereforethe importance of their efficacy and efficiency cannot be overstated. There aremany open possibilities for new ideas, implementations and optimizations inthis area; that, combined with the immense interest from both academia andcorporations, makes it a very attractive research field.

Research in the IR field has connections with many other science disciplines,including mathematics used for good foundations; however most work has beenexplained through massive experimentation. As such, most new developmentsrequire well understood and broadly accepted testing methodology, so that theenhancements are correlated with levels of performance. IR research is a verypractical field, in the sense that good theory with poor experimentation getsless attention than poor theory with good experimentation.

Queries, collections, search engines and users

The main Information Retrieval task can be briefly summarized as: a user asksa query to a system that has access to a collection of possible results (usuallydocuments). The system, also called “search engine”, does its best to retrievethe results that the user would like the most; obviously the system employsvarious techniques in order to match the query with documents and output thebest matches.

The early systems [1] looked only at the document content, often matchingonly the terms based on frequency counts; modern systems [70] are actuallyvery complex mechanisms, usually using more than a single search component infinding relevant documents for a given query: indexing, dictionaries, documentcontent and natural language processing, connectivity, popularity, trends andcache, learning etc. Each one of these components has been intensively studiedand well documented (see, for example, SIGIR conference series 1977-2008).

In research studies, some of the entities of this setup are very different thanwhat happens in real world (like web search). For example, when it comes toassess performance, users are in most cases highly educated individuals who areaware that they are evaluating IR systems and therefore develop queries thatare much more elaborated than the day-by-day queries a commercial searchengine has to answer. Another difference is that the research collections arepre-processed, organized and in some cases biased (depending of the source),and orders of magnitude smaller that the indexes of web search engines. Never-theless, IR research community had major contributions to commercial searchengines, e-commerce, advertising, text analysis, translation, evaluation method-ology etc.


Relevance. There are many reasons for user satisfaction (besides the obviousone that the user found what he was looking for): user interface, results presen-tation, pointers to other topics of interest, integration of services etc. Generally,we say that a retrieved document is relevant if it contributes significantly to usersatisfaction; in practice relevance is a number (often binary) that indicates theinformation need fulfillment associated with the document.

There has been much debate about the assessment of relevance (see TRECconferences 1992-2008), and for good reasons. Sometimes relevance is obvious: ifone is looking for the homepage of Intel Corporation, then only www.intel.comis a relevant document; however in many situations, given a query, relevanceis an elusive concept. For example, for query “The Godfather”, a documentabout director Ford Coppola’s personal life experience during the making ofthe famous movie may or may not be relevant, depending on who is askingand why. Even more dependence of relevance to the actual meaning of a querycan be observed in queries like “chili peppers”, that can refer to either cookingrecipes or to a popular rock band, but not both.

The interest of the user (“information need”) may or may not be well ex-pressed by the particular query asked. While not the interest of this thesis, thereis a large research subarea of IR dedicated to the problem of finding the bestexpression of information need in terms of queries. Significant results includeasking multiple queries [19, 18], updating the query [85, 15] and appropriatepresentation of results [75], among other approaches.

In this thesis, we assume that the queries asked are a perfect representationof information need, and that the information need is associated with the personthat is making a relevance judgment; therefore the system response to a querycan be measured as the performance of that system on the task of fulfillment ofthe information need.

1.1.1 IR evaluation

Suppose the user is not satisfied with the results. Given that he/she doesnot have direct access to the collection (or even with access, there would beno feasible way to search manually), the user can choose to either change thequery, or to change the system (or both). Say some change occurs and the useris now satisfied; can we quantify this phenomena mathematically ?

We are concerned with the user satisfaction only as a result of changing thesystem. To quantify the difference in quality of results of two systems, we needto make some assumptions:

• the query is a perfect expression of the information need

• the same query (or set of queries) is used

• the same information need (or user) expresses satisfaction

• both systems have access to the same collection


Under these assumptions, we can reasonably state that a change in usersatisfaction with the results is caused by the differences of internal mechanics ofthe two systems, and not other factors 1. Tools for assessing system performanceare critical to IR researchers, commercial search engines, e-commerce businesses,and many large institutions etc, because any change in internal mechanism of asearch engines needs to be evaluated in practice.

We consider the problem of large-scale retrieval evaluation, and we pro-pose new methods for evaluating retrieval systems using incomplete judgments.Unlike existing techniques that (1) rely on effectively complete, and thus pro-hibitively expensive, relevance judgment sets, (2) produce biased estimates ofstandard performance measures, or (3) produce estimates of non-standard mea-sures thought to be correlated with these standard measures, our proposedtechniques produce accurate estimates of the standard measures themselves.

Incomplete judgments Large-scale, TREC-style retrieval evaluation can bevery expensive, often requiring that tens of thousands of documents be judgedin order to obtain accurate, robust, and reusable assessments. A number ofmethods have been proposed to potentially alleviate this assessment burden,including shallower depth pools [115], greedily chosen dynamic pools [41, 33],and pools with randomly assigned relevance assessments [89]. However, thesemethods all tend to produce biased estimates of commonly used measures ofretrieval performance (such as average precision), especially when relatively fewrelevance assessments are made.

1.1.2 Text REtrieval Conference

The Text REtrieval Conference (TREC) is an annual event run by the NationalInstitute of Standards and Technology (NIST) that provides the infrastruc-ture for large-scale testing of text retrieval, including reasonable (and growing)document collections, realistic test topics, appropriate scoring procedures, anda forum for the exchange of research ideas and for the discussion of researchmethodology.

Essentially TREC is an effort to fit the four assumptions made above, withlarge collections. TREC has a huge impact on the IR research community, asmany groups across the globe (even non-participants to TREC conference) useTREC data and technologies to try out new ideas that otherwise would nothave been explored.

TREC is run since 1992. Each year there are a number of tracks focusedof different aspect of retrieval; where various research groups submit searchengines results that are evaluated and published. Data is made public; the ad-hoc collections (1992-1999) contain about 1.9 million documents (mostly newsarticles) and about 450 topics with relevance judgments.

1some would consider the last item on the list of assumptions part of the system internalmechanics


Because the test collections are very large, the relevance judgments cannotbe exhaustive, even at TREC. Several technologies have been developed forselecting documents to be judged, in such a way that all participating systemscan be fairly evaluated. In 2007, the Million Query track used the samplingtechnology proposed in chapter 4.

New, bigger collections are developed about every three years. The currentlarge collection (“GOV2”) has about 25 million documents and has been usedfor the last three years. The next collection may be about 1 billion web pages,requiring a large effort to put together.

TREC consists of many tracks, each running for few years. In this thesis,we shall refer to the following tracks:

• the Ad-Hoc track

• the Web track

• the Robust track

• the Terabyte track

• the Million Query track

1.2 Thesis Preview

The main result of this thesis is the proposal of two methods for pooling andevaluation of IR systems. Both approaches are well studied in other domains,and have a good mathematical foundation. Both offer a strong direction forefficient evaluation, without which IR experimentation cannot be performed ona large scale. An additional benefit of the sampling method is that its estimatedvariance leads to a confidence in the evaluation, a first within all IR evaluationtechnology.

IR evaluation is discussed in chapter 2. All the necessary background is pro-vided, along with research insights into the information embedded into IR mea-surements

Relevance priors Chapter 3 presents analyses and derivations of relevancepriors. The priors developed are used in all later chapters.

Sampling technologies are discussed at length in chapter 4. Pools of docu-ments are obtained with various sampling techniques and then used for evalua-tion of retrieval systems. While our estimates of performance are unbiased bystatistical design, their variance is dependent on the sampling distribution (orprior) employed; as such, we derive a sampling distribution likely to yield lowvariance estimates. Our experiments indicate that highly accurate estimates ofstandard performance measures can be obtained using a number of relevancejudgments as small as 4% of the typical judgment pools.


Million Query Track —part of TREC2007 conference— is a massive experi-ment involving about 10000 queries used for system evaluation, 1800 of thembeing judged. The track uses advanced pooling systems, including the samplingtechnique developed in chapter 4. Million query track is detailed in chapter 5.

Online pooling Chapter 6 discusses online allocation as a pooling technique.specifically the hedge algorithm is used to develop a unified model which, giventhe ranked lists of documents returned by multiple retrieval systems in responseto a given query, generates document collections likely to contain large frac-tions of relevant documents (pooling) and accurately evaluating the underlyingretrieval systems with small numbers of relevance judgments (efficient systemassessment).

Metasearch is introduced in chapter 7. We discuss various metasearch tech-niques and compare with previously known state-of-art methods. In particular,we show that Hedge pooling also provides a natural feedback mechanism forfusing the ranked lists of documents in order to obtain a high-quality combinedlist (metasearch). This approach is an adaptation of a popular on-line learningalgorithm: in effect, our proposed system “learns” which documents are likelyto be relevant from a sequence of on-line relevance judgments.

Chapter 2

IR Evaluation

As with any system, there are dedicated measures for all aspects of a searchengine concerning engineering/business/marketing; the literature concerningsystem-design, especially on hardware, expands way beyond Computer Science.We are not concerned here with any system performance (speed, resources re-quired, cost etc), other than quality of results.

To assess the performance of an IR system we need

• a collection of documents

• a set of queries

• a set (preferably complete) of relevance judgments, consisting in a rele-vance value associated with every (doc, query) pair. In most cases—butnot all— the relevance is binary (0/1)

• an evaluation procedure (usually a mathematical formula)

Ranked lists

We consider only ranked lists as returned results; while other forms of resultspresentation have been proposed, ranked lists are by far the most popular formof system response, in all mediums. A ranked lists contains the documentsretrieved by the system, in the order dictated by estimated relevance, or systemscore. Obviously the retrieval performance depends on where the [true] relevantdocuments are in the list: if most relevant documents are returned on top of thelist, the user is satisfied. Therefore any reasonable global measure takes intoaccount the ranks of documents retrieved. As a convention, we shall call theranks denoted by small numbers (i.e. 1, 2, 3 etc) ”high ranks” —these are theranks at the top of the list, and the most important; the ranks at the bottomof the list are called ”low ranks”.

For set retrieval (no list returned, but a set instead), such as the case ofboolean retrieval or database matching or filtering, evaluation methodology is

11

CHAPTER 2. IR EVALUATION 12

considerably simpler: without ranks, we can only measure how many relevantdocuments have been retrieved (“recall”), or what percentage of the retrieveddocuments are relevant (“precision”). Most of the methods presented for rankedlists can be adapted for set retrieval by truncating the list at a certain threshold(rank) and considering the part of the list above threshold as the retrieved set.

Judged documents

Relevance judgments are usually

Figure 1: binary relevance judg-ments associated with search resultsR=relevant; N=nonrelevant

obtained for research collections throughmassive human effort. Every year, IRresearchers (or paid personnel) judgefor countless hours documents retrievedas response to various queries. Therelevance is assessed relative to thequery meaning (or “information need”),and not to the query particular terms.Some queries can be very ambiguous: a user can type “chili peppers” in aweb search engine referring to cook-ing recipes, while another can typethe same exact phrase but expectingin return documents discussing a pop-ular rock band. Certainly documentsrelevant for one of the meanings wouldbe non-relevant for the other. Somesearch engines specifically deal withthis problem by recognizing an am-biguous query and clustering both set of results, asking the user to refine thequery. For research purposes, however, the problem is solved by carefully de-signing the queries: not only are chosen the words un-ambiguous, but the queryis formulated in plain text such that a human would immediately understandthe meaning. Example of a research query (or topic):

TOPIC 752. Dam removalDescription: Where have dams been removed and what has beenthe environmental impact?Narrative: Discussions of dam removal in general are relevant.Identification of specific places where dams have been removedis relevant, as is discussion of the environmental impactof removal. Applications for removal are not relevant unlessreasons are given.

The result of the judging process is represented in Figure 1. For the purposeof evaluation, a rank list of returned documents is essentially a list of rankedjudgments; in most case the judgments are binary (or transformed to binary evenif he judge made a nuanced assessment) because some popular measures require


as input binary judgments. Formally, for evaluation purpose, a document d ina list has only two characteristics:

• its rank denoted r(d) (or rs(d) for rank in list of system s)

• its judged relevance denoted rel(d)

Throughout this thesis, all evaluations use only these two characteristics of adocument.

2.1 Evaluation measures

Many measures for retrieval quality have been proposed [15, 70], each appraisingspecific retrieval aspects. Early measures were focused on recall, or the com-pleteness of the result set; some measures have been designed to account foruser effort required to find relevant results [20]; other measures are accountingfor the rate at which relevant results are listed etc.

Desirable properties of IR performance measurement:

• Related to a user satisfaction:

• Interpretable

• Able to average or collect

• High discrimination power

• Able to be analyzed

2.1.1 Precision and Recall

Given a ranked list of documents generic called ”relevants” (relevance = 1) and”non-relevants” (relevance = 0), the precision at rank r is defined as the fractionof relevant documents within ranks 1..r of the list.

prec@r =1r

∑d:1≤r(d)≤r

rel(d)

prec@10 is especially useful on web retrieval measurement, where the first pagecontains roughly 10 results (Google). (Note: in many sums notations throughthis thesis the iterator d is omitted, only writing the condition below the sumsymbol.)

Similarly, the recall at a certain rank r is the percentage of relevant doc-uments found within ranks 1..r (out of all relevant documents). This is a bittricky to measure because the total number of relevant documents in the collec-tion (for a given query) is generally unknown. Throughout this thesis, we will


make the common assumption that the total number of relevant documents Ris known or can be estimated accurately.

recall@r =1R

∑1≤r(d)≤r

rel(d)

recall@rank is especially useful at relatively deep ranks, in the case of exhaustivesearch, such as library research or law documentation.

Usually, system performance is characterized by an increasing recall and adecreasing precisions. At the few top ranks of the list (think r = 5), manysystems generally retrieve relevant documents, therefore the precision is high;but given a total number of relevant documents R in the hundreds, the recall isvery low. Conversely, at the bottom of the list (think r = 1000) a good retrievalsystem would find most relevant documents, achieving a high recall; howeverthe precision as such ranks would generally be low due to a lot of nonrelevantdocuments retrieved.

Precision-recall curves

A simple way to visually assess the performance of a retrieval system is to graphthe precision and recall at each rank that increases recall (that is each rank thatis associated with a relevant document). Recall sits on the x-axis while precisionsits on the y-axis (Figure 2). Given the tradeoff between recall and precision,it is generally assumed that at recall = 0, precision = 1 and vice versa, so ageneric precision-recall curve passes to points (0,1) and (1,0).

Figure 2: Precision-recall curve of system 8manex on TREC query 407. The yel-low bars show the number of nonrelevant documents located within consecutiverecall levels.

R-precision

R-precision is simply the precision at rank R

Rprecision =1R

∑1≤r(d)≤R

rel(d)


It is easy to see that at rank R, precision and recall have the same value;hence R-precision is also denoted the “break even point”. Geometrically, it iswhere the precision-recall curve intersects the line y = x (figure 3)

Figure 3: A generic precision-recall curve intersected with the line y = x

2.1.2 Average Precision

Since the precision-recall curve fully characterizes the performance of the system(on a given query), we want to express the “goodness” of such a curve. While(by convention) the plot passes through (1,0) and (0,1), a strong indicator ofperformance is the area under the curve (similar to ROC curves [47]). Insteadof writing down the area as an integral, we are writing it by partitioning therecall into intervals of length 1/R

AP =1R

∑rel(d)=1

prec@r(d)

We shall call this area measure “Average Precision”(AP), because alge-braically it is the average of precisions at ranks of relevant documents. Given aranked list of documents returned in response to a query, the average precisionof this list is the average of the precisions at all relevant documents,1 which isapproximately the area under the precision-recall curve.

Advantages of Average Precision include:

• sensitive to entire ranking: changing a single rank will change final score

• stable: a small change in ranking makes a relatively small change in score

• has both precision- and recall-oriented factors1The precision at an un-retrieved relevant document is assumed to be zero.


• ranks closest to 1 receive largest weight (more about rank weighting inchapter 3)

• computed over all relevant documents

Disadvantage : less easily interpreted in terms of user satisfaction with thesystem response.

2.1.3 R-precision geometrical correlation with Avg precision

It is well known that average precision and R-precision are highly correlatedand similarly robust measures of performance, though the reasons for this arenot entirely clear. It has been shown that average precision and R-precisionare highly correlated [106, 93] and have similar stability in terms of comparingsystems using different queries [28]. The correlation between average precisionand R-precision has been considered quite surprising given the fact that R-precision considers only a single precision point while average precision evaluatesthe area under the entire precision-recall curve [28].

In this section, we give a geometric argument which shows that under avery reasonable set of assumptions, average precision and R-precision both ap-proximate the area under the precision-recall curve, thus explaining their highcorrelation. We further demonstrate through the use of TREC data that thesimilarity or difference between average precision and R-precision is largely gov-erned by the adherence to, or violation of, these reasonable assumptions.

Given a query with R rel-

��

��

(0,1)

3

(1,0)

2

1

prec

isio

n

(rp,rp)

recallFigure 4: Precision-recall curve obtained byconnecting points (0, 1), (rp, rp), (1, 0) withstraight lines.

evant documents, consider thelist of documents returned bya retrieval system in responseto this query, and let tot rel(i)be the total number of relevantdocuments retrieved up to andincluding rank i. By definition,the R-precision rp of this list isthe precision at rank R, rp =tot rel(R)/R. Furthermore, notethat the recall at rank R is alsotot rel(R)/R. Thus, at rankR, the list has both precisionand recall equal to rp, and as-suming a continuous precision-recall curve, this curve wouldpass through the point (rp, rp).

Now consider the area under this piecewise-linear approximation of the ac-tual precision-recall curve, i.e., the shaded area in Figure 4. It can easily beshown that this area is rp by calculating the areas associated with the square


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

Query 435 AP = 0.143 RP = 0.248

actual prec−recallRP prec−recall

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

Query 446 AP = 0.467 RP = 0.488


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

Query 410 AP = 0.856 RP = 0.708


Figure 6: Actual precision-recall curves versus piecewise-linear approximationspassing through points (0, 1), (rp, rp) and (1, 0) for system fub99a in TREC8.

(2) and the triangles (1) and (3).

Area = Area1 + Area2 + Area3 =rp(1− rp)

2+ rp2 +

rp(1− rp)2

= rp

Thus, given the facts that (1) average precision is approximately the area underthe precision-recall curve, (2) under the assumptions stated, the precision-recallcurve can be approximated by a piecewise-linear fit to the points {(0, 1), (rp, rp),(1, 0)}, and (3) the area under this piecewise-linear approximation is rp, we havethat R-precision is approximately average precision.

We next consider the assumption that precision-recall curves are piecewise-linear. In reality, precision-recall curves tend not to have a sharp change in slopeat the point (rp, rp); rather, they tend to be “smoother” and concave-up for val-ues of rp < 1/2 and concave-down for values of rp > 1/2 (see Figure 6). Thuswe expect that R-precision, the area under the piecewise-linear approximationof the actual precision-recall curve, will tend to overestimate average preci-sion when rp < 1/2, and it will tend to underestimate average precision whenrp > 1/2. This fact is also illustrated in Figure 6, where the actual precision-recall curves of the system fub99a in TREC8 are compared with piecewise-linearapproximations for three different queries, one each where rp < 1/2, rp ≈ 1/2,and rp > 1/2.

This phenomenon is fairly consistent across all the runs in TREC8. InFigure 7 (left), we plot the average precisions and R-precisions for each runsubmitted to the conference, together with the line y = x for comparison. Notethat when R-precision is less than 1/2, it tends to overestimate average precision,and when R-precision is greater than 1/2, it tends to underestimate averageprecision.

2.1.4 nDCG

Jarvelin and Kekalainen developed the nDCG measure [53] on the followingprinciple: each document should contribute to the performance score accordingto its relevance, but discounted according to its rank. If h is the rank discountingfunction and g is the relevance function, then the Discounted Cumulative Gain


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

AP

RP

ρ = 0.964 τ = 0.865

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Average AP in 0.05 size buckets

Ave

rage

RP

with

in e

ach

AP

buc

ket

Figure 7: TREC8 average precision versus R-precision for each run (left) andfor AP buckets of size 0.05 (right).

(DCG) isDCG =

∑d

g(rel(d)) · h(r(d))

The most popular functions g and h are the ones used at Microsoft Research,and also adopted by many IR researchers, especially in SIGIR community:

h(r) =1

log(1 + r)g(x) = 2x − 1

which give

DCG =∑

d

2rel(d) − 1log(1 + r(d))

Normalized DCG (nDCG) is simply DCG normalized such that the best possiblevalue is 1. The normalization constant Z is the DCG performance of the idealsystem (where all relevant R documents are retrieved in the top R ranks, in theorder of relevance).

nDCG =1Z

∑d

2rel(d) − 1log(1 + r(d))

Contrary to AP measure, nDCG works naturally with relevances rel(d) whichare not binary 0/1, but rather continuous values, or discrete multi-values. In websearch, for example, relevance are often encoded as 0=“nonrelevant”, 1=“rele-vant, 2=“very relevant”. Another significant advantage of nDCG is that it isinterpretable as the user effort to satisfy a certain information need, given thelist returned by the IR system.

2.1.5 Other measures

F measure The F measure [96] is the weighted harmonic mean of precision andrecall at a given rank r. Thus Fr explicitly accounts for the tradeoff betweenprecision and recall.

Fr =(β2 + 1) · precr · recallr

β2precr + recallr


where β is the weighting parameter. A popular choice is β = 1 which gives thestraight harmonic mean:

Fr =2precr · recallrprecr + recallr

Many other measures, have been proposed.

Q measure Sakai [83] introduced the Q-measure, which is cumulative-ratiogain measured, like Average Precision, only at ranks of relevant documents.

Expected/average search length[39] When retrieval is done in batches (acommon older form of presentation) instead of complete rankings, one can mea-sure user effort to obtain a document by looking at the batch containing thedocument and calculating a local expectation over the possible ranks the doc-ument may have in its batch. This is still relevant to web search, where thecost(effort) of moving to the next page of results is significantly higher than thecost of moving to the new document (on the same page).

Reciprocal rank When only very few documents are relevant, and especiallywhen there is only one relevant document (such as for queries asking for homepage of a person or institution), only the rank at the first relevant documentmatters. The reciprocal rank measure is essentially the inverse of the rank of thefirst relevant document, a design reminiscent of the popular IR heuristic Zipf’sLaw [114]. Note that, when there is a single relevant document, the reciprocalrank and average precision values are the same.

2.1.6 Performance for a set of queries

Lastly, evaluation of IR systems is usually performed over several queries (typi-cally at least 50). To obtained an overall measure of performance, we average themeasurements. If “T” is the measured preferred, then “Mean T” performanceof system s is the average across queries:

MeanT (s) =1

N QUERIES

∑q

Tq(s)

Therefore, “MnDCG” denotes “mean nDCG” across queries, ”Mean AveragePrecision or MAP” is the average of Average Precision across queries, and soon. In our exposition, we will use mostly MAP as a ground truth measurement,and we will report every other measure to it. MAP is by far the most popularmeasure used within IR research community; however most commercial enginesuse some variant of nDCG.


2.2 Maximum Entropy application

How much information is encoded in single number? Can it be decoded? Thatis, if a system has been assessed with average precision of AP = .25, what canwe infer about its precision recall curve? We present a model [14], based onthe maximum entropy method, for analyzing various measures of retrieval per-formance such as average precision, R-precision, and precision-at-cutoffs. Themethodology treats the value of such a measure as a constraint on the distri-bution of relevant documents in an unknown list, and the maximum entropydistribution can be determined subject to these constraints. For good measuresof overall performance (such as average precision), the resulting maximum en-tropy distributions are highly correlated with actual distributions of relevantdocuments in lists as demonstrated through TREC data; for poor measures ofoverall performance, the correlation is weaker. As such, the maximum entropymethod can be used to quantify the overall quality of a retrieval measure. Fur-thermore, for good measures of overall performance (such as average precision),we show that the corresponding maximum entropy distributions can be used toaccurately infer precision-recall curves and the values of other measures of per-formance, and we demonstrate that the quality of these inferences far exceedsthat predicted by simple retrieval measure correlation, as demonstrated throughTREC data.

The efficacy of retrieval systems is evaluated by a number of performancemeasures such as average precision, R-precision, and precisions at rank. Broadlyspeaking, these measures can be classified as either system-oriented measuresof overall performance (e.g., average precision and R-precision) or user-orientedmeasures of specific performance (e.g., precision@10) [40, 65, 46]. Different mea-sures evaluate different aspects of retrieval performance, and much thought andanalysis has been devoted to analyzing the quality of various different perfor-mance measures [59, 28, 78].

We begin with the premise that the quality of a list of documents retrievedin response to a given query is strictly a function of the sequence of relevantand non-relevant documents retrieved within that list (as well as R, the totalnumber of relevant documents for the given query). Most standard measuresof retrieval performance satisfy this premise. Our thesis is then that given theassessed value of a “good” overall measure of performance, one’s uncertaintyabout the sequence of relevant and non-relevant documents in an unknown listshould be greatly reduced. Suppose, for example, one were told that a listof 1,000 documents retrieved in response to a query with 200 total relevantdocuments contained 100 relevant documents. What could one reasonably inferabout the sequence of relevant and non-relevant documents in the unknownlist? From this information alone, one could only reasonably conclude that thelikelihood of seeing a relevant document at any rank level is uniformly 1/10.Now suppose that one were additionally told that the average precision of thelist was 0.4 (the maximum possible in this circumstance is 0.5). Now one couldreasonably conclude that the likelihood of seeing relevant documents at lownumerical ranks is much greater than the likelihood of seeing relevant documents


1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

die face

prob

abili

ty

1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

die face

prob

abili

ty

1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

die face

prob

abili

ty

Figure 8: Maximum entropy die distributions with mean die rolls of 3.5, 4.5,and 5.5, respectively.

at high numerical ranks. One’s uncertainty about the sequence of relevant andnon-relevant documents in the unknown list is greatly reduced as a consequenceof the strong constraint that such an average precision places on lists in thissituation. Thus, average precision is highly informative. On the other hand,suppose that one were instead told that the precision of the documents in therank range [100, 110] was 0.4. One’s uncertainty about the sequence of relevantand non-relevant documents in the unknown list is not appreciably reduced asa consequence of the relatively weak constraint that such a measurement placeson lists. Thus, precision in the range [100, 110] is not a highly informativemeasure. In what follows, we develop a model within which one can quantifyhow informative a measure is.

More specifically, we develop a framework based on the maximum entropymethod which allows one to infer the most “reasonable” model for the sequenceof relevant and non-relevant documents in a list given a measured constraint.From this model, we show how one can infer the most “reasonable” model forthe unknown list’s entire precision-recall curve. We demonstrate through theuse of TREC data that for “good” overall measures of performance (such asaverage precision), these inferred precision-recall curves are accurate approxi-mations of actual precision-recall curves; however, for “poor” overall measuresof performance, these inferred precision-recall curves do not accurately approx-imate actual precision-recall curves. Thus, maximum entropy modeling can beused to quantify the quality of a measure of overall performance.

2.2.1 The Maximum Entropy Method

The concept of entropy as a measure of information was first introduced by Shan-non [88], and the Principle of Maximum Entropy was introduced by Jaynes [54,55, 56]. Since its introduction, the Maximum Entropy Method has been ap-plied in many areas of science and technology [109] including natural languageprocessing [21], ambiguity resolution [79], text classification [74], machine learn-ing [76, 77], and information retrieval [50, 60], to name but a few examples. Inwhat follows, we introduce the maximum entropy method through a classic ex-ample, and we then describe how the maximum entropy method can be used toevaluate measures of retrieval performance.

Suppose you are given an unknown and possibly biased six-sided die andwere asked the probability of obtaining any particular die face in a given roll.


What would your answer be? This problem is under-constrained and the mostseemingly “reasonable” answer is a uniform distribution over all faces. Supposenow you are also given the information that the average die roll is 3.5. Themost seemingly “reasonable” answer is still a uniform distribution. What if youare told that the average die roll is 4.5? There are many distributions over thefaces such that the average die roll is 4.5; how can you find the most seemingly“reasonable” distribution? Finally, what would your answer be if you were toldthat the average die roll is 5.5? Clearly, the belief in getting a 6 increases asthe expected value of the die rolls increases. But there are many distributionssatisfying this constraint; which distribution would you choose?

The “Maximum Entropy Method” (MEM) dictates the most “reasonable”distribution satisfying the given constraints. The “Principle of Maximal Igno-rance” forms the intuition behind the MEM; it states that one should choosethe distribution which is least predictable (most random) subject to the givenconstraints. Jaynes and others have derived numerous entropy concentrationtheorems which show that the vast majority of all empirical frequency distri-butions (e.g., those corresponding to sequences of die rolls) satisfying the givenconstraints have associated empirical probabilities and entropies very close tothose probabilities satisfying the constraints whose associated entropy is maxi-mal [54].

Thus, the MEM dictates the most random distribution satisfying the givenconstraints, using the entropy of the probability distribution as a measure ofrandomness. The entropy of a probability distribution ~p = {p1, p2, . . . , pn} isa measure of the uncertainty (randomness) inherent in the distribution and isdefined as follows

H(~p) = −n∑

i=1

pi lg pi.

Thus, maximum entropy distributions are probability distributions making noadditional assumptions apart from the given constraints.

In addition to its mathematical justification, the MEM tends to producesolutions one often sees in nature. For example, it is known that given thetemperature of a gas, the actual distribution of velocities in the gas is themaximum entropy distribution under the temperature constraint.

We can apply the MEM to our die problem as follows. Let the probabilitydistribution over the die faces be ~p = {p1, . . . , p6}. Mathematically, finding themaximum entropy distribution over die faces such that the expected die roll isd corresponds to the following optimization problem:

Maximize: H(~p)

Subject to:

1.6∑

i=1

pi = 1

2.6∑

i=1

i · pi = d


Maximize:∑Ni=1 H(pi)

Subject to:

1. 1R

N∑i=1

(pi

i

(1 +

i−1∑j=1

pj

))= ap

2.N∑

i=1

pi = Rret

Figure 9: Maximumentropy setup for av-erage precision.


Subject to:

1. 1R

R∑i=1

pi = rp

2.N∑

i=1

pi = Rret

Figure 10: Maximumentropy setup for R-precision.


Subject to:

1. 1k

k∑i=1

pi =

PC (k)

2.N∑

i=1

pi = Rret

Figure 11: Maximumentropy setup forprecision-at-cutoff.

The first constraint ensures that the solution forms a distribution over the diefaces, and the second constraint ensures that this distribution has the appro-priate expectation. This is a constrained optimization problem which can besolved using the method of Lagrange multipliers. Figure 8 shows three differentmaximum entropy distributions over the die faces such that the expected dieroll is 3.5, 4.5, and 5.5, respectively.

Maximmum entropy for IR measures

Suppose that you were given a list of length N corresponding to the outputof a retrieval system for a given query, and suppose that you were asked topredict the probability of seeing any one of the 2N possible patterns of relevantdocuments in that list. In the absence of any information about the query,any performance information for the system, or any a priori modeling of thebehavior of retrieval systems, the most “reasonable” answer you could givewould be that all lists of length N are equally likely. Suppose now that you arealso given the information that the expected number of relevant documents overall lists of length N is Rret. Your “reasonable” answer might then be a uniformdistribution over all

(NRret

)different possible lists with Rret relevant documents.

But what if apart from the constraint on the number of relevant documentsretrieved, you were also given the constraint that the expected value of averageprecision is ap? If the average precision value is high, then of all the

(NRret

)lists with Rret relevant documents, the lists in which the relevant documents areretrieved at low numerical ranks should have higher probabilities. But how canyou determine the most “reasonable” such distribution? The maximum entropymethod essentially dictates the most reasonable distribution as a solution to the


following constrained optimization problem.Let p(r1, ..., rN ) be a probability distribution over the relevances associated

with document lists of length N , let rel(r1, ..., rN ) be the number of relevantdocuments in a list, and let ap(r1, ..., rN ) be the average precision of a list. Thenthe maximum entropy method can be mathematically formulated as follows:

Maximize: H(~p)

Subject to:

1.∑

r1,...,rN

p(r1, . . . , rN ) = 1

2.∑

r1,...,rN

ap(r1, . . . , rN ) · p(r1, . . . , rN ) = ap

3.∑

r1,...,rN

rel(r1, . . . , rN ) · p(r1, . . . , rN ) = Rret

Note that the solution to this optimization problem is a distribution over possi-ble lists, where this distribution effectively gives one’s a posteriori belief in anylist given the measured constraint.

The previous problem can be formulated in a slightly different manner yield-ing another interpretation of the problem and a mathematical solution. Supposethat you were given a list of length N corresponding to output of a retrievalsystem for a given a query, and suppose that you were asked to predict theprobability of seeing a relevant document at some rank. Since there are no con-straints, all possible lists of length N are equally likely, and hence the probabilityof seeing a relevant document at any rank is 1/2. Suppose now that you are alsogiven the information that the expected number of relevant documents over alllists of length N is Rret. The most natural answer would be a Rret/N uniformprobability for each rank. Finally, suppose that you are given the additionalconstraint that the expected average precision is ap. Under the assumption thatour distribution over lists is a product distribution (this is effectively a fairlystandard independence assumption), we may solve this problem as follows. Let

p(r1, . . . , rN ) = p(r1) · p(r2) · · · p(rN )

where p(ri) is the probability that the document at rank i is relevant. Wecan then solve the problem of calculating the probability of seeing a relevantdocument at any rank using the MEM. For notational convenience, we willrefer to this product distribution as the probability-at-rank distribution and theprobability of seeing a relevant document at rank i, p(ri), as pi.

Standard results from information theory [42] dictate that if p(r1, . . . , rN ) isa product distribution, then

H(p(r1, . . . , rN )) =N∑

i=1

H(pi)

where H(pi) is the binary entropy

H(pi) = −pi lg pi − (1− pi) lg(1− pi).


Javed Alsam showed the following result. Given a product distributionp(r1, . . . , rN ) over the relevances associated with document lists of length N ,the expected value of average precision is

1R

N∑i=1

pi

i

1 +i−1∑j=1

pj

. (1)

Proof. For any document list of length k and for all i, 1 ≤ i ≤ k, let xi ∈ 0, 1denote the relevance of the document at rank i. We may then define the sumof the precisions at relevant documents as follows.

sp(x1, . . . , xk) =k∑

i=1

xi

i

i∑j=1

xj

=

k∑i=1

i∑j=1

xixj

i

We note that

sp(x1, . . . , xk) =k−1∑i=1

xi

i

i∑j=1

xj

+xk

k

k∑j=1

xj

= sp(x1, . . . , xk−1) +xk

k

k∑j=1

xj .

As a consequence, we have

sp(x1, . . . , xk−1, 0) = sp(x1, . . . , xk−1) (2)

and

sp(x1, . . . , xk−1, 1) = sp(x1, . . . , xk−1) +1k

1 +k−1∑j=1

xj

. (3)

Now let p(x1, . . . , xn) denote a joint distribution over the relevances associatedwith document lists of length n. The expected sum precision is then∑

x1,...,xn

sp(x1, . . . , xn) p(x1, . . . , xn).

Now assume that p(x1, . . . , xn) is a product distribution; i.e.,

p(x1, . . . , xn) = p1(x1) · p2(x2) · · · pn(xn).

For notational convenience, let pi = pi(1) for all i. In other words, pi is theprobability that the document at rank i is relevant.


The expected sum precision can be calculated as follows.∑x1,...,xn

sp(x1, . . . , xn) p(x1, . . . , xn)

=∑

x1,...,xn

sp(x1, . . . , xn) p(x1, . . . , xn−1) · pn(xn)

=∑

x1,...,xn−1

sp(x1, . . . , xn−1, 0) p(x1, . . . , xn−1) · pn(0) +

∑x1,...,xn−1

sp(x1, . . . , xn−1, 1) p(x1, . . . , xn−1) · pn(1)

=∑

x1,...,xn−1

sp(x1, . . . , xn−1) p(x1, . . . , xn−1) · (1− pn) +

∑x1,...,xn−1

sp(x1, . . . , xk−1) +1n

1 +n−1∑j=1

xj

p(x1, . . . , xn−1) · pn

=∑

x1,...,xn−1

sp(x1, . . . , xn−1) p(x1, . . . , xn−1) · (1− pn) +

∑x1,...,xn−1

sp(x1, . . . , xn−1) p(x1, . . . , xn−1) · pn +

∑x1,...,xn−1

1n

1 +n−1∑j=1

xj

p(x1, . . . , xn−1) · pn

=∑

x1,...,xn−1

sp(x1, . . . , xn−1) p(x1, . . . , xn−1) +

pn

n

1 +n−1∑j=1

∑x1,...,xn−1

xj p(x1, . . . , xn−1)

=

∑x1,...,xn−1

sp(x1, . . . , xn−1) p(x1, . . . , xn−1) +pn

n

1 +n−1∑j=1

pj

Thus, we have a recurrence for the expected sum precision. Iterating this re-currence, we obtain

∑x1,...,xn

sp(x1, . . . , xn) p(x1, . . . , xn) =n∑

i=1

pi

i

1 +i−1∑j=1

pj

.

Since the average precision is the sum precision divided by the constant R, wehave that the expected average precision is

1R

n∑i=1

pi

i

1 +i−1∑j=1

pj

.


0 100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

rank

prob

abili

ty

TREC8 System fub99a Query 435 AP = 0.1433

ap maxent dist.rp maxent dist.pc−10 maxent dist.

Figure 12: Probability-at-rank distributions.

which ends the proof.Since pi is the probability of seeing a relevant document at rank i, the

expected number of relevant documents retrieved until rank N is∑N

i=1 pi.Now, if one were given some list of length N , one were told that the expected

number of relevant documents is Rret, one were further informed that the ex-pected average precision is ap, and one were asked the probability of seeinga relevant document at any rank under the independence assumption stated,one could apply the MEM as shown in Figure 9. Note that one now solves forthe maximum entropy product distribution over lists, which is equivalent to amaximum entropy probability-at-rank distribution. Applying the same ideasto R-precision and precision-at-cutoff k, one obtains analogous formulations asshown in Figures 10 and 11, respectively.

All of these formulations are constrained optimization problems, and themethod of Lagrange multipliers can be used to find an analytical solution, inprinciple. When analytical solutions cannot be determined, numerical opti-mization methods can be employed. The maximum entropy distributions forR-precision and precision-at-cutoff k can be obtained analytically using themethod of Lagrange multipliers. However, numerical optimization methods arerequired to determine the maximum entropy distribution for average precision.In Figure 12, examples of maximum entropy probability-at-rank curves corre-sponding to the measures average precision, R-precision, and precision-at-cutoff10 for a run in TREC8 can be seen. Note that the probability-at-rank curvesare step functions for the precision-at-cutoff and R-precision constraints; this isas expected since, for example, given a precision-at-cutoff 10 of 0.3, one can onlyreasonably conclude a uniform probability of 0.3 for seeing a relevant documentat any of the first 10 ranks. Note, however, that the probability-at-rank curvecorresponding to average precision is smooth and strictly decreasing.

Using the maximum entropy probability-at-rank distribution of a list, we


can infer the maximum entropy precision-recall curve for the list. Given aprobability-at-rank distribution ~p, the number of relevant documents retrieveduntil rank i is REL(i) =

∑ij=1 pj . Therefore, the precision and recall at rank i

are prec@i = REL(i)/i and recall@i = REL(i)/R. Hence, using the maximumentropy probability-at-rank distribution for each measure, we can generate themaximum entropy precision-recall curve of the list. If a measure provides agreat deal of information about the underlying list, then the maximum entropyprecision-recall curve should approximate the precision-recall curve of the actuallist. However, if a measure is not particularly informative, then the maximumentropy precision-recall curve need not approximate the actual precision-recallcurve. Therefore, noting how closely the maximum entropy precision-recallcurve corresponding to a measure approximates the precision-recall curve of theactual list, we can calculate how much information a measure contains aboutthe actual list, and hence how “informative” a measure is. Thus, we have amethodology for evaluating the evaluation measures themselves.

Using the maximum entropy precision-recall curve of a measure, we can alsopredict the values of other measures. For example, using the maximum en-tropy precision-recall curve corresponding to average precision, we can predictthe precision-at-cutoff 10. For highly informative measures, these predictionsshould be very close to reality. Hence, we have a second way of evaluating eval-uation measures.

Maximum entropy application results. We tested the performance of the eval-uation measures average precision, R-precision, and precision-at-cutoffs 5, 10,15, 20, 30, 100, 200, 500 and 1000 using data from TRECs 3, 5, 6, 7, 8 and9. For any TREC and any query, we chose those systems whose number ofrelevant documents retrieved was at least 10 in order to have a sufficient num-ber of points on the precision-recall curve. We then calculated the maximumentropy precision-recall curve subject to the given measured constraint, as de-scribed above. The maximum entropy precision-recall curve corresponding toan average precision constraint cannot be determined analytically; therefore, weused numerical optimization2 to find the maximum entropy distribution corre-sponding to average precision.

We shall refer to the execution of a retrieval system on a particular query asa run. Figure 13 shows examples of maximum entropy precision-recall curvescorresponding to average precision, R-precision, and precision-at-cutoff 10 forthree different runs, together with the actual precision-recall curves. We focusedon these three measures since they are perhaps the most commonly cited mea-sures in IR. We also provide results for precision-at-cutoff 100 in later plots anddetailed results for all measures in a later table. As can be seen in Figure 13, us-ing average precision as a constraint, one can generate the actual precision-recallcurve of a run with relatively high accuracy.

2We used the TOMLAB Optimization Environment for Matlab.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

TREC8 System fub99a Query 435 AP = 0.1433

actual prec−recallap maxent prec−recallrp maxent prec−recallpc−10 maxent prec−recall

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

TREC8 System MITSLStd Query 404 AP = 0.2305


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

recall

prec

isio

n

TREC8 System pir9At0 Query 446 AP = 0.4754


Figure 13: Inferred precision-recall curves and actual precision-recall curve forthree runs in TREC8.

TREC3 TREC5 TREC6 TREC7 TREC8 TREC90.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

Mea

n A

bsol

ute

Err

or

ap maxent prec−recallrp maxent prec−recallpc−10 maxent prec−recallpc−100 maxent prec−recall

TREC3 TREC5 TREC6 TREC7 TREC8 TREC90.1

0.15

0.2

0.25

RM

S E

rror

ap maxent prec−recallrp maxent prec−recallpc−10 maxent prec−recallpc−100 maxent prec−recall

Figure 14: MAE and RMS errors for inferred precision-recall curves over allTRECs.

In order to quantify how good an evaluation measure is in generating theprecision-recall curve of an actual list, we consider two different error measures:the root mean squared error (RMS) and the mean absolute error (MAE). Let{π1, π2, . . . , πRret

} be the precisions at the recall levels {1/R, 2/R, . . . , Rret/R}where Rret is the number of relevant documents retrieved by a system and Ris the number of documents relevant to the query, and let {m1,m2, . . . ,mRret}be the estimated precisions at the corresponding recall levels for a maximumentropy distribution corresponding to a measure. Then the MAE and RMSerrors are calculated as follows.

RMS =

√√√√ 1Rret

Rret∑i=1

(πi −mi)2

MAE =1

Rret

Rret∑i=1

|πi −mi|

The points after recall Rret/R on the precision-recall curve are not consideredin the evaluation of the MAE and RMS errors since, by TREC convention, theprecisions at these recall levels are assumed to be 0.

In order to evaluate how good a measure is at inferring actual precision-recall curves, we calculated the MAE and RMS errors of the maximum entropy


TREC3 TREC5 TREC6 TREC7 TREC8 TREC9 AVERAGE %INCAP 0.1185 0.1220 0.1191 0.1299 0.1390 0.1505 0.1298 −RP 0.1767 0.1711 0.1877 0.2016 0.1878 0.1630 0.1813 39.7PC-5 0.2724 0.2242 0.2451 0.2639 0.2651 0.2029 0.2456 89.2PC-10 0.2474 0.2029 0.2183 0.2321 0.2318 0.1851 0.2196 69.1PC-15 0.2320 0.1890 0.2063 0.2132 0.2137 0.1747 0.2048 57.8PC-20 0.2210 0.1806 0.2005 0.2020 0.2068 0.1701 0.1968 51.6PC-30 0.2051 0.1711 0.1950 0.1946 0.2032 0.1694 0.1897 46.1PC-100 0.1787 0.1777 0.2084 0.2239 0.2222 0.1849 0.1993 53.5PC-200 0.1976 0.2053 0.2435 0.2576 0.2548 0.2057 0.2274 75.2PC-500 0.2641 0.2488 0.2884 0.3042 0.3027 0.2400 0.2747 111.6PC-1000 0.3164 0.2763 0.3134 0.3313 0.3323 0.2608 0.3051 135.0

Table 1: RMS error values for each TREC.

TREC3 TREC5 TREC6RP PC-10 PC-100 RP PC-10 PC-100 RP PC-10 PC-100

τact 0.921 0.815 0.833 0.939 0.762 0.868 0.913 0.671 0.807τinf 0.941 0.863 0.954 0.948 0.870 0.941 0.927 0.871 0.955%Inc 2.2 5.9 14.5 1.0 14.2 8.4 1.5 29.8 18.3

TREC7 TREC8 TREC9RP PC-10 PC-100 RP PC-10 PC-100 RP PC-10 PC-100

τact 0.917 0.745 0.891 0.925 0.818 0.873 0.903 0.622 0.836τinf 0.934 0.877 0.926 0.932 0.859 0.944 0.908 0.757 0.881%Inc 1.9 17.7 3.9 0.8 5.0 8.1 0.6 21.7 5.4

Table 2: Kendall’s τ correlations and percent improvements for all TRECs.

precision-recall curves corresponding to the measures in question, averaged overall runs for each TREC. Figure 14 shows how the MAE and RMS errors foraverage precision, R-precision, precision-at-cutoff 10, and precision-at-cutoff 100compare with each other for each TREC. The MAE and RMS errors follow thesame pattern over all TRECs. Both errors are consistently and significantlylower for average precision than for the other measures in question, while theerrors for R-precision are consistently lower than for precision-at-cutoffs 10 and100.

Table 1 shows the actual values of the RMS errors for all measures overall TRECs. In our experiments, MAE and RMS errors follow a very similarpattern, and we therefore omit MAE results due to space considerations. Fromthis table, it can be seen that average precision has consistently lower RMSerrors when compared to the other measures. The penultimate column of thetable shows the average RMS errors per measure averaged over all TRECs. Onaverage, R-precision has the second lowest RMS error after average precision,and precision-at-cutoff 30 is the third best measure in terms of RMS error. Thelast column of the table shows the percent increase in the average RMS errorof a measure when compared to the RMS error of average precision. As can be


0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Actual RP

Act

ual A

P

TREC 8 Actual RP vs Actual AP

Kendall’s τ = 0.925

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Actual PC−10

Act

ual A

P

TREC 8 Actual PC−10 vs Actual AP


0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Actual PC−100

Act

ual A

P

TREC 8 Actual PC−100 vs Actual AP


0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Actual RP

Infe

rred

RP

TREC 8 Actual RP vs Inferred RP


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Actual PC−10

Infe

rred

PC

−10

TREC 8 Actual PC−10 vs Inferred PC−10


0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Actual PC−100

Infe

rred

PC

−100

TREC 8 Actual PC−100 vs Inferred PC−100


Figure 15: Correlation improvements, TREC8.

seen, the average RMS errors for the other measures are substantially greaterthan the average RMS error for average precision.


2.3 TREC Pooling and System Evaluation

Collections of retrieval systems are traditionally evaluated by (1) constructing atest collection of documents (the “corpus”), (2) constructing a test collection ofqueries (the “topics”), (3) judging the relevance of the documents to each query(the “relevance judgments”), and (4) assessing the quality of the ranked listsof documents returned by each retrieval system for each topic using standardmeasures of performance such as mean average precision. Much thought andresearch has been devoted to each of these steps in the annual TREC [51].

For large collections of documents and/or topics, it is impractical to assessthe relevance of each document to each topic. Instead, a small subset of the doc-uments is chosen, and the relevance of these documents to the topics is assessed.When evaluating the performance of a collection of retrieval system, as in theTREC conference, this judged “pool” of documents is typically constructed byconsidering only documents ranked high by each system in response to a givenquery.

TREC employs depth pooling wherein the union of the top k documentsretrieved in each run corresponding to a given query is formed, and the docu-ments in this depth k pool are judged for relevance with respect to this query.In TREC, k = 100 has been shown to be an effective cutoff in evaluating the rel-ative performance of retrieval systems [51, 115], and while the depth 100 pool isconsiderably smaller than the document collection, it still engenders a large as-sessment effort: in TREC8, for example, 86,830 relevance judgments were usedto assess the quality of the retrieved lists corresponding to 129 system runs inresponse to 50 topics [106]. The table given below shows the relationship be-tween pool depth and the number of judgments required per topic on averagefor various TRECs.

While many of the top documents are retrieved by multiple systems, thusreducing the overall size of the pool, the total number of relevance judgments isstill substantial. Reducing the number of relevance judgments required wouldpermit competitions such as TREC to scale well in the future, as well as moreeasily permit the assessment of large numbers of systems over vast, changingdata collections such as the World Wide Web.

Both shallower and deeper pools have been studied [115, 51], both for TRECand within the greater context of the generation of large test collections [41].Pooling is an effective technique since many of the documents relevant to a topicwill appear near the top of the lists returned by (quality) retrieval systems;thus, these relevant documents will be judged and used to effectively assess theperformance of the collected systems.

Pools are often used to evaluate retrieval systems in the following manner.The documents within a pool are judged to determine whether they are rele-vant or not relevant to the given user query or topic. Documents not containedwithin the pool are assumed to be non-relevant. The ranked lists returned bythe retrieval systems are then evaluated using standard measures of performance(such as mean average precision) using this “complete” set of relevance judg-ments. Since documents not present in the pool are assumed non-relevant, the


TRECPool 3 5 6 7 8

Depth n = 40 n = 82 n = 79 n = 103 n = 1291 19 38 38 32 402 39 68 67 55 693 47 98 95 76 954 60 126 120 95 1195 73 153 146 114 1446 85 181 172 134 1677 96 208 197 152 1918 107 234 221 170 2159 118 262 246 189 238

10 129 288 271 207 26015 183 418 393 297 37920 235 543 513 389 49430 336 791 743 571 71740 436 1034 969 754 93950 531 1273 1191 936 115560 626 1509 1410 1114 136670 718 1745 1629 1299 157480 811 1978 1845 1486 177790 903 2206 2058 1675 1978

100 995 2434 2271 1860 2176

Table 3: The size of the pool (per query) for various pool depths if the poolingis performed TREC-style. Here n is the number of input systems in the givendata set.

quality of the assessments produced by such a pool is often in direct proportionto the fraction of relevant documents found in the pool (its recall). On-linepooling techniques have been proposed which attempt to identify relevant doc-uments as quickly as possible in order to exploit this phenomenon [41].

TREC implements and offers for download their evaluation tool, trec eval.The program takes as input any list of results together with a judgments fileor “qrel” 3 and outputs many measures of performance, including: the numberof relevant documents retrieved, Average Precision, R-precision, prec@rank forvarious ranks, infAP, and their averages across queries.

2.4 Evaluation experimentation

In the context of IR evaluation research, we often want to test a certain evalu-ation methodology. To do so we usually obtain an evaluation score for each IR

3input files must be TREC formatted


system involved; we then compare this set of scores with a previous trusted setof scores for the same systems; we can compare directly the two sets of scores,or we can obtain two rankings based on the scores and compare those instead.We might be interested in answering questions like:

• is the top ranked system the same in both evaluations ?

• are the top 10 systems ranked the same ? Or more general, how close arethe two overall rankings?

• how close are the values in the two sets on a per-system basis ?

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

actual values

estim

ated

val

ues

Figure 16: A generic scatterplot.

Let (a1, a2, ..., aN ) be the actual values and (e1, e2, ..., eN ) be the estimatedvalues. We compared the estimates obtained by the estimating methods (suchas the ones discussed in chapter 4 and 6) with the “actual” evaluations, i.e.evaluations obtained using TREC judgment files. Figure 16 shows a genericscatterplot of the estimated values VS the actual values: each dot represent asystem measured both by “actual value” and by “estimated value”. Most oftenin this thesis, the dots represent IR systems measured in two ways: the axis isthe TREC measurement while the y-axis is an estimate.

To evaluate the quality of the estimates, we use three different statistics,root mean squared (RMS) error (how different the estimated values are from theactual values), linear correlation coefficient ρ (how well the actual and estimatedvalues fit to straight line), and Kendall’s τ (how well the estimated measuresrank the systems compared to the actual rankings).

Kendall’s τ

The weakest statistic is solely based on ranks: Kendall’s τ [61] orders the two setsof values separately and measures the distance between the two rankings as the


number of inversions present in one compared to the other. It also normalizes, sothe result is between -1 (complete opposite rankings) and 1 (identical rankings)

Kendall τ =1Z

∑ei 6=ej

sign(ai − aj

ei − ej

)where Z = 2

N(N+1) is a normalization factor.Note that Kendall’s τ depends solely on rankings obtained from values, the

actual value does not matter. Even with a Kendall’s τ of 1, there is no guaranteethat the score are linearly on the same scale, and there is indication of whetherthe 2 value sets are close or not.

Linear Correlation Coefficient ρ

The linear correlation coefficient is a step stronger than Kendall’s τ , because itmeasures whether the values of the two sets are linearly correlated. Note thata high linear correlation implies high ranking correlation (high Kendall’s τ) butnot vice-versa. It still does not give any information on the actual error pervalue.

ρ =σae

σaσe

where σa ad σe are standard deviations separately estimated from values (a1, a2, ..., aN )and (e1, e2, ..., eN ) respectively, and σae is the estimate of covariance.

RMS Error

RMS error measures the deviation of estimated values from actual values. Hence,it is related to the standard deviation of the estimation error. The RMS errorof the estimation can be calculated as

RMS =

√√√√ 1N

N∑i=1

(ai − ei)2

Note that in contrast to the RMS error, Kendall’s τ and ρ do not measurehow much the estimated values differ from the actual values. Therefore, evenif they indicate perfectly correlated estimated and actual values, the estimatesmay still not be accurate. Hence, it is much harder to achieve small RMS errorsthan to achieve high τ or ρ values.

2.5 Evaluation with incomplete judgments

Standard methods of retrieval evaluation can be quite expensive when conductedon a large-scale. Shallower pools [115] and greedily chosen dynamic pools [41, 10]have also been studied in an attempt to alleviate the assessment effort; however,such techniques tend to produce biased estimates of standard retrieval measures,


especially when relatively few relevance judgments are used. For example, inmany years of the annual Text REtrieval Conference (TREC), upwards of 100runs each consisting of a ranked list of up to 1,000 documents were submittedwith respect to each of 50 or more topics. In principle, each document in thecollection would need to be assessed for relevance with respect to each query inorder to evaluate many standard retrieval measures such as average precisionand R-precision; in practice, this is prohibitively expensive.

bpref Recently, Buckley and Voorhees proposed a new measure, bpref, andthey show that when bpref is employed with a judged sample drawn uniformlyfrom the depth 100 pool, the retrieval systems can be effectively ranked [29].

bpref =1R

∑r

1− nr

R

where nr is the number of non-relevant documents ranked higher than rank r.Naturally, bpref can ignore the non-judged documents. The authors presentsome variants of the metric based on collection tuning, and provide a large setof experimental results. Recently it has been shown that infAP is more robustthan bpref in dealing with incomplete judgments [110]

However, bpref is not designed to approximate any standard measure ofretrieval performance, and while bpref is correlated with average precision, es-pecially at high sampling rates, these correlations and the system rankings pro-duced degenerate at low sampling rates.

Inferring judgments Aslam and Yilmaz et al[13] proposed a framework fordealing with ranking-based constraints per document. Given a set of systemsand measure of performance like AP is estimated for each system, an over-constrained optimization system is solved: the constraints are the ranks ofdocuments in each system; the objective is a least square error targeting theestimated performance values; the variables are probability of relevance for eachdocument. Yilmaz et al show that, even with loose estimated of performance asinput, the “inferred judgments” obtained (or the solution to the optimizationsystem) are quite accurate.

Evaluation without relevance judgments Taking this problem of not havingrelevance judgments to its logical extreme, Soboroff, Nicholas, and Cahan [89]proposed a methodology for ranking retrieval systems in the absence of relevancejudgments. They pool the top documents retrieved by systems in response toa given query and then assign a random subset of this pool to be relevant, inproportion to the expected number of relevant documents to be found in sucha pool. They propose a number of variants of this basic system, including deepand shallow pools, pools with and without duplicates, and random relevantsubset sizes chosen to match the expected number of relevant pool documentson a per query basis or on an average basis across the queries. Note that thechoice of the size of this “random relevant subset” requires some training data,


and far greater training is required when this choice is not fixed but changes ona per query basis.

Given the randomly generated relevance judgments described above, Sobo-roff et al. then evaluate the ranked lists returned by retrieval system againstthese judgments and rank the retrieval systems in order of relative performance.Performing this technique over the systems submitted in the TREC 3, 5, 6, 7,and 8 competitions, they achieve system rankings which correlate favorablywith the actual TREC rankings using relevance judgments. As measured byKendall’s τ , a standard measure for assessing the correlation of two rank listswhich ranges from −1 (perfectly anti-correlated) to +1 (perfectly correlated),the best variant of their system achieves performances of 0.482, 0.603, 0.576,0.441, and 0.534 on TREC 3, 5, 6, 7, and 8, respectively. One reported failingof this technique is that it tends to consistently underestimate the ranking ofthe top retrieval systems.

Filtering out non-judged documents for standard measures T.Sakai [84] ar-gues that, instead of bpref, we are better off using standard measures (AP,prec@rank, R-precision), with the caveat that their computation we should sim-ply ignore the non-judged documents. In fact, [84] shows that bpref is simplyAP computed by filtering out non-judged documents, except normalized dif-ferently. While his arguments are theoretically sound (we can evaluate onlywith the documents that are judged), in practice if a system returns a lot ofunjudged documents, chances are the system performance is poor. The mainpoint of T.Sakai is perhaps that bpref is an unnecessary metric, because we canget roughly the same performance metric by running a slight modified AP.

Advanced techniques In this thesis we focus on advanced techniques for evalu-ation with incomplete judgments. Instead of using any set of judgments available(although we can use any judgments), we carefully design pooling strategies thatwe show help tremendously in evaluation. With more complex pooling strate-gies, the evaluation of performance metrics such as AP is also more complicated;for example, in sampling (chapter 4), filtering out the non-judged documentsand compute AP would certainly lead nowhere.

Very related to the sampling techniques (chapter 4) Yilmaz et al [110] pro-posed a method based on random, uniform, sampling. We discuss infAP inchapter 4 and make some comparisons with our own sampling methodology.

Ben Carterette et al [33] developed a technology “MTC” for AP evaluationfocused on system ranking. His algorithm essentially selects at any time thedocument that, if relevant, makes the biggest difference among systems perfor-mance. The MTC method is discussed in chapter 5.

2.5.1 Confidence in the evaluation with incomplete judgments

While TREC-style depth pools are used to evaluate the systems from whichthey were generated, they are also often used to evaluate new runs which did not


originally contribute to the pool. Depth 100 TREC-style pools generalize well inthe sense that they can be used to effectively evaluate new runs (“reusability”),and this has been a tremendous boon to researchers who use TREC data.

We show that our techniques described in chapters 4 and 6 pools generalizewell, achieving estimation errors on unseen runs comparable to the estimationerrors on runs from which the sample was drawn.

One of the main contributions of this thesis is the principled developmentof confidence, based on well grounded statistics. The sampling methodology,particularly the sampling without replacement (chapter 4), comes with a vari-ance estimator which we turn into a confidence; we show that this confidencecan be used both to make use of the estimate, but also to warn of cases werethe estimate is unreliable, such as the case when there are not enough judgeddocuments to properly assess the quality of a given system.

Chapter 3

Relevance priors

Most of information retrieved is returned as a ranked list (think web search forexample). A central concept in many aspects of handling lists is the relativeimportance of the ranks. Certainly the first rank is most important, then thesecond and so on, so we expect a monotonic decrease of performance; but exactlyhow important is the second ranked document relatively to the first? Thisconcept emerged heavily in evaluation, with countless model of importance; butalso in metasearch, system optimization, user experience, web search, directsystem comparison ,etc.

Very related, many measures exist for assessing the “distance” between tworanked lists, such as the Kendall’s τ and Spearman rank correlation coefficients.However, these measures do not distinguish between differences in the “top”of the lists from equivalent differences in the “bottom” of the lists; however,in the context of information retrieval, two ranked lists would be consideredmuch more dissimilar if their differences occurred at the “top” rather than the“bottom” of the lists. To capture this notion, one can focus on the top retrieveddocuments only. For example, Yom-Tov et al. [112] compute the overlap (sizeof intersection) among the top Z documents in each of two lists. Effectively,the overlap statistic places a uniform 1/Z “importance” to each of the top Zdocuments and a zero importance to all other documents.

The Figure 17 is taken from [87]; it shows the actual probability of relevanceat each rank (averaged over all data), for various TREC collections.

3.1 From ranked list to distribution

More natural still, in the context of information retrieval, would be weightswhich are higher at top ranks and smoothly lower at lesser ranks. Recently, weproposed such weights [12, 11] which correspond to the implicit weights whichthe average precision measure places on each rank.

39

CHAPTER 3. RELEVANCE PRIORS 40

Figure 17: Actual probabilities of relevance at rank for TREC collections.

3.1.1 Measure-based rank weighting scheme

Most IR measures for ranked lists induce a “rank importance” , formally rankweighting scheme, when assess the quality of a given list. For example, the PC atcutoff c=10 takes into account only the 10 top ranks with a uniform weighting;therefore it induces equal weights on those 10 ranks and weight=0 for the rest.

Consider now any measure used for evaluating a ranked list of documents.If we write the mathematical formula that produces the measurement, will tryto look at this formula as weighted average ( Z=length of the list measured)

measurement =Z∑

i=1

wi · ri

where the vector w is ”rank importance” and r has to do with relevancejudgments at ranks. Now this may easy for some measures (RP) but definitelychallenging for measures that operate not on individual ranks but rather onpairs of ranks or other subsets(AP).

Precision@rank. Lets fix a cutoff c and denote PC as “precision at cutoff c”.Because no recall is involved, it is easy to see that essentially PC puts a uniform


distribution of weights over the top c ranks and 0 for all other ranks in the list.

R-precision. RP is very similar with PC (use c=R = number of relevant docu-ments in the query), except R is unknown as no relevance judgment is revealed.In this paper we assume that R is “magically” given; that is because, as statedabove, our purpose is to investigate the measures and the corelations betweenthem; in an implementation for metasearch purposes, R needs to be estimatedor guessed from data.There is a second fundamental difference between RP and PC. In standard IRsettings, like TREC, the performance of anything is usually averaged over many(50) queries. In that case c= PC-cutoff level is a constant over queries while Rvaries, making the RP-cutoff appropriate for each query.

nDCG discounting function. The nDCG measure was designed [53] specifi-cally as a dot product between a relevance function and a discount function,essentially weighting each document relevance by the discount of the rank itsrank. NDCG induces naturally a prior over the ranks (or documents at ranks):

w(d) ∝ 1log(1 + r(d))

Reciprocal rank The reciprocal rank measure [101] puts on each document aweight inverse proportional with its rank. Although the reciprocal rank refersonly to the first document found relevant, it is straightforward to develop a priorof relevance based on inverse-rank formulation.

Robert Savell [87], also working with Hedge (described in chapters 6 and7) algorithm for pooling and metasearch, studied in depth priors of relevancebased on inverse rank.

P (d) = ps1

1 + ct(r(d)− 1)

where ps is a system parameter (usually obtained from previous training) andct is a collection parameter.

R.Savell has shown [87] that the inverse-rank prior satisfies a number ofdesirable theoretical and practical properties, including:

• the distribution based on the inverse rank function is a generalization ofa scale invariant or universal distribution.

• the inverse rank weighting class is, in fact, unique in possessing the scaleinvariance property.

• preservation of a consistent Bayesian interpretation of the rank weightingson individual instances


The Figure 18 is taken from [87]; it shows empirically that the inverse-rank prior can match extremely well the actual probability of relevance, afterappropriate tuning.

Figure 18: Prior of relevance based on inverse rank

3.2 Average Precision prior

AP formula is significantly more difficult to see as weighted average across ranksbecause it operates on pairs of documents (at ranks). AP is the average of pre-cision at ranks of relevant docs retrieved, counting 0 for non-retrieved relevantdocs.

AP =1R

∑rel(d)=1

prec@r(d)

For the purpose of developing a prior, we shall ignore R and define “Sum Pre-cision” (SP ) as the numerator of Average Precision:

SP =∑

rel(d)=1

prec@r(d)

Since the number of relevant documents R is a global constant (per query), theSP is just as good as AP for evaluating systems for one query. On multiplequeries, when computing MAP , of course the R factor acts as normalization(different queries have different R-s) and it cannot be discarded.


3.2.1 Sum precision as an expected value

Unlike R-precision or precision at standard cutoffs, deriving a sampling distri-bution for average precision is non-trivial; however it yields a sampling distri-bution which is quite useful not only for AP estimation, but also for estimatingthe other measures of interest.

Next we are going to justify the

Figure 19: AP induced prior, over ranks

above prior of relevance as a weight-ing scheme naturally imposed to aranked list by the average precisionmeasure. Let SP be this sum ofprecisions at all relevant documents.In order to view this sum as an ex-pectation, we define an event spacecorresponding to pairs of ranks (r(d), r(f)),a random variable X correspondingto the product of the binary rele-vances rel(d) ·rel(f), and an appro-priate probability distribution overthe event space. One such distribu-tion corresponds to the (appropri-ately normalized) weights given inTable 4 (left); for convenience, weshall instead define a symmetrizedversion of these weights (see Table 4 (right))and the corresponding joint distri-bution J (appropriately normalizedby 2Z).

One can compute sum precision as follows

SP =∑

d : rel(d)=1

prec@r(d) =∑

d

rel(d) · prec@r(d)

=∑

d

rel(d)∑

r(f)≤r(d)

rel(f)r(d)

=∑

r(f)≤r(d)

1r(d)

· rel(d) · rel(f)

Thus, in order to evaluate SP , one must compute the weighted product ofrelevances of documents at pairs of ranks, where for any pair r(f) ≤ r(d), theassociated weight is 1/r(d).

Table 4 (left) visualize the weighting for each pair of ranks (r(d), r(f)).To obtain a weight for each rank we symmetrize by the first diagonal (right),marginalize, and normalize appropriately to obtain the weighting scheme pro-posed.

W (d) = marginal(J) =1

2|s|

(1 +

1r(d)

+1

r(d) + 1+ · · ·+ 1

|s|

). (4)


1 2 3 . . . Z1 12 1/2 1/23 1/3 1/3 1/3...Z 1/Z 1/Z 1/Z . . . 1/Z

1 2 3 . . . Z1 2 1/2 1/3 . . . 1/Z2 1/2 1 1/3 . . . 1/Z3 1/3 1/3 2/3 . . . 1/Z...Z 1/Z 1/Z 1/Z . . . 2/Z

Table 4: (Left) Weights associated with pairs of ranks; normalizing by Z yieldsan asymmetric joint distribution. (Right) Symmetric weights; normalizing by2Z yields the symmetric joint distribution JD .

Efficient calculation of W (d). If H(x) = 1+1/2+....+1/x is the harmonicseries up to x then

W (d) =1

2|s|

(1 + H(|s|)−H(r(d)− 1)

)is an efficient O(1) calculation, granted that the harmonic numbers are storedin a look-up table. Also, if we can afford to approximate W , then we can write

W (d) =1

2|s|

(1 + H(|s|)−H(r(d)− 1)

)≈ 1

2|s|

(1 + ln(|s|)− ln(r(d)− 1)

)≈ 1

2|s|

(1 + ln

|s|r(d)

)Total Precision

It is known [58] that Average Precision cannot be written as sum of retrieveddocuments’ contributions (or errors). Specifically there are no simple rank-function h and relevance function g such that

SP = −∑

d

g(rel(d)) · h(r(d))

Instead, if we denote Total Precision (TP) the average of precisions at allranks

TP =1|s|

∑d

prec@r(d)


then we can write∑f

rel(f) ·W (f) =∑

f

rel(f)1

2|s|(1 +

1r(f)

+1

r(f) + 1+ ... +

1|s|

)

=1

2|s|

(∑f

rel(f) +∑

r(d)≥r(f)

rel(f)r(d)

)=

12|s|

(R +

∑d

(r(d)

∑r(f)≤r(d)

rel(f)))

=R

2|s|+

12|s|

∑d

prec@r(d)

=R

2|s|+

12TP

Assuming |s| is roughly constant for most retrieved lists, then TP can bedecomposed into parts-per-document, and W is the contribution of each relevantdocument. In chapter 6, we describe the Hedge application for pooling, whereat each round the “loss” is governed by W for a total relative “loss” of TP .

3.2.2 A global prior for documents

One is often faced with the task of evaluating the average precisions of manyretrieval systems with respect to a given query (as in TREC); if a global priorof documents relevance is needed (like in sampling -chapter 4) then we averagethe per-system priors. The exponent 3/2 is detailed in chapter 4, section 4.3

W (d) =1

N SY STEMS

∑s

W32

s (d)

Chapter 4

Sampling

The process of sampling consists of selecting for observation few units of a popu-lation in order to make inferences about the whole population. Thus to estimatethe prevalence of a disease, a number of medical institutions are selected, eachof which has records of patients treated for the specified disease. Similarly,to estimate the popularity of a presidential candidate, a polling organizationselects individuals from all states, backgrounds, business etc and collects theiropinions on the candidate; then, using carefully designed mathematics, a nation-wide opinion is formed. Naively, the accuracy of the polling estimate may seemamazing: one can assess a nation-wide voting result with reasonable high confi-dence using only few thousand questionaries; scientifically, the sampling processhas been very intensively studied and its mathematics are well understood.

The units selected, together with specifics of their selection, form the sample.The first question that comes to mind is: how should one select the sample? Themechanism through which the population units are selected is called samplingdesign or sampling scheme; many designs have been proposed [94, 25] and manyare in use for various problems.

It is very intuitive to assume that in order to get a reliable population esti-mate, the sample should be ”representative” for the population. If we choose askewed sample (that it is not representative) then we should know what partsof the sample are representative for what parts of the population; this is usu-ally the case when the sampling is performed non-uniform (some units are morelikely to be part of the sample than others). In importance sampling [7, 25],the units are sampled non-uniformly with probability proportional to expectedvalue yield (computed from a prior or from relations with other known vari-ables). For example in order to estimate the percent of the world populationsick of a rare disease, areas in which the disease has been observed in the pastare sampled more. Such a mechanism not only increases massively the accuracyof the estimate, but also results in more sampled units observed with positivevalue — in this example more people sick included in the sample.

Anther important sampling consideration is whether we can select any unitmore than one time (with replacement) or not (without replacement). In with-

46

CHAPTER 4. SAMPLING 47

replacement sampling, the sampling mechanism is usually sequential or episodic:an unit is selected and observed at a time; the process repeats for a fixed numberof steps and there is a certain likelihood that at each step the unit selectedis one of those already observed. The sample would contain the repetitions(or counts) of each unit along with the value observed. While this samplingdesign is very simple, it is obviously not the most efficient. Conversely, without-replacement sampling does not allow for a population unit to be sampled morethan once. Sampling schemes without replacement, especially non-uniform, arethe most efficient but also the hardest to implement; while there are no counts,the fundamental mathematical property of each unit is the posterior probabilityto be included in the sample, inclusion probability.

Estimation

Once we have the sample, how can we estimate the quantity of interest for thewhole population? Say the total number of people suffering from a disease, orthe positive perception for a presidential candidate. Certainly we are to use amathematical formula that takes into account the sampled units, their proba-bilities, counts, and the value observed; we call this formula an estimator. Manyestimators have been proposed [25, 94]; some work better with specific samplingdesigns, some work only with certain designs. Conversely, some sampling de-signs have been put forward for use with specific estimators, while others aremore general and can be used with more than one estimator.

A particular important class is that of unbiased estimators, for which theexpected value over all possible samples of a given size is exactly the true statis-tics that is estimated. For example, if one chooses randomly and uniformly 10numbers from a population of 100, and estimates the mean of the populationby taking the mean of the 10 numbers selected, that is an unbiased estimatorbecause the average estimated value over all possible samples is in fact the truemean of the population. Unbiased estimators are generally preferable to biasedones, because at least in expectation they produce the correct value. Howeverthere are cases in which the variance of the unbiased estimator is unacceptablyhigh, or cases where an unbiased estimator is not known or too hard to compute;in such cases, biased estimators are used if the bias is acceptable.

Confidence

Surely we can produce a sample using a sampling design, and we can estimatea statistic (like the total) using an estimator. But how reliable is the estimate?To put it mathematically, we fix a confidence threshold t, and ask what intervalshould we consider around the estimate such that with probability at least t,the true statistics happen to be inside the interval predicted. Something like tosay that a presidential candidate will obtain, with a 90% probability, 60% ±2%of the vote. We call ±2% the 90%-confidence interval.

Confidence for unbiased estimators depends solely on their variance and inmost cases the estimator variance can also be estimated from the sample. Thus,


along with the estimate, one can produce a confidence or “error margin”.

Sampling errors vs nonsampling errors

The basic sampling assumption is that once a unit is chosen by the samplingdesign, the value of interest (such as sick or not sick) can be determined withouterror. If this is the case, the errors are sampling errors and can only occur ifthe population was not sampled entirely, that is some units were not consideredfor sampling. In many situations (IR included) this happens naturally becausethe access to the population units is restricted by the provider mechanism.For example certain countries would not allow an organization to test certainproducts; or some people would refuse to answer polling questions (but theywill vote in the upcoming election).

It turns out that is not the sampling errors that cause most of inaccuracies inthe estimates, but rather the non-sampling errors: errors caused by mislabelingof the units selected by the sampling design. Even worse, in some cases, it is notan error (which can still be somewhat close to a correct observation) but theunit is missing the value (or observation); this phenomena affects the samplesize, and also the statistics of the sample such as the inclusion probabilities.

To conclude this short introduction, it is worth mentioning that sampling hasbeen used for centuries. Since ancient times, sampling was used for budget pre-diction, food consumption and production, military assessments etc. Today, itis a critical tool in many domains ranging from market research to astronomy.We show next how can one employ sampling to perform IR evaluation.

The chapter contains two main results: first we look at several IR measures,particularly at Average Precision, and we show how one can sample documentsfrom system output in order to estimate the evaluation of performance. Thebenefit is that the human effort (judging documents) is drastically reduced,sometimes by a factor of 20. We present two sampling methods: with- andwithout- replacement; the first one is easier to understand, while the later ismore efficient. We show empirical results, and discuss applications.

The second result is the derivation of a confidence value for our estimates;the confidence serves as reliability of the estimates, but also to prompt for morejudgments. We think evaluation confidence intervals are of major importancefor any task involving reusability of the data.

4.1 Sampling for IR

We consider the problem of large-scale retrieval evaluation, and we proposea statistical method for evaluating retrieval systems using incomplete judg-ments. Unlike existing techniques that (1) rely on effectively complete, andthus prohibitively expensive, relevance judgment sets, (2) produce highly bi-ased estimates of standard performance measures, or (3) produce estimates ofnon-standard measures thought to be correlated with these standard measures,


our proposed statistical technique produces unbiased or low-biased estimates ofthe standard measures themselves.

On query by query basis. The experiments run completely query inde-pendent. The pools of documents are constructed on query basis, too. Theequivalent number of judged documents for depth d is computed per query, asthe exact size of depth d pool for that query. We do not look at documents con-tent, but only at rankings in all the systems and relevance. After every queryis completed we run an averaging program to get the MAP estimates.

The core of our methodology is the derivation, for each measure, of a dis-tribution over documents such that the value of the measure is proportionalto the expectation of observing a relevant document drawn according to thatdistribution. (In the case of average precision, the distribution is over pairs ofdocuments, and the observation can be modeled as the product of the relevancesfor the pair drawn.) Given such distributions, one can estimate the expecta-tions (and hence measurement values) using random sampling. By statisticaldesign, such estimates will be almost unbiased, especially for a large sample.Furthermore, through the use of the statistical estimation technique of impor-tance sampling [7], we show how low variance estimates of multiple retrievalmeasures can be simultaneously estimated for multiple runs given a single sam-ple. In sum, we show how both efficient and effective estimates of standardretrieval measures can be inferred from a random sample, thus providing analternative to large-scale TREC-style evaluations.

While many of the details are somewhat complex, the basic ideas can besummarized as follows:

1. For each measure, we derive a random variable and associated probabilitydistribution such that the value of the measure in question is proportionalto the expectation of the random variable with respect to the probabilitydistribution. For example, to estimate precision-at-cutoff 500, one couldsimply uniformly sample documents from the top 500 in a given list andoutput the fraction of relevant documents seen. Thus, the underlying ran-dom variable for precision-at-cutoff c is dictated by the binary relevanceassessments, and the associated distribution is uniform over the top c doc-uments. (Since R-precision is precision-at-cutoff R, an identical strategyholds, given the value or an estimate of R.) For average precision, thesituation is somewhat more complex: we show that the required samplingdistribution is over pairs of documents and the underlying random variableis the product of the binary relevance judgments for that pair.

2. Given that the value of a measure can be viewed as the expectation of arandom variable, one can apply standard sampling techniques to estimatethis expectation and hence the value of the measure. To implement thismethodology efficiently , one would like to estimate all retrieval measuresfor all runs simultaneously using a single judged sample. As such, oneis confronted with the task of estimating the expectation of a randomvariable with respect to a known distribution by using a sample drawnaccording to a different (but known) distribution.


3. Finally, to implement this methodology effectively, one desires low varianceunbiased estimators so that the computed empirical means will convergeto their true expectations quickly. While we show that any known sam-pling distribution can be used to yield unbiased estimators for retrievalmeasures, we also describe a heuristic for generating a specific samplingdistribution which is likely to yield low variance estimators.

Notation

- W = prior of relevance over documents

- S = set of sampled documents,

- K = set of sampled pairs-of documents formed from S

- D = sampling distribution over documents

- I = induced sampling distribution over pairs of documents

- πd = inclusion probability (in S) for document d

- πdf = inclusion probability (in S) for the pair of documents (d,f)

- R = estimate for the number of relevant documents for the query

- AP ,SP , R− prec, prec@rank = estimates for average precision, Sum Preci-sion, R-precision and precision at rank, respectively.

- rel(d) = relevance of judged document d

- r(d), rs(d) = rank of d in a particular retrieved list

4.2 Sampling Theory and Intuition

As a simple example, suppose that we are given a ranked list of documents(d1, d2, . . .), and we are interested in determining the precision-at-cutoff 1000,i.e., the fraction of the top 1000 documents that are relevant. Let prec@1000denote this value. One obvious solution is to examine each of the top 1000documents and return the number of relevant documents seen divided by 1000.Such a solution requires 1000 relevance judgments and returns the exact value ofprec@1000 with perfect certainty. This is analogous to forecasting an election bypolling each and every registered voter and asking how they intend to vote: Inprinciple, one would determine, with certainty, the exact fraction of voters whowould vote for a given candidate on that day. In practice, the cost associatedwith such “complete surveys” is prohibitively expensive. In election forecasting,market analysis, quality control, and a host of other problem domains, randomsampling techniques are used instead [94].

In random sampling, one trades-off exactitude and certainty for efficiency.Returning to our prec@1000 example, we could instead estimate prec@1000 withsome confidence by sampling in the obvious manner: Draw m documents uni-formly at random from among the top 1000, judge those documents, and returnthe number of relevant documents seen divided by m — this is analogous to a


random poll of registered voters in election forecasting. In statistical parlance,we have a sample space of documents indexed by d ∈ {1, . . . , 1000}, we have asampling distribution over those documents pd = 1/1000 for all 1 ≤ k ≤ 1000,and we have a random variable X corresponding to the relevance of documents,

xd = rel(d) ={

0 if d is non-relevant1 if d is relevant.

One can easily verify that the expected value of a single random draw is prec@1000

E[X] =1000∑

r(d)=1

pd · xd =1

1000

1000∑r(d)=1

rel(d) = prec@1000,

and the Law of Large Numbers and the Central Limit Theorem dictate that theaverage of a set S of m such random draws

prec@1000 =1m

∑d∈S

Xk =1m

∑d∈S

rel(d)

will converge to its expectation, prec@1000, quickly [81] — this is the essenceof random sampling.

Random sampling gives rise to a number of natural questions: (1) Howshould the random sample be drawn? In sampling with replacement, each itemis drawn independently and at random according to the distribution given (uni-form in our example), and repetitions may occur; in sampling without replace-ment, a random subset of the items is drawn, and repetitions will not occur.While the former is much easier to analyze mathematically, the latter is oftenused in practice since one would not call the same registered voter twice (orask an assessor to judge the same document twice) in a given survey. (2) Howshould the sampling distribution be formed? While prec@1000 seems to dictatea uniform sampling distribution, we shall see that non-uniform sampling givesrise to much more efficient and accurate estimates. (3) How can one quantifythe accuracy and confidence in a statistical estimate? As more samples aredrawn, one expects the accuracy of the estimate to increase, but by how muchand with what confidence? In the paragraphs that follow, we address each ofthese questions, in reverse order.

While statistical estimates are generally designed to be correct in expecta-tion, they may be high or low in practice (especially for small sample sizes) dueto the nature of random sampling. The variability of an estimate is measured byits variance, and by the Central Limit Theorem, one can ascribe 95% confidenceintervals to a sampling estimate given its variance. Returning to our prec@1000example, suppose that (unknown to us) the actual prec@1000 was 0.25; thenone can show that the variance in our random variable X is 0.1875 and that thevariance in our sampling estimate is 0.1875/m, where m is the sample size. Notethat the variance decreases as the sample size increases, as expected. Given thisvariance, one can derive 95% confidence intervals [81], i.e., an error range within


which we are 95% confident that our estimate will lie.1 For example, given asample of size 50, our 95% confidence interval is +/−0.12, while for a sample ofsize 500, our 95% confidence interval is +/−0.038. This latter result states thatwith a sample of size 500, our estimate is likely to lie in the range [0.212, 0.288]or as percentage [0.25 − 15%, 0.25 + 15%]. In order to increase the accuracyof our estimates, we must decrease the size of the confidence interval. In orderto decrease the size of the confidence interval, we must decrease the variancein our estimate, 0.1875/m. This can be accomplished by either (1) decreasingthe variance of the underlying random variable X (the 0.1875 factor) or (2)increasing the sample size m. Since increasing m increases our judgment effort,we shall focus on decreasing the variance of our random variable instead.

4.2.1 Uniform random sampling and infAP

Conversely, the method proposed by Yilmaz et al.[110] is quite simple: Doc-uments are chosen uniformly at random from the depth-100 pool, and onlythose judged documents (and knowledge of the depth-100 pool) are required toassess any given search engine run. The method is quite simple in its concep-tion and implementation, and it has been adopted in the most recent TRECTerabyte and TREC video retrieval tracks and included as part of TREC’sstandard trec eval procedure. Unfortunately, while more efficient than tradi-tional depth-pooling, the method is far less accurate and efficient than the (farmore complex) methods using on-uniform sampling; in particular, to achievethe performance of the non-uniform methods (presented later), infAP requiressamples sizes roughly five times larger. We present next a summary of theinfAP method, reproduced from [110]:

Given the output of a retrieval system, the expected value of average pre-cision can be described as follows: Pick a relevant document at random fromthe ranked list returned as response to a user query. What is the probabilityof getting a relevant document at or above that rank, that is what is the ex-pected precision at that rank? Note that in the traditional sense of averageprecision, the probability of getting a relevant document at or above a rankcorresponds to the precision at that rank and picking a relevant document atrandom corresponds to averaging these precisions over all relevant documents.

One can think of average precision as this expected value and in order toestimate average precision, one can instead try to estimate this expectationusing the given sampled relevance judgments.

First, consider the first part of this random experiment, picking a relevantdocument at random from the list. Since we do uniform sampling from thedepth-100 pool, the relevant documents in the list are also uniformly distributed.Now consider the expected precision at a relevant document retrieved at rank r.While computing the precision at rank r by picking a document at random at orabove r, two cases can happen. With probability 1/r, we may pick the current

1For estimates obtained by averaging a random sample, the 95% confidence interval isroughly +/ − 1.965 standard deviations, where the standard deviation is the square root of

the variance, i.e.,p

0.1875/m in our example.


document and since the document is known to be relevant, the precision at thisdocument is 1. Or we may pick a document above the current document withprobability (r− 1)/r, and we calculate the expected precision (or probability ofrelevance) within these documents. Thus, for a relevant document at rank r,the expected value of precision at rank r can be calculated as:

E[prec@r] =1r· 1 +

r − 1r

E[precision above r]

Now we need to calculate the expected precision above r. Within the r − 1documents above rank r, there are two main types of documents. Documentsthat are not in the depth 100 pool (non-depth100 ), which are known to benonrelevant, and documents that are within the depth 100 pool (depth100 ).Within the documents that are within the depth 100 pool, there are documentsthat are unsampled (unjudged) (non-sampled), documents that are sampled(judged) and relevant (rel), and documents that are sampled and nonrelevant(nonrel). While computing the expected precision within these k−1 documents,we pick a document at random from these k − 1 documents and report therelevance of this document. With probability |non-depth100|/(r− 1), we pick adocument that is not in the depth 100 pool and the expected precision withinthese documents is 0. With probability |depth100|/(k− 1), we pick a documentthat is in the depth 100 pool. Within the documents in the depth 100 pool,we estimate the precision using the sample given. Thus, the expected precisionwithin the documents in the depth 100 pool is |rel|/(|rel|+ |nonrel|). Therefore,the expected precision above rank r can be calculated as:

E[precision above r] =|non-depth100 |

(r − 1)· 0 +

|d-100 |r − 1

· |rel|(|rel|+ |nonrel|)

Thus, if we combine the two formulas, the expected precision at a relevantdocument that is retrieved at rank r can be computed as:

E[precision at rank r] =1r· 1 +

(r − 1)r

(|depth100 |

r − 1· |rel|(|rel|+ |nonrel|)

)Note that it is possible to have no documents sampled above rank r (|rel| +|nonrel| = 0). To avoid this 0/0 condition, we employ Lidstone smoothing [36]where a small value ε is added to both the number of relevant and number ofnonrelevant documents sampled. Then, the above formula becomes:

E[precision at rank r] =1r· 1 +

(r − 1)r

(|depth100 |

r − 1· |rel|+ ε

(|rel|+ |nonrel|+ 2ε)

)Since average precision is the average of the precisions at each relevant docu-ment, we compute the expected precision at each relevant document rank using


0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

actual map

infe

rred

map

TREC 8 inferred map vs. actual map, 30% sample

RMS = 0.0188 τ = 0.9203 ρ = 0.9958

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

actual map

infe

rred

map


RMS = 0.0269 τ = 0.8849 ρ = 0.9893

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.1

0.2

0.3

0.4

0.5

0.6

0.7

actual map

infe

rred

map


RMS = 0.0459 τ = 0.8427 ρ = 0.9779

Figure 20: TREC-8 mean inferred AP as the judgment set is reduced to (fromleft to right) 30, 10, and 5 percent versus the mean AP value using the entirejudgment set.

the above formula and calculate the average of them, where the relevant docu-ments that are not retrieved by the system are assumed to have a precision ofzero. We call this new measure that estimates the expected average precisioninferred AP (infAP).

Note that in order to compute the above formula, we need to know whichdocuments are in the depth 100 pool and which are not. However, the aboveformula has the advantage that it is a direct estimate of average precision.

Figure 20 shows how the inferred map (for each system we compute theinferred AP for all queries and get the average) versus actual mean averageprecision (computed using the complete judgments). It can be seen that inferredAP values are a good approximation to actual AP values, even for samples ofsmall percentage of judgments.

4.2.2 Importance sampling

Back to our prec@1000 example. Even if prec@1000 seems to inherentlydictate a uniform sampling distribution, one can reduce the variance of the un-derlying random variable X, and hence the sampling estimate, by employingnon-uniform sampling. A maxim of sampling theory is that accurate estimatesare obtained when one samples with probability proportional to size (PPS) [94].Consider our election forecasting analogy: Suppose that our hypothetical can-didate is know to have strong support in rural areas, weaker support in thesuburbs, and almost no support in major cities. Then to obtain an accurateestimate of the vote total (or fraction of total votes) this candidate is likely toobtain, it makes sense to spend your (sampling) effort “where the votes are.”In other words, one should spend the greatest effort in rural areas to get veryaccurate counts there, somewhat less effort in the suburbs, and little effort inmajor cites where very few people are likely to vote for the candidate in question.However, one must now compensate for the fact that the sampling distributionis non-uniform — if one were to simply return the fraction of polled voters whointend to vote for our hypothetical candidate when the sample is highly skewedtoward the candidates areas of strength, then one would erroneously concludethat the candidate would win in a landslide. To compensate for non-uniform


0 100 200 300 400 500 600 700 800 900 10000

0.5

1

1.5

documents

dist

ribut

ion

valu

e (x

103 )

Figure 21: Non-uniform sampling distribution.

sampling, one must under-count where one over-samples and over-count whereone under-samples.

Employing a PPS strategy would dictate sampling “where the relevant doc-uments are.” Analogous to the election forecasting problem, we do have a priorbelief about where the relevant documents are likely to reside — in the contextof ranked retrieval, relevant documents are generally more likely to appear to-ward the top of the list. We can make use of this fact to reduce our samplingestimates variance, so long as our assumption holds. Consider the non-uniformsampling distribution shown in Figure 21 where

pk ={

1.5/1000 1 ≤ k ≤ 5000.5/1000 501 ≤ k ≤ 1000.

where we have increased our probability of sampling the top half (where morerelevant documents are likely to reside) and decreased our probability of sam-pling the bottom half (where fewer relevant documents are likely to reside).In order to obtain the correct estimate, we must now “under-count” wherewe “over-sample” and “over-count” where we “under-sample.” This is accom-plished by modifying our random variable X as follows:

xk ={

rel(k)/1.5 1 ≤ k ≤ 500rel(k)/0.5 501 ≤ k ≤ 1000.

Note that we over/under-count by precisely the factor that we under/over-sample; this ensures that the expectation is correct:

E[X] =1000∑k=1

pk · xk =500∑k=1

1.51000

· rel(k)1.5

+500∑k=1

0.51000

· rel(k)0.5

=1

1000

1000∑k=1

rel(k) = prec@1000.


0 200 400 600 800 10000

0.5

1

1.5

documents

dist

ribut

ion

valu

e (x

103 )

Figure 22: Non-uniform distrib. with three strata.

For a given sample S of size m, our estimator is then a weighted average

prec@1000 =1m

∑k∈S

Xk =1m

∑k∈S : k≤500

rel(k)1.5

+∑

k∈S : k>500

rel(k)0.5

where we over/under-count appropriately.

Note that our expectation and estimator are correct, independent of whetherour assumption about the location of the relevant documents actually holds!However, if our assumption holds, then the variance of our random variable(and sampling estimate) will be reduced (and vice versa). Suppose that all ofthe relevant documents were located where we over-sample. Our expectationwould be correct, and one can show that the variance of our random variableis reduced from 0.1875 to 0.1042 — we have sampled where the relevant doc-uments are and obtained a more accurate count as a result. This reduction invariance yields a reduction in the 95% confidence interval for a sample of size500 from +/− 0.038 to +/− 0.028, a 26% improvement. Conversely, if the rele-vant documents were located in the bottom half, the confidence interval wouldincrease.

One could extend this idea to three (or more) strata, as in Figure 22. Foreach document k, let αk be the factor by which it is over/under-sampled withrespect to the uniform distribution; for example, in Figure 21, αk is 1.5 or 0.5 forthe appropriate ranges of k, while in Figure 22, αk is 1.5, 1, or 0.5 for appropriateranges of k. For a sample S of size m drawn according to the distribution inquestion, the sampling estimator would be

prec@1000 =1m

∑k∈S

rel(k)αk

.

In summary, one can sample with respect to any distribution, and so long as oneover/under-counts appropriately, the estimator will be correct. Furthermore,if the sampling distribution places higher weight on the items of interest (e.g.,


relevant documents), then the variance of the estimator will be reduced, yieldinghigher accuracy.


4.3 Non-uniform sampling with replacement

A carefully chosen, non-uniform distribution over the documents in the depth-100 pool is formed, and documents are sampled one at a time, with replacement,according to this distribution [12]. In order to assess a search engine system,the entire sampling distribution over the entire depth-100 pool must be avail-able, together with the relevance judgments and sampling counts associatedwith the sampled documents. This evaluation method is very accurate and effi-cient in terms of human-judging effort, achieving assessment results essentiallyequivalent to TREC depth-100 pooling results using sample sizes as low as 4%of the size of the traditional depth-100 pool. However its conception and im-plementation are rather complex; also the requirement of having at evaluationstage the entire sampling distribution available (opposite to only the samplingprobabilities for the judged documents) makes it a bit inconvenient to put inpractice.

Methodology

While our goal is to simultaneously estimate multiple measures of performanceover multiple lists, we begin by considering the problem of estimating averageprecision from a random sample for only one system. In chapter 3, we presentedhow Average Precision can be written as an expectation value, and the samplingdistribution (or prior) derived. Let SP be this sum of precisions at all relevantdocuments. In what follows, we first discuss estimating SP , and later we discussestimating R; our estimate of AP will be the ratio of the estimates of SP andR.

SP =∑

d : rel(d)=1

prec@r(d) =∑

d : rel(d)=1

rel(d) · prec@r(d)

=∑

d : rel(d)=1

rel(d)1

r(d)

∑r(f)≤r(d)

rel(f) =∑

1≤r(f)≤r(d)≤|s|

1r(d)

· rel(d) · rel(f)

This shows that if we want SP as the expected value of the observationsmade on pairs (d, f), the sampling weight J(d, f) should be proportional to1/max(r(d), r(f)). If K is a multiset-sample (that is, includes repetitions) ofpairs drawn according to J then we can estimate SP by

SP = |s| · 1|K|

∑(d,f)∈K

rel(d) · rel(f)

However the effective sampling for the pairs of documents is going to bedifferent than J , for the following reasons:

• Sample documents, not pairs. To be efficient, we are going to sampledocuments and not pairs of documents. Given the sampled documents,


we shall form all possible pairs; the distribution induced over pairs ismultinomial and very different than J .

• Multiple systems. The distribution has to accommodate the goal of evalu-ating many systems so it will not fit any particular system perfectly (in fact“strange” systems that return many “unique” documents not retrieved byother systems may be not fitted at all)

• Variance minimization. We want a sampling distribution over documentsthat would likely minimize the overall variance of the estimator.

4.3.1 Estimation, scaling factors

How can we use the estimator if the sample has been drawn from a differentdistribution than the density function? Say random variable X has the densityfunction p. The expected value of X is then E[X] =

∑x p(x) · x. Therefore, if

a sample K is taken according to p, then we can estimate the mean by

E[X] =1|K|

∑x∈K

x

If however, one takes a sample K of X values according to density q insteadof p then one can use the following unbiased estimator of the mean of X, asdescribed earlier in the “intuition” section

E[X] =1|K|

∑x∈K

p(x)q(x)

· x

The ratios of the distributions are called scaling factors [7]. We shall effec-tively be sampling pairs from a distribution different from the one (J) necessaryto estimate the expectations desired. Let Is(d, f) be the joint distribution (overpairs) induced by global sampling distribution D (over documents) restricted tosystem s (we are going to compute Is later); then the scaling factors SF s(d, f)correspond to the ratio between the desired and sampling distributions

SF s(d, f) =Js(d, f)Is(d, f)

The estimator for SP for the system s becomes

SPs = |s| · 1|Ks|

∑(d,f)∈Ks

rel(d) · rel(f) · SF s(d, f)

where Ks ⊆ K is the subset of samples corresponding to documents retrievedby system s. Note that the above formulation works for any distribution Is, inthe sense that the estimate SPs is unbiased; we will show next how to find aglobal distribution D over documents that minimizes overall variance.


induced distributionover pairs of docs

required distributionover pairs of docs

scalingfactorsscalingfactors

Evaluation

R

APranked list

sampling distributionover docs judged docs

(counts)

judged pairs of docs(counts)

Figure 23: AP evaluation diagram

If D denote the sampling distribution (over documents) and I the induceddistribution over pairs of documents, let S be the document sample and K thepair-of-documents sample formed from K. Using scaling factors, an unbiasedestimate for R (the total number of relevant documents) is

R =N DOCS

|S|∑d∈S

rel(d)D(d)

Having the global estimate R and SPs for every system, we can estimateAverage Precision2:

APs =SPs

R

Bias

If SP and R are random variables estimating SP and R respectively, their ratiois not truly estimating our quantity of interest, SP/R. One idea to obtainthe correct estimate, is to use the Taylor series approximation ([80], page 146).

2The ratio estimator bias will be addressed in section 4.5


Assuming a null covariance, the formula for the ratio is

AP =SP

R· 1

1 + σ2RbR2

where σ2R is the variance (estimated) of R.

Note that the estimate of a ratio is not necessarily the ratio of estimates.More accurate ratio estimates derived via a second order Taylor series approxi-mation [80] were tested, and they were generally found to be of little benefit forthe computational effort required.

Exact computation of scaling factors

Given D the distribution we use for sampling documents, and given a sample ofS such documents, we consider a pair sample K consisting of all |S|2 inducedpairs and estimate the required expectations from these induced pairs and ap-propriate scaling factors. For sufficiently large S, the distribution over inducedpairs will approximate the associated product distribution I(d, f) ≈ D(d)·D(f);however, the actual distribution is multinomial.

As a consequence, if Ks is the size of the subset of K sampled pairs whichare retrieved by system s, one obtains the following final scaling factors:

SF s(d, f) =Js(d, f)

I(d, f) · |K|/|Ks|.

Finally, we derive the exact form of the multinomial sampling distributionover induced pairs. Sampling S documents (with replacement) from a distribu-tion D and forming all |S|2 pairs of documents yields a multinomial distributionI over the possible outcomes.

Abusing but simplifying notation, we call S both the sample, and the sizeof the sample |S|. Let ~t = (t1, t2, . . . , tW ) correspond to counts for the sampleddocuments where t1 + t2 + · · · + tW = S and td is the count associated withdocument d. Then Pr(~t) =

(S

t1,t2...tW

)·∏d

D(d)td .

For d 6= f and S > 2, the induced pairs distribution is derived as follows


follows (we denote Dd = D(d))

I(d, f) =∑

~t=(...,td,...,tf ,...)

td·tf

S2 · Pr(~t)

= 1S2

∑td+tf≤S

(tdtf

(S

td,tf ,...

)Dtd

d Dtf

f ·

(1−Dd −Df )S−td−tf

)= 1

S2

∑td+tf≤S

((S−2)!(S−1)S

(td−1)!(tf−1)!(S−td−tf )!DdDf ·

Dtd−1d D

tf−1f (1−Dd −Df )S−td−tf

)= S(S−1)DdDf

S2

∑td+tf≤S

((S−2

td−1,tf−1,...

)·

Dtd−1d D

tf−1f (1−Dd −Df )S−td−tf

)=

S − 1S

DdDf

For d = f and S > 2, the induced-pairs-distribution is

I(d, d) =∑

t=(...,td,...)

t2dS2

· Pr[t]

=1S2

∑td≤S

t2d

(S

td, ....

)Dtd

d (1−Dd)S−td

=1S2

∑2≤td≤S

td(S!

(td − 1)!(S − td)!Dtd

d (1−Dd)S−td + Term(td = 1)

=S(S − 1)D2

d

S2

∑2≤td≤S

tdtd − 1

(S − 2

td − 2, ....

)D(d)td−2(1−Dd)S−td + Term(td = 1)

= Term(td = 1) +S(S − 1)D2

d

S2

∑ (S − 2td − 2

)Dtd−2

d (1−DS−td

d +

S(S − 1)D2d

S2

∑ 1td − 1

(S − 2td − 2

)Dtd−2

d (1−DS−td

d

=1S

Dd(1−Dd)S−1 +S − 1

SD2

d +1S

Dd

4.3.2 Optimal sampling distribution

In what follows, we describe a heuristic for determining a good sampling dis-tribution D — one which corresponds to a distribution over documents (forefficiency) and which explicitly attempts to minimize the variance in the es-timates produced (for accuracy). In determining a sampling distribution D,we look at the properties of I, the effective sampling distribution over pairs.


We consider two factors. First, while the exact I is multinomial, we know I isapproximatively a symmetric product distribution, i.e., I(d, f) ≈ D(d) · D(f).Second, we seek an I which explicitly attempts to minimize the variance in ourestimator, for accuracy. We begin by considering the latter factor.

Variance minimization

For a sampling distribution D and a given system s, let Is be the distributioninduced by D over pairs of documents contained in the list returned by systems. Furthermore, let Y be the random variable rel(d) ·rel(f) ·SF s(d, f) such thatSP = |s| · EIs [Y ]. Since Is and R are fixed, in order to minimize the varianceof AP , we must minimize the variance of Y .

Var[Y ] = E[Y 2]−E2[Y ]

=∑s,d

Is(d, f) · rel(d)2 · rel(f)2 · SF s(d, f)2 − (SP/|s|)2

=∑

rel(d,f)=1

Ds(d, f) · Js(d, f)2

Ds(d, f)2− (SP/|s|)2

=∑

rel(d,f)=1

Js(d, f)2

Is(d, f)− (SP/|s|)2

To minimize this variance, it is enough to minimize the first term since SP/|s|is fixed. Using Cauchy-Schartz inequality we get

∑rel(d,f)=1

J2s (d, f)

Is(d, f)=

( ∑rel(d,f)=1

J2s (d, f)

Is(d, f)

)·(∑

d,f

Is(d, f))

≥( ∑

rel(d)=rel(f)=1

J2s (i, j)

)2

= CONSTANT(w.r.t. I)

with equality iff the terms are proportional i.e.

∀(d, f = relevant)J2

s (d, f)Is(d, f)

∝ Is(i, j)

But of course we don’t have the relevance judgments; we conveniently use asprior of relevance the marginal of J (discussed in chapter 3) which we call W(for a system s):

Ws(d) = marginal(J) =1

2|s|

(1 +

1r(d)

+1

r(d) + 1+ · · ·+ 1

|s|

). (5)

Then the proportionality equation becomes

W12

s (d)W12

s (f) · J2s (d, f) ∝ I2

s (i, j)


Product distribution

We want I(d, f) to be a approx. a product distribution because we sample doc-uments - not pairs - according to the marginal D, which in here we approximateto the marginal of I. If I(d, f) is the product of its marginal distribution on dand f , the pairs formed from sampled documents according to D will ”appear”as drawn from I (the exact computation of I once D is known was presentedearlier).

To get Is(d, f) ∝ W12

s (d) · W12

s (f) · Js(d, f) to be a product distribution ,

we note that W12

s (d) · W12

s (f) looks already like a product, and we substituteJs(d, f) with the closest —in KL distance— product distribution. That is ([42])also the product of its marginals Ws(d) ·Ws(f). Therefore we get

Is(d, f) ' (Ws(d) ·Ws(f))32

which of course meansDs(d) ' W

32

s (d)

Simultaneous estimation for multiple runs

One is often faced with the task of evaluating the average precisions of manyretrieval systems with respect to a given query (as in TREC), and in a naiveimplementation of the technique described, the documents judged for one systemwill not necessarily be reused in judging another system. In contrast, TRECcreates a single pool of documents from the collection of runs to be evaluated,judges that pool, and evaluates all of the systems with respect to this singlejudged pool. In order to combat this potential inefficiency, we shall construct asingle distribution over pairs of documents derived from the joint distributionsJs associated with every system s.

D(d) =1

N SY S

∑s

Ds(d) =1

N SY S

∑s

W32

s (d)

This is the distribution we use for the experiments reported. I is computed usingthe multinomial formulas shown above with D as input. As already stated,the method works with any distribution D, confidence of the results mostlydepending on size of the sample and variance induced by scaling factors. It iseasy to see that only the distribution on the actual relevant documents matters.Therefore a good distribution will put a lot of weight on all relevants. This is thepotential place for heuristic improvements: incorporating various experience,prior knowledge of the problem or particular behavior of search engines etc.

4.3.3 Estimating prec@rank and R-prec

To estimate precision-at-rank r, one could simply uniformly sample documentsfrom the top r in any given list. Given that we sample documents according to


Samplingjudged docs

ranked list


sampling distribution over docs(average)

ranked list


ranked list


R

(counts)

efficiency

variancereduction

marginal distribover docs

^(3/2)


^(3/2)


^(3/2)

Figure 24: Sampling diagram

D, we again employ appropriate scaling factors to obtain correct estimates forprec@r . prec@r =

1r

∑r(d)≤r

rel(d)Z

D(d)

where Z =∑

r(d)≤r D(d) restricts the sampling distribution only to documentsranked higher or equal to r. In the experimental section, we report the estimatesof prec@100.

R-precision is simply the precision-at-cutoff R. We do not know R; however,we can obtain an estimate R for R as described above. Given this estimate, wesimply estimate prec@R.

4.3.4 Practical summary

Below we give a summary for implementing the sampling with replacementmethod. Figures 23, 24 can serve as a guide.


• (1) For each system s, use the joint distribution over pairs of documents,Js(d , f ), dictated by SP such that sampling pairs of documents accordingto J would yield SP in expectation. Marginalize to obtain a prior overdocuments and the approximate product distribution.

Js(d, f) ∝ 1max(r(d), r(f))

Ws(d) = marginal(J) =1

2|s|

(1 +

1r(d)

+1

r(d) + 1+ · · ·+ 1

|s|

).

• (2) To minimize variance, we use the marginal of J as a prior over relevantdocuments (Figure 24, step 2). We calculate the desired distribution overpairs, and from it we infer the sampling distribution over documents

Is(d, f) ' (Ws(d) ·Ws(f))32

Ds(d) ' W32

s (d)

• (3) Over all systems, compute D, the average of these sampling distribu-tions over documents. D is our single, final sampling distribution for thegiven query.

D(d) =1

N SY STEMS

∑s

Ds(d)

• (4) Sample S documents with replacement according to D (Figure 24,black box) until T unique documents are drawn; judge these documents.(T is the desired a priori judgment effort.) Generate all K = S2 pairsof judged documents (count of each of the K pairs is the product of thecounts for the respective sampled documents).

• (5) Compute the multinomial induced pairs distribution (Figure 23, step1), I(d, f); this is the “effective” distribution over pairs of documents fromwhich we sampled. From I and Js , compute the required scaling factors.

I(d, f) =S − 1

SD(d)D(f)

I(d, d) =1S

D(d)[1 + (S − 1)D(d) + (1− S(d))S−1]

SFs(d, f) =Js(d, f)

I(d, f) ·K/Ks

• (6) From the induced pairs and the scaling factors, compute the estimatesof SP for each system (Figure 23 black box).

SPs = |s| · 1|Ks|

∑(d,f)∈Ks

rel(d) · rel(f) · SF s(d, f)


• (7) Estimate R using the sampled documents (use appropriate scalingfactors) (Figure 24).

R =N DOCS

|S|∑d∈S

rel(d)D(d)

• (8) Estimate AP by the ratio of the estimates for SP and R.

APs =SPs

R

• (9) Estimate prec@rank and R-prec :

prec@r =1r

∑r(d)≤r

rel(d)Z

D(d)

R− prec =1

R

∑r(d)≤ bR

rel(d)Z

D(d)

4.3.5 Sampling with replacement results

Since the performance

0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

trainingRMS=0.067717 testingRMS=0.071323!=0.935188 "=0.735349

Depth pooling MAP estimates, TREC=8 K=29

MAP

dept

h po

oled

MAP

0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


Depth pooling MAP estimates, TREC=8 K=200

MAP

dept

h po

oled

MAP

0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


Sampling MAP, TREC=8 K=29

MAP

estim

ated

MAP

0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


Sampling MAP, TREC=8 K=200

MAP

estim

ated

MAP

Figure 25: Sampling vs. depth pooling mean av-erage precision estimates at depths 1 and 10 inTREC8. Each dot (·) corresponds to a distribution-contributor run and each plus (+) to a distribution-non-contributor run (there are 129 runs in TREC8.)

of the sampling methodvaries depending on theactual sample, we sam-pled 10 times and pickeda representative samplethat exhibited typical per-formance based on thethree evaluation statisticsused. We take one thinginto consideration: we haveas the biggest qrel filesthe depth100 pooling whichcontain (per query) roughly12% of the union set ofdocuments. That meansthat any document notin this set is consideredirrelevant without judg-ment; because we haveas a baseline the depth100MAP computation, thereis no point in sample anysuch document (not in the depth100 pool). Therefore we construct the distri-bution over the depth-100 pool set.


0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MRP

dept

h po

oled

MRP

Depth Pooling MRP, TREC=8 K=29


0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MRP

dept

h po

oled

MRP

Depth Pooling MRP, TREC=8 K=200


0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MPC100

dept

h po

oled

MPC

100

Depth Pooling MPC100 estimates, TREC=8 K=29


0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MPC100

dept

h po

oled

MPC

100

Depth Pooling MPC100 estimates, TREC=8 K=200


0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MRP

estim

ated

MRP

Sampling MRP estimates, TREC=8 k=29


0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MRP

estim

ated

MRP

Sampling MRP estimates, TREC=8 k=200


0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MPC100

estim

ated

MPC

100

Sampling MPC100 estimates, TREC=8 k=29


0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MPC100

estim

ated

MPC

100

Sampling MPC100 estimates, TREC=8 k=200


Figure 26: Sampling (bottom) vs. depth pooling (top) mean R-precision (left)and mean prec@100 (right) estimates at depths 1 and 10 in TREC8.

We report the results of the experiments for MAP, MRP, and Mprec@100on TREC8 in Figures 25 and 26. As can be seen, on TREC8 , for both depth=1(on avg 29 judgments/query) and depth=10 (on avg 200 judgments/query),there is a significant improvement in all three statistics when sampling is usedversus the TREC-style pooling for all the measures. The sampling estimateshave reduced variance and little or no bias compared to depth pooling estimates.This can be seen from the great reduction in the RMS error when the estimatesare obtained via sampling. Furthermore, the bottom-right plots of all threefigures show that with as few as 200 relevance judgments on average per query,the sampling method can very accurately estimate the actual measure valueswhich were obtained using 1,737 relevance judgments on average per query.

It is important that the performance of sampling over the testing runs isvirtually as good as the performance over the training runs. Note that thetesting systems do not directly contribute to the sampling pool; their documentsare sampled only because they happen to appear in training runs. The trendof RMS error, as sample size increases from depth 1 to depth 10 equivalent fortraining and testing systems is shown in Figure 27. On x-axis the units arethe depth-pool equivalent number of judgments converted into percentages ofdepth-100 pool.


AP RP prec@100

0 2 4 6 8 10 120.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

12 3

45 6

78 9 10

percentage of pool judged

RMS

erro

r

RMS error for AP estimates, TREC7

depth pooling train errsampling (10 runs avg) train errsampling (10 runs avg) test err

0 2 4 6 8 10 120.01

0.02

0.03

0.04

0.05

0.06

0.07

12 3 4

5 67

89

10


RMS

erro

r

RMS error for RP estimates, TREC7


0 2 4 6 8 10 120

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

1

2

34

56

7 8 9 10


RMS

erro

r

RMS error for PC100 estimates, TREC7


0 2 4 6 8 10 120

0.01

0.02

0.03

0.04

0.05

0.06

0.07


RMS

erro

r



0 2 4 6 8 10 120.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

1 23 4 5 6 7 8 9 10


RMS

erro

r



0 2 4 6 8 10 120

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.161

2

34

56

78

9 10


RMS

erro

r



0 2 4 6 8 10 12 140.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

1

23

45 6 7

89 10


RMS

erro

r



0 2 4 6 8 10 12 140.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

1 23

4

5 67

89 10


RMS

erro

r



0 2 4 6 8 10 12 140

0.02

0.04

0.06

0.08

0.1

0.12

1

2

34

56

78 9 10


RMS

erro

r



Figure 27: RMS error train/test crossvalidation comparisons for MAP, RP,PC(100),in TRECs 7, 8 and 10. Equivalent depths are indicated on the plot.


4.4 Non-uniform sampling without replacement

The sampling without replacement method proposed in this section combinesthe strengths of each of the above two methods: it retains some of the simplicityof the Yilmaz et al. random uniform sampling method infAP [110] while equal-ing or exceeding the performance of the far more complex without-replacementmethod presented in the previous section.

Furthermore, unlike the previous two methods, we demonstrate that this newsampling method can be adapted to incorporate additional judgments obtainedvia deterministic methods (such as traditional depth pooling), and thus ourproposed method effectively generalizes and combines both random and fixedpooling techniques.

We also analyze the variance of the estimator and estimate confidence in-tervals which are critical in warning, for example, of the case where the sampleobtained is not appropriate for the estimate desired. This is often the case witha sample obtained based on retrieval systems available today, but used to eval-uate a later system - that is, a system which did not participate to the samplingprocess.

Without replacement

A sample taken without replacement is a sample in which once an unit has beenselected, it is not considered for subsequent selections. This has several majorimplications for the overall goal of estimation:

• If we were to employ the intuitive sequential sampling like in the previ-ous section, the sampling distribution would be changing with each trial.There are obvious quick ways to update the sampling distribution (renor-malize the density function over the not-selected units after each trial) butsuch updates are not easy to account for. The only sequential without-replacement setup where the sampling distribution need not be updatedis the uniform sampling (not our case), which corresponds to infAP tech-nique described earlier. All sampling schemes proposed in this section arenot sequential.

• There are no counts over the sampled objects; each object is either partof the sample or not.

• The probability that really matters (it does the job of both scaling factorsand counts from previous section) is, for each object, the overall probabil-ity to be included in the sample after all trials (“inclusion probability”);the inclusion probability must be available for each of the sampled units.

• The estimator changes somewhat, though the principles remain the same:sample where you think the relevant documents are in order to reducevariance and increase accuracy. The scaling factors are replaced by inclu-sion probabilities πk, and the estimator must be normalized by the size


of the sample space. The estimate for our intuition-example prec@1000becomes prec@1000 =

11000

∑d∈S,r(d)≤1000

rel(d)πd

.

The inclusion probability πd is simply the probability that the documentd would be included in any sample of size |S|. In without-replacement sampling,πd = D(d) when |S| = 1 and πd approaches 1 as the sample size grows (D(d) isthe sampling distribution or prior over documents). Note that documents withlarge inclusion probabilities (i.e., those likely to be sampled) are under-countedas compared to those with small inclusion probabilities, which are appropriatelyover-counted, as desired.

As simple as it might look, there are major technical challenges even inthe simplest implementations of without-replacement sampling. In particularinclusion probabilities are very hard to compute for most sampling schemes; forexample if the sampling process is sequential like suggested above, for |S| > 2trials, inclusion probabilities are not computable in closed form.

Average Precision as the mean of a population

Contrary to the with-replacement sampling, where we estimated Sum Precisionas a mean of a random variable consisting of pairs of documents, and separatelyestimated R to obtain Average Precision, now we adopt a different view: wewrite Average Precision (AP) as the mean of precisions at ranks with relevantdocuments.

AP =1R

∑d:rel(d)=1

prec@r(d)

In statistical terms, we think of it as a population (of precision values at relevantranks) for which we want to estimate the mean. Each relevant document inthe collection corresponds to a population value: the precision at rank of thedocument (0 if document was not retrieved). Non-relevant documents do notcorrespond to population values as they do not contribute precision values toAP. Of course, when sampling, we don’t know what documents are relevant;so our method design is to consider precision values only for ranks of sampledrelevant documents while using the sampled non-relevant documents to helpestimate the required precision values.

4.4.1 The sample

Modularity. A critical observation is that the evaluation and sampling modulesare independent: sampling does not require a particular evaluation and no ad-ditional information (apart from the sample) is needed at the evaluation stage(a strong improvement over method presented in [12]); even more, samplingtechnique proposed is known to work with many other estimators (evaluations)while the proposed estimator is known to work with other sampling strategies


[25]. That is particularly important if one has reason to believe that a differentsampling strategy might work better for a particular problem.

Figure 28: Sampling and evaluation design

If there are additional judged documents, independent from the sampleddocuments, they can be added to the existing sample with inclusion probability1. This is a powerful feature as in practice it is often the case that additionaljudgments are available. The collisions (where a document is sampled, judgedand separately deterministically judged) are solved by having the fixed judgmenttaking priority over sampling, i.e., inclusion probability of 1. This feature makesthe method especially attractive for tasks like Terabyte tracks[38, 37] or thenew proposed TREC million query track; both tracks have access to documentsjudged outside the sample.

Another important feature is that the inclusion probabilities are invariantto sub-sampling (in sampling with replacement we had to compute restricteddistributions Is and Ds for each system). That is, given the sample, one mightwant to ignore documents that deterministically satisfy a given property (forexample document length is smaller than a threshold, or document is not in-cluded in a particular given list/set/category); the remaining sample, with its al-ready computed inclusion probabilities, constitutes a valid outcome of a without-replacement strategy for the required setup. In particular this makes the sampleusable to evaluate systems that have not contributed to the sampling stage, ifenough sampled documents are in their retrieved lists.

The simplicity of this interface makes it suitable for many instances. InTREC setup it can be used in two ways: first when massive central judging isdone by assessors, it might be desirable to judge deterministically a given depth(say top 20 documents of every participant list) and then invoke the samplingstrategy to judge additional documents. Second, when a participant receivesthe judgments from TREC (“qrel” file [107]), if more judgments are needed,one can judge either hand-picked documents (if the distributed judgments are


sampled) or can judge extra hand-picked and sampling documents (if the originalreceived judgments are deterministic) and add them to the provided sample.Any combination works as long as one set of judgments to be combined is donedeterministic (inclusion probability 1); in other words, two independent sampledsets generally cannot be combined unless they were done over disjoint sets ofdocuments.

4.4.2 Estimators

Lets assume for now that we have a sampling strategy that produced a sampleS of (value, incl-probability) pairs (vk, πk). How can we estimate say the totalpopulation value ? Or the mean ?

Horwitz-Thompson estimator

As previewed earlier, the Horwitz-Thompson [94] estimates of the total or themean of the population:

HTtotal =∑k∈S

vk

πkHTµ =

1M

∑k∈S

vk

πk

where M is the population size.Variance induced by the prior and/or sampling scheme. It is easy to

see that if the inclusion probabilities are proportional with the value observed,then the variance of HT estimators is zero, and vice-versa. If one can designa sampling scheme (usually consisting of a prior and a sampling mechanism)that leads to inclusion probabilities roughly proportional with the units’ value—called “pps” or proportional-to-size sampling—, then HT estimators are veryaccurate.

Say we want to estimate the total weight of a set of animals composed of ele-phants, bulls and chickens. Instead of measuring the weight of each animal andadd up to get the total, we are going to use non-uniform without-replacementsampling. Since the weight of each elephant/bull/chicken is the essentially thesame relative to the other categories, we shall use a prior that assigns a constantprobability to each elephant/bull/chicken. Then obviously inclusion probabili-ties (computed given the prior and sampling scheme) will also be constant foreach type of animal; lets call them πe, πb and πc. If We,Wb and Wc are the(roughly constant) weights for each type of animal, then to minimize the vari-ance of the total weight estimation we need to choose the prior such that theinclusion probabilities end up roughly proportional with the animal weights.

(πe, πb, πc) ∝ (We,Wb,Wc)

Of course, we do not know the weights of the animals, but we might have someprior knowledge (say obtained by averages of previous observations) on We,Wb

and Wc; this prior knowledge is not good enough to use as fact in our estimateof the total weight, but it is good to have the prior based on it. To see how the


prior affects the variance, assume that we take a samples S of |S| animals withinclusion probabilities πk and observed weights wk. Then we estimate the totalweight as

HT total =∑k∈S

wk

πk

If our inclusion probabilities are exactly proportional with the actual animalweights then for every animal k

wk

πk=

∑k wk∑k πk

=W

|S|

which means HT total =W , independent of the sample taken (so our estimatorhas zero variance).

Ratio estimators

If the population size M is unknown (suppose that we did not know the numberof registered voters in an election forecasting analogy), then one can estimatethis quantity as well.

M = HTsize =∑k∈S

1πk

This yields the Horwitz-Thompson generalized ratio estimator [94]

µ =∑

k∈S wk/πk∑k∈S 1/πk

The generalized ratio estimator is most useful for estimating average preci-sion (AP), which is the average of the precisions at relevant documents; there-fore the population size is the number of relevant documents R, unknown. Thus,the generalized ratio estimator is applicable since it effectively estimates R aswell. The “values” we wish to average are the precisions at relevant documents,and the ratio estimator for AP is thus

AP =

∑d∈S,rel

prec@r(d)/πd∑d∈S,rel

1/πd=

1

R

∑d∈S,rel

prec@r(d)πd

where R =∑

d∈S,rel

1πd

Precisions at ranks need to be estimated too, both for Average Precision and forthe measure prec@rank estimate. We do so using Horwitz-Thompson estimator[94]: prec@r =

1r

∑d∈S,r(d)≤r

rel(d)πd


While this is a valid precision estimator in general, in the case of averageprecision, precision is always estimated at a relevant rank. Therefore the esti-mator should account for that relevant document being there [110] by settingthe inclusion probability for the document at rank r to 1:

prec@relevant r =1r

1 +∑

d∈S,r(d)<r

rel(d)πd

We also apply the same PC estimation to obtain R-precision, with R as the

rank. R− precision = prec@R

4.4.3 Sampling designs, computation of inclusion probabilities

As in sampling with replacement, any prior over documents can work as thesampling distribution. In fact we experimented with several quite different pri-ors and show that they all produce reasonable estimates in the sense that thereis little or no bias. The difference in choosing various priors is in the variance ofthe estimate: different priors obviously lead to different inclusion probabilities;it is well known that variance of the Horvitz-Thompson estimator is minimizedwhen for each object, the value sampled is roughly proportional with the in-clusion probability. However it is not easy to design a sampling scheme thatproduces the desired inclusion probabilities, nor is it easy to compute inclusionprobabilities for a given sampling scheme.

Note again that if we are wrong in guessing the prior, the estimate willstill be correct in expectation, but it will have a larger variance; the worse theprior, the higher the variance. For IR evaluation, the population values are theprec@rank, so we should design the sampling strategy such that the resultedinclusion probabilities are roughly proportional with the prec@rank estimatesas dictated by the prior of relevance chosen. Next we present several priors thathave been tried; later in this section we present the sampling strategies.

AP prior

The AP prior was derived in the previous section

D(d) =1

N SY STEMS

∑s

W32

s (d)

where Ws is the marginal of joint table induced by Average precision over theranks of the list |s|

Ws(d) =1

2|s|

(1 +

1r(d)

+1

r(d) + 1+ · · ·+ 1

|s|

).

This is the prior used for almost all sampling experiments in this section, andalso in the next chapter, Million Query Track 2007.


Figure 29: Sampling vs. depth pooling mean average precision estimates usingAP prior (left) anti-AP prior (right) and uniform prior (middle) in TREC8.Each dot (·) corresponds to a distribution-contributor run and each plus (+) toa distribution-non-contributor run (there are 129 runs in TREC8.)

Uniform prior and anti-AP prior

In the IR research community, concerns were raised that the sampling methodworks so well only with a carefully designed prior, like the AP-based prior.While certainly the AP prior is designed to minimize variance, any prior shouldwork in the sense that the estimates should be [almost] unbiased, especially forreasonable large sample sizes.

Theoretically, it already established [94] that the expected value of estimateshave little to do with the sampling distribution used. To prove this point em-pirically, we run an experiment where a sample size of 300 per query (or 16%of the TREC judging effort) was taken using the same sampling scheme, butbased on three different priors: the AP-prior , the anti-AP prior (where doc-uments weighted higher by the AP prior are given less weight) and a uniformprior. The sampling results are presented in Figure 29. Note that uniform priorperformance is, as expected, similar to performance of infAP [110] (describedearlier, also based of uniform random sampling). The estimates using anti-APprior (right on the figure) are valid estimates, certainly usable, but with morevariance than the ones obtained with a uniform prior, while the AP prior esti-mates have the smallest variance. This is as expected: the better the prior, theless variance.

Theoretical optimal prior

We used the AP prior for most of the experiments and analysis, due to itssimplicity and efficiency shown in practice. Here we show a theoretical optimalprior in the proportional-to-size sense. If Ws is a prior of relevance for system


s, then the expected prec@rank is

Es[prec@r] =1r

∑rs(f)≤r

Ws(f)

Since the values observed on sampled units are precision values, “proportional-to-size” for system s dictates that the inclusion probability of unit representedby d be roughly proportional to Es[prec@r(d)]. Assuming a sampling scheme(as we show later) that produces inclusion probabilities approx proportional tothe input prior, the prior for system s is thus given by Es[prec@r(d)]. For aglobal prior, we need to average over systems:

optimal prior(d) =1

N SY STEMS

∑s

1rs(d)

∑rs(f)≤rs(d)

Ws(f)

Bucketing sampling strategy

There are many ways one can imagine sampling from a given distribution [25].For us, the most desirable features are:

• practicality for IR (simple and efficient);

• ability to add existing judged documents to the pool obtained by samplingand use all of them for evaluation;

• computability of inclusion probabilities for documents (πd) and for pairsof documents (πdf ).

• handle non-uniform sampling distribution as input;

• probabilities proportional with size (pps): we want resulted inclusion prob-ability πd proportional with the input sampling distribution;

We adapt a method developed by Stevens [25, 92], sometimes referred to asstratified sampling or bucket sampling. Let D be the prior sampling distribu-tion over N documents, m be the desired sample size (Figure 30,top). Stevensstratified sampling works as follows:

1. Order documents by sampling weight and partition them in buckets of mdocuments each(Figure 30, second row). First bucket will contain biggest(by D) m documents, second bucket the next m documents and so on.Last bucket might have fewer than m but that fact is negligible.

2. Pick with replacement the buckets m times, where each bucket has prob-ability to be chosen (each time) its cumulative sample size, or the sum ofD weights associated with documents in the bucket(Figure 30,third).


7 4 2 0 1 0

Figure 30: Average Precision induced prior W , averaged over many systemlists (top). Bucketed prior (second row): Each bucket contains m = 14 items(in this example) and it is associated with sum of distribution weights of itsitems. Third: Buckets are sampled with replacement, obtaining, for example,counts 7,4,2,0,1,0 (summing to m=14). Bottom: Inside each bucket documentsare sampled uniformly, without replacement: from first bucket 7 items, fromsecond bucket 4 items, and so on.

3. For each bucket, if it got picked at stage (2) k times, sample uniformly,without replacement k documents from the bucket (Figure 30,bottom;black bars indicate documents sampled).

Obviously, this strategy is fast and simple. Although buckets are sampledwith replacement at first, the overall sample of documents is without replace-ment. Also inclusion probabilities for each document πd are straight forward:say d belongs to a bucket (of size m) with cumulative sample size g, and letsdenote X the random variable indicating how many times the this bucket ispicked (out of m times total for all buckets, step 2). Then

πd =m∑

k=1

Prob[X = k]k

m=

1m

E[X] =1m

mg = g


This shows that the sampling strategy produces inclusion probabilities approxi-mative proportional with sampling probabilities. While g is the inclusion proba-bility for all documents in the bucket(which don’t have identical sampling prob-abilities, but close due to bucketing), for small m it is safe to say the method ispps; for large samples sizes, the documents selected would appear more uniform-sampled than the prior dictates, but it is clear that as sample size grows, thebenefit of more documents being judged heavily outweighs the disadvantage ofuniform-ization of the actual sampling distribution.

We can also to compute inclusion probability for each pair of documents(d,f), that is the probability that both documents are included in the sample.We derive πdf distinguishing two cases. If d and f belong to the same bucketwith total probability g then

πdf =∑

1≤k≤m

k(k − 1)m(m− 1)

Prob[X = k]

=∑

1≤k≤m

k(k − 1)m(m− 1)

gk(1− g)m−k

(m

k

)= g2 = πdπf

If documents d and f belong to different buckets with total probability gand h respectively and X, W are random variables indicating how many timeseach bucket got picked, then

πdf =∑

1≤k+l≤m

kl

m2Prob[X = k;W = l]

=∑

1≤k+l≤m

kh

m2gkhl(1− g − h)m−k−l

(m

k, l,m− l − k

)=

m− 1m

gh =m− 1

mπdπf

The method presented above (”Stevens”) is used to generate all the resultsfor the non-uniform sampling section of this chapter; it is the method used forsampling the Million Query tracks described in the next chapter, and also whatwe refer as ”non-uniform sampling” for the rest of the thesis. We present twoslightly different sampling schemes that can be used just as well with the sameestimator, and a discussion of what scheme suits best various scenarios.

Sampling method - variant 2

Given the desired sample size m, partition the population into m buckets ofapprox same cumulative sampling size. Note that this is not trivial to achieve,as the knapsack problem is NP-complete, but any approximation will do.


Then choose at random one document in each bucket, according to relativesampling probabilities inside the bucket. If a document d happen to belong toa bucket with cumulative sampling size g then its inclusion probability is

πd = D(d)/g

If one manages to partition the documents in buckets as suggested (buckets haveroughly the same size) then g ≈ 1/m and πd ≈ mD(d) which shows that theinclusion probabilities produced are somehow proportional with their sampling(prior) probabilities, as long as the bucketing is not too far off. This, like for themain sampling scheme proposed earlier, is a desirable property: one can chooseany sampling distribution D and trusts that the inclusion probabilities resultedare roughly proportional to it.

There is, however a small caveat with this method: while inclusion prob-abilities πd are trivial, inclusion probability for pairs of documents is tricky:if documents d and f belong to buckets of cumulative sample size of g and lrespectively, then

πdf = πdπf ≈ D(d) ∗D(f) ∗m2

but if d and f happen to be partitioned into the same bucket, then πdf = 0 sinceonly one document is selected from each bucket. For a bucketing procedure thatis deterministic, many inclusion probabilities are zero, making the variance hardto estimate. For a bucketing procedure that is non-deterministic, perhaps mostpair-inclusion-probabilities are nonzero, and variance can be estimated.

Advantage: the inclusion probabilities are closer (they can even be exact)to the prior probabilities as long as m buckets can be formed of the same size.Disadvantage: it is hard to partition units into buckets of roughly the samecumulative size, especially for large m. In particular , for large m, some unitsalone would be larger than the desired size of the bucket so they form a bucketalone, thus included in the sample with probability 1.

While the ratio estimators work with this sampling scheme, a slightly bet-ter, unbiased estimator is given by Rao-Hartley-Cochran [25]: since the bucketsmight not have the same cumulative size, use the the Horwitz-Thompson es-timator on each bucket separately and appropriately average the results. TheRHC estimator is the same as the HT estimator when buckets are of the sameexact cumulative size.

Both this sampling scheme and the previous one are good approximationsof ideal cases when the sample size m is small. When m is large enough, theinclusion probabilities are not anymore proportional with the sampling distri-bution: in the first scheme, the buckets will be very large and all documents inone bucket are sampled uniformly; in the second scheme, the number of bucketswill be very large and therefore harder to make them all the same size (someunits will form a bucket by themselves). Fortunately, especially for the IR case,large samples are not something to worry about: we show experimentally thatat sufficiently large sample (say 10%) our estimates are virtually perfect.


Sampling method - variant 3

Is there a way to get inclusion probabilities proportional with the samplingdistribution independent of the sample size? The answer is “NO”: if the samplesize is large enough , all inclusion probabilities approach 1.

However, assuming that a “large sample” is still relatively small to the collec-tion of documents, if one is willing to give up the requirement of a fixed samplesize (only the expected size is known), there is a way: if m is the expectedsample size, sample each unit d independent of the other units with probabilitym ·D(d). Of course it is not known how many units are going to be included inthe actual sample, but in expectation the number is m. For budgeting purposes,since in IR we do this for many queries, the overall sample size is likely to bevery close to the expected number of documents sampled. This method hasbeen used in TREC Legal Track Evaluation[95].

Advantages : the inclusion probability is equal with the sampling probability;the procedure is extremely easy and fast. Disadvantage: it is not known inadvance how big the sample is going to be; even if at some point during thesampling process the sample exceeds in size the targeted size, the sampling mustcontinue to determine independently for each (remaining) unit whether is partof the sample or not.

4.4.4 Sampling without replacement results

We tested the proposed method as a mechanism for estimating the performanceof retrieval systems using data from TRECs 7, 8 and 10. We used mean averageprecision (MAP), mean R-precision (MRP), and mean precision at cutoff 30(MPC(30)) as evaluation measures. We compared the estimates obtained bythe sampling method with the “actual” evaluations, i.e., evaluations obtainedby depth 100 TREC-style pooling. The estimates are found to be consistentlygood even when the total number of documents judged is far less than thenumber of judgments used to calculate the actual evaluations.

To evaluate the quality of our estimates, we calculated three different statis-tics, root mean squared (RMS) error (how different the estimated values are

from the actual values, i.e., RMS =√

1N

∑Ni=1 (ai − ei)

2, where ai are the ac-tual and ei are the estimates values), linear correlation coefficient ρ (how wellthe actual and estimated values fit to a straight line), and Kendall’s τ (how wellthe estimated measures rank the systems compared to the actual rankings). Itis a function of the minimum number of pairwise adjacent interchanges neededto convert one ranking into the other. Both ρ and Kendall’s τ values range from−1 (perfectly negatively correlated values) to +1 (perfectly correlated values).Note that in contrast to the RMS error, Kendall’s τ and ρ do not measurehow much the estimated values differ from the actual values. Therefore, evenif they indicate perfectly correlated estimated and actual values, the estimatesmay still not be accurate. Hence, it is much harder to achieve small RMS errorsthan to achieve high τ or ρ values. In all our experiments, when computingthe average sampling distribution we only use the runs that contributed to the


0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5


Depth pooling MAP estimates, TREC8 K=29

MAP

dept

h po

oled

MAP

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5


Depth pooling MAP estimates, TREC8 K=200

MAP

dept

h po

oled

MAP

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5


Sampling MAP, TREC8 K=29

MAP

estim

ated

MAP

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5


Sampling MAP, TREC8 K=200

MAP

estim

ated

MAP

Figure 31: Sampling vs. depth pooling mean average precision estimates atdepths 1 and 10 in TREC8. Each dot (·) corresponds to a distribution-contributor run and each plus (+) to a distribution-non-contributor run (thereare 129 runs in TREC8.)

pool (training systems), and use the runs that did not contribute to the pool astesting systems. In the plots that follow, the dots (·) in the plots correspond tothe training systems, and the pluses (+) correspond to the testing systems.

We compare the sampling estimates (various sample size) with ”true” values,obtained from fully judged depth-100 pool. The size of samples are based ondepth pooling equivalence; depths 1 (top document for every run) and 10(top 10documents for every run) are displayed. TREC-style depth pooling for depths1 and 10 correspond to 40 and 260 relevance judgments on average per query,respectively (including the pool non-participating runs). We also compare theestimated values of the measures obtained using the sampling method with thedepth pooling estimates.

Since the performance of the sampling method varies depending on the actualsample, we sampled 100 times and picked a representative sample that exhibitedtypical performance based on the three evaluation statistics used.

We report the results of the experiments for MAP, MRP, and Mprec@30 onTREC8 in Figure 31, Figure 32, and Figure 33, respectively. As can be seen, on


0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MRP

dept

h po

oled

MRP

Depth Pooling MRP, TREC8 K=29


0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MRP

dept

h po

oled

MRP

Depth Pooling MRP, TREC8 K=200


0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MRP

estim

ated

MRP

Sampling MRP estimates, TREC8 K=29


0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MRP

estim

ated

MRP

Sampling MRP estimates, TREC8 K=200


Figure 32: Sampling vs. depth pooling mean R-precision estimates at depths 1and 10 in TREC8.

TREC8, for both depth 1 (on avg 29 judgments/query) and depth 10 (on avg200 judgments/query), there is a significant improvement in all three statisticswhen sampling is used versus the TREC-style pooling for all the measures: thesampling estimates have reduced variance and little bias compared to depthpooling estimates; furthermore, the bottom-right plots of the figures show that200 relevance judgments on average per query are enough to get “almost perfect”evaluations (τ ≈ .95); actual TREC8 evaluations use 1,737 relevance judgmentson average per query.

Figure 34 illustrates how MAP estimates using TREC-style depth poolingcompare in terms of ρ and Kendall’s τ with those obtained using sampling as thedepth of the pool changes. For depths 1 to 10, we first calculated the number ofdocuments required to be judged using TREC-style depth pooling. Generallyspeaking the results obtained are comparable or better with a previous (muchmore complicated) sampling method [12].

For each sample size (equivalent depth pool size) we repeated the experi-ment 100 times and then calculated the average ρ (left column), RMS (middle)and τ (right column). Along with the average displayed in the Figure 34, for ρand τ we plot the standard deviation bar estimated unbiased from 100 values[3](sampling line on the plots, bar shows ±1 std ). As the figure displays, the


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

MPC30

dept

h po

oled

MPC

100

Depth Pooling MPC(30) estimates, TREC8 K=29


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

MPC30

dept

h po

oled

MPC

100

Depth Pooling MPC(30) estimates, TREC8 K=200


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

MPC

estim

ated

MPC

Sampling MPC(30) estimates, TREC8 K=29


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

MPC

estim

ated

MPC

Sampling MPC(30) estimates, TREC8 K=200


Figure 33: Sampling vs. depth pooling mean prec at cutoff 100 estimates atdepths 1 and 10 in TREC8.

sampling method significantly outperforms the TREC-style depth pooling eval-uations. For comparison purposes, we also include the average Kendall’s τ valueof bpref [29] and infAP [110] obtained using random samples of the given sizeto the plots in the second column. The Kendall τ values for bpref and infAPare the average values computed over 10 different random samples. More on

Per query and per run results

There are certain situations when one needs the results of a single query, hencenot taking advantage of the variance reduction achieved by averaging over 50queries. It is certainly not expected to see the same kind of performance on perquery basis; however our results show definite usable query estimates (Figure35). The method described in this paper is self-contained for a query, i.e.,estimates for a query are not dependent on any data from other queries. On adifferent setup, one may want to analyze only one run over all queries (Figure36).

Using additional judged documents

We have shown that the proposed sampling method is very simple, practicaland efficient for system evaluations (AP, RP and prec@30). The method designinvolves two stages: sampling based on a carefully computed distribution and


0 2 4 6 8 10 120.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1


corr.

coe

ff.

Corr. coeff. for AP estimates TREC7

depth poolingsampling (100 runs avg)

0 2 4 6 8 10 120.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11


RMS

erro

r



0 5 10 150.7

0.75

0.8

0.85

0.9

0.95

1


Kend

all’s

tau

Kendall’s tau for AP estimates TREC7

depth poolingsampling (100 runs avg)b!pref with random judgmentsinfAP with random judgments

0 2 4 6 8 10 120.93

0.94

0.95

0.96

0.97

0.98

0.99

1


corr.

coe

ff.



0 2 4 6 8 10 120.01

0.02

0.03

0.04

0.05

0.06

0.07


RMS

erro

r



0 5 10 150.7

0.75

0.8

0.85

0.9

0.95

1


Kend

all’s

tau



0 2 4 6 8 10 12 140.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1


corr.

coe

ff.



0 2 4 6 8 10 12 140.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1


RMS

erro

r



0 5 10 150.65

0.7

0.75

0.8

0.85

0.9

0.95

1


Kend

all’s

tau



Figure 34: Linear correlation coefficient, Kendall’s τ and RMS error comparisonsfor mean average precision, in TRECs 7, 8 and 10.

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

AP

estim

ated

AP

Sampling AP, TREC8, Q46, K=29


0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

AP

estim

ated

AP



Figure 35: Sampling estimates for a query with mixed system performance.Dots (·) represent training runs; pluses (+) represent testing runs..

then evaluation of the systems. However, in practice, in many cases, additionaljudgments are available from various sources. Most important, those judgmentsare totally independent of the sampling picked judgments and some may becommon (collisions). We next demonstrate that the evaluation stage of ourmethod can use the additional judgments.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AP

estim

ated

AP

Sampling AP, TREC8 , all Q, run=Sab8A1 K=29

RMS=0.196883!=0.661096 "=0.617959

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AP

estim

ated

AP

Sampling AP, TREC8 , all Q, run=Sab8A1 K=200

RMS=0.073870!=0.933765 "=0.800816

Figure 36: Sampling estimates for a fixed typical run (Sab8A1) with MAP =0.25, all queries. Each dot (·) is an AP for a query estimate (total 50); MAPestimate is plotted as “×”.

Additional judgments are considered independent of the sampled ones. Theycould be judged before or after the sampling process; the sampling stage has noknowledge of those. In a very simple fashion, the additional judgments are addedto the sample with inclusion probability 1 (each) and of course their judgedrelevance. In the case of collision, the fixed added judgment takes priority, iethe inclusion probability is 1 (and not the one computed via sampling).

Figure 37 presents evaluation results when additional judgments are used.The top 3 plots (left to right: sampling, sampling+added judgments, depthpooling) show results for a particular query on TREC8. A comparison betweenleft plot and middle plot shows a significant improvement when additional doc-uments are added; this corresponds to a scenario where sampling is done firstand additional judgments provided later. The combined result shows significantimprovement over the depth pool strategy (right plot), as expected and demon-strated above; this corresponds to a scenario where depth pool judgments areavailable to start with and sampling is used to obtain additional judgments (itis essentially sampling a pool that does not contain documents already judged).The bottom 3 plots are showing corresponding results for all queries on TREC8.


0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

AP

estim

ated

AP



0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

AP

estim

ated

AP

Sampling AP, TREC8, Q40, K=117 (sampled=71 given=71 collis=25)


0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5trainingRMS=0.118704 testingRMS=0.133814!=0.946775 "=0.838545

Depth pooling AP estimates, TREC8, Q40, K=71

APde

pth

pool

ed A

P

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5


Sampling MAP, TREC8, K=71

MAP

estim

ated

MAP

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5


Sampling AP, TREC8, K=120.7 (sampled=71 given=71 collis=21.3)

MAP

estim

ated

MAP

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5


Depth pooling MAP estimates, TREC8, K=71

MAP

dept

h po

oled

MAP

Figure 37: Sampling combined with depth pooling judgments(middle) comparedwith Sampling(left) and depth pooling(right) evaluations on TREC8. Top plotsare for a fixed query while bottom plots are averages for all queries.


4.5 Confidence intervals and reusability

For the sampling without replacement technique (previous section) we make ananalysis of bias, variance and explain implications on confidence in the estimatesand reusability of the sample.

New retrieval runs

A question of particular importance is how can we use the samples generatedby our method to evaluate a new run, i.e. a run that did not contribute tothe sampling distribution. In order for sampling to work correctly, there shouldbe sufficient sampled documents in the new run so that the evaluation usingsampling is meaningful.

The evaluation (estimation) methodology is independent of the fact thatthe run participated to the sampling process; therefore it can be applied tothe new runs in the same way as for the runs used in producing the sample.On the scale factor computation, the numerator is a function of the ranks ofsampled documents in the new list and the denominator is computed based onthe sampling distribution conditioned to the new run.

In TREC data, it is already the case that the actual pools are created fromonly a subset of the submitted runs. The trend of RMS error, as sample sizeincreases from depth 1 to depth 10 equivalent for training and testing systems isshown in Figure 38. On x-axis the units are the depth-pool equivalent numberof judgments converted into percentages of depth-100 pool.

Bias

While prec@rank and R are unbiased estimators, the ratio estimator AP is notguaranteed to be unbiased. Some of our results have a small positive bias, butit is negligible for any practical situation and definitely a large improvementcompared to depth-pooling evaluation bias. Unbiased ratio estimators havebeen proposed [25], but they are far more complex. Since for reasonable largesample size, the bias of the ratio estimator is very small, we think it is worthusing it for simplicity.


AP RP PC(30)

0 2 4 6 8 10 120.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11


RMS

erro

r



0 2 4 6 8 10 120.01

0.02

0.03

0.04

0.05

0.06

0.07


RMS

erro

r



0 2 4 6 8 10 120

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2


RMS

erro

r

RMS error for PC(30) estimates, TREC7


0 2 4 6 8 10 120.01

0.02

0.03

0.04

0.05

0.06

0.07


RMS

erro

r



0 2 4 6 8 10 120.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045


RMS

erro

r



0 2 4 6 8 10 120

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2


RMS

erro

r



0 2 4 6 8 10 12 140.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1


RMS

erro

r



0 2 4 6 8 10 12 140.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055


RMS

erro

r



0 2 4 6 8 10 12 140

0.02

0.04

0.06

0.08

0.1

0.12

0.14

percentage of pool judgedRM

S er

ror



Figure 38: RMS error train/test cross-validation comparisons for MAP, RP,PC(30),in TRECs 7, 8 and 10. Equivalent depths are indicated on the plot.

Variance

We show next how to compute the variance of the AP estimator. For a given sys-tem, if we denote yd = prec@rs(d)−APs for every document d in the collection,we obtain the following formula for variance, adapted from [94]:

var(AP ) =1

R2

∑alldocs:rel(d)=1

1− πd

π2d

y2d +

∑rel(d)=1

∑f 6=d,rel(f)=1

πdf − πdπf

πd · πf· ydyf

provided that all joint inclusion probabilities πdf are greater than zero. Weshall make this formula a variance estimator, but note that this calculation alsoassumes that once a unit is sampled (for us a relevant document d), the unitvalue (prec@r(d)) is immediately available. Our estimator has more variance(than what this formula gives) because of the variance inherited from prec@rankestimates.

In the presented IR sampling setup, only documents that are retrieved by thesystems inquired can be sampled; the other documents cannot be sampled, theyhave an inclusion probability of 0. If we want to use this method with a systemthat retrieved such documents (reusability), the variance computation has tobe altered to reflect the fact that some documents could not have possibly be


part of the sample (note that, in practice, we cannot make a difference betweendocuments not sampled due to unavailability πd = 0 or simply not selected bythe sampling scheme πd > 0 since we are only keeping track of the selecteddocuments).

To account for some documents being ignored (πd = 0) we use a varianceestimator that accounts for the number of units selected for a given system (wecall that |Ss|). Let yd = prec@rank(d)− AP ; then

var(AP ) =1

R · Ss

∑d∈S,rel

1− πd

π2d

y2d −

1|S| − 1

∑d,f∈S,rel

yd · yf

πd · πf

since for our sampling design, if d 6= f , we have πdf ≈ |S|−1

|S|Finally, since the set of queries Q is chosen randomly and independent, the

variance of the MAP will decrease linearly with the number of queries:

var(MAP ) =1

|Q|2∑q∈Q

var(AP q)

Confidence interval

For sufficiently many queries, the MAP would appear normally distributed(Central Limit Theorem, figure 39); then the 95% confident interval is given by

±2std or ±2√

var(MAP ).Figure 40 shows the confidence intervals plotted around the estimated MAP

value for each system in TREC8. Naturally, with small size samples, the vari-ance is higher and so are the confidence intervals, while at larger size they aresmaller.

In most cases, if confidence intervals (centered at the MAP value) of twosystems overlap, that is a strong indication that the true MAP values are veryclose; when they do not overlap, it is a strong indication that the MAP valuesare significantly different. Empirical tests show that our variance estimatorslightly underestimates the true variance, accounting for about 90% of its value,so in practice, a slightly larger confidence intervals should be used.

For independent queries, the standard deviation of MAP decreases linearlywith the number of queries; with more than 1000 queries, as is the case withMillion Query Track, the MAP confidence interval length is very small, and thattruly reflects the confidence that the MAP value is very close to the mean ofthe estimator. However is such cases the slight bias of the estimator may affectconfidence intervals considerably.

4.5.1 Robust Track 05 and the sab05ror1 system

In the TREC Robust Tracks [103], while the task is essentially ad-hoc retrieval,the evaluation emphasizes difficult queries. One of the measures used, is GMAP


0 2 4 6 8 10 12 14 160

0.05

0.1

0.15

0.2

0.25binomial distribution N=16 p=0.25

0 5 10 15 20 25 30 35 400

0.02

0.04

0.06

0.08

0.1

0.12

0.14


0 10 20 30 40 50 60 70 80 90 1000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09


−5 −4 −3 −2 −1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4normal distribution µ=0 !=1

2 !

95% interval: ±2!

2.5% 2.5% 95% 2 !

Figure 39: Binomial distribution approaches normal distribution; 95% of themass under the normal distribution curve is located inside the ±2std interval.

[82] or Geometric-Mean Average Precision, which as any geometric mean de-pends much more on the smallest values than the arithmetic mean (MAP). InRobust Track 2005, the 50 queries used were previously known to be difficult;they were run against a new collection AQUAINT.

Zobel [115] showed that many relevant documents are missed by poolingtechniques, and questioned the validity of system evaluation, especially for thesystem that seem to find more, unique, relevant documents (“unique” refersto documents not retrieved by other systems). Zobel designed the followingtest: use your favorite document selection mechanism (for example samplingor pooling) an all systems (and evaluate) and then use the same selection onall system except one (and evaluate). Naturally, a system that retrieves manyunique relevant documents is under-evaluated when it does not participate todocument selection.

The routing system sab05ror1 from Sabir Research [26] found 405 uniquerelevant documents. If the system was included in the pool, all these docu-ments were used for evaluation and sab05ror1 obtained a MAP of 0.262; whenexcluded for the pool, the unique relevant documents were not considered andthe evaluation dropped its MAP to 0.22, a significant and alarming difference.


!0.1 0 0.1 0.2 0.3 0.4 0.5 0.6!0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

MAP

estim

ated

MAP

TREC 8 K=20

RMSE=0.022 !=0.974 "=0.829

!0.1 0 0.1 0.2 0.3 0.4 0.5 0.60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

MAP

estim

ated

MAP

TREC 8 K=40

RMSE=0.022 !=0.976 "=0.875

!0.1 0 0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MAP

estim

ated

MAP

TREC 8 K=60

RMSE=0.013 !=0.992 "=0.908

!0.1 0 0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MAP

estim

ated

MAP

TREC 8 K=100

RMSE=0.022 !=0.992 "=0.901

Figure 40: Confidence intervals for various sample sizes, TREC8 runs

Sampling evaluation (or any kind of evaluation based on pooling) wouldobviously suffer for the same reason: if several systems are used to get accessto documents, then any documents not retrieved by these pooled systems aresimply ignored; if a run like sab05ror1 finds many relevant documents not avail-able at sampling stage, its evaluation would be decreased. There is nothingany technique (sampling included) can do about this. However, even if estimatescould be way off in such cases, our sampling technique can warn that there aretoo few documents available in the sample for estimating the performance ofa given system. This is reflected in the variance calculation, and hence in theconfidence interval length: unusual large confidence intervals serve as a warningthat a specific system needs more attention.

Figure 41 shows the difference in sampling evaluation of sab05ror1 (red onthe plots) when all systems are used for sampling (left) VS the evaluation per-formed on a sample taken excluding all Sabir systems (right). When not par-ticipating to the sample, the system sab05ror1 is evaluated badly (as expected,[27]) but gets a large confidence interval compared to other systems, most likelydue to very few docs sampled in its list. Note that in this particular plots, theconfidence interval still “catches” the diagonal line which means the true MAPvalue is inside.


−0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

MAP

stat

MAP

ROBUST05 sampling all systems SS=50 red=input.sab05ror1

−0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

MAP

stat

MAP

ROBUST05 sampling without sabir systems SS=50 red=input.sab05ror1

−0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

MAP

stat

MAP

ROBUST05 sampling all systems SS=100 red=input.sab05ror1

−0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

MAP

stat

MAP

ROBUST05 sampling without sabir systems SS=100

Figure 41: Sampling of size 50 results with Robust 05 Track data; sampled poolincluding all systems (left) vs sampling pool without sabir systems (right)

Chapter 5

Million Query Tracks

5.1 Introduction

Information retrieval evaluation has typically been performed over several dozenqueries, each judged to near-completeness. There has been a great deal ofrecent work on evaluation over much smaller judgment sets: how to select thebest set of documents to judge and how to estimate evaluation measures whenfew judgments are available. In light of this, it should be possible to evaluateover many more queries without much more total judging effort. The MillionQuery Track at TREC 2007 used two document selection algorithms to acquirerelevance judgments for more than 1,800 queries. We present results of thetrack, along with deeper analysis: investigating tradeoffs between the numberof queries and number of judgments shows that, up to a point, evaluation overmore queries with fewer judgments is more cost-effective and as reliable as fewerqueries with more judgments. Total assessor effort can be reduced by 95% withno appreciable increase in evaluation errors.

Over the past 40 years, Information Retrieval research has progressed againsta background of ever-increasing corpus size. From the 1,400 abstracts in theCranfield collection, the first portable test collection, to the 3,200 abstracts ofthe Communications of the ACM (CACM), to the 348,000 Medline abstracts(OHSUMED), to the first TREC collections of millions of documents, to theweb—billions of HTML and other documents—IR research has had to addresslarger and more diverse corpora.

As corpora grow, the assessor effort needed to construct test collections growsin tandem. Sparck Jones and van Rijsbergen introduced the pooling method [91]to deal with the problem of acquiring judgments. Rather than judge everydocument to every query, the documents ranked by actual retrieval systemscould be pooled and judged, thus focusing judging effort on those documentsleast likely to be nonrelevant. The pooling method has been successful fordecades. However, recent work suggests that the growth in the size of corporais outpacing even the ability of pooling to find and judge enough documents [27].

94

CHAPTER 5. MILLION QUERY TRACKS 95

Rather than trying to keep up simply by judging more documents, there hasbeen interest in focusing judging effort even better and make smarter inferenceswhen few judgments are available.

With fewer judgments available, estimates of evaluation measures will havehigher variance. One way to cope with this is to evaluate over more queries. Websearch engines, for instance, typically judge very shallow pools for thousands ofqueries. For the recall-based measures that are ubiquitous in retrieval research,such a shallow pool is not enough.

In this work we describe an evaluation over a corpus of 25 million documentsand 10,000 queries, the Million Query Track that ran for the first time at theText REtrieval Conference (TREC) in 2007 [5]. Using two recent method forselecting documents and evaluating over small collections, we achieve resultsvery similar to an evaluation using 149 queries judged with more depth, with62% of the assessor effort but 11 times as many queries. But this only establishesan upper bound. An evaluation with high rank correlation can be achievedwith 3% of the effort over only 1.34 times as many queries, and using analysisof variance and stability studies, we show that the amount of effort neededto establish that differences between systems are not simply due to randomvariation in scores is at most 5% of the effort over 1.14 times as many queries.

We begin in Section 2 by describing the methods used to select documentsand evaluate for the Million Query Track. In Section 3 we describe the setupof the experiment and the relevance judgments collected. Section 4 presentsinitial results, comparing the two methods to the baseline set of 149 queries. InSection 5, we approach the question of the number of queries and judgmentsfor each query that is needed to evaluate with minimum effort, and also explorethe reusability of such a small test collection.

The Million Query (1MQ) track ran for the first time in TREC 2007. Itwas designed to serve two purposes. First, it was an exploration of ad-hocretrieval on a large collection of documents. Second, it investigated questionsof system evaluation, particularly whether it is better to evaluate using manyshallow judgments or fewer thorough judgments.

Participants in this track were assigned two tasks: (1) run 10,000 queriesagainst a 426Gb collection of documents at least once and (2) judge documentsfor relevance with respect to some number of queries.

Section 5.2 describes how the corpus and queries were selected, details thesubmission formats, and provides a brief description of all submitted runs. Sec-tion 5.3.1 provides an overview of the judging process, including a sketch of howit alternated between two methods for selecting the small set of documents tobe judged. One of the methods is the Sampling without Replacement techniquedetailed in chapter 4; the other one, developed by Ben Carterette at UMASS[33] is detailed Section 5.4.

In Section 5.3.2 we present some statistics about the judging process, such asthe total number of queries judged, how many by each approach, and so on. Wepresent some additional results and analysis of the overall track in Sections 5.5and 5.6.


5.2 Phase I: Running Queries

The first phase of the track required that participating sites submit their re-trieval runs.

5.2.1 Corpus

The 1MQ track used the so-called “terabyte” or “GOV2” collection of docu-ments. This corpus is a collection of Web data crawled from Web sites in the.gov domain in early 2004. The collection is believed to include a large propor-tion of the .gov pages that were crawl-able at that time, including HTML andtext, plus the extracted text of PDF, Word, and PostScript files. Any docu-ment longer than 256Kb was truncated to that size at the time the collectionwas built. Binary files are not included as part of the collection, though werecaptured separately for use in judging.

The GOV2 collection includes 25 million documents in 426 gigabytes. Thecollection was made available by the University of Glasgow, distributed on ahard disk that was shipped to participants for an amount intended to cover thecost of preparing and shipping the data.

5.2.2 Queries

Topics for this task were drawn from a large collection of queries that werecollected by a large Internet search engine. Each of the chosen queries is likelyto have at least one relevant document in the GOV2 collection because logsshowed a click-through on one page captured by GOV2. Obviously there is noguarantee that the clicked page is relevant, but it increases the chance of thequery being appropriate for the collection.

These topics are short, title-length (in TREC parlance) queries. In thejudging phase, they were developed into full-blown TREC topics.

Ten thousand (10,000) queries were selected for the official run. The 10,000queries included 150 queries that were judged in the context of the TerabyteTrack from earlier years (though one of these had no relevant documents andwas therefore excluded).

No quality control was imposed on the 10,000 selected queries. The hopewas that most of them would be good quality queries, but it was recognizedthat some were likely to be partially or entirely non-English, to contain spellingerrors, or even to be incomprehensible to anyone other than the person whooriginally created them.

The queries were distributed in a text file where each line has the format“N:query word or words”. Here, N is the query number, is followed by a colon,and immediately followed by the query itself. For example, the line (from atraining query) “32:barack obama internships” means that query number 32is the 3-word query “barack obama internships”. All queries were provided inlowercase and with no punctuation (it is not clear whether that formatting is aresult of processing or because people use lowercase and do not use punctuation).


5.2.3 Submissions

Sites were permitted to provide up to five runs. Every submitted run wasincluded in the judging pool and all were treated equally.

A run consisted of up to the top 1,000 documents for each of the 10,000queries. The submission format was a standard TREC format of exactly sixcolumns per line with at least one space between the columns. For example:

100 Q0 ZF08-175-870 1 9876 mysys1100 Q0 ZF08-306-044 2 9875 mysys2

where:

1. The first column is the topic number.

2. The second column is unused but must always be the string “Q0” (letterQ, number zero).

3. The third column is the official document number of the retrieved docu-ment, found in the <DOCNO> field of the document.

4. The fourth column is the rank of that document for that query.

5. The fifth column is the score this system generated to rank this document.

6. The six column was a “run tag,” a unique identifier for each group andrun.

If a site would normally have returned no documents for a query, it instead re-turned the single document “GX000-00-0000000” at rank one. Doing so main-tained consistent evaluation results (averages over the same number of queries)and did not break any evaluation tools being used.

5.2.4 Submitted runs

The following is a brief summary of some of the submitted runs. The sum-maries were provided by the sites themselves and are listed in alphabeticalorder. (When no full summary is available, the brief summary information fromthe submissions has been used.)

ARSC/University of Alaska Fairbanks The ARSC multi-search system isa heterogeneous distributed information retrieval simulation and demon-stration implementation. The purpose of the simulation is to illustrateperformance issues in Grid Information Retrieval applications by parti-tioning the GOV2 collection into a large number of hosts and searchingeach host independently of the others. Previous TREC Terabyte Trackexperiments using the ARSC multi-search system have focused on theIR performance of multi-search result-set merging and the efficiency gainsfrom truncating result-sets from a large collection of hosts before merging.


The primary task of the ARSC multi-search system in the 2007 TRECMillion Query experiment is to estimate the number of hosts or subcol-lections of GOV2 that can be used to process 10,000 queries within theTREC Million Query Track time constraints. The secondary and ongoingtask is to construct an effective strategy for picking a subsets of the GOV2collections to search at query-time. The host-selection strategy used forthis experiment was to restrict searches to hosts that returned the mostrelevant documents in previous TREC Terabyte Tracks.

Exegy Exegy’s submission for the TREC 2007 million query track consisted ofresults obtained by running the queries against the raw data, i.e., the datawas not indexed. The hardware-accelerated streaming engine used to per-form the search is the Exegy Text Miner (XTM), developed at Exegy, inc.The search engine’s architecture is novel: XTM is a hybrid system (hetero-geneous compute platform) employing general purpose processors (GPPs)and field programmable gate arrays (FPGAs) in a hardware-software co-design architecture to perform the search. The GPPs are responsiblefor inputting the data to the FPGAs and reading and post-processingthe search results that the FPGAs output. The FPGAs perform the ac-tual search and due to the high degree of parallelism available (includingpipelining) are able to do so much more efficiently than the GPP.

For the million query track the results for a particular query were obtainedby searching for the exact query string within the corpus. This brute forceapproach, although naıve, returned relevant results for most of the queries.The mean-average precision for the results was 0.3106 and 0.0529 usingthe UMass and the NEU approaches, respectively. More importantly,XTM completed the search for the entire set of the 10,000 queries on theunindexed data in less than two and a half hours.

Heilongjiang Institute of Technology, China Used Lemur.

IBM Haifa This year, the experiments of IBM Haifa were focused on the scor-ing function of Lucene, an Apache open-source search engine. The maingoal was to bring Lucene’s ranking function to the same level as the state-of-the-art ranking formulas like those traditionally used by TREC partici-pants. Lucene’s scoring function was modified to include better documentlength normalization, and a better term-weight setting following to theSMART model.

Lucene then compared to Juru, the home-brewed search engine used bythe group in previous TREC conferences. In order to examine the rankingfunction alone, both Lucene and Juru used the same HTML parser, thesame anchor text, and the same query parsing process including stop-wordremoval, synonym expansion, and phrase expansion. Based on the 149topics of the Terabyte tracks, the results of modified Lucene significantlyoutperform the original Lucene and are comparable to Juru’s results.


In addition, a shallow query log analysis was conducted over the 10K querylog. Based on the query log, a specific stop-list and a synonym-table wereconstructed to be used by both search engines.

Northeastern University We used several standard Lemur built in systems(tfidf bm25, tfidf log, kl abs,kl dir,inquery,cos, okapi) and combined theiroutput (metasearch) using the hedge algorithm.

RMIT Zettair Dirichlet smoothed language model run.

SabIR Standard smart ltu.Lnu run.

University of Amsterdam The University of Amsterdam, in collaborationwith the University of Twente, participated with the main aim to com-pare results of the earlier Terabyte tracks to the Million Query track.Specifically, what is the impact of shallow pooling methods on the (ap-parent) effectiveness of retrieval techniques? And what is the impact ofsubstantially larger numbers of topics? We submitted a number of runsusing different document representations (such as full-text, title-fields, orincoming anchor-texts) to increase pool diversity. The initial results showbroad agreement in system rankings over various measures on topic setsjudged at both Terabyte and Million Query tracks, with runs using thefull-text index giving superior results on all measures. There are somenoteworthy upsets: measures using the Million Query judged topics showstronger correlation with precision at early ranks.

University of Massachusetts Amherst The base UMass Amherst submis-sions were a simple query likelihood model and the dependence modelapproach fielded during the terabyte track last year. We also tried somesimple automatic spelling correction on top of each baseline to deal witherrors of that kind. All runs were done using the Indri retrieval system.

University of Melbourne Four types of runs were submitted:

1. A topic-only run using a similarity metric based on a language modelwith Dirichlet smoothing as describe by Zhai and Lafferty (2004).

2. Submit query to public web search engine, retrieve snippet informa-tion for top 5 documents, add unique terms from snippets to query,run expanded query using same similarity metric just described.

3. A standard impact-based ranking.4. A merging of the language modeling and the impact runs.

5.3 Phase II: Relevance judgments and judging

After all runs were submitted, a subset of the topics were judged. The goal wasto provide a small number of judgments for a large number of topics. For TREC2007, over 1700 queries were judged, a large increase over the more typical 50queries judged by other tracks in the past.


5.3.1 Judging overview

Judging was done by assessors at NIST and by participants in the track. Non-participants were welcome (encouraged!) to provide judgments, too, thoughvery few such judgments occurred. Some of the judgments came from an In-formation Retrieval class project, and some were provided by hired assessors atUMass. The bulk of judgments, however, came from the NIST assessors.

The process looked roughly like this from the perspective of someone judging:

1. The assessment system presented 10 queries randomly selected from theevaluation set of 10,000 queries.

2. The assessor selected one of those ten queries to judge. The others werereturned to the pool.

3. The assessor provided the description and narrative parts of the query,creating a full TREC topic. This information was used by the assessor tokeep focus on what is relevant.

4. The system presented a GOV2 document (Web page) and asked whetherit was relevant to the query. Judgments were on a three-way scale tomimic the Terabyte Track from years past: highly relevant, relevant, ornot relevant. Consistent with past practice, the distinction between thefirst two was up to the assessor.

5. The assessor was required to continue judging until 40 documents has beenjudged. An assessor could optionally continue beyond the 40, but few did.

The system for carrying out those judgments was built at UMass on top ofthe Drupal content management platform1. The same system was used as thestarting point for relevance judgments in the Enterprise track.

As in other TREC tracks, sites participating in the Million Query Trackwere provided a set of queries to run through their retrieval engines, producingranked lists of up to 1,000 documents from a given corpus for each query. Thesubmitted runs were used as input to the MTC and statAP algorithms forselection of documents to be judged.

Corpus. The corpus was the GOV2 collection, a crawl of the .gov domainin early 2004 [37]. It includes 25 million documents in 426 gigabytes. Thedocuments are a mix of plain text, HTML, and other formats converted to text.

Queries. A total of 10,000 queries were sampled from the logs of a largeInternet search engine. They were sampled from a set of queries that had atleast one click within the .gov domain, so they are believed to contain at leastone relevant document in the corpus. Queries were generally 1-5 words long andwere not accompanied by any hints about the intent of the user that originallyentered them. The title queries of TREC topics 701–850 were seeded in thisset [38].

1http://drupal.org


Retrieval runs. Ten sites submitted a total of 24 retrieval runs. The runsused a variety of methods: tf-idf, language modeling, dependence modeling,model combination; some used query expansion, in one case expanding usingan external corpus. Some attempted to leverage the semi-structured natureof HTML by using anchor text, links, and metadata as part of the documentrepresentation.

Assessors. Judgments were made by three groups: NIST assessors, sitesthat submitted runs, and undergraduate work-study students. Upon logging infor the first time, assessors were required to go through a brief training phaseto acquaint them with the web-based interface. After at least five trainingjudgments, they entered the full assessment interface. They were presented witha list of 10 randomly-chosen queries from the sample. They selected one queryfrom that list. They were asked to develop the query into a full topic by enteringan information need and a narrative describing what types of information adocument would have to present in order to be considered relevant and whatinformation would not be considered relevant.

Each query was served by one of three methods (unknown to the assessors):MTC, statMAP, or an alternation of MTC and statMAP. For MTC, documentsweights were updated after each judgment; this resulted in no noticeable delayto the assessor. StatMAP samples were selected in advance of any judging. Thealternations proceeded as though MTC and statMAP were running in parallel;neither was allowed knowledge of the judgments to documents served by theother. If one served a document that had already been judged from the other,it was given the same judgment so that the assessor would not see the documentagain.

Documents were displayed with query terms highlighted and images includedto the extent possible. Assessors could update their topic definitions as theyviewed documents, a concession to the fact that the meaning of a query couldbe difficult to establish without looking at documents. Judgments were madeon a tertiary scale: nonrelevant, relevant, or highly relevant. Assessors were notgiven instructions about the difference between relevant and highly relevance.

Assessors were required to judge at least 40 documents for each topic. After40 judgments they were given the option of closing the topic and choosing anew query.

5.3.2 Judgments

There were three separate judging phases. The first, by NIST assessors andparticipating sites, was the longest. It resulted in 69,730 judged documentsfor 1,692 queries, with 10.62 relevant per topic on average and 25.7% relevantoverall. This set comprises three subsets: 429 queries that were served by MTC,443 served by statAP, and 801 that alternated between methods. Details of thesethree are shown in Table 5 as “1MQ-MTC”, “1MQ-statAP”, and “1MQ-alt”.

Due to the late discovery of an implementation error, a second judging phasebegan in October with the undergraduate assessors and statAP judging only.This resulted in an additional 3,974 judgments for 93 queries, of which 21.22%


set topics judgments rel/topic % relTB 149 135,352 180.65 19.89%1MQ-MTC 429 17,523 11.08 27.12%1MQ-statAP 536 21,887 10.42 25.47%1MQ-alt 801 33,077 10.32 24.99%depth10 25 2,357 14.16 15.00%

Table 5: Judgment sets.

were relevant (8.29 per topic on average). These were folded into the 1MQ-statAP set.

The TREC queries in this set had already been judged with some depth.These queries and judgments, details of which are shown in Table 5 as “TB”,were used as a “gold standard” to compare the results of evaluations by MTCand statAP. It should be noted that these queries are not sampled from thesame source as the other 10,000 and may not be representative of that space.They are, however, nearer to “truth” than any other set of queries we have.

There were two additional short judging phases with the goal of reinforcingthe gold standard. For the first, a pool of depth 10 was judged for queries inthe sample of 10,000; this is described in Table 5 as “depth10”. One strikingfeature of depth10 is that assessors found many fewer relevant documents thanin previous judging phases. Second, since some of our runs turned out to bepoorly represented in the TB set, we made 533 additional judgments on top-ranked documents from sparsely-judged systems.

5.3.3 Selection of documents for judging

Two approaches to selecting documents were used:

Expected AP method. In this method, documents are selected by how muchthey inform us about the difference in mean average precision given allthe judgments that were made up to that point [33]. Because averageprecision is quadratic in relevance judgments, the amount each relevantdocument contributes is a function of the total number of judgments madeand the ranks they appear at. Nonrelevant documents also contribute toour knowledge: if a document is nonrelevant, it tells us that certain termscannot contribute anything to average precision. We quantify how mucha document will contribute if it turns out to be relevant or nonrelevant,then select the one that we expect to contribute the most. This methodis further described below.

Statistical AP method. This is the non-uniform, without replacement sam-pling method described in previous chapter. It has the ability to incorpo-rate into estimation additional (non-random) judged documents; this fitsvery well the MQ setup, where statAP method has access to documentsselected by MTC method, besides its own selection.

The two methods we used differ by the aspect of evaluation that they attack.The Minimal Test Collection (MTC) algorithm is designed to induce rankings


of systems by identifying differences between them [33], without regard to thevalues of measures. StatAP is a sampling method designed to produce unbiased,minimum-variance estimates of average precision [9]. Both methods are designedto evaluate systems by average precision (AP), which is the official evaluationmeasure of TREC ad hoc and ad hoc-like tracks.

For each query, one of the following happened:

1. The pages to be judged for the query were selected by the “expected APmethod.” A minimum of 40 documents were judged, though the assessorwas allowed to continue beyond 40 if so motivated.

2. The pages to be judged for the query were selected by the “statisticalevaluation method.” A minimum of 40 documents were judged, thoughthe assessor was allowed continue beyond 40 if so motivated.

3. The pages to be judged were selected by alternating between the two meth-ods until each has selected 20 pages. If a page was selected by more thanone method, it was presented for judgment only once. The process con-tinues until at least 40 pages have been judged (typically 20 per method),though the assessor was allowed continue beyond 40 if so motivated.

The assignments were made such that option (3) was selected half the time andthe other two options each occurred 1/4 of the time. When completed, roughlyhalf of the queries therefore had parallel judgments of 20 or more pages by eachmethod, and the other half had 40 or more judgments by a single method.

In addition, a small pool of 50 queries were randomly selected for multiplejudging. With a small random chance, the assessor’s ten queries were drawnfrom that pool rather than the full pool. Whereas in the full pool no query wasconsidered by more than one person, in the multiple judging pool, a query couldbe considered by any or even all assessors—though no assessor was shown thesame query more than once.

5.4 Minimal Test Collections (MTC)

We present a brief description of the MTC method, developed by Ben Carteretteat UMASS Amherst. For a complete reading see [33]

MTC is a greedy on-line algorithm for selecting documents to be judged.Given a particular evaluation measure and any extant relevance judgments,it weighs documents by how informative they are likely to be in determiningwhether there is a difference in the measure between two systems. The highest-weight document is presented to an assessor for judging; the judgment is usedto update document weights.

AP is the average of the precision values at ranks at which relevant doc-uments were retrieved. Letting xi represent the relevance of the document at


rank i, precision at rank k is prec@k = 1k

∑ki=1 xi. Average precision is then

AP =1R

n∑i=1

xiprec@i =1R

n∑i=1

i∑j=1

1i xixj

where n is the number of documents and R is the number of relevant documents.Average precision is a common and well-understood measure in IR research.

Define the difference in AP as ∆AP = AP1 − AP2. Let xi be the relevanceof document i, where the index i is arbitrary, unrelated to the rank of thedocument. We can then express ∆AP in closed form as

∆AP =1∑n

i=1 xi

n∑i=1

n∑j=i

cijxixj

where cij is a constant depending on the ranks of documents i and j in the twosystems:

cij =1

max{rank1(i), rank1(j)}− 1

max{rank2(i), rank2(j)}.

If ∆AP > 0 over our judgments (assuming unjudged documents to be non-relevant), then making the worst case assumptions that every unjudged docu-ment that will decrease ∆AP if relevant will be judged relevant, and that everyunjudged document that would increase ∆AP if relevant will be judged nonrel-evant, we can determine whether there is any set of judgments that will resultin the sign of ∆AP changing. If not, we have proved the difference.

This offers a guide for selecting documents to judge. To prove that AP1 >AP2, we pick documents that would benefit AP2 if relevant but are in fact likelyto be nonrelevant, or documents that would benefit AP1 if relevant and are likelyto be relevant. To be fair to both systems, we simply alternate trying to proveAP1 > AP2 and AP2 > AP1. Since AP is quadratic, each judgment influencesour knowledge of the benefit of future judgments: knowing that document 1 isnonrelevant, for instance, would tell us that 1

2x1x2 + 13x1x3 + · · · is 0.

Expected MAP. In practice the number of judgments it takes to provea difference in AP is quite large, but the marginal value of a judgment dropsrapidly. At some point it becomes highly probable that we know the sign of thedifference despite not having yet proved it. Let Xi be a Bernoulli random trialrepresenting the relevance of document i, and pi = p(Xi = 1) the probabilitythat document i is relevant. We can then estimate the expected value of ∆APas

E[∆AP ] =1∑pi

n∑i=1

ciipi +∑j>i

cijpipj

.

The variance has a closed form as well [33].It is straightforward to adapt this to the evaluation of a single system over

multiple topics. Replace the rank constant cij with its single-system component


to get E[AP ], then sum over topics to get EMAP . In this work we presentrankings of systems by EMAP and use ∆MAP in the context of comparingpairs of systems.

∆MAP , and therefore EMAP , converge to an approximately normal distri-bution over possible assignments of relevance, and thus can be understood bytheir expectation and variance. To determine the probability that ∆AP is lessthan zero, we simply look up the value in a normal distribution table. We referto this probability as “confidence”.

Calculating EMAP and confidence requires some estimate of the probabilityof relevance of each document. Carterette [32] described a method for usingknown relevance judgments and the performance of the systems to estimate therelevance of the unjudged documents they ranked.

5.5 Results

The 24 runs were evaluated over the TB set using trec eval and over the 1MQset using EMAP and statMAP. If TB is representative, we should see thatEMAP and statMAP agree with each other as well as TB about the relativeordering of systems. Our expectation is that statMAP will present better es-timates of MAP while EMAP is more likely to present a correct ranking ofsystems.

The left side of Table 6 shows the MAP for our 24 systems over the 149 Ter-abyte queries, ranked from lowest to highest. The average number of unjudgeddocuments in the top 100 retrieved is also shown. Since some of these systemsdid not contribute to the Terabyte judgments, they ranked quite a few unjudgeddocuments.

The right side shows EMAP and statMAP over the queries judged for ourexperiment, in order of increasing MAP over Terabyte queries. It also showsthe number of unjudged documents in the top 100. EMAP and statMAP areevaluated over somewhat different sets of queries; statMAP excludes queriesjudged by MTC and queries for which no relevant documents were found, whileEMAP includes all queries, with those that have no relevant documents havingsome probability that a relevant document may yet be found.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

MAP TB

MAP

sta

tAP

MAP comparison ! each dot is a system

!=0.823

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

MAP TB

MAP

2M

AP 3M

TC5


!=0.780

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

MAP statMAP

MAP

3M

AP 4M

TC7


!=0.869

Figure 43: From left, evaluation over Terabyte queries versus statMAP evalua-tion, evaluation over Terabyte queries versus EMAP evaluation, and statMAPevaluation versus EMAP evaluation.


149 Terabyte 1MQrun name unjudg MAP unjudg EMAP statMAPUAms.AnLM 64.72 0.0278‡ 90.75 0.0281 0.0650UAms.TiLM 61.43 0.0392‡ 89.40 0.0205 0.0938exegyexact 8.81 0.0752‡ 13.67 0.0184 0.0517umelbexp 61.17 0.1251 91.85 0.0567∗† 0.1436†

ffind07c 22.91 0.1272‡ 77.94 0.0440 0.1531ffind07d 24.07 0.1360 82.11 0.0458 0.1612sabmq07a1 21.69 0.1376 86.51 0.0494 0.1519UAms.Sum6 32.74 0.1398‡ 81.37 0.0555 0.1816UAms.Sum8 24.40 0.1621 79.92 0.0580 0.1995UAms.TeVS 21.11 0.1654 81.35 0.0503 0.1805hedge0 16.90 0.1708‡ 80.44 0.0647 0.2175umelbimp 15.40 0.2499 80.83 0.0870 0.2568umelbstd 11.48 0.2532‡ 82.17 0.0877 0.2583umelbsim 10.38 0.2641‡ 80.17 0.1008∗† 0.2891†

hitir 9.06 0.2873 80.25 0.0888 0.2768rmitbase 8.32 0.2936 79.28 0.0945 0.2950indriQLSC 7.34 0.2939 79.18 0.0969 0.3040LucSynEx 13.02 0.2939 78.23 0.1032∗ 0.3184∗

LucSpel0 13.08 0.2940 78.27 0.1031 0.3194∗

LucSyn0 13.08 0.2940 78.27 0.1031 0.3194∗

indriQL 7.12 0.2960‡ 78.80 0.0979∗ 0.3086JuruSynE 8.86 0.3135 78.36 0.1080 0.3117indriDMCSC 9.79 0.3197 80.36 0.0962∗ 0.2981∗

indriDM 8.67 0.3238 79.51 0.0981∗ 0.3060∗

Table 6: Performance on 149 Terabyte topics, 1692 partially-judged topics perEMAP , and 1084 partially-judged queries per statMAP, along with the numberof unjudged documents in the top 100 for both sets.

Overall, the rankings by EMAP and statMAP are fairly similar, and bothare similar to the “gold standard”. Figure 42 shows a graphical representationof the two rankings compared to the ranking by Terabyte systems. Figure 43shows how statMAP, EMAP , and MAP over TB queries correlate. All threemethods have identified the same three clusters of systems, separated in Table 6by horizontal lines; within those clusters there is some variation in the rankingsbetween methods. For statMAP estimates (Figure 43, left plot), besides theranking correlation, we note the accuracy in terms of absolute difference withthe TB MAP values by the line corresponding to the main diagonal.

Some of the bigger differences between the methods are noted in Table 6by a ∗ indicating that the run moved four or more ranks from its position in


the TB ranking, or a † indicating a difference of four or more ranks betweenEMAP and statMAP. Both methods presented about the same number of suchdisagreements, though not on the same systems. The biggest disagreementsbetween EMAP and statMAP were on umelbexp and umelbsim, both of whichEMAP ranked five places higher than statMAP. Each method settled on adifferent “winner”: indriDM for the TB queries, JuruSynE for EMAP , andLucSpel0 and LucSyn0 tied by statMAP. However, these systems are all quiteclose in performance by all three methods.

We also evaluated statisti-

MT

C E

MA

P

0.00

0.02

0.04

0.06

0.08

0.10

stat

MA

P

0.00

0.10

0.20

0.30

UAms.A

nLM

UAms.T

iLM

exeg

yexa

ct

umelb

exp

ffind0

7c

ffind0

7d

sabm

q07a

1

UAms.S

um6

UAms.S

um8

UAms.T

eVS

hedg

e0

umelb

imp

umelb

std

umelb

sim hitir

rmitb

ase

indriQ

LSC

LucS

ynEx

LucS

pel0

LucS

yn0

indriQ

L

Juru

SynE

indriD

MCSC

indriD

M

Figure 42: EMAP and statMAP evaluationresults sorted by evaluation over 149 Terabytetopics.

cal significance over the TB queriesby a one-sided paired t-test atα = 0.05. A run denoted by a‡ has a MAP significantly lessthan the next run in the rank-ing. (Considering the numberof unjudged documents, someof these results should be takenwith a grain of salt.) Signifi-cance is not transitive, so a sig-nificant difference between twoadjacent runs does not alwaysimply a significant difference be-tween other runs. Both EMAPand statMAP swapped some sig-nificant pairs, though they agreedwith each other for nearly all

such swaps.An obvious concern about the gold standard is the correlation between the

number of unjudged documents and MAP: the tau correlation is −.517, or −.608when exegyexact (which often retrieved only one document) is excluded. Thiscorrelation persists for the number unjudged in the top 10. To ensure that wewere not inadvertently ranking systems by the number of judged documents,we selected some of the top-retrieved documents in sparsely-judged systems foradditional judgments. A total of 533 additional judgments only discovered 7new relevant documents for the UAms systems, 4 new relevant documents forthe ffind systems, but 58 for umelbexp. The new relevant judgments causedumelbexp to move up one rank. This suggests that while the ranking is fair formost systems, it is likely underestimating umelbexp’s performance.

It is interesting that the three evaluations disagree as much as they do inlight of work such as Zobel’s [115]. There are at least three possible reasonsfor the disagreement: (1) the gold standard queries represent a different spacethan the rest; (2) the gold standard queries are incompletely judged; and (3)the assessors did not pick queries truly randomly. The fact that EMAP andstatMAP agree with each other more than either agrees with the gold standardsuggests to us that the gold standard is most useful as a loose guide to therelative differences between systems, but does not meaningfully reflect “truth”


over the larger query sample. But the possibility of biased sampling affects thevalidity of the other two sets as well: as described above, assessors were allowedto choose from 10 different queries, and it is possible they chose queries thatthey could decide on clear intents for rather than queries that were unclear. Itis difficult to determine how random query selection was. We might hypothesizethat, due to order effects, if selection was entirely random we would expect to seethe top most query selected most, followed by the second-ranked query, followedby the third, and so on, roughly conforming to a log-normal distribution. This infact is not what happened; instead, assessors chose the top-ranked query slightlymore often than the others (13.9% of all clicks), but the rest were roughly equal(slightly under 10%). But this would only disprove random selection if we couldguarantee that presentation bias holds in this situation. Nevertheless, it doeslend weight to the idea that query selection was not random.

In an attempt to resolve some of these questions, we evaluated systemsover the depth10 set using trec eval. Evaluation results over this set do notcorrelate well to any previous set, with rank correlations in the 0.6− 0.7 range.In particular, the Luc* systems drop from the top tier to the second tier. Thisis a very interesting result that bears closer investigation, since if queries area random sample it disagrees with the notion that two samples of queries canbe used to evaluate systems reliably. Our current hypothesis (supported byconversations with the assessors) is that assessors selected these queries lessrandomly than other sets, so that they are not a representative sample. Notein Table 5 that the frequency of relevant documents in this set is significantlylower than any other set.

5.6 Analysis

In this section, we describe a set of analyses performed on the data collected asdescribed above. Our analyses are of two forms: (1) efficiency studies, aimed atdetermining how quickly one can arrive at accurate evaluation results and (2)reusability studies, aimed at determining how reusable our evaluation paradigmsare in assessing future systems.

5.6.1 Efficiency Studies

The end goal of evaluation is assessing retrieval systems by their overall perfor-mance. According to the empirical methodology most commonly employed inIR, retrieval systems are run over a given set of topics producing a ranked listof document. The performance of each system per topic is expressed in termsof average precision of the output list of documents while the overall quality ofa system is captured by averaging its AP values over all topics into its meanaverage precision. Systems are ranked by their MAP scores.

Hypothetically, if a second set of topics was available, the systems could berun over this new set of topics and new MAP scores (and consequently a newranking of the systems) would be produced. Naturally, two questions arise: (1)


how do MAP scores or a ranking of systems over different set of topics compareto each other, and (2) how many topics are needed to guarantee that the MAPscores or a ranking of systems reflect their actual performance?

We describe two efficiency studies, the first based on analysis of variance(ANOVA) and Generalizability theory, and the second based on an empiricalstudy of the stability of rankings induced by subsets of queries.

ANOVA and Generalizability Theory

Given different sets of topics one could decompose the amount of variabilitythat occurs in MAP scores (as measured by variance) across all sets of topicsand all systems into three components: (a) variance due to actual performancedifferences among systems—system variance, (b) variance due to the relativedifficulty of a particular set of topics—topics variance, and (c) variance due tothe fact that different systems consider different set of topics hard (or easy)—system-topics interaction variance.

Ideally, one would like the total variance in MAP scores to be due to theactual performance differences between systems as opposed to the other twosources of variance. In such a case, having the systems run over different setsof topics would result into each system obtaining identical MAP scores over allsets of topics, and thus MAP scores over a single set of topics would be 100%reliable in evaluating the quality of the systems. Note that among the threevariance components, only the variances due to the systems and system-topicsinteractions affect the ranking of systems—it is these two components that canalter the relative differences among MAP scores, while the topic variance willaffect all systems equally, reflecting the overall difficulty of the set of topics.

In practice, as already described, retrieval systems are run over a singlegiven set of topics. The decomposition of the total MAP variance into theaforementioned components in this case can be realized by using tools providedby Generalizability Theory (GT) [22, 24].

We ran two separate GT studies; one over the MAP scores estimated bythe MTC method given the set of 429 topics exclusively selected by MTC andone over the MAP scores estimated by the statAP method over the set of 459topics exclusively selected by statAP (both methods utilized 40 relevance judg-ments per topic). For both studies we reported (a) the ratio of the variancedue to system and the total variance and (b) the ratio of the variance due tosystem and the variance components that affect the relative MAP scores (i.e.the ranking of systems), both as a function of the number of topics in the topicsset. The results of the two studies are illustrated in Figure 44. The solid linescorrespond to the ratio of the variance due to system and the total variance andexpresses how fast (in terms of number of topics) we reach stable MAP valuesover different sets of topics of the same size. As the figure shows, the statAPmethod eliminates all variance components (other than the system) faster thanthe MTC method, reaching a ratio of 0.95 with a set of 152 topics, while MTCreaches the same ratio with 170 topics. The dashed lines correspond to theratio of the variance due to system and the variance due to effects that can alter


the relative MAP scores (rankings) of the systems. The figure shows that theMTC method produces a stable ranking of systems over different sets of topicsfaster (in terms of number of topics) than the statAP method reaching a ratioof variance 0.95 with a set of 40 topics, while statAP reaches the same ratiowith 85 topics.

These results further support the claims that the statAP method, by design,aims to estimate the actual MAP scores of the systems, while the MTC method,by design, aims to infer the proper ranking of systems.

0 10 20 $0 40 50 60 70 80 90 100 110 120 1$0 140 150 160 1700.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of Topics

Stab

ility

Leve

l

statAP ! stability of MAP scoresMTC ! stability of MAP scoresstatAP ! stability of rankingMTC ! stability of ranking

Figure 44: Stability levels of the MAP scores and the ranking of systems forstatAP and MTC as a function of the number of topics.

Ranking stability

Figure 45 shows the τ correlation between both EMAP and statMAP rankingsover 1000 queries with 40 judgments each and rankings by both measures overfewer queries and/or fewer judgments. 1000 queries were selected to make the x-axes equal; the two methods cannot use the same queries. These figures assumethat the goal of reducing the assessor effort is to reach a ranking that is close tothe one that would have been produced by the same method over all availablejudgments and these 1000 queries.

In all cases the lines rise quickly as queries are added, then flatten. Each ofthe lines seems to asymptote to a point below τ = 1; without certain judgmentsit may be impossible to reach the same level. This plot therefore depends tosome extent on which documents were judged.

Cost analysis

Empirically, stability depends on the particular documents judged and howthose judgments are used to make inferences and estimations. We can studystability empirically by selecting an operating point, then simulating evaluationruns to reach that point. The minimum cost required to reach some operating


0 100 200 300 400 500 600 700 800 900 10000.4

0.5

0.6

0.7

0.8

0.9

1

number of random queries

Kend

all !

< M

AP ra

nkin

g ag

ains

t ful

l!M

AP ra

nkin

g MTC stability TRIALS=50

MTC!5MTC!10MTC!20MTC!40

0 100 200 300 400 500 600 700 800 900 10000.4

0.5

0.6

0.7

0.8

0.9

1

number of random queries

Kend

all !

: M

AP ra

nkin

g ag

ains

t ful

l!M

AP ra

nkin

g statAP stability TRIALS=50

statAP!40statAP!20

Figure 45: Stability of MTC and statAP

point given some parameter such as the number of judgments per query is ameasure of stability.

For MTC, we will use as the operating point Kendall’s τ rank correlation.If we want to ensure a τ of at least 0.9 to the ranking over all queries and alljudgments, what is the minimum judging effort we need to expend?

Since MTC picks documents in an order, it is possible to simulate increasingnumbers of judgments from 1 up to 40. To find the optimal cost point, wesimulated increasing judgments, then increasing queries as in Figure 44 untilwe first reach a Kendall’s τ of 0.9 between the ranking over the smaller set ofqueries and judgments and the full set.

As the figure shows, with 5 judgments per query MTC does not quite reacha 0.9 τ correlation with 1000 topics—it finishes at 0.872. With 10 judgments, τreaches 0.9 with 900 topics (though 0.9 is well within the standard error with aslittle as 600 topics). With 20 judgments per topic, it only requires 250 topics toreach 0.9. Making an additional 20 judgments per topic does not provide muchgain, as τ reaches 0.9 with only 50 fewer topics than with 20 judgments.


10 15 20 25 30 35 40 45

50

60

70

80

90

100

number of judgments per query

asse

ssor

tim

e in

hou

rs

cost (assessor time) to reach !=0.9 ranking

600

400

250 220 200

200

400

200

MTCstatAP

Figure 46: Total assessor cost required to reach a stable ranking. The numberof queries required to reach τ = 0.9 is indicated on the plot for both MTC (blue)and statAP (red).

Assessor effort has two primary components: the amount of time spent mak-ing developing the topic and the amount of time spent making judgments. Ittook assessors a median time of about 14 minutes to judge 40 documents, andabout five minutes to develop the topic2. Given that, the total effort needed todo 10 judgments for each of 600 topics is at least 1

60 (5 · 600 + 1440 · 10 · 600) = 85

hours (assuming 0.9 can be reached with 600 topics); with 20 judgments and250 topics it is less than half that at 44 hours. With 40 judgments the timerises to 63 hours.

Figure 46 shows the total effort needed to reach a τ of 0.9 in hours as afunction of the number of judgments. At each increasing judgment level, wefound the minimum number of queries needed for a τ of 0.9, then calculatedtotal cost by the formula above. Note that there is a precipitous drop followedby a gradual increase, with the minimum point at about 20 judgments and 250topics. This is the least amount of effort that must be done to ensure that theranking will not change significantly with more judgments or topics.

For statAP , a Kendall’s τ of 0.9 can be obtained with slightly less than 200queries, 40 judgments per query, for a total of 8000 judgments (Figure 45, rightplot); a slightly lower τ can be obtained using about 400 queries for 20 judgmentsper query for the same total of 8000 judgments. These values correspond to 63and 80 assessor hours respectively, shown in Figure 46.

5.6.2 Reusability

Reusability in the sense it is traditionally understood is impossible—we cannotpredict what new systems will do, and as corpora keep getting bigger it will get

2We could not measure topic development time precisely, so this is a rough estimate basedon mean time between viewing a new list of 10 queries and saving a topic description.


run EMAP conf statMAP ±2stdexegyexact 0.0184 0.959 0.0517 ±0.0014

UAmsT07MAnLM 0.0205 1.000 0.0650 ±0.0019UAmsT07MTiLM 0.0281 1.000 0.0938 ±0.0021

ffind07c 0.0440 1.000 0.1531 ±0.0022ffind07d 0.0458 0.971 0.1612 ±0.0024

sabmq07a1 0.0494 0.703 0.1519 ±0.0022UAmsT07MTeVS 0.0503 0.999 0.1805 ±0.0028UAmsT07MSum6 0.0555 0.663 0.1816 ±0.0028

umelbexp 0.0567 0.688 0.1436 ±0.0037UAmsT07MSum8 0.0580 0.999 0.1995 ±0.0028

hedge0 0.0647 1.000 0.2175 ±0.0029umelbimp 0.0870 0.608 0.2568 ±0.0034umelbstd 0.0877 0.690 0.2583 ±0.0033

hitir2007mq 0.0888 1.000 0.2768 ±0.0032rmitbase 0.0945 0.842 0.2950 ±0.0031

indriDMCSC 0.0962 0.655 0.2981 ±0.0035indriQLSC 0.0969 0.993 0.3040 ±0.0033indriQL 0.0979 0.555 0.3086 ±0.0033indriDM 0.0981 0.870 0.3060 ±0.0035umelbsim 0.1008 0.808 0.2891 ±0.0038LucSyn0 0.1031 0.583 0.3194 ±0.0036LucSpel0 0.1031 0.681 0.3194 ±0.0036LucSynEx 0.1032 0.996 0.3184 ±0.0035JuruSynE 0.1080 NA 0.3117 ±0.0033

Table 7: Confidence estimates for EMAP and statMAP. For EMAP , the confi-dence is the probability that the system has a lower MAP than the next system.For statMAP, they are confidence intervals.

harder to create test collections that work as well for new systems as they do forold. Instead, evaluation should report a confidence based on the missing judg-ments. Reusability should be understood in terms of how well the confidenceholds up [32].

The two methods have different notions of confidence. As described in Sec-tion 5.4, MTC calculates confidence as the probability that the sign of ∆MAPis negative (or positive). Confidence in ∆MAP are over relevance judgmentsonly; they ask what the probability is that there is a difference between twosystems on a given set of topics. StatMAP calculates a confidence interval forthe value of AP for each query, then a confidence interval for the value of MAPover the sample of queries.

Table 7 shows confidence estimates for the two methods. The left side isEMAP ; since to display all the information MTC provides would require a24×24 table, we have limited the table to only the confidence between adjacentpairs in the ranking by EMAP . The right side shows statMAP with confidenceintervals calculated as in Section 2.2. These are described in more detail below.


MTC analysis

Confidence can be interpreted as the probability that two systems will swapin the ranking given more relevance judgments. If confidence is high, systemsare unlikely to swap; the results of the evaluation can be trusted. If confidenceis low, more judgments should be acquired. In that case, MTC can take anyexisting judgments and produce a list of additional judgments that should bemade.

Since we do not have enough systems or judgments to be able to do standardleave-one-out reusability experiments, we instead investigated the ability of theconfidence estimate to predict what would happen after more judgments. Afterthe first 20 judgments by MTC over all topics, we calculated confidence betweenall pairs. We then completed the judgments. Pairs that had high confidenceafter the first 20 judgments should not have swapped. For MTC we used thesame 1000 topics used for the stability experiment above.

The τ correlation between the 20-judgment ranking and the 40-judgmentranking is 0.928, so not many pairs swapped. Of those that did, half had aconfidence of less than 0.6. The greatest confidence of any pair that swappedwas 0.875; though it was not particularly likely to swap, it was not unimaginable.There were 243 pairs with confidences of greater than 0.95 with 20 judgments,and none of them swapped after the next 20.

statMAP confidence interval

Per query, the estimated interval length varies between 0 and±2.6; for statMAP ,assuming query independence, we obtain numbers varying from ±.0014 and±.0038 for the 24 systems (Table 7). In most cases, if two confidence intervals(centered at the statMAP value) overlap, it is a strong indication that the trueMAP values are very close; when they do not, it is a strong indication that theMAP values are significantly different. Empirical tests using previous TRECdata show that our estimator slightly underestimates the variance, accountingfor about 90% of it, so in practice slightly larger confidence intervals should beused.

For independent queries, the standard deviation of statMAP decreases lin-early with the number of queries; with more than 1100 queries, the confidenceinterval length is very small and that truly reflects the confidence that thestatMAP value is very close to the mean of the estimator. However, whileprec@k and R are unbiased estimators, the ratio estimator statMAP it is notguaranteed to be unbiased and so the mean can be slightly different that thetrue AP value; therefore the overall confidence, especially for a large number ofqueries should not be derived solely from the estimated confidence intervals.

5.7 Conclusion

The Million Query Track and subsequent analyses show that we can evaluateretrieval systems with greatly reduced effort, beyond what was done for the


track, and down to a few hundred queries with several dozen judgments foreach one. Even when this is too few judgments to reliably distinguish betweensystems, we can still identify the systems that we have the least confidence in andfocus on acquiring more judgments for them, thus ensuring future reusability.

The results of our study confirm those found by Sanderson & Zobel [86],Jensen [57], and Carterette & Smucker [34], all of which argued that evaluationover more queries with fewer or noisier judgments is preferable to evaluationover fewer queries with more judgments. There are tradeoffs, of course: fail-ure analysis may be more difficult when judgments are scarce, and there maybe limited data for training new algorithms. Exploring and quantifying thesetradeoffs are clear directions for future work.

Chapter 6

Hedge online pooling

Consider an awful player that takes five Blackjack experts with him to the casinoin hope that these experts will be able to win him more money than he coulddo on his own. Say he wants to play every day $1000, for about year. On thefirst day, he splits his money equally amongst all the Blackjack experts, sincehe doesn’t know any of the experts being better than others. After each day, hecan measure the performance of each expert, and naturally begin to give moremoney to the experts that are doing well, and give less or no money to thosethat are doing poorly in an attempt to make as much money as he possibly can.

Similarly, if the underlying experts are making predictions about whichstocks will rise in the next trading day, one might invest one’s money in stocksaccording to the weighted predictions of the underlying experts. If a stockgoes up, then each underlying expert which predicted this rise would receive a“gain,” and the investor would also receive a gain in proportion to the moneyinvested. If the stock goes down, then each underlying expert which predicted arise would suffer a “loss,” and the investor would also suffer a loss in proportionto the money invested.

Such problems are called ”combination of expert advice problems” and occurwhen you have multiple ”experts” making predictions about something thatcannot be known in advance. These problems have been intensively studied[108, 35, 48], particularly in the field of Machine Learning. Boosting generallyrefers to combine the experts in such a way that the overall performance of theplayer is better or on par with that obtained following the advice only from thebest (unknown to the player) expert.

Things are somehow similar in our IR setup. We have access to multiplesystems (or experts) and each of them has an “advice” consisting in the rankingof documents in response to a given query. Following this analogy, the “player”is a combination of the input ranked lists into one output list, process we callmetasearch.

We consider the problems of metasearch, pooling, and system evaluation,and we show that all three problems can be efficiently and effectively solvedwith a single technique based on the Hedge algorithm for on-line learning. Our

116

CHAPTER 6. HEDGE ONLINE POOLING 117

results from experiments with TREC data demonstrate that: (1) As an algo-rithm for metasearch, our technique combines ranked lists of documents in amanner whose performance equals or exceeds that of benchmark algorithmssuch as CombMNZ and Condorcet, and it generalizes these algorithms by seam-lessly incorporating user feedback in order to obtain dramatically improvedperformance. (2) As an algorithm for pooling, our technique generates sets ofdocuments containing far more relevant documents than standard techniquessuch as TREC-style depth pooling. (3) These pools, when used to evaluateretrieval systems, estimate the performance of retrieval systems and rank thesesystems in a manner superior to TREC-style depth pools of an equivalent size.

Our unified model for solving these three problems is based on the Hedge al-gorithm for on-line learning. In the context of these problems, Hedge effectivelylearns which systems are “better” than others and which documents are “morelikely relevant” than others, given on-line relevance feedback. Thus Hedge (1)learns to rank documents in order of relevance (metasearch), (2) learns how togenerate document sets likely to contain large fractions of relevant documents(pooling), and (3) efficiently and effectively evaluates the underlying retreivalsystems using these pools.

Although the three goals of the algorithm, as stated, are somewhat inter-twined, the action of the Hedge algorithm as an online metasearch (more onmetasearch in chapter 7) engine may be seen as enabling those of pooling andsystem evaluation. The high quality ranked list produced by the metasearch en-gine consists of documents ranked in order of expected relevance, and thereforeprovides a foundation for document pooling— either performed iteratively, or inmulti-document batches. Documents with high probability of relevance, in turn,prove to be good discriminators of system quality, and thus, pools generated inthis manner enable rapid evaluation of system performance.

In this chapter we explain the online-pooling and evaluation methodologiesand present experimental results regarding system evaluation. Metasearch andrelated subjects are presented in chapter 7.

6.1 Online pooling and evaluation

Pools are often used to evaluate retrieval systems in the following manner. Thedocuments within a pool are judged to determine whether they are relevant ornot relevant to the given user query or topic. Documents not contained withinthe pool are assumed to be non-relevant. The ranked lists returned by theretrieval systems are then evaluated using standard measures of performance(such as mean average precision) using this “complete” set of relevance judg-ments. Since documents not present in the pool are assumed non-relevant, thequality of the assessments produced by such a pool is often in direct proportionto the fraction of relevant documents found in the pool (its recall).

On-line pooling techniques have been proposed which attempt to identifyrelevant documents as quickly as possible in order to exploit this phenomenon.Cormack et al [41] proposed a push-to-front strategy for system pooling: ar-


range the systems in a queue, initially in a random order; then pool (and judge)documents from the first system until a non-relevant document is found, inwhich case the current system is put at the end of the queue and the nextsystem is pooled.

We demonstrate that the Hedge algorithm for on-line learning is ideallysuited to generating efficient pools which effectively evaluate retrieval systems.In effect, the Hedge algorithm learns which documents are likely to be relevant,these documents can be judged and added to the pool, and these relevancejudgments can be used as feedback to improve the learning process, thus gener-ating more relevant documents in subsequent rounds. The quality of the poolsthus generated can be measured in two ways: (1) At what rate are relevantdocuments found (recall percentage as a function of total judgments)? (2) Howwell do these pools evaluate the retrieval systems (score or rank correlations vs.“ground truth”)? In our experiments using TREC data, Hedge found relevantdocuments at rates nearly double that of benchmark techniques such as TREC-style depth pooling. These Hedge pools were found to evaluate the underlyingretrieval systems much better than TREC-style depth pools of an equivalentsize (as measured by Kendall’s τ rank correlation, for example). Finally, theseHedge pools seemed particular effective at properly evaluating the best under-lying systems, a task difficult to achieve using small pools.

Meta-pools can be substantially smaller than traditional pools and yet assessthe underlying systems quite effectively. For meta-pools a mere fraction of thesize of traditional TREC-style pools, we can rank the underlying systems in amanner nearly identical (Kendall’s τ over 0.9) to that achieved by the far largerTREC-style pools. Furthermore, the assessment scores produced using meta-pools are often linearly correlated (linear correlation coefficients over 0.9) withstandard measures of assessed performance such as mean average precision, andthus they could be used to actually predict the system performance assessedwith full TREC-style pools.

6.1.1 Hedge methodology

The intuition for our methodology can be described as follows. Consider a userwho submits a given query to multiple search engines and receives a collectionof ranked lists in response. How would the user select documents to read inorder to satisfy his or her information need? In the absence of any knowledgeabout the quality of the underlying systems, the user would probably begin byselecting some document which is “highly ranked” by “many” systems; such adocument has, in effect, the collective weight of the underlying systems behindit. If the selected document were relevant, the user would begin to “trust”systems which retrieved this document highly (i.e., they would be “rewarded”),while the user would begin to “lose faith” in systems which did not retrieve thisdocument highly (i.e., they would be “punished”). Conversely, if the documentwere non-relevant, the user would punish systems which retrieved the documenthighly and reward systems which did not. In subsequent rounds, the user wouldlikely select documents according to his or her faith in the various systems in


Algorithm Hedge(β)

Parameters:

β ∈ [0, 1]

initial weight vector w1 ∈ [0, 1]N

number of trials T

Do for t = 1, 2, . . . , T

1. Choose allocation pt = wtPNi=1 wt

i

.

2. Receive loss `t ∈ [0, 1]N from environment.

3. Suffer loss pt · `t.

4. Set the new weight vector to be wt+1i = wt

iβ`ti .

Figure 47: Hedge Algorithm

conjunction with how these systems rank the various documents; in other words,the user would likely pick documents which are ranked highly by trusted systems.

How can the above intuition be quantified and encoded algorithmically?Such questions have been studied in the machine learning community for quitesome time and are often referred to as “combination of expert advice” problems.One of the seminal results in this field is the Weighted Majority Algorithm dueto Littlestone and Warmuth [67]; in this work, we use a generalization of theWeighted Majority Algorithm called Hedge due to Freund and Schapire [48].

Hedge is an on-line allocation strategy which solves the combination ofexpert advice problem as follows. (See Figure 47.) Hedge is parameterized bya tunable learning rate parameter β ∈ [0, 1], and in the absence of any a prioriknowledge, begins with an initially uniform “weight” w1

s for each expert s (in ourcase, w1

s = 1 for all s). The relative weight associated with a expert correspondsto one’s “faith” in its performance at a given time.

The online algorithm runs episodically: for each round t ∈ {1, . . . , T}, theseweights are normalized to form a probability distribution pt where

pts =

wts∑

j wtj

,

and one places pts “faith” in system s after round t. We shall denote Zt =

∑j wt

j

the normalization factor at round t.This “faith” can be manifested in any number of ways, depending on the

problem being solved. To encode this online loss in Hedge, at round t, eachexpert s suffers a loss `t

s, and the combined-expert suffers a weighted average


mixture of∑

s pts`

ts. loss. It is assumed that the losses and/or gains are bounded

so that they can be appropriately mapped to the range [0, 1].Finally, the Hedge algorithm updates its “faith” in each expert according

to the losses suffered in the current round, wt+1s = wt

sβ`t

s . Thus, the greater theloss an expert suffers in round t, the lower its weight in round t + 1, and the“rate” at which this change occurs is dictated by the tunable parameter β.

One can easily show that, for each expert, after t rounds, the weight for eachexpert reflects the overall performance [48]. Ignoring renormalization after eachround, we have

wT+1s = wT

s βlTs = wT−1s βlT−1

s βlTs = ... = w0sβ

Pt≤T lts = βLT

s

where Ls =∑

t `ts is the cumulative loss suffered by expert s after T rounds.

Over time, the “best” underlying experts (or the experts with smallest losses)will get the highest weights, and the cumulative (mixture) loss suffered byHedge will be not much higher than that of the best underlying expert. Specif-ically, Freund and Schapire show that the cumulative (mixture) loss suffered byHedge is bounded by

LHedge ≤mins{Ls} · ln(1/β) + lnN

1− β

where N is the number of underlying experts.Volodimir Vovk [108, 48] proved a powerful support theorem that immedi-

ately shows the bound is optimal, in the following sense: If an online allocationalgorithm supports a bound of the form Lcombined ≤ cmins{Ls} + a lnN thenfor all β ∈ [0, 1],

c ≥ ln(1/β)1− β

or a ≥ 11− β

In terms of information theory, the combined loss of Hedge at round t canbe written in terms of KL distance of the allocation distribution at round t (pt)and the one at round t + 1 (pt+1):

KL(pt||pt+1) =∑

s

pts ln

pts

pt+1s

=∑

s

pts ln

wts/Z

t

wt+1s /Zt+1

=∑

s

pts ln

wts/Z

t

wtsβ

lts/Zt+1

=∑

s

pts ln(

1βlts

· Zt+1/Zt)

= −∑

s

ptsl

ts ln(β) +

∑s

pts ln(Zt+1/Zt)

= − ln(β) · ltHedge + ln(Zt+1/Zt)


which gives

ltHedge =−KL(pt||pt+1) + ln(Zt+1/Zt)

lnβ

6.2 Hedge application

On a per query basis, each underlying retrieval system is an “expert” providing“advice” about the relevance of various documents to the given query. We mustdefine a method for selecting likely relevant documents based on system weightsand document ranks, and we must also define an appropriate loss that a systemshould suffer for retrieving a particular relevant or non-relevant document at aspecified rank.

While a loss function which converges to some standard measure of perfor-mance such as average precision might be desirable, it is known [58] that AveragePrecision cannot be written as sum of retrieved documents’ contributions (orerrors). Specifically there is no simple rank-function f such that

AP = −∑

d

(−1)rel(d)f(r(d))

nor there is any f such that

AP =∑

d

rel(d) · f(r(d))

Instead of designing a loss on AP, we choose a loss function such that thecumulated loss of a system relates to Total Precision (TP) or the average preci-sion at all ranks (very related to AP). This rank function has been introducedin chapter 3, and used as a sampling prior in chapter 4.

Ws(d) =1

2|s|

(1 +

1r(d)

+1

r(d) + 1+ · · ·+ 1

|s|

)≈ 1

2|s|

(1 + log

|s|r(d)

)Our episodic loss is given by

`s = (−1)rel(d) ·Ws(d)

In the limit of complete relevance judgments, one can show that the total lossof a system converges to the negative of the total precision plus a system-independent constant (chapter 3). For our purposes, this measure demonstratesa close empirical relationship to other popular measures of performance (suchas average precision at relevant documents) while it has the advantage of beingsimple and “symmetric” (the magnitude of the loss or gain is independent ofrelevance). Note that the magnitude of this loss is highest for documents whichare highly ranked; this is as desired and expected.

Given this loss function, we implement a simple pooling strategy designedto maximize the learning rate of the Hedge algorithm. At each iteration, weselect the unlabelled document which would maximize the weighted average


(mixture) loss if it were non-relevant. Since the loss suffered by a system wouldbe large for a non-relevant document which is highly ranked, this is exactly theunlabelled document with the maximum expectation of relevance as voted by aweighted linear combination of the systems.

6.3 Hedge results

The Hedge algorithm demonstrated uniformly excellent performance across allTRECs tested (TRECs 3, 5, 6, 7, and 8) in all three measures of performance-as an online metasearch engine, as a pooling strategy for finding large fractionsof relevant documents, and finally as a mechanism for rapidly evaluating therelative performance of retrieval systems.

In the following discussion, we demonstrate results using standard TREC-style pools of depth k and Hedge pools of an equivalent size. Depth-n will referto an evaluation with respect to a TREC-style pool of all documents retrievedby some system at depth n or less, and Hedge-m refers to an evaluation with re-spect to the pool generated by the Hedge algorithm after judging m documents.Standard TREC routines were used to evaluate the systems with respect tothese pools, yielding mean average precision (MAP) scores for the underlyingsystems as well as the rankings of those systems induced by these scores.

In what follows, we show that

Figure 48: Hedge pooling performance,recall. (TREC 8)

pseudo-evaluation via online pool-ing can be quite effective, yieldingrankings of the underlying systemswhich correlate quite well with rank-ings produced using actual relevancejudgments. Once a metasearch al-gorithm is chosen, our methodologyis essentially fully automatic (lit-tle, if any, training required). Em-ploying our technique over the sys-tems submitted in the TREC 3, 5,6, 7, and 8 competitions, we achieveranked orderings of systems whichcorrelate very well with the actu-ally rankings as assessed using relevance judgments in TREC. As measured byKendall’s τ , our methodology yields correlations of 0.672, 0.694, 0.665, 0.619,and 0.691 on TREC 3, 5, 6, 7, and 8, respectively. These correlations representperformance gains of 39.4%, 15.1%, 15.5%, 40.4%, and 29.4% for TREC 3, 5, 6,7, and 8, respectively, as compared to the best techniques described by Soboroffet al. [89]

Figure 48 demonstrates the algorithm’s success in finding relevant docu-ments. The plot also shows the recall performance of depth pooling and theperformance of the Cormack push-to-front technique [41], described earlier. Thevertical axis corresponds to percentage of total relevant documents and the


Figure 49: Hedge VS depth-pooling: System evaluation performance(TREC 8)

dashed line indicates the number of relevant documents present in the Depth-npools for depths 1-10, 15, and 20. Hedge performance far surpasses the discov-ery rates of the depth pooling method when compared at equivalent depths ofdocuments judged. But even more indicative of the success of the algorithm isa comparison of the number of judgments required to achieve equivalent recallpercentages. For example, examining the TREC 8 curves along the horizontalaxis, we see that the Depth-n method requires approximately 104 judgmentsto match the Hedge-40 return rate, and the Hedge-69 rate (36 percent) is un-matched until Depth-8 (199 judgments). After almost 500 judgments, Depth-20has found only approximately 55% of relevant documents— a rate achieved byHedge in less than 150 judgments.

Figures 49,50 compare the system rankings produced by the Hedge algorithmagainst those of Depth-n pooling at equivalent levels of judged documents usingthe Kendall’s τ measure. Again, the dashed line indicates the results of systemevaluations performed using standard TREC routines, given Depth-n pools ofsize 1-10,15, and 20. Examination of TREC 8 demonstrates typical performance.At 40 documents, the τ for Hedge is 0.87. This compares with 0.73 for Depth-1— a substantial improvement. Likewise, Hedge-69 achieves an accuracy of0.91, vs. a Depth-2 accuracy of 0.73.

Next, comparing along the horizontal axis the pool depths required to achieveequivalent rates of ordering accuracy, in Figure 51, we see that in order to achievean accuracy of 0.87(Hedge-40), the equivalent Depth-3 pool requires 95 judg-ments. An accuracy of 0.91 (Hedge-69) corresponds to a system approachingDepth-8 (198 judgments).

Finally, a look at the scatter plots in Figure 49 demonstrates another as-pect of algorithm performance in the evaluation of system orderings which issomewhat obscured in the traditional Kendall’s τ measure. Each pair of plotsshows Depth-1 and equivalent Hedge-n predicted ranks vs. the actual TRECrankings. Note in these plots that the rankings proceed from best systems in


50 100 150 200 250 300 350 400 450 500

10

20

30

40

50

60

70

20

15

109

87

65

43

2

1

# docs judged

reca

ll %

Relevant Docs Hedge vs Depth!n TREC6

Depth!nHedge

50 100 150 200 250 300 350 400 450 500

10

20

30

40

50

60

70

80

20

15

109

87

65

43

2

1

# docs judged

reca

ll %


Depth!nHedge

50 100 150 200 250 300 350 400 450 500

10

20

30

40

50

60

70

80

20

15

109

87

65

43

2

1

# docs judged

reca

ll %


Depth!nHedge

0 50 100 150 200 250 300 350 400 450 500

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

2015

109876

543

2

1

# docs judged

Kend

all’s

tau

System Ordering Hedge vs Depth!n TREC6

Depth!nHedge

0 50 100 150 200 250 300 350 400 450 500

0.4

0.5

0.6

0.7

0.8

0.9

1

201510987654

32

1

# docs judged

Kend

all’s

tau


Depth!nHedge

0 50 100 150 200 250 300 350 400 450 5000.4

0.5

0.6

0.7

0.8

0.9

1

20151098765

43

2

1

# docs judged

Kend

all’s

tau


Depth!nHedge

Figure 50: (top)Hedge-m vs. Depth-n: percent of total relevant documentsdiscovered. (bottom) Hedge-m and Depth-n vs. actual ranks: k-τ .

the lower left corner to worst in the upper right.While poor systems tend to be

Figure 51: Hedge: System evaluation per-formance, Kendall’s τ (TREC 8)

easily identifiable due to their lackof commonality with any other sys-tems, many of the better systemslikewise exhibit a great deal of vari-ance in returned documents. Thus,while poor systems may be well rankedusing standard techniques with depthpools as small as Depth-1, the bet-ter systems (and for most purposes,the systems of most interest) tendto be the more difficult to rank cor-rectly. As the Kendall’s τ measureof accuracy in system ordering treatsdocuments at all rank levels equally,

much of the qualitative superiority of algorithms which perform well in classi-fying higher ranked systems is obscured by commonly good performance inranking the poorer systems. Examination of tightened patterns of the Hedgeplots in the regions of interest indicates that performance of the algorithm inevaluating system orderings is somewhat better than the excellent performancedemonstrated in Figure 50(c).


10 20 30 40 50 60 70

10

20

30

40

50

60

70

dept

h!1

pool

ing

rank

trec system rank

SYSTEM EVALUATION depth!1 pooling TREC6

10 20 30 40 50 60 70

10

20

30

40

50

60

70

trec system rank

Hedg

e ra

nk

SYSTEM EVALUATION Hedge!38 TREC6

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

100

dept

h!1

pool

ing

rank

trec system rank


10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

100

trec system rank

Hedg

e ra

nk


20 40 60 80 100 120

20

40

60

80

100

120

dept

h!1

pool

ing

rank

trec system rank


20 40 60 80 100 120

20

40

60

80

100

120

trec system rank

Hedg

e ra

nk


Figure 52: Depth-1 and equivalent Hedge-m rankings vs. actual ranks.

Chapter 7

Metasearch

Metasearch is the well-studied process of fusing the ranked lists of documentsreturned by a collection of systems in response to a given user query in orderto obtain a combined list whose quality equals or exceeds that of any of theunderlying lists. Many metasearch techniques have been proposed and stud-ied [17, 63, 98, 8, 69].

The motivation to using a metasearch technique for Information Retrievalis because of metasearcher’s ability to combine the results from many othergood retrieval systems. Users only need to enter a query to a single interfaceto gain the best results given by the underlying retrieval systems. For instance,http://www.dogpile.com is able to combine ranked lists of dozens of retrievalsystems such as MSN, Yahoo!, Overture, and Google, thus traversing much moreof the web [49] and providing the user with the results ranked most highly bythe underlying systems.

Metasearch also give a systematic way of internally incorporating all of thevarious search components into the output of a search engine. For example,in the context of web retrieval, many components are used for search: contentsearch, connectivity, popularity, locality etc; the end list outputted surely is ametasearch list of all the components.

The fundamental thesis of metasearch, argued by Lee [64] and Robertson[52] among others, is that different retrieval algorithms retrieve many of thesame relevant documents, but different irrelevant documents. This hypothesis,verified empirically, naturally leads to majoritarian metasearch algorithms [73](that is, a document gets a high metasearch score if it is retrieved by manysystems). Vogt [97] calls this the “chorus eect”.

Metasearch does not work all the time, especially because most techniquesare majoritarian on input systems: if one of the systems is multiplied and allthe copies count as input to metasearch, then the metasearch output would bemuch more like it. The question of when and why metasearch works has beenstudied by Ng and Kantor [73, 72], Croft [43] as well as Vogt et al. [100, 97];they roughly agreed that it is important that the systems combined to embed anotion of relevance in their score, and that they should be independent of each

126

CHAPTER 7. METASEARCH 127

other in a intrinsic manner.

7.1 Established metasearch techniques

CombSum and variants CombSUM is perhaps the simplest and most intuitivemetasearch algorithm. Suppose one wants to combine the ranked lists of a set ofsearch engines (iterated by s); if each system provides a score for each documentcs(d) then the combSUM metasearch score per document is

CombSUM(d) =∑

s

cs(d)

where cs is the normalized score (per system). Normalization of scores is nec-essary as the retrieval functions often output scores on different scales; for ex-ample, probabilistic models and language models generally produce scores asprobabilities, hence very small; vector space models scores can be the values ofcosine of angles or other similarity measures, significantly higher than proba-bilistic scores. Usually, normalization is a linear transformation, mapping allscores into [0,1].

CombMNZ is a variant of CombSUM, where besides the cumulative score fora document, the number of systems that retrieved the document is also takeninto account (hence “MNZ” or “multiply with number of non-zero”)

CombMNZ(d) = Nd

∑s

cs(d)

where Nd is the number of systems that retrieved document d. While CombMNZis an empirical metasearch algorithm, it is proven of the very best in practice.

Borda count A simplified version of Borda [8] is just combSUM, but usingranks instead of scores as input. If rs(d) is the rank of document d in system s,then

Borda(d) = −∑

s

f(rs(d))

where f is a monotonic transformation of the rank values. Ranks do not neednormalization, of course. The sign is changed so that the highest Borda scoresreflect the top ranks. Special attention is given to documents not retrieved by asystem, since a rank of “infinite” makes the Borda formula unusable; in practicea conventional high rank is associated with documents not retrieved.

The Borda count was designed as system for counting votes or electors [23].it can be used with any point-based count, not only with the ranks.

Condorcet A different election system, Condorcet [71] is essentially a head-to-head election mechanism. Systems are treated as voters and documents ascandidates. The documents are the vertexes of a graph whose edges are a generalpreference between documents (for example an edge from document d pointed


at document f could mean that the majority of systems ranked d higher than f).A Condorcet metasearch list is a ranking that respects edge directions at leastfor consecutive ranks (non consecutive ranks may conflict with edge directiondue to cycles in the graph).

Essentially the metasearch list produced follows a Hamiltonian path; Mon-tague et al [71] show that, while the general problem of finding Hamiltonianpaths is a well known NP-hard problem, for the IR setup where all pairs ofdocuments (d, f) have an edge between them, a Hamiltonian path is easy tofind.

Probabilistic methods Aslam et al [8] proposed a probabilistic model, basedon Bayesian inference. The system trains over previous queries (with judgeddocuments) and infers a probability of relevance-at-rank for each system andeach rank. when a new query came in, an overall probability of relevance iscomputed for each document, from the ranks at which each system ranks it(and from previous inferred probabilities). Treating this probability as score,documents are ranked for the metasearch output.

This model uses a questionable assumption: that systems are independentrankings, or that the probability of relevance (of a document) at rank for a sys-tem is independent of where the document is ranked by other systems. Whilea solution to get around this heavy assumption is not obvious, Yilmaz et al[13]proposed a framework for dealing with ranking-based constraints per document;such a framework perhaps can be used to eliminate the independence assump-tion.

Linear Combination Among others, Bartell[16] and Vogt [97, 99, 100] proposedas a metasearch score the linear combination of the scores of constituent systems

LC(d) =∑

s

wscs(d)

where the weights ws are learned form previous queries judged documents. Thismethod is know to work very well in certain cases, but not reliable in general.

We consider two benchmark techniques (baselines) for our experiments.CombMNZ and Condorcet produce quality ranked lists of documents by fus-ing the ranked lists provided by a collection of underlying systems. Givenranked lists produced by good but sufficiently different underlying systems, thesemetasearch techniques can produce fused lists whose quality exceeds that ofany of the underlying lists. Given ranked lists produced by possibly correlatedsystems of varying performance, these metasearch techniques will most oftenproduce fused lists whose performance exceeds that of the “average” underlyinglist but which rarely exceeds that of the best underlying list.


7.2 Measure based metasearch

We propose next a simple method for turning any query retrieval performancemeasure into a metasearch algorithm. While motivation is mainly a betterexploration, understanding, evaluation and/or comparison of the IR measuresrather than designing a new metasearch algorithms, we also demonstrate goodmetasearch results in comparison with popular metasearch techniques. In par-ticular, we consider the most common reported measures: precision at standardcutoffs (prec@rank), R-precision (R-precsion) and average precision (AP).

The basic idea of our work is that IR measures evaluate performances of runsby “looking” into the list returned (in different ways) for relevant documentsreturned. The better method is, the more is “looking” at the right ranks andtherefore should be able to identify more relevant documents. Therefore rankedlist measure, in a multi run setting (like TREC), can be used for metasearchby identifying the documents at the [averaged] ranks believed to be relevant.The metasearch performance (measured also with AP) will give an insight ofthe quality of the measure used.

Rank weighting scheme derived from measures

We already showed in chapter 3 that most IR measures for ranked lists inducea prior over documents, based on the ranks at which they are retrieved. Forexample, the prec@10 takes into account only the 10 top ranks with a uniformweighting; therefore it induces equal weights on those 10 ranked documents and0 weights for the rest.

Essentially, any measure used for evaluating a ranked list of documents canbe written as a mathematical formula that produces the measurement; we shalltry to look at this formula as weighted average

measurement =∑

d

w(d) · rel(d)

where the vector w(d) = w(r(d)) is a rank-based function, giving each doc-ument its ”rank importance” weight. As shown before, this interpretation mayeasy for some measures (prec@rank, R-precision) but definitely challenging formeasures that operate not on individual ranks but rather on pairs of ranksor other subsets(AP). We believe this idea can be generalized to most of themeasures.

prec@rank. Lets fix a rank (or cutoff) c and denote prec@rank as “preci-sion at cutoff c”. Because no recall is involved, it is easy to see that essentiallyprec@rank puts a uniform distribution of weights over the top c ranks and 0 forall other ranks in the list.

w(d) =1c

if r(d) ≤ c; w(d) = 0 if r(d) > c


R-precision is very similar with prec@rank (use c=R = number of relevantdocuments in the query), except R is unknown as no relevance judgment is re-vealed. In this paper we assume that R is “magically” given; that is because,as stated above, our purpose is to investigate the measures and the correlationsbetween them; in an implementation for metasearch purposes, R needs to beestimated or guessed from data.

w(d) =1c

if r(d) ≤ R; w(d) = 0 if r(d) > R

There is a second fundamental difference between R-precision and [email protected] standard IR settings, like TREC, the performance of anything is usuallyaveraged over many (50) queries. In that case c= prec@rank-cutoff level is aconstant over queries while R varies, making the R-precision-cutoff appropriatefor each query.

Average precision has been discussed at length in chapter 3. The weight-ing scheme over ranks was defined as “AP-prior” and used in both samplingand online pooling

w(d) =1

2|s|(1 +

1r(d)

+1

r(d) + 1+ ...

1|s|

)

Metasearch

Given a specific weighting function (associated with an IR measure), every doc-ument has associated a set of weights (one weight per system). On a documentby document basis, the sum (or average) of these weights gives an overall weightof the document; ranking the documents by this overall weight we obtain a ma-joritarian metasearch algorithm.

We test the metasearch associated with AP,R-precision and prec@rank forstandard cutoffs (5, 10, 15, 20, 30, 50, 100, 200, 500, 1000) using TREC 5,6, 7, 8, 9 (Table 8). We also include the performances of CombMNZ andCondorcet techniques. CombMNZ and Condorcet produce quality ranked listsof documents by fusing the ranked lists provided by a collection of underlyingsystems. Given ranked lists produced by possibly correlated systems of varyingperformance, these metasearch techniques will most often produce fused listswhose performance exceeds that of the “average” underlying list but whichrarely exceeds that of the best underlying list. We included those two methodsperformance for a baseline comparison.

First we see once again that AP and R-precision are strongly correlated.This argument, based on metasearch performance just confirms the correlation,well known by statistical, algebraic and geometrical means.

As for the metasearch performance, both AP and R-precision demonstrategood results in comparison with previous methods.

Note that the metasearch based on AP is a valid and easy to implementmethod, but the one based on R-precision is not, because R is unknown andcritical for this algorithm to work. While the results in the table are useful for


analyzing the R-precision measures, for practical metasearch purposes one hasto come up with a guess for R, for each query.

TREC MNZ COND AP Rp p@10 p@20 p@50 p@100 p@5005 .294 .307 .300 .308 .265 .280 .288 .294 .2586 .341 .315 .344 .357 .292 .315 .335 .341 .3197 .320 .308 .333 .283 .295 .311 .327 .331 .3058 .350 .343 .370 .254 .286 .313 .351 .357 .3239 .351 .348 .345 .266 .309 .303 .320 .323 .294

Table 8: comparison of metasearch performances (numbers are average precisionmeasurements) based on different measures. Test are run separately on TREC5,6,7,8,9 data

We also show (Table 9) statistics on R over the TREC queries. Because ofthe mean at almost all TRECs (except TREC9) is around 100 it is expected thatprec@rank with cutoff c=100 does the best of all cutoffs, being relatively closeto AP or R-precision numbers. For prec@rank to work reasonably compared toR-precision, even at the best c, here c=100, it is necessary that the R valuesover the queries have a relatively small variation (or standard deviation). Inparticular, for TREC 9 we see that the ratio std/mean is significantly high ,therefore the drop in performance.

TREC R mean R std5 109.9 137.76 92.1 103.17 93.4 85.18 94.5 80.09 52.3 84.1

Table 9: R stats over 50 queries

In a separate experiment, using Terabyte 05 data [38], we run standard IRengines, part of Lemur Toolkit [2], in order to feed the metasearch algorithm.The underlying systems include:

• two tf-idf retrieval systems;

• three KL-divergence language models (Dirichlet smoothing, Jelinek-Mercersmoothing, and absolute discounting)

• a cosine similarity system;

• Okapi system;


System MAP p@20Jelinek-Mercer 0.2257 0.3780Dirichlet 0.2100 0.4200TFIDF 0.1993 0.4250Okapi 0.1906 0.4270log-TFIDF 0.1661 0.4140Absolute Discounting 0.1575 0.3660Cosine Similarity 0.0875 0.1960CombMNZ 0.2399 0.4550Condorcet 0.2119 0.4200AP-prior metas 0.2297 0.4260

Table 10: Results for input and metasearch systems on the Terabyte05 collectionand topics. CombMNZ, Cordorcet, and AP-metasearch are metasearch runsover all input systems.

statAP (1153 topics) 0.2175statR-prec (1153 topics) 0.2266statPrec@30 (1153 topics) 0.1728EMAP (1700 topics) 0.0641

Table 11: Performance over MQ topics, where the evaluation is performed usingthe MQ evaluation methodology. StatAP, statR-prec, and statPrec@30 refer toestimates of average precision, R-precision, and precision@30, averaged over1153 topics. EMAP refers to the MTC evaluation over 1700 topics; note thatthe EMAP value is not on the same scale as traditional MAP values [5].

Table 10 illustrates that both AP-metasearch and CombMNZ are able toexceed the performance of the best underlying system. This demonstrates thatAP metasearch (also denote ”hedge0” or Hedge without relevance feedback),is a successful metasearch technique, exceeding the metasearch performance ofCondorcet and rivaling the performance of CombMNZ.

7.2.1 AP-based Metasearch at TREC MQ tracks

In the Million Query Track, an AP-based metasearch system was submitted.Because it is essentially the Hedge algorithm at round 0 (with no feedback), wesometimes refer to it as “Hedge0”; it is an automatic metasearch engine. Weindexed the GOV2 collection using the Lemur Toolkit; that process took about3 days using a 2-processor dual-core Opteron machine (2.4 GHz/core).


MAP 0.1708R-prec 0.2411bpref 0.2414recip-rank 0.6039retrieved 135075relevant 26917relevant retrieved 13944

Table 12: Performance over the 149 Terabyte06 topics, where the evaluationwas performed using traditional methods and metrics.

Precision at Recall (149 Terabyte06 topics) Precision at Rank (149 Terabyte06 topics)recall precision rank precision.00 0.6611 at 5 docs 0.4174.10 0.3748 at 10 docs 0.3826.20 0.3096 at 20 docs 0.3453.50 0.1558 at 100 docs 0.2644.70 0.0592 at 500 docs 0.14731.00 0.0035 at 1000 docs 0.0936

Table 13: Performance over the 149 Terabyte06 topics, where the evaluationwas performed using traditional methods and metrics.

Underlying systems

We run standard IR engines, part of lemur Toolkit, in order to feed the metasearchalgorithm (mentioned above) For each query and retrieval system, we consid-ered the top 1,000 scored documents for that retrieval system. Once all retrievalsystems were run against all queries, we ran the AP-based metasearch algo-rithm [10] to perform metasearch on the ranked lists obtained. These modelswere run on 10,000 topics using the GOV2 collection.

Besides the newer (and more complex) MQ track evaluation using sam-pling (statAP) and Minimal Test Collection (EMAP), a number of 149 queriescommon to Terabyte 06 track were evaluated with standard measures usingjudgments available from the Terabyte 06 track. We refer to this evaluation as“evaluation over 149 terabyte 06 topics”. Tables 12 presents the performance ofhedge0 on the 2007 Million Query Track collection and topics, separately for theMQ evaluation methods and for the 149 Terabyte topics using traditional eval-uation methods and metrics. This performance was in line with expectationsand previous results.


7.3 Hedge-based metasearch w/ relevance feedback

In the context of a metasearch engine, the fused list produced by CombMNZ orCondorcet would be presented to the user who would naturally begin processingthe documents in rank order to satisfy the desired information need. While theuser could naturally and easily provide relevance feedback to the metasearchalgorithm, these techniques are not easily or naturally amenable to incorporatingsuch feedback.

We already described the Hedge application in chapter 6. It is easy to seethat Hedge has an immediate use a metasearch technique, specifically one thataccounts for user feedback in a theoretically sound manner. We showed in chap-ter 6 that the documents pooled using Hedge are very useful in obtaining partialevaluations; now we show that this strategy is also appropriate for selecting doc-uments to be output in a metasearch list: instead of returning just the singleunlabeled document with highest mixture loss if it were non-relevant, rank allthe unlabeled documents by their mixture loss if non-relevant, and output thislist. (In fact, this list is appended as a suffix to the ordered list of documentsthat the user has already judged at this point.)

Figure 53: Hedge metasearch with feedback

We examine the performance of the Hedge system as an evolving metasearchlist. At each iteration, the document chosen to be judged is the one with thehighest expectation of relevance. Thus, it is appropriate to build an onlinemetasearch list from these selections. To complete the metasearch list, theremaining documents are likewise ranked by weighted linear combination.

Given the ranked lists of documents returned by a collection of IR systems in


response to a given query, Hedge is capable of matching and often exceeding theperformance of the best underlying retrieval system; given relevance feedback,Hedge is capable of “learning” how to optimally combine the input systems,yielding a level of performance which often significantly exceeds that of the bestunderlying system.

Experimental results

For the experiments presented in this section the feedback was obtained byconsulting the qrel (judgment files) available from TREC. For the experimenton the Terabyte track, we had to generate the feedback ourselves, so we judgedabout 2500 documents for 50 queries, or 50 documents per query.

TREC MNZ COND Hedge-0 %MNZ %COND3 0.423 0.403 0.418 -0.012 +0.0375 0.294 0.307 0.309 +0.051 +0.0066 0.341 0.315 0.345 +0.012 +0.0957 0.320 0.308 0.323 +0.009 +0.0498 0.350 0.343 0.352 +.0014 +0.026

Table 14: Hedge-0 Method vs. Metasearch Techniques Comb-MNZ and Con-dorcet.

As shown in Table(14), the Hedge algorithm begins in the lower limit of 0online relevance judgments with a baseline MAP score which is equivalent orslightly better, in almost all instances, to the performance of the well knownCombMNZ and Condorcet metasearch methods.

0 50 100 150 200 250 300 350 400 450 500

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

# docs judged

MAP

Metasearch Performance Hedge TREC6

Hedge CombMNZ best system

0 50 100 150 200 250 300 350 400 450 5000.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

# docs judged

MAP



0 50 100 150 200 250 300 350 400 450 500

0.35

0.4

0.45

0.5

0.55

# docs judged

MAP



Figure 54: Hedge-m: metasearch performance.

Figure 54 shows metasearch performance compared with that of CombMNZand the best input system (dot lines) . Proceeding from the 0 level, Hedge onlinemetasearch results quickly surpass those of the best underlying retrieval system(the upper dashed lines). In TRECs 3,5, and 7, the best system is equaledin 10 or fewer judgments. TRECs 6 and 8 require somewhat more judgments


to achieve the performance of the best underlying system. This reflects thefact that in both competitions the best systems are outliers, with few retrieveddocuments in common with the generic pack, and thus Hedge must initiate morejudgments prior to their discovery.

7.3.1 Hedge at Terabyte Track 2006

Traditional information retrieval test collections were small, only consistingof several gigabytes of text and a few million documents. Because of this,TREC proposed the development of a much larger terabyte scale collectionwith the goals of discovering new evaluation methodologies and developingnew IR techniques that could be used on much larger document collections.It is expected that many existing retrieval algorithms that performed well onsmaller collections would not scale to very large document collections. Althoughwe proved that the Hedge algorithm for on-line learning is an effective tech-nique for metasearch, often matching or exceeding the performance of standardmetasearch and IR techniques using small TREC collections, it is still to beproven that the Hedge algorithm can indeed scale to terabyte size collections.In experiments using TREC Terabyte Track data, Hedge is shown to continueto outperform the standard metasearch techniques Condorcet and CombMNZas well as our best underlying system.

In 2004, the annual TREC conference introduced a new track which dealtwith the assessment and scalability of information retrieval (IR) techniques forterabyte scale collections. Since the terabyte track initiative, there has beenspeculation that in very large collections it would not be possible to find goodenough samples of relevant documents in order to effectively evaluate the partic-ipating systems [4]. It has also been seen that standard IR system parameterswould need to be returned to equal the performance on corpora with muchsmaller sizes. Because of this, the primary goals of the terabyte track is to ob-serve and evaluate how previous IR techniques scale to terabyte size data [38].

The traditional method of evaluation used at the TREC conferences was tofirst take the union of the top 100 documents returned by each participatingretrieval system [10, 15]. This pool of documents was then evaluated by hu-man assessors who would judge which of the documents were relevant to thegiven set of queries (the topics) issued by TREC [15]. Upon determining whichdocuments were relevant to the set of queries, the ranked lists returned by theparticipating retrieval systems would be evaluated by using standard IR mea-sures of performance such as precision and recall. Documents that were not inthis pool were automatically assumed to be non-relevant [10].

TREC’s traditional pooling technique’s purpose is to reduce the number ofhuman relevance judgments needed in order to effectively evaluate the retrievalsystems entered in the conference. Since a particular document’s relevance to atopic can only be judged by a human, this places an incredible burden of costand time in order to effectively judge the pool. When dealing with terabyte sizecorpora, the pool of documents can also be significantly biased, thus not giving


certain retrieval algorithms the scores that they deserve. This has been illus-trated in [38] where they found that the best performing retrieval algorithmswere using the titles in their rankings of documents. Because the terabyte trackuses such a large collection of documents, the number of documents which con-tain a given queries title words have enormous collection frequency comparedto the depth of the assessment pool [38].

To demonstrate Hedge’s performance at the terabyte level, Hedge parametersare refined and run against last year’s TREC 2005 Terabyte Track’s queries anddata. Because TREC publishes the IDs of the documents retrieved by last year’ssystems, it is possible to compute such IR statistics as mean average precisionand R-precision. After computing these statistics it is clear that Hedge continuesto outperform standard metasearch techniques and the best underlying systemat the terabyte level. We ran Condorcet, CombMNZ, and Hedge over rankedlists which were produced by the Lemur Toolkit’s standard retrieval systemsto test whether the Hedge algorithm for online learning scales to very largecorpora.

Experimental Setup and Results

We tested the performance of the Hedge algorithm by using the queries fromTREC 2005 Terabyte Track. The underlying systems include: 1) two TFIDFretrieval systems; 2) three KL-divergence retrieval models, one with Dirichletprior smoothing, one with Jelinek-Mercer smoothing, and the last with Abso-lute discounting; 3) a cosine similarity model; 4) the OKAPI retrieval model;5) and the INQUERY retrieval method. All of the above retrieval models areprovided as standard IR systems by the Lemur Toolkit for language Modelingand Information Retrieval [2].

These models were run against a collection (GOV2) of Web data crawledfrom Web sites in the .gov domain during early 2004 by NIST. The collection is426GB in size and contains 25 million documents [38]. Although this collectionis not a full terabyte in size, it is still much larger than the collections used atprevious TREC conferences.

For each query and retrieval system, we obtained 10000 of the top scoreddocuments for that retrieval system. Once all retrieval systems were run againstall queries, we ran the Hedge Algorithm described above to perform metasearchon the ranked lists we obtained. We used TREC 2005’s qrel files to provideHedge with relevance feedback to the algorithm. If one of our underlying sys-tems retrieved a document that was not included in TRECs 2005 qrel file, weassumed the document to be irrelevant.

Following running Hedge, we ran Condorcet and CombMNZ over the rankedlists generated by our underlying systems. The lists generated by Condorcetand CombMNZ were to serve as baseline systems to compare against Hedgesperformance. (See Figure 55) We then calculated mean average precision scores


System 2 4 6 8CombMNZ 0.2332 0.2693 0.2715 0.2399Condorcet 0.1997 0.2264 0.2302 0.2119Hedge 0 0.2314 0.2641 0.2687 0.2297Hedge 10 0.2579 0.2944 0.2991 0.2650Hedge 50 0.3199 0.3669 0.3652 0.3493

Table 15: Hedge vs. Metasearch Techniques CombMNZ and Condorcet.

for each of the three metasearch systems. As in , the pools were evaluated bycalculating the fraction of total relevant documents returned by the systems fora query (recall precision).

Figure 55: Hedge-m: metasearch performance on Terabyte 05 data

We compare Hedge to CombMNZ, Condorcet, and the underlying retrievalsystems that were used for our metasearch techniques. Table 16 shows thatHedge, in the absence of relevance feedback (Hedge0), is able to continuallyoutperform Condorcet. Although Hedge 0 does not outperform CombMNZ, itonly differs from CombMNZs performance by, at most, 0.01 of a point. Table15 illustrates that both Hedge 0 and CombMNZ are both able to exceed theperformance of the best underlying list. This implies that Hedge alone, withoutany relevance feedback, is in of itself, a successful metasearch engine.

After providing the Hedge algorithm with only ten relevance judgments


System MAP p@20Jelinek-Mercer 0.2257 0.3780Dirichlet 0.2100 0.4200TFIDF 0.1993 0.4250Okapi 0.1906 0.4270log-TFIDF 0.1661 0.4140Absolute Discounting 0.1575 0.3660Cosine Similarity 0.0875 0.1960CombMNZ 0.2399 0.4550Condorcet 0.2119 0.4200Hedge 0 0.2297 0.4260Hedge 10 0.2650 0.5270Hedge 50 0.3493 0.8090

Table 16: Results for input and metasearch systems. CombMNZ, Cordorcet,and Hedge N were run over all input systems.

(Hedge 10), Hedge is able to outperform CombMNZ, Condorcet, and the bestunderlying system considerably. Table 15 also shows that Hedge 50 more thandoubles precision at cutoff 20 of the top underlying system. This is because doc-uments that have been ranked relevant are placed at the top of the list, whereasthe documents that have been judge irrelevant are placed at the bottom of thelist. Thus, if there is a simple query, where any of the underlying systems returnmany (at least 20) relevant documents, and each of those documents are judgedand place at the top of the list; Hedge will receive a precision at cutoff 20 scoreof 1.

It should be noted that although this is behavior boosts up our scores forprecision at cutoff 20, thus giving us better average precision, it is in a sensemisleading. Because we are changing the rank of the some of the documents,we are obtaining better scores, but we are not in anyway benefiting the user.The user still has to see the irrelevant documents returned by Hedge, and judgethem accordingly.

Terabyte06 system

For our Terabyte submission to TREC 2006, given the lack of judgments, wemanually judged several documents for each query. We choose to run Hedge for50 rounds (for each query) on top of our underlying IR systems (provided byLemur, as described above). Therefore, in total, 50 rounds x 50 queries = 2500documents were judged for relevance.

As a function of the amount of relevance feedback utilized, four differentruns were submitted to Terabyte 2006: hedge0 (no judgments), which is es-sentially an automatic metasearch system; hedge10 (10 judgments per query);hedge30 (30 judgments per query) and hedge50 (50 judgments per query). The


System MAP R-prec p@10 p@30 p@100 p@500hedge0 0.177 0.228 0.378 0.320 0.232 0.104hedge10 0.239 0.282 0.522 0.394 0.278 0.118hedge30 0.256 0.286 0.646 0.451 0.290 0.119hedge50 0.250 0.280 0.682 0.470 0.279 0.115

Table 17: Results for Hedge runs on Terabyte06 queries.

performance of all four runs are presented in Table 17. The table reports themean average precision (MAP), R-precision, and precision-at-cutoff 10, 30, 100and 500. Against expectations, hedge30 looks slightly better than hedge50 butthis is most likely due to the fact that hedge30 was included as a contributor tothe TREC pool of judged documents while hedge50 was not.

Judgment disagreement and impact to Hedge performance Hedge works asan on-line metasearch algorithm, using user feedback (judged documents) toweight underlying input systems. It does not have a “search engine” compo-nent; i.e., it does not perform traditional retrieval by analyzing documents forrelevance to a given query. Therefore the performance is heavily determinedby user feedback, i.e., the quality of he judgments. In what follows, we discusshow well our own judgments (50 per query) match those provided by TRECqrel file, released at the conclusion of TREC 2006. Major disagreements couldobviously lead to significant changes in performance. First, we note that thereare consistent, large disagreements. Mismatched relevance judgments for Query823 are shown below:

GX000-62-7241305 trecrel=0 hedgerel=1GX000-14-5445022 trecrel=1 hedgerel=0GX240-72-4498727 trecrel=1 hedgerel=0GX060-85-9197519 ABSENT hedgerel=0GX240-48-7256267 trecrel=1 hedgerel=0GX248-73-4320232 trecrel=1 hedgerel=0GX245-68-14099084 trecrel=0 hedgerel=1GX227-60-13210050 trecrel=1 hedgerel=0GX071-71-15063229 trecrel=1 hedgerel=0GX047-80-14304963 trecrel=1 hedgerel=0GX217-86-0259964 trecrel=1 hedgerel=0GX031-42-14513498 trecrel=1 hedgerel=0GX227-75-10978947 trecrel=1 hedgerel=0GX004-97-14821140 trecrel=1 hedgerel=0GX268-65-3825487 ABSENT hedgerel=0GX029-22-6233173 trecrel=1 hedgerel=0GX060-96-11856158 ABSENT hedgerel=0GX269-71-3058600 trecrel=1 hedgerel=0GX271-79-2767287 trecrel=1 hedgerel=0823 19 mismatches


We examined a subset of the mismatched relevance judgments and we believethat there were judgment errors on both sides. Nevertheless all judgment dis-agreements on judges affect measured hedge performance negatively. For com-parison we re-run hedge30 (30 judgments) using the TREC qrel file for relevancefeedback. In doing so, we obtained a mean average precision of 0.33, consistentwith performance on Terabyte 2005. This would place the new hedge30 runsecond among all manual runs, as ordered by MAP (Figure 56).

Figure 56: Terabyte06: hedge30 with TREC qrel judgments. The shell showstrec eval measurements on top of the published TREC Terabyte06 ranking ofmanual runs [30]; it would rank second in terms of MAP.

7.4 Query difficulty estimation

We showed how working with the returned results of many systems for the samequery, one can improve the retrieval quality by combining all systems usingmetasearch. While in the same setup, we turn the attention to the differencesin the systems’ list and we show they somehow characterize the actual query.

We consider the issue of query performance, and we propose a novel methodfor automatically predicting the difficulty of a query. Unlike a number of existingtechniques which are based on examining the ranked lists returned in responseto perturbed versions of the query with respect to the given collection or per-turbed versions of the collection with respect to the given query, our techniqueis based on examining the ranked lists returned by multiple scoring functions(retrieval engines) with respect to the given query and collection. In essence,we propose that the results returned by multiple retrieval engines will be rel-atively similar for “easy” queries but more diverse for “difficult” queries. Byappropriately employing Jensen-Shannon divergence to measure the “diversity”of the returned results, we demonstrate a methodology for predicting query dif-ficulty whose performance exceeds existing state-of-the-art techniques on TRECcollections, often remarkably so.


7.4.1 Query hardness

The problem of query hardness estimation is to accurately and automaticallypredict the difficulty of a query, i.e., the likely quality of a ranked list of doc-uments returned in response to that query by a retrieval engine, and to per-form such predictions in the absence of relevance judgments and without userfeedback. Much recent research has been devoted to the problem of queryhardness estimation, and its importance has been recognized by the IR commu-nity [112, 111, 31, 6, 62, 45, 113]. An accurate procedure for estimating queryhardness could potentially be used in many ways, including the following:

• Users, alerted to the likelihood of poor results, could be prompted toreformulate their query.

• Systems, alerted to the difficult query, could automatically employ en-hanced or alternate search strategies tailored to such difficult queries.

• Distributed retrieval systems could more accurately combine their inputresults if alerted to the difficulty of the query for each underlying (system,collection) pair [111].

In this work, we propose a new method for automatically predicting thedifficulty of a given query. Our method is based on the premise that differ-ent retrieval engines or scoring functions will retrieve relatively similar rankedlists in response to “easy” queries but more diverse ranked lists in response to“hard” queries. As such, one could automatically predict the difficulty of a givenquery by simultaneously submitting the query to multiple retrieval engines andappropriately measuring the “diversity” of the ranked list responses obtained.In order to measure the diversity of set of ranked lists of documents, we mapthese rankings to distributions over the document collection, where documentsranked nearer the top of a returned list are naturally associated with higher dis-tribution weights (befitting their importance in the list) and vice versa. Givena set of distributions thus obtained, we employ the well known Jensen-Shannondivergence [66] to measure the diversity of the distributions corresponding tothese ranked lists.

We extensively tested our methodology using the benchmark TREC collec-tions [104, 105, 106, 107, 102, 37, 38]. To simulate the ranked lists returnedby multiple retrieval strategies in response to a given (TREC) query, we chosesubsets of the retrieval runs submitted in response to that query in a givenTREC. We then predicted query difficulty using the methodology described andcompared our estimated query difficulty to the difficulty of that query in thatTREC, as measured in a number of standard and new ways. Finally, we com-pared the quality of our query difficulty estimates to state-of-the-art techniques[112, 45, 113], demonstrating significant, often remarkable, improvements.

Background and Related Work

Existing work on query hardness estimation can be categorized along at leastthree axes: (1) How is query hardness defined?, (2) How is query hardness


predicted?, and (3) How is the quality of the prediction evaluated? In whatfollows, we describe our work and related work along these dimensions.

One can define query hardness in many ways; for example, queries can beinherently difficult (e.g., ambiguous queries), difficult for a particular collection,or difficult for a particular retrieval engine run over a particular collection.Other notions of query difficulty exist as well. In what follows, we discuss twonotions of query harness, which we shall refer to as system query hardness andcollection query hardness.

System query hardness captures the difficulty of a query for a given retrievalsystem run over a given collection. Here the notion of query hardness is system-specific; it is meant to capture the difficulty of the query for a particular system,run over a given collection. System query hardness is typically measured by theaverage precision of the ranked list of documents returned by the retrieval systemwhen run over the collection using the query in question.

Examples of work considering system query hardness include (1) Carmelet al. [31] and Yom-Tov et al. [112] who investigate methods for predicting queryhardness, testing against the Juru retrieval system, (2) Cronen-Townsend et al. [45]and Zhou and Croft [113] who investigate methods for predicting query hardness,testing against various language modeling systems, and (3) the Robust track atTREC [102] wherein each system attempted to predict its own performance oneach given query.

Collection query hardness captures the difficulty of a query with respect toa given collection. Here the notion of query hardness is meant to be largelyindependent of any specific retrieval system, capturing the inherent difficulty ofthe query (for the collection) and perhaps applicable to a wide variety of typicalsystems. Collection query hardness can be measured by some statistic takenover the performance of a wide variety of retrieval systems run over the givencollection using the query in question. For example, Carmel et al. [31] considercollection query hardness by comparing the query difficulty predicted by theirmethod to the median average precision taken over all runs submitted in theTerabtye tracks at TREC for a given query.

Our work: We consider both system query hardness and collection query hard-ness and demonstrate that our proposed methodology is useful in predictingeither. In order to test the quality of our methodology for predicting a givensystem’s performance (system query harness), one must fix a retrieval system.In this work, we simply choose the system (retrieval run) whose mean averageprecision was the median among all those submitted to a particular TREC; thus,we consider a “typical” system, one whose performance was neither extremelyhigh or low. We refer to this measure as the median system AP (med-sys AP).

In order to test the quality of our methodology for predicting collectionquery hardness, one must fix a measure for assessing the hardness of a query fora given collection. In this work, we consider two statistics taken over all runs


submitted to a particular TREC with respect to a given query: (1) the averageof the average precisions for all runs submitted in response to a given query and(2) the median of the average precisions for all runs submitted in response toa given query. We refer to the former as query average AP (avgAP) and thelatter as query median AP (medAP).

7.4.2 Predicting query hardness

Cronen-Townsend et al. [45] introduced the clarity score which effectively mea-sures the ambiguity of the query with respect to a collection, and they showthat clarity scores are correlated with query difficulty. Clarity scores are com-puted by assessing the information-theoretic distance between a language modelassociated with the query and a language model associated with the collection.Subsequently, Zhou and Croft [113] introduced ranking robustness as a measureof query hardness, where ranking robustness effectively measures the stabilityin ranked results with respect to perturbations in the collection.

Carmel et al. [31] proposed the use of pairwise information-theoretic dis-tances between distributions associated with the collection, the set of relevantdocuments, and the query as predictors for query hardness. Yom-Tov et al. [112]proposed a method for predicting query hardness by assessing the stability ofranked results with respect to perturbations in the query; subsequently, theyshowed how to apply these results to the problem of metasearch [111].

In other related work, Amati et al. [6] studied query hardness and robustnessin the context of query expansion, Kwok [62] proposed a strategy for select-ing the scoring function based on certain properties of a query, and Macdon-ald et al. [68] investigated query hardness prediction in an intranet environment.

Our work: While fundamentally different from existing techniques, our work isrelated to the methodologies described above in a number of ways. A number ofexisting techniques predict query hardness by measuring the stability of rankedresults in the presence of perturbations of the query [112] or perturbations ofthe collection [113]. In a similar spirit, our proposed technique is based on mea-suring the stability of ranked results in the presence of perturbations of the scor-ing function, i.e., the retrieval engine itself. We measure the “stability” of theranked results by mapping each ranked list of documents returned by a differentscoring function to a probability distribution and then measuring the diversityamong these distributions using the information-theoretic Jensen-Shannon di-vergence [66]. In a similar spirit, Cronen-Townsend et al. [45] use the relatedKullback-Leibler divergence [42] to compute clarity scores, and Carmel et al. [31]use the Jensen-Shannon divergence to compute their query hardness predictor.

Evaluating the quality of query hardness predictions

In order to evaluate the quality of a query hardness prediction methodology, testcollections such as the TREC collections are typically used. The system and/orcollection hardnesses of a set of queries are measured, and they are compared


to predicted values of query hardness. These actual and predicted values arereal-valued, and they are typically compared using various parametric and non-parametric statistics. Zhou and Croft [113] and Carmel et al. [31] compute thelinear correlation coefficient ρ between the actual and predicted hardness values;ρ is a parametric statistic which measures how well the actual and predictedhardness values fit to a straight line. If the queries are ranked according to theactual and predicted hardness values, then various non-parametric statistics canbe computed with respect to these rankings. Cronen-Townsend et al. [45] andCarmel et al. [31] compute the Spearman rank correlation coefficient. Zhou andCroft [113], Yom-Tov et. al [112], and the TREC Robust track [102] all computeand report the Kendall’s τ statistic.

Our work: In the results that follow, we assess the quality of our query hardnesspredictions using both the linear correlation coefficient ρ and Kendall’s τ .

7.4.3 Query hardness estimation via Metasearch

Our hypothesis is that disparate retrieval engines will return “similar” resultswith respect to “easy” queries and “dissimilar” results with respect to “hard”queries. As such, for a given query, our methodology essentially consists of threesteps: (1) submit the query to multiple scoring functions (retrieval engines), eachreturning a ranked list of documents, (2) map each ranked list to a distributionover the document collection, where higher weights are naturally associated withtop ranked documents and vice versa, and (3) assess the “disparity” (collectivedistance) among these distributions. We discuss (2) and (3) in the sections thatfollow, reserving our discussion of (1) for a later section.

From ranked lists to distributions

Many measures exist for assessing the “distance” between two ranked lists, suchas the Kendall’s τ and Spearman rank correlation coefficients mentioned earlier.However, these measures do not distinguish between differences in the “top” ofthe lists from equivalent differences in the “bottom” of the lists; however, in thecontext of information retrieval, two ranked lists would be considered much moredissimilar if their differences occurred at the “top” rather than the “bottom” ofthe lists.

To capture this notion, one can focus on the top retrieved documents only.For example, Yom-Tov et al. [112] compute the overlap (size of intersection)among the top N documents in each of two lists. Effectively, the overlap statisticplaces a uniform 1/N “importance” to each of the top N documents and azero importance to all other documents. More natural still, in the contextof information retrieval, would be weights which are higher at top ranks andsmoothly lower at lesser ranks. Recently, we proposed such weights [12, 11]which correspond to the implicit weights which the average precision measureplaces on each rank, and we use these distribution weights in our present workas well. Over the top c documents of a list, the distribution weight associated


with any rank r, 1 ≤ r ≤ c, is given below; all other ranks have distributionweight zero.

weight(r) =12c

(1 +

1r

+1

r + 1+ · · ·+ 1

c

). (6)

The Jensen-Shannon divergence among distributions

Using the above distribution weight function, one can map ranked lists to distri-butions over documents. In order to measure the “disparity” among these lists,we measure the disparity or divergence among the distributions associated withthese lists. For two distributions ~a = (a1, . . . , an) and ~b = (b1, . . . , bn), a naturaland well studied “distance” between these distributions is the Kullback-Leiblerdivergence [42]:

KL(~p||~q) =∑

i

pi logpi

qi

However, the KL-divergence suffers two drawbacks: (1) it is not symmetric in itsarguments and (2) it does not naturally generalize to measuring the divergenceamong more than two distributions. We instead employ the related Jensen-Shannon divergence [66]. Given a set of distributions {~p1, . . . , ~pm}, let ~p bethe average (centroid) of these distributions. The Jensen-Shannon divergenceamong these distributions is then defined as the average of the KL-divergencesof each distribution to this average distribution:

JS (~p1, . . . , ~pm) =1m

∑j

KL(~pj ||~p)

An equivalent and somewhat simpler formulation defined in terms of entropiesalso exists [66]. In this work, we employ the Jensen-Shannon divergence amongthe distributions associated with the ranked lists of documents returned bymultiple retrieval engines in response to a give query as an estimate of queryhardness.

7.4.4 Query hardness results

We tested our automatic metasearch-style query hardness estimation methodol-ogy extensively on multiple TREC datasets: TREC5, TREC6, TREC7, TREC8,Robust04, Terabyte04, and Terabyte05. The performance of our proposedJensen-Shannon query hardness estimator is measured against three benchmarkquery hardness statistics: query average AP (avgAP) and query median AP(medAP), both measures of collection query hardness, and median-system AP(med-sys AP), a measure of system query hardness. When predicting the difficul-ties of multiple queries in any given TREC, the strength of correlation of our pre-dicted difficulties with actual query difficulties is measured by both Kendall’s τand linear correlation coefficient ρ. We conclude that even when using few inputsystems, our method consistently outperforms existing approaches [112, 45, 113],sometimes remarkably so.


Prediction Method TREC5 TREC6 TREC7 TREC8 Robust04 TB04 TB05JS (2 systems) 0.334 0.353 0.436 0.443 0.393 0.339 0.288JS (5 systems) 0.420 0.443 0.468 0.551 0.497 0.426 0.376JS (10 systems) 0.468 0.444 0.544 0.602 0.502 0.482 0.406JS (20 systems) 0.465 0.479 0.591 0.613 0.518 0.480 0.423JS (all systems) 0.469 0.491 0.623 0.615 0.530 0.502 0.440

Table 18: Kendall’s τ (JS vs. query average AP) for all collections using 2,5, 10, 20, and all input systems. All but the last row report 10-run averageperformance.

The ad hoc tracks in TRECs 5–8 and the Robust track in 2004 each employa standard 1,000 documents retrieved per system per query on collections ofsize in the range of hundreds of thousand of documents. For these collections,the weight cutoff was fixed at c = 20 in Equation 6; in other words, only thetop 20 documents retrieved by each system received a non-zero weight in thedistribution corresponding to the retrieved list, as used in the Jensen-Shannondivergence computation. The Terabyte tracks use the GOV2 collection of about25 million documents, and ranked result lists consist of 10,000 documents each;for this larger collection and these longer lists, the weight cutoff was set atc = 100 in Equation 6. This work leaves open the question of how to optimallyset the weight cutoff per system, query, and/or collection.

The baseline statistics query avgAP, query medAP, and the fixed systemmed-sys AP are computed among all retrieval runs available. The Jensen-Shannon divergence is computed among 2, 5, 10, 20, or all retrieval runs avail-able. When less than all of the available runs are used, the actual runs selectedare chosen at random, and the entire experiment is repeated 10 times; scat-ter plots show a typical result among these 10 repetitions, and tables reportthe average performance over all 10 repetitions. We note that in general, thequality of our query hardness predictions increases rapidly as more system runsare used, with improvements tailing off after the inclusion of approximately 10systems.

We compare our Jensen-Shannon query hardness predictions with all threebaseline statistics, for all queries and all collections; in two isolated cases weexcluded queries with zero relevant documents. Figures 57 and 58 show a selec-tion of the results as scatter plots, separately for JS estimation using five systemruns and for JS estimation using 10 system runs.

Kendall’s τ measures the similarity of two rankings, in our case, the rankingsof the queries in terms of a baseline measure (query average AP, query medianAP, or median-system AP) and the rankings of the queries in terms of ourJensen-Shannon estimate. Prediction performance as measured by Kendall’s τis given in Tables 18, 19, and 20, and for visual purposes, we graph Tables 19and 20 in Figure 59. Note that while our scatter plots seem to indicate neg-ative correlation (high Jensen-Shannon divergence implies low query perfor-


0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.5

1

1.5

2

2.5!=!0.685"=!0.543

Que

ry J

S

Query avgAP

TREC8_5sys

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.5

1

1.5

2

2.5!=!0.686"=!0.551

Que

ry J

S

Query medAP

TREC8_5sys

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5!=!0.599"=!0.473

Que

ry J

S

med!sys AP

TREC8_5sys

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.5

1

1.5

2

2.5

!=!0.674"=!0.500

.ue

ry J

S

.uery avgAP

:obust04_5sys

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.5

1

1.5

2

2.5

!=!0.680"=!0.512

.ue

ry J

S

.uery medAP

9obust04_5sys

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

!=!0.707"=!0.530

-ue

ry J

S

med!sys AP

9obust04_5sys

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.5

1

1.5

2

2.5

!=!0.626"=!0.435

Que

ry J

S

Query avgAP

Terabyte04_5sys

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

0.5

1

1.5

2

2.5

!=!0.638"=!0.457

Que

ry J

S

Query medAP

Terabyte04_5sys

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.5

1

1.5

2

2.5

!=!0.553"=!0.354

Que

ry J

S

med!sys AP

Terabyte04_5sys

Figure 57: Query hardness prediction results using five input systems for (topto bottom) TREC8, Robust04 and Terabyte04. Each dot in these scatter plotscorresponds to a query. The x-axis is actual query hardness as measured byquery average AP (left), query median AP (center), and median-system AP(right). The y-axis is the Jensen-Shannon divergence computed over the rankedresults returned by five randomly chosen systems for that query.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.5

1

1.5

2

2.5

3

!=!0.802"=!0.644

Que

ry J

S

Query avgAP

TREC8_10 sys

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3!=!0.734"=!0.593

Que

ry J

S

Query medAP

TREC8_10sys

0 0.2 0.4 0.6 0.8 11

1.5

2

2.5

3!=!0.663"=!0.538

Que

ry J

S

med!sys AP

TREC8_10sys

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.5

1

1.5

2

2.5

3!=!0.682"=!0.508

.ue

ry J

S

.uery avgAP

:obust04_10sys

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.5

1

1.5

2

2.5

3!=!0.687"=!0.523

.ue

ry J

S

.uery medAP

9obust04_10sys

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

3!=!0.663"=!0.494

-ue

ry J

S

med!sys AP

9obust04_10sys

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.51.5

2

2.5

3

!=!0.620"=!0.488

Que

ry J

S

Query avgAP

Terabyte04_10sys

0 0.1 0.2 0.3 0.4 0.5 0.6 0.71.5

2

2.5

3

!=!0.606"=!0.497

Que

ry J

S

Query medAP

Terabyte04_10sys

0 0.1 0.2 0.3 0.4 0.5 0.6 0.71.5

2

2.5

3

!=!0.536"=!0.403

Que

ry J

S

med!sys AP

Terabyte04_10sys

Figure 58: Query hardness prediction results using 10 input systems for (topto bottom) TREC8, Robust04 and Terabyte04. Each dot in these scatter plotscorresponds to a query. The x-axis is actual query hardness as measured byquery average AP (left), query median AP (center), and median-system AP(right). The y-axis is the Jensen-Shannon divergence computed over the rankedresults returned by 10 randomly chosen systems for that query.

mance), this indicates positive correlation with the problem as defined (highJensen-Shannon divergence implies high query difficulty). As such, we reportthe corresponding “positive” correlations in all tables, and we note the equiva-lent negative correlations in all scatter plots.

Where we could make a direct comparison with prior results (TREC5, TREC8,Robust04, Terabyte04, and Terabyte05), we indicate the performance reportedin prior work along with references. For past results measuring system queryhardness (i.e., correlations between predicted query hardness and the hardnessof the query for a specific system), we compare these prior results against ourcorrelations with the median-system AP, as that would be closest to a fair com-


Prediction Method TREC5 TREC6 TREC7 TREC8 Robust04 TB04 TB05JS (2 systems) 0.341 0.385 0.452 0.442 0.401 0.338 0.260JS (5 systems) 0.448 0.475 0.483 0.547 0.510 0.435 0.340JS (10 systems) 0.483 0.464 0.556 0.585 0.515 0.485 0.366JS (20 systems) 0.488 0.503 0.610 0.599 0.533 0.496 0.382JS (all systems) 0.510 0.530 0.634 0.597 0.544 0.520 0.391

Table 19: Kendall’s τ (JS vs. query median AP) for all collections using 2,5, 10, 20, and all input systems. All but the last row report 10-run averageperformance.

Prediction Method TREC5 TREC6 TREC7 TREC8 Robust04 TB04 TB05[113] clarity 0.311 0.412 0.134 0.171[113] robust+clarity 0.345 0.460 0.226 0.252[112] hist. boost. .439JS (2 systems) 0.260 0.276 0.384 0.452 0.387 0.298 0.241JS (5 systems) 0.350 0.370 0.427 0.525 0.472 0.349 0.318JS (10 systems) 0.355 0.334 0.476 0.552 0.490 0.408 0.359JS (20 systems) 0.339 0.365 0.516 0.577 0.498 0.403 0.380JS (all systems) 0.363 0.355 0.509 0.561 0.512 0.427 0.381

Table 20: Kendall’s τ (JS vs. median-system AP) for all collections using 2,5, 10, 20, and all input systems. All but the last row report 10-run averageperformance.

parison. Using 10 input system runs for the Jensen-Shannon computation yieldimprovements over best previous results of on average approximately 40% to50%; the complete Tables 20, 22, and 23 show improvements ranging from 7%to 80%.

The linear correlation coefficient ρ effectively measures how well actual andpredicted values fit to a straight line; in our case, these actual and predictedvalues are the hardness of queries in terms of a baseline measure (query averageAP, query median AP, or median-system AP) and the hardness of these samequeries in terms of our Jensen-Shannon estimate. Prediction performance asmeasured by linear correlation coefficient is presented in Tables 21, 22, and 23.Note the substantial improvements over prior results as shown in Tables 22and 23.


0 20 40 60 80 100 120 140

0.35

0.4

0.45

0.5

0.55

0.6

0.65

Number of systems used for JS computation

Kend

all !

: JS

vs Q

uery

med

AP

TREC8Robust04Terabyte04

0 20 40 60 80 100 120 1400.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

Number of systems used for JS computation

Kend

all !

: JS

vs m

ed!s

ys A

P

TREC8Robust04Terabyte04

Figure 59: Kendall’s τ for JS vs. query median AP (left) and Kendall’s τ for JSvs. median-system AP (right).

Prediction method TREC5 TREC6 TREC7 TREC8 Robust04 TB04 TB05JS (2 systems) 0.498 0.456 0.601 0.627 0.555 0.466 0.388JS (5 systems) 0.586 0.611 0.637 0.736 0.673 0.576 0.516JS (10 systems) 0.632 0.651 0.698 0.778 0.672 0.642 0.564JS (20 systems) 0.645 0.677 0.731 0.784 0.688 0.666 0.577JS (all systems) 0.623 0.698 0.722 0.770 0.695 0.682 0.581

Table 21: Correlation coefficient ρ (JS vs. query average AP) for all collectionsusing 2, 5, 10, 20, and all systems. All but the last row report 10-run averageperformance.

Prediction method TREC5 TREC6 TREC7 TREC8 Robust04 TB04 TB05[31] Juru(TB04+05) .476JS (2 systems) 0.495 0.467 0.612 0.622 0.557 0.466 0.366JS (5 systems) 0.592 0.631 0.654 0.727 0.677 0.586 0.477JS (10 systems) 0.622 0.678 0.715 0.762 0.676 0.646 0.524JS (20 systems) 0.630 0.703 0.752 0.769 0.694 0.674 0.541JS (all systems) 0.600 0.727 0.743 0.755 0.701 0.687 0.543

Table 22: Correlation coefficient ρ (JS vs. query median AP) for all collectionsusing 2, 5, 10, 20, and all systems. All but the last row report 10-run averageperformance.


Prediction method TREC5 TREC6 TREC7 TREC8 Robust04 TB04 TB05[113] clarity 0.366 0.507 0.305 0.206[113] robust+clarity 0.469 0.613 0.374 0.362JS (2 systems) 0.425 0.294 0.553 0.595 0.542 0.435 0.338JS (5 systems) 0.537 0.459 0.609 0.676 0.645 0.490 0.467JS (10 systems) 0.556 0.469 0.639 0.707 0.659 0.566 0.524JS (20 systems) 0.562 0.479 0.679 0.724 0.665 0.585 0.545JS (all systems) 0.567 0.497 0.657 0.702 0.677 0.603 0.541

Table 23: Correlation coefficient ρ (JS vs. median-system AP) for all collectionsusing 2, 5, 10, 20, and all systems. All but the last row report 10-run averageperformance.

Chapter 8

Conclusions

We presented techniques for solving in a unified, yet principled, way variousIR tasks: pooling, evaluation, reusability, metasearch, query difficulty. At thetime of the writing of this document, all these problems were very intensivelydebated into IR community, but also relevant for commercial search engines.

While we provided an implementation (not necessarily the best) to all ofthese techniques, the main contributions of this thesis are: a view of theseproblems exploiting intrinsic connections, and the demonstration that certaintechniques are applicable to IR with with potential of great results.

Like in any scientific field, IR research builds on previous developments.Much of this thesis is motivated by the need of tools that both to handle thecontinuous growth on digitized information, and to relate with previous researchresults.

8.1 Thoughts on IR Evaluation

TREC has been at the very center of IR evaluation since 1992. It is importantto note that, although many aspects of evaluation are not yet well understood,TREC has been a real boon for all IR researchers around the world, and alsofor major companies in the search business.

Several concerns have been raised about TREC evaluation, most notably byZobel [115], Soboroff and Robertson [27, 90]. The main issue is that by usingsome retrieval systems to get access to the documents, inevitably we bias ourjudgments towards these systems; more importantly, a later system using thesame queries and collection of documents may be hard to evaluate due to lackof appropriate judgments. Since this is a relative deficit (some new systems maybe more evalu-able than others), we think that any evaluation methodology thatuses incomplete judgments has to be in company of a confidence assessment (thesampling method proposed in chapter 4 permits such a confidence).

Another aspect of evaluation with incomplete judgments often ignored isthe bias of the documents judged towards certain types of documents. Soboroff

153

CHAPTER 8. CONCLUSIONS 154

[27] is one of the few who analyzes this effect; they propose a way to measurethe bias towards “title words” documents, that is the documents retrieved sim-ply because they heavily contain the query title terms. Soboroff and Buckley[27, 26] also experiment with various heuristics in order to try to find relevantdocuments, other than the title words ones.

It seems that we need to ensure somehow that the set of documents pooledfor judging is representative for the collection, independent of what systemsoutput we have handy at the pooling time. No solution has been proposed yetin this direction, but we believe that such a solution is possible via sampling,and that any such solution has to make use of sampling, or some flavor ofnondeterministic method for selection of documents.

8.2 Sampling remarks

IR evaluation is presently critically dependent of methodologies that allow scal-ability. The existing methodologies (Average Precision) are well understoodand fairly popular: they provide a “common denominator” for the evaluation ofIR systems. However, we know for quite some time that these methods are lessand less applicable for the newer growing collections. It is deployment challengeto come up with new evaluation methodologies that researchers would adoptquickly, but also a challenge to make the existing metrics work with the newdata.

The main problem with the existing metrics is that they require immensehuman effort (judgment of documents) in order to produce the measurements.Our proposed sampling method provides scalability more than any previouspooling-and-evaluation strategy. We essentially show that with an effort of 5%of the current effort, we can get 95% of the benefits. That is, rigorously put, wecan predict the correct results with 95% confidence.

In particular, confidence in the estimate with incomplete judgments is apremiere on the evaluation scene. To our knowledge, no current evaluationmethodology comes with well grounded confidence guarantees, other than em-pirical robustness demonstrations. The confidence greatly affects the reusabilityof the judged documents in evaluating new, later systems run on the same data.We showed particular extreme example (Robust Track 05) where a special run ismassively under-evaluated because many of its uniquely-retrieved relevant doc-uments are not included in the judged set; even in such a case the confidencecan at least warn of inappropriate estimation.

Sampling as an application to IR evaluation has many other benefits:

• the mathematics of sampling is relatively simple

• a very powerful technique

• sampling is relatively easy to put in practice

• sampling can incorporate additional effort


• sampling pools are generally more representative for the collection thanthe pools obtained with deterministic pooling methods

For analogy with TREC, the sampling strategy is the equivalent of depth-pooling, the evaluation the equivalent of trec-eval program and the sample theequivalent of traditional qrel files[107], the only addition being that every sam-pled document is also accompanied by a count (with-replacement sampling) orby an inclusion probability (without-replacement sampling). This methodologyworks very well for small sizes (Kendall’s τ=.85 for less than %2 of the pooljudged) and it also can be smoothly adapted towards traditional setup (depth-pooling, trec-eval, qrel) by adding depth-pooling style judged documents; at thevery extreme when all documents are judged, our estimated values are identicalwith trec-eval outputs.

The Million Query Track and subsequent analyses is the kind of large-scaleexperiment made possible by the sampling technology (Carterette evaluationtechnology MTC [33] was also used) We show that we can evaluate retrievalsystems with greatly reduced effort and give confidence in the results; compar-isons with previous evaluations methodologies are presented for reference.

The statistical analysis of variance performed on MQ track also gives anidea of the “sweet spot” of budgeting: how many queries, at various levels ofjudgments, are necessary for certain confidence in the overall performance IRsystems. The results of our study confirm that evaluation over more querieswith fewer or noisier judgments is preferable to evaluation over fewer querieswith more judgments.

8.3 Online metasearch

The idea of combining several ranked lists into a metasearch list is very naturalfor IR: many search engines use several search components, such as contentmatching, global popularity etc., and essentially combine internally the rankingsin order to produce a final output.

We found the setup of multiple systems for the same query useful for metasearchquality, but also for many other purposes. We show how one can use the un-derlying rankings to select documents and evaluate the performance of the con-stituent systems, how to characterize the popular measures of performance viametasearch, and how to estimate query difficulty using the generalized distri-bution distance (Jensen-Shannon divergence) on the ranking constraints. All ofthese tasked were solved using sound techniques adopted from Machine Learningor Information Theory.

Hedge We have shown that the Hedge algorithm for on-line learning can beadapted to simultaneously solve the problems of metasearch, pooling, and sys-tem evaluation, in a manner both efficient and highly effective. In the absenceof relevance judgments, Hedge produces metasearch lists whose quality equals


or exceeds that of benchmark techniques such as CombMNZ and Condorcet; inthe presence of relevance judgments, the performance of Hedge increases rapidlyand dramatically.

When applied to the problems of pooling and system evaluation, Hedge iden-tifies relevant documents very quickly, and these documents form an excellentand efficient pool for evaluating the quality of retrieval systems.

We tested metasearch and related technologies on many datasets and prob-lems: TREC 3,4,5,6,7,8, Ad-Hoc tracks, TREC 06 Terabyte Track and TREC07/08Million Query Track. On the Terabyte 06 Track, we provided manual feed-back ourselves by judging about 2500 documents. Overall, it has been shownthat the Hedge algorithm for online learning is highly efficient and effective asa metasearch technique. Our experiments show that even without relevancefeedback Hedge is still able to produce metasearch lists which are directly com-parable to the standard metasearch techniques Condorcet and CombMNZ, andwhich exceed the performance of the best underlying list. With relevance feed-back Hedge is able to considerably outperform Condorcet and CombMNZ. After50 relevance judgments Hedges average precision is almost double that of ourbest underlying system.

Query hardness estimation Previous work on query hardness has demon-strated that a measure of the stability of ranked results returned in response toperturbed versions of the query with respect to the given collection or perturbedversions of the collection with respect to the given query are both correlatedwith query difficulty, both in general and for specific systems. In this work, wefurther demonstrate that a measure of the stability of ranked results returnedin response to perturbed versions of the scoring function is also correlated withquery hardness, often at a level significantly exceeding that of prior techniques.Zhou and Croft [113] and Carmel et al. [31] demonstrate that combining multi-ple methods for predicting query difficulty yields improvements in the predictedresults, and we hypothesize that appropriately combining our proposed methodwith other query difficulty prediction methods would yield further improvementsas well. Finally, this work leaves open the question of how to optimally pick thenumber and type of scoring functions (retrieval engines) to run in order to mostefficiently and effectively predict query hardness.

Bibliography

[1] History of search engines. http://en.wikipedia.org/wiki/Web-search-engine.

[2] The Lemur Toolkit for language modeling and information retrieval.Available at http://www-2.cs.cmu.edu/lemur.

[3] Standard deviation, MATLAB reference.

[4] Proceedings of the 24th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval, Sept. 2001.

[5] J. Allan, B. Carterette, J. A. Aslam, V. Pavlu, B. Dachev, andE. Kanoulas. Overview of the TREC 2007 Million Query Track. In Pro-ceedings of TREC, 2007.

[6] G. Amati, C. Carpineto, and G. Romano. Query difficulty, robustnessand selective application of query expansion. In Proceedings of the 25thEuropean Conference on Information Retrieval ECIR 2004, 2004.

[7] E. C. Anderson. Monte carlo methods and importance sampling. LectureNotes for Statistical Genetics, October 1999.

[8] J. A. Aslam and M. Montague. Models for metasearch. In W. B. Croft,D. J. Harper, D. H. Kraft, and J. Zobel, editors, Proceedings of the 24thAnnual International ACM SIGIR Conference on Research and Develop-ment in Information Retrieval, pages 276–284. ACM Press, September2001.

[9] J. A. Aslam and V. Pavlu. A practical sampling strategy for efficientretrieval evaluation, technical report.

[10] J. A. Aslam, V. Pavlu, and R. Savell. A unified model for metasearch,pooling, and system evaluation. In O. Frieder, J. Hammer, S. Quershi, andL. Seligman, editors, Proceedings of the Twelfth International Conferenceon Information and Knowledge Management, pages 484–491. ACM Press,November 2003.

157

BIBLIOGRAPHY 158

[11] J. A. Aslam, V. Pavlu, and E. Yilmaz. Measure-based metasearch. InG. Marchionini, A. Moffat, J. Tait, R. Baeza-Yates, and N. Ziviani, edi-tors, Proceedings of the 28th Annual International ACM SIGIR Confer-ence on Research and Development in Information Retrieval, pages 571–572. ACM Press, August 2005.

[12] J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for systemevaluation using incomplete judgments. In S. Dumais, E. N. Efthimiadis,D. Hawking, and K. Jarvelin, editors, Proceedings of the 29th AnnualInternational ACM SIGIR Conference on Research and Development inInformation Retrieval, pages 541–548. ACM Press, August 2006.

[13] J. A. Aslam and E. Yilmaz. Inferring document relevance via averageprecision. In S. Dumais, E. N. Efthimiadis, D. Hawking, and K. Jarvelin,editors, Proceedings of the 29th Annual International ACM SIGIR Confer-ence on Research and Development in Information Retrieval, pages 601–602. ACM Press, August 2006.

[14] J. A. Aslam, E. Yilmaz, and V. Pavlu. The maximum entropy methodfor analyzing retrieval measures. In G. Marchionini, A. Moffat, J. Tait,R. Baeza-Yates, and N. Ziviani, editors, Proceedings of the 28th AnnualInternational ACM SIGIR Conference on Research and Development inInformation Retrieval, pages 27–34. ACM Press, August 2005.

[15] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACMPress, 1999.

[16] B. T. Bartell. Optimizing Ranking Functions: A Connectionist Approachto Adaptive Information Retrieval. PhD thesis, University of California,San Diego, 1994.

[17] B. T. Bartell, G. W. Cottrell, and R. K. Belew. Automatic combinationof multiple ranked retrieval systems. In W. B. Croft and C. van Rijsber-gen, editors, Proceedings of the 17th Annual International ACM SIGIRConference on Research and Development in Information Retrieval, pages173–181, July 1994.

[18] N. Belkin, C. Cool, W. Croft, and J. Callan. The effect of multiple queryrepresentations on information retrieval system performance. In R. Ko-rfhage, E. Rasmussen, and P. Willett, editors, Proceedings of the 16thAnnual International ACM SIGIR Conference on Research and Develop-ment in Information Retrieval, pages 339–346, June 1993.

[19] N. Belkin, P. Kantor, C. Cool, and R. Quatrain. Combining evidencefor information retrieval. In D. Harman, editor, The Second Text RE-trieval Conference (TREC-2), pages 35–43. U.S. Government PrintingOffice, Mar. 1994.

BIBLIOGRAPHY 159

[20] N. Belkin and G. Muresan. Measuring web search effectiveness: Rutgersat interactive TREC.

[21] A. L. Berger, V. D. Pietra, and S. D. Pietra. A maximum entropy approachto natural language processing. Comput. Linguist., 22:39–71, 1996.

[22] D. Bodoff and P. Li. Test theory for assessing ir test collection. In Pro-ceedings of SIGIR, pages 367–374, 2007.

[23] J. C. Borda. Memoire sur les lections au scrutin, 1781.

[24] R. L. Brennan. Generalizability Theory. Springer-Verlag, New York, 2001.

[25] K. R. W. Brewer and M. Hanif. Sampling With Unequal Probabilities.Springer, New York, 1983.

[26] C. Buckley. Looking at limits and tradeoffs: Sabir research at trec2005. InProceedings of the Fourteenth Text REtrieval Conference (TREC 2005),2007.

[27] C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limitsof pooling for large collections. Inf. Retr., 10(6):491–508, 2007.

[28] C. Buckley and E. M. Voorhees. Evaluating evaluation measure stabil-ity. In Proceedings of the 23rd Annual International ACM SIGIR Confer-ence on Research and Development in Information Retrieval, pages 33–40,2000.

[29] C. Buckley and E. M. Voorhees. Retrieval evaluation with incompleteinformation. In Proceedings of the 27th Annual International ACM SIGIRConference on Research and Development in Information Retrieval, pages25–32, 2004.

[30] S. Buttcher, C. L. A. Clarke, and I. Soboroff. The TREC 2006 terabytetrack. In Proceedings of the Fifteenth Text REtrieval Conference (TREC2006), 2005.

[31] D. Carmel, E. Yom-Tov, A. Darlow, and D. Pelleg. What makes a querydifficult? In SIGIR ’06: Proceedings of the 29th annual international ACMSIGIR conference on Research and development in information retrieval,pages 390–397, New York, NY, USA, 2006. ACM Press.

[32] B. Carterette. Robust test collections for retrieval evaluation. In Proceed-ings of SIGIR, pages 55–62, 2007.

[33] B. Carterette, J. Allan, and R. Sitaraman. Minimal test collections forretrieval evaluation. In Proceedings of the 29th Annual InternationalACM SIGIR Conference on Research and Development in InformationRetrieval, pages 268–275, 2006.

BIBLIOGRAPHY 160

[34] B. Carterette and M. Smucker. Hypothesis testing with incomplete rele-vance judgments. In Proceedings of CIKM, pages 643–652, 2007.

[35] N. Cesa-Bianchi, Y. Freund, D. Helmbold, D. Haussler, R. Schapire, andM. Warmuth. How to use expert advice. In Proceedings of the 25th AnnualACM Symposium on the Theory of Computing, pages 382–391.

[36] S. F. Chen and J. Goodman. An empirical study of smoothing tech-niques for language modeling. In Proceedings of the Thirty-Fourth AnnualMeeting of the Association for Computational Linguistics, pages 310–318,1996.

[37] C. Clarke, N. Craswell, and I. Soboroff. The TREC terabyte retrievaltrack. 2004.

[38] C. L. A. Clarke, F. Scholer, and I. Soboroff. The TREC 2005 terabytetrack. In Proceedings of the Fourteenth Text REtrieval Conference (TREC2005), 2005.

[39] W. S. Cooper. Expected search length: A single measure of retrieval effec-tiveness based on the weak ordering action of retrieval systems. AmericanDocumentation, 19(1):30–41, 1968.

[40] W. S. Cooper. On selecting a measure of retrieval effectiveness. Part I.Readings in Information Retrieval, pages 191–204, 1997.

[41] G. V. Cormack, C. R. Palmer, and C. L. A. Clarke. Efficient constructionof large test collections. In Croft et al. [44], pages 282–289.

[42] T. M. Cover and J. A. Thomas. Elements of Information Theory. JohnWiley & Sons, 1991.

[43] W. B. Croft. Combining approaches to information retrieval. In W. B.Croft, editor, Advances in Information Retrieval: Recent Research fromthe Center for Intelligent Information Retrieval, chapter 1. Kluwer Aca-demic Publishers, 2000.

[44] W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zo-bel, editors. Proceedings of the 21th Annual International ACM SIGIRConference on Research and Development in Information Retrieval, Aug.1998.

[45] S. Cronen-Townsend, Y. Zhou, and W. Croft. Predicting query perfor-mance. In In Proceedings of the ACM Conference on Research in Infor-mation Retrieval (SIGIR), 2002.

[46] B. Dervin and M. S. Nilan. Information needs and use. In Annual Reviewof Information Science and Technology, volume 21, pages 3–33, 1986.

[47] T. Fawcett. An introduction to roc analysis. Pattern Recogn. Lett.,27(8):861–874, 2006.

BIBLIOGRAPHY 161

[48] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer andSystem Sciences, 55(1):119–139, Aug. 1997.

[49] S. Gardiner. Cybercheating: A new twist on an old problem, 2001.

[50] W. R. Greiff and J. Ponte. The maximum entropy approach and proba-bilistic ir models. ACM Trans. Inf. Syst., 18(3):246–287, 2000.

[51] D. Harman. Overview of the third text REtreival conference (TREC-3).In D. Harman, editor, Overview of the Third Text REtrieval Conference(TREC-3), pages 1–19. U.S. Government Printing Office, Apr. 1995.

[52] D. Hawking and S. Robertson. On collection size and retrieval effective-ness. Information Retrieval, 6(1):99–105, 2003.

[53] K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of irtechniques. ACM Trans. Inf. Syst., 20(4):422–446, 2002.

[54] E. Jaynes. On the rationale of maximum entropy methods. In Proc.IEEE,volume 70, pages 939–952, 1982.

[55] E. T. Jaynes. Information theory and statistical mechanics: Part i. Phys-ical Review 106, pages 620–630, 1957a.

[56] E. T. Jaynes. Information theory and statistical mechanics: Part ii. Phys-ical Review 108, page 171, 1957b.

[57] E. C. Jensen. Repeatable Evaluation of Information Retrieval Effectivenessin Dynamic Environments. PhD thesis, Illinois Institute of Technology,2006.

[58] T. Joachims. A support vector method for multivariate performance mea-sures. In ICML ’05: Proceedings of the 22nd international conference onMachine learning, pages 377–384, New York, NY, USA, 2005. ACM.

[59] Y. Kagolovsky and J. R. Moehr. Current status of the evaluation ofinformation retrieval. J. Med. Syst., 27(5):409–424, 2003.

[60] P. B. Kantor and J. Lee. The maximum entropy principle in informationretrieval. In SIGIR ’86: Proceedings of the 9th annual international ACMSIGIR conference on Research and development in information retrieval,pages 269–274. ACM Press, 1986.

[61] M. Kendall. Rank correlation methods, 1948.

[62] K. Kwok. An attempt to identify weakest and strongest queries. In ACMSIGIR’05 Query Prediction Workshop, 2005.

BIBLIOGRAPHY 162

[63] J. H. Lee. Combining multiple evidence from different properties of weight-ing schemes. In Proceedings of the 18th Annual International ACM SIGIRConference on Research and Development in Information Retrieval, pages180–188, 1995.

[64] J. H. Lee. Analyses of multiple evidence combination. In N. J. Belkin,A. D. Narasimhalu, and P. Willett, editors, Proceedings of the 20th AnnualInternational ACM SIGIR Conference on Research and Development inInformation Retrieval, pages 267–275, July 1997.

[65] D. D. Lewis. Evaluating and optimizing autonomous text classificationsystems. In SIGIR ’95: Proceedings of the 18th annual international ACMSIGIR conference on Research and development in information retrieval,pages 246–254. ACM Press, 1995.

[66] J. Lin. Divergence measures based on the shannon entropy. IEEE Trans.Infor. Theory, 37:145–151, 1991.

[67] N. Littlestone and M. Warmuth. The weighted majority algorithm. In-formation and Computation, 108(2):212–261, 1994.

[68] C. Macdonald, B. He, and I. Ounis. Predicting query performance inintranet search. In ACM SIGIR’05 Query Prediction Workshop, 2005.

[69] R. Manmatha, T. Rath, and F. Feng. Modeling score distributions forcombining the outputs of search engines. In Proceedings of the 24th AnnualInternational ACM SIGIR Conference on Research and Development inInformation Retrieval [4], pages 267–275.

[70] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to InformationRetrieval. Cambridge University Press, July 2008.

[71] M. Montague and J. A. Aslam. Condorcet fusion for improved retrieval.In K. Kalpakis, N. Goharian, and D. Grossman, editors, Proceedings of theEleventh International Conference on Information and Knowledge Man-agement, pages 538–548. ACM Press, November 2002.

[72] K. B. Ng. An Investigation of the Conditions for Effective Data Fusion inInformation Retrieval. PhD thesis, School of Communication, Informa-tion, and Library Studies, Rutgers University, 1998.

[73] K. B. Ng and P. B. Kantor. An investigation of the preconditions foreffective data fusion in ir: A pilot study. In Proceedings of the 61th AnnualMeeting of the American Society for Information Science, 1998.

[74] K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy fortext classification. In IJCAI-99 Workshop on Machine Learning for In-formation Filtering, pages 61–67, 1999.

BIBLIOGRAPHY 163

[75] S. Osinski and D. Weiss. A concept-driven algorithm for clustering searchresults. Intelligent Systems, IEEE [see also IEEE Intelligent Systems andTheir Applications], 20(3):48–54, 2005.

[76] D. Pavlov, A. Popescul, D. M. Pennock, and L. H. Ungar. Mixturesof conditional maximum entropy models. In T. Fawcett and N. Mishra,editors, ICML, pages 584–591. AAAI Press, 2003.

[77] S. J. Phillips, M. Dudik, and R. E. Schapire. A maximum entropy ap-proach to species distribution modeling. In ICML ’04: Twenty-first in-ternational conference on Machine learning, New York, NY, USA, 2004.ACM Press.

[78] V. Raghavan, P. Bollmann, and G. S. Jung. A critical investigation ofrecall and precision as measures of retrieval system performance. ACMTrans. Inf. Syst., 7(3):205–229, 1989.

[79] A. Ratnaparkhi and M. P. Marcus. Maximum entropy models for naturallanguage ambiguity resolution, 1998.

[80] J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth andBrooks/Cole, 1988.

[81] J. A. Rice. Mathematical Statistics and Data Analysis. Duxbury Press,second edition, 1995.

[82] S. Robertson. On gmap: and other transformations. In CIKM ’06: Pro-ceedings of the 15th ACM international conference on Information andknowledge management, pages 78–83, New York, NY, USA, 2006. ACM.

[83] T. Sakai. Evaluating evaluation metrics based on the bootstrap. In SIGIR’06: Proceedings of the 29th annual international ACM SIGIR conferenceon Research and development in information retrieval, pages 525–532,New York, NY, USA, 2006. ACM.

[84] T. Sakai. Alternatives to bpref. In SIGIR ’07: Proceedings of the 30th an-nual international ACM SIGIR conference on Research and developmentin information retrieval, pages 71–78, New York, NY, USA, 2007. ACM.

[85] G. Salton and C. Buckley. Improving retrieval performance by rele-vance feedback. Journal of the American Society for Information Science,41(4):288–297, 1990.

[86] M. Sanderson and J. Zobel. Information retrieval system evaluation: Ef-fort, sensitivity, and reliability. In Proceedings of SIGIR, pages 162–169,2005.

[87] R. Savell. On-line Metasearch, Pooling and System Evaluation. PhDthesis, Dartmouth College, 2005.

BIBLIOGRAPHY 164

[88] C. E. Shannon. A mathematical theory of communication. The BellSystem Technical Journal 27, pages 379–423 & 623–656, 1948.

[89] I. Soboroff, C. Nicholas, and P. Cahan. Ranking retrieval systems withoutrelevance judgments. In Proceedings of the 24th Annual InternationalACM SIGIR Conference on Research and Development in InformationRetrieval [4], pages 66–73.

[90] I. Soboroff and S. Robertson. Building a filtering test collection for trec2002. In SIGIR ’03: Proceedings of the 26th annual international ACMSIGIR conference on Research and development in informaion retrieval,pages 243–250, New York, NY, USA, 2003. ACM.

[91] K. Sparck Jones and C. J. van Rijsbergen. Information retrieval testcollections. Journal of Documentation, 32(1):59–75, 1976.

[92] W. L. Stevens. Sampling without replacement with probability propor-tional to size. Journal of the Royal Statistical Society. Series B (Method-ological), Vol. 20, No. 2. (1958), pp. 393-397.

[93] J. Tague-Sutcliffe and J. Blustein. A statistical analysis of the TREC-3data. In Proceedings of the Third Text REtrieval Conference (TREC-3),pages 385–398, 1995.

[94] S. K. Thompson. Sampling. Wiley-Interscience, second edition, 2002.

[95] S. Tomlison, D. Oard, J. Baron, and P. Thompson. Overview of the TREC2007 legal track. In Proceedings of the Sixteenth Text REtrieval Conference(TREC 2007), 2007.

[96] C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Com-puter Science, University of Glasgow, 1979.

[97] C. C. Vogt. Adaptive Combination of Evidence for Information Retrieval.PhD thesis, University of California, San Diego, 1999.

[98] C. C. Vogt. How much more is better? Characterizing the effects ofadding more IR systems to a combination. In Content-Based MultimediaInformation Access (RIAO), pages 457–475, Apr. 2000.

[99] C. C. Vogt and G. W. Cottrell. Fusion via a linear combination of scores.Information Retrieval, 1(3):151–173, Oct. 1999.

[100] C. C. Vogt, G. W. Cottrell, R. K.Belew, and B. T. Bartell. Using relevanceto train a linear mixture of experts. In E. Voorhees and D. Harman,editors, The Fifth Text REtrieval Conference (TREC-5), pages 503–515.U.S. Government Printing Office, 1997.

[101] E. Voorhees. The trec-8 question answering track report, 1999.

BIBLIOGRAPHY 165

[102] E. M. Voorhees. The TREC robust retrieval track. SIGIR Forum,39(1):11–20, 2005.

[103] E. M. Voorhees. The trec 2005 robust track. SIGIR Forum, 40(1):41–48,2006.

[104] E. M. Voorhees and D. Harman. Overview of the Fifth Text REtrievalConference (TREC-5). In TREC, 1996.

[105] E. M. Voorhees and D. Harman. Overview of the Sixth Text REtrievalConference (TREC-6). In TREC, pages 1–24, 1997.

[106] E. M. Voorhees and D. Harman. Overview of the Seventh Text REtrievalConference (TREC-7). In Proceedings of the Seventh Text REtrieval Con-ference (TREC-7), pages 1–24, 1999.

[107] E. M. Voorhees and D. Harman. Overview of the Eighth Text REtrievalConference (TREC-8). In Proceedings of the Eighth Text REtrieval Con-ference (TREC-8), pages 1–24, 2000.

[108] V. G. Vovk. A game of prediction with expert advice. In COLT, pages51–60, 1995.

[109] N. Wu. The Maximum Entropy Method. Springer, New York, 1997.

[110] E. Yilmaz and J. A. Aslam. Estimating average precision with incompleteand imperfect judgments. In P. S. Yu, V. Tsotras, E. Fox, and B. Liu,editors, Proceedings of the Fifteenth ACM International Conference onInformation and Knowledge Management, pages 102–111. ACM Press,August 2006.

[111] E. Yom-Tov, S. Fine, D. Carmel, and Darlow. Metasearch and Federationusing Query Difficulty Prediction. In Predicting Query Difficulty - Methodsand Applications, Aug. 2005.

[112] E. Yom-Tov, S. Fine, D. Carmel, and A. Darlow. Learning to estimatequery difficulty: including applications to missing content detection anddistributed information retrieval. In SIGIR, pages 512–519, 2005.

[113] Y. Zhou and W. B. Croft. Ranking robustness: A novel framework topredict query performance. Technical Report IR-532, Center for IntelligentInformation Retrieval, University of Massachusetts, Amherst, 2006. URL:http://maroo.cs.umass.edu/pub/web/674.

[114] G. K. Zipf. Human behavior and the principle of least effort: An intro-duction to human ecology. Hafner Pub. Co.

[115] J. Zobel. How reliable are the results of large-scale retrieval experiments?In Croft et al. [44], pages 307–314.