Building Reliable Test and Training Collections in ... ?· Building Reliable Test and Training Collections…

Embed Size (px)

Text of Building Reliable Test and Training Collections in ... ?· Building Reliable Test and Training...

  • Building Reliable Test and Training Collections in Information Retrieval

    A dissertation presented

    by

    Evangelos Kanoulas

    to the Faculty of the Graduate School

    of the College of Computer and Information Science

    in partial fulfillment of the requirements for the degree of

    Doctor of Philosophy

    Northeastern University

    Boston, Massachusetts

    December, 2009

  • Abstract

    Research in Information Retrieval has significantly benefited from the availability of stan-

    dard test collections and the use of these collections for comparative evaluation of the ef-

    fectiveness of different retrieval system configurations in controlled laboratory experiments.

    In an attempt to design large and reliable test collections decisions regarding the assembly

    of the document corpus, the selection of topics, the formation of relevance judgments and

    the development of evaluation measures are particularly critical and affect both the cost of

    the constructed test collections and the effectiveness in evaluating retrieval systems. Fur-

    thermore, recently, building retrieval systems has been viewed as a machine learning task

    resulting in the development of a learning-to-rank methodology widely adopted by the com-

    munity. It is apparent that the design and construction methodology of learning collections,

    along with the selection of the evaluation measure to be optimized significantly affects the

    quality of the resulting retrieval system. In this work we consider the construction of re-

    liable and efficient test and training collections to be used in the evaluation of retrieval

    systems and in the development of new and effective ranking functions. In the process of

    building such collections we investigate methods of selecting the appropriate documents

    and queries to be judged and we proposed evaluation metrics that can better capture the

    overall effectiveness of the retrieval systems under study.

    i

  • Contents

    Abstract i

    Contents iii

    List of Figures vii

    List of Tables xi

    1 Introduction 1

    1.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 Learning-to-Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.4 Duality between evaluation and learning-to-rank . . . . . . . . . . . . . . 6

    1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2 Background Information 9

    2.1 Retrieval Models and Relevance . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.2.1 Document corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.2.2 Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.2.3 Relevance Judgments . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.2.4 Reliability of Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.3 Learning-to-Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.3.1 Learning-to-rank algorithms . . . . . . . . . . . . . . . . . . . . . . 19

    2.3.2 Learning-to-rank collections . . . . . . . . . . . . . . . . . . . . . . 19

    2.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.4.1 Evaluating evaluation metrics . . . . . . . . . . . . . . . . . . . . . 24

    2.4.2 Evaluation metrics for incomplete relevance judgments . . . . . . . 25

    2.4.3 Multi-graded evaluation metrics . . . . . . . . . . . . . . . . . . . . 25

    2.5 Summary and Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    iii

  • iv CONTENTS

    3 Test Collections 29

    3.1 Document Selection Methodology for Efficient Evaluation . . . . . . . . . 29

    3.1.1 Confidence Intervals for infAP . . . . . . . . . . . . . . . . . . . . . 31

    3.1.2 Inferred AP on Nonrandom Judgments . . . . . . . . . . . . . . . . 34

    3.1.3 Inferred AP in TREC Terabyte . . . . . . . . . . . . . . . . . . . . . 37

    3.1.4 Estimation of nDCG with Incomplete Judgments . . . . . . . . . . 40

    3.1.5 Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    3.1.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.2 Evaluation over Thousands of Queries . . . . . . . . . . . . . . . . . . . . 45

    3.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    3.2.2 Million Query 2007 : Experiment and Results . . . . . . . . . . . . 47

    3.2.3 Million Query 2007 : Analysis . . . . . . . . . . . . . . . . . . . . . 50

    3.2.4 Million Query 2008 : Experiment and Results . . . . . . . . . . . . 51

    3.2.5 Million Query 2008 : Analysis . . . . . . . . . . . . . . . . . . . . . 54

    3.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    3.3 Overall Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    4 Training Collections 63

    4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    4.1.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    4.1.2 Document selection . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    4.1.3 Learning-to-rank algorithms . . . . . . . . . . . . . . . . . . . . . . 69

    4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    4.3 Overall Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    5 Evaluation Metrics 77

    5.1 Gain and Discount Function for nDCG . . . . . . . . . . . . . . . . . . . . 79

    5.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    5.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    5.1.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

    5.2 Extension of Average Precision to Graded Relevance Judgments . . . . . . 95

    5.2.1 Graded Average Precision (GAP) . . . . . . . . . . . . . . . . . . . 96

    5.2.2 Properties of GAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

    5.2.3 Evaluation Methodology and Results . . . . . . . . . . . . . . . . . 103

    5.2.4 GAP for Learning to Rank . . . . . . . . . . . . . . . . . . . . . . . 108

    5.2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    5.3 Overall Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

  • v

    6 Conclusions 113

    Bibliography 115

    A Confidence Intervals for Inferred Average Precision 131

    B Variance Decomposition Analysis 133

    C Statistics 135

  • List of Figures

    1.1 Training and evaluating retrieval systems. . . . . . . . . . . . . . . . . . . . . 6

    3.1 TREC 8, 9 and 10 mean inferred AP along with estimated confidence intervals

    when relevance judgments are generated by sampling 10, 20 and 30% of the

    depth-100 pool versus the mean actual AP. . . . . . . . . . . . . . . . . . . . . 33

    3.2 Cumulative Distribution Function of the mean infAP values . . . . . . . . . . 34

    3.3 TREC 8 mean xinfAP when relevance judgments are generated according to

    depth-10, depth-5 and depth-1 pooling combined with equivalent number of

    randomly sampled judgments versus mean actual AP. . . . . . . . . . . . . . . 39

    3.4 TREC 8 change in Kendalls , linear correlation coefficient (), and RMS errors

    of xinfAP and infAP as the judgment sets are reduced when half of the judgments

    are generated according to depth pooling and the other half is a random subset

    of complete judgments and of inferred AP when the judgments are a random

    subset of complete judgments. . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3.5 Comparison of extended inferred map, (extended) mean inferred nDCG, inferred

    map and mean nDCG on random judgments, using Kendalls for TREC 8, 9 and

    10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.6 Comparison of extended inferred map, (extended) mean inferred nDCG, inferred

    map and mean nDCG on random judgments, using linear correlation for TREC

    8, 9 and 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    3.7 Comparison of extended inferred map, (extended) mean inferred nDCG, inferred

    map and mean nDCG on random judgments, using RMS error for TREC 8, 9 and

    10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    3.8 From left, evaluation over Terabyte queries versus statMAP evaluation, evalua-

    tion over Terabyte queries versus EMAP evaluation, and statMAP evaluation

    versus EMAP evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    3.9 Stability levels of the MAP scores and the ranking of systems for statAP and MTC

    as a function of the number of topics. . . . . . . . . . . . . . . . . . . . . . . . 51

    vii

  • viii LIST OF FIGURES

    3.10 MTC and statAP weighted-MAP correlation . . . . . . . . . . . . . . . . . . . 54

    3.11 Stability level of MAPs and induced ranking for statAP and MTC as a function of

    the number of queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    3.12 Stability levels of MAP scores and induced ranking for statAP and MTC as a func-

    tion of the number of queries for different levels of relevance incompleteness. 56

    3.13 St