Is Relevance Associated with Successful Use of Information Retrieval Systems?

William HershProfessor and Head

Division of Medical Informatics & Outcomes Research

Oregon Health & Science Universityhersh@ohsu.edu

Goal of talk Answer question of association of

relevance-based evaluation measures with successful use of information retrieval (IR) systems

By describing two sets of experiments in different subject domains Since focus of talk is on one question

assessed in different studies, I will necessarily provide only partial details of the studies

For more information on these studies… Hersh W et al., Challenging conventional

assumptions of information retrieval with real users: Boolean searching and batch retrieval evaluations, Info. Processing & Management, 2001, 37: 383-402.

Hersh W et al., Further analysis of whether batch and user evaluations give the same results with a question-answering task, Proceedings of TREC-9, Gaithersburg, MD, 2000, 407-416.

Hersh W et al., Factors associated with success for searching MEDLINE and applying evidence to answer clinical questions, Journal of the American Medical Informatics Association, 2002, 9: 283-293.

Outline of talk Information retrieval system

evaluation Text REtrieval Conference (TREC) Medical IR

Methods and results of experiments TREC Interactive Track Medical searching

Implications

Information retrieval system evaluation

Evaluation of IR systems Important not only to researchers but

also users so we can Understand how to build better systems Determine better ways to teach those who

use them Cut through hype of those promoting them

There are a number of classifications of evaluation, each with a different focus

Lancaster and Warner(Information Retrieval Today, 1993)

Effectiveness e.g., cost, time, quality

Cost-effectiveness e.g., per relevant citation, new

citation, document Cost-benefit

e.g., per benefit to user

Hersh and Hickam(JAMA, 1998)

Was system used? What was it used for? Were users satisfied? How well was system used? Why did system not perform well? Did system have an impact?

Most research has focused on relevance-based measures Measure quantities of relevant documents

retrieved Most common measures of IR evaluation

in published research Assumptions commonly applied in

experimental settings Documents are relevant or not to user

information need Relevance is fixed across individuals and time

Recall and precision defined Recall

Precision

collectionindocumentsrelevant

documentsrelevantandretrievedR

documentsretrieved

documentsrelevantandretrievedP

Some issues with relevance-based measures Some IR systems return retrieval sets of

vastly different sizes, which can be problematic for “point” measures

Sometimes it is unclear what a “retrieved document” is Surrogate vs. actual document

Users often perform multiple searches on a topic, with changing needs over time

There are differing definitions of what is a “relevant document”

What is a relevant document? Relevance is intuitive yet hard to define

(Saracevic, various) Relevance is not necessarily fixed

Changes across people and time Two broad views

Topical – document is on topic Situational – document is useful to user in

specific situation (aka, psychological relevance, Harter, JASIS, 1992)

Other limitations of recalland precision Magnitude of a “clinically significant”

difference unknown Serendipity – sometimes we learn from

information not relevant to the need at hand

External validity of results – many experiments test using “batch” mode without real users; is not clear that results translate to real searchers

Alternatives to recall and precision “Task-oriented” approaches that

measure how well user performs information task with system

“Outcomes” approaches that determine whether system leads to better outcome or a surrogate for outcome

Qualitative approaches to assessing user’s cognitive state as they interact with system

Text Retrieval Conference (TREC) Organized by National Institutes for

Standards and Technology (NIST) Annual cycle consisting of

Distribution of test collections and queries to participants

Determination of relevance judgments and results

Annual conference for participants at NIST (each fall)

TREC-1 began in 1992 and has continued annually Web site: trec.nist.gov

TREC goals Assess many different approaches

to IR with a common large test collection, set of real-world queries, and relevance judgements

Provide forum for academic and industrial researchers to share results and experiences

Organization of TREC Began with two major tasks

Ad hoc retrieval – standard searching Discontinued with TREC 2001

Routing – identify new documents with queries developed for known relevant ones

In some ways, a variant of relevance feedback Discontinued with TREC-7

Has evolved to a number of tracks Interactive, natural language processing,

spoken documents, cross-language, filtering, Web, etc.

What has been learned in TREC?

Approaches that improve performance e.g., passage retrieval, query expansion, 2-poisson

weighting Approaches that may not improve performance

e.g., natural language processing, stop words, stemming

Do these kinds of experiments really matter? Criticisms of batch-mode evaluation from Swanson,

Meadow, Saracevic, Hersh, Blair, etc. Results that question their findings from Interactive

Track, e.g., Hersh, Belkin, Wu & Wilkinson, etc.

The TREC Interactive Track Developed out of interest in how

with real users might search using TREC queries, documents, etc.

TREC 6-8 (1997-1999) used instance recall task

TREC 9 (2000) and subsequent years used question-answering task

Now being folded into Web track

TREC-8 Interactive Track Task for searcher: retrieve instances of a

topic in a query Performance measured by instance recall

Proportion of all instances retrieved by user Differs from document recall in that multiple

documents on same topic count as one instance Used

Financial Times collection (1991-1994) Queries derived from ad hoc collection Six 20-minute topics for each user Balanced design: “experimental” vs. “control”

TREC-8 sample topic Title

Hubble Telescope Achievements Description

Identify positive accomplishments of the Hubble telescope since it was launched in 1991

Instances In the time allotted, please find as many

DIFFERENT positive accomplishments of the sort described above as you can

TREC-9 Interactive Track Same general experimental design

with A new task

Question-answering A new collection

Newswire from TREC disks 1-5 New topics

Eight questions

Issues in medical IR Searching priorities vary by setting

In busy clinical environment, users usually want quick, short answer

Outside clinical environment, users may be willing to explore in more detail

As in other scientific fields, researchers likely to want more exhaustive information

Clinical searching task has many similarities to Interactive Track design, so methods are comparable

Some results of medical IR evaluations (Hersh, 2003) In large bibliographic databases (e.g.,

MEDLINE), recall and precision comparable to those seen in other domains (e.g., 50%-50%, minimal overlap across searchers)

Bibliographic databases not amenable to busy clinical setting, i.e., not used often, information retrieved not preferred

Biggest challenges now in digital library realm, i.e., interoperability of disparate resources

Methods and results

Research question:Is relevance associated with successful use of information

retrieval systems?

TREC Interactive Track and our research question Do the results of batch IR studies correspond

to those obtained with real users? i.e., Do term weighting approaches which work

better in batch studies do better for real users? Methodology

Identify a prior test collection that measures large batch performance differential over some baseline

Use interactive track to see if this difference is maintained with interactive searching and new collection

Verify that previous batch difference is maintained with new collection

TREC-8 experiments Determine the best-performing measure

Use instance recall data from previous years as batch test collection with relevance defined as documents containing >1 instance

Perform user experiments TREC-8 Interactive Track protocol

Verify optimal measure holds Use TREC-8 instance recall data as batch test

collection similar to first experiment

IR system used for our TREC-8 (and 9) experiments MG

Public domain IR research system Described in Witten et. al., Managing

Gigabytes, 1999 Experimental version implements all

“modern” weighting schemes (e.g., TFIDF, Okapi, pivoted normalization) via Q-expressions, c.f., Zobel and Moffat, SIGIR Forum, 1998

Simple Web-based front end

Experiment 1 – Determine best “batch” performance

MG Q-expression

Common name

Average precision

% improvement

BB-ACB-BAA TFIDF 0.2129 - BD-ACI-BCA (0.5)

Pivoted normalization

0.2853 34%

BB-ACM-BCB (0.275)

Pivoted normalization

0.2821 33%

AB-BFC-BAA Okapi 0.3612 70% AB-BFD-BAA Okapi 0.3850 81%

Okapi term weighting performs much better than TFIDF.

Experiment 2 – Did benefit occur with interactive task? Methods

Two user populations Professional librarians and graduate students

Using a simple natural language interface

MG system with Web front end With two different term weighting

schemes TFIDF (baseline) vs. Okapi

User interface

Results showed benefit for better batch system (Okapi)

Weighting Approach

Instance Recall

TFIDF 0.33 Okapi 0.39

+18%, BUT...

All differences were due to one query

0.0408i 414i 428i 431i 438i 446i

+38.7%

+318.5%+21.3%

-25.8%

-56.6%

Okapi batch benefit

Experiment 3 – Did batch results hold with TREC-8 data?

Query InstancesRelevant

Documents TFIDF Okapi%

Improvement408i 24 71 0.5873 0.6272 6.8%414i 12 16 0.2053 0.2848 38.7%428i 26 40 0.0546 0.2285 318.5%431i 40 161 0.4689 0.5688 21.3%438i 56 206 0.2862 0.2124 -25.8%446i 16 58 0.0495 0.0215 -56.6%

Average 29 92 0.2753 0.3239 17.6%

Yes, but still with high varianceand without statistical significance.

TREC-9 Interactive Track experiments Similar to approach used in TREC-8

Determine the best-performing weighting measure

Use all previous TREC data, since no baseline Perform user experiments

Follow protocol of track Use MG

Verify optimal measure holds Use TREC-9 relevance data as batch test

collection analogous first experiment

Determine best “batch” performance

Query set Collection Cosine Okapi (% improvement)

Okapi + PN (% improvement)

303i-446i FT91-94 0.2281 0.3753 (+65) 0.3268 (+43) 051-200 Disks 1&2 0.1139 0.1063 (-7) 0.1682 (+48) 202-250 Disks 2&3 0.1033 0.1153 (+12) 0.1498 (+45) 351-450 Disks 4&5

minus CR 0.1293 0.1771 (+37) 0.1825 (+41)

001qa-200qa Disks 4&5 minus CR

0.0360 0.0657 (+83) 0.0760 (+111)

Average improvement (+38) (+58)

Okapi+PN term weighting performs better than TFIDF.

Interactive experiments – comparing systems

TFIDF Okapi+PNQuestion Searches #Correct %Correct Searches #Correct %Correct1 13 3 23.1% 12 1 8.3%2 11 0 0.0% 14 5 35.7%3 13 0 0.0% 12 0 0.0%4 12 7 58.3% 13 8 61.5%5 12 9 75.0% 13 11 84.6%6 15 13 86.7% 10 6 60.0%7 13 11 84.6% 12 10 83.3%8 11 0 0.0% 14 0 0.0%Total 100 43 43.0% 100 41 41.0%

Little difference across systems but notewide differences across questions.

Do batch results hold with new data?

Question TFIDF Okapi+PN % improvement 1 0.1352 0.0635 -53.0% 2 0.0508 0.0605 19.1% 3 0.1557 0.3000 92.7% 4 0.1515 0.1778 17.4% 5 0.5167 0.6823 32.0% 6 0.7576 1.0000 32.0% 7 0.3860 0.5425 40.5% 8 0.0034 0.0088 158.8% Mean 0.2696 0.3544 31.5%

Batch results show improved performancewhereas user results do not.

Further analysis (Turpin, SIGIR 2001) Okapi searches definitely retrieve more

relevant documents Okapi+PN user searches have 62% better

MAP Okapi+PN user searches have 101% better

Precision@5 documents But

Users do 26% more cycles with TFIDF Users get overall same results per

experiments

Possible explanations for our TREC Interactive Track results Batch searching results may not

generalize User data show wide variety of

differences (e.g., search terms, documents viewed) which may overwhelm system measures

Or we cannot detect that they do Increase task, query, or system diversity Increase statistical power

Medical IR study design Orientation to experiment and system Brief training in searching and evidence-

based medicine (EBM) Collect data on factors of users Subjects given questions and asked to

search to find and justify answer Statistical analysis to find associations

among user factors and successful searching

System used – OvidWeb MEDLINE

Experimental design Recruited

45 senior medical students 21 second (last) year NP students

Large-group session Demographic/experience questionnaire Orientation to experiment, OvidWeb Overview of basic MEDLINE and EBM

skills

Experimental design (cont.) Searching sessions

Two hands-on sessions in library For each of three questions, randomly

selected from 20, measured: Pre-search answer with certainty Searching and answering with

justification and certainty Logging of system-user interactions

User interface questionnaire (QUIS)

Searching questions Derived from two sources

Medical Knowledge Self-Assessment Program (Internal Medicine board review)

Clinical questions collection of Paul Gorman Worded to have answer of either

Yes with good evidence Indeterminate evidence No with good evidence

Answers graded by expert clinicians

Assessment of recall and precision Aimed to perform a “typical” recall and

precision study and determine if they were associated with successful searching

Designated “end queries” to have terminal set for analysis

Half of all retrieved MEDLINE records judged by three physicians each as definitely relevant, possibly relevant, or not relevant

Also measured reliability of raters

Overall results Prior to searching, rate of

correctness (32.1%) about equal to chance for both groups Rating of certainly low for both groups

With searching, medical students increased rate of correctness to 51.6% but NP students remained virtually unchanged at 34.7%

Overall results

Post-Search Incorrect Correct Incorrect M N

133 (41%) 81 (36%) 52 (52%)

87 (27%) 70 (31%) 17 (17%)

Pre-Search

Correct M N

41 (13%) 27 (12%) 14 (14%)

63 (19%) 45 (20%) 18 (18%)

Medical students were better able to convert incorrect into correct answers, whereas NP students were hurt as often as helped by searching.

Recall and precision

Recall and precision were not associated with successful answering of questions and were nearly identical for medical and NP students.

Variable All Medical NP Recall 18% 18% 20% Precision 29% 30% 26%

Variable Incorrect Correct p value Recall 18% 18% .61 Precision 28% 29% .99

Conclusions from results Medical students improved ability to

answer questions with searching, NP students did not Spatial visualization ability may explain

Answering questions required >30 minutes whether correct or incorrect This content not amenable to clinical setting

Recall and precision had no relation to successful searching

Implications

Limitations of studies Domains

Many more besides newswire and medicine

Numbers of users and questions Small and not necessarily representative

Experimental setting Real-world users may behave differently

But I believe we can conclude Although batch evaluations are useful early

in system development, their results cannot be assumed to apply to real users

Recall and precision are important components of searching but not the most important determiners of success

Further research should investigate what makes documents relevant to users and helps them solve their information problems

Thank you for inviting me…

www.irbook.org

It’s great to be back in the Midwest!

Is Relevance Associated with Successful Use of Information Retrieval Systems?

Documents

A Unified Relevance Model for Opinion Retrieval

PicHunter: Bayesian Relevance Feedback for Image Retrieval*

Variations in relevance judgments and the measurement of retrieval e ... - Jason … · 2008-03-11 · Variations in relevance judgments and the measurement of retrieval e•ectiveness

Relevance feedback for category search in music retrieval based on

Image Retrieval by Cross-Media Relevance Fusiondanieljf24.github.io/projects/mm2015/data/p173-dong.pdf · 2019-04-28 · Image Retrieval by Cross-Media Relevance Fusion Jianfeng Dong1∗,

Relevance Feedback for Image Retrieval: a Short Surveycedric.cnam.fr/~crucianm/src/ShortSurveyRF.pdf · Relevance Feedback for Image Retrieval: a Short Survey Michel Crucianu,

Relevance Feedback in Image Retrieval Systems: A …...The relevance feedback based approach in image retrieval system has been an active research field in the past few years. This

Content-Based Image Retrieval with Relevance Feedback ...srotabul/files/publications/PR2011.pdf · Content-Based Image Retrieval with Relevance Feedback using Random Walks Samuel

Adaptive relevance feedback in information retrieval

Is Relevance Associated with Successful Use of Information Retrieval Systems? William Hersh Professor and Head Division of Medical Informatics & Outcomes

Content-Based Sub-Image Retrieval Using Relevance Feedback

Enhancing relevance feedback in image retrieval using unlabeled

Chapter 19: Information Retrieval - WordPress.com 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms, and Ontologies

Relevance Feedback for Image Retrieval: a Short Survey

Relevance Feedback in Image Retrieval Systems: A Surveychens/courses/cis6931/2001/Tao.pdf · Relevance Feedback in Image Retrieval Systems: A Survey Tao Huang, Lin Luo, Chengcui Zhang

Automatic Image Annotation and Retrieval using Cross-Media Relevance Models

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL

Fisher Kernel based Relevance Feedback for Multimodal Video Retrieval

Learning on Relevance Feedback in Content-based Image Retrieval

H euristic Pre-Clustering Relevance Feedback in Attention -Based Image Retrieval