View
26
Download
0
Category
Tags:
Preview:
DESCRIPTION
Is Relevance Associated with Successful Use of Information Retrieval Systems?. William Hersh Professor and Head Division of Medical Informatics & Outcomes Research Oregon Health & Science University hersh@ohsu.edu. Goal of talk. - PowerPoint PPT Presentation
Citation preview
Is Relevance Associated with Successful Use of Information Retrieval Systems?
William HershProfessor and Head
Division of Medical Informatics & Outcomes Research
Oregon Health & Science Universityhersh@ohsu.edu
Goal of talk Answer question of association of
relevance-based evaluation measures with successful use of information retrieval (IR) systems
By describing two sets of experiments in different subject domains Since focus of talk is on one question
assessed in different studies, I will necessarily provide only partial details of the studies
For more information on these studies… Hersh W et al., Challenging conventional
assumptions of information retrieval with real users: Boolean searching and batch retrieval evaluations, Info. Processing & Management, 2001, 37: 383-402.
Hersh W et al., Further analysis of whether batch and user evaluations give the same results with a question-answering task, Proceedings of TREC-9, Gaithersburg, MD, 2000, 407-416.
Hersh W et al., Factors associated with success for searching MEDLINE and applying evidence to answer clinical questions, Journal of the American Medical Informatics Association, 2002, 9: 283-293.
Outline of talk Information retrieval system
evaluation Text REtrieval Conference (TREC) Medical IR
Methods and results of experiments TREC Interactive Track Medical searching
Implications
Evaluation of IR systems Important not only to researchers but
also users so we can Understand how to build better systems Determine better ways to teach those who
use them Cut through hype of those promoting them
There are a number of classifications of evaluation, each with a different focus
Lancaster and Warner(Information Retrieval Today, 1993)
Effectiveness e.g., cost, time, quality
Cost-effectiveness e.g., per relevant citation, new
citation, document Cost-benefit
e.g., per benefit to user
Hersh and Hickam(JAMA, 1998)
Was system used? What was it used for? Were users satisfied? How well was system used? Why did system not perform well? Did system have an impact?
Most research has focused on relevance-based measures Measure quantities of relevant documents
retrieved Most common measures of IR evaluation
in published research Assumptions commonly applied in
experimental settings Documents are relevant or not to user
information need Relevance is fixed across individuals and time
Recall and precision defined Recall
Precision
collectionindocumentsrelevant
documentsrelevantandretrievedR
#
#
documentsretrieved
documentsrelevantandretrievedP
#
#
Some issues with relevance-based measures Some IR systems return retrieval sets of
vastly different sizes, which can be problematic for “point” measures
Sometimes it is unclear what a “retrieved document” is Surrogate vs. actual document
Users often perform multiple searches on a topic, with changing needs over time
There are differing definitions of what is a “relevant document”
What is a relevant document? Relevance is intuitive yet hard to define
(Saracevic, various) Relevance is not necessarily fixed
Changes across people and time Two broad views
Topical – document is on topic Situational – document is useful to user in
specific situation (aka, psychological relevance, Harter, JASIS, 1992)
Other limitations of recalland precision Magnitude of a “clinically significant”
difference unknown Serendipity – sometimes we learn from
information not relevant to the need at hand
External validity of results – many experiments test using “batch” mode without real users; is not clear that results translate to real searchers
Alternatives to recall and precision “Task-oriented” approaches that
measure how well user performs information task with system
“Outcomes” approaches that determine whether system leads to better outcome or a surrogate for outcome
Qualitative approaches to assessing user’s cognitive state as they interact with system
Text Retrieval Conference (TREC) Organized by National Institutes for
Standards and Technology (NIST) Annual cycle consisting of
Distribution of test collections and queries to participants
Determination of relevance judgments and results
Annual conference for participants at NIST (each fall)
TREC-1 began in 1992 and has continued annually Web site: trec.nist.gov
TREC goals Assess many different approaches
to IR with a common large test collection, set of real-world queries, and relevance judgements
Provide forum for academic and industrial researchers to share results and experiences
Organization of TREC Began with two major tasks
Ad hoc retrieval – standard searching Discontinued with TREC 2001
Routing – identify new documents with queries developed for known relevant ones
In some ways, a variant of relevance feedback Discontinued with TREC-7
Has evolved to a number of tracks Interactive, natural language processing,
spoken documents, cross-language, filtering, Web, etc.
What has been learned in TREC?
Approaches that improve performance e.g., passage retrieval, query expansion, 2-poisson
weighting Approaches that may not improve performance
e.g., natural language processing, stop words, stemming
Do these kinds of experiments really matter? Criticisms of batch-mode evaluation from Swanson,
Meadow, Saracevic, Hersh, Blair, etc. Results that question their findings from Interactive
Track, e.g., Hersh, Belkin, Wu & Wilkinson, etc.
The TREC Interactive Track Developed out of interest in how
with real users might search using TREC queries, documents, etc.
TREC 6-8 (1997-1999) used instance recall task
TREC 9 (2000) and subsequent years used question-answering task
Now being folded into Web track
TREC-8 Interactive Track Task for searcher: retrieve instances of a
topic in a query Performance measured by instance recall
Proportion of all instances retrieved by user Differs from document recall in that multiple
documents on same topic count as one instance Used
Financial Times collection (1991-1994) Queries derived from ad hoc collection Six 20-minute topics for each user Balanced design: “experimental” vs. “control”
TREC-8 sample topic Title
Hubble Telescope Achievements Description
Identify positive accomplishments of the Hubble telescope since it was launched in 1991
Instances In the time allotted, please find as many
DIFFERENT positive accomplishments of the sort described above as you can
TREC-9 Interactive Track Same general experimental design
with A new task
Question-answering A new collection
Newswire from TREC disks 1-5 New topics
Eight questions
Issues in medical IR Searching priorities vary by setting
In busy clinical environment, users usually want quick, short answer
Outside clinical environment, users may be willing to explore in more detail
As in other scientific fields, researchers likely to want more exhaustive information
Clinical searching task has many similarities to Interactive Track design, so methods are comparable
Some results of medical IR evaluations (Hersh, 2003) In large bibliographic databases (e.g.,
MEDLINE), recall and precision comparable to those seen in other domains (e.g., 50%-50%, minimal overlap across searchers)
Bibliographic databases not amenable to busy clinical setting, i.e., not used often, information retrieved not preferred
Biggest challenges now in digital library realm, i.e., interoperability of disparate resources
Methods and results
Research question:Is relevance associated with successful use of information
retrieval systems?
TREC Interactive Track and our research question Do the results of batch IR studies correspond
to those obtained with real users? i.e., Do term weighting approaches which work
better in batch studies do better for real users? Methodology
Identify a prior test collection that measures large batch performance differential over some baseline
Use interactive track to see if this difference is maintained with interactive searching and new collection
Verify that previous batch difference is maintained with new collection
TREC-8 experiments Determine the best-performing measure
Use instance recall data from previous years as batch test collection with relevance defined as documents containing >1 instance
Perform user experiments TREC-8 Interactive Track protocol
Verify optimal measure holds Use TREC-8 instance recall data as batch test
collection similar to first experiment
IR system used for our TREC-8 (and 9) experiments MG
Public domain IR research system Described in Witten et. al., Managing
Gigabytes, 1999 Experimental version implements all
“modern” weighting schemes (e.g., TFIDF, Okapi, pivoted normalization) via Q-expressions, c.f., Zobel and Moffat, SIGIR Forum, 1998
Simple Web-based front end
Experiment 1 – Determine best “batch” performance
MG Q-expression
Common name
Average precision
% improvement
BB-ACB-BAA TFIDF 0.2129 - BD-ACI-BCA (0.5)
Pivoted normalization
0.2853 34%
BB-ACM-BCB (0.275)
Pivoted normalization
0.2821 33%
AB-BFC-BAA Okapi 0.3612 70% AB-BFD-BAA Okapi 0.3850 81%
Okapi term weighting performs much better than TFIDF.
Experiment 2 – Did benefit occur with interactive task? Methods
Two user populations Professional librarians and graduate students
Using a simple natural language interface
MG system with Web front end With two different term weighting
schemes TFIDF (baseline) vs. Okapi
Results showed benefit for better batch system (Okapi)
Weighting Approach
Instance Recall
TFIDF 0.33 Okapi 0.39
+18%, BUT...
All differences were due to one query
1.0
0.8
0.6
0.4
0.2
0.0408i 414i 428i 431i 438i 446i
Okapi
TFIDF
Topic
+6.8%
+38.7%
+318.5%+21.3%
-25.8%
-56.6%
Inst
an
ce r
eca
ll
Okapi batch benefit
Experiment 3 – Did batch results hold with TREC-8 data?
Query InstancesRelevant
Documents TFIDF Okapi%
Improvement408i 24 71 0.5873 0.6272 6.8%414i 12 16 0.2053 0.2848 38.7%428i 26 40 0.0546 0.2285 318.5%431i 40 161 0.4689 0.5688 21.3%438i 56 206 0.2862 0.2124 -25.8%446i 16 58 0.0495 0.0215 -56.6%
Average 29 92 0.2753 0.3239 17.6%
Yes, but still with high varianceand without statistical significance.
TREC-9 Interactive Track experiments Similar to approach used in TREC-8
Determine the best-performing weighting measure
Use all previous TREC data, since no baseline Perform user experiments
Follow protocol of track Use MG
Verify optimal measure holds Use TREC-9 relevance data as batch test
collection analogous first experiment
Determine best “batch” performance
Query set Collection Cosine Okapi (% improvement)
Okapi + PN (% improvement)
303i-446i FT91-94 0.2281 0.3753 (+65) 0.3268 (+43) 051-200 Disks 1&2 0.1139 0.1063 (-7) 0.1682 (+48) 202-250 Disks 2&3 0.1033 0.1153 (+12) 0.1498 (+45) 351-450 Disks 4&5
minus CR 0.1293 0.1771 (+37) 0.1825 (+41)
001qa-200qa Disks 4&5 minus CR
0.0360 0.0657 (+83) 0.0760 (+111)
Average improvement (+38) (+58)
Okapi+PN term weighting performs better than TFIDF.
Interactive experiments – comparing systems
TFIDF Okapi+PNQuestion Searches #Correct %Correct Searches #Correct %Correct1 13 3 23.1% 12 1 8.3%2 11 0 0.0% 14 5 35.7%3 13 0 0.0% 12 0 0.0%4 12 7 58.3% 13 8 61.5%5 12 9 75.0% 13 11 84.6%6 15 13 86.7% 10 6 60.0%7 13 11 84.6% 12 10 83.3%8 11 0 0.0% 14 0 0.0%Total 100 43 43.0% 100 41 41.0%
Little difference across systems but notewide differences across questions.
Do batch results hold with new data?
Question TFIDF Okapi+PN % improvement 1 0.1352 0.0635 -53.0% 2 0.0508 0.0605 19.1% 3 0.1557 0.3000 92.7% 4 0.1515 0.1778 17.4% 5 0.5167 0.6823 32.0% 6 0.7576 1.0000 32.0% 7 0.3860 0.5425 40.5% 8 0.0034 0.0088 158.8% Mean 0.2696 0.3544 31.5%
Batch results show improved performancewhereas user results do not.
Further analysis (Turpin, SIGIR 2001) Okapi searches definitely retrieve more
relevant documents Okapi+PN user searches have 62% better
MAP Okapi+PN user searches have 101% better
Precision@5 documents But
Users do 26% more cycles with TFIDF Users get overall same results per
experiments
Possible explanations for our TREC Interactive Track results Batch searching results may not
generalize User data show wide variety of
differences (e.g., search terms, documents viewed) which may overwhelm system measures
Or we cannot detect that they do Increase task, query, or system diversity Increase statistical power
Medical IR study design Orientation to experiment and system Brief training in searching and evidence-
based medicine (EBM) Collect data on factors of users Subjects given questions and asked to
search to find and justify answer Statistical analysis to find associations
among user factors and successful searching
Experimental design Recruited
45 senior medical students 21 second (last) year NP students
Large-group session Demographic/experience questionnaire Orientation to experiment, OvidWeb Overview of basic MEDLINE and EBM
skills
Experimental design (cont.) Searching sessions
Two hands-on sessions in library For each of three questions, randomly
selected from 20, measured: Pre-search answer with certainty Searching and answering with
justification and certainty Logging of system-user interactions
User interface questionnaire (QUIS)
Searching questions Derived from two sources
Medical Knowledge Self-Assessment Program (Internal Medicine board review)
Clinical questions collection of Paul Gorman Worded to have answer of either
Yes with good evidence Indeterminate evidence No with good evidence
Answers graded by expert clinicians
Assessment of recall and precision Aimed to perform a “typical” recall and
precision study and determine if they were associated with successful searching
Designated “end queries” to have terminal set for analysis
Half of all retrieved MEDLINE records judged by three physicians each as definitely relevant, possibly relevant, or not relevant
Also measured reliability of raters
Overall results Prior to searching, rate of
correctness (32.1%) about equal to chance for both groups Rating of certainly low for both groups
With searching, medical students increased rate of correctness to 51.6% but NP students remained virtually unchanged at 34.7%
Overall results
Post-Search Incorrect Correct Incorrect M N
133 (41%) 81 (36%) 52 (52%)
87 (27%) 70 (31%) 17 (17%)
Pre-Search
Correct M N
41 (13%) 27 (12%) 14 (14%)
63 (19%) 45 (20%) 18 (18%)
Medical students were better able to convert incorrect into correct answers, whereas NP students were hurt as often as helped by searching.
Recall and precision
Recall and precision were not associated with successful answering of questions and were nearly identical for medical and NP students.
Variable All Medical NP Recall 18% 18% 20% Precision 29% 30% 26%
Variable Incorrect Correct p value Recall 18% 18% .61 Precision 28% 29% .99
Conclusions from results Medical students improved ability to
answer questions with searching, NP students did not Spatial visualization ability may explain
Answering questions required >30 minutes whether correct or incorrect This content not amenable to clinical setting
Recall and precision had no relation to successful searching
Limitations of studies Domains
Many more besides newswire and medicine
Numbers of users and questions Small and not necessarily representative
Experimental setting Real-world users may behave differently
But I believe we can conclude Although batch evaluations are useful early
in system development, their results cannot be assumed to apply to real users
Recall and precision are important components of searching but not the most important determiners of success
Further research should investigate what makes documents relevant to users and helps them solve their information problems
Recommended