32
Special Topics in Computer Science Special Topics in Computer Science Advanced Topics in Information Advanced Topics in Information Retrieval Retrieval Chapter 3: Chapter 3: Goals: Retrieval Goals: Retrieval Evaluation Evaluation Alexander Gelbukh www.Gelbukh.com

Alexander Gelbukh Gelbukh

  • Upload
    yin

  • View
    14

  • Download
    0

Embed Size (px)

DESCRIPTION

Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 3: Goals: Retrieval Evaluation. Alexander Gelbukh www.Gelbukh.com. Previous chapter. Models are needed for formal operations Boolean model is the simplest - PowerPoint PPT Presentation

Citation preview

Page 1: Alexander Gelbukh Gelbukh

Special Topics in Computer ScienceSpecial Topics in Computer Science

Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval

Chapter 3: Chapter 3: Goals: Retrieval EvaluationGoals: Retrieval Evaluation

Alexander Gelbukh

www.Gelbukh.com

Page 2: Alexander Gelbukh Gelbukh

2

Previous chapterPrevious chapter

Models are needed for formal operations Boolean model is the simplest Vector model is the best combination of quality and

simplicityo TF-IDF term weighting

o This (or similar) weighting is used in all further models

Many interesting and not well-investigated variationso possible future work

Page 3: Alexander Gelbukh Gelbukh

3

Previous chapter: Research issuesPrevious chapter: Research issues

How people judge relevance?o ranking strategies

How to combine different sources of evidence? What interfaces can help users to understand and

formulate their information need?o user interfaces: an open issue

Meta-search engines: how to combine results from different Web search engines?o These results almost do not intersect

o How to combine rankings?

Page 4: Alexander Gelbukh Gelbukh

4

To write a paper: Evaluation!To write a paper: Evaluation!

How do you measure whether a system is good or bad? To go to the right direction, need to know where you

want to get to. “We can do it this way” vs. “This way it performs better”

o “I think it is better...”

o “We do it this way...”

o “Our method takes into account syntax and semantics...”

o “I like the results...”

Criterion of truth. Crucial for any science. Enables competition financial policy attracts people

o TREC international competitions

Page 5: Alexander Gelbukh Gelbukh

5

Methodology to write a paperMethodology to write a paper

Define formally your task and constraints Define formally your evaluation criterion (argue if

needed)o One numerical value is better than several

Show that your method gives better value thano the baseline (the simple obvious way), such as:

Retrieve all. Retrieve none. Retrieve at random. Use Google.

o state-of-the-art (the best reported method) in the same setting and same evaluation method!

and your parameter settings are optimalo Consider extreme settings: 0,

Page 6: Alexander Gelbukh Gelbukh

6

... Methodology... Methodology

The only valid way of reasoning “But we want the clusters to be non-trivial”

o Add this as a penalty to your criteria or as constraints Divide your “acceptability considerations” into:

o Constraints: yes/no. o Evaluation: better/worse.

Check that your evaluation criteria are well justifiedo “My formula gives it this way”o “My result is correct since this is what my algorithm gives”o Reason in terms of the user task, not your algorithm / formulas

Are your good/bad judgments in accord with intuition?

Page 7: Alexander Gelbukh Gelbukh

7

Evaluation? (Possible? How?)Evaluation? (Possible? How?)

IR: “user satisfaction”o Difficult to model formally

o Expensive to measure directly (experiments with subjects)

At least two contradicting parameterso Completeness vs. quality

o No good way to combine into one single numerical value

o Some “user-defined” “weights of importance” of the two Not formal, depend on situation

Art

Page 8: Alexander Gelbukh Gelbukh

8

Parameters to evaluateParameters to evaluate

Performance (in general sense)o Speed

o Space Tradoff

o Common for all systems. Not discussed here.

Retrieval performance (quality?)o = goodness of a retrieval strategy

o A test reference collection: docs and queries.

o The “correct” set (or ordering) provided by “experts”

o A similarity measure to compare system output with the “correct” one.

Page 9: Alexander Gelbukh Gelbukh

9

Evaluation: Model User SatisfactionEvaluation: Model User Satisfaction

User tasko Batch query processing? Interaction? Mixed?

Way of useo Real-life situation: what factors matter?

o Interface type

In this chapter: laboratory settingso Repeatability

o Scalability

Page 10: Alexander Gelbukh Gelbukh

10

Sets (Boolean): Precision & RecallSets (Boolean): Precision & Recall

Tradeoff (as with time and space) Assumes the retrieval results are sets

o as in Boolean; in Vector, use threshold Measures closeness between two sets Recall:

Of relevant docs, how many (%) were retrieved?Others are lost.

Precision:Of retrieved docs, how many (%) are relevant?Others are noise.

Nowadays with huge collections Precision is more important!

Page 11: Alexander Gelbukh Gelbukh

11

Precision & RecallPrecision & Recall

Recall =

Precision =

||

||

R

Ra

||

||

A

Ra

Page 12: Alexander Gelbukh Gelbukh

12

Ranked Output (Vector): ?Ranked Output (Vector): ?

“Truth”: ordering built by experts System output: guessed ordering

Ways to compare two rankings: ? Build the “truth” set is not possible or too expensive So not used (rarely used?) in practice

One can built the “truth” set automaticallyo Research topic for us?

Page 13: Alexander Gelbukh Gelbukh

13

Ranked Output (Vector) vs. SetRanked Output (Vector) vs. Set

“Truth”: unordered “relevant” set Output: ordered guessing Compare ordered set with an unordered one

Page 14: Alexander Gelbukh Gelbukh

14

... Ranked Output vs. set ... Ranked Output vs. set (one query)(one query)

Plot precision vs. recall curve In the initial part of the list containing n% of all

relevant docs, what the precision is?o 11 standard recall levels: 0%, 10%, ..., 90%, 100%.

o 0%: interpolated

Page 15: Alexander Gelbukh Gelbukh

15

... Many queries... Many queries

Average precision and recall

Ranked output: Average precision at each recall level To get equal (standard) recall levels, interpolation

o of 3 relevant docs, there is no 10% level!

o Interpolated value at level n =maximum known value between n and n + 1

o If none known, use the nearest known.

Page 16: Alexander Gelbukh Gelbukh

16

Precision vs. Recall FiguresPrecision vs. Recall Figures

Alternative method: document cutoff valueso Precision at first 5, 10, 15, 20, 30, 50, 100 docs

Used to compare algorithms.o Simple

o Intuitive

NOT a one-value comparison!

Page 17: Alexander Gelbukh Gelbukh

Which one is better?

Page 18: Alexander Gelbukh Gelbukh

18

Single-value summariesSingle-value summaries

Curves cannot be used for averaging by multiple queries

We need single-value performance for each queryo Can be averaged over several querieso Histogram for several queries can be madeo Tables can be made

Precision at first relevant doc? Average precision at (each) seen relevant docs

o Favors systems that give several relevant docs first R-precision

o precision at R-th retrieved (R = total relevant)

Page 19: Alexander Gelbukh Gelbukh

Precision histogram

Two algs: A, B

R(A)-R(B).

Which is better?

Page 20: Alexander Gelbukh Gelbukh

20

Alternative measures for BooleanAlternative measures for Boolean

Problems with Precision & Recall measure:o Recall cannot be estimated with large collections

o Two values, but we need one value to compare

o Designed for batch mode, not interactive. Informativeness!

o Designed for linear ordering of docs (not weak ordering)

Alternative measures: combine both in one

F-measure: E-measure: user preference Rec vs. Prec

Page 21: Alexander Gelbukh Gelbukh

User-oriented measuresUser-oriented measuresDefinitions:

Page 22: Alexander Gelbukh Gelbukh

22

User-oriented measuresUser-oriented measures

Coverage ratioo Many expected docs

Novelty ratioo Many new docs

Relative recall: # found / # expected Recall effort: # expected / # examined until those are found

Other: o expected search length (good for weak order)o satisfaction (considers only relevant docs)o frustration (considers only non-relevant docs)

Page 23: Alexander Gelbukh Gelbukh

23

Reference collectionsReference collections

Texts with queries and relevant docs known

TREC Text REtrieval Conference. Different in different years Wide variety of topics. Document structure marked up. 6 GB. See NIST website: available at small cost Not all relevant docs marked!

o Pooling method:

o top 100 docs in ranking of many search engines

o manually verified

o Was tested that is a good approximation to the “real” set

Page 24: Alexander Gelbukh Gelbukh

24

...TREC tasks...TREC tasks

Ad-hoc (conventional: query answer) Routing (ranked filtering of changing collection) Chinese ad-hoc Filtering (changing collection; no ranking) Interactive (no ranking) NLP: does it help? Cross-language (ad-hoc) High precision (only 10 docs in answer) Spoken document retrieval (written transcripts) Very large corpus (ad-hoc, 20 GB = 7.5 M docs) Query task (several query versions; does strategy depends on it?)

Query transformingo Automatic

o Manual

Page 25: Alexander Gelbukh Gelbukh

25

...TREC evaluation...TREC evaluation

Summary table statisticso # of requests used in the tasko # of retrieved docs; # of relevant retrieved and not retrieved

Recall-precision averageso 11 standard points. Interpolated (and not)

Document level averageso Also, can include average R-value

Average precision histogramo By topic.o E.g., difference between R-precision of this system and

average of all systems

Page 26: Alexander Gelbukh Gelbukh

26

Smaller collectionsSmaller collections

Simpler to use Can include info that TREC does not Can be of specialized type (e.g., include co-citations) Less sparse, greater overlap between queries Examples:

o CACM

o ISI

o there are others

Page 27: Alexander Gelbukh Gelbukh

27

CACM collectionCACM collection

Communications of ACM, 1958-1979 3204 articles Computer science Structure info (author, date, citations, ...) Stems (only title and abstract)

Good for algorithms relying on cross-citationso If a paper cites another one, they are related

o If two papers cite the same ones, they are related

52 queries with Boolean form and answer sets

Page 28: Alexander Gelbukh Gelbukh

28

ISI collectionISI collection

On information sciences 1460 docs For similarity in terms and cross-citation Includes:

o Stems (title and abstracts)

o Number of cross-citations

35 natural-language queries with Boolean form and answer sets

Page 29: Alexander Gelbukh Gelbukh

29

Cystic Fibrosis (CF) collectionCystic Fibrosis (CF) collection

Medical 1239 docs MEDLINE data

o keywords assigned manually!

100 requests 4 judgments for each doc

o Good to see agreement

Degrees of relevance, from 0 to 2 Good answer set overlap

o can be used for learning from previous queries

Page 30: Alexander Gelbukh Gelbukh

30

Research issuesResearch issues

Different types of interfaces; interactive systems:o What measures to use?

o Such as infromativeness

Page 31: Alexander Gelbukh Gelbukh

31

ConclusionsConclusions

Main measures: Precision & Recall.o For sets

o Rankings are evaluated through initial subsets

There are measures that combine them into oneo Involve user-defined preferences

Many (other) characteristicso An algorithm can be good at some and bad at others

o Averages are used, but not always are meaningful

Reference collection exists with known answers to evaluate new algorithms

Page 32: Alexander Gelbukh Gelbukh

32

Thank you!Till ... ??