Upload
lyris
View
28
Download
1
Tags:
Embed Size (px)
DESCRIPTION
On the Evaluation of Snippet Selection for Information Retrieval. A. Overwijk , D. Nguyen, C. Hauff, R.B. Trieschnigg, D. Hiemstra, F.M.G. de Jong. Contents. Properties of a good evaluation method Evaluation method of WebCLEF Approach Results Analysis Conclusion. Good evaluation method. - PowerPoint PPT Presentation
Citation preview
A. Overwijk, D. Nguyen, C. Hauff,
R.B. Trieschnigg, D. Hiemstra, F.M.G. de Jong
ContentsProperties of a good evaluation methodEvaluation method of WebCLEFApproachResultsAnalysisConclusion
Good evaluation methodReflects the quality of the systemReusability
Evaluation method of WebCLEFRecall
The sum of character lengths of all spans in the response of the system linked to nuggets (i.e. an aspect the user includes in his article), divided by the total sum of span lengths in the responses for a topic in all submitted runs.
PrecisionThe number of characters that belong to at
least one span linked to a nugget, divided by the total character length of the system’s response.
ApproachBetter system, better performance scores?Similar system, same performance scores?Worse system, lower performance scores?
Better systemLast year’s best performing system contains a bug
our %stopwords = qw( for my $w … { ‘s next if exists $stopwords{$w}; a … … } zwischen);
Better systemSystem Precision Recall
With bug 0.2018 0.2561
Without bug 0.1328 0.1685
Not filtering stop words 0.1087 0.1380
Similar systemGeneral idea
Almost identical snippets should have almost the same precision and recall
ExperimentRemove the last word for every snippet in the output of
last year’s best performing system
Similar system
System Precision Recall
Original 0.2018 0.2561
Last word removed 0.0597 0.0758
Worse systemDelivering snippets based on occurrence
1st snippet = 1st paragraph of 1st document2nd snippet = 2nd paragraph of 2nd document...
No difference with search engines, except that documents are split up in snippets
Worse systemOriginal First occurrence
Topic Precision Recall Precision Recall
17 0.0389 0.0436 0.0389 0.0436
18 0.1590 0.6190 0.1590 0.6190
21 0.4083 0.6513 0.4083 0.6513
23 0.1140 0.1057 0.1140 0.1057
25 0.4240 0.4041 0.4240 0.4041
26 0.0780 0.1405 0.0780 0.1405
Avg. 0.2018 0.2561 0.0536 0.0680
AnalysisPool of snippetsImplementationAssessments
ConclusionEvaluation method is not sufficient:
Biased towards participating systemsCorrectness of a snippet is too strict
Recommendations:N-grams (e.g. ROUGE)Multiple assessors per topic
Questions