Upload
sanimisa
View
233
Download
0
Embed Size (px)
Citation preview
8/10/2019 Evaluation of Multimedia Retrieval System
1/48
Multimedia Indexing & Retrieval
Evaluation of Multimedia Retrieval Systems
12/6/14Evaluation of Multimedia Retrieval Systems
1
8/10/2019 Evaluation of Multimedia Retrieval System
2/48
Outline
! Introduction
!
Test Collections and Evaluation Workshops
! Evaluation Measures
! Significance Tests
!
A Case Study: TRECVID
12/6/14Evaluation of Multimedia Retrieval Systems
2
8/10/2019 Evaluation of Multimedia Retrieval System
3/48
Outline
! Introduction
!
Test Collections and Evaluation Workshops
! Evaluation Measures
! Significance Tests
!
A Case Study: TRECVID
12/6/14Evaluation of Multimedia Retrieval Systems
3
8/10/2019 Evaluation of Multimedia Retrieval System
4/48
!
We are going to discuss the tools and methodology forcomparing the effectivenessof two or more multimediaretrieval systems in a meaningful way
! Multimedia retrieval system evaluation can be evaluatedusing two strategies:
1. Withoutconsulting the potential users of the system e.g.query processing time, query throughput, etc.
2.
Withconsulting the potential users of the system e.g. theeffectiveness of retrieved results
Introduction
12/6/14Evaluation of Multimedia Retrieval Systems
4
8/10/2019 Evaluation of Multimedia Retrieval System
5/48
Test Collections
!
Doing an evaluation involving real people is not only a costlyjob, itis also difficult to controland hard to replicate
! Methods have been developed to design so-called test collections
!
Often, these test collections are created by consulting potentialusers
! Test collection can be used to evaluate multimedia retrievalsystems without the need to consult the users during furtherevaluations
! A new retrieval method or system can be evaluated by comparingit to some well-established methods in a controlled experiment
12/6/14Evaluation of Multimedia Retrieval Systems
5
8/10/2019 Evaluation of Multimedia Retrieval System
6/48
Evaluation Methods in Controlled
Experiment (Hull, 1993)
!
A test collection consisting of:
1. Multimedia data
2. A task and data needed for that task, for instance image
search using example images as queries3. Ground truth data (correct answers)
! One or more suitable evaluation measures that assign valuesto the effectiveness of the search
!
A statistical methodology that determines whether theobserved differences in performance between the methodsinvestigated are statistically significant
12/6/14Evaluation of Multimedia Retrieval Systems
6
8/10/2019 Evaluation of Multimedia Retrieval System
7/48
Glass box vs Black Box
! Two approaches to test the effectiveness of multimediasearch system:
1.
Glass boxevaluation, i.e., the systematic assessment ofevery component of a system
example: word error rate on speech recognition, or shotdetector subsystem of video retrieval system
2. Black boxevaluation, i.e., testing the system as a whole
12/6/14Evaluation of Multimedia Retrieval Systems
7
8/10/2019 Evaluation of Multimedia Retrieval System
8/48
Outline
! Introduction
!
Test Collections and Evaluation Workshops
! Evaluation Measures
! Significance Tests
!
A Case Study: TRECVID
12/6/14Evaluation of Multimedia Retrieval Systems
8
8/10/2019 Evaluation of Multimedia Retrieval System
9/48
! The scientific evaluation of multimedia search systems is a costlyand laboriousjob
!
Big companies, for instance Web search companies such asYahooand Google, might perform such evaluations themselves
! For smaller companies and research institutions a good option is
often to take a test collection that is prepared by others
!
A test collections are often created by the collaborative effortof many research groups in so-called evaluation workshops ofconferences i.e. Text Retrieval Conferences (TREC)
Test Collections and Evaluation Workshops
12/6/14Evaluation of Multimedia Retrieval Systems
9
8/10/2019 Evaluation of Multimedia Retrieval System
10/48
TREC-style Evaluation Cycle
12/6/14Evaluation of Multimedia Retrieval Systems
10
8/10/2019 Evaluation of Multimedia Retrieval System
11/48
Ground Truth Data Preparation
! Forblack-boxevaluations, people are involved in the process ofdeciding whether a multimedia object is relevant for a certain
query
! For glass-box evaluations, for instance an evaluation of a shot
boundary detection algorithm, the process of deciding whether
a sequence of frames contains a shot boundary involvespeople as well
12/6/14Evaluation of Multimedia Retrieval Systems
11
8/10/2019 Evaluation of Multimedia Retrieval System
12/48
Corel Set!
The Corel test collection is a set of stock photographs,which is divided into subsets of images (e.g., Arabianhorses, sun- sets, or English pub signs, etc.)
! The collection is used to evaluate the effectiveness ofcontent-based image retrieval systems in a large numberof publications and has become a de-facto standardinthe field
12/6/14Evaluation of Multimedia Retrieval Systems
12
8/10/2019 Evaluation of Multimedia Retrieval System
13/48
Corel Set (cont.)
! Usually, Corel is used as follows:1. From the test collection, take out one image and use it as a
query-by-example;2. Relevant images for the query are those that come from the
same theme; and
3.
Compute precision and recall for a number of queries
!
Problems:! Single Corel set does not exist
! Different publications use different CDs and different selectionsof themes
!
Too simple, since only come from one professionalphotographer (not represents realistic settings)
12/6/14Evaluation of Multimedia Retrieval Systems
13
8/10/2019 Evaluation of Multimedia Retrieval System
14/48
Mirex Workshops!
The Music Information Retrieval Evaluation eXchange(MIREX) test collections consist of data from record labelsthat allow to publish tracks from their artists releases
! MIREX uses a glass boxapproach to the evaluation ofmusic information retrieval by identifying various subtaskse.g. identification, Drum detection, Genre classification,Melody extraction, On- set detection, Tempo extraction,
Key finding, Symbolic genre classification, Symbolicmelodic similarity
12/6/14Evaluation of Multimedia Retrieval Systems
14
8/10/2019 Evaluation of Multimedia Retrieval System
15/48
TRECVID: The TREC Video Retrieval Workshops
! The work- shop emerged as a special task of the Text RetrievalConference (TREC) in 2001
!
Continued as an independent workshop collocated with TREClater
! The workshop has focused mostly on news videos, from US,
Chinese and Arabic sources e.g. CNN Headline News, and ABCWorld News Tonight
12/6/14Evaluation of Multimedia Retrieval Systems
15
8/10/2019 Evaluation of Multimedia Retrieval System
16/48
TRECVID tasks (example)
! TRECVID provides several white-boxevaluation tasks such
as:! Shot boundary detection
! Story segmentation
! High-level feature extraction
! Etc.
! The workshop series also provides a black-boxevaluationframework: a Search task that may include the completeinteractive session of a user with the system
12/6/14Evaluation of Multimedia Retrieval Systems
16
8/10/2019 Evaluation of Multimedia Retrieval System
17/48
CLEF Workshops
! CLEF (Cross-Language Evaluation Forum), a workshop seriesmainly focusing on multilingual retrieval, i.e., retrieving data
from a collection of documents in many languages, and cross-
language retrieval
! CLEF started as a cross-language retrieval task in TREC in 1997,
became independent from TREC in 2000
12/6/14Evaluation of Multimedia Retrieval Systems
17
8/10/2019 Evaluation of Multimedia Retrieval System
18/48
ImageCLEF
! Goal: to investigate the effectiveness of combining text andimage for retrieval
!
ImageCLEF also provides an evaluation task for a medicalimagecollection consisting of medical photographs, X-rays, CT-
scans, MRIs, etc. provided by the Geneva University Hospital in
the Casimage project
12/6/14Evaluation of Multimedia Retrieval Systems
18
8/10/2019 Evaluation of Multimedia Retrieval System
19/48
INEX Workshops! The Initiative for the Evaluation of XML Retrieval (INEX) aims at
evaluating the retrieval performance of XML retrieval systems
!
It focuses on using the structure of the document to extract,relate and combinethe relevance scores of different
multimedia fragments
12/6/14Evaluation of Multimedia Retrieval Systems
19
8/10/2019 Evaluation of Multimedia Retrieval System
20/48
INEX Workshop task (example)
! INEX 2006 had a modest multimedia task that involved
data from the Wikipedia
! A search query might for instance search for images bycombining information from its captiontext, its links, andfrom the articlethat contains the image
12/6/14Evaluation of Multimedia Retrieval Systems
20
8/10/2019 Evaluation of Multimedia Retrieval System
21/48
Outline
! Introduction
!
Test Collections and Evaluation Workshops
! Evaluation Measures
! Significance Tests
!
A Case Study: TRECVID
12/6/14Evaluation of Multimedia Retrieval Systems
21
8/10/2019 Evaluation of Multimedia Retrieval System
22/48
Evaluation Measures! The effectiveness of a system or component is often measured by the
combination of precisionand recall
! Precision is defined by the fraction of the retrieved or detectedobjects that is actually relevant. Recall is defined by the fraction of
the relevant objects that is actually retrieved
12/6/14Evaluation of Multimedia Retrieval Systems
22
recall =Number of relevant documents retrieved
Total number of relevant documents
retrieveddocumentsofnumberTotal
retrieveddocumentsrelevantofNumberprecision=
Relevant
documents
Retrieved
documents
Entire
document
collection
8/10/2019 Evaluation of Multimedia Retrieval System
23/48
Determining Recall is Difficult
! Total number of relevant items is sometimes not available:
!Sampleacross the database and perform relevancejudgment on these items.
! Apply different retrieval algorithms to the same database forthe same query. The aggregateof relevant items is taken as
the total relevant set.
12/6/14Evaluation of Multimedia Retrieval Systems
23
8/10/2019 Evaluation of Multimedia Retrieval System
24/48
Trade-off between Recall andPrecision
12/6/14Evaluation of Multimedia Retrieval Systems
24
10
1
Recall
Precision
The idealReturns relevant documents but
misses many useful ones too
Returns most relevant
documents but includes
lots of junk
8/10/2019 Evaluation of Multimedia Retrieval System
25/48
Computing Recall/Precision Points
! For a given query, produce the ranked list of retrievals
!
Adjusting a thresholdon this ranked list produces differentsets of retrieved documents, and therefore differentrecall/precision measures
! Markeach document in the ranked list that is relevant
according to the gold standard
! Compute a recall/precision pairfor each position in theranked list that contains a relevant document
12/6/14Evaluation of Multimedia Retrieval Systems
25
8/10/2019 Evaluation of Multimedia Retrieval System
26/48
Computing Recall/Precision Points (example)
12/6/14Evaluation of Multimedia Retrieval Systems
26
R=3/6=0.5; P=3/4=0.75
n doc # relevant
1 588 x
2 589 x
3 576
4 590 x
5 986
6 592 x
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x
14 990
Let total # of relevant docs = 6
Check each new recall point:
R=1/6=0.167; P=1/1=1
R=2/6=0.333; P=2/2=1
R=5/6=0.833; p=5/13=0.38
R=4/6=0.667; P=4/6=0.667
Missing one
relevant document.
Never reach
100% recall
8/10/2019 Evaluation of Multimedia Retrieval System
27/48
Precision at Fixed Recall Levels
! A number of fixed recall levelsare chosen, for instance 10
levels: {0.1, 0.2,!!!, 1.0}
! The levels correspond to usersthat are satisfied if they find
respectively 10%, 20%,!!!, 100%of the relevant documents
!
For each of these levels thecorresponding precision isdetermined by averaging theprecision on that level over the
tests that are performed
12/6/14Evaluation of Multimedia Retrieval Systems
27
8/10/2019 Evaluation of Multimedia Retrieval System
28/48
Precision at Fixed Points in the Ranked List
! Recall isnot necessarily a goodmeasure of user equivalence
!
For instance if one query has 20 relevant documents while another has 200
! A recall of 50% would be a reasonablegoal in the first case, butunmanageablefor most users in the second case
! Alternative, choose a number of fixed points in the ranked list, for instancenine points at: 5, 10, 15, 20, 30, 100, 200, 500 and 1000 documents retrieved
! These points correspond with users that are willing to read 5, 10, 15, etc.documents of a search
! For each of these points in the ranked list, the precision is determined byaveragingthe precision on that level over the queries
12/6/14Evaluation of Multimedia Retrieval Systems
28
8/10/2019 Evaluation of Multimedia Retrieval System
29/48
Mean Average Precision
! Average Precision: Average of the precision values at thepoints at which each relevant document is retrieved.
! Example: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633
! Mean Average Precision: Averageof the averageprecision value for a set of queries.
12/6/14Evaluation of Multimedia Retrieval Systems
29
8/10/2019 Evaluation of Multimedia Retrieval System
30/48
Combining Precision/Recall: F-Measure
!
One measure of performance that takes into accountboth recalland precision
! Harmonic mean of recall and precision:
! Compared to arithmetic mean, both need to be high forharmonic mean to be high.
12/6/14Evaluation of Multimedia Retrieval Systems
30
PRRP
PRF
11
22
+
=
+
=
8/10/2019 Evaluation of Multimedia Retrieval System
31/48
Outline
! Introduction
!
Test Collections and Evaluation Workshops
! Evaluation Measures
! Significance Tests
!
A Case Study: TRECVID
12/6/14Evaluation of Multimedia Retrieval Systems
31
8/10/2019 Evaluation of Multimedia Retrieval System
32/48
Significance Test
! Simply citing percentage improvements of one method overanother is helpful
! It does not tell if the improvementswere in fact due todifferences of the two methods
! A differences between two methods might simply be due torandom variationin the performance
!
To make significance testing of the differences applicable, areasonable amount of queries is needed
! A technique called cross-validationis used to further preventbiased evaluation results
12/6/14Evaluation of Multimedia Retrieval Systems
32
8/10/2019 Evaluation of Multimedia Retrieval System
33/48
Cross Validation (example)
12/6/14Evaluation of Multimedia Retrieval Systems
33
8/10/2019 Evaluation of Multimedia Retrieval System
34/48
Significance Tests Assumptions
! Significance tests are designed to disprovethe nullhypothesis H0
! The null hypothesis will be that there is no differencebetween method A and method B.
! Rejecting H0 implies accepting the alternativehypothesis H1
!
The alternative hypothesis for the retrieval experiments willbe
1.
Method A consistently outperforms method B, or
2.
Method B consistently outperforms method A
12/6/14Evaluation of Multimedia Retrieval Systems
34
8/10/2019 Evaluation of Multimedia Retrieval System
35/48
Common Significance Tests
1. The paired t-test: assumes that errors are normallydistributed
2.
The paired Wilcoxon signed ranks test: is a non-parametric test that assumes that errors come from acontinuous distribution that is symmetric around 0
3.
The paired sign test: is a non-parametric test that onlyuses the sign of the differences between method A andB for each query. The test statistic is the number of timesthat the least frequent sign occurs
12/6/14Evaluation of Multimedia Retrieval Systems
35
8/10/2019 Evaluation of Multimedia Retrieval System
36/48
Significance Test (example)
!
We are going to compare the effectiveness of twocontent-based image retrieval systems A and B
! Given an example query image, return a ranked lists ofimages from a standard benchmark test collection
! For each system, we run 50 queries from the benchmark
!
Based on benchmarks relevance judgments, the meanaverage precision is 0.208 for system A and 0.241 forsystem B
! Are these different significant?
12/6/14Evaluation of Multimedia Retrieval Systems
36
8/10/2019 Evaluation of Multimedia Retrieval System
37/48
Outline
! Introduction
!
Test Collections and Evaluation Workshops
! Evaluation Measures
! Significance Tests
!
A Case Study: TRECVID
12/6/14Evaluation of Multimedia Retrieval Systems
37
8/10/2019 Evaluation of Multimedia Retrieval System
38/48
A Case Study: TRECVID
! In 2004, there were four main tasks in TRECVID:
Glass box tasks:1. Shot boundary detection
2. Story segmentation
3. High-level feature extraction
Black box task: Search
12/6/14Evaluation of Multimedia Retrieval Systems
38
8/10/2019 Evaluation of Multimedia Retrieval System
39/48
Shot Boundary Detection
!
A shot marks the complete sequenceof frames startingfrom the moment that the camera was switched on, and
stoppingwhen it was switched off again, or asubsequence of that selected by the movie editor
!
Shot boundaries: the positions in the video stream thatcontain the transitions from one shot to the next shot
!
There are two kind of boundaries:1. Hardcuts: a relatively large difference between the two frames.
Easy to detect
2. Softcuts: there is a sequence of frames that belongs to both thefirst shot and the second shot. Harder to detect
12/6/14Evaluation of Multimedia Retrieval Systems
39
8/10/2019 Evaluation of Multimedia Retrieval System
40/48
TRECVID 2004 Shot Boundary Detection Task
! Goal: to identify the shot boundaries with their location andtype (hard or soft) in the given video clips
! The performance of a system is measured by precisionandrecallof detected shot boundaries
! The detection criteria require only a single frame overlapbetween the submitted transitions and the referencetransition
! The TRECVID 2004 test data:! 618,409 frames
! 4806 shot transitions(58% hard cuts, 42% soft cuts, mostly dissolves)
! The best performing systems:95%precision and recall on hard cuts, 80%precision and recall on soft cuts
12/6/14Evaluation of Multimedia Retrieval Systems
40
/ /f S
8/10/2019 Evaluation of Multimedia Retrieval System
41/48
Hard cuts vs Soft cuts
12/6/14Evaluation of Multimedia Retrieval Systems
41
Hard cuts Soft cuts
with dissolve effect
12/6/14E l ti f M lti di R t i l S t
8/10/2019 Evaluation of Multimedia Retrieval System
42/48
Story Segmentation
!
A digital video retrieval system with shots as the basicretrieval unit might not be desirablea news shows and
news magazines
! The segmentation of a news show into its constitutingnews items has been studied under the names ofstoryboundarydetection and story segmentation
!
Story segmentation of multimedia data can potentiallyexploit visual, audioand textual cuespresent in the data.
! Story boundaries often but not always occur at shotboundaries
12/6/14Evaluation of Multimedia Retrieval Systems
42
12/6/14E l ti f M lti di R t i l S t
8/10/2019 Evaluation of Multimedia Retrieval System
43/48
TRECVID 2004 Story Segmentation Task
! Goal: given the story boundary test collection, identify thestory boundaries with their location (time) in the given video
clip(s)
! The definition of the story segmentation task was based onmanual story boundary annotationsmade by Linguistic DataConsortium (LDC) for the Topic Detection and Tracking (TDT)project
! Experiments setup:
! Video + Audio (no ASR/CC)
! Video + Audio + ASR/CC
! ASR/CC (no Video + Audio)
12/6/14Evaluation of Multimedia Retrieval Systems
43
12/6/14Evaluation of Multimedia Retrieval Systems
8/10/2019 Evaluation of Multimedia Retrieval System
44/48
Story Segmentation (Zhai, 2005)
! The video consists of two news stories and one commercial. The blue cycleand the red cycle are the news stories, and the green cycle represents amiscellaneous story
12/6/14Evaluation of Multimedia Retrieval Systems
44
12/6/14Evaluation of Multimedia Retrieval Systems
8/10/2019 Evaluation of Multimedia Retrieval System
45/48
Story Boundary Evaluation
! A story boundary was expressed as atime offset withrespect to the start of the video file in seconds, accurateto nearest hundredth of a second
!
Each reference boundary was expanded with a fuzzinessfactorof five seconds in each direction
!
If a computed boundary did not fall in the evaluationintervalof a reference boundary, it was considered afalse alarm
12/6/14Evaluation of Multimedia Retrieval Systems
45
12/6/14Evaluation of Multimedia Retrieval Systems
8/10/2019 Evaluation of Multimedia Retrieval System
46/48
High-level Feature Extraction
! One way to bridge the semantic gap between videocontent and textual representations is to annotatevideofootage with high-level features
!
The assumption is that features describing genericconceptse.g.:
! Genres: commerial, politics, sports, weather, financial
!
Objects: face, fire, explosion, car, boat, aircraft, map, building! Scene: water, scape, mountain, sky, outdoor, indoor, vegetation
! Action: people walking, people in crowd
12/6/14Evaluation of Multimedia Retrieval Systems
46
12/6/14Evaluation of Multimedia Retrieval Systems
8/10/2019 Evaluation of Multimedia Retrieval System
47/48
Video Search Task
!
Goal: to evaluate the system as a whole
!
Given the search test collection and a multimediastatement of a users information need (called topicinTRECVID)
! Return a ranked list of at most 1000 shots from the testcollection that best satisfy the need
!
A topic consists of a textual descriptionof the usersneed, for instance: Find shots of one or more buildingswith flood waters around it/them. and possibly one ormore example frames or images
12/6/14Evaluation of Multimedia Retrieval Systems
47
12/6/14Evaluation of Multimedia Retrieval Systems
8/10/2019 Evaluation of Multimedia Retrieval System
48/48
Manual vs Interactive Search Task
12/6/14Evaluation of Multimedia Retrieval Systems