Upload
adrian
View
212
Download
0
Embed Size (px)
Citation preview
Building Reliableand ReusableTest Collectionsfor ImageRetrieval: TheWikipedia Taskat ImageCLEF
Theodora TsikrikaUniversity of Applied Sciences Western Switzerland
Jana KludasUniversity of Geneva, Switzerland
Adrian PopescuCEA, LIST, France
Test collections for multimedia in-
formation retrieval, consisting of
multimedia resources, topics, and
associated relevance assessments
(ground truth), enable the reproducible and
comparative evaluation of different approaches,
algorithms, theories, and models through the
use of standardized datasets and common eval-
uation methodologies.1 Such test collections are
typically built in the context of evaluation cam-
paigns that experimentally assess the worth and
validity of new ideas in a laboratory setting
within regular and systematic evaluation cycles.
Over the years, several large-scale evaluation
campaigns have been established at the
international level. Major initiatives in visual
media analysis, indexing, classification, and re-
trieval include the Text Retrieval Conference
Video Retrieval Evaluation (TRECVid), the Pascal
Visual Object Classes challenge, the ImageNet
Large Scale Visual Recognition Challenge, the
MediaEval Benchmarking Initiative for Multi-
media Evaluation, and the Cross-Language Image
Retrieval (ImageCLEF) evaluation campaign.
As part of the ImageCLEF evaluation cam-
paign,2 the Wikipedia image-retrieval task was
introduced to support the reliable benchmark-
ing of multimodal retrieval approaches for ad
hoc image retrieval. The overall goal of the
task was to investigate how well image-retrieval
approaches that exploit textual and visual evi-
dence in order to satisfy a user’s multimedia in-
formation need could deal with large-scale,
heterogeneous, diverse image collections, such
as those encountered on the Web. To build
image collections with such characteristics, we
relied on freely distributable Wikipedia data.
This article first presents the development
and evolution of the Wikipedia image retrieval
task and test collections during the four years
(2008�2011) we organized this task as part of
ImageCLEF.3�6 It covers image collection con-
struction, topic development, ground truth cre-
ation, and applied evaluation measures. Then,
we perform an in-depth analysis to investigate
the reliability and reusability of these test col-
lections. Lastly, we discuss some of the lessons
learned from our experience with running an
image-retrieval benchmark and provide some
guidelines for building similar test collections.
Wikipedia Image-Retrieval at ImageCLEF
ImageCLEF (www.imageclef.org) was intro-
duced in 2003 as part of the Cross-Language
Evaluation Forum (CLEF) with several aims:
� develop the infrastructure necessary to eval-
uate visual information retrieval systems
operating in both monolingual and cross-
language contexts,
� provide reliable and reusable resources for
such benchmarking purposes, and
� encourage collaboration and interaction
among researchers from academia and in-
dustry to further advance the field.
To meet these objectives, ImageCLEF has
organized several tasks, including the Wikipedia
[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 24
Large-Scale Multimedia Data Collections
The ImageCLEF
Wikipedia image
retrieval task aimed
to support ad-hoc
image retrieval
evaluation using
large-scale
collections of
Wikipedia images
and their user-
generated
annotations.
1070-986X/12/$31.00 �c 2012 IEEE Published by the IEEE Computer Society24
image-retrieval task (2008�2011). This is an ad
hoc image-retrieval task whereby retrieval sys-
tems access a collection of images, but the par-
ticular topics that will be investigated cannot
be anticipated. Thus, the task’s overall goal is
to investigate how well multimodal image-
retrieval approaches that combine textual and
visual features can deal with large-scale image
collections that contain highly heterogeneous
items, in terms of their textual descriptions
and visual content. The aim was to simulate
image retrieval in a realistic setting, such as
the Web, where available images cover highly
diverse subjects, have highly varied visual prop-
erties, and might include noisy, user-generated
textual descriptions of varying lengths and
quality.
Image Collections
During the four years the task ran as part of
ImageCLEF, we used two image collections:
the Wikipedia INEX Multimedia collection12
in 2008 and 2009 and the Wikipedia Retrieval
2010 collection5 in 2010 and 2011. The aim
was to create realistic collections; given that
Wikipedia images are diverse and of varying
quality and that their annotations are hetero-
geneous and noisy, we used the collaborative
encyclopedia as our primary data source. All
the content we used is licensed under free soft-
ware licenses, which facilitates distribution
provided that the original license terms are
respected when using the collection. From a
long-term perspective, another advantage of
using Wikipedia is that this resource has
been readily available over the years and that
in the future new and larger collections can
be created at any moment based on the lessons
learned with existing Wikipedia collections.
The Wikipedia INEX Multimedia collection
contains 151,519 images and associated textual
annotations extracted from the English Wikipe-
dia. The Wikipedia Retrieval 2010 collection
consists of 237,434 images selected to cover
similar topics in English, German, and French.
We ensured coverage similarity by retaining
images from articles with versions in all three
languages and at least one image in each ver-
sion. Given that English articles are usually
more developed than those in German and
French, more annotations are available for En-
glish. The rationale for proposing a multilingual
collection is that we wanted to encourage
participants to both test their monolingual
approaches for different languages and develop
multilingual and cross-lingual approaches.
The main differences between the two col-
lections are that the 2010�2011 collection is al-
most 60 percent larger than the 2008�2009
collection and that its images are accompanied
by annotations in multiple languages and links
to the one or more articles that contain the
image to better reproduce the conditions of
Web image search, where images are usually
embedded in relatively long texts. The collections
also incorporate additional visual resources—
visual concepts in 2008 and several image fea-
tures in all four editions—to encourage partici-
pation from groups that specialize in text
retrieval.
Topic Development
Topics were developed in order to respond to
diverse multimedia information needs. The
participants were provided with the topic
title and image examples, while the assessors
were also given an unambiguous description
(narrative) of the type of relevant and irrele-
vant results. There were 75 topics in 2008, 45
in 2009, 70 in 2010, and 50 in 2011.
Topic creation was collaborative in 2008 and
2009, with the participants proposing topics
from which the organizers selected a final list.
Participation in topic creation was mandatory
in 2008 and optional in 2009. In 2010 and
2011, the task organizers selected the topics
after performing a statistical analysis of image
search engine logs. The queries logged by the
Belga News Agency image search portal (www.
belga.be) were analyzed in 2010 and by Exalead
(www.exalead.com/search) in 2011. Mean topic
length varied between 2.64 and 3.10 words per
topic (see Table 1) and was similar to that of
standard Web image queries. We found that
topic creation using log files identified more re-
alistic topics compared to collaborative topic
creation because the resulting topics are closer
to those most interesting to a general user
population.
Following the collections’ structure, only
English topics were provided in 2008 and
2009, while English, German, and French
topics were proposed in 2010 and 2011. Be-
cause we wanted to compare the performance
of the approaches in different languages, the
German and French topics were translated from
English. However, the number of submissions
[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 25
July
�Sep
tem
ber
2012
25
using French- and German-only topics was rel-
atively low.
To achieve a balanced distribution of topic
difficulty, we ran the topics through the Cross
Modal Search Engine (CMSE, http://dolphin.
unige.ch/cmse) and selected the topic set to
include
� topics with differing numbers of relevant
images (as found by the CMSE) and
� a mixture of topics for which the CMSE pro-
duced better results when using textual
rather than visual queries and vice versa.
Difficult topics usually convey complex se-
mantics (such as ‘‘model train scenery’’),
whereas easy topics have a clearly defined con-
ceptual focus (such as ‘‘colored Volkswagen
Beetles’’). We provided image examples for
each topic to support the investigation of mul-
timodal approaches. To encourage multimo-
dal approaches, we increased the number of
example images in 2011 (to 4.84 versus 1.68,
1.7, and 0.61 in previous years), which allowed
participants to build more detailed visual
query models.
Relevance Assessments
Given the complexity of a multiscale evalua-
tion, binary relevance (relevant versus non-
relevant) is assumed in the Wikipedia image-
retrieval task. Each year, the retrieved images
contained in the runs submitted by the partic-
ipants were pooled together using a pool
depth of 100 in 2008, 2010, and 2011, and a
pool depth of 50 in 2009. As Table 1 indicates,
the average pool size per topic varied over the
years, even for the same pool depth. It was
larger in 2010 and 2011 than in 2008 due to
many more runs being submitted and thus
contributing more unique images to the
pools. Also because the later collection was
substantially larger, it is possible that the
runs retrieved more diverse images. There
was though a significant decrease in pool
sizes and number of relevant images from
2010 to 2011. This might have been due to
having many more topics with named entities.
That is, although such topics are highly repre-
sentative of real Web image searches, they are
not covered in the Wikipedia corpus to the
same extent as in the general Web. It could
also be due to higher numbers of more specific
topics that were introduced to ease the assess-
ment. Furthermore, there appears to be a con-
vergence among the approaches of the various
participants, many of whom shared system
components and were able to refine their sys-
tems by emulating effective methods intro-
duced by fellow participants and interacting
with the ImageCLEF community.
During the first three years, volunteer task
participants and the organizers performed the
assessment during a four-week period after the
runs were submitted using the Web-based inter-
face previously used in the TREC Enterprise
track: 13 groups participated in 2008, seven
groups in 2009, and six groups in 2010.
[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 26
Table 1. Wikipedia task and collections statistics.
Statistics 2008 2009 2010 2011
Number of images in collection 151,519 237,434
Number of topics 75 45 70 50
Number of words/topic 2.64 2.7 2.7 3.1
Number of images/topic 0.61 1.7 1.68 4.84
Participants 12 8 13 11
Runs in pool 74 57 127 110
Textual runs 36 26 48 51
Visual runs 5 2 7 2
Multimodal runs 33 29 72 57
Pool depth 100 50 100 100
Average pool size/topic 1,290 545 2,659 1,467
Minimum pool size/topic 753 299 1,421 764
Maximum pool size/topic 1,850 802 3,850 2,327
Number of relevant images/topic 74.6 36.0 252.25 68.8
IEEE
Mu
ltiM
ed
ia
26
To ensure consistency in the assessments, a sin-
gle person assessed the pooled images for each
topic. We also made an effort to ensure that
the participants who both created topics and
volunteered as assessors were assigned the
topics they had created; in 2008, this was
achieved in 76 percent of the assignments.
As a result of a continuous drop in the num-
ber of volunteer assessors, we adopted a crowd-
sourcing approach in 2011 using CrowdFlower
(http://crowdflower.com), a general-purpose
platform for managing crowdsourcing tasks
and ensuring high-quality responses. The
assessments were carried out by Amazon Me-
chanical Turk (http://www.mturk.com) workers.
Each worker assignment involved the assess-
ment of five images for a single topic. To pre-
vent spammers, each assignment contained
one ‘‘gold standard’’ image among the five
images—an image already correctly labeled—
to estimate the workers’ accuracy. If a worker’s
accuracy dropped below a threshold (70 per-
cent), his or her assessments were excluded.
To further ensure accurate responses, each
image was assessed by three workers with the
final assessment obtained through majority
vote.
On average, 26 distinct workers assessed
each topic, with each assessing approximately
200 images. A total of 379 distinct workers
were employed over all topics. Although this
approach risks obtaining inconsistent results for
the same topic, an inspection of the results did
not reveal such issues. The time required per
topic was 27 minutes on average, which made
it possible to complete the ground truth cre-
ation within a few hours—a marked difference
with our experience in previous years.
Evaluation Measures
The effectiveness of the submitted runs was
evaluated using the following measures:
mean average precision (MAP), precision at
fixed rank position (P@n, n ¼ 10, 20), and R-
precision (precision at rank position R, where
R is the number of relevant documents).1 Dif-
ferent measures reflect different priorities in
the simulation of an operational setting and
evaluate different aspects of retrieval effective-
ness. For example, MAP focuses on the overall
quality of the entire ranking and on the im-
portance of locating as many relevant items
as possible, whereas P@10 emphasizes the
quality at the top of the ranking and ignores
the rest. We selected MAP as the main evalua-
tion measure for ranking the participants’ sub-
missions given its higher inherent stability,13
informativeness, and discriminative power.14
Furthermore, research has shown that it can
be a better measure to employ even if users
only expect a retrieval system to find a few rel-
evant items among the top ranked ones.15
Finally, given that relevance assessments in
retrieval benchmarks are incomplete due to
pooling, evaluation measures need to account
for unjudged documents. These evaluation
measures treat these documents as irrelevant.
This might affect the measures’ robustness in
cases of substantially incomplete relevance
assessments.8 To this end, we also use binary
preference (BPref),8 a measure devised to better
assess retrieval effectiveness when large num-
bers of unjudged documents exist.
Analyzing Ranking Stability
We now investigate the reliability and reus-
ability of the test collections built during the
four years the Wikipedia task ran as part of
ImageCLEF, so as to outline some general
best practice guidelines for building robust
large-scale image (multimedia) retrieval bench-
marks with minimum effort. To this end, we
explore several questions:
� Given the highly variable effectiveness of re-
trieval approaches across topics, what is the
minimum number of topics needed to
achieve reliable comparisons between differ-
ent systems for various evaluation measures?
� What is the minimum pool depth necessary
to reliably evaluate image-retrieval systems
for various evaluation measures?
� How reusable are these test collections for
image-retrieval systems that did not partici-
pate in the evaluation campaigns and thus
did not contribute to the pools?
Although researchers have examined such
questions for textual test collections built
in the context of TREC7�10 and the Initiative
for the Evaluation of XML Retrieval (INEX)11
evaluation campaigns, they have not been
addressed for multimedia datasets.
[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 27
July
�Sep
tem
ber
2012
27
Different Topic Set Sizes
Given that retrieval approaches have a highly
variable effectiveness across topics, a good ex-
perimental design requires a sufficient number
of topics to help determine whether one re-
trieval approach is better than another. To in-
vestigate the ranking stability of the test
collections built during the four years of the
task, we applied the methodology proposed
by Ellen Voorhees and Chris Buckley,7 which
empirically derives a relationship between
the topic set size, the observed difference be-
tween the values of two runs for an evaluation
measure, and the confidence that can be
placed in the conclusion that one run is better
than another. In particular, their method
derives error rates that quantify the likelihood
that a different set of topics would lead to dif-
ferent conclusions regarding the relative effec-
tiveness of two runs.
In this method, for N topics in each year, two
disjoint topic sets of equal size M (M � N/2) are
randomly drawn. The runs submitted that year
are evaluated against each of these topic sets
separately and ranked based on an evaluation
measure (such as MAP). Then, the number of
pairwise swaps are counted between the two
rankings and the probability of a swap (error
rate) is expressed as the frequency of observed
swaps out of all possible pairwise swaps. This
process is further refined by categorizing the
runs (and their swaps) into one of 11 bins
based on the difference between the values of
the runs for the given evaluation measure: the
first bin contains runs with a difference of less
than 0.01—the next bin contains runs with a
difference of at least 0.01, but less than 0.02;
and the last contains all runs with a difference
of at least 0.1. The error rate is then computed
for every such bin and by varying the topic set
size M¼ [1, . . ., N/2]; the experiment is repeated
50 times for each M. These error rates can be di-
rectly computed as a function of topic set sizes
up to N=2b c and then extrapolated to the full
set of topics N using a fitted exponential model.
Figure 1 plots the error rate for 2011 against
the topic set size for each of the 11 bins repre-
senting the difference between MAP scores. Fig-
ure 1a shows the calculated values for M ¼[1, . . ., 25], and Figure 1b the extrapolated
curves up to 50 topics. For instance, the error
rate for a difference between 0.03 and 0.04 in
the MAP scores of two runs when 25 topics
are used is approximately 20 percent. This
error rate drops significantly when more topics
are used.
An interpolation of the error rates gives the
absolute differences required in MAP or BPref
values for having a given error rate (such as
5 percent) using 50 topics. This can be inter-
preted as a significance level that assures one
run is consistently better than another, inde-
pendent of the evaluated topic set. Table 2
reports these differences for an error rate of
[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 28
0 252010 15Number of topics
(a) (b)
5
≥0.00 ≥0.01 ≥0.02≥0.03 ≥0.04 ≥0.05≥0.06 ≥0.07
≥0.10≥0.08
≥0.09
≥0.00 ≥0.01 ≥0.02≥0.03 ≥0.04 ≥0.05≥0.06 ≥0.07
≥0.10≥0.08
≥0.09
0.50
0.40
0.30Er
ror
rate
0.20
0.10
0
0.50
0.40
0.30
Erro
r ra
te
0.20
0.10
00 504020 30Number of topics
10
Figure 1. Error rate on
disjoint topic sets for
2011. (a) Error rate
calculation for one to
25 topics. (b) Error rate
extrapolation for up to
50 topics. The legend
for each curve indicates
that the absolute
difference in mean
average precision
(MAP) scores is at least
the given value and less
that the next curve’s
value.
IEEE
Mu
ltiM
ed
ia
28
5 percent for the years 2008 to 2011 calculated
by applying the Voorhees and Buckley method
for three different types of topic sets of varying
size M. The differences for BPref are always
higher than for MAP, which means that BPref
is somewhat less robust to changes in the
topic set. The values for 2009 and 2011 are
lower than for 2008 and 2010, despite having
both fewer topics and fewer relevant docu-
ments per topic on average (see Table 1).
Forty to 50 topics achieve stable rankings inde-
pendent of the topic set, for the given pool
depth. However, only a difference of at least
0.05 in MAP or BPref guarantees that one run
is consistently better than another—that is,
with an error rate of less than 5 percent.
Incomplete Judgments
For large-scale collections, pooling is applied
to the top D documents of each topic and
run to determine which documents to assess,
resulting in incomplete judgments. This leads
to several questions: How do we choose the
pool depth D and evaluation measure to en-
sure stable rankings even though some of the
relevant documents have not been judged?
Can the ground truth derived from pooling
achieve a fair ranking for runs that did not
contribute to the pool?
Unbiased, Incomplete Judgments. First, we
investigated how the decrease of pool depth D
influences the ranking stability for different
evaluation measures. The pools that decrease
linearly in size are said to be unbiased because
all runs still contribute equally.
For each year, we generated different rank-
ings of the runs by considering the judged docu-
ments in pools of depths D ¼ [5, . . ., 100]. We
compared these rankings to the original rank-
ing of the runs generated by considering all
the judged documents available for that
year—that is, at pool depth 100 for 2008,
2010, and 2011 and at 50 for 2009. We then
measured the correlation between the original
and the generated rankings of runs using Ken-
dall tau, a measure proportional to the number
of pairwise adjacent swaps needed to convert
one ranking into another. We also used the av-
erage precision (AP) tau,16 a correlation coeffi-
cient that gives more weight to errors at the
top end of rankings.
Figure 2 shows the correlation for MAP and
BPref over pool depth D for each year. BPref
was introduced to cope with incomplete judg-
ments, yet surprisingly it is less stable than
MAP for 2008 and 2010, but it performs as
well as MAP for the other years. Generally, at
pool depth D ¼ 50, a correlation of approxi-
mately 0.96 or higher is observed for both
MAP and BPref. Therefore, we can conclude
that a pool depth of 50 suffices to produce sta-
ble rankings for the given topic set sizes and rel-
atively few relevant documents per topic.
Table 3 gives an overview of the pool sizes
and numbers of relevant documents over all
topics that are found for pools of depth 50
and 100. At D ¼ 50 the pool size is halved
and still three-fourths of the relevant docu-
ments are found.
Next, we investigated the use of variable per-
topic pool depth following the methodology
proposed by Justin Zobel.9 First, we created and
assessed a pool of D ¼ 30. Then, we estimated
the number of relevant documents that can be
identified by deepening the pool for each topic
from the rate of new arrivals at increased pool
depth. The last rows in Table 3 show that we
can reduce the pool size compared to pools
with D ¼ 50 while retaining the number of rel-
evant documents identified. This variable-
depth pooling approach achieves a correlation
of approximately 0.96 with the original ranking
and thus performs similarly to pooling with
D ¼ 50 (results not shown). Zobel explained
that the documents have to be judged in
pool-depth order, which might bias assessors,9
but this is not a problem when we apply
crowdsourcing.
[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 29
Table 2. Absolute differences in MAP and BPref scores required to have a
5 percent error rate for each year.*
Statistics 2008 2009 2010 2011
Number of topics (N) 75 45 70 50
Pool depth 100 50 100 100
1 � bN/2c topics
MAP 0.0511 0.0355 0.0493 0.0383
BPref 0.0572 0.0444 0.0588 0.0402
1�25 topics
MAP 0.0505 � 0.0505 0.0383
BPref 0.0598 � 0.0557 0.0402
1�25 with fewest relevant topics
MAP 0.0451 � 0.0518 0.0383
BPref 0.0574 � 0.0570 0.0402
* Error rates are estimated for three topic subsets and are extrapolated totopic sets of size 50.
July
�Sep
tem
ber
2012
29
Biased, Incomplete Judgments. Next, we
investigated the ranking stability when the
runs of one group are excluded from the pool
creation. Table 4 lists the average results of
the difference in the pool sizes, the difference
in the number of relevant documents
identified, the correlation (Kendall tau) of
MAP and BPref of the new ranking compared
to the original, and the differences in the rank-
ings of the excluded runs (average and maxi-
mum change in the rank of an excluded run,
in both directions).
[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 30
1.00
0.90
0.95
Cor
rela
tion
0.80
0.85
0.95
0.85
0.75
0.95
0.85
0.7510 10020
(a) (b)
(c) (d)
30 40 50 60 70 80 90
Pool depth
1.00
0.90
Cor
rela
tion
0.80
5 10 5015 20 25 30 35 40 45
Pool depth
1.00
0.90
Cor
rela
tion
0.80
0.75
0.95
0.85
0.7510 10020 30 40 50 60 70 80 90
Pool depth
1.00
0.90
Cor
rela
tion
0.80
10 10020 30 40 50 60 70 80 90
Pool depth
Kendall MAPAP MAPKendall BPrefAP BPref
Kendall MAPAP MAPKendall BPrefAP BPref
Kendall MAPAP MAPKendall BPrefAP BPref
Kendall MAPAP MAPKendall BPrefAP BPref
Figure 2. Correlation
measured with Kendall
tau and average
precision (AP) tau for
MAP and BPref over
pool depth five to 100
for each year. (a) 2008
(b) 2009 (c) 2010
(d) 2011.
Table 3. Pool sizes and number of relevant documents found for different pools.
Statistics 2008 2009 2010 2011
Number of topics 75 45 70 50
Pool depth 100 50 100 100
Pool depth 100
Pool size 97,396 � 186,104 73,346
Number relevant 5,593 � 17,659 3,440
Pool depth 50
Pool size 59,012 (61%) 24,272 102,540 (55%) 38,272 (52%)
Number relevant 4,302 (77%) 1,622 12,762 (72%) 2,698 (78%)
Variable pool depth
Pool size 41,710 (42%) � 91,159 (49%) 29,202 (40%)
Number relevant 4,098 (73%) � 12,676 (72%) 2,511 (72%)
IEEE
Mu
ltiM
ed
ia
30
The correlation coefficients indicate a stable
overall performance for 2011, 2010, and 2008.
We can see a significantly lower correlation for
2009, which could be due to the smaller pool
depth. MAP and BPref perform again similarly
except for 2009, where BPref is more stable.
MAP tends to decrease the rank of excluded
runs whereas BPref tends to increase them.
This behavior, which researchers also observed
in earlier work,10 indicates that MAP might un-
derestimate the system performance for biased
judgments, whereas BPref might overestimate it.
These results show that the Wikipedia bench-
marking data can fairly rank most runs that
do not contribute to the pool even though a
few single runs might be grossly misjudged.
Nonetheless, Table 5 shows that the reus-
ability of a collection’s data has limits. For
this experiment, we first pooled textual runs
only and then multimodal runs only. This
investigated the ranking stability toward a sig-
nificantly novel retrieval approach that did
not contribute to the pools. The pool sizes
and numbers of relevant documents identified
drastically decreased and were highest when
excluding the multimodal runs. The correlation
coefficients and the rank differences display the
performance instability for each year. The most
unstable setup is 2009, and the least is 2011,
which is likely influenced by the number of
runs that contributed to the pools and the
pool depths. This indicates that a ground
truth created using numerous runs based on
various approaches will be able to fairly rank
new approaches.
Conclusions and Lessons Learned
Any effort to build test collections for similar
image-retrieval tasks should consider several
issues. For example, copyright issues must be
taken into account if the collection is to be
freely distributed to the community. The
same goes for any additional resources, such
as example images added to the topics. Fur-
thermore, all collection users should be fully
aware and respectful of the images’ original li-
cense terms. Before distributing a collection, it
is also important to perform multiple checks
on image integrity so as to make the collection
easy for users to process.
It has become best practice to use image
search engine logs to select candidate topics
to simulate real-world users’ information
needs. Nevertheless, topic sets should contain
topics of varying difficulty (such as textual
and visual multimedia queries) that also have
a varying number of relevant documents in the
collection (broad versus narrow). A baseline re-
trieval system that gives an overview of the col-
lection content can help test the latter. When
the assessment uses pooling, the number of rele-
vant documents should not be too large (no
more than 100, for example) to minimize the
number of unjudged relevant documents that
[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 31
Table 4. Average results when runs of one group are excluded from pool creation.
Statistics 2008 2009 2010 2011
Difference in pool size 7,548 2,172 10,153 4,396
Difference in number of relevant 239 80.7 538 97.5
Tau MAP 0.991 0.939 0.987 0.993
Tau BPref 0.977 0.946 0.985 0.989
Average rank difference MAP �1.96 �3.183 �2.3 �0.85
Maximum rank difference MAP (up/down) 1"/19# 3"/ 20# 3"/14# 0"/6#Average rank difference BPref 2.58 2.24 3.30 2.48
Maximum rank difference BPref (up/down) 68"/2# 23"/1# 38"/3# 15"/0#
Table 5. Average results when textual or multimodal runs are excluded from
pool creation.
Statistics 2009 2010 2011
Number of textual/multimodal runs 33/35 42/62 51/52
Difference in pool size 12,759 95,411 34,134
Difference in number of relevant 443 4,387 720
Tau MAP 0.739 0.958 0.959
Tau BPref 0.832 0.938 0.942
Absolute average rank difference MAP 6.31 2.32 2.28
Maximum rank difference MAP (up/down) 10"/27# 7"/16# 5"/14#Absolute average rank difference BPref 3.33 2.74 2.67
Maximum rank difference BPref (up/down) 29"/11# 43"/9# 18"/8#
July
�Sep
tem
ber
2012
31
influence the ranking stability. Our analysis
showed that 40 to 50 topics achieved stable rank-
ings independent of the topic set, where a differ-
ence of at least 0.05 in MAP or BPref indicates
that one run is consistently better than another.
Our experience with using crowdsourcing
for the 2011 assessment was positive. Some
manual work was necessary to create the gold
standard, but after that, the automatic assess-
ment was done quickly and accurately when
more than one assessor was assigned to each
topic. Our analysis showed that a pool depth
of 50 can sufficiently produce stable rankings
if the topics do not have too many relevant
documents in the collection. Smaller pools
that still identify many relevant documents
can be achieved using variable-depth pooling,9
which works well with crowdsourcing. Our
analysis also indicates that ground truth should
be created by pooling a large number of runs
based on various approaches so as to be able
to fairly rank new approaches.
The test collections constructed in the con-
text of the Wikipedia image-retrieval task at
ImageCLEF are available at www.imageclef.
org/wikidata. These resources are the result of
the collaboration of a large number of partici-
pating research groups that, through forming
a community with multidisciplinary compe-
tencies and sharing expertise, have contrib-
uted to the advancement of image-retrieval
research. MM
Acknowledgments
Theodora Tsikrika was supported by the Euro-
pean Union in the context of the Promise
(contract 258191) and Chorus+ (contract
249008) FP7 projects. Jana Kludas was funded
by the Swiss National Fund (SNF). Adrian
Popescu was supported by the French ANR
(Agence Nationale de la Recherche) via the
PERIPLUS project (ANR-10-CORD-026).
References
1. M. Sanderson, ‘‘Test Collection Based Evaluation
of Information Retrieval Systems,’’ Foundations
and Trends in Information Retrieval, vol. 4, no. 4,
2010, pp. 247�375.
2. H. Muller et al., eds., ImageCLEF: Experimental Evalu-
ation in Visual Information Retrieval, Springer, 2010.
3. T. Tsikrika and J. Kludas, ‘‘Overview of the Wiki-
pedia MM task at ImageCLEF 2008,’’ Evaluating
Systems for Multilingual and Multimodal Information
Access: Proc. 9th Workshop of the Cross-Language
Evaluation Forum (CLEF 2008), Revised Selected
Papers, LNCS, Springer, 2009, pp. 539�550.
4. T. Tsikrika and J. Kludas, ‘‘Overview of the Wikipe-
dia MM task at ImageCLEF 2009,’’ Multilingual
Information Access Evaluation Vol. II Multimedia
Experiments: Proc. 10th Workshop of the Cross-
Language Evaluation Forum (CLEF 2009), Revised
Selected Papers, LNCS, Springer, 2010, pp. 60�71.
5. A. Popescu, T. Tsikrika, and J. Kludas, ‘‘Overview
of the Wikipedia Retrieval Task at ImageCLEF
2010,’’ Working Notes for the CLEF 2010 Work-
shop, 2010; http://clef2010.org/resources/
proceedings/clef2010labs_submission_124.pdf.
6. T. Tsikrika, A. Popescu, and J. Kludas, ‘‘Overview
of the Wikipedia Image Retrieval Task at Image-
CLEF 2011,’’ Working Notes for the CLEF 2011
Labs and Workshop, 2011; http://clef2011.org/
resources/proceedings/Overview_ImageCLEF_
Wikipeida_Retrieval_Clef2011.pdf.
7. E.M. Voorhees and C. Buckley, ‘‘The Effect of Topic
Set Size on Retrieval Experiment Error,’’ Proc. 25th
ACM SIGIR Conf. Research and Development in Infor-
mation Retrieval, ACM Press, 2002, pp. 316�323.
8. C. Buckley and E.M. Voorhees, ‘‘Retrieval Evalua-
tion with Incomplete Information,’’ Proc. 27th
ACM SIGIR Conf. Research and Development in In-
formation Retrieval, ACM Press, 2004, pp. 25�32.
9. J. Zobel, ‘‘How Reliable Are the Results of Large-
Scale Information Retrieval Experiments?’’ Proc.
21st ACM SIGIR Conf. Research and Development
in Information Retrieval, ACM Press, 1998,
pp. 307�314.
10. S. Buttcher et al., ‘‘Reliable Information Retrieval
Evaluation with Incomplete and Biased Judgments,’’
Proc. 30th ACM SIGIR Conf. Research and Develop-
ment in Information Retrieval, ACM Press, 2007,
pp. 63�70.
11. S. Pal, M. Mitra, and J. Kamps, ‘‘Evaluation Effort,
Reliability and Reusability in XML Retrieval,’’ J.
Am. Soc. for Information Science and Technology,
vol. 62, no. 2, 2011, pp. 375�394.
12. T. Westerveld and R. van Zwol, ‘‘The INEX 2006
Multimedia Track,’’ Advances in XML Information
Retrieval: Proc. 5th Int’l Workshop of the Initiative
for the Evaluation of XML Retrieval (INEX 2006),
Revised Selected Papers, LNCS, Springer, 2007,
pp. 331�344.
13. C. Buckley and E.M. Voorhees, ‘‘Evaluating Evalu-
ation Measure Stability,’’ Proc. 23rd ACM SIGIR
Conf. Research and Development in Information
Retrieval, ACM Press, 2000, pp. 33�40.
14. J.A. Aslam, E. Yilmaz, and V. Pavlu, ‘‘The Maxi-
mum Entropy Method for Analyzing Retrieval
Measures,’’ Proc. 28th ACM SIGIR Conf. Research
[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 32
IEEE
Mu
ltiM
ed
ia
32
and Development in Information Retrieval, ACM
Press, 2005, pp. 27�34.
15. E. Yilmaz and S. Robertson, ‘‘On the Choice of
Effectiveness Measures for Learning to Rank,’’
Information Retrieval, vol. 13, no. 3, 2010,
pp. 271�290.
16. E. Yilmaz, J.A. Aslam, and S. Robertson, ‘‘A New
Rank Correlation Coefficient for Information Re-
trieval,’’ Proc. 31st ACM SIGIR Conf. Research and
Development in Information Retrieval, ACM Press,
2008, pp. 587�594.
Theodora Tsikrika is a post-doctoral researcher at the
University of Applied Sciences Western Switzerland.
Her research interests focus on multimedia informa-
tion retrieval and on the combination of statistical,
social, and semantic evidence for such search pro-
cesses and interactions. Tsikrika has a PhD in com-
puter science from Queen Mary, University of
London. Contact her at [email protected].
Jana Kludas is a post-doctoral researcher in bioinfor-
matics at Aalto University, Finland. Her research
interests include machine learning, data mining,
and computer vision with a focus on information fu-
sion in all its forms and application areas. Kludas has
a PhD in computer science from the University of
Geneva, Switzerland. Contact her at jana.kludas@
unige.ch.
Adrian Popescu is a post-doctoral researcher with
CEA, LIST, France. His research interests include in-
formation extraction, image retrieval, and social net-
works. Popescu has a PhD in computer science from
Telecom Bretagne, France. Contact him at adrian.
[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 33
July
�Sep
tem
ber
2012
33