Building Reliable and Reusable Test Collections for Image Retrieval: The Wikipedia Task at ImageCLEF

Building Reliableand ReusableTest Collectionsfor ImageRetrieval: TheWikipedia Taskat ImageCLEF

Theodora TsikrikaUniversity of Applied Sciences Western Switzerland

Jana KludasUniversity of Geneva, Switzerland

Adrian PopescuCEA, LIST, France

Test collections for multimedia in-

formation retrieval, consisting of

multimedia resources, topics, and

associated relevance assessments

(ground truth), enable the reproducible and

comparative evaluation of different approaches,

algorithms, theories, and models through the

use of standardized datasets and common eval-

uation methodologies.1 Such test collections are

typically built in the context of evaluation cam-

paigns that experimentally assess the worth and

validity of new ideas in a laboratory setting

within regular and systematic evaluation cycles.

Over the years, several large-scale evaluation

campaigns have been established at the

international level. Major initiatives in visual

media analysis, indexing, classification, and re-

trieval include the Text Retrieval Conference

Video Retrieval Evaluation (TRECVid), the Pascal

Visual Object Classes challenge, the ImageNet

Large Scale Visual Recognition Challenge, the

MediaEval Benchmarking Initiative for Multi-

media Evaluation, and the Cross-Language Image

Retrieval (ImageCLEF) evaluation campaign.

As part of the ImageCLEF evaluation cam-

paign,2 the Wikipedia image-retrieval task was

introduced to support the reliable benchmark-

ing of multimodal retrieval approaches for ad

hoc image retrieval. The overall goal of the

task was to investigate how well image-retrieval

approaches that exploit textual and visual evi-

dence in order to satisfy a user’s multimedia in-

formation need could deal with large-scale,

heterogeneous, diverse image collections, such

as those encountered on the Web. To build

image collections with such characteristics, we

relied on freely distributable Wikipedia data.

This article first presents the development

and evolution of the Wikipedia image retrieval

task and test collections during the four years

(2008�2011) we organized this task as part of

ImageCLEF.3�6 It covers image collection con-

struction, topic development, ground truth cre-

ation, and applied evaluation measures. Then,

we perform an in-depth analysis to investigate

the reliability and reusability of these test col-

lections. Lastly, we discuss some of the lessons

learned from our experience with running an

image-retrieval benchmark and provide some

guidelines for building similar test collections.

Wikipedia Image-Retrieval at ImageCLEF

ImageCLEF (www.imageclef.org) was intro-

duced in 2003 as part of the Cross-Language

Evaluation Forum (CLEF) with several aims:

� develop the infrastructure necessary to eval-

uate visual information retrieval systems

operating in both monolingual and cross-

language contexts,

� provide reliable and reusable resources for

such benchmarking purposes, and

� encourage collaboration and interaction

among researchers from academia and in-

dustry to further advance the field.

To meet these objectives, ImageCLEF has

organized several tasks, including the Wikipedia

[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 24

Large-Scale Multimedia Data Collections

The ImageCLEF

Wikipedia image

retrieval task aimed

to support ad-hoc

image retrieval

evaluation using

large-scale

collections of

Wikipedia images

and their user-

generated

annotations.

1070-986X/12/$31.00 �c 2012 IEEE Published by the IEEE Computer Society24

image-retrieval task (2008�2011). This is an ad

hoc image-retrieval task whereby retrieval sys-

tems access a collection of images, but the par-

ticular topics that will be investigated cannot

be anticipated. Thus, the task’s overall goal is

to investigate how well multimodal image-

retrieval approaches that combine textual and

visual features can deal with large-scale image

collections that contain highly heterogeneous

items, in terms of their textual descriptions

and visual content. The aim was to simulate

image retrieval in a realistic setting, such as

the Web, where available images cover highly

diverse subjects, have highly varied visual prop-

erties, and might include noisy, user-generated

textual descriptions of varying lengths and

quality.

Image Collections

During the four years the task ran as part of

ImageCLEF, we used two image collections:

the Wikipedia INEX Multimedia collection12

in 2008 and 2009 and the Wikipedia Retrieval

2010 collection5 in 2010 and 2011. The aim

was to create realistic collections; given that

Wikipedia images are diverse and of varying

quality and that their annotations are hetero-

geneous and noisy, we used the collaborative

encyclopedia as our primary data source. All

the content we used is licensed under free soft-

ware licenses, which facilitates distribution

provided that the original license terms are

respected when using the collection. From a

long-term perspective, another advantage of

using Wikipedia is that this resource has

been readily available over the years and that

in the future new and larger collections can

be created at any moment based on the lessons

learned with existing Wikipedia collections.

The Wikipedia INEX Multimedia collection

contains 151,519 images and associated textual

annotations extracted from the English Wikipe-

dia. The Wikipedia Retrieval 2010 collection

consists of 237,434 images selected to cover

similar topics in English, German, and French.

We ensured coverage similarity by retaining

images from articles with versions in all three

languages and at least one image in each ver-

sion. Given that English articles are usually

more developed than those in German and

French, more annotations are available for En-

glish. The rationale for proposing a multilingual

collection is that we wanted to encourage

participants to both test their monolingual

approaches for different languages and develop

multilingual and cross-lingual approaches.

The main differences between the two col-

lections are that the 2010�2011 collection is al-

most 60 percent larger than the 2008�2009

collection and that its images are accompanied

by annotations in multiple languages and links

to the one or more articles that contain the

image to better reproduce the conditions of

Web image search, where images are usually

embedded in relatively long texts. The collections

also incorporate additional visual resources—

visual concepts in 2008 and several image fea-

tures in all four editions—to encourage partici-

pation from groups that specialize in text

retrieval.

Topic Development

Topics were developed in order to respond to

diverse multimedia information needs. The

participants were provided with the topic

title and image examples, while the assessors

were also given an unambiguous description

(narrative) of the type of relevant and irrele-

vant results. There were 75 topics in 2008, 45

in 2009, 70 in 2010, and 50 in 2011.

Topic creation was collaborative in 2008 and

2009, with the participants proposing topics

from which the organizers selected a final list.

Participation in topic creation was mandatory

in 2008 and optional in 2009. In 2010 and

2011, the task organizers selected the topics

after performing a statistical analysis of image

search engine logs. The queries logged by the

Belga News Agency image search portal (www.

belga.be) were analyzed in 2010 and by Exalead

(www.exalead.com/search) in 2011. Mean topic

length varied between 2.64 and 3.10 words per

topic (see Table 1) and was similar to that of

standard Web image queries. We found that

topic creation using log files identified more re-

alistic topics compared to collaborative topic

creation because the resulting topics are closer

to those most interesting to a general user

population.

Following the collections’ structure, only

English topics were provided in 2008 and

2009, while English, German, and French

topics were proposed in 2010 and 2011. Be-

cause we wanted to compare the performance

of the approaches in different languages, the

German and French topics were translated from

English. However, the number of submissions

[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 25

July

�Sep

tem

ber

2012

25

using French- and German-only topics was rel-

atively low.

To achieve a balanced distribution of topic

difficulty, we ran the topics through the Cross

Modal Search Engine (CMSE, http://dolphin.

unige.ch/cmse) and selected the topic set to

include

� topics with differing numbers of relevant

images (as found by the CMSE) and

� a mixture of topics for which the CMSE pro-

duced better results when using textual

rather than visual queries and vice versa.

Difficult topics usually convey complex se-

mantics (such as ‘‘model train scenery’’),

whereas easy topics have a clearly defined con-

ceptual focus (such as ‘‘colored Volkswagen

Beetles’’). We provided image examples for

each topic to support the investigation of mul-

timodal approaches. To encourage multimo-

dal approaches, we increased the number of

example images in 2011 (to 4.84 versus 1.68,

1.7, and 0.61 in previous years), which allowed

participants to build more detailed visual

query models.

Relevance Assessments

Given the complexity of a multiscale evalua-

tion, binary relevance (relevant versus non-

relevant) is assumed in the Wikipedia image-

retrieval task. Each year, the retrieved images

contained in the runs submitted by the partic-

ipants were pooled together using a pool

depth of 100 in 2008, 2010, and 2011, and a

pool depth of 50 in 2009. As Table 1 indicates,

the average pool size per topic varied over the

years, even for the same pool depth. It was

larger in 2010 and 2011 than in 2008 due to

many more runs being submitted and thus

contributing more unique images to the

pools. Also because the later collection was

substantially larger, it is possible that the

runs retrieved more diverse images. There

was though a significant decrease in pool

sizes and number of relevant images from

2010 to 2011. This might have been due to

having many more topics with named entities.

That is, although such topics are highly repre-

sentative of real Web image searches, they are

not covered in the Wikipedia corpus to the

same extent as in the general Web. It could

also be due to higher numbers of more specific

topics that were introduced to ease the assess-

ment. Furthermore, there appears to be a con-

vergence among the approaches of the various

participants, many of whom shared system

components and were able to refine their sys-

tems by emulating effective methods intro-

duced by fellow participants and interacting

with the ImageCLEF community.

During the first three years, volunteer task

participants and the organizers performed the

assessment during a four-week period after the

runs were submitted using the Web-based inter-

face previously used in the TREC Enterprise

track: 13 groups participated in 2008, seven

groups in 2009, and six groups in 2010.

[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 26

Table 1. Wikipedia task and collections statistics.

Statistics 2008 2009 2010 2011

Number of images in collection 151,519 237,434

Number of topics 75 45 70 50

Number of words/topic 2.64 2.7 2.7 3.1

Number of images/topic 0.61 1.7 1.68 4.84

Participants 12 8 13 11

Runs in pool 74 57 127 110

Textual runs 36 26 48 51

Visual runs 5 2 7 2

Multimodal runs 33 29 72 57

Pool depth 100 50 100 100

Average pool size/topic 1,290 545 2,659 1,467

Minimum pool size/topic 753 299 1,421 764

Maximum pool size/topic 1,850 802 3,850 2,327

Number of relevant images/topic 74.6 36.0 252.25 68.8

IEEE

Mu

ltiM

ed

ia

26

To ensure consistency in the assessments, a sin-

gle person assessed the pooled images for each

topic. We also made an effort to ensure that

the participants who both created topics and

volunteered as assessors were assigned the

topics they had created; in 2008, this was

achieved in 76 percent of the assignments.

As a result of a continuous drop in the num-

ber of volunteer assessors, we adopted a crowd-

sourcing approach in 2011 using CrowdFlower

(http://crowdflower.com), a general-purpose

platform for managing crowdsourcing tasks

and ensuring high-quality responses. The

assessments were carried out by Amazon Me-

chanical Turk (http://www.mturk.com) workers.

Each worker assignment involved the assess-

ment of five images for a single topic. To pre-

vent spammers, each assignment contained

one ‘‘gold standard’’ image among the five

images—an image already correctly labeled—

to estimate the workers’ accuracy. If a worker’s

accuracy dropped below a threshold (70 per-

cent), his or her assessments were excluded.

To further ensure accurate responses, each

image was assessed by three workers with the

final assessment obtained through majority

vote.

On average, 26 distinct workers assessed

each topic, with each assessing approximately

200 images. A total of 379 distinct workers

were employed over all topics. Although this

approach risks obtaining inconsistent results for

the same topic, an inspection of the results did

not reveal such issues. The time required per

topic was 27 minutes on average, which made

it possible to complete the ground truth cre-

ation within a few hours—a marked difference

with our experience in previous years.

Evaluation Measures

The effectiveness of the submitted runs was

evaluated using the following measures:

mean average precision (MAP), precision at

fixed rank position (P@n, n ¼ 10, 20), and R-

precision (precision at rank position R, where

R is the number of relevant documents).1 Dif-

ferent measures reflect different priorities in

the simulation of an operational setting and

evaluate different aspects of retrieval effective-

ness. For example, MAP focuses on the overall

quality of the entire ranking and on the im-

portance of locating as many relevant items

as possible, whereas P@10 emphasizes the

quality at the top of the ranking and ignores

the rest. We selected MAP as the main evalua-

tion measure for ranking the participants’ sub-

missions given its higher inherent stability,13

informativeness, and discriminative power.14

Furthermore, research has shown that it can

be a better measure to employ even if users

only expect a retrieval system to find a few rel-

evant items among the top ranked ones.15

Finally, given that relevance assessments in

retrieval benchmarks are incomplete due to

pooling, evaluation measures need to account

for unjudged documents. These evaluation

measures treat these documents as irrelevant.

This might affect the measures’ robustness in

cases of substantially incomplete relevance

assessments.8 To this end, we also use binary

preference (BPref),8 a measure devised to better

assess retrieval effectiveness when large num-

bers of unjudged documents exist.

Analyzing Ranking Stability

We now investigate the reliability and reus-

ability of the test collections built during the

four years the Wikipedia task ran as part of

ImageCLEF, so as to outline some general

best practice guidelines for building robust

large-scale image (multimedia) retrieval bench-

marks with minimum effort. To this end, we

explore several questions:

� Given the highly variable effectiveness of re-

trieval approaches across topics, what is the

minimum number of topics needed to

achieve reliable comparisons between differ-

ent systems for various evaluation measures?

� What is the minimum pool depth necessary

to reliably evaluate image-retrieval systems

for various evaluation measures?

� How reusable are these test collections for

image-retrieval systems that did not partici-

pate in the evaluation campaigns and thus

did not contribute to the pools?

Although researchers have examined such

questions for textual test collections built

in the context of TREC7�10 and the Initiative

for the Evaluation of XML Retrieval (INEX)11

evaluation campaigns, they have not been

addressed for multimedia datasets.

[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 27

July

�Sep

tem

ber

2012

27

Different Topic Set Sizes

Given that retrieval approaches have a highly

variable effectiveness across topics, a good ex-

perimental design requires a sufficient number

of topics to help determine whether one re-

trieval approach is better than another. To in-

vestigate the ranking stability of the test

collections built during the four years of the

task, we applied the methodology proposed

by Ellen Voorhees and Chris Buckley,7 which

empirically derives a relationship between

the topic set size, the observed difference be-

tween the values of two runs for an evaluation

measure, and the confidence that can be

placed in the conclusion that one run is better

than another. In particular, their method

derives error rates that quantify the likelihood

that a different set of topics would lead to dif-

ferent conclusions regarding the relative effec-

tiveness of two runs.

In this method, for N topics in each year, two

disjoint topic sets of equal size M (M � N/2) are

randomly drawn. The runs submitted that year

are evaluated against each of these topic sets

separately and ranked based on an evaluation

measure (such as MAP). Then, the number of

pairwise swaps are counted between the two

rankings and the probability of a swap (error

rate) is expressed as the frequency of observed

swaps out of all possible pairwise swaps. This

process is further refined by categorizing the

runs (and their swaps) into one of 11 bins

based on the difference between the values of

the runs for the given evaluation measure: the

first bin contains runs with a difference of less

than 0.01—the next bin contains runs with a

difference of at least 0.01, but less than 0.02;

and the last contains all runs with a difference

of at least 0.1. The error rate is then computed

for every such bin and by varying the topic set

size M¼ [1, . . ., N/2]; the experiment is repeated

50 times for each M. These error rates can be di-

rectly computed as a function of topic set sizes

up to N=2b c and then extrapolated to the full

set of topics N using a fitted exponential model.

Figure 1 plots the error rate for 2011 against

the topic set size for each of the 11 bins repre-

senting the difference between MAP scores. Fig-

ure 1a shows the calculated values for M ¼[1, . . ., 25], and Figure 1b the extrapolated

curves up to 50 topics. For instance, the error

rate for a difference between 0.03 and 0.04 in

the MAP scores of two runs when 25 topics

are used is approximately 20 percent. This

error rate drops significantly when more topics

are used.

An interpolation of the error rates gives the

absolute differences required in MAP or BPref

values for having a given error rate (such as

5 percent) using 50 topics. This can be inter-

preted as a significance level that assures one

run is consistently better than another, inde-

pendent of the evaluated topic set. Table 2

reports these differences for an error rate of

[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 28

0 252010 15Number of topics

(a) (b)

5

≥0.00 ≥0.01 ≥0.02≥0.03 ≥0.04 ≥0.05≥0.06 ≥0.07

≥0.10≥0.08

≥0.09

≥0.00 ≥0.01 ≥0.02≥0.03 ≥0.04 ≥0.05≥0.06 ≥0.07

≥0.10≥0.08

≥0.09

0.50

0.40

0.30Er

ror

rate

0.20

0.10

0

0.50

0.40

0.30

Erro

r ra

te

0.20

0.10

00 504020 30Number of topics

10

Figure 1. Error rate on

disjoint topic sets for

2011. (a) Error rate

calculation for one to

25 topics. (b) Error rate

extrapolation for up to

50 topics. The legend

for each curve indicates

that the absolute

difference in mean

average precision

(MAP) scores is at least

the given value and less

that the next curve’s

value.

IEEE

Mu

ltiM

ed

ia

28

5 percent for the years 2008 to 2011 calculated

by applying the Voorhees and Buckley method

for three different types of topic sets of varying

size M. The differences for BPref are always

higher than for MAP, which means that BPref

is somewhat less robust to changes in the

topic set. The values for 2009 and 2011 are

lower than for 2008 and 2010, despite having

both fewer topics and fewer relevant docu-

ments per topic on average (see Table 1).

Forty to 50 topics achieve stable rankings inde-

pendent of the topic set, for the given pool

depth. However, only a difference of at least

0.05 in MAP or BPref guarantees that one run

is consistently better than another—that is,

with an error rate of less than 5 percent.

Incomplete Judgments

For large-scale collections, pooling is applied

to the top D documents of each topic and

run to determine which documents to assess,

resulting in incomplete judgments. This leads

to several questions: How do we choose the

pool depth D and evaluation measure to en-

sure stable rankings even though some of the

relevant documents have not been judged?

Can the ground truth derived from pooling

achieve a fair ranking for runs that did not

contribute to the pool?

Unbiased, Incomplete Judgments. First, we

investigated how the decrease of pool depth D

influences the ranking stability for different

evaluation measures. The pools that decrease

linearly in size are said to be unbiased because

all runs still contribute equally.

For each year, we generated different rank-

ings of the runs by considering the judged docu-

ments in pools of depths D ¼ [5, . . ., 100]. We

compared these rankings to the original rank-

ing of the runs generated by considering all

the judged documents available for that

year—that is, at pool depth 100 for 2008,

2010, and 2011 and at 50 for 2009. We then

measured the correlation between the original

and the generated rankings of runs using Ken-

dall tau, a measure proportional to the number

of pairwise adjacent swaps needed to convert

one ranking into another. We also used the av-

erage precision (AP) tau,16 a correlation coeffi-

cient that gives more weight to errors at the

top end of rankings.

Figure 2 shows the correlation for MAP and

BPref over pool depth D for each year. BPref

was introduced to cope with incomplete judg-

ments, yet surprisingly it is less stable than

MAP for 2008 and 2010, but it performs as

well as MAP for the other years. Generally, at

pool depth D ¼ 50, a correlation of approxi-

mately 0.96 or higher is observed for both

MAP and BPref. Therefore, we can conclude

that a pool depth of 50 suffices to produce sta-

ble rankings for the given topic set sizes and rel-

atively few relevant documents per topic.

Table 3 gives an overview of the pool sizes

and numbers of relevant documents over all

topics that are found for pools of depth 50

and 100. At D ¼ 50 the pool size is halved

and still three-fourths of the relevant docu-

ments are found.

Next, we investigated the use of variable per-

topic pool depth following the methodology

proposed by Justin Zobel.9 First, we created and

assessed a pool of D ¼ 30. Then, we estimated

the number of relevant documents that can be

identified by deepening the pool for each topic

from the rate of new arrivals at increased pool

depth. The last rows in Table 3 show that we

can reduce the pool size compared to pools

with D ¼ 50 while retaining the number of rel-

evant documents identified. This variable-

depth pooling approach achieves a correlation

of approximately 0.96 with the original ranking

and thus performs similarly to pooling with

D ¼ 50 (results not shown). Zobel explained

that the documents have to be judged in

pool-depth order, which might bias assessors,9

but this is not a problem when we apply

crowdsourcing.

[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 29

Table 2. Absolute differences in MAP and BPref scores required to have a

5 percent error rate for each year.*

Statistics 2008 2009 2010 2011

Number of topics (N) 75 45 70 50

Pool depth 100 50 100 100

1 � bN/2c topics

MAP 0.0511 0.0355 0.0493 0.0383

BPref 0.0572 0.0444 0.0588 0.0402

1�25 topics

MAP 0.0505 � 0.0505 0.0383

BPref 0.0598 � 0.0557 0.0402

1�25 with fewest relevant topics

MAP 0.0451 � 0.0518 0.0383

BPref 0.0574 � 0.0570 0.0402

* Error rates are estimated for three topic subsets and are extrapolated totopic sets of size 50.

July

�Sep

tem

ber

2012

29

Biased, Incomplete Judgments. Next, we

investigated the ranking stability when the

runs of one group are excluded from the pool

creation. Table 4 lists the average results of

the difference in the pool sizes, the difference

in the number of relevant documents

identified, the correlation (Kendall tau) of

MAP and BPref of the new ranking compared

to the original, and the differences in the rank-

ings of the excluded runs (average and maxi-

mum change in the rank of an excluded run,

in both directions).

[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 30

1.00

0.90

0.95

Cor

rela

tion

0.80

0.85

0.95

0.85

0.75

0.95

0.85

0.7510 10020

(a) (b)

(c) (d)

30 40 50 60 70 80 90

Pool depth

1.00

0.90

Cor

rela

tion

0.80

5 10 5015 20 25 30 35 40 45

Pool depth

1.00

0.90

Cor

rela

tion

0.80

0.75

0.95

0.85

0.7510 10020 30 40 50 60 70 80 90

Pool depth

1.00

0.90

Cor

rela

tion

0.80

10 10020 30 40 50 60 70 80 90

Pool depth

Kendall MAPAP MAPKendall BPrefAP BPref




Figure 2. Correlation

measured with Kendall

tau and average

precision (AP) tau for

MAP and BPref over

pool depth five to 100

for each year. (a) 2008

(b) 2009 (c) 2010

(d) 2011.

Table 3. Pool sizes and number of relevant documents found for different pools.

Statistics 2008 2009 2010 2011

Number of topics 75 45 70 50

Pool depth 100 50 100 100

Pool depth 100

Pool size 97,396 � 186,104 73,346

Number relevant 5,593 � 17,659 3,440

Pool depth 50

Pool size 59,012 (61%) 24,272 102,540 (55%) 38,272 (52%)

Number relevant 4,302 (77%) 1,622 12,762 (72%) 2,698 (78%)

Variable pool depth

Pool size 41,710 (42%) � 91,159 (49%) 29,202 (40%)

Number relevant 4,098 (73%) � 12,676 (72%) 2,511 (72%)

IEEE

Mu

ltiM

ed

ia

30

The correlation coefficients indicate a stable

overall performance for 2011, 2010, and 2008.

We can see a significantly lower correlation for

2009, which could be due to the smaller pool

depth. MAP and BPref perform again similarly

except for 2009, where BPref is more stable.

MAP tends to decrease the rank of excluded

runs whereas BPref tends to increase them.

This behavior, which researchers also observed

in earlier work,10 indicates that MAP might un-

derestimate the system performance for biased

judgments, whereas BPref might overestimate it.

These results show that the Wikipedia bench-

marking data can fairly rank most runs that

do not contribute to the pool even though a

few single runs might be grossly misjudged.

Nonetheless, Table 5 shows that the reus-

ability of a collection’s data has limits. For

this experiment, we first pooled textual runs

only and then multimodal runs only. This

investigated the ranking stability toward a sig-

nificantly novel retrieval approach that did

not contribute to the pools. The pool sizes

and numbers of relevant documents identified

drastically decreased and were highest when

excluding the multimodal runs. The correlation

coefficients and the rank differences display the

performance instability for each year. The most

unstable setup is 2009, and the least is 2011,

which is likely influenced by the number of

runs that contributed to the pools and the

pool depths. This indicates that a ground

truth created using numerous runs based on

various approaches will be able to fairly rank

new approaches.

Conclusions and Lessons Learned

Any effort to build test collections for similar

image-retrieval tasks should consider several

issues. For example, copyright issues must be

taken into account if the collection is to be

freely distributed to the community. The

same goes for any additional resources, such

as example images added to the topics. Fur-

thermore, all collection users should be fully

aware and respectful of the images’ original li-

cense terms. Before distributing a collection, it

is also important to perform multiple checks

on image integrity so as to make the collection

easy for users to process.

It has become best practice to use image

search engine logs to select candidate topics

to simulate real-world users’ information

needs. Nevertheless, topic sets should contain

topics of varying difficulty (such as textual

and visual multimedia queries) that also have

a varying number of relevant documents in the

collection (broad versus narrow). A baseline re-

trieval system that gives an overview of the col-

lection content can help test the latter. When

the assessment uses pooling, the number of rele-

vant documents should not be too large (no

more than 100, for example) to minimize the

number of unjudged relevant documents that

[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 31

Table 4. Average results when runs of one group are excluded from pool creation.

Statistics 2008 2009 2010 2011

Difference in pool size 7,548 2,172 10,153 4,396

Difference in number of relevant 239 80.7 538 97.5

Tau MAP 0.991 0.939 0.987 0.993

Tau BPref 0.977 0.946 0.985 0.989

Average rank difference MAP �1.96 �3.183 �2.3 �0.85

Maximum rank difference MAP (up/down) 1"/19# 3"/ 20# 3"/14# 0"/6#Average rank difference BPref 2.58 2.24 3.30 2.48

Maximum rank difference BPref (up/down) 68"/2# 23"/1# 38"/3# 15"/0#

Table 5. Average results when textual or multimodal runs are excluded from

pool creation.

Statistics 2009 2010 2011

Number of textual/multimodal runs 33/35 42/62 51/52

Difference in pool size 12,759 95,411 34,134

Difference in number of relevant 443 4,387 720

Tau MAP 0.739 0.958 0.959

Tau BPref 0.832 0.938 0.942

Absolute average rank difference MAP 6.31 2.32 2.28

Maximum rank difference MAP (up/down) 10"/27# 7"/16# 5"/14#Absolute average rank difference BPref 3.33 2.74 2.67

Maximum rank difference BPref (up/down) 29"/11# 43"/9# 18"/8#

July

�Sep

tem

ber

2012

31

influence the ranking stability. Our analysis

showed that 40 to 50 topics achieved stable rank-

ings independent of the topic set, where a differ-

ence of at least 0.05 in MAP or BPref indicates

that one run is consistently better than another.

Our experience with using crowdsourcing

for the 2011 assessment was positive. Some

manual work was necessary to create the gold

standard, but after that, the automatic assess-

ment was done quickly and accurately when

more than one assessor was assigned to each

topic. Our analysis showed that a pool depth

of 50 can sufficiently produce stable rankings

if the topics do not have too many relevant

documents in the collection. Smaller pools

that still identify many relevant documents

can be achieved using variable-depth pooling,9

which works well with crowdsourcing. Our

analysis also indicates that ground truth should

be created by pooling a large number of runs

based on various approaches so as to be able

to fairly rank new approaches.

The test collections constructed in the con-

text of the Wikipedia image-retrieval task at

ImageCLEF are available at www.imageclef.

org/wikidata. These resources are the result of

the collaboration of a large number of partici-

pating research groups that, through forming

a community with multidisciplinary compe-

tencies and sharing expertise, have contrib-

uted to the advancement of image-retrieval

research. MM

Acknowledgments

Theodora Tsikrika was supported by the Euro-

pean Union in the context of the Promise

(contract 258191) and Chorus+ (contract

249008) FP7 projects. Jana Kludas was funded

by the Swiss National Fund (SNF). Adrian

Popescu was supported by the French ANR

(Agence Nationale de la Recherche) via the

PERIPLUS project (ANR-10-CORD-026).

References

1. M. Sanderson, ‘‘Test Collection Based Evaluation

of Information Retrieval Systems,’’ Foundations

and Trends in Information Retrieval, vol. 4, no. 4,

2010, pp. 247�375.

2. H. Muller et al., eds., ImageCLEF: Experimental Evalu-

ation in Visual Information Retrieval, Springer, 2010.

3. T. Tsikrika and J. Kludas, ‘‘Overview of the Wiki-

pedia MM task at ImageCLEF 2008,’’ Evaluating

Systems for Multilingual and Multimodal Information

Access: Proc. 9th Workshop of the Cross-Language

Evaluation Forum (CLEF 2008), Revised Selected

Papers, LNCS, Springer, 2009, pp. 539�550.

4. T. Tsikrika and J. Kludas, ‘‘Overview of the Wikipe-

dia MM task at ImageCLEF 2009,’’ Multilingual

Information Access Evaluation Vol. II Multimedia

Experiments: Proc. 10th Workshop of the Cross-

Language Evaluation Forum (CLEF 2009), Revised

Selected Papers, LNCS, Springer, 2010, pp. 60�71.

5. A. Popescu, T. Tsikrika, and J. Kludas, ‘‘Overview

of the Wikipedia Retrieval Task at ImageCLEF

2010,’’ Working Notes for the CLEF 2010 Work-

shop, 2010; http://clef2010.org/resources/

proceedings/clef2010labs_submission_124.pdf.

6. T. Tsikrika, A. Popescu, and J. Kludas, ‘‘Overview

of the Wikipedia Image Retrieval Task at Image-

CLEF 2011,’’ Working Notes for the CLEF 2011

Labs and Workshop, 2011; http://clef2011.org/

resources/proceedings/Overview_ImageCLEF_

Wikipeida_Retrieval_Clef2011.pdf.

7. E.M. Voorhees and C. Buckley, ‘‘The Effect of Topic

Set Size on Retrieval Experiment Error,’’ Proc. 25th

ACM SIGIR Conf. Research and Development in Infor-

mation Retrieval, ACM Press, 2002, pp. 316�323.

8. C. Buckley and E.M. Voorhees, ‘‘Retrieval Evalua-

tion with Incomplete Information,’’ Proc. 27th

ACM SIGIR Conf. Research and Development in In-

formation Retrieval, ACM Press, 2004, pp. 25�32.

9. J. Zobel, ‘‘How Reliable Are the Results of Large-

Scale Information Retrieval Experiments?’’ Proc.

21st ACM SIGIR Conf. Research and Development

in Information Retrieval, ACM Press, 1998,

pp. 307�314.

10. S. Buttcher et al., ‘‘Reliable Information Retrieval

Evaluation with Incomplete and Biased Judgments,’’

Proc. 30th ACM SIGIR Conf. Research and Develop-

ment in Information Retrieval, ACM Press, 2007,

pp. 63�70.

11. S. Pal, M. Mitra, and J. Kamps, ‘‘Evaluation Effort,

Reliability and Reusability in XML Retrieval,’’ J.

Am. Soc. for Information Science and Technology,

vol. 62, no. 2, 2011, pp. 375�394.

12. T. Westerveld and R. van Zwol, ‘‘The INEX 2006

Multimedia Track,’’ Advances in XML Information

Retrieval: Proc. 5th Int’l Workshop of the Initiative

for the Evaluation of XML Retrieval (INEX 2006),

Revised Selected Papers, LNCS, Springer, 2007,

pp. 331�344.

13. C. Buckley and E.M. Voorhees, ‘‘Evaluating Evalu-

ation Measure Stability,’’ Proc. 23rd ACM SIGIR

Conf. Research and Development in Information

Retrieval, ACM Press, 2000, pp. 33�40.

14. J.A. Aslam, E. Yilmaz, and V. Pavlu, ‘‘The Maxi-

mum Entropy Method for Analyzing Retrieval

Measures,’’ Proc. 28th ACM SIGIR Conf. Research

[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 32

IEEE

Mu

ltiM

ed

ia

32

and Development in Information Retrieval, ACM

Press, 2005, pp. 27�34.

15. E. Yilmaz and S. Robertson, ‘‘On the Choice of

Effectiveness Measures for Learning to Rank,’’

Information Retrieval, vol. 13, no. 3, 2010,

pp. 271�290.

16. E. Yilmaz, J.A. Aslam, and S. Robertson, ‘‘A New

Rank Correlation Coefficient for Information Re-

trieval,’’ Proc. 31st ACM SIGIR Conf. Research and

Development in Information Retrieval, ACM Press,

2008, pp. 587�594.

Theodora Tsikrika is a post-doctoral researcher at the

University of Applied Sciences Western Switzerland.

Her research interests focus on multimedia informa-

tion retrieval and on the combination of statistical,

social, and semantic evidence for such search pro-

cesses and interactions. Tsikrika has a PhD in com-

puter science from Queen Mary, University of

London. Contact her at [email protected].

Jana Kludas is a post-doctoral researcher in bioinfor-

matics at Aalto University, Finland. Her research

interests include machine learning, data mining,

and computer vision with a focus on information fu-

sion in all its forms and application areas. Kludas has

a PhD in computer science from the University of

Geneva, Switzerland. Contact her at jana.kludas@

unige.ch.

Adrian Popescu is a post-doctoral researcher with

CEA, LIST, France. His research interests include in-

formation extraction, image retrieval, and social net-

works. Popescu has a PhD in computer science from

Telecom Bretagne, France. Contact him at adrian.

[email protected].

[3B2-9] mmu2012030024.3d 12/7/012 10:53 Page 33

July

�Sep

tem

ber

2012

33

Documents

Building Reliable and Reusable Test Collections for Image Retrieval: The Wikipedia Task at ImageCLEF