Experimental evaluation in visual information retrievalusers.dsic.upv.es/grupos/nle/ceri/presentaciones/ceri2012-clough.pdf · Experimental evaluation in visual information retrieval

l lExperimental evaluation in visual information retrievalvisual information retrieval

Paul Clough

Information SchoolInformation SchoolUniversity of Sheffield (UK)

Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18-19 June 2012

Areas of research

• Text re use and plagiarism detection

Areas of research

• Text re-use and plagiarism detection• Multilingual information access• Geographical Information Retrieval

(GIR)(GIR)• Multimedia retrieval (images)• Evaluation of IR systemsy• User interfaces and interaction• Construction of corpora and evaluation

resourcesesou ces

http://ir.shef.ac.uk/cloughie/


ContentsContents• Part 1 ‐ Evaluating (V)IR systems

– Visual IR systems– The evaluation landscape

• Part 2 – ImageCLEF for VIR evaluationPart 2 ImageCLEF for VIR evaluation– Overview of ImageCLEF– Example tasks– Main findings and lessons learnedMain findings and lessons learned

• Part 3 ‐ Addressing some issues in IR evaluation– Crowdsourcing for gathering relevance assessments

S t f d ti f ti– System performance measures and user satisfaction– Evaluating beyond single query‐response paradigm


Evaluating (V)IR systems

Second Spanish Conference on Information Retrieval (CERI 2012) Valencia 18‐19 June 2012

Visual information retrieval

• Visual information retrieval (VIR)

Visual information retrieval

– Users want to retrieve visual documents rather than texts– Take into account visual properties of the data

• Example use cases for VIR systems• Example use cases for VIR systems– Researcher searching digital archives– Clinicians searching for medical images (e.g. x‐rays)– Illustrator looking for example photograph– Organisations checking for trademark infringements– Professionals accessing science databases (e.g. medicine, g ( g ,astronomy, geography)

– Domestic users browsing their personal photo collections


Retrieval methods

• Description‐based

Retrieval methods

p– Using abstracted features assigned to the image, e.g. metadata, captions, keywords, associated text

– Often assigned manually although can be automatic (e.g.Often assigned manually although can be automatic (e.g. object recognition and classification)

• Content‐based (CBIR)Using primitive features based on pixels which form the– Using primitive features based on pixels which form the visual content of the image, e.g. colour, shapes, textures

– Extracted automatically from imageC bi ti f b th h• Combinations of both approaches– Investigating fusion of multiple modalities was an objective of ImageCLEF


Evaluating IR systems

• Evaluation is systematic determination of merit of something f d d [ ]


using criteria against a set of standards [Harman, 2011]• Evaluation is important for designing and developing

effective search systems (effective, efficient and usable)• Focus of evaluation will vary

– System (i.e. with little or no user involvement)– User and user interaction with system– User‐system interaction with environment

• Traditionally been a strong focus on measuring system effectiveness in controlled lab setting– Abstraction of reality (e.g. information need to query)– Comparative testing of systems (e.g. A vs. B – which is better?)– Does not account for contextual and situational factors (user’s

background and preferences search task )background and preferences, search task…)



• IR systems ultimately to be used by people for some purpose d


and operating in an environment• What makes an IR system successful?

– Whether it retrieves ‘relevant’ documents– How quickly it returns results– How well it supports user interaction– Whether the user is satisfied with the results

H il th t– How easily users can can use the system– Whether the system helps users carry out tasks– Whether system impacts on the wider environment

• Multiple evaluation methods and measures will be used• Multiple evaluation methods and measures will be used throughout IR system development– Evaluate components (e.g. IR system) vs. overall system


Example: evaluating library catalogues

• Evaluation consultant for Search25 project in UK which is b ld f d d d l b l h l

Example: evaluating library catalogues

building a federated academic library catalogue search tool• Initial study of user needs and behaviours using online

questionnaire with users of current system (179 responses)

How important are the f ll i f t hfollowing factors when using academic library catalogues?

(rate on a scale of 1‐5,(rate on a scale of 1 5, where 1 is not important, and 5 is very important)


Evaluating IR systemsEvaluating IR systems• Evaluation of retrieval systems tends to focus on either the y

system (algorithms) or the user

• Saracevic (1995) distinguishes six levels of evaluation for i f ti t th t i l d IR tinformation systems that include IR systems

Engineering

InputIR l ti

Focus of IR evaluation mainly here (test collections and batch-mode lab style evaluation)Input

Processing

Output

IR evaluation

Interactive IR (IIR) l ti

mode lab-style evaluation)

ImageCLEF evaluations focused herep

Use and user

Social

(IIR) evaluation / Human information behaviour


Saracevic, T. (1995). Evaluation of evaluation in information retrieval. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle, Washington, United States, July 09 ‐ 13, 1995): SIGIR '95 (pp.1380146). New York, ACM Press

Evaluation landscape [Kelly, 2009]Evaluation landscape [Kelly, 2009]Real users, needs and situations

In situ (living lab)Simulated users

In situ (living lab)

Uncontrolled variablesLab‐based

Controlled variables

Inform


Predict

IR test collections• Test collections provide re‐usable resources to evaluate

IR test collectionsp

IR systems in controlled lab setting– Collection of documents

Set of representative queries (topics)

‘Cranfield style’ test collection

– Set of representative queries (topics)– Set of relevance judgments for each topic– Evaluation measures (system performance)

Comparative system evaluation

• Test collection + measures provides simulation of user in operation setting (if designed carefully)– Do results obtained with test collections predict user taskDo results obtained with test collections predict user task

success or performance / satisfaction with results?– But what about beyond query‐response paradigm?– How do you integrate contextual and situational factors?How do you integrate contextual and situational factors?


TREC‐style evaluation

• Large‐scale evaluation to assess worth of new ideas in lab‐

TREC style evaluation

controlled setting– Enables comparative evaluation between systems based on

common resources and standardised methodologies (test g (collections)

• Many large‐scale activities run over the years– TREC INEX CLEF NTCIIRTREC, INEX, CLEF, NTCIIR, …

• Also carried out in field of VIR– TRECVid, PASCAL Visual Object Classes, MediaEval, ImageCLEF…

• Provided valuable resources and infrastructure to support researchers and shown to advance field of IR


ResultsResults

Search Engine

Judges

F t t i l “L C t E l ti R li bilit d R bilit ” R SSIR 2011


From tutorial on “Low‐Cost Evaluation, Reliability and Reusability”, RuSSIR 2011, Evangelos Kanoulas

Practical issues• Gathering a collection of documents

Practical issues

• Generating a suitable set of queries/topics– How do I obtain the queries/topics?– How many queries/topics do I need?How many queries/topics do I need?

• Creating the relevance assessments– How do I gather the assessments?

Wh h ld d th t ?– Who should do the assessments?– How many assessments should be made?– What are the assessors expected to do?

h b fi di i i l d ?– What about finding missing relevant documents?

• Selecting a suitable evaluation measure• These decisions will affect the quality of the benchmark and q y

impact on the accuracy/usefulness of results


Limitations of test collections

• Simulation/evaluation based on test collections

Limitations of test collections

Simulation/evaluation based on test collections– Individual differences between users are typically ignored in test collection setting

– Result presentation not part of test collection

– Collections grown in size but often numbers of test queries remains smallremains small

– Limited diversity in tasks (e.g. how about evaluation of navigation, resource finding … explorative search)g g p )

– Ignores longitudinal process of searching

• Alternatives– Interactive evaluation, (static) log file analysis, living labs …


ImageCLEF for VIRImageCLEF for VIR evaluation


ImageCLEF

• International evaluation campaign for evaluating (cross‐

ImageCLEF

language) image retrieval (2003 – now)– Part of Cross Language Evaluation Forum (CLEF)

• Comparative evaluation based on common resources andComparative evaluation based on common resources and standardised methodologies

• Main objectives of ImageCLEFT d l h i f f h l i f i l– To develop the necessary infrastructure for the evaluation of visual information retrieval systems (e.g. resources, organised events…)

– To investigate effectiveness of combining textual and visual features

– To promote the exchange of ideas towards the further advancement of the field of visual media analysis, indexing, classification and retrieval


•http://www.imageclef.org/

ImageCLEF tasks

• Ad hoc retrieval (since 2003)

ImageCLEF tasks

( )– Multiple querying modalities (e.g. QBVE)– Fusion of retrieval methods (TBIR and CBIR)– Promoting diversity in results– Promoting diversity in results

• Object and concept recognition (since 2005)– Object class recognition to identify whether certain objects

f d fi d f l i dor concepts from a pre‐defined set of classes are contained in an image

– Image annotation to assign textual labels or descriptions to ian image

– Automatic image classification to classify images into one or many classes

( )• Interactive image retrieval (since 2003)


Defining suitable tasks

• Where possible tasks have been informed by operational ( ) d l d

Defining suitable tasks

settings (use cases) and involved expert assessors– Tasks for medical image retrieval based on interviews with

clinicians, assessments performed by trained medical professionals and datasets realistic of medical domainprofessionals and datasets realistic of medical domain

– Many queries for non‐medical ad hoc tasks derived from analysing query logs of search systems hosting datasets

• Also attempted to introduce challenging and novel tasks to• Also attempted to introduce challenging and novel tasks to interest researchers– Retrieval and classification on ‘large’ image datasets– Promoting diversity for ad hoc search– Promoting diversity for ad hoc search– From image retrieval to case‐based retrieval (medical)

• But getting the balance right is hard!


ImageCLEF participationImageCLEF participation

Participation in the ImageCLEF tasks and number of participants by year (2003‐2010)

Historical images

Personal images

News photo archive

Wikipediap

Medical datasets (x‐rays etc.)( y )


ImageCLEF datasetsImageCLEF datasets

Datasets developed in ImageCLEF (2003‐2009)


ImageCLEF 2009ImageCLEF 2009

• Variety of retrieval tasks– Photographic retrieval– Medical image retrieval– Interactive retrieval– Automatic medical image annotation– Large‐scale visual concept detection– Wikipedia image retrievalWikipedia image retrieval– (Robot vision task)

• Pre‐CLEF workshop Visual retrieval evaluation– Visual retrieval evaluation

– Sponsored by THESEUS

• 84 groups registered– 62 groups submitting results


Example task: promoting diversity

• A system retrieving a spread of results for a user need or one


y g pthat retrieves results across interpretations of a query is said to be a system that promotes a diverse ranking

T h i t t di it f h lt t b• Techniques to promote diversity of search results seem to be gaining wide adoption in the commercial web search sector

• However, at the time (2008 and 2009) there were almost no , ( )test collections available to evaluate different techniques in a standardised manner

F di id i di i i i i l• Few studies considering diversity in image retrieval


The idea of diversityThe idea of diversity

C1 C1 C1 C1 C1 C1 C1C4C3 C5C1

C1 C1

C1 C1 C1 C1

C1 C1 C1

C1 C1

C2

C4

C3

C3

C4C5

C5

C6

• P10 = 1.00

C1 C1 C1 C1 C1 C2C3 C4C5 C6

• P10 = 1.00

• Cluster Recall at 10: • Cluster Recall at 10:

1 covered sub-topic 6 covered sub-topic

0 167

p

6 total subtopics=

p

6 total subtopics=

1 0000.167 1.000



• Designed task where participants had to present as


Designed task where participants had to present as many diverse results in the top 10 results– Belgian news agency (Belga) provided dataset consisting 498,039 images with unstructured captions (English)

– 50 topics provided based on manual analysis of logs from Belga (average of 3.96 clusters for each topic)g ( g p )

– 44 institutions registered (19 submitted runs)– Evaluation measures were P@10 and CR@10 (combined using F1)using F1)

– CR@10 is known as cluster recall at rank 10 and measures how many clusters are covered in the top n results


Query Part 1 Query Part 2<title> clinton </title> <title> obama </title><title> clinton </title> <title> obama </title><clusterTitle> hillary clinton </clusterTitle>

<clusterDesc> Relevant images show photographs of Hillary Clinton. Images of Hillary with other people are relevant if she is shown in the foreground. Images of her in the background are irrelevant. </clusterDesc>

<image> belga26/05859430.jpg </image> <image> belga30/06098170.jpg </image>

<clusterTitle> obama clinton </clusterTitle><clusterTitle> obama clinton </clusterTitle>

<clusterDesc> Relevant images show photographs of Obama and Clinton. Images of those two with other people are relevant if they are shown in the foreground. Images of them in the background are irrelevantImages of them in the background are irrelevant. </clusterDesc>

<image> belga28/06019914.jpg </image> <image> belga28/06019914.jpg </image><clusterTitle> bill clinton </clusterTitle>

<clusterDesc> Relevant images show photographs of Bill Clinton. Images of Bill with other people are relevant if he is shown in the foreground. Images of him in the background are irrelevant. </clusterDesc>


<image> belga44/00085275.jpg </image> <image> belga30/06107499.jpg </image>

Example task: promoting diversityExample task: promoting diversity

No Group Run Name Query Modality P@10 CR@10 F1p y y @ @1 XEROX-SAS XRCEXKNND T-CT-I TXT-IMG 0.794 0.824 0.8092 XEROX-SAS XRCECLUST T-CT-I TXT-IMG 0.772 0.818 0.7943 XEROX-SAS KNND T-CT-I TXT-IMG 0.8 0.727 0.7624 INRIA LEAR5_TI_TXTIMG T-I TXT-IMG 0.798 0.729 0.7625 INRIA LEAR1_TI_TXTIMG T-I TXT-IMG 0.776 0.741 0.7586 InfoComm LRI2R_TI_TXT T-I TXT 0.848 0.671 0.7497 XEROX SAS XRCE1 T CT I TXT IMG 0 78 0 711 0 7447 XEROX-SAS XRCE1 T-CT-I TXT-IMG 0.78 0.711 0.7448 INRIA LEAR2_TI_TXTIMG T-I TXT-IMG 0.772 0.706 0.7379 Southampton SOTON2_T_CT_TXT T-CT TXT 0.8240 0.654 0.72910 Southampton SOTON2_T_CT_TXT_IMG T-CT TXT-IMG 0.746 0.71 0.727p _ _ _ _

• Cluster information is essential for providing diverse results• A combination of T‐CT‐I maximizes diversity

U i i d d lit hi d th hi h t F1


• Using mixed modality achieved the highest F1

Example task: interactive IR (iCLEF)

• Run in conjunction with CLEF interactive task (iCLEF) from d d d h l f k


2005 and conducted in the style of TREC interactive task– Experiments typically hypothesis‐driven, and interfaces studied

and compared using controlled user populations under laboratory conditionslaboratory conditions

– Participants recruit users to perform experiments and these have provided valuable insights into interactive IR

• But there are problems with interactive tasks• But there are problems with interactive tasks– User populations typically small in size (e.g. 8 participants)– Cost of training users, scheduling and monitoring search

sessions highsessions high– Factors such as the user interface and relevance criteria affect

success– Hard to produce subsequent comparisons outside experimentalHard to produce subsequent comparisons outside experimental

setup



• Tried new approach in 2008‐09 with different goal [Gonzalo et


al., 2008; Gonzalo et al., 2009]– To harvest a large search log of users performing multilingual searches

on Flickr.com in an online gaming environmentO i id d d f lt ltili l h i t f• Organisers provided default multilingual search interface– Functions for registering and monitoring users– Monolingual and multilingual search functionality

d l l d– Customised logging capturing user‐system interaction including explicit success/failure of searches, users’ profiles, and post‐search questionnaires for every search

P ti i t ld f t t k• Participants could perform two tasks– Generation and analysis of search logs– Conduct their own lab‐based interactive IR experiments


Example task: interactive IR (iCLEF)Example task: interactive IR (iCLEF)

Gonzalo J Clough P and Karlgren J (2009) Overview of iCLEF2008: Search Log Analysis for Multilingual Image


Gonzalo, J., Clough, P. and Karlgren, J. (2009), Overview of iCLEF2008: Search Log Analysis for Multilingual Image Retrieval, In Proceedings of 9th Workshop of the Cross‐Language Evaluation Forum (CLEF'08), September 17‐19 2008,

LNCS 5706, pp. 227‐235.


• Total of 2 million lines of log data generated in 2008‐09


– 435 users contributed to logs and generated 6,182 valid search sessions

2008 2009

Subjects/users 305 130

Log lines 1,483,806 617,947

Target images 103 132

Valid search sessions 5 101 2 410

Download and use the dataValid search sessions 5,101 2,410

Successful sessions 4,033 2,149

Unsuccessful sessions 1,068 261

Hints asked 11,044 5,805

http://nlp.uned.es/iCLEF/

Queries in monolingual mode 37,125 13,037

Queries in multilingual mode 36,504 17,872

Manually promoted translations 584 725


Manually penalised translations 215 353

Image descriptions inspected 418 100


• Logs provide rich source of information for studying


multilingual search behaviour– Investigating the effects of language skills on search behaviour– Discovering actions leading to an abort– Observing the switching behaviour of users within a search task

• Community and game‐like way is perhaps one way to generate resources to help analyse user‐system interactions and searching behavioursand searching behaviours– But there are limitations to this approach: the logs reflect only a single

search task (known‐item retrieval) using a pre‐defined search interface• In future could use logs to record behaviour for specific• In future could use logs to record behaviour for specific

version of user interface with systematic modifications to compare various search assistance functionalities


ImageCLEF: organisational challenges

• Obtaining funding (e.g. for relevance assessments, invited

ImageCLEF: organisational challenges

speakers…)• Obtaining access rights to image datasets• Motivating participation across multiple domains• Motivating participation across multiple domains • Motivating submission of results (<50% of groups who register

actually submit)• Difficult to get interest from commercial organisations to

inform operational settings• Creating realistic tasks and user models (esp. in TREC‐style C eat g ea st c tas s a d use ode s (esp C sty e

evaluation event)• Efficiently creating ground truths (esp. for medical tasks)


Müller, H., Clough, P., Deselaers, T. and Caputo, B. (Eds)(2010) ImageCLEF ‐ Experimental Evaluation of Visual Information Retrieval, Springer: Heidelberg, Germany, ISBN 978‐3‐642‐15180, 495 pages.

Some of the main findings

• Consistent study over the years of combinations of image and

Some of the main findings

y y gtextual information– 62% of papers submitted to ImageCLEF 2003‐2009 proceedings used

combinations of CBIR and TBIR– Of those using combinations, 60% used approach based on combining

multiple results lists rather than using multi‐modal indexing– Consistent improvements (overall) using CBIR+TBIR, but query

dependent• Multilingual search just as effective as monolingual• For certain domains (e.g. medicine) use of known resources ( g )

(e.g. UMLS) helps with indexing and query expansion• Doing interactive tasks in TREC‐style setting is still hard!


Müller, H., Clough, P., Deselaers, T. and Caputo, B. (Eds)(2010) ImageCLEF ‐ Experimental Evaluation of Visual Information Retrieval, Springer: Heidelberg, Germany, ISBN 978‐3‐642‐15180, 495 pages.

Contributions of ImageCLEF – impact [Tsikrika l 2011]et al., 2011]

• Around 70% of citations are from papers not in CLEF proceedings• 8 62 cites per paper on average• 8.62 cites per paper on average


Addressing some issues in gIR evaluation


Addressing issues in IR evaluation (with using ll i )

• Can we gather relevance assessments efficiently and

test collections)

g yeffectively? – Often causes a bottleneck in evaluation and can be very resource intensive

• Does system performance translate to user satisfaction and success?satisfaction and success?– If not then evaluating with test collections is limited

• Can we adapt test collections to deal with multi‐Can we adapt test collections to deal with multiquery sessions?– This would reflect more realistic searching behaviours


Generating relevance assessments

• Relevance assessment is time‐consuming and causes

Generating relevance assessments

gbottleneck in IR evaluation– Often requires input of domain experts– Pooling is commonly used to form sets of documents forPooling is commonly used to form sets of documents for assessors to judge

– Coverage of pools (and depth)• More efficient approaches to judge relevance?• More efficient approaches to judge relevance?

– Move‐to‐Front (MTF) pooling– Interactive Search and Judge– Use of sampling techniques– Use of implicit judgments (e.g. from log data)– CrowdsourcingCrowdsourcing


Crowdsourcing

• Crowdsourcing is act of taking job traditionally performed by d d d d f d ll

Crowdsourcing

designated person and outsourcing to undefined, generally large, group of people in form of open call

• Amazon Mechanical Turk (AMT) is example crowdsourcing l tfplatform– Requester creates Human Intelligence Tasks (HITs)– Workers chose to complete HITs

Req esters assess res lts and pa orkers– Requesters assess results and pay workers– Currently > 200,000 workers from many countries

• Crowdsourcing shown to be feasible for relevance assessment [Alonso & Mizzaro 2009; Kazai 2011; Carvalho et al 2011][Alonso & Mizzaro, 2009; Kazai, 2011; Carvalho et al., 2011]

• However, previous studies also showed that domain expertise can have effect on judgments [Bailey et al., 2008; Kinney et al 2008]al., 2008]


Crowdsourcing study at the UK National A hi [Cl h l 2012]

• Designed crowdsourcing experiment using AMT to gather l ff f

Archives [Clough et al., 2012]

relevance assessments to measure effectiveness of two competing search engine at UK government National Archives– Selected AMT route as there was limited access to search log dataC d f AMT i h h f d i• Compared assessments from AMT with those from domain expert and measured impact on effectiveness scores and rankings of Systems A and B

48 queries selected by domain expert– 48 queries selected by domain expert– Queries issues to Systems A and B and 10 highest results

retrieved/judged– Effectiveness assessed at rank 10 (P@10 and DCG@10)( @ @ )– HIT consisted of being shown query, description of query intent, 10

retrieved documents (from system A or B) and answering questions– Gathered 10 judgments per query‐system (960 HITS) over 2 weeks

73 k d d 924 HITS ( f i d)– 73 workers produced 924 HITS (after noise removed)


Crowdsourcing study at the UK National A hiArchives

Q i Q1 diffi lt Q2 f ili it Q3 fid Q4 ti f tiQueries Q1‐ difficulty 1=V. difficult; 5=V. easy

Q2 ‐ familiarity1=V. unfamiliar; 5=V. familiar

Q3 – confidence1=Not at all confident; 5=V. confident

Q4 – satisfaction1=V. unsatisfied; 5=V. satisfied

Expert All 4.36 4.34 4.25 3.25

Informational 4.10 4.13 4.00 3.04

Navigational 4.63 4.56 4.50 3.46

System A 4.54 4.44 4.44 3.90

System B 4.19 4.25 4.06 2.60

Crowd‐sourced workers

All 3.47 3.54 4.12 4.13

Informational 3.48 3.43 4.04 4.05

N i ti l 3 47 3 65 4 20 4 21Navigational 3.47 3.65 4.20 4.21

System A 3.42 3.57 4.18 4.18

System B 3.52 3.51 4.05 4.08


Questionnaire results for expert and crowdsourced worker responses

Queries DCG P@10

Absolute scores between expert and workers differ but ranking of Systems A and B remains stable (A judged the better system).

Th j d b h d d k d hsystem A All (N=48) 0.492** 0.285*

Informational 0.323 ‐0.029

Navigational 0.485* 0.467*

B All (N 48) 0 601** 0 595**

The judgments between the crowdsourced workers and the expert are more interchangeable for system B than A, despite the resulting differences in absolute scores.

Navigational queries correlate better than informational. Thesystem B All (N=48) 0.601** 0.595**

Informational 0.563** 0.523**

Navigational 0.772** 0.786**

Navigational queries correlate better than informational. The lower correlation between expert and crowdsourced workers for system A, particularly for informational queries, suggests the results of a higher quality search engines are more difficult to assess using crowdsourcing. g g

10.00

12.00

10.00

12.00

System A (the better system) System B

2.00

4.00

6.00

8.00

Expe

rt DCG

score

2.00

4.00

6.00

8.00

Expe

rt DCG

score

0.00

0.00 2.00 4.00 6.00 8.00 10.00

MTurker DCG score

Informational Navigational

0.00

0.00 2.00 4.00 6.00 8.00 10.00MTurker DCG score

Informational Navigational


AMT cost: $43 for 45 hrs (73 assessors) TNA cost: $106 for 3 hrs 5 mins

Usefulness of using test collections in IR d l

• Effectiveness of IR systems typically measured based on the

development

y yp ynumber of “relevant” items found (Precision, Recall, DCG...)– Test collection and measure predict user behaviour: if system A

scores higher than B on test collection we assume users willscores higher than B on test collection we assume users will prefer system A over B in an operational setting

• ButSeveral past studies shown that a high increase in system– Several past studies shown that a high increase in system effectiveness did not have detectable gains for the end user in practice (i.e. not correlated with user satisfaction / success)The real issue in IR system design is not whether P/R goes up– The real issue in IR system design is not whether P/R goes up, but rather whether it helps users perform search tasks more effectively


Usefulness of using test collections in IR d l [S d l 2010]

• Experiment conducted to examine relation of system

development [Sanderson et al., 2010]

p yeffectiveness with user preference on large scale

• Study involved 296 Amazon Mechanical Turk (AMT) users working with 30 topics comparing user preferences across 19working with 30 topics comparing user preferences across 19 runs submitted to TREC 2009 Web track

• Sampled range of runs with large and small relative differences in evaluation measures

• Lists of results randomly shown to AMT users side‐by‐side and asked to make a preference judgment for given search topicp j g g p

• Total cost of experiment < $60



• Experiment conducted to examine relation of system


p yeffectiveness with user preference on large scale

• Study involved 296 Amazon Mechanical Turk (AMT) users working with 30 topics comparing user preferences across 19working with 30 topics comparing user preferences across 19 runs submitted to TREC 2009 Web track

• Sampled range of runs with large and small relative differences in evaluation measures

• Lists of results randomly shown to AMT users side‐by‐side and asked to make a preference judgment for given search topicp j g g p

• Total cost of experiment < $60


Usefulness of using test collections in IR d l [S d l 2010]development [Sanderson et al., 2010]



• Found clear evidence that effectiveness measured on test


collection predicted user preferences for one IR system over another– Strength of prediction varied by search type (inf / nav)Strength of prediction varied by search type (inf / nav)

• When comparing measures it was found that P@10 poorly modelled user preferences (ERR and nDCG best)U f b t i f t h h d• User preferences between pairs of systems where one had failed to return any relevant item were significantly stronger compared to rankings with at least one relevant document– Measures need adjusting to account for this

• Still preliminary work that requires further validation


Adapting test collections to deal with sessions TREC S i T k

• TREC Session Track started in 2010 with intention of

– TREC Session Track

providing test collections for studying IR over sessionsrather than one‐shot queries

• 2011’s goal was to provide best possible results for2011 s goal was to provide best possible results for mth query in session given prior session data

• Session data consisted of– Current query qm– Set of past queries in session q1, q2, …, qm‐1– Ranked list of URLS for each past queryp q y– Set of clicked URLs/snippets and time spent by user reading them


http://trec.nist.gov/pubs/trec20/papers/SESSION.OVERVIEW.2011.pdf


• Participants ran IR systems over current query under


p y q yfour conditions considered separately– RL1: Ignoring session data prior to the query– RL2: considering only prior queriesRL2: considering only prior queries– RL3: considering prior queries and search result URLs– RL4: considering all data, including the items clicked on by users and time spent viewing itemsusers and time spent viewing items

• Provided test collection to participants– Used ClueWeb09 (Category B – 50M documents)– 76 session for 62 topics created (re‐used from TREC 2007 QA and 2009 Million Query tracks as they have sub‐topics)

– Custom‐built IR system developed (based on Yahoo! BOSS) y p ( )and used to generate session data



• Judgments created by NIST assessors


g y– For each topic a depth‐10 pool was formed from ranked results for past queries q1…qm‐1 produced by Yahoo! BOSS and top 10 documents from submitted runs on current query qm

– Documents judged with respect to the general topic and all sub‐topics

• Relevance Judgments– ‐2 for spam, 0 for not relevant, 1 for relevant, 2 for highly relevant, 3 for topics that are navigational in nature andrelevant, 3 for topics that are navigational in nature and the judged page is “key” to satisfying the need



• Results from TREC 2011 indicate it is possible for


psystems to use interaction data to improve results over a baseline using no interaction data at all– Open questions include: use of sub‐topic judgment andOpen questions include: use of sub topic judgment and how to deal with duplicates

RL1 ‐> RL4 (all sub‐topics)(all sub topics)


TTo sum up


Challenges for IR evaluation

• Tague‐Sutcliffe [1996] highlights six issues with IR

Challenges for IR evaluation

g [ ] g gevaluation– Should IR experiments involve real users with real information needs?information needs?

– Must IR evaluation involve actual retrieval processes?– What kind of aggregation is appropriate in evaluating gg g pp p gdifferent IR systems?

– What can analysis, as opposed to the experimental or qualitative collection of data tell us about IR systems?qualitative collection of data, tell us about IR systems?

– How can interactive IR systems be evaluated?– How generalisable are the results of IR systems?


Conclusions

• Evaluating search is very important both in academic and

Conclusions

commercial contexts• Evaluation often performed using test collections which

provides valuable insights into IR algorithmsprovides valuable insights into IR algorithms– But need to validate the findings based on test collections with users

and in realistic settings – System evaluation is part of wider evaluation activitiesy p

• ImageCLEF focused on system‐oriented evaluation and inherits limitations

But created variety of realistic tasks and studied user interaction– But created variety of realistic tasks and studied user interaction

• Future work considering evaluating wider IR applications (search is one component) and varying search strategies (e.g. b i ) i t ll d l b b d i tbrowsing) using controlled lab‐based experiments


Evaluating Information Access Systems (ELIAS)

• ELIAS is an ESF Research Networking Programme launched in

Evaluating Information Access Systems (ELIAS)

g g2011 for duration of 5 years (http://elias‐network.eu/)

• Study living laboratories for the evaluation of information i th laccess in the large

• Horizontal dimension– Domains and application areasDomains and application areas

• Vertical dimension– Fundamental questions, methodological and user simulation issues to

b dd dbe addressed

• Money available to support students/researchers doing evaluation research


Very useful source for test collections

Sanderson, M. “Test Collection Evaluation of Ad-hoc Retrieval Systems”,

Very useful source for test collections

Foundations and Trends® in Information Retrieval, 2010

69 pages with69 pages with 276 articles reviewed

Created with Wordle: http://www.wordle.net


References

Carvalho, V. R., Lease, M., & Yilmaz, E. (2011) Crowdsourcing for search evaluation, ACM SIGIR F 44 17 22

References

Forum, 44, 17–22.

Alonso, O., & Mizzaro, S. (2009) Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment, In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, 15–16.

Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P., & Yilmaz, E. (2008). Relevance assessment: are judges exchangeable and does it matter. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 667–674.

Clough, P., Gonzalo, J., Karlgren, J., Barker, E., Artiles, J. and Peinado, V. (2008), Large‐Scale Interactive Evaluation of Multilingual Information Access Systems ‐ the iCLEF Flickr Challenge , In Proceedings of Workshop on novel methodologies for evaluation in information retrieval, ECIR 2008, 33‐38.2008, 33 38.

Clough, P., Sanderson, M., Tang, J., Gollins, T. and Warner, A. (2012) Examining the limits of crowdsourcing for relevance assessment, IEEE Internet Computing, 28 Jun. 2012. IEEE computer Society Digital Library. IEEE Computer Society (doi ieeecomputersociety org/10 1109/MIC 2012 95)(doi.ieeecomputersociety.org/10.1109/MIC.2012.95)


References

Gonzalo, J., Clough, P. and Karlgren, J. (2009), Overview of iCLEF2008: Search Log Analysis for M l ili l I R i l I P di f 9 h W k h f h C L E l i

References

Multilingual Image Retrieval, In Proceedings of 9th Workshop of the Cross‐Language Evaluation Forum (CLEF'08), September 17‐19 2008, LNCS 5706, 227‐235.

Harman, D. (2011). Information Retrieval Evaluation. Synthesis Lectures on Information Concepts, Retrieval, and Services, 3(2), 1–119.

Kazai, G. (2011). In search of quality in crowdsourcing for search engine evaluation. Advances in Information Retrieval, 165–176.

Kelly D. (2009) Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information RetrievalFoundations and Trends in Information Retrieval

Kinney, K. A., Huffman, S. B., & Zhai, J. (2008). How evaluator domain expertise affects search result relevance judgments. Proceeding of the 17th ACM conference on Information and knowledge management (pp. 591–598). ACM.

Müll H Cl h P D l T d C t B (Ed )(2010) I CLEF E i t lMüller, H., Clough, P., Deselaers, T. and Caputo, B. (Eds)(2010) ImageCLEF ‐ Experimental Evaluation of Visual Information Retrieval, Springer: Heidelberg, Germany, ISBN 978‐3‐642‐15180


References

Tsikrika, T., Seco de Herrera, A.G., & Müller, H. (2011) Assessing the scholarly impact of i CLEF I P di f h S d i i l f M l ili l d

References

imageCLEF. In Proceedings of the Second international conference on Multilingual and multimodal information access evaluation (CLEF'11), Pamela Forner, Julio Gonzalo, Jaana Kekäläinen, Mounia Lalmas, and Maarten de Rijke (Eds.). Springer‐Verlag, Berlin, Heidelberg, 95‐106.

Saracevic, T. (1995). Evaluation of evaluation in information retrieval. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle, Washington, United States, July 09 ‐ 13, 1995): SIGIR '95 (pp.1380146). New York, ACM Press.

Sanderson, M., Paramita, M., Clough, P. and Kanoulas, E. (2010) Do user preferences and evaluation measures line up?, In Proceedings of the 33rd Annual ACM SIGIR Conference, Geneva, Switzerland, pp. 555‐562.

Sanderson, M. (2010) Test Collection Evaluation of Ad‐hoc Retrieval Systems, Foundations andSanderson, M. (2010) Test Collection Evaluation of Ad hoc Retrieval Systems, Foundations and Trends in Information Retrieval, 4(2010), 247‐375.

Tague‐Sutcliffe, J.M. (1996) Some perspectives on the evaluation of information retrieval systems, Journal of the American Society for Information Science, 47(1), 1‐3.