Proceedings of the SIGIR 2013 Workshop on Modeling User …msmucker/mube2013/mube... · 2014-06-04 · Towards Task-Based Snippet Evaluation:Preliminary Results and Challenges. Mikhail

Proceedings of the

SIGIR 2013 Workshop on Modeling User Behavior

for Information Retrieval Evaluation (MUBE 2013)

Charles L. A. Clarke, Luanne Freund, Mark D. Smucker, Emine Yilmaz

(editors)

August 1, 2013

Dublin, Ireland

Preface from Workshop Organizers

Welcome to the SIGIR 2013 Workshop on Modeling User Behavior for Informa-tion Retrieval Evaluation (MUBE 2013). The goal of this workshop is to bringtogether people to discuss new and existing approaches to improving informa-tion retrieval evaluation through the modeling of user behavior.

We invited participants to submit 2 page poster papers that could rangefrom position papers to new research results. Each paper was reviewed by atleast 3 program committee members. Papers were evaluated in terms of theirrelevance to the workshop, their correctness, and their potential to generatediscussion. The submissions were very high quality, and we accepted 10 of the11 submitted papers.

The workshop consists of 3 main parts: invited talks, short paper presenta-tions, and breakout groups and their presentations. The invited talks from BenCarterette and Leif Azzopardi will provide participants with two alternate looksat modeling user behavior for IR evaluation from two of the leading researchersin this area. The short papers will allow participants to see a diverse set ofwork. Following the invited and short paper talks, the workshop participantswill select the topics of discussion for the remainder of the workshop. We willform breakout groups around these discussion topics and each breakout groupwill present their group’s findings at the end of the workshop.

We thank SIGIR for hosting the workshop, and in particular, we thank theconference organizers for their many efforts to make the workshop a success.We thank Ben Carterette and Leif Azzopardi for accepting our invitations tospeak. We thank our excellent program committee for writing some of the bestreviews we have ever seen for a workshop. We thank the authors for submittingand presenting their work, and finally, we thank all of the participants for whatwe hope will be a day of lively disscussion and learning.

Charles L. A. ClarkeLuanne Freund

Mark D. SmuckerEmine Yilmaz

i

Program Committee

Charles L. A. Clarke (University of Waterloo) – ChairLuanne Freund (University of British Columbia) – ChairMark D. Smucker (University of Waterloo) – ChairEmine Yilmaz (University College London) – ChairEugene Agichtein (Emory University)James Allan (University of Massachusetts Amherst)Javed Aslam (Northeastern University)Leif Azzopardi (University of Glasgow)Nicholas Belkin (Rutgers University)Pia Borlund (Royal School of Library and Information Science)Ben Carterette (University of Delaware)Arjen de Vries (Delft University of Technology)Norbert Fuhr (University of Duisburg-Essen)Donna Harman (NIST)Hideo Joho (University of Tsukuba)Joemon Jose (University of Glasgow)Jaap Kamps (University of Amsterdam)Evangelos Kanoulas (Google)Mounia Lalmas (Yahoo!)Alistair Moffat (University of Melbourne)Virgil Pavlu (Northeastern University)Stephen Robertson (Microsoft Research Cambridge)Ian Ruthven (University of Strathclyde)Falk Scholer (RMIT University)Pavel Serdyukov (Yandex)Ellen Voorhees (NIST)

ii

Invited Speakers

Ben Carterette, University of Delaware

Talk Title: User Variability and Information Retrieval EvaluationAbstract: Information retrieval evaluation is highly reliant on averages: wetypically evaluate an engine by testing for a significant difference in an averageeffectiveness computed using relevance judgments that may be averaged overassessors, and perhaps taking one or more parameters estimated as averagesfrom user data. Using averages at every stage of the evaluation process this waypresents a false certainty; in reality there is so much potential variability at eachpoint that even long-held conventional wisdom about effectiveness-enhancingtechniques must be questioned.

In this talk, we argue for the importance of incorporating something aboutvariability in user behavior into automatic (batch-style) retrieval evaluations.We will present results from three ongoing projects: (1) using user logs toincorporate variability in user behavior; (2) using many preferences-based as-sessments to incorporate variability about relevance; (3) using different classesof tasks to incorporate variability about topics.

Leif Azzopardi, University of Glasgow

Talk Title: The Assimilation of UsersAbstract: In this talk, I want to discuss a number of issues regarding thesimulation of users and how we are “assimilating” users into our evaluationsthrough the models that we create. First, I’ll try and provide some definitionsabout what measures, models and simulations are, and how they relate. Specif-ically, I argue that simulation has been a central component in most, if not all,evaluations. Then, I’ll review some of the different kinds of simulations, theadvantages, pitfalls and challenges of performing simulations. It is here that, Iwonder about what we are trying to achieve, and what we want to do with allour models, simulations and measures. Is it to evaluate or is it to assimilate?I’ll argue that we need to look towards developing more explanatory models ofuser behaviour so that we can obtain a better understanding of users and theirinteraction with information systems.

iii

Table of Contents

Preface i

Program Committee ii

Invited Speakers iii

Table of Contents iv

Accepted Papers (alphabetized by first authors last name)

Towards Task-Based Snippet Evaluation:Preliminary Results and Challenges.Mikhail Ageev, Dmitry Lagun, Eugene Agichtein 1

Towards Measures and Models of Findability.Leif Azzopardi, Colin Wilkie, Tony Russell-Rose 3

Modelling the information seeking user by the decisions they make.Peter Bruza, Guido Zuccon, Laurianne Sitbon 5

Simulating User Selections of Query Suggestions.Jiepu Jiang, Daqing He 7

Incorporating Efficiency in Evaluation.Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, Iadh Ounis 9

Observing Users to Validate Models.Falk Scholer, Paul Thomas, Alistair Moffat 11

Markov Modeling for User Interaction in Retrieval.Vu T. Tran, Norbert Fuhr 13

Personalization for Difficult Queries.Arjumand Younus, M. Atif Qureshi, Colm O’Riordan, Gabriella Pasi 15

User Judgements of Document Similarity.Mustafa Zengin and Ben Carterette 17

Evaluating Heterogeneous Information Access (Position Paper).Ke Zhou, Tetsuya Sakai, Mounia Lalmas, Zhicheng Dou, Joemon M.Jose 19

iv

Towards Task-Based Snippet Evaluation:Preliminary Results and Challenges

Mikhail Ageev∗

Moscow State [email protected]

Dmitry LagunEmory University

[email protected]

Eugene AgichteinEmory University

[email protected]

ABSTRACTQuery-biased search result summaries, or “snippets”, helpusers decide whether a result is relevant for their informationneed, and have become increasingly important for helpingsearchers with difficult or ambiguous search tasks. However,existing snippet evaluation methods focus on the snippetquality of summarizing the document for the given query,and do not consider the user’s search task. We propose amethodology for task-based snippet evaluation, with the aimof directly evaluating how how well the snippets help users tosatisfy their information need. This includes an open-sourceinfrastructure for collecting controlled yet realistic searcherbehavior data, that could allow analysis of search sessionsuccess as a function of snippet generation quality. We alsopresent preliminary results of using this methodology, andidentify some of the challenges that remain.

1. BACKGROUND AND APPROACHWhile web search engines have been rapidly evolving, one

constant in the search result pages has been the presenceof some form of a document summary, or “snippet”, pro-vided to help a user to select the best result documents. Forsome queries, the generated snippets already contain the de-sired information, either by design (e.g., Google’s “InstantAnswers” or Wolfram Alpha)

Search snippet quality evaluation is a crucial step fordevelopment of snippet generation algorithms. Previouslyknown methods for evaluation of snippet quality are ei-ther judgements-based [5, 6], or behavior-based (i.e., mea-suring the relation between click-through rate and relevanceof landing pages [3, 4]). The drawbacks of assessor-basedmethods include potentially high cost, and the disconnec-tion from the original search intent. On the other hand, thebehavior-based methods could fail in cases of “good aban-donment” [2], i.e., when the snippet directly answers thesearcher’s query. Distinct from these two approaches, ouraim is to create a snippet evaluation methodology that mea-sures directly how well the snippets help searchers satisfytheir information needs.

To measure snippet quality with relation to search success,we propose the following requirements:

• The experiments should be performed in a realistic websearch setting, for a variety of search tasks, and fordifferent users.

• A search task should be clearly defined, and searchsuccess should be directly measurable.

• Snippet generation algorithm should be varied, whileother parameters of the experiment (ranking algo-rithm, search engine interface) should be fixed.

∗Work done at Emory University.

Copyright is held by the author/owner(s).SIGIR 2013 Workshop on Modeling User Behavior for Information Re-trieval Evaluation (MUBE 2013), August 1, 2013, Dublin, Ireland.

Next, we define previously known, and the new proposedmetrics, and describe experimental infrastructure to collectthe appropriate data.Manual Assessments. Following the literature on snip-

pet quality [5, 4], snippets must satisfy the aspects of Rep-resentativeness, Readability, and Judgeability. The snippetquality is evaluated by performing blind paired preferencetests. For each query and URL, a pair of snippets producedby two different algorithms are presented on a page in ran-dom order, so the assessors do not know which algorithmproduced which snippet. The corresponding evaluation met-ric for manual assessments is the fraction of labels that givepreference to system B, compared to system A. The prefer-ence ratio metric is evaluated for each criterion in (judge-ability, readability, representativeness).Behavior-based: CTR. Following [3, 4] we measure ra-

tio of clicked relevant, and irrelevant documents for the firsttwo SERP positions. The main hypotheses for these metricsis that better snippets attract more clicks on relevant doc-uments, and less clicks for irrelevant documents. First twoSERP positions are almost always examined by users.Our proposal: SSR (Search Success Ratio). We

propose two new snippet evaluation metrics, following theQRAV model of search session success [1], as a ratio of suc-cessful sessions for a specific snippet generation algorithm.Namely, we propose SSR1: the fraction of the answered ses-sions where a user was satisfied with the session and believedto have found an answer (but the answer may not be sub-sequently validated to be correct) and SSR2: the fractionof successful sessions where the validated correct answer wasfound by the searcher.

To measure the CTR and the SSR1 and SSR2 (search suc-cess) metrics, we require the collection of realistic yet con-trolled behavior data, using the mechanism described next.

2. EXPERIMENTSInfrastructure and Participants: To collect the searchbehavior data, we used the infrastructure created and pub-lished in [1], and modified it for our task. The participantsplayed a search contest “game” consisting of 12 search tasks(questions). The stated goal of the game was to submitthe highest possible number of correct answers within theallotted time. The participants were instructed that uponfinding an answer they need to type the answer along withthe supporting URL, into the corresponding fields of thegame interface. Each search session (for one question) wascompleted by either submitting an answer, or by clicking the“skip question” button to pass to the next question.

The search interface was implemented using a search APIof a major commercial search engine to retrieve the results,but the search snippet contents could be generated using oneof the available Snippet Generation Algorithms (SGAs), forexample using the original (“default”) search snippets, or bya different algorithm. While the same SGA is used for allqueries in single search session, different SGAs are used for

1

(b)

0

0.2

0.4

0.6

0.8

1

S S R 1 S S R 2

c t r @ 1 / r e l ev a n t

c t r @ 1 / n otr e l e v

a n t

c t r @ 2 / r e l ev a n t

c t r @ 2 / n otr e l e v

a n t

default

lucene_2f

lucene_long

lucene_short

(a)

0

0.2

0.4

0.6

0.8

1

J u dg ea bi l i ty

R ea da bi l i ty

R ep re se nt a ti v e

n es s

default

lucene_2f

Figure 1: Snippet comparison: (a) manual judgements-based metrics; (b) behavior-based metrics

different search sessions, in order to expose the searchers tothe different SGAs for different tasks in a systematic man-ner. The code and the data used for these experiments areavailable from http://ir.mathcs.emory.edu/intent/.

Experimental Design: we use a randomized block exper-imental design to eliminate possible biases related to SGAsand difficulty of the questions. We divided the user-questionpairs into groups at random, satisfying the constraints that:(1) each user performed 12 search sessions, divided into 4blocks of 3 search sessions each, with each block assigned toa specific SGA; (2) the questions are presented in randomorder; (3) for each task-SGA pair, and each task order, thereare approximately the same number of users who performedthe search.

To support the SGAs that operate offline (i.e., that requireretrieving the full document text), we perform experimentsin two stages. First, we ran the experiment with the defaultsnippets provided by the search engine API, and collectedthe information about the user’s queries, and the returnedSERPs. Then, we downloaded landing pages for the top 10returned URLs for all queries, and generated snippets forthem. At the second (experimental) stage, we presented thenewly generated snippets for all of the cached queries. In ourexperiments, 69% of the user’s queries were cached, whichwe found sufficiently high for subsequent experiments.

Participants: the experiment participants were recruitedthrough Amazon Mechanical Turk, and careful checks weremade to ensure good understanding of the proposed task. Atotal of 109 MTurk participants finished their tasks. Afterfiltering out the users who did not follow the game rules, weobtained 1175 search sessions, performed by 98 users. Foreach user and SGA there are 294 search sessions. For eachquestion and SGA there are about 24 users. Our data forthese users consists of 3,294 queries, 1,598 unique queries,and 2,997 SERP clicks on 662 distinct URLs.

2.1 Preliminary ResultsWe compared four snippet generation algorithms: the

default snippets provided by the search engine API, andthree different variants of snippets generated by open-sourcesearch engine Lucene. (1) lucene 2f – each snippet consistsof two fragments with best math to query, average snippetlength is the same as for default algorithm (150 characters);(2) lucene short – one-fragment snippets with average length100 characters; (3) lucene long – one-fragment snippets withaverage length 300 characters.

Figure 1(a) presents the assessment-based pairwise prefer-ence metrics for two algorithms – “default” and “lucene 2f”.We have 547 manual judgements for snippet pairs, thusit shows that assessors clearly prefer “default” snippets.Figure 1(b) presents behavior-based metrics for four snip-pet generation algorithms. The figure shows an agree-ment between session success-based metric SSR2, and

ctr@1/relevant. But there is also an apparent disagreementbetween some of the behavioral metrics and the assessor-based preferences. Surprisingly, we did not find a signifi-cant correlation between the perceived (assessor-based) snip-pet quality, and searcher success metrics. Furthermore, wefound that the algorithm preferences are consistent for dif-ferent players, but diverge when comparing algorithm prefer-ences for different questions. Using a larger set of questionsmight alleviate this problem.

3. DISCUSSIONWe proposed a novel search snippet evaluation frame-

work, that allows us to measure how well the snippets helpsearchers to satisfy their information needs. While our cur-rent results are not conclusive, we believe that behavior/taskbased evaluation is a promising research direction, in par-ticular for the mobile search setting where SERP screen issmall, and for the so called “instant answers” setting.

The main challenges in measuring influence of snippetquality on session success are to distinguish the effect ofsnippet generation algorithm from other factors that affectsession success, such as: different types of information needsand search task difficulty, user’s skills and experience, searchengine ranking quality and interface, good abandonmentsand rich snippet interface. Our framework allows to fixsome of experimental parameters, while allowing to exper-imentally vary the snippet generation algorithms to enabledirect evaluation of the changes to snippet generation onsearch success.

ACKNOWLEDGMENTSThis work was supported by the National Science Founda-tion grant IIS-1018321, the DARPA grant D11AP00269, andby the Russian Foundation for Basic Research Grant 12-07-31225.

4. REFERENCES[1] M. Ageev, Q. Guo, D. Lagun, and E. Agichtein. Find it

if you can: A game for modeling different types of websearch success using interaction data. In Proc. ofSIGIR, 2011.

[2] A. Chuklin and P. Serdyukov. Potential goodabandonment prediction. In WWW (CompanionVolume), 2012.

[3] C. Clarke, E. Agichtein, S. Dumais, and R. White. Theinfluence of caption features on clickthrough patterns inweb search. In Proc. of SIGIR. ACM, 2007.

[4] T. Kanungo and D. Orr. Predicting the readability ofshort web summaries. In Proc. of WSDM, 2009.

[5] S. Liang, S. Devlin, and J. Tait. Evaluating web searchresult summaries. Advances in Information Retrieval,2006.

[6] D. Savenkov, P. Braslavski, and M. Lebedev. Searchsnippet evaluation at yandex: lessons learned andfuture directions. Multilingual and MultimodalInformation Access Evaluation, 2011.

2

Towards Measures and Models of Findability

Leif Azzopardi, Colin WilkieUniversity of Glasgow

Glasgow, United Kingdom{Leif.Azzopardi,

Colin.Wilkie}@glasgow.ac.uk

Tony Russell-RoseUX Labs

Guildford, United [email protected]

ABSTRACTThis poster paper outlines our current project which aimsto develop models of how users search and browse for infor-mation. These will be used to create measures of findability,providing an estimate of how easily a document can be foundwithin a given collection

Categories and Subject Descriptors: H.3.3 [Informa-tion Storage and Retrieval]: Information Search and Re-trieval - Search Process

Terms Theory, Experimentation, Human Factors KeywordsRetrieval Strategies, Simulation

1. PROJECT OVERVIEWDesigning and organizing a website or any other informa-

tion space is a complex process where the goal is to cre-ate structure around content that enables users to completetheir tasks (such as finding the relevant information and per-forming associated transactions) in a seamless and efficientmanner. Typically, Information Architecture techniques areapplied in an attempt to improve a site’s usability and theoverall user experience by optimizing “the structural designof an information space to facilitate task completion and in-tuitive access to content” [6]. Although there are numerousprinciples and heuristics that have been developed [6], In-formation Architecture, as a discipline, lacks formal modelsfor evaluating or predicting whether such techniques will im-prove the usability of a website. In this project, we aim todevelop formal models to measure, analyze and evaluate howeasily a site can be navigated and the key resources withinit can be retrieved. Such a model would provide a way toobjectively measure what is colloquially termed “AmbientFindability”, i.e. the ease with which a given informationobject can be found [6]. If the structure of a particular web-site precludes users from intuitively and easily locating keyresources, then in competitive online environments users arelikely to abandon the site in favour of alternative sites thatprovide competing services or information. For example,online retailers need to ensure that users can quickly and

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright is held by the author/owner(s). SIGIR 2013 Workshop on Mod-eling User Behavior for Information Retrieval Evaluation (MUBE 2013),August 1, 2013, Dublin, Ireland.

easily locate the products and services offered along withrelated information such as product reviews, help files, andother supporting information. Consequently, the ability todetermine or predict the findability of information objectsis vital.

While various measures have been developed to objec-tively quantify how easily a user can navigate a website,or how easily a user can retrieve a particular page [8, 12, 1,2], these techniques are often very coarse grained in nature.They ignore the needs or goals of users in their calculationswhich directly influence how easily a user could locate anobject. They are essentially agnostic to design, and thus ig-nore key issues such as layout, location and the visibility oflinks within web pages and the user’s broader task context.Moreover, they assume each user is equally likely to select agiven link (i.e. acting as a random surfer), and ignore well-known patterns and strategies of user behavior (see Figure1). Thus, they provide little value to practicing InformationArchitects who require more sophisticated, behavioral mea-sures that accommodate these dimensions. To this end, thisproject will attempt to model more accurately the interac-tion of users within websites to provide an estimate of howeasily pages can be found based on specific user persona andscenarios (and thus the users underlying information needsand intentions).

2. BACKGROUNDIntuitively, the more findable content is, the more likely

it is to be viewed and consumed, and vice versa such that:if no one views your content/resource, then no-one will buyyour products, solicit your services, or cite your papers. Thesystems, structures and content used to provide relevant in-formation to web users play a major role in shaping whatinformation is findable. For example, when a user visits awebsite trying to find a particular product, they must eithertry to navigate to the product given the navigational struc-ture, try to search for the product using the site-search pro-vided, or undertake a combination of searching and brows-ing. If the structure is counter-intuitive or the site-searchis not effective then the user will have difficulties in findingwhat they want. This often leads to abandoning the search,and the site, in favor of using other sites. As reported in [12],it was found that 56% of users were often confused by thestructure of a website and had major difficulties in findingwhat they wanted. Sites that can ensure that the users areable to complete their goals and tasks in an efficient andeffective manner are more likely to be successful [7].

However, a number of developments have been made re-cently which do provide ways in which particular aspects

3

Latest Sci-Fi

Novels

User

Books

Music Films

Books

Films

Music

User

Enters

Site

Search

Browse

Exit

Site

Information

Need and Task

Only structured is considered

Size and position affect user interaction

High Level State Transition Diagram of Searching and Browsing

Information Needs and Labelling also affect user interaction

Figure 1: Left: Browsing assumed independent of content. Middle Left: Layout and Size affect browsing. Middle Right:

Information Need affects browsing and searching. Right: High Level State Transition Diagram modeling a user with an

information need, and their interactions within a website.

of “Ambient Findability” can be measured. One of the firstmeasures employed was coverage which represents the amountof a site crawled by a search engine [5]. By ensuring thatthe information is visible to web crawlers, this increases thelikelihood of the information being indexed by search sys-tems. Dasgupta et al [4] referred to this as discoverability,while Upstill et al [9] referred to it as crawlability.

Once crawled and indexed, the findability then dependson the search engine itself, and how retrievable the searchengine makes pages. Here retrievability measures were cre-ated to estimate how easily a document can be retrievedusing a search engine [1, 2]. On the other hand, in [3, 8, 12],they considered findability in terms of browsing (i.e. naviga-bility), and measured the ease with which users can navigatethrough websites1.

3. PROJECT AIMSIn this project, our aims are two-fold:

1. Develop probabilistic models of how users interact withina website through searching and browsing

2. Develop measures that estimate how easily users canfind content.

To this end, we have performed an initial analysis lookingat the relationship between the usage of a site and variousnavigability and retrievability measures [11]. This was toevaluate the current baselines. Here we found that naviga-bility measures held the strongest correlation with usage -providing a strong baseline. Our subsequent work will focuson making more sophisticated probabilistic models of inter-action that use information needs, information cues, andpage features to more accurately model browsing behavior(as shown in Figure 1). This is because these additional fac-tors affect how users browse and search through a site. Forexample, if a link is at the top of a webpage it is more likelyto be clicked than one at the bottom of a webpage, if the linkis more promient due to position, size and colour, it is morelikely to be clicked. However, whether they perform suchan action also depends upon the user’s information needand their underlying task. Consequently, our models willinclude each of these factors.

While in [10], we investigated the relationship between re-trievability and retrieval performance. Here we found thatthe retrievability bias of system is correlated to retrieval per-formance - and that systems could be tuned using retriev-ability. We shall extend this work in two ways: (1) create afindability measure that considers both retrieval and navi-gation within the same model (i.e. combine navigability and

1Note that PageRank and Hits can also be though of as navigability

measures that model a random surfer.

retreivability measures), and (2) perform a user study to de-termine whether this measure reflects the ease with whichusers can find certain pages by searching and browsing.

If we can validate our models and measures then it willbe possible to develop tools to help Information Architectsanalyse websites: providing them with insights into how andwhat content is findable (and thus used), along with whatfeatures (terms/links/etc) makes pages easy or hard to find.Acknowledgements: This work is supported by the EP-

SRC Project, Models and Measures of Findability (EP/K000330/1).

4. REFERENCES[1] L. Azzopardi and V. Vinay. Accessibility in

information retrieval. In Proc. of ECIR’08, pages482–489, Glasgow, 2008.

[2] L. Azzopardi and V. Vinay. Retrievability: anevaluation measure for higher order information accesstasks. In Proc. of the 17th ACM CIKM’08, pages561–570, 2008.

[3] E. H. Chi, P. Pirolli, and J. Pitkow. The scent of asite: a system for analyzing and predictinginformation scent, usage, and usability of a web site.In Proc. of the CHI’00, pages 161–168, 2000.

[4] A. Dasgupta, A. Ghosh, R. Kumar, C. Olston,S. Pandey, and A. Tomkins. The discoverability of theweb. In Proc. of WWW ’07, pages 421–430, 2007.

[5] S. Lawrence and C. L. Giles. Accessibility ofinformation on the web. Nature, 400(6740):107–107,1999.

[6] P. Morville. Ambient Findability. O’Reilly Media Inc.,Sebastopol, CA, USA, 2005.

[7] J. Palmer. Web site usability, design and perfmonacemetrics. Information Systems Research, 13(2):151–167,2002.

[8] S. Pandit and C. Olston. Navigation-aided retrieval.In Proc. of WWW ’07, pages 391–400, 2007.

[9] T. Upstill, N. Craswell, and D. Hawking. Buyingbestsellers online: A case study in search &searchability. In 7th Australasian DocumentComputing Symposium, Sydney, Australia, 2002.

[10] C. Wilkie and L. Azzopardi. Relating retrievability,performance and length. In To appear in SIGIR ’13.

[11] C. Wilkie and L. Azzopardi. An initial investigationon the relationship between usage and findability. InAdvances in Information Retrieval, volume 7814,pages 808–811. 2013.

[12] Y. Zhou, H. Leung, and P. Winoto. Mnav: A markovmodel-based web site navigability measure. IEEETransactions on Software Engineering, 33:869–890,2007.

4

Modelling the information seeking user by the decisionsthey make

Peter BruzaQueensland University of

TechnologyBrisbane, Australia

[email protected]

Guido ZucconCSIRO Australian e-Health

Research CentreBrisbane, Australia

[email protected]

Laurianne SitbonQueensland University of

TechnologyBrisbane, Australia

[email protected]

ABSTRACTThe article focuses on how the information seeker makes de-cisions about relevance. It will employ a novel decision the-ory based on quantum probabilities. This direction derivesfrom mounting research within the field of cognitive scienceshowing that decision theory based on quantum probabili-ties is superior to modelling human judgements than stan-dard probability models [2, 1]. By quantum probabilities,we mean decision event space is modelled as vector spacerather than the usual Boolean algebra of sets. In this way,incompatible perspectives around a decision can be modelledleading to an interference term which modifies the law of to-tal probability. The interference term is crucial in modifyingthe probability judgements made by current probabilisticsystems so they align better with human judgement. Thegoal of this article is thus to model the information seekeruser as a decision maker. For this purpose, signal detectionmodels will be sketched which are in principle applicable ina wide variety of information seeking scenarios.

1. RELEVANCE JUDGEMENTS AS DECI-SIONS

Decades of research have uncovered a whole spectrumof human judgement that deviates substantially from whatwould be normatively correct according to logic and proba-bility theory. Probability judgement errors have been foundso consistently that they have names e.g., the “conjunctionfallacy” in which subjects readily judge the conjunction ofevent A and B to be more likely that either of the individ-ual events, e.g., Pr(A,B) > Pr(A). These findings are barelyknown, let alone accounted for, in current systems for infor-mation retrieval and recommendation, which strictly adhereto the laws of probability. Therefore, they do not alwaysaccount for how humans make decisions, rendering them po-tentially less useful.

From a cognitive point of view, the key to explaining theconjunction fallacy is the incompatibility of subspaces. Con-sider figure 1. The perspective around deciding document1’s relevance is represented as a two dimensional vector spacewhere the basis vector R1 corresponds to the decision “doc-ument 1 is relevant” and R1 corresponds to “document1 notbeing relevant”. A similar two dimensional vector space cor-responds to the decision perspective around document 2.


R1

R2

θ

R2

R1

Ψ

Figure 1: Incompatible subspaces in a decision onrelevance

Initially, the cognitive state of the decision maker is repre-sented by the vector Ψ, which is suspended between bothsets of basis vectors. This situation represents the subjectbeing undecided about whether document 1 or document 2is relevant. Suppose the subject now decides that document1 is relevant. This decision is modelled by Ψ “collapsing”onto the basis vector labelled R1. (The probability of thedecision corresponds to the square of the length of the pro-jection of the cognitive state Ψ onto the basis vector R1,denoted ‖P1ψ‖2). Observe how the subject is now neces-sarily uncertain about document 2’s relevance because thebasis vector R1 is suspended between the two basis vectorsR2 and R2 by the angle θ. The hall mark of incompati-bility is the state of indecision from one perspective (e.g.,the relevance of document 2) when a decision is taken fromanother (e.g., document 1 is deemed relevant). Incompati-bility means that deciding on the relevance of document 1may mean uncertainty about the relevance of document 2,which necessarily implies the information seeker can’t formthe joint probability Pr(R1, R2) [1]. (This is crucially differ-ent to the situation in standard probability theory in whichevents are always compatible, and thus the joint probabilityis always defined).

The consequence of incompatibility is an interference termIntf which is central to the modelling human decision be-haviour in the face of the conjunction fallacy. This termmodifies the law of total probability allowing standard prob-abilities to be augmented by human judgement. [1] The par-tial derivation below shows that the interference term Intfappears when the decision whether document 1 is relevant ismade in relation to the incompatible subspace correspond-ing to relevance of document 2 (represented by projector P2

5

and its dual P2⊥):

Pr(R1) = ‖P1ψ‖2 (1)

= ‖(P1 · I)ψ‖2 (2)

= ‖(P1 · (P2 + P⊥2 )ψ‖2 (3)

= ‖P1P2ψ‖2 + ‖P1P⊥2 ψ‖2 + Intf (4)

The intuition behind equation 4 is that the informationseeker is undecided about relevance of document 1 due toperspective created by document 2. This indecision canbe viewed as the subject “oscillating between two minds”which produces wave-like dampening or enhancement of thelaw of total probability via the interference term Intf [1]:Pr(R1) = Pr(R1, R2) + Pr(R1, R2) + Intf. When events arecompatible Intf = 0, and therefore the law of total probabil-ity holds, meaning that the probabilities adhere to standardprobability theory. The interference term Intf is a recent de-velopment in the literature in cognitive decision theory [1].Models that employ it have been labeled “quantum cogni-tion”, as underlying probability theory is derived from quan-tum physics.

2. MODELLING THE DYNAMICS OF REL-EVANCE DECISIONS

We propose to view judgments of relevance as a signal-detection decision task, and thereby open the door to ap-plying decision models developed in cognitive science. The“signal” is a relevant fragment of information (say in a docu-ment), which is present amongst irrelevant information, i.e.,“noise”. We sketch two signal noise models: A Markov modeland a quantum model.

The subject’s cognitive state will be modelled as a 7-dimensional vector spanning the grades of relevance {−3, . . . ,+3},where -3 denotes highly irrelevant and +3 denotes highly rel-evant judgement. It is usual to assume at t0 that each gradeis equally likely, but for different information seeking tasksthis can be varied. For example, [3] showed that the rank ofa document affects the mean relevance grade given. In thiscase, the values in the initial cognitive state state vector canbe biased towards higher scores because when documentsare presented in rank order, relevant documents are morelikely to be seen first.

In the quantum signal detection model the initial state willbe evolved using the Hamiltonian matrix H. The intuitionbehind H is that it models how the user moves betweendifferent grades of relevance, e.g., the entry in cell hij of Hrepresents amplitude diffusing from relevance at grade i tograde j. A parameter σ determines the diffusion rate out of aparticular relevance grade, and the parameter µ determinesthe diffusion rate back into that grade. The matrix H is usedto model the dynamics of the graded relevance decision overtime as a unitary operator U(t) = eitH [1]. In the Markovsignal detection model, standard practice will be adopted:A seven-state intensity matrix K will be used instead ofthe Hamiltonian H and the dynamics provided by using theKolmogorov forward equation T (t) = etK .

3. EXPERIMENTAL DESIGNFigure 2 illustrates the dynamics: At any time t, the prob-

ability of relevance at a particular grade j can be computeddepending on the experimental condition (derived from [3]).The aim of the experimental conditions is to establish how

Figure 2: Quantum and Markov signal-noise rele-vance model simulations

incompatibility in the user’s cognitive space influences rele-vance judgements. A between subjects design is proposed.In the first condition (A), subjects will be shown a singledocument d and asked to rate its relevance to the scenarioon a 7 point scale. The subject is directed not to considerscores given to preceding documents as in [3], i.e., relevancejudgements are assumed compatible. In the second condition(B), pairs of documents will be shown together one abovethe other, and the subject asked to rate first the top doc-ument and then the one below (which is the same d as incondition A). They will not be instructed to take a rele-vance decision in isolation, i.e., allowing for incompatibilty.The graded relevance judgement for each document will berecorded and time-stamped for each subject.

In the first condition (A), subjects where there is no orderin the decision, the probability at time t can be computed by||Pj

A·U(t)ψ||2, where PjA denotes the projector correspond-

ing to relevance grade j in condition A, and the unitary op-erator U at time t. In the second condition where order isimposed, first B (a decision taken at time t′), then A (a de-cision taken at a later time t′), the probability of relevanceat grade i is now given by ||Pj

A · U(t − t′) · PiB · U(t′)ψ||2

The difference between these two probabilities allows theinterference term Intf to be computed [1]. An analogous ap-proach can be taken with the Markov model but where thedynamics is driven by the operator T (t).

4. REFERENCES[1] J. Busemeyer and P. Bruza. Quantum cognition and

decision. Cambridge University Press, 2012.

[2] J. Busemeyer, E. Pothos, R. Franco, and J. Trueblood.A quantum theoretical explanation for probabilityjudgment errors. Psychological Review, 118(2):193–218,2011.

[3] M. Eisenberg and C. Barry. Order effects: A study ofthe possible influence of presentation order on userjudgments of document relevance. Journal of theAmerican Society for Information Science,39(5):293–300, 1988.

6

Simulating User Selections of Query SuggestionsJiepu Jiang

School of Information Sciences, University of Pittsburgh

[email protected]

Daqing He School of Information Sciences,

University of Pittsburgh

[email protected]

ABSTRACT In this paper, we identify that simulating user selections of query suggestions is one of the essential challenges for automatic evaluation of query suggestion algorithms. Some examples are presented to illustrate where the problem lies. A preliminary solution is proposed and discussed.

1. MOTIVATION Recent studies [3, 11] found that user ratings of query quality are very different from those evaluated using system-oriented metrics (e.g. nDCG@10 of query results). According to these findings, it is very likely that users may not be able to identify the good queries when provided with a list of query suggestions. Therefore, we believe that user selections of query suggestions should be considered into the evaluation of query suggestion algorithms, so as to better understand the effectiveness of query suggestions for practical systems and users.

Existing methods either did not consider this issue or adopted unwarranted assumptions. For example, and Dang et al. [1] evaluated a list of query suggestions by the performance of the best query, which implicitly assumed that users can make perfect judgments and adopt the best query. Sheldon et al. [10] evaluated by the average performance of the suggested queries, assuming that users randomly select query suggestions. To illustrate where the problem lies, we present two examples where the existing methods show their limitations.

Example 1. How many suggested queries should be presented?

With more queries being presented, the maximum performance of the queries will probably be enhanced. However, the average performance of the queries will probably decline, because query suggestion algorithms are designed to rank good queries at higher positions. In such case, the existing evaluation methods come to conflict conclusions. In order to determine the number of queries to be presented, we need to investigate the target user groups on their ability of judging queries. It is more likely for users with high judge ability to avoid selecting ineffectively queries and benefit from a long list of query suggestions, while for users with low judge ability, increasing the number of query suggestions may lead to decline of search performance.

Example 2. Should we present query suggestions or not?

Query suggestions can lead to decline of search performance if users adopt query suggestions that underperform those could be reformulated by users themselves. If the quality of the user’s own query reformulations lies between the average and the best performance of the suggested queries, the query suggestions are very likely beneficial for users with high judge ability but risky for those with low judge ability. In addition, the performance of query suggestions also depends on the quality of the users’ own reformulations. As Kelly et al. found [7], “query suggestions seem to have an advantage when subjects face a cold-start problem and when they exhaust their own ideas for searches”, where the users cannot reformulate effective queries.

The rest of the paper discusses a preliminary solution.

2. METHODS We are particularly interested in a specific time of a search session when a user has issued several queries and is now offered a ranked list of query suggestions. The user can either adopt one suggested query or issue his/her own reformulation for the next round of search. We believe that an evaluation model for query suggestion should examine how helpful a suggested query is for the user at that specific time of the session. We use the following notations in further discussions:

S{q1, … , qn} A list of n query suggestions presented to the user.

C A candidate set of query suggestions actually judged by the user. C may include only parts of the queries in S.

q0 The user’s own query reformulation in mind, i.e. the user will issue q0 if no query suggestions are offered.

q' The actual query adopted by the user.

u(q) A measure for the utility of the query q.

The evaluation can be achieved by calculating the expected difference between the utility of the follow-up searches with and without the query suggestions, as shown in the left side of Eq(1). Here we simply assume that u(q0) is a constant provided by evaluation datasets, so that the main challenge is to measure E(u(q')), as shown in the right side of Eq(1).

0 0E u q u q E u q u q (1)

Apparently, the optimized performance will be achieved if q' is the best query (by the utility measure u) among S and q0. However, the query actually adopted by the user may not be the real best one. First, the user may not be persistent enough to judge all the query suggestions, since it takes time and effort to judge each query. Second, the user may not be able to identify the best query among those being judged. This is because that the user’s perceived utility of queries may be different from the actual utility since the user does not know the search results when judging queries.

Therefore, we further calculate E(u(q')) using a two-step approach as shown in Eq(2). First, we calculate the probability that the user will judge a possible subset S' of S, i.e. P(S'|S). Second, we calculate the probability that the user will select a query qi among q0 and S' for the next round of search, i.e. P(q'=qi|q0, S'). Finally, we marginalize over all possible selected queries to come out the expected utility of the selected query.

0

0,

| | ,i

i iS S q q S

E u q P S S P q q q S u q

(2)

2.1 Users’ Judging Persistence To estimate P(S'|S), we adopt a model similar to the browsing model in rank-biased precision (RBP) [9]. We assume that the user judges queries in S by the sequences of the queries ranked by the system. The user will always judge the first query in S. After judging a query, the user has the probability pnext to continue judging the next one, or 1 – pnext to stop and select the believed best query among those having been judged so far. Suppose Sk is {q1, q2, … , qk} (k ≤ n), i.e. the user has judged this subset of queries and stops after judging the kth query qk, we can calculate P(Sk|S) as Eq(3):

11 2{ , ,..., }| (1 ) k

k k next nextP S q q q S p p (3)Copyright is held by the author/owner(s). SIGIR 2013 Workshop on Modeling User Behavior for Information Retrieval Evaluation (MUBE 2013), August 1, 2013, Dublin, Ireland.

7

2.2 Users’ Judging Ability For a judged set of queries S' (in which all queries are judged), we assume that the user will compare each query in S' as well as the user’s own query q0 to select a believed best one. We use C for the set of candidate queries, including q0 and S'. We have the following intuitions regarding query selection:

Intuition 1 The better the user’s judging ability is, the more likely that the user can select the best query in C. Intuition 2 The better a query’s quality is, the more likely that the user will select the query.

We adopt a parameter pjudge for the user’s judging ability. pjudge is defined as the probability of making a correct pairwise judgment on the utility of two queries (i.e. making a correct statement on which query has the higher utility). pjudge = 0.5 indicates that the user’s judgments are no better than random selection; pjudge > 0.5 indicates that the user has a general capability of judging queries (the higher the value of pjudge, the better the user’s judging ability); pjudge < 0.5 indicates that the user’s judgments are opposite to those measured by u.

We assume that users select their believed best queries through a round-robin tournament process involving pairwise judgments over all possible pairs of candidate queries, as shown in Table 1. We executed the simulation tournament for a large number of times and recorded down the outcome of each iteration. Finally, we estimate the probability of selecting a query in C by the maximum likelihood estimation that the query was selected.

Table 1. Query selection tournament algorithm.

Algorithm: query-selection-tournament Input: C{q1, q2, … , qm} Output: one selected query init array scores[1…m] // stores the scores of the m queries init array winner // stores the selected query/queries for i from 1 to m – 1

for j from i + 1 to m qx = believed better one of qi and qj (a random factor with

the probability pjudge being the actual better one of qi and qj) scores[x]++ // the judged better query will earn 1 point

winner = the query(s) with the highest score in scores if the winner array contains only one query return the query in winner else // has more than one “winners” return tournament-query-selection(winner) // recursion

Figure 1. Estimated probability of P(q(ith best)|C).

When a subset S' is determined, the probability of selecting a query in S' does not depend on the original rank of the query in S', but the rank of the query in C{q0, S'} by the utility measure u(q). pjudge and the number of queries in C can also affect the selection probability. Let C be a set of 10 queries. Figure 1 shows the probability of selecting the ith best query in C, i.e. P(qith

best|C), estimated after running the tournament 100,000 times. Although we have not derived a formula for the query selection probability, we noticed that it can be fitted by an exponential function. Figure 1 shows several examples. The trend lines fit the observed results perfectly, with R2 > 0.99.

2.3 Experiment Settings Dataset. Our model uses the user’s own query reformulation for evaluation, which can be provided from a session search dataset [6]. For a static session q(1), q(2), … , q(m), we can evaluate query suggestions for q(x) (x ˂ m) so that the dataset can provide information on q(x+1).

Utility measures. The utility measures adopted in previous works include those based on human ratings [8], search logs [2], and the session-level evaluation metrics [4–6]. However, further studies are required to verify their validity in the evaluation of query suggestion algorithms.

3. DISCUSSION AND FUTURE WORKS According to Figure 1, users with high judge ability are very likely to be able to identify comparatively better queries. For example, when pjudge = 0.8, the user has over 80% probability to select the top 2 best queries. However, when pjudge = 0.6 (a comparatively low judge ability), the user has about 30% chances to select below average quality queries. This suggests two strategies of query suggestions for different user groups: for users with high judge ability, the algorithm can aim at increasing the upper bound quality of the suggested queries (finding the best possible query suggestion); for users with low judge ability, it is rather risky to return ineffective queries and therefore the algorithms should balance between “finding the best possible query suggestion” and “maintaining the overall quality of the query suggestions”.

Our future work will focus on the verification and refinement of the proposed model through user experiments, including mainly the validation of utility measures and the following issues:

The current model did not consider the modeling of pjudge. Therefore, one of our future works is to properly measure pjudge from user experiments and model pjudge based on various factors. Except for user factors, pjudge may also be affected by the two queries being compared. As reported in [7], users “felt that the query suggestion system helped them in a variety of ways, some of which are not detectable from the log”. The current model assumes that the users’ own query reformulations will not be affected by the query suggestions. However, it is possible that the suggested queries may also influence users’ own query reformulations.

4. REFERENCES [1] Dang, V. et al. 2010. Query reformulation using anchor text.

In WSDM'10. [2] Duarte Torres, S. et al. 2012. Query recommendation for

children. In CIKM'12. [3] Hauff, C. et al. 2010. A comparison of user and system

query performance predictions. In CIKM'10. [4] Järvelin, K. et al. 2008. Discounted Cumulated Gain Based

Evaluation of Multiple-Query IR Sessions. In ECIR'08. [5] Jiang, J. et al. 2012. Contextual evaluation of query

reformulations in a search session by user simulation. In CIKM'12.

[6] Kanoulas, E. et al. 2011. Evaluating multi-query sessions. In SIGIR'11.

[7] Kelly, D. et al. 2009. A comparison of query and term suggestion features for interactive searching. In SIGIR'09.

[8] Ma, Z. et al. 2012. New assessment criteria for query suggestion. In SIGIR'12.

[9] Moffat, A. et al. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27, 1.

[10] Sheldon, D. et al. 2011. LambdaMerge: Merging the Results of Query Reformulations. In WSDM'11.

[11] Wu, W. et al. 2012. User evaluation of query quality. In SIGIR'12.

(Pjudge = 0.4)

y = 0.0201e0.2481x

R² = 0.9994

(pjudge = 0.6)

y = 0.3055e-0.246x

R² = 0.9992

(Pjudge = 0.8)

y = 1.7393e-0.953x

R² = 0.9978

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9 10 i

p_judge = 0.40p_judge = 0.50p_judge = 0.60p_judge = 0.80

P(qith best|C): the probability of selecting the ith best query from C, a candidate set of 10 queries.

8

Incorporating Efficiency in Evaluation

Eugene Kharitonov†‡, Craig Macdonald‡, Pavel Serdyukov†, Iadh Ounis‡†Yandex, Moscow, Russia

‡ School of Computing Science, University of Glasgow, UK†{kharitonov, pavser}@yandex-team.ru

‡{craig.macdonald, iadh.ounis}@glasgow.ac.uk

ABSTRACTThe trade-off between retrieval efficiency and effectivenessnaturally arises in various domains of information retrievalresearch. In this paper, we argue that the users’ reaction tothe system efficiency should be incorporated into the cascademodel of the user behaviour and that an effectiveness-awaremodification of a state-of-the-art evaluation metric can bederived. We propose to base the study on the anonymisedbrowser/toolbar log data, and our preliminary experimentsdemonstrate that the users’ reaction to the delays observedin the dataset is consistent with one reported previously.

1. INTRODUCTIONThe problem of measuring the efficiency and effectiveness

trade-off arises in a variety of retrieval settings. For instance,machine-learned rankers of commercial search engines canleverage thousands of regression trees, operating in space ofhundreds of features [2]. Increasing the number of the re-gression trees is likely to improve the ranking effectiveness,but it inevitably reduces the system efficiency. Indeed, whileimproving effectiveness improves the users’ experience withthe search engine, the overall users’ satisfaction might falldue to the system efficiency decrease. Completely differentapproaches to trade-off the system effectiveness for the sys-tem efficiency are considered in [6, 8, 9]. The central ideabehind that work is to introduce additional tools to trade-offthe retrieval effectiveness for the retrieval efficiency.

Since from the user’s point of view, network and queryprocessing delays are indistinguishable, in this work we usea generalised notion of efficiency that incorporates the timeit takes the user’s browser to send the HTTP request to thesearch engine, time needed for the search engine to processthe query, and the time required for the browser to receivethe result page over the network.

The question arises how to measure the overall user’s sat-isfaction with the system performance, incorporating effec-tiveness and efficiency in a single metric. The commonlyused approaches are not experimentally justified to be alignedwith the user’s satisfaction. For instance, Wang et al. [9]used the harmonic mean of effectiveness and efficiency met-rics. Later, the same authors used a linear combination ofefficiency and effectiveness metrics [8], as this presents aneasier optimisation target for machine learning. However,the contrasting approaches suggest that a founded measurethat represents both efficiency and effectiveness remains anopen problem. At the same time, the metric used can con-


siderably influence the trade-off optimum and it is vital toensure the progress in that domain of research. Before dis-cussing the proposed approach to build a such metric, inthe next section we describe a dataset that is used in our re-search and a preliminary study of the influence of the searchdelays on the user clicking behaviour.

2. DATASET AND EXPERIMENTSAs a dataset in our experiments we use the anonymised

user behaviour data obtained from Yandex’ Browser andToolbar1. This data contains the delays that users experi-ence while accessing Yandex result pages as well as informa-tion about the users’ search result clicking behaviour. Webelieve that studying the influence of the natural delays oc-curring in the user’s everyday interaction with the searchengine is promising due to availability of massive amountsof data. In contrast, previous studies [1, 7] introduced an ar-tificial delays to the user’s interactions and thus are limitedboth in time and in user span.

The search result page access data obtained from the browserand the toolbar can be linked to the anonymised query logdata, providing us with the data required to model the userbehaviour in presence of the delays: search result pages withuser clicks and delays associated. The experimental datasetis sampled from the logs spanning the time period 13th-27th May 2013. To reduce the sparseness of the user-relatedstatistics, we remove all users with less than 5 sessions dur-ing the two-week period. In order to reduce influence fromthe previous search context, we consider only first queries inthe sessions. The resulting dataset includes 3M users, 230Kunique query-search result page pairs, and 10M sessions.

Since the perception of the delay might be extremely user-centric (e.g. some users are used to their slow Internet con-nection), for each user we calculate the median delay theyexperience while using Yandex and subtract it from all theirsession delays, thus reducing the personal bias.

In order to understand how delays influence the users’ be-haviour, we perform the following experiment. Using theavailable click log data, we simulate eight search engineswith exactly the same effectiveness but different efficiency.The simulation procedure is inspired by the procedure usedby Chapelle et al. [3] to evaluate effectiveness measures. Foreach unique combination of a query and a search result page(SERP) we sort all sessions associated according to the delaytime experienced by the user. Next, we split the obtainedsession list in ten sub-lists of equal length, so that the firstsub-list contains sessions with the smallest delay (i.e. thefastest 10% of sessions), while the last sub-list contains the

1browser.yandex.com/ and element.yandex.com/

9

sessions with the highest delay. We discard the first and thelast sub-lists as they are likely to contain various outliers.The rest of the sub-lists are considered as “simulated” searchengines. Notably, these sub-lists contain exactly the samequery-SERP pairs, as well as the same number of sessionsassociated with them, with only the delays varied. In Fig-ure 1 we report relative increase in the abandonment rate(AR) with respect to the first simulated search engine, asthe relative delay increases. An increase of approximately121% in the abandonment rate is reached when the delaytime is increased by 540%.

From the figure it can be observed that the delays do in-fluence the users’ abandoning behaviour. Indeed, there isa statistically significant evidence that lower delays causesmaller abandonment rates. While such a dependency be-tween user’s dissatisfaction and higher delays has previouslybeen reported in [1, 7], our obtained results support the va-lidity of the proposed experimental setup, which is, as dis-cussed, different from the one used in the previous works.On analysing the figure, we also notice that the increasein the abandonment rate grows approximately linearly withthe increase in the delay time.

Overall, our results exhibit the expected trend between de-lays and user dissatisfaction. In the next section, we discussour plan to incorporate the research on the user behaviourinto an evaluation metric.

3. PROPOSED APPROACHFollowing the work on state-of-the-art effectiveness evalu-

ation metrics, such as Expected Reciprocal Rank (ERR) [3]and Expected Browsing Utility (EBU) [10], we devise ourproposed metric basing on the user behaviour modelling. Ata first step, we aim to model the user behaviour in presenceof the system efficiency variability. In particular, we incor-porate the user’s tolerance to the system’s delays into thecascade model [4] of the clicking behaviour, which underliesERR. In order to get insight on how this can be achieved, wepropose to train the Dynamic Bayesian Network (DBN) clickmodel with abandonment [4] on the sessions from bins withdifferent delays. As a result, we might expect that the fol-lowing model parameters depend on the retrieval delay: theprobability of examining the first document; the probabil-ity of abandoning after examining a non-relevant document(the user’s tolerance to irrelevant results); the probabilityof clicking on results (readiness to spend time to check ifthe landing page is relevant). Furthermore, this delay-awaremodification of the cascade model can be used to predict theprobability of satisfying the user with document labels andsystem delay provided. In the final step, the probability ofuser satisfaction can be used as an evaluation metric, similarto the extension of ERR to address abandonment, proposedby Chapelle et al. [3].

Another possible direction might imply modification ofthe time-biased gain effectiveness metric proposed in [5].However, directly accounting for the delay time might befruitless, as the delays are likely to be considerably smallerthan the characteristic time scale considered by this metric(e.g. half of the users continue their search after 224s [5]).

Having proposed a new metric, a question arises how toevaluate its quality, i.e. to check that it indeed represents theusers’ preferences and optimising it leads to higher users’ sat-isfaction. We propose to leverage the evaluation approachconsidered by Chapelle et al. [3]. This approach studies thecorrelation of the considered metrics with the online satis-

Figure 1: Relative change in the abandonment rate(AR), as a function of the relative change in thedelay. 95% error bars are smaller than the symbols.Linear fit corresponds to the line y = 0.028 + 0.229 · x.R2 = 0.98

0 100 200 300 400 500

020

60

100

Relative increase in delay, %

Rela

tive incre

ase in A

R, %

faction indicators, such as abandonment rate. We believethat the use of the abandonment rate will provide us witha reliable indicator of user satisfaction. A good efficiency-aware metric should outperform its effectiveness-only coun-terparts, i.e. efficiency-aware ERR should correlate with theonline click metrics better than ERR. In addition, we expectit outperform the heuristic approaches used in the literature[7, 8].

4. CONCLUSIONSIn this work, we argued the need for an evaluation metric

that combines both retrieval efficiency and retrieval effec-tiveness in a founded and empirically-verified manner. Weproposed to devise such a metric by means of incorporat-ing the users’ tolerance to delays into the cascade modelof the user behaviour. Further, this model can be used tobuild an efficiency-aware modification of the ERR effective-ness metric. We suggested to use the search engine click logsas well as the browser/toolbar data as a source of the userpreference evidence. Finally, our preliminary experimentsdemonstrated that this data exhibits the same patterns asreported in previous studies and hence can be used to derivemetrics such as those proposed in this paper.

5. REFERENCES[1] J. D. Brutlag, H. Hutchinson, and M. Stone. User

preference and search engine latency. In JSM Proceedings,Qualtiy and Productivity Research Section, 2008.

[2] B. B. Cambazoglu, H. Zaragoza, O. Chapelle, J. Chen,C. Liao, Z. Zheng, and J. Degenhardt. Early exitoptimizations for additive machine learned rankingsystems. WSDM ’10.

[3] O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan.Expected reciprocal rank for graded relevance. CIKM ’09.

[4] O. Chapelle and Y. Zhang. A dynamic bayesian networkclick model for web search ranking. WWW ’09.

[5] M. D. Smucker and C. L. Clarke. Time-based calibration ofeffectiveness measures. SIGIR ’12.

[6] N. Tonellotto, C. Macdonald, and I. Ounis. Efficient andeffective retrieval using selective pruning. WSDM ’13.

[7] K. Wang, T. Walker, and Z. Zheng. Pskip: estimatingrelevance ranking quality from web search clickthroughdata. SIGKDD ’09.

[8] L. Wang, J. Lin, and D. Metzler. A cascade ranking modelfor efficient ranked retrieval. SIGIR ’11.

[9] L. Wang, J. Lin, and D. Metzler. Learning to efficientlyrank. SIGIR ’10.

[10] E. Yilmaz, M. Shokouhi, N. Craswell, and S. Robertson.Expected browsing utility for web search evaluation. CIKM’10.

10

Observing Users to Validate Models(Extended Abstract)

Falk ScholerSchool of Computer Scienceand Information Technology

RMIT [email protected]

Paul Thomas

ICT CentreCSIRO

[email protected]

Alistair MoffatDepartment of Computing and

Information SystemsThe University of Melbourne

[email protected]

ABSTRACTUser models serve two purposes: to help us understand users, andhence determine how to supply them with effective search services;and as a framework against which to evaluate the quality of thoseservices once they have been developed. In this extended abstractwe describe an experiment we have undertaken in which we observeuser behaviours, and try to determine whether these behaviours canbe connected to search quality metrics via an existing or novel usermodel. We provide summary evidence that suggests that the answeris a qualified “yes”.

Categories and Subject DescriptorsH.3.4 [Information Storage and Retrieval]: Systems and soft-ware—performance evaluation.

General TermsExperimentation, measurement.

KeywordsRetrieval experiment, evaluation, system measurement.

1. METRICS, MODELS, AND BEHAVIOUREvery model embodies certain assumptions about the system it

represents, and is used to make predictions about the future be-haviour of that system. The usefulness of these predictions dependsboth on the veracity of those initial assumptions, and on the fidelityof the model. In retrieval, metrics such as DCG, RBP, or Prec@kare built atop explicit or implicit models of user behaviour and pref-erences. The usefulness of these metrics again depends on theirassumptions and fidelity.

In their description of the discounted cumulative gain (DCG),effectiveness metric, Järvelin and Kekäläinen [2002] observe that thediscounting of relevance contributions is in no small part a responseto user behaviour. They argue that for a variety of reasons a user isless likely to gain benefit from a relevant document deep in a rankingthan they would if they observed the same document earlier, andpropose that the relevance of the document at depth i in the rankingbe discounted by a factor of 1/ log2(i+1) (in the Microsoft versionof the formulation). Moffat and Zobel [2008] suggest a differentdiscounting function but retain the same underlying philosophy.Their rank-biased precision (RBP) metric assumes that the user is

Copyright is held by the author/owner(s).SIGIR 2013 Workshop on Modeling User Behavior for InformationRetrieval Evaluation (MUBE 2013), August 1, 2013, Dublin, Ireland.

likely to abandon their review of a search ranking at each presenteddocument with some fixed probability (1− p); they then derive ageometric discounting function.

Both DCG and RBP presume that user behaviour can be antici-pated via a probabilistic one-state model with just three significanttransitions: enter the reading state, taken with probability 1.0; re-main in the reading state, taken with probability p at each trial; andexit from the reading state, taken with probability 1− p. With such aprobabilistic model, weighted effectiveness metrics (of which RBPis an example) can then be seen as being a calculation of the rateat which the user gains utility from their searching action. Zhanget al. [2010] used click-log data to compare the user behaviour pre-dicted by the models embedded in DCG and RBP, and found thatp = 0.73 was a good fit with RBP. Researchers have also measureduser behaviour by monitoring gaze locations while they view searchrankings [Joachims et al., 2005]. Both types of study provide sup-port for models of behaviour in which documents near the top of theranking are more likely to be accessed than ones further down.

Other user models have also emerged. The expected reciprocalrank (ERR) metric introduces the notion of adaptivity; that is, thatthe user’s behaviour will be affected by the relevance of the docu-ments that they encounter in the ranking. In contrast, DCG and RBPare static – the user is predicted to act in a certain way, regardlessof whether they are overwhelmed by relevant documents or seenone at all. Other “knobs” have also been added by researchersseeking to make their models more accurate or to derive a givenmetric [Clarke et al., 2008, Dupret and Piwowarski, 2010]. Mostrecently, Smucker and Clarke [2012] have described a metric theycall time-biased gain, which models utility not as something thataccrues over documents, but over units of seconds or minutes. Thatis, a long (and hence slow to read) document contributes a lowerscore than does a shorter one that is equally relevant.

2. AN EXPERIMENTIf metrics are bound with user models, and if we accept that

a model is only as useful as its predictions are reliable, there areclearly questions to be asked. Is user behaviour predicted moreprecisely by some user models than others? Is more complexity in amodel (and metric) justified by greater fidelity? Or is user behaviourso varied that there is no such thing as a model user at all?

To investigate this, we used an instrumented search interfacebuilt on the anonymised API of a commercial search service tomonitor a group of 34 users, while they undertook a set of six searchtasks of differing complexity. Each user was asked to note any“useful” pages that they found while searching, so as to answer theinformation need. At the same time, all user actions – includinggaze position, via eye-tracking hardware – were monitored. In halfof the user-topic combinations we showed deliberately degraded

11

+9

+8

+7

+6

+5

+4

+3

+2

+1

−1

−2

−3

−4

−5

−6

−7

−8

−91 2 3 4 5 6 7 8 9 10

Current gaze position

Jum

p to

nex

t pos

ition

0.10.20.30.40.5

Probability

Figure 1: Gaze transitions as ranked answer lists are viewed. Thehorizontal scale shows current rank position of gaze point; thevertical scale shows the proportion of next gaze points that are atpositions . . . ,−2,−1,0,+1,+2, . . . relative to that starting position.Higher probabilities are shown by brighter colors. Except at rankone, jumps of −1 (that is, to the snippet above this one on the resultpage) are only slightly less frequent than jumps of +1 to the nextsnippet down the page. The “fold” in the results page occurredbetween snippets 6 and 7.

result pages, with every second result one that matched some of thequery terms, but was nevertheless clearly not relevant. This allowsus to gauge the extent to which users are sensitive to the appearanceof relevant documents in the ranking. Details of the experimentalstructure are provided in the full paper.

3. SUMMARY OF RESULTSOur preliminary investigations have focused on two facets of

user behaviour. First, most of the user models embedded in qualitymetrics assume that users read result lists top-to-bottom. We soughtto understand how users actually progress through rankings. Figure 1illustrates some of the data collected, after processing sequencesof gaze points into fixations and snippet views. In this heatmap,each column represents the observed probability distribution ofusers’ next gaze positions, conditioned by their current gaze location.Jumps of +1 (to the next document down the ranking) are common,but so too are jumps of −1. Jumps of +2 and −2 also take place.Jumps down the list are more likely than not to be followed by jumpsback up, and vice versa, although with an overall downward trend.It seems that in a sense users do progress linearly down the ranking,but also frequently compare the snippet just inspected with one(s)seen earlier, perhaps maintaining a mental “best candidate so far”until they reach enough confidence to go ahead and click.

We also asked whether users behave differently when presentedwith diluted rankings that contain fewer relevant documents. Staticeffectiveness metrics such as DCG and RBP predict that there shouldbe no difference in behaviour; adaptive models such as ERR suggestwe should see a difference. Our preliminary analysis of the datasuggests that users are certainly aware of the inserted snippets,

0.0

0.1

0.2

1 2 3 4 5 6 7 8 9 10

rank

Pro

port

ion

of c

licks

Figure 2: Clickthroughs by depth, for degraded rankings only, withthe inserted irrelevant snippets at ranks 1, 3, 5, and so on. A further6.6% of clicks were past rank 10.

because they do not click on them (Figure 2). But nor have wefound any real difference in any of their other behaviours that mighthave been sensitive to snippet quality – there is no increased depth ofviewing in the ranked list, and there is no difference in second-pagerequests or query reformulation [Thomas et al., 2013]. Hence, it isas yet unclear that the additional complexity of this sort of adaptivemodel can be recouped by more precise prediction of user behaviour.

4. ONGOING WORKWe have constructed an empirical “best fit to data” model that

chooses from amongst a wide range of possible factors when seekingto predict the point at which a user abandons a result listing, andtakes a different action. Results to date suggest that rather than“relevance of the most recently viewed document” as a predictor ofexit, a more potent factor is “proportion of anticipated total relevancethat has been accrued until this moment”, which automatically foldsin task type. In the full paper we embed this concept into a modelof user behaviour, and hence an alternative effectiveness metric.

Acknowledgment. This work was supported by the AustralianResearch Council.

ReferencesC. Clarke, M. Kolla, G. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher,

and I. MacKinnon. Novelty and diversity in information retrieval evalua-tion. In Proc. SIGIR, pages 659–666, Singapore, 2008.

G. Dupret and B. Piwowarski. A user behavior model for average precisionand its generalization to graded judgments. In Proc. SIGIR, pages 531–538, Geneva, Switzerland, 2010.

K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IRtechniques. ACM Trans. Inf. Sys., 20(4):422–446, 2002.

T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accuratelyinterpreting clickthrough data as implicit feedback. In Proc. SIGIR, pages154–161, Salvador, Brazil, 2005.

A. Moffat and J. Zobel. Rank-biased precision for measurement of retrievaleffectiveness. ACM Trans. Inf. Sys., 27(1):2:1–2:27, 2008.

M. D. Smucker and C. L. A. Clarke. Time-based calibration of effectivenessmeasures. In Proc. SIGIR, pages 95–104, Portland, Oregon, 2012.

P. Thomas, F. Scholer, and A. Moffat. Fading away: Dilution and userbehaviour. In Proc. EuroHCIR, Dublin, 2013. To appear.

Y. Zhang, L. A. F. Park, and A. Moffat. Click-based evidence for decayingweight distributions in search effectiveness metrics. Information Retrieval,13(1):46–69, 2010.

12

Markov Modeling for User Interaction in Retrieval

Vu T. TranInformation Engineering

University of Duisburg-EssenDuisburg, Germany

[email protected]

Norbert FuhrInformation Engineering

University of Duisburg-EssenDuisburg, Germany

[email protected]

ABSTRACTFor applying the interactive probability ranking principle(IPRP), we derive Markov models (MM) from observing in-teractive retrieval via system logging and eyetracking. Thenwe discuss various applications of these models: 1) For time-based retrieval measures, we can derive the expected perfor-mance based on the MM. 2) By varying single parameters ofthe model, it is possible to simulate the effect of specific sys-tem improvements and its consequences on retrieval quality.3) Based on the interactive PRP, the system can order thechoices offered to the user in an optimum way, and guidethe user to more successful searches. 4) While current ap-proaches for simulating interactive IR assume a simplistic,deterministic user behavior, the MMs derived empiricallycan form the basis for more realistic simulations.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval—Search Process

KeywordsEvaluation, Interactive Retrieval, Simulation, User Behavior

1. INTRODUCTIONThe Interactive Probability Ranking Principle (IPRP) [1]

is a probabilistic framework model for interactive retrieval.This model assumes that a user moves between situationssi, in each of which the system presents a list of choices,about which the user has to decide, and the first acceptedchoice moves the user to a new situation. Each choice cijis associated with three parameters: the effort eij for con-sidering this choice, the acceptance probability pij , and thebenefit aij resulting from the acceptance.

Based on the IPRP we developed a new methodology foranalyzing interactive IR [5] using log and eyetracking datafrom the INEX 2010 interactive track [3] (12 retrieval ses-sions, 84 queries). Based on this data, we represent theuser’s interaction as a Markov model (MM, see (Figure 1).After formulating a query, the user looks at one result itemafter the other, possibly regards its details and puts itemsfound relevant into the basket (for further explanation on ourinterface, see [5]). The timings correspond to the effort eijfor evaluating a choice cij , while the transition probabilitiesgive the chances pij of accepting it. As a possible approach


15 %

85 %

100 % 3 %

11 %

74 %

24 %

85 %

1 %

2 %

Basket

1,7 sec

Detail

15,3 sec

Query

4,9 sec

Result

2,3 secItem

Figure 1: Transition probabilities and user efforts

for quantifying the benefit aij of a decision, we regard thetime needed for finding the first (next) relevant document(see below). With some limitations, the same methodologycan also be applied when only log data is available, e.g. forthe 2011 TREC session track1 (see the results in the nextSection).

2. EVALUATIONGiven the MM, we are able to estimate the time for finding

the (next) relevant document. In turn, these values can beused for estimating retrieval quality in terms of time-basedmeasures [4].

Let us denote the four MM states by q, r, d and b, theeffort in these states by tq, tr, td and tb, and the transitionprobability from state x to state y as pxy.Then the expectedtimes Tq, Tr and Td for reaching the basket state from thequery, results or details stage, respectively, can be computedvia the following linear equation system:

Tq = tq + pqrTr

Tr = tr + prqTq + prrTr + prdTd

Td = td + pdqTq + pdrTr

As results, we get the time Tq for the query, Tr for the

1http://trec.nist.gov/data/session.html

13

result list and Td for the details stage, which are shown inTable 1.

INEX TRECTq 122.7 285.4Tr 117.8 255.2Td 104.9 204.2

Table 1: Times (in seconds) to basket/relevance forINEX and TREC

Here we have assumed that all result items are of equalquality, i.e. prd and prr are constant. In order to considerthe decreasing quality of result items, we have to refine theMM by introducing a state ri for each result item, alongwith a state di for the corresponding details. From our ob-servation data, we derived statistics about rank-dependentclick-through rates prid (and the changing values of priri+1).This leads to the times displayed in Table 2, based on thefollowing equation system:

Tr = Tr1 + . . .+ Trn

Tr1 = tr + pr1qTq + pr1r2Tr2 + pr1dTd1

Td1 = td + pdqTq + pdr2Tr2

Tr2 = tr + pr2qTq + pr2r3Tr3 + pr2dTd2

Td2 = td + pdqTq + pdr3Tr3

. . . = . . .

Tq Tr1 Tr2 Tr3 ... Tr9 Tr10

122.7 117.8 117.9 119.1 ... 122.5 123.5

Table 2: Times to basket/relevance for INEX withvarying probabilities

3. SIMULATIONBy varying the MM parameters, we can simulate the effect

of system changes. As a first example, we consider the effectof ranking quality. As Figure 1 shows, only roughly 4% ofthe items in the result list are relevant (prb + prdpdb). Im-proving ranking quality would result in increasing the valueof prd, while prr and prb would decrease by the same amount(assuming that prq remains unchanged). Then we can de-rive that retrieval improvements by 10%, 20% or 30% wouldreduce the time to basket Tq by 17%, 27% and 31% respec-tively. This shows that even small improvements in terms ofranking quality would have a big effect on the quality of in-teractive retrieval (while enhancements beyond 30% wouldhave only minor effects). In a similar way, we can e.g. sim-ulate the effect of improving the quality of the result sum-maries.

Besides studying the effects of small changes to an existingsystem, the MM derived from such a system could also beused for simulating interactive retrieval with other systems(see e.g., [6]). Currently, these simulations are based on adeterministic user behavior, and thus the results give onlyupper and lower bounds for retrieval quality, but hardly es-timates of the average performance. Based on the MM, wenow have an empirical basis for simulating the more realis-tic stochastic user behavior. Of course, this would requirea large number of simulated runs, in order to implement

the stochastic behavior represented by the MM. In addition,one would need full relevance information for the informa-tion needs studied. Also, we still need a model for queryreformulation, which is still a research issue. Nevertheless,a stochastic user model is clearly the way for more realisticsimulations of interactive retrieval.

4. GUIDANCEThe core idea of the IPRP is the optimum ranking of

choices. In our simple MM studied here, the only choice theuser has is the question when to reformulate a query insteadof going to the next result item. Looking at the figuresin Table 2, one can see the expected time for reaching thenext relevant document increases as the user goes down theranked list of result items. Comparing the values Tri with Tq

it becomes apparent that it does not pay off to go beyondthe ninth result item—the user should rather formulate anew query.

However, we should bear in mind that these figures onlyrefer to the average case. Since there is a large variationin the ranking quality of queries, we need query-specific es-timates. This requires methods for estimating the proba-bilities of relevance for a given query, along with the corre-sponding click-through rates—which is an issue of our cur-rent research.

5. OUTLOOKIn this paper, we have shown how the application of MMs

to interactive retrieval can be used for evaluation, simula-tion and user guidance. However, the models regarded so farare very simple. For example, we assume that the qualityof the queries is constant, i.e. reformulation does not leadto better answers. Moreover, the possible user interaction israther limited (e.g., the system could suggest query expan-sion terms). Considering changes from query to query ornew interaction possibilities increases the number of statesin the MM significantly; as a consequence, estimating theincreased number of parameters requires substantially moreobservation data, thus more extensive experimentation oraccess to logging data of operational systems.

6. REFERENCES[1] N. Fuhr. A probability ranking principle for interactive

information retrieval. Information Retrieval,11(3):251–265, 2008.

[2] W. R. Hersh, J. Callan, Y. Maarek, and M. Sanderson,editors. The 35th International ACM SIGIR conferenceon research and development in Information Retrieval,SIGIR ’12, Portland, OR, USA, August 12-16, 2012.ACM, 2012.

[3] N. Pharo, T. Beckers, R. Nordlie, and N. Fuhr.Overview of the inex 2010 interactive track. In INEX,Lecture Notes in Computer Science, pages 227–235.Springer, 2011.

[4] M. D. Smucker and C. L. A. Clarke. Time-basedcalibration of effectiveness measures. In Hersh et al. [2],pages 95–104.

[5] V. T. Tran and N. Fuhr. Using eye-tracking withdynamic areas of interest for analyzing interactiveinformation retrieval. In Hersh et al. [2], pages1165–1166.

[6] R. W. White, I. Ruthven, J. M. Jose, and C. J. V.Rijsbergen. Evaluating implicit feedback models usingsearcher simulations. ACM Trans. Inf. Syst.,23(3):325–361, July 2005.

14

Personalization for Difficult Queries

Arjumand Younus†‡, M. Atif Qureshi†‡, Colm O’Riordan†, Gabriella Pasi‡†Computational Intelligence Research Group, Information Technology, National University of Ireland,

Galway, Ireland‡Information Retrieval Lab, Informatics, Systems and Communication, University of Milan Bicocca,

Milan, Italy{arjumand.younus, muhammad.qureshi, colm.oriordan}@nuigalway.ie,

{arjumand.younus, muhammad.qureshi, pasi}@disco.unimib.it

ABSTRACTWe investigate the correlation between standard query dif-ficulty estimators (query performance predictors) and per-formance of a personalized search system which uses realprofiles of assessors. Our findings indicate that when queryperformance predictors denote difficult queries, the user pro-files largely help in improving the performance of a personal-ized search system. We argue for use of information retrievalevaluation measures that take into account aspects of userprofiles.

Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: Search pro-cess

General TermsHuman Factors, Performance

KeywordsQuery Performance Predition, Difficult queries

1. INTRODUCTIONFor over a decade Web search personalization has been

viewed as an effective solution for user-tailored information-seeking. However, current personalized search systems applypersonalization to all queries while improving results for onlysome queries and actually harming the performance of otherqueries [7]. This poster presents an investigation into whichqueries may benefit from personalization through the useof query performance predictors [1]. As a lesson from ourfindings, we argue for use of information retrieval evaluationmeasures that take into account aspects of user profiles.

The evaluation of Query Performance Prediction (QPP)methods has usually been performed over standard test col-lections (TREC, ClueWeb etc.) without focusing on howinformation from such methods can be used for applica-tions that enhance retrieval. This contribution considersone such application, namely Web search personalizationfrom a query performance prediction perspective. Specif-ically, we investigate the relationship between query diffi-


Topic Information NeedIndia-Pakistan Relations I have to write a report on my per-

spective of India-Pakistan relations andwhether or not I feel they should forgetpast differences and move towards a pos-itive healthy relationship. The reportshould contain an overview of benefitsthat can be achieved from such relationsalong with the disadvantages.

Pakistani Advertisements I am a media studies student research-ing on state of advertisements in Pak-istan. Is the message in Pakistani ad-vertisements good or are they bad cit-ing reasons for each viewpoint with somecase-studies in point.

Table 1: Information Needs Crafted for Journalistsand Social Media Activists

culty (estimated via standard QPP methods in the liter-ature) and their performance over real-world informationneeds in a personalized search scenario. Instead of usingrelevance judgements over a test collection, we rely on ex-plicit relevance judgements from real users to gain a betterinsight into the relationship between query difficulty and thebenefits derived from personalization.

2. EMPIRICAL STUDY AND RESULTSWe performed a study with twenty-five Pakistani journal-

ists and social media activists as assessors using the ExpressTribune blogs1 as a corpus. We crafted eight informationneeds that closely match real world information needs ofjournalists and social media activists2. Table 1 presents ex-amples of two information needs.

In this study, we use a side-by-side evaluation technique[8] to perform comparison of search results from a non-personalized and personalized algorithm over the eight cu-rated information needs. The non-personalized search re-sults were obtained from a system using a BM25 weightingscheme while for the personalized search results we re-rankedthe top 20 results from the BM25 algorithm. The re-rankingwas performed using a cosine similarity measure between the

1http://blogs.tribune.com.pk: This is a news blog of Pak-istan’s first internationally affiliated newspaper.2We believe that the information needs closely mirroredsearches that the assessors conducted in the real world asthey were crafted after an elaborate study of questions posedby journalists and social media activists on Twitter.

15

Pre/Post Ret. Predictors P@10p P@10np

MaxIDF -0.26 0.05SCS -0.45* 0.33**Sum-SCQ -0.38 0.29*clarity -0.56* 0.43*σ100 -0.59*** 0.48*σ50% -0.71* 0.58**

Note *p<.05, **p<.01, ***p<.001

Table 2: Spearman’s correlations: QPP vs. Non-Personalized and Personalized Search Performance

top 20 search results obtained from the non-personalized ver-sion of the algorithm (i.e., BM25) and the terms obtainedfrom the user profiles of assessors. The user profile wassimply a tf-idf vector of terms (excluding stopwords) fromthe tweets (Twitter data) of journalists and social mediaactivists3. From the eight provided information needs, theassessors were asked to chose at least two information needsthat closely matched a search they had conducted in thepast; some of the assessors chose more than two informationneeds with four being the maximum. We gave freedom tothe assessors in their choice of queries so as to have a widerange of queries and across each query assessors were askedto provide explicit relevance judgements (i.e., a binary mea-sure for relevant or non-relevant) for the top 10 results fromboth algorithms. At the end we were able to collect explicitrelevance judgements for a total of 59 unique queries4.

As mentioned previously, the aim of the investigation isto analyse the correlation between query difficulty and thebenefits such queries can derive from personalization. Tothis aim, we get an estimate for the benefit that a query canderive from personalization using a comparison between theperformance of the personalized vs. the non-personalizedalgorithm through the following two measures:

Benefitp =AveragePrecisionpersonalized

AveragePrecisionnon−personalized(1)

Differencep = P@10personalized − P@10non−personalized

(2)

To estimate the difficulty of a particular query, we usedthree pre-retrieval QPP methods: Max. Inverse DocumentFrequency (MaxIDF) [6], Simplified Clarity Score (SCS) [4]and Summed Collection Query Similarity (Sum-SCQ) [9],as well as two post-retrieval methods: clarity score [2] andstandard deviation (σ) [3, 5].

Results: Table 2 reports the correlations between queryperformance predictors proposed in the literature and theperformance of a non-personalized and a personalized searchsystem. To investigate how difficult queries affect the perfor-mance of a personalized search system, we examined the cor-relation between the system predictions of each QPP method(which give an estimate of query difficulty) and the measuresintroduced in Equation 1 and 2 shown in Table 35. Note that

3The last 3200 tweets (obtained via Twitter API) of eachassessor were used for the user profile construction.4Of the 25 assessors four chose four information needs, eightchose three information needs and thirteen chose two infor-mation needs. Of these seven assessors had an overlap inthe queries which reduces unique queries to 59.5Precisely, as QPP methods denote a difficult query theperformance of the non-personalized search system degradeswhile that of the personalized search system improves.

Pre/Post Ret. Predictors Benefitp DifferencepMaxIDF -0.16 -0.06SCS -0.31** -0.42*Sum-SCQ -0.39* -0.37**clarity -0.63* -0.52***σ100 -0.72** -0.65*σ50% -0.88** -0.69***

Note *p<.05, **p<.01, ***p<.001

Table 3: Spearman’s correlations: Query Difficulty(estimated via QPP methods) vs. Benefit that aQuery can Derive from Personalization

this is entirely opposite to the investigations in the litera-ture so far where query predictions (both pre-retrieval andpost-retrieval) have found to be positively co-related withsystem performance. The highest correlation is obtained forthe post-retrieval predictors and in particular with the stan-dard deviation measure σ50% proposed in [3]. Hence, in apersonalized search scenario, when standard QPP methodsdenote a difficult query user profiles help in improving sys-tem performance.

Another significant lesson learnt from the relevance assess-ment study in this paper is that user profiles and preferencesplay a huge role in modelling users’ behaviour over searchengines and this significant aspect has been neglected in in-formation retrieval evaluation measures so far. We show viathe metrics introduced in Equations 1 and 2 that perfor-mance of personalized vs. non-personalized search systemdiffers significantly when taking user profiles into account,and there is a need for more elaborate metrics that can cap-ture this notion.

3. CONCLUSIONIn this poster, we investigated the relationship between

query difficulty and performance of a personalized searchsystem. We found that a personalized search system per-forms well for difficult queries. We believe this to be dueto the fact that when users find it hard to express clearlytheir information need a system aware of their preferenceslargely helps. In future work, we aim to investigate the rela-tionship between queries and personalization in more detailwith more sophisticated forms of personalization.

4. REFERENCES[1] D. Carmel and E. Yom-Tov. Estimating the query difficulty for

information retrieval. Synthesis Lectures on InformationConcepts, Retrieval, and Services, 2(1):1–89, 2010.

[2] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predictingquery performance. SIGIR ’02, pages 299–306, 2002.

[3] R. Cummins, J. Jose, and C. O’Riordan. Improved queryperformance prediction using standard deviation. SIGIR ’11,pages 1089–1090, 2011.

[4] B. He and I. Ounis. Query performance prediction. Inf. Syst.,31(7):585–594, Nov. 2006.

[5] J. Perez-Iglesias and L. Araujo. Standard deviation as a queryhardness estimator. SPIRE’10, pages 207–212, 2010.

[6] F. Scholer, H. E. Williams, and A. Turpin. Query associationsurrogates for web search. JASIST, 55(7):637–650, 2004.

[7] J. Teevan, S. T. Dumais, and D. J. Liebling. To personalize ornot to personalize: modeling queries with variation in userintent. SIGIR ’08, pages 163–170, 2008.

[8] P. Thomas and D. Hawking. Evaluation by comparing result setsin context. CIKM ’06, pages 94–101, 2006.

[9] Y. Zhao, F. Scholer, and Y. Tsegay. Effective pre-retrieval queryperformance prediction using similarity and variability evidence.ECIR’08, pages 52–64, 2008.

16

User Judgements of Document Similarity

Mustafa Zengin and Ben CarteretteDepartment of Computer and Information Sciences

University of DelawareNewark, DE, USA 19716

{zengin,carteret}@udel.edu

ABSTRACTCosine similarity is a term-vector-based measure of simi-larity that has been used widely in information retrievalresearch. In this study, we collect user judgments of webdocument similarity in order to investigate the correlationbetween cosine similarity and users’ perception of similarityon web documents. Experimental results demonstrate thatit is hard to deduce that cosine similarity correlates stronglywith human judgements of similarity.

Categories and Subject Descriptors: H.3 [InformationStorage and Retrieval]; H.3.4 [Systems and Software]:Performance Evaluation

Keywords: document similarity, cosine similarity, users

1. INTRODUCTIONMeasures of similarity between documents are widely used

in information retrieval: as scores to rank documents, forclustering, for diversity, and more. Most of these measuresare based on simple textual features, primarily term counts,document counts, and document lengths. Probably the mostwidely used is the cosine similarity, a measure of the dis-tance between two weighted vectors in a vocabulary space.

Since most uses of such similarity measures are meant tohelp users with some task, it is worth asking whether theycorrespond to the notion of similarity that users actuallyhave. The field of IR has always compared query-documentsimilarity measures to human judgements of relevance—thisis the foundation of e↵ectiveness evaluation—but there isvery little work comparing document-document similaritymeasures to human opinion. In this paper we describe an ex-periment to collect human judgements of document-documentsimilarity and determine the extent to which cosine similar-ity captures them.

2. USER EXPERIMENTS

2.1 Experimental DesignThe experiments were performed on Amazon Mechanical

Turk (AMT) [1], an online crowdsourcing labor marketplacewhere requesters submit Human Intelligence Tasks (HITs)with some constraints and workers complete the tasks for afee. Each HIT submitted to workers consisted of a set ofinstructions about the task, five queries with descriptions


that clarify their information need, and five pairs of fully-rendered web pages. Users were asked the following ques-tions: (1) Examine the two web pages shown side by sideand judge whether each would help a user achieve the stateddescription of an information need. (2) Rate the two webpages by how similarly they would help a user achieve thestated description of an information need on a 1–5 scale.Three sets of radio buttons were shown to users to collectthe answers. The former question had two answer options:“True”which states the web document helps a user to achievethe stated description of the information need, and “False”otherwise. We provided a rating scale to the users to clarifythe similarity ratings for the latter question:Similarity rating scale1. Either or both of the two pages do not provide any in-formation for the stated description. Even if two pages areidentical, if they are not relevant to the stated description,you should select 1.2. The two pages give a small amount of similar informationfor the stated description. They may di↵er on many points,or one page may be much more substantive than the other,or di↵er in other ways; they only overlap on a few things.3. The two pages convey similar information for the stateddescription. They may di↵er on some points, or one pagemay o↵er more information than the other, but there is someoverlap in information relevant to the description.4. The two pages are mostly equivalent for the stated de-scription; you would be happy to see either one of them insearch results. They contain the same information neededto achieve the description, though they may present it in adi↵erent order, or along with other information that is notrelevant to the description, or with other minor di↵erences.5. The two pages are essentially equivalent and relevant tothe description. They contain exactly the same informa-tion, even if they di↵er in some cosmetic respects (sidebars,top/bottom matter, etc).Hit propertiesIn our experiment we set a completion time limit of 3 hoursfor each HIT and 7 days for each query batch. We askedeach HIT to 3 di↵erent users and users were paid $0.40 percompleted HIT.Quality control One of the concerns of requesters aboutcrowdsourcing marketplaces such as AMT is low qualitywork. Due to the high cost of manual data review we usedtwo methods to automatize the quality control process. Firstwe accepted users having following qualifications: Mechani-cal Turk masters, 95% or higher HIT approval rate, at least100 HITs of approved work, and a minimum qualification

17

Table 1: Similarity score distribution for documentpairs rated by di↵erent workers.

C s1 s2 s3 s4 s5s1 50 87 30 18 1s2 87 132 120 69 8s3 30 120 236 165 17s4 18 69 165 186 20s5 1 8 17 20 6

Table 2: Probability distribution of Table 1.P s1 s2 s3 s4 s5s1 0.269 0.209 0.053 0.039 0.019s2 0.468 0.317 0.211 0.151 0.154s3 0.161 0.288 0.415 0.360 0.327s4 0.097 0.166 0.290 0.406 0.385s5 0.005 0.019 0.03 0.044 0.115

test score of 30 over 36. The qualification test consists of 4simple questions having the same design layout as the actualtask. We also used a“trap”question in each HIT which has asimple, obvious answer; workers that answered it incorrectlywere likely to have submitted low-quality work.

2.2 MaterialsWe randomly selected 10 queries from TREC 2011 Web

Track that had been identified as “faceted”. For each querywe selected 8 web documents that were marked as relevant(i.e., rel, key or nav) by NIST assessors from the top 50 doc-uments of University of Glasgow’s adhoc submission uog-

TrA45Vm [2]. From the selected web documents every possi-ble distinct pair is created for each query. The total numberof pairs in our experiment set was 280.

3. DATA ANALYSISWe collected a total of 840 similarity score judgements (3

for each distinct pair) from 45 workers over 2-weeks of exper-iment duration. We first investigated the users score agree-ment on document pairs. Table 1 shows counts of documentpairs for which one worker gave the rating corresponding tothe column label and another gave the rating correspond-ing to the row label (the counts in this table add up to280 ⇥ 6 = 1680, since 3 workers for each pair produces 6possible comparisons of ratings). Overall agreement, calcu-lated as the sum of the diagonals divided by the total count,is about 36%, which is about on par with measured humanagreement about relevance [4, 3].

Table 2 shows the probability of a worker giving the rat-ing corresponding to row given that another worker gavethe rating corresponding to the column (columns sum to 1).Having the largest probabilities on the diagonal shows thereis some agreement between users in the score 2, score 3 andscore 4 cases. Since we only included documents that aredistinct and relevant to the given query, results in score 1and score 5 diagonal were as expected. Note also that theprobabilities on the diagonal create the largest sums withprobabilities in neighboring cells above or below. This showseven users do not agree on a certain score in 5-scale scoring,their scores are not random.

Figure 1 shows the average cosine similarity (and confi-dence interval) of document pairs according to given userscores. Note that there are small increases in cosine simi-larity from score 2 to score 3 and score 4 to score 5, but

Figure 1: Average cosine similarity of documentpairs by worker rating.

Figure 2: Proportion of ratings in five bins of cosinesimilarity (labeled on the y-axis by the upper boundof the bin). Numbers on bars are raw counts ofdocument pairs.

the standard deviations are wide. This suggests that cosinesimilarity captures something about human judgements, butperhaps not enough that a di↵erence in cosine similarity ofas much as 0.4 could be considered meaningful.

Figure 2 shows the percentage of user scores falling intofive cosine similarity bins (0.0–0.2, 0.2–0.4, 0.4–0.6, 0.6–0.8,and 0.8–1). The share of score 4 and score 5 in all binsincreases starting from the 0.2–0.4 bin, while the share ofscore 1 decreases from the same bin. Nevertheless, it isdi�cult to say there is a strong association between cosinesimilarity and given user scores.

The overall correlation between user ratings and cosinesimilarity is 0.152 by Kendall’s ⌧ rank correlation, and 0.189by linear correlation. While these are significant, they arevery low.

4. REFERENCES[1] Amazon Mechanical Turk. http://www.mturk.com

[2] R. McCreadie, C. Macdonald, R. L. T. Santos, I. Ounis.Experiments with Terrier in Crowdsourcing, Microblog, andWeb Tracks. Proc. TREC 2011.

[3] B. Carterette, P. N. Bennett, D. M. Chickering, S. T.Dumais. Here or There: Preference Judgements forRelevance Proc. ECIR 2008.

[4] E. M. Voorhees Variations in relevance judgments and themeasurement of retrieval e↵ectiveness. Proc. SIGIR 1998.

18

Evaluating Heterogeneous Information Access(Position Paper)

Ke ZhouUniversity of Glasgow

[email protected]

Tetsuya SakaiMicrosoft Research Asia

[email protected]

Mounia LalmasYahoo! Labs [email protected]

Zhicheng DouMicrosoft Research Asia

[email protected]

Joemon M. JoseUniversity of [email protected]

ABSTRACTInformation access is becoming increasingly heterogeneous.We need to better understand the more complex user be-haviour within this context so that to properly evaluatesearch systems dealing with heterogeneous information. Inthis paper, we review the main challenges associated withevaluating search in this context and propose some avenuesto incorporate user aspects into evaluation measures.

1. INTRODUCTIONThe performance of search systems is evaluated with user-

oriented and system-oriented measures. The former are ob-tained through user studies conducted to examine and reflectupon various aspects of user behaviours. The latter rely onreusable test collections (i.e. document, query and relevancejudgements) to assess the search quality.

Recent work [5, 9, 11] have attempted to combine thosetwo types of measures, by modelling users and their be-haviour within test collection based system-oriented eval-uation metrics. For example, Smucker et al. [11] proposed atime-biased gain (TBG) framework that explicitly calibratesthe time of various (user) aspects in the search process. An-other attempt from Sakai et al. [9] proposed a unified evalua-tion framework (U-measure) that is free from linear traversalassumption and can evaluate information access other thanad-hoc retrieval (e.g. multi-document summarisation, diver-sified search). Recently, Chuklin et al. [5] proposed a com-mon approach to convert click models into system-orientedevaluation measures. Although the above mentioned evalua-tion frameworks can potentially handle more complex searchtasks, they all have been tested on traditional homogeneoussearch scenarios (newswire or general web search).

The web environment is becoming increasingly heteroge-neous. We have now many search engines, so-called verticals,each targeting a specific type of information (e.g. image,news, video, etc.). Aggregated search [2, 14] is concernedwith retrieving search results from a heterogeneous set ofsearch engines, and is a topic of investigation in both theacademic community and commercial world. Due to theheterogeneous nature of information in aggregated search,numerous challenges have arisen.

In this paper, we argue that, compared with traditional


homogeneous search, evaluation in the context of hetero-geneous information is more challenging and requires takinginto account more complex user behaviours and interactions.Specifically, we require evaluation approaches that not onlymodel user behaviours but also adapt to how users interactwith an heterogeneous information space.

2. CHALLENGESThere are three main challenges in incorporating user be-

haviours within an evaluation framework for heterogeneousinformation access. We discuss each challenge and currentresearch endeavours for each below.

2.1 Non-linear Traversal BrowsingPresenting heterogeneous information is more complex than

the typical single ranked list (e.g. ten blue links) employedin homogeneous ranking. There are three main types of pre-sentation designs: (1) results from the different verticals areblended into a single list (of blocks), referred to as blended ;(2) results from each vertical are presented in a separate(e.g. horizontal paralleled) panel (tile), referred to as non-linear blended ; and (3) vertical results can be accessed inseparate tabs, referred to as tabbed. A combination of allthree is also possible. In addition, results from differentvertical search engines can be grouped together to form acoherent “bundle” for a given aspect of the query (e.g. abundle composed of a news article along with videos anduser comments as a response to a query “football match”).Finally, the results presented on the search page can containvisually salient snippets (e.g. image).

Different presentation strategies and visual saliencies im-ply different patterns of user interaction. For example, auser could follow a non-linear traversal browsing pattern.Through eye-tracking studies [13] and search log analysis [7,12], recent studies have shown that in a blended presenta-tion, users tend first to examine results from one vertical(vertical bias), in particular those with visual salient snip-pets, and results nearby. In addition, when the vertical re-sults are not presented at the top of the search result page,users tend to scan back to re-examine previous web resultseither bottom-up and top-down. When presenting in a non-linear blended style (two parallel panels/columns), a recenteye and mouse tracking study [8] showed that users tend tofirstly focus on examining top results on the first columnand then jump to the right panel afterwards. For a tabbedpresentation of vertical results, the user browsing behaviour

19

is still poorly understood.

2.2 Diverse Search TasksUser search tasks are more complex with heterogeneous in-

formation access than traditional homogeneous ranking. Asearch task’s vertical orientation can affect user behaviours.Previous research [12] showed that the strength of a user’ssearch task’s orientation towards a particular source (verti-cal) type is different, and this affects user’s search behaviour(for example, click-through rates). Recently, Zhou et al. [15]found that when assessing vertical relevance, a search task’svertical orientation is more important than the topical rele-vance of the retrieved results.

Secondly, search task’s complexity can also have a majoreffect on user behaviours. This is because users can accessresults from different verticals to accomplish their searchtasks in multiple search sessions. Arguello et al. [2] showedthat more complex search tasks require significantly moreuser interaction and more examination of vertical results.Finally, Bron et al. [6] found that user’s preference for ag-gregated search presentation (blended and tabbed) changesduring multi-session search tasks.

2.3 Coherence, Diversity and PersonalizationAnother important consideration when evaluating hetero-

geneous information access is coherence. This refers to thedegree to which results from different verticals focus on asimilar “sense” of the query (can they form a bundle?). Re-cent research [3] showed that query-senses associated withthe blended vertical results can influence user interactionwith web search results.

The diversity of the results is another interesting prob-lem. It has been shown to be considerably different, andthat users often have their own personalized vertical diver-sity preferences [14]. Finally, Santos et al. [10] showed thatfor an ambiguous or multi-faceted query, user’s intended in-formation need varies considerably across different verticals.

3. AVENUES OF RESEARCHWe need an approach that models the above mentioned

user behaviours and incorporates them into system-orientedmeasures to evaluate heterogeneous information access. Thisrequires two main lines of research: (1) understanding andmodelling users behaviours, and (2) incorporating these intothe evaluation. We elaborate on each below.

The first line of research aims to give insights on the userperspectives and provide better models of user behaviours.Although there have been studies aiming at better under-standing the behaviour of users in aggregated search, theproblem of evaluating heterogeneous information access isfar from solved. There remains a large gap between under-standing user behaviours in this context and incorporatingthis understanding into the evaluation measures. There hasbeen attempts at building models of aggregated search clicks[7, 13] which could be incorporated in measures, e.g. to ac-count for search task’s vertical orientation and vertical vi-sual saliency. However, many aspects still lack investiga-tion (e.g. coherence, diversity). We propose to follow cur-rent research endeavours and investigate models to captureuser aspects, in particular those poorly accounted for in theevaluation. To achieve this, we must collect data on userbehaviours for aggregated search through laboratory exper-iments, crowd-sourcing or accessing search engine logs.

The second line of research aims to incorporate these newmodels into a general evaluation framework that can ac-curately capture the variations in user behaviours. Thereare few powerful evaluation frameworks that we could usefor this, for instance, TBG [11] and U-measure [9] as men-tioned in Section 1. Zhou et al. [14] also recently proposed ageneral evaluation framework to model utility and effort inaggregated search. In addition, we could follow Chuklin etal [5] and convert obtained aggregated search click modelsinto system-oriented evaluation. Preference-based evalua-tion approach is another direction that is worth of attention,for instance, Chandar et al. [4] and Arguello et al. [1].

4. CONCLUSIONSThis paper advocates the need to incorporate user be-

haviours into system-oriented measures for evaluating het-erogeneous information access. We listed challenges andproposed some avenues for shaping future research in thisdirection. A new track at TREC, FedWeb1, is studying in-formation access for heterogeneous information, and is theperfect forum to carry some of the research avenues dis-cussed in this paper.

5. REFERENCES[1] J. Arguello, F. Diaz, J. Callan, and B. Carterette. A

methodology for evaluating aggregated search results. In ECIR2011.

[2] J. Arguello, W. Wu, D. Kelly, and A. Edwards. Taskcomplexity, vertical display, and user interaction in aggregatedsearch. In SIGIR 2012.

[3] J. Arguello, and R. Capra. The effect of aggregated searchcoherence on search behavior. In CIKM 2012.

[4] P. Chandar, and B. Carterette. Preference based evaluationmeasures for novelty and diversity. In SIGIR 2013.

[5] A. Chuklin, P. Serdyukov, and M. Rijke. Click model-basedinformation retrieval metrics. In SIGIR 2013.

[6] M. Bron, J. Grop, F. Nack, L.B. Baltussen, and M. Rijke.Aggregated search interface preference in multi-session searchtasks. In SIGIR 2013.

[7] D. Chen, W. Chen, H. Wang, Z. Chen, and Q. Yang. Beyondten blue links: enabling user click modeling in federated websearch. In WSDM 2012.

[8] V. Navalpakkam, L. Jentzsch, R. Sayres, S. Ravi, A. Ahmed,and A. Smola. Measurement and modeling of eye-mousebehavior in the presence of nonlinear page layouts. In WWW2013.

[9] T. Sakai and Z. Dou. Summaries, ranked retrieval and sessions:a unified framework for information access evaluation. InSIGIR 2013.

[10] R. L. T. Santos, C. Macdonald, and I. Ounis. Aggregatedsearch result diversification. In ICTIR 2011.

[11] M. Smucker, and C. Clarke. Time-based calibration ofeffectiveness measures. In SIGIR 2012.

[12] S. Sushmita, H. Joho, M. Lalmas, and R. Villa. Factorsaffecting click-through behavior in aggregated searchinterfaces. In CIKM 2010.

[13] C. Wang, Y. Liu, M. Zhang, S. Ma, M. Zheng, J. Qian, andK. Zhang. Incorporating vertical results into search clickmodels. In SIGIR 2013.

[14] K. Zhou, R. Cummins, M. Lalmas, and J. M. Jose.Evaluating aggregated search pages. In SIGIR 2012.

[15] K. Zhou, R. Cummins, M. Lalmas, and J. M. Jose. Whichvertical search engines are relevant? Understanding verticalrelevance assessments for web queries. In WWW 2013.

1Federated web search: https://sites.google.com/site/trecfedweb/.

20

Documents

Proceedings of the SIGIR 2013 Workshop on Modeling User …msmucker/mube2013/mube... · 2014-06-04 · Towards Task-Based Snippet Evaluation:Preliminary Results and Challenges. Mikhail