Evangelos Kanoulas — Advances in Information Retrieval Evaluation

Evalua&ng Mul&-‐Query Sessions

Evangelos Kanoulas*, Ben Cartere9e+, Paul Clough*, Mark Sanderson$

* University of Sheffield, UK + University of Delaware, USA

$ RMIT University, Australia

Why sessions?

•  Current evalua&on framework – Assesses the effec&veness of systems over one-‐shot queries

•  Users reformulate their ini&al query

•  S&ll fine if … – op&mizing system for one-‐shot queries led to op&mal performance over an en&re session

When was the DuPont Science Essay Contest created?

Ini&al Query : DuPont Science Essay Contest

Reformula&on : When was the DSEC created?

•  e.g. retrieval systems should accumulate informa&on along a session

Why sessions?

Extend the evalua&on framework

From one query evalua&on

To mul&-‐query sessions evalua&on

Construct appropriate test collec&ons

Rethink of evalua&on measures

What is the appropriate collec&on?

Test collec&ons we built…

•  Text REtrieval Conference (TREC) – sponsored by NIST – many compe&&ons; among them

Session Track 2010, 2011, …

Test collec&on we built in 2010…

•  Corpus: ClueWeb09 – 1 billion web pages (5TB compressed)

•  Queries and Reformula&ons – 150 query pairs: ini$al query, reformula$on – 3 types of reformula&ons (not disclosed to par&cipants) •  Specifica&on (52 query pairs) •  Generaliza&on (48 query pairs) •  Drifing / Parallel Reformula&on (50 query pairs)

Some Cri&cism…

•  Ar&ficial reformula&ons •  Short reformula&ons –  just 2 queries

•  No other user interac&on data –  clicks, dwell &mes, etc.

•  Reformula&ons are sta&c (do not depend on the SE’s response) –  The collec&on does not allow early abandonment –  The reformula&on itself does not change up on SE’s response

Test Collec&on in 2011

•  Corpus: ClueWeb09 –  1 billion web pages (5TB compressed)

•  Queries and Reformula&ons –  Real users searching ClueWeb09 –  76 sessions of 2 up 10 reformula&ons

•  Other interac&ons –  Clicks, dwell &mes, mouse movements, relevance judgments

•  But… reformula&ons are s&ll sta&c

•  A set of informa&on needs What do we know about black powder ammunition?

– A sta&c sequence of m queries

Basic test collec&on

Ini&al Query :

1st Reformula&on :

2nd Reformula&on : … (m-‐1)th Reformula&on :

black powder ammunition

black powder wiki

gun powder wiki …

history of gunpowder

Experiment

black powder wiki

gun powder wiki


1

2

3

4

5

6

7

8

9

10

…

Evalua&on over a single ranked list

black powder wiki

gun powder wiki


1

2

3

4

5

6

7

8

9

10

…

Experiment

Construct appropriate test collec&ons

Rethink of evalua&on measures

What is a good system?

How can we measure “goodness”?

Measuring “goodness”

The user steps down a ranked list of documents and observes each one of them un&l a decision point and either

a)  abandons the search, or

b)  reformulates

While stepping down or sideways, the user accumulates u&lity

What are the challenges?

Evalua&on over a single ranked list

black powder wiki

gun powder wiki


1

2

3

4

5

6

7

8

9

10

…

Evalua&on over mul&ple ranked lists

Exis&ng measures

•  Session DCG [Järvelin et al ECIR 2008] The user steps down the ranked list un&l rank k and reformulates [Determinis&c; no early abandonment]

•  Expected session u&lity [Yang and Lad ICTIR 2009] The user steps down a ranked list of documents un&l a decision point and reformulates [Stochas&c; no early abandonment]

Evalua&ng over paths

Op&mize Model-‐free measures

Integrate out Model-‐based measures

Evalua&on measures

•  Evalua&ng over paths

•  Model – free measures

•  Model – based measures

Model-‐free measures

The user is an oracle that knows when to reformulate

Ω(k,j) : paths of length k, ending at reformula&on j

Count number of relevant docs on the op&mal path ω of length k ending at query j


Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

Define :

Precision@k,j Recall@k,j Precision@recall,j

ω(10,3) : length 10, ending at 3rd query


recall

reformulation

precision

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …


Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ranking 1

recall

precision

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ranking 2

recall

precision

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ranking 3

recall

precision


recall

reformulation

precision

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

Evalua&on measures




Model-‐based measures

Probabilis&c space of users following different paths

•  Ω is the space of all paths •  P(ω) is the prob of a user following a path ω in Ω •  Mω is a measure over a path ω

esM =

ω∈Ω

P (ω)Mω

[Yang and Lad ICTIR 2009]

Model Browsing Behavior

Posi&on-‐based models

The chance of observing a document depends on the posi&on of the document in the ranked list.


1

2

3

4

5

6

7

8

9

10

…

Rank Biased Precision [Moffat and Zobel, TOIS08]

Query

Stop

View Next Item


1

2

3

4

5

6

7

8

9

10

…

Model Browsing Behavior

Cascade-‐based models


1

2

3

4

5

6

7

8

9

10

…

The chance of observing a document depends on the posi&on of the document in the ranked list and the relevance of documents/

snippets already viewed.

Expected Reciprocal Rank [Chapelle et al CIKM09]

Query

Stop

Relevant?

View Next Item

no somewhat highly


1

2

3

4

5

6

7

8

9

10

…

€

DEBU(r ) = P(Er )⋅ P(C | Rr )

EBU = DEBU(r )r =1

n

∑ ⋅ Rr

Expected Browsing U&lity [Yilmaz et al CIKM10]

Probability of a path

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

(1)

(2)

Joint probability of

abandoning at reform 2

reformula&ng at rank 3 of first query

Probability of a path

Probability of abandoning at reform 2

X Probability of

reformula&ng at rank 3 of first query

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

(1)

(2)

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

Probability of abandoning the session at reformula&on i

Geometric w/ parameter preform

(1)

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

Truncated Geometric w/ parameter preform

Probability of abandoning the session at reformula&on i

(1)

Q1 Q2 Q3

N R R

N R R

N R R

N R R

N R R

N N R

N N R

N N R

N N R

N N R

… … …

Truncated Geometric w/ parameter preform

Geo

metric w/ parameter p

down

Probability of reformula&ng

at rank j (of 1 to i-‐1 reform)

(2)

Model-‐based measures

Probabilis&c space of users following different paths

•  Ω is the space of all paths •  P(ω) is the prob of a user following a path ω in Ω •  Mω is a measure over a path ω

esM =

ω∈Ω

P (ω)Mω

Evalua&on measures




Evalua&on measures

•  Proper&es

– How do the new measures correlate with previously introduced?

– Do they behave as expected, i.e. do they reward early retrieval of relevant documents?

Correla&ons

0.10 0.15 0.20

0.04

0.06

0.08

nsDCG vs. esAP

nsDCG

esAP

Kendall''s tau : 0.5247

0.10 0.15 0.20

0.10

0.15

0.20

nsDCG vs. esNDCG

nsDCG

esNDCG

Kendall''s tau : 0.7972

•  TREC 2010 Session track

Reward early retrieval

esMPC@20 esMRC@20 esMAP

“good”-‐>”good” 0.378 0.036 0.122

“good”-‐>”bad” 0.363 0.034 0.112

“bad”-‐>”good” 0.271 0.023 0.083

“bad”-‐>”bad” 0.254 0.022 0.073

•  TREC9 Query track – 50 topics and 23 query sets (formula&ons)

•  Simulate sessions

Conclusions

•  Extend the evalua&on framework to sessions –  Built the appropriate test collec&on –  Rethink of evalua&on measures

•  Basic test collec&on •  Model-‐free and model-‐based measures

•  Did not talk about: – Duplicate documents –  Efficient computa&on of the measures

Technology

Evangelos Kanoulas — Advances in Information Retrieval Evaluation