Upload
yaevents
View
3.894
Download
4
Embed Size (px)
DESCRIPTION
Citation preview
Evalua&ng Mul&-‐Query Sessions
Evangelos Kanoulas*, Ben Cartere9e+, Paul Clough*, Mark Sanderson$
* University of Sheffield, UK + University of Delaware, USA
$ RMIT University, Australia
Why sessions?
• Current evalua&on framework – Assesses the effec&veness of systems over one-‐shot queries
• Users reformulate their ini&al query
• S&ll fine if … – op&mizing system for one-‐shot queries led to op&mal performance over an en&re session
When was the DuPont Science Essay Contest created?
Ini&al Query : DuPont Science Essay Contest
Reformula&on : When was the DSEC created?
• e.g. retrieval systems should accumulate informa&on along a session
Why sessions?
Extend the evalua&on framework
From one query evalua&on
To mul&-‐query sessions evalua&on
Construct appropriate test collec&ons
Rethink of evalua&on measures
What is the appropriate collec&on?
Test collec&ons we built…
• Text REtrieval Conference (TREC) – sponsored by NIST – many compe&&ons; among them
Session Track 2010, 2011, …
Test collec&on we built in 2010…
• Corpus: ClueWeb09 – 1 billion web pages (5TB compressed)
• Queries and Reformula&ons – 150 query pairs: ini$al query, reformula$on – 3 types of reformula&ons (not disclosed to par&cipants) • Specifica&on (52 query pairs) • Generaliza&on (48 query pairs) • Drifing / Parallel Reformula&on (50 query pairs)
Some Cri&cism…
• Ar&ficial reformula&ons • Short reformula&ons – just 2 queries
• No other user interac&on data – clicks, dwell &mes, etc.
• Reformula&ons are sta&c (do not depend on the SE’s response) – The collec&on does not allow early abandonment – The reformula&on itself does not change up on SE’s response
Test Collec&on in 2011
• Corpus: ClueWeb09 – 1 billion web pages (5TB compressed)
• Queries and Reformula&ons – Real users searching ClueWeb09 – 76 sessions of 2 up 10 reformula&ons
• Other interac&ons – Clicks, dwell &mes, mouse movements, relevance judgments
• But… reformula&ons are s&ll sta&c
• A set of informa&on needs What do we know about black powder ammunition?
– A sta&c sequence of m queries
Basic test collec&on
Ini&al Query :
1st Reformula&on :
2nd Reformula&on : … (m-‐1)th Reformula&on :
black powder ammunition
black powder wiki
gun powder wiki …
history of gunpowder
Experiment
black powder wiki
gun powder wiki
black powder ammunition
1
2
3
4
5
6
7
8
9
10
…
Evalua&on over a single ranked list
black powder wiki
gun powder wiki
black powder ammunition
1
2
3
4
5
6
7
8
9
10
…
Experiment
Construct appropriate test collec&ons
Rethink of evalua&on measures
What is a good system?
How can we measure “goodness”?
Measuring “goodness”
The user steps down a ranked list of documents and observes each one of them un&l a decision point and either
a) abandons the search, or
b) reformulates
While stepping down or sideways, the user accumulates u&lity
What are the challenges?
Evalua&on over a single ranked list
black powder wiki
gun powder wiki
black powder ammunition
1
2
3
4
5
6
7
8
9
10
…
Evalua&on over mul&ple ranked lists
Exis&ng measures
• Session DCG [Järvelin et al ECIR 2008] The user steps down the ranked list un&l rank k and reformulates [Determinis&c; no early abandonment]
• Expected session u&lity [Yang and Lad ICTIR 2009] The user steps down a ranked list of documents un&l a decision point and reformulates [Stochas&c; no early abandonment]
Evalua&ng over paths
Op&mize Model-‐free measures
Integrate out Model-‐based measures
Evalua&on measures
• Evalua&ng over paths
• Model – free measures
• Model – based measures
Model-‐free measures
The user is an oracle that knows when to reformulate
Ω(k,j) : paths of length k, ending at reformula&on j
Count number of relevant docs on the op&mal path ω of length k ending at query j
Model-‐free measures
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
Define :
Precision@k,j Recall@k,j Precision@recall,j
ω(10,3) : length 10, ending at 3rd query
Model-‐free measures
recall
reformulation
precision
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
Model-‐free measures
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
ranking 1
recall
precision
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
ranking 2
recall
precision
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
ranking 3
recall
precision
Model-‐free measures
recall
reformulation
precision
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
Evalua&on measures
• Evalua&ng over paths
• Model – free measures
• Model – based measures
Model-‐based measures
Probabilis&c space of users following different paths
• Ω is the space of all paths • P(ω) is the prob of a user following a path ω in Ω • Mω is a measure over a path ω
esM =
ω∈Ω
P (ω)Mω
[Yang and Lad ICTIR 2009]
Model Browsing Behavior
Posi&on-‐based models
The chance of observing a document depends on the posi&on of the document in the ranked list.
black powder ammunition
1
2
3
4
5
6
7
8
9
10
…
Rank Biased Precision [Moffat and Zobel, TOIS08]
Query
Stop
View Next Item
black powder ammunition
1
2
3
4
5
6
7
8
9
10
…
Model Browsing Behavior
Cascade-‐based models
black powder ammunition
1
2
3
4
5
6
7
8
9
10
…
The chance of observing a document depends on the posi&on of the document in the ranked list and the relevance of documents/
snippets already viewed.
Expected Reciprocal Rank [Chapelle et al CIKM09]
Query
Stop
Relevant?
View Next Item
no somewhat highly
black powder ammunition
1
2
3
4
5
6
7
8
9
10
…
€
DEBU(r ) = P(Er )⋅ P(C | Rr )
EBU = DEBU(r )r =1
n
∑ ⋅ Rr
Expected Browsing U&lity [Yilmaz et al CIKM10]
Probability of a path
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
(1)
(2)
Joint probability of
abandoning at reform 2
reformula&ng at rank 3 of first query
Probability of a path
Probability of abandoning at reform 2
X Probability of
reformula&ng at rank 3 of first query
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
(1)
(2)
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
Probability of abandoning the session at reformula&on i
Geometric w/ parameter preform
(1)
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
Truncated Geometric w/ parameter preform
Probability of abandoning the session at reformula&on i
(1)
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
Truncated Geometric w/ parameter preform
Geo
metric w/ parameter p
down
Probability of reformula&ng
at rank j (of 1 to i-‐1 reform)
(2)
Model-‐based measures
Probabilis&c space of users following different paths
• Ω is the space of all paths • P(ω) is the prob of a user following a path ω in Ω • Mω is a measure over a path ω
esM =
ω∈Ω
P (ω)Mω
Evalua&on measures
• Evalua&ng over paths
• Model – free measures
• Model – based measures
Evalua&on measures
• Proper&es
– How do the new measures correlate with previously introduced?
– Do they behave as expected, i.e. do they reward early retrieval of relevant documents?
Correla&ons
0.10 0.15 0.20
0.04
0.06
0.08
nsDCG vs. esAP
nsDCG
esAP
Kendall''s tau : 0.5247
0.10 0.15 0.20
0.10
0.15
0.20
nsDCG vs. esNDCG
nsDCG
esNDCG
Kendall''s tau : 0.7972
• TREC 2010 Session track
Reward early retrieval
esMPC@20 esMRC@20 esMAP
“good”-‐>”good” 0.378 0.036 0.122
“good”-‐>”bad” 0.363 0.034 0.112
“bad”-‐>”good” 0.271 0.023 0.083
“bad”-‐>”bad” 0.254 0.022 0.073
• TREC9 Query track – 50 topics and 23 query sets (formula&ons)
• Simulate sessions
Conclusions
• Extend the evalua&on framework to sessions – Built the appropriate test collec&on – Rethink of evalua&on measures
• Basic test collec&on • Model-‐free and model-‐based measures
• Did not talk about: – Duplicate documents – Efficient computa&on of the measures