Upload
yasmin-goodspeed
View
214
Download
1
Embed Size (px)
Citation preview
A Framework for Result Diversification
Sreenivas GollapudiSearch Labs, Microsoft Research
Joint work with Aneesh Sharma (Stanford) , Samuel Ieong, Alan Halverson, and Rakesh Agrawal
(Microsoft Research)
Ambiguous queries
wine 2009
Intuitive definition◦ Represent a variety of relevant meanings for a
given query
Mathematical definitions:◦ Minimizing query abandonment
Want to represent different user categories◦ Trade-off between relevance and novelty
Definition of Diversification
Query and document similarities◦ Maximal Marginal Relevance [CG98]◦ Personalized re-ranking of results [RD06]
Probability Ranking Principle not optimal [CK06] ◦ Query abandonment
Topical diversification [Z+05, AGHI09]◦ Needs topical (categorical) information
Loss minimization framework [Z02, ZL06]◦ “Diminishing returns” for docs w/ the same intent is
a specific loss function [AGHI09]
Research on diversification
Express diversity requirements in terms of desired properties
Define objectives that satisfy these properties
Develop efficient algorithms
Metrics and evaluation methodologies
The framework
Inspired by similar approaches for◦ Recommendation systems [Andersen et al ’08]◦ Ranking [Altman, Tennenholtz ’07]◦ Clustering [Kleinberg ’02]
Map the space of functions – a “basis vector”
Axiomatic approach
Input:◦ Candidate documents: U={u1,u2,…, un}, query q
◦ Relevance function: wq(ui)
◦ Distance function: dq(ui, uj) (symmetric, non-metric)
◦ Size k of output result set
Diversification Setup (1/2)
wq(u5
)
u5
u1
u2
u3
u4u6
dq(,u2,u4)
Output◦ Diversified set S* of documents (|S*|= k)◦ Diversification function:
f : S x wq x dq R+ S* = argmax f(S) (|S|=k)
Diversification Setup (2/2)
u5
u1
u2
u3
u4u6
k = 3S* = {u1,u2,u6}
1. Scale-invariance2. Consistency3. Richness4. Strength
a) Relevanceb) Diversity
5. Stability6. Two technical properties
Axioms
S* = argmaxS f(S, w(·), d(·, ·))
= argmaxS f(S, w΄(·), d΄(·, ·))
◦ w΄(ui) = α · w(ui)
◦ d΄(ui,uj) = α · d(ui,uj)
Scale Invariance Axiom
• No built-in scalefor f !
S*(3)
S* = argmaxS f(S, w(·), d(·, ·))
= argmaxS f(S, w΄(·), d΄(·, ·))
◦ w΄(ui) = w(ui) + ai for ui є S*
◦ d΄(ui,uj) = d(ui,uj) + bi for ui and/or uj є S*
Consistency Axiom
• Increasing relevance/ diversity doesn’t hurt!
S*(3)
S*(k) = argmaxS f(S, w(·), d(·, ·),k)
◦S*(k) S*(k+1) for all k
Stability Axiom
• Output set shouldn’t oscillate (change arbitrarily) with size
S*(3)
S*(4)
Proof via constructive argument
Impossibility result
Theorem: No function f can satisfy all the axioms simultaneously.
Scale-invariance, Consistency, Richness,
Strength of Relevance/Diversity, Stability, Two technical properties
Baseline for what is possible
Mathematical criteria for choosing f
Modular approach: f is independent of specific wq(·) and dq(·, ·)!
Axiomatic characterization– Summary
Express diversity requirements in terms of desired properties
Define objectives that satisfy these properties
Develop efficient algorithms
Metrics and evaluation methodologies
A Framework for Diversification
Max-sum (avg) objective:
Diversification objectives
u5
u1
u2
u3
u4u6
k = 3S* =
{u1,u2,u6}
Violates stability!
u3 u5
k = 4S* = {u1,u3,u5,u6}
Max-min objective:
Diversification objectives
u5
u1
u2
u3
u4u6
k = 3S* =
{u1,u2,u6}
Violates consistency and stability!
u5
S* = {u1,u5,u6}
A taxonomy-based diversification objective◦ Uses the analogy of marginal utility to determine
whether to include more results from an already covered category
◦ Violates stability and one of the technical axioms
Other Diversification objectives
Express diversity requirements in terms of desired properties
Define objectives that satisfy these properties
Develop efficient algorithms
Metrics and evaluation methodologies
The Framework
Recast as facility dispersion◦ Max-sum (MaxSumDispersion):
◦ Max-min(MaxMinDispersion):
Known approximation algorithms
Lower bounds
Lots of other facility dispersion objectives and algorithms
Algorithms for facility dispersion
Algorithm for categorical diversification S = ∅ ∀c ∈ C, U (c |q) ← P (c |q) while |S| < k do for d ∈ D do g (d |q, c) ← c U (c |q)V (d |q,c) end for d∗ ← argmax g (d | q, c) S ← S ∪ {d∗} ∀c ∈ C, U (c |q) ← (1−V (d∗ |q, c))U (c |q) D ← D \ {d∗} end while
P(c | q): conditional prob of intent c given query q
g(d | q, c): current prob of d satisfying q, c
Update the utility of a category
Intent distribution: P (R |q) = 0.8, P (B |q) = 0.2.
0.4
An Example
0.9
0.5
0.4
0.4
D V(d | q, c)
0.08
0.72
0.40
0.32
0.08
g(d | q, c)
U(R | q) = U(B | q) =0.8 0.2
×0.8×0.8×0.8×0.2×0.2
×0.08×0.08×0.2×0.2
0.08
0.08
0.04
0.03
0.08
0.12
×0.08×0.08
×0.12 0.050.4
0.9
0.4
0.07
S• Actually produces an
ordered set of results
• Results not proportional to intent distribution
• Results not according to (raw) quality
• Better results ⇒ less needed to be shown
Express diversity requirements in terms of desired properties
Define objectives that satisfy these properties
Develop efficient algorithms
Metrics and evaluation methodologies
The Framework
Approach◦ Represent real queries◦ Scale beyond a few user studies
Problem: Hard to define ground truth
Use disambiguated information sources on the web as the ground truth
Incorporate intent into human judgments◦ Can exploit the user distribution (need to be careful)
Evaluation Methodologies
Query = Wikipedia disambiguation page title
Large-scale ground truth set Open source Growing in size
Wikipedia Disambiguation Pages
Novelty◦ Coverage of wikipedia topics
Relevance◦ coverage of top Wikipedia results
Metrics Based on Wikipedia Topics
Relevance function:◦ 1/position◦ Can use the search engine score◦ Maybe use query category information
Distance function:◦ Compute TF-IDF distances◦ Jaccard similarity score for two documents A and
B:
The Relevance and Distance Functions
Evaluating Novelty
Topics/categories = list of disambiguation topics
Given a set Sk of results:◦ For each result, compute a distribution over topics
(using our d(·, ·))◦ Sum confidence over all topics◦ Threshold to get # topics represented
jaguar.com
Jaguar cat (0.1)
Jaguar car (0.9)
wikipedia.org/jaguar
Jaguar cat (0.8)
Jaguar car (0.2)
Category confidence
• Jaguar cat: 0.1+0.8
• Jaguar car: 0.9+0.2
Threshold = 1.0
• Jaguar cat: 0• Jaguar car: 1
Evaluating Relevance Query – get ranking of search restricted to
Wikipedia pages a(i) = position of Wikipedia topic i in this
list b(i) = position of Wikipedia topic i in list
being tested Relevance is measured in terms of
reciprocal ranks:
Adding Intent to Human Judgments(Generalizing Relevance Metrics)
Take expectation over distribution of intents◦ Interpretation: how will the average user feel?
Consider NDCG@k◦ Classic:
◦ NDCG-IA depends on intent distribution and intent-specific NDCG
c
ckSqcPkS )|;(NDCG)|();(IA-NDCG
)|;(DCG/)|;(DCG)|;(NDCG ideal ckSckSckS
Evaluation using Mechanical Turk Created two types of
HITs on Mechanical Turk◦ Query classification:
workers are asked to choose among three interpretations
◦ Document rating (under the given interpretation)
Two additional evaluations◦ MT classification +
current ratings◦ MT classification + MT
document ratings
Some Important Questions When is it right to diversify?
◦ Users have certain expectations about the workings of a search engine
What is the best way to diversify?◦ Evaluate approaches beyond diversifying the
retrieved results Metrics that capture both relevance and
diversity◦ Some preliminary work suggests that there will be
certain trade-offs to make
Questions?
Otherwise, need to encode explicit user model in the metric◦ Selection only needs k (which is 10)
Later, can rank set according to relevance◦ Personalize based on clicks
Alternative to stability:◦ Select sets repeatedly (this loses information)◦ Could refine selection online, based on user clicks
Why frame diversification as set selection?
00.
10.
20.
30.
40.
5
0.60
0000
0000
0000
1
0.70
0000
0000
0000
10.
80.
9 10
20
40
60
80
100
120
140
160
180
Novelty difference over 650 ambiguous queries
Max-sumMax-min
Normalized difference in novelty between diversified and original results
Fre
quency c
ount
Novelty Evaluation – Effect of Algorithms
-1-0
.9-0
.8
-0.7
0000
0000
0000
01
-0.6
0000
0000
0000
01 -0.5
-0.4
-0.3
-0.2
-0.1 0
0.1
0.2
0.3
0.4
0.5
0.60
0000
0000
0000
1
0.70
0000
0000
0000
10.
80.
9 10
100
200
300
400
500
600
Relevance difference over 650 ambiguous queries
Max-sumMax-min
Normalized difference in relevance between diversified and original results
Fre
qu
en
cy c
ou
nt
Relevance Evaluation – Effect of Algorithms
Product Evaluation – Anecdotal Result
• Results for query cd player
• Relevance: popularity• Distance: from product hierarchy
Preliminary Results (100 queries)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Novelty for Max-sum as a function of thresholds and lambda
0.10.20.40.60.812468
Thresholds for measuring novelty
Fra
cti
onal diff
ere
nce in n
ovelt
y
Evaluation using Mechanical Turk
MAP-IA@3 MAP-IA@5 [email protected]
0.10
0.20
0.30
0.40
0.50
0.60Diverse Engine 1 Engine 2 Engine 3
MA
P-I
A v
alu
e
NDCG-IA@3 NDCG-IA@5 [email protected]
0.05
0.10
0.15
0.20
0.25Diverse Engine 1 Engine 2 Engine 3
ND
CG
-IA
valu
e
MRR-IA@3 MRR-IA@5 [email protected]
0.10
0.20
0.30
0.40
0.50
0.60 Diverse Engine 1 Engine 2 Engine 3
Other Measures of Success Many metrics for relevance
◦ Normalized discounted cumulative gains at k (NDCG@k)
◦ Mean average precision at k (MAP@k)◦ Mean reciprocal rank (MRR)
Some metrics for diversity◦ Maximal marginal relevance (MMR) [CG98]◦ Nugget-based instantiation of NDCG [C+08]
Want a metric that can take into account both relevance and diversity
[JK00]
Problem Statement
DIVERSIFY(K) Given a query q, a set of documents D,
distribution P(c | q), quality estimates V(d | c, q), and integer k
Find a set of docs S D with |S| = k that maximizes
interpreted as the probability that the set S is relevant to the query over all possible intentions
c Sd
cqdVqcPqSP )),|(1(1)(|()|(
Find at least one relevant docMultiple intents
Discussion of Objective Makes explicit use of taxonomy
◦ In contrast, similarity-based: [CG98], [CK06], [RKJ08] Captures both diversification and doc relevance
◦ In contrast, coverage-based: [Z+05], [C+08], [V+08] Specific form of “loss minimization” [Z02], [ZL06] “Diminishing returns” for docs w/ the same intent Objective is order-independent
◦ Assumes that all users read k results◦ May want to optimize k P(k) P(S | q)