Sreenivas Gollapudi Search Labs, Microsoft Research Joint work with Aneesh Sharma (Stanford), Samuel Ieong, Alan Halverson, and Rakesh Agrawal (Microsoft

A Framework for Result Diversification

Sreenivas GollapudiSearch Labs, Microsoft Research

Joint work with Aneesh Sharma (Stanford) , Samuel Ieong, Alan Halverson, and Rakesh Agrawal

(Microsoft Research)

Ambiguous queries

wine 2009

Intuitive definition◦ Represent a variety of relevant meanings for a

given query

Mathematical definitions:◦ Minimizing query abandonment

Want to represent different user categories◦ Trade-off between relevance and novelty

Definition of Diversification

Query and document similarities◦ Maximal Marginal Relevance [CG98]◦ Personalized re-ranking of results [RD06]

Probability Ranking Principle not optimal [CK06] ◦ Query abandonment

Topical diversification [Z+05, AGHI09]◦ Needs topical (categorical) information

Loss minimization framework [Z02, ZL06]◦ “Diminishing returns” for docs w/ the same intent is

a specific loss function [AGHI09]

Research on diversification

Express diversity requirements in terms of desired properties

Define objectives that satisfy these properties

Develop efficient algorithms

Metrics and evaluation methodologies

The framework

Inspired by similar approaches for◦ Recommendation systems [Andersen et al ’08]◦ Ranking [Altman, Tennenholtz ’07]◦ Clustering [Kleinberg ’02]

Map the space of functions – a “basis vector”

Axiomatic approach

Input:◦ Candidate documents: U={u1,u2,…, un}, query q

◦ Relevance function: wq(ui)

◦ Distance function: dq(ui, uj) (symmetric, non-metric)

◦ Size k of output result set

Diversification Setup (1/2)

wq(u5

)

u5

u1

u2

u3

u4u6

dq(,u2,u4)

Output◦ Diversified set S* of documents (|S*|= k)◦ Diversification function:

f : S x wq x dq R+ S* = argmax f(S) (|S|=k)

Diversification Setup (2/2)

u5

u1

u2

u3

u4u6

k = 3S* = {u1,u2,u6}

1. Scale-invariance2. Consistency3. Richness4. Strength

a) Relevanceb) Diversity

5. Stability6. Two technical properties

Axioms

S* = argmaxS f(S, w(·), d(·, ·))

= argmaxS f(S, w΄(·), d΄(·, ·))

◦ w΄(ui) = α · w(ui)

◦ d΄(ui,uj) = α · d(ui,uj)

Scale Invariance Axiom

• No built-in scalefor f !

S*(3)

S* = argmaxS f(S, w(·), d(·, ·))

= argmaxS f(S, w΄(·), d΄(·, ·))

◦ w΄(ui) = w(ui) + ai for ui є S*

◦ d΄(ui,uj) = d(ui,uj) + bi for ui and/or uj є S*

Consistency Axiom

• Increasing relevance/ diversity doesn’t hurt!

S*(3)

S*(k) = argmaxS f(S, w(·), d(·, ·),k)

◦S*(k) S*(k+1) for all k

Stability Axiom

• Output set shouldn’t oscillate (change arbitrarily) with size

S*(3)

S*(4)

Proof via constructive argument

Impossibility result

Theorem: No function f can satisfy all the axioms simultaneously.

Scale-invariance, Consistency, Richness,

Strength of Relevance/Diversity, Stability, Two technical properties

Baseline for what is possible

Mathematical criteria for choosing f

Modular approach: f is independent of specific wq(·) and dq(·, ·)!

Axiomatic characterization– Summary





A Framework for Diversification

Max-sum (avg) objective:

Diversification objectives

u5

u1

u2

u3

u4u6

k = 3S* =

{u1,u2,u6}

Violates stability!

u3 u5

k = 4S* = {u1,u3,u5,u6}

Max-min objective:

Diversification objectives

u5

u1

u2

u3

u4u6

k = 3S* =

{u1,u2,u6}

Violates consistency and stability!

u5

S* = {u1,u5,u6}

A taxonomy-based diversification objective◦ Uses the analogy of marginal utility to determine

whether to include more results from an already covered category

◦ Violates stability and one of the technical axioms

Other Diversification objectives





The Framework

Recast as facility dispersion◦ Max-sum (MaxSumDispersion):

◦ Max-min(MaxMinDispersion):

Known approximation algorithms

Lower bounds

Lots of other facility dispersion objectives and algorithms

Algorithms for facility dispersion

Algorithm for categorical diversification S = ∅ ∀c ∈ C, U (c |q) ← P (c |q) while |S| < k do for d ∈ D do g (d |q, c) ← c U (c |q)V (d |q,c) end for d∗ ← argmax g (d | q, c) S ← S ∪ {d∗} ∀c ∈ C, U (c |q) ← (1−V (d∗ |q, c))U (c |q) D ← D \ {d∗} end while

P(c | q): conditional prob of intent c given query q

g(d | q, c): current prob of d satisfying q, c

Update the utility of a category

Intent distribution: P (R |q) = 0.8, P (B |q) = 0.2.

0.4

An Example

0.9

0.5

0.4

0.4

D V(d | q, c)

0.08

0.72

0.40

0.32

0.08

g(d | q, c)

U(R | q) = U(B | q) =0.8 0.2

×0.8×0.8×0.8×0.2×0.2

×0.08×0.08×0.2×0.2

0.08

0.08

0.04

0.03

0.08

0.12

×0.08×0.08

×0.12 0.050.4

0.9

0.4

0.07

S• Actually produces an

ordered set of results

• Results not proportional to intent distribution

• Results not according to (raw) quality

• Better results ⇒ less needed to be shown





The Framework

Approach◦ Represent real queries◦ Scale beyond a few user studies

Problem: Hard to define ground truth

Use disambiguated information sources on the web as the ground truth

Incorporate intent into human judgments◦ Can exploit the user distribution (need to be careful)

Evaluation Methodologies

Query = Wikipedia disambiguation page title

Large-scale ground truth set Open source Growing in size

Wikipedia Disambiguation Pages

Novelty◦ Coverage of wikipedia topics

Relevance◦ coverage of top Wikipedia results

Metrics Based on Wikipedia Topics

Relevance function:◦ 1/position◦ Can use the search engine score◦ Maybe use query category information

Distance function:◦ Compute TF-IDF distances◦ Jaccard similarity score for two documents A and

B:

The Relevance and Distance Functions

Evaluating Novelty

Topics/categories = list of disambiguation topics

Given a set Sk of results:◦ For each result, compute a distribution over topics

(using our d(·, ·))◦ Sum confidence over all topics◦ Threshold to get # topics represented

jaguar.com

Jaguar cat (0.1)

Jaguar car (0.9)

wikipedia.org/jaguar

Jaguar cat (0.8)

Jaguar car (0.2)

Category confidence

• Jaguar cat: 0.1+0.8

• Jaguar car: 0.9+0.2

Threshold = 1.0

• Jaguar cat: 0• Jaguar car: 1

Evaluating Relevance Query – get ranking of search restricted to

Wikipedia pages a(i) = position of Wikipedia topic i in this

list b(i) = position of Wikipedia topic i in list

being tested Relevance is measured in terms of

reciprocal ranks:

Adding Intent to Human Judgments(Generalizing Relevance Metrics)

Take expectation over distribution of intents◦ Interpretation: how will the average user feel?

Consider NDCG@k◦ Classic:

◦ NDCG-IA depends on intent distribution and intent-specific NDCG

c

ckSqcPkS )|;(NDCG)|();(IA-NDCG

)|;(DCG/)|;(DCG)|;(NDCG ideal ckSckSckS

Evaluation using Mechanical Turk Created two types of

HITs on Mechanical Turk◦ Query classification:

workers are asked to choose among three interpretations

◦ Document rating (under the given interpretation)

Two additional evaluations◦ MT classification +

current ratings◦ MT classification + MT

document ratings

Some Important Questions When is it right to diversify?

◦ Users have certain expectations about the workings of a search engine

What is the best way to diversify?◦ Evaluate approaches beyond diversifying the

retrieved results Metrics that capture both relevance and

diversity◦ Some preliminary work suggests that there will be

certain trade-offs to make

Questions?

Otherwise, need to encode explicit user model in the metric◦ Selection only needs k (which is 10)

Later, can rank set according to relevance◦ Personalize based on clicks

Alternative to stability:◦ Select sets repeatedly (this loses information)◦ Could refine selection online, based on user clicks

Why frame diversification as set selection?

00.

10.

20.

30.

40.

5

0.60

0000

0000

0000

1

0.70

0000

0000

0000

10.

80.

9 10

20

40

60

80

100

120

140

160

180

Novelty difference over 650 ambiguous queries

Max-sumMax-min

Normalized difference in novelty between diversified and original results

Fre

quency c

ount

Novelty Evaluation – Effect of Algorithms

-1-0

.9-0

.8

-0.7

0000

0000

0000

01

-0.6

0000

0000

0000

01 -0.5

-0.4

-0.3

-0.2

-0.1 0

0.1

0.2

0.3

0.4

0.5

0.60

0000

0000

0000

1

0.70

0000

0000

0000

10.

80.

9 10

100

200

300

400

500

600

Relevance difference over 650 ambiguous queries

Max-sumMax-min

Normalized difference in relevance between diversified and original results

Fre

qu

en

cy c

ou

nt

Relevance Evaluation – Effect of Algorithms

Product Evaluation – Anecdotal Result

• Results for query cd player

• Relevance: popularity• Distance: from product hierarchy

Preliminary Results (100 queries)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Novelty for Max-sum as a function of thresholds and lambda

0.10.20.40.60.812468

Thresholds for measuring novelty

Fra

cti

onal diff

ere

nce in n

ovelt

y

Evaluation using Mechanical Turk

MAP-IA@3 MAP-IA@5 [email protected]

0.10

0.20

0.30

0.40

0.50

0.60Diverse Engine 1 Engine 2 Engine 3

MA

P-I

A v

alu

e

NDCG-IA@3 NDCG-IA@5 [email protected]

0.05

0.10

0.15

0.20

0.25Diverse Engine 1 Engine 2 Engine 3

ND

CG

-IA

valu

e

MRR-IA@3 MRR-IA@5 [email protected]

0.10

0.20

0.30

0.40

0.50

0.60 Diverse Engine 1 Engine 2 Engine 3

Other Measures of Success Many metrics for relevance

◦ Normalized discounted cumulative gains at k (NDCG@k)

◦ Mean average precision at k (MAP@k)◦ Mean reciprocal rank (MRR)

Some metrics for diversity◦ Maximal marginal relevance (MMR) [CG98]◦ Nugget-based instantiation of NDCG [C+08]

Want a metric that can take into account both relevance and diversity

[JK00]

Problem Statement

DIVERSIFY(K) Given a query q, a set of documents D,

distribution P(c | q), quality estimates V(d | c, q), and integer k

Find a set of docs S D with |S| = k that maximizes

interpreted as the probability that the set S is relevant to the query over all possible intentions

c Sd

cqdVqcPqSP )),|(1(1)(|()|(

Find at least one relevant docMultiple intents

Discussion of Objective Makes explicit use of taxonomy

◦ In contrast, similarity-based: [CG98], [CK06], [RKJ08] Captures both diversification and doc relevance

◦ In contrast, coverage-based: [Z+05], [C+08], [V+08] Specific form of “loss minimization” [Z02], [ZL06] “Diminishing returns” for docs w/ the same intent Objective is order-independent

◦ Assumes that all users read k results◦ May want to optimize k P(k) P(S | q)

Documents

Sreenivas Gollapudi Search Labs, Microsoft Research Joint work with Aneesh Sharma (Stanford), Samuel Ieong, Alan Halverson, and Rakesh Agrawal (Microsoft