48
Introduction Continuous approximation Structured output learning Perspectives Direct Optimization for Web Search Ranking Olivier Chapelle SIGIR Workshop: Learning to Rank for Information Retrieval July 23rd 2009

hashdsahdbasd

Embed Size (px)

DESCRIPTION

hashdsahdbasd

Citation preview

Page 1: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Direct Optimization for Web Search Ranking

Olivier Chapelle

SIGIR Workshop: Learning to Rank for Information RetrievalJuly 23rd 2009

Page 2: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Outline

1 Introduction

2 Continuous approximation

3 Structured output learning

4 Perspectives

Page 3: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Outline

1 Introduction

2 Continuous approximation

3 Structured output learning

4 Perspectives

Page 4: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Web search ranking

Ranking via a relevance function

Given a query q and a document d , estimate the relevance ofd to q.

Web search results are sorted by relevance.

Traditional relevance functions (BM25) are hand designed.

Recently: several machine learning approaches to learn therelevance function.

Learning a relevance function is practical, but there are otherpossibilities: learn a ranking or learn a preference function.

Page 5: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Machine learning for ranking

Training data

1 Binary relevance label (traditional IR)

2 Multiple levels of relevance (Excellent, Good, Bad,...)

3 Pairwise comparisons

Possibility of converting 1 and 2 into 3.

Need of human editors for 1 and 2.

But possibility of using 3 with large amount of click data−→ Skip-above pairs in [Joachims ’02]

Rest of this talk: 2.

Page 6: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Information retrieval metrics

Binary relevance labels: average precision, reciprocal rank,winner takes all, AUC (i.e. fraction of misranked pairs),...

Multiple level of relevance: Discounted Cumulative Gain atrank p:

DCGp =∑ranks

Dp(j)G (sr(j))

=∑

documents

Dp(r−1(i))G (si ),

Rank j D(j) G(sr(j))

1 1 32 1/ log2(3) 73 1/ log2(4) 0

. . .

wheresi is the relevance score for doc i from 0 (Bad) to 4 (Perfect).r is the ranking function: r(j) = i means document i is atposition j .Dp is the discount function truncated at rank p,D(j) = 1/ log2(j) if j ≤ p, 0 otherwise.G is the gain function, G (s) = 2s − 1.

Page 7: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Features

Given a query and a document, construct a feature vector xi with3 types of features:

Query only : Type of query, query length,...

Document only : Pagerank, length, spam,...

Query & document : match score,...

Set of q = 1, . . . ,Q queries.

Set of n triplets (query,document,score),(xi , si ), xi ∈ Rd , si ∈ {0, 1, 2, 3, 4}.Uq is the set of indices associated with query q-th query.

Page 8: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Approaches to ranking

Pointwise classification [Li et al. ’07], regression−→ works surprisingly well.

Pairwise RankSVMperceptron [Crammer et al. ’03]neural nets: [Burges et al. ’05], LambdaRank,boosting: RankBoost, GBRank.

Listwise Non metric specific: ListNet, ListMLEMetric specific: AdaRank; structured learning:SVMMAP, [Chapelle et al. ’07]; gradient descent:SoftRank, [Chapelle et al’ 09].

Page 9: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Two approaches for a direct optimization the DCG:

1 Gradient descent on a smooth approximation of the DCG

2 Large margin structured output learning where the lossfunction is the DCG.

Orthogonal issue: choice of the architecture.−→ for simplicity, linear functions.At the end, we will present non-linear extensions.

Page 10: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Outline

1 Introduction

2 Continuous approximation

3 Structured output learning

4 Perspectives

Page 11: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Main difficulty for a direct optimization (by gradient descent forinstance): the DCG is not continuous and constant almosteverywhere.−→ Continuous approximation of it.

DCG1 =∑

i

I (i = arg maxj

w>xj)G (si )

≈∑

i

exp(w>xi/σ)∑j exp(w>xj/σ)

G (si ).

−→ ”Soft-argmax”; softness controlled by σ.

Page 12: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Generalization for DCGp:

A(w, σ) :=

p∑j=1

D(j)

∑i G (si )hij∑

i hij

with hij a ”smooth” version of the indicator function:”Is xi at the j-th position in the ranking?”,

hij = exp

(−||w · xi −w · xr(j)||2

2σ2

).

σ controls amount of smoothing: when σ → 0,A(w, σ)→ DCGp.A(w, σ) is continuous but non-differentiable. But it is differentiablealmost everywhere −→ no problem for gradient descent.

Approach generalizable to other IR metrics such as MAP.

Page 13: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Optimization by gradient descent and annealing:

1 - Initialize: w = w0 and large σ.2 - Starting from w, minimize by (conjugate) gradientdescent

λ||w −w0||2 + A(w, σ).3 - Divide σ by 2 and go back to 2 (or stop).

w0 is an initial solution such as the one given by pairwise ranking.

−5 −4 −3 −2 −1 0 1 2 3 4 5−12.02

−12

−11.98

−11.96

−11.94

−11.92

−11.9

−11.88

−11.86

−11.84

t

Ob

jective

fu

nctio

n

σ=0.125

σ=1

σ=8

σ=64

Page 14: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Evaluation on web search data

Dataset

Several hundred features.

∼50k (query,urls) pairs from an international market.

∼1500 queries randomly split in training / test (80% / 20%).

5 levels of relevance.

10−2

100

102

7.5

8

8.5

9

9.5

Smoothing factor σ

DC

G5

λ = 101

λ = 102

λ = 103

λ = 104

λ = 105

10−2

100

102

7.7

7.8

7.9

8

8.1

8.2

8.3

8.4

8.5

Smoothing factor σ

DC

G5

λ = 101

λ = 102

λ = 103

λ = 104

λ = 105

DCG can be improved by almost 10% on the training set (left),but not more than 1% on the test set (right).

Page 15: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Evaluation on web search data

Dataset

Several hundred features.

∼50k (query,urls) pairs from an international market.

∼1500 queries randomly split in training / test (80% / 20%).

5 levels of relevance.

10−2

100

102

7.5

8

8.5

9

9.5

Smoothing factor σ

DC

G5

λ = 101

λ = 102

λ = 103

λ = 104

λ = 105

10−2

100

102

7.7

7.8

7.9

8

8.1

8.2

8.3

8.4

8.5

Smoothing factor σ

DC

G5

λ = 101

λ = 102

λ = 103

λ = 104

λ = 105

DCG can be improved by almost 10% on the training set (left),but not more than 1% on the test set (right).

Page 16: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Evaluation on Letor 3.0

Ohsumed dataset: NDCG

0.42 0.425 0.43 0.435 0.44 0.445 0.45 0.455 0.46

SmoothNDCG

RankSVM

Regression

AdaRank−MAP

AdaRank−NDCG

FRank

ListNet

RankBoost

SVMMAP

NDCG@10

1 2 3 4 5 6 7 8 9 100.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6

Position

ND

CG

SmoothNDCGRankSVMRegressionAdaRank−NDCGListNet

All datasets: NDCG / MAP

0.57 0.58 0.59 0.6 0.61 0.62 0.63

SmoothNDCG

RankSVM

Regression

AdaRank−MAP

AdaRank−NDCG

FRank

ListNet

RankBoost

SVMMAP

[email protected] 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55

SmoothNDCG

RankSVM

Regression

AdaRank−MAP

AdaRank−NDCG

FRank

ListNet

RankBoost

SVMMAP

MAP

Page 17: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Outline

1 Introduction

2 Continuous approximation

3 Structured output learningFormulationExperiments & Extensions

4 Perspectives

Page 18: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Structured output learning

Notations

xq is a set of documents associated with query q;xqi the i-th document.

yq is a ranking (i.e. a permutation): yqi is the rank of the i-thdocument. (obtained by sorting the scores sqi ).

Learning for structured outputs (Tsochantaridis et al. ’04)

Learn a mapping x→ y

Joint feature map: Ψ(x, y).

Prediction rule:

y = arg maxy

w>Ψ(x, y).

Page 19: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

We take: Ψ(xq, yq) =∑

i xqiA(yqi ).

A : N→ R is a user defined non increasing function.

Ranking is given by the order of w>xqi becausew>Ψ(x, y) =

∑i w>xqiA(yqi ).

w>xqi 2.5 3.7 −0.5× + × + × = 15.2→ max

A(y) A(2) = 2 A(1) = 3 A(3) = 1

Constraints for correct predictions on the training set:

∀q,∀y 6= yq,w>Ψ(xq, yq)−w>Ψ(xq, y) > 0.

Page 20: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

SVM-like optimization problem:

minw,ξq

λ

2||w||2 +

Q∑q=1

ξq,

under constraints:

∀q,∀y 6= yq,w>Ψ(xq, yq)−w>Ψ(xq, y) ≥ ∆(y , yq)− ξq,

where ∆(y , yq) is the query loss, e.g. the difference betweenthe DCGs with ranking y and yq.

At the optimum solution, ξq ≥ ∆(yq, yq) withyq = arg max w>Ψ(xq, y).

Page 21: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Optimization

ξq = maxy

∆(y , yq) + w>Ψ(xq, y)−w>Ψ(xq, yq).

Need to find the argmax:

y = arg maxy

∑i

A(yi )w>xqi − G (sqi )D(yi ).

−→ Can be solved efficiently through a linear assignment problem.

Page 22: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Cutting plane

Strategy used in SVMstruct. Iterate between:

1 Solving the problem on a subset of constraints.

2 Find and add (the most) violated constraints.

Unconstrained optimization

min1

2||w||2 +

∑q

maxy

∆(y , yq) + w>Ψ(xq, y)−w>Ψ(xq, yq).

Convex, but not differentiable

Subgradient descent

Bundle method

Page 23: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Cutting plane

Strategy used in SVMstruct. Iterate between:

1 Solving the problem on a subset of constraints.

2 Find and add (the most) violated constraints.

Unconstrained optimization

min1

2||w||2 +

∑q

maxy

∆(y , yq) + w>Ψ(xq, y)−w>Ψ(xq, yq).

Convex, but not differentiable

Subgradient descent

Bundle method

Page 24: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Experiments

2 4 6 8 100.46

0.47

0.48

0.49

0.5

0.51

0.52

0.53

k

ND

CG

k

SVMStructRegressionRankSVM

Normalized DCG for differenttruncation levels

λ chosen on a validation set

A(r) = max(r + 1− k , 0).

Training time ∼20 minutes.

About 2% improvement(p-value = 0.03 vs regression, 0.07vs RankSVM).

Page 25: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Ohsumed dataset (Letor distribution):

3 levels of relevance.

25 features

106 queries split in training / validation / test.

Optimal solution is w = 0 even for small values of λ.Reason: there are a lot of constraints and not a lot of variables−→ the function looks like x → |x |.

Perfect

w>Ψ(x, y)

Bad

w>Ψ(x, y)

Good

w>Ψ(x, y)

Large loss (because Perfect < Bad), but we would like a small(because Good is at the top).

Page 26: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Ohsumed dataset (Letor distribution):

3 levels of relevance.

25 features

106 queries split in training / validation / test.

Optimal solution is w = 0 even for small values of λ.Reason: there are a lot of constraints and not a lot of variables−→ the function looks like x → |x |.

Perfect

w>Ψ(x, y)

Bad

w>Ψ(x, y)

Good

w>Ψ(x, y)

Large loss (because Perfect < Bad), but we would like a small(because Good is at the top).

Page 27: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

maxy

w>Ψ(xi , y)−w>Ψ(xi , yi ) + ∆(y , yi )

Page 28: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

miny , ∆(y ,yi )=0

maxy

w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , y)

Page 29: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

miny

maxy

w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , y) + ∆(y , yi )

Page 30: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

miny

maxy

w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , y) + ∆(y , yi )

= miny

maxy

w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , yi )

= maxy

(w>Ψ(xi , y) + ∆(y , yi ))−maxy

w>Ψ(xi , y).

Page 31: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

miny

maxy

w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , y) + ∆(y , yi )

= miny

maxy

w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , yi )

= maxy

(w>Ψ(xi , y) + ∆(y , yi ))−maxy

w>Ψ(xi , y).

1 Smaller than the original loss: take y = y .

2 Still an upper bound on the loss: take y = y .

3 Non-convex.

−→ This upper bound can be used for any structured outputlearning problem.

Details available in Tighter bounds for structured estimation [Do et al. ’09] andOptimization of ranking measures [Le et al.].

Page 32: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

2 4 6 8 100.43

0.44

0.45

0.46

0.47

0.48

0.49

0.5

0.51

0.52

k

ND

CG

k

SVMStructRankBoostRankSVM

Ohsumed dataset

w0 found by regression: serves astarting point and in the regularizer||w −w0||2.

Optimization for DCG10.

Page 33: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Non-linear extensions

1 The ”obvious” kernel trick.2 Gradient boosted decision trees.

Frideman’s functional gradient boosting framework.

minf ∈F

R(f ) =n∑

j=1

`(yj , f (xj)), F =

{N∑

i=1

αihi

}.

Typically hi is a tree and N is infinite.

Cannot do direction optimization on F .

f ← 0repeat

gj ← −∂`(f (xj ),yj )∂f (xj )

. Functional gradient

ı← arg mini ,λ∑n

j=1(gj − λhi (xj))2. Steepest componentρ← arg minR(f + ρhı). Line searchf ← f + ηρh. ”Shrinkage” when η < 1.

until Max iterations reached.

Page 34: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Non-linear extensions

1 The ”obvious” kernel trick.2 Gradient boosted decision trees.

Frideman’s functional gradient boosting framework.

minf ∈F

R(f ) =n∑

j=1

`(yj , f (xj)), F =

{N∑

i=1

αihi

}.

Typically hi is a tree and N is infinite.

Cannot do direction optimization on F .

f ← 0repeat

gj ← −∂`(f (xj ),yj )∂f (xj )

. Functional gradient

ı← arg mini ,λ∑n

j=1(gj − λhi (xj))2. Steepest componentρ← arg minR(f + ρhı). Line searchf ← f + ηρh. ”Shrinkage” when η < 1.

until Max iterations reached.

Page 35: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Non-linear extensions

1 The ”obvious” kernel trick.2 Gradient boosted decision trees.

Frideman’s functional gradient boosting framework.

minf ∈F

R(f ) =n∑

j=1

`(yj , f (xj)), F =

{N∑

i=1

αihi

}.

Typically hi is a tree and N is infinite.

Cannot do direction optimization on F .

f ← 0repeat

gj ← −∂`(f (xj ),yj )∂f (xj )

. Functional gradient

ı← arg mini ,λ∑n

j=1(gj − λhi (xj))2. Steepest componentρ← arg minR(f + ρhı). Line searchf ← f + ηρh. ”Shrinkage” when η < 1.

until Max iterations reached.

Page 36: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

The objective function can be much more general:

R(f (x1), . . . , f (xn))

gj = − ∂R∂f (xj )

.

For ranking via structured output learning:

R(f ) =∑q

maxy

(∆(y , yq) +

Q∑i=1

f (xqi )(A(yi )− A(yqi ))

).

Preliminary results are disappointing: with gradient boosteddecision trees, no difference between regression and structuredoutput learning.−→ Could be because the loss function matters only when theclass of functions is restricted (underfitting).

Page 37: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

The objective function can be much more general:

R(f (x1), . . . , f (xn))

gj = − ∂R∂f (xj )

.

For ranking via structured output learning:

R(f ) =∑q

maxy

(∆(y , yq) +

Q∑i=1

f (xqi )(A(yi )− A(yqi ))

).

Preliminary results are disappointing: with gradient boosteddecision trees, no difference between regression and structuredoutput learning.−→ Could be because the loss function matters only when theclass of functions is restricted (underfitting).

Page 38: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Outline

1 Introduction

2 Continuous approximation

3 Structured output learning

4 PerspectivesObjective functionFuture directions

Page 39: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Choice of the objective function

General consensus on relative performance of learning to rankmethods:

Pointwise < Pairwise < ListwiseTrue but ... the differences are very small.

On real web search data using non-linear functions:

Pairwise is ∼ 0.5− 1% better than pointwise;

Listwise is ∼ 0− 0.5% better than pairwise.

Letor datasets are interesting to test some ideas, but validation ina real setting is necessary.

Page 40: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Choice of the objective function

General consensus on relative performance of learning to rankmethods:

Pointwise < Pairwise < ListwiseTrue but ... the differences are very small.

On real web search data using non-linear functions:

Pairwise is ∼ 0.5− 1% better than pointwise;

Listwise is ∼ 0− 0.5% better than pairwise.

Letor datasets are interesting to test some ideas, but validation ina real setting is necessary.

Page 41: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Public web search datasets

Internet Mathematics 2009

Dataset released by the russian search engine Yandex for acompetition.Available at:http://company.yandex.ru/grant/2009/en/datasets

9,124 queries / 97,290 judgements (training)

245 features

5 levels of relevance

132 submissions

Yahoo! also plans to organize a similar competition and releasedatasets. Stay tuned!

Page 42: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

To improve a ranking system, work in priority on:

1 Feature development

2 Choice of the function class

3 Choice of the objective function to optimize

But 1 and 2 are orthogonal issues to learning to rank.

What are the other interesting problematics beyond the choice ofthe objective function?

Page 43: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Sample selection bias

Training and offline test sets typically come from polling topresults of other ranking functions.

But online test documents come from a ”larger” distribution(all the documents from a simple ranking function).

Problem: the learning algorithm does not learn to demotevery bad pages (low BM25, spam,...) because they rarelyappear in the training set.

Solution: reweight the training set such that it resembles theonline test distribution.

Page 44: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Diversity

Output a set of relevant documents which is also diverse.

Need to go beyond learning a relevance function. Structuredoutput learning can be a principled framework for this purpose.But in any case, extra computational load at test time.

Problem: no cheap metric for diversity.

Diversity on content is more important than diversity on topic:a user can always reformulate an ambiguous query.

Page 45: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Transfer / multi-task learning

How to leverage the data from one (big) market to another (small)one?

Page 46: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Cascade learning

Ideally: rank all existing web pages.In practice: rank only a small subset of them using machinelearning.

Instead: build a ”cascade” of rankers

f1, . . . , fT cascade of T rankers.All documents applied to f1.Discard bottom documents after each round.Features and functions of increasing complexity.Each ranker is learned.

Page 47: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Low-level learning

Two different ML philosophies:

1 Design a limited number of high-level features and put an MLalgorithm on top of them.

2 Let the ML algorithm directly work on a large number oflow-level features.

We have done 1, but 2 has been successful in various domains suchas computer vision.

Two ideas:

Learn BM25 by introducing several parameters per word (suchas the k in the saturation function).

Define the score match as∑

i ,j wijqidj and learn the wij .See earlier talk Learning to rank with low rank.

Page 48: hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Summary

Optimizing ranking measures is difficult, but feasible.

Two types of approaches: convex upper bound or non-convexapproximation.

Only small improvements on real settings (large number ofexamples, large number of features, non-linear architecture).−→ Choice of the objective function has a small influence onthe overall performance.

Research on learning to rank should focus on new problems.