hashdsahdbasd

Introduction Continuous approximation Structured output learning Perspectives

Direct Optimization for Web Search Ranking

Olivier Chapelle

SIGIR Workshop: Learning to Rank for Information RetrievalJuly 23rd 2009


Outline

1 Introduction

2 Continuous approximation

3 Structured output learning

4 Perspectives


Outline

1 Introduction



4 Perspectives


Web search ranking

Ranking via a relevance function

Given a query q and a document d , estimate the relevance ofd to q.

Web search results are sorted by relevance.

Traditional relevance functions (BM25) are hand designed.

Recently: several machine learning approaches to learn therelevance function.

Learning a relevance function is practical, but there are otherpossibilities: learn a ranking or learn a preference function.


Machine learning for ranking

Training data

1 Binary relevance label (traditional IR)

2 Multiple levels of relevance (Excellent, Good, Bad,...)

3 Pairwise comparisons

Possibility of converting 1 and 2 into 3.

Need of human editors for 1 and 2.

But possibility of using 3 with large amount of click data−→ Skip-above pairs in [Joachims ’02]

Rest of this talk: 2.


Information retrieval metrics

Binary relevance labels: average precision, reciprocal rank,winner takes all, AUC (i.e. fraction of misranked pairs),...

Multiple level of relevance: Discounted Cumulative Gain atrank p:

DCGp =∑ranks

Dp(j)G (sr(j))

=∑

documents

Dp(r−1(i))G (si ),

Rank j D(j) G(sr(j))

1 1 32 1/ log2(3) 73 1/ log2(4) 0

. . .

wheresi is the relevance score for doc i from 0 (Bad) to 4 (Perfect).r is the ranking function: r(j) = i means document i is atposition j .Dp is the discount function truncated at rank p,D(j) = 1/ log2(j) if j ≤ p, 0 otherwise.G is the gain function, G (s) = 2s − 1.


Features

Given a query and a document, construct a feature vector xi with3 types of features:

Query only : Type of query, query length,...

Document only : Pagerank, length, spam,...

Query & document : match score,...

Set of q = 1, . . . ,Q queries.

Set of n triplets (query,document,score),(xi , si ), xi ∈ Rd , si ∈ {0, 1, 2, 3, 4}.Uq is the set of indices associated with query q-th query.


Approaches to ranking

Pointwise classification [Li et al. ’07], regression−→ works surprisingly well.

Pairwise RankSVMperceptron [Crammer et al. ’03]neural nets: [Burges et al. ’05], LambdaRank,boosting: RankBoost, GBRank.

Listwise Non metric specific: ListNet, ListMLEMetric specific: AdaRank; structured learning:SVMMAP, [Chapelle et al. ’07]; gradient descent:SoftRank, [Chapelle et al’ 09].

http://olivier.chapelle.cc/pub/nips_ranking.pdf

http://olivier.chapelle.cc/pub/smoothgrad.pdf


Two approaches for a direct optimization the DCG:

1 Gradient descent on a smooth approximation of the DCG

2 Large margin structured output learning where the lossfunction is the DCG.

Orthogonal issue: choice of the architecture.−→ for simplicity, linear functions.At the end, we will present non-linear extensions.


Outline

1 Introduction



4 Perspectives


Main difficulty for a direct optimization (by gradient descent forinstance): the DCG is not continuous and constant almosteverywhere.−→ Continuous approximation of it.

DCG1 =∑

i

I (i = arg maxj

w>xj)G (si )

≈∑

i

exp(w>xi/σ)∑j exp(w>xj/σ)

G (si ).

−→ ”Soft-argmax”; softness controlled by σ.


Generalization for DCGp:

A(w, σ) :=

p∑j=1

D(j)

∑i G (si )hij∑

i hij

with hij a ”smooth” version of the indicator function:”Is xi at the j-th position in the ranking?”,

hij = exp

(−||w · xi −w · xr(j)||2

2σ2

).

σ controls amount of smoothing: when σ → 0,A(w, σ)→ DCGp.A(w, σ) is continuous but non-differentiable. But it is differentiablealmost everywhere −→ no problem for gradient descent.

Approach generalizable to other IR metrics such as MAP.


Optimization by gradient descent and annealing:

1 - Initialize: w = w0 and large σ.2 - Starting from w, minimize by (conjugate) gradientdescent

λ||w −w0||2 + A(w, σ).3 - Divide σ by 2 and go back to 2 (or stop).

w0 is an initial solution such as the one given by pairwise ranking.

−5 −4 −3 −2 −1 0 1 2 3 4 5−12.02

−12

−11.98

−11.96

−11.94

−11.92

−11.9

−11.88

−11.86

−11.84

t

Ob

jective

fu

nctio

n

σ=0.125

σ=1

σ=8

σ=64


Evaluation on web search data

Dataset

Several hundred features.

∼50k (query,urls) pairs from an international market.

∼1500 queries randomly split in training / test (80% / 20%).

5 levels of relevance.

10−2

100

102

7.5

8

8.5

9

9.5

Smoothing factor σ

DC

G5

λ = 101

λ = 102

λ = 103

λ = 104

λ = 105

10−2

100

102

7.7

7.8

7.9

8

8.1

8.2

8.3

8.4

8.5

Smoothing factor σ

DC

G5

λ = 101

λ = 102

λ = 103

λ = 104

λ = 105

DCG can be improved by almost 10% on the training set (left),but not more than 1% on the test set (right).


Evaluation on web search data

Dataset

Several hundred features.

∼50k (query,urls) pairs from an international market.

∼1500 queries randomly split in training / test (80% / 20%).


10−2

100

102

7.5

8

8.5

9

9.5

Smoothing factor σ

DC

G5

λ = 101

λ = 102

λ = 103

λ = 104

λ = 105

10−2

100

102

7.7

7.8

7.9

8

8.1

8.2

8.3

8.4

8.5

Smoothing factor σ

DC

G5

λ = 101

λ = 102

λ = 103

λ = 104

λ = 105

DCG can be improved by almost 10% on the training set (left),but not more than 1% on the test set (right).


Evaluation on Letor 3.0

Ohsumed dataset: NDCG

0.42 0.425 0.43 0.435 0.44 0.445 0.45 0.455 0.46

SmoothNDCG

RankSVM

Regression

AdaRank−MAP

AdaRank−NDCG

FRank

ListNet

RankBoost

SVMMAP

NDCG@10

1 2 3 4 5 6 7 8 9 100.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6

Position

ND

CG

SmoothNDCGRankSVMRegressionAdaRank−NDCGListNet

All datasets: NDCG / MAP

0.57 0.58 0.59 0.6 0.61 0.62 0.63

SmoothNDCG

RankSVM

Regression

AdaRank−MAP

AdaRank−NDCG

FRank

ListNet

RankBoost

SVMMAP

[email protected] 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55

SmoothNDCG

RankSVM

Regression

AdaRank−MAP

AdaRank−NDCG

FRank

ListNet

RankBoost

SVMMAP

MAP


Outline

1 Introduction


3 Structured output learningFormulationExperiments & Extensions

4 Perspectives


Structured output learning

Notations

xq is a set of documents associated with query q;xqi the i-th document.

yq is a ranking (i.e. a permutation): yqi is the rank of the i-thdocument. (obtained by sorting the scores sqi ).

Learning for structured outputs (Tsochantaridis et al. ’04)

Learn a mapping x→ y

Joint feature map: Ψ(x, y).

Prediction rule:

y = arg maxy

w>Ψ(x, y).


We take: Ψ(xq, yq) =∑

i xqiA(yqi ).

A : N→ R is a user defined non increasing function.

Ranking is given by the order of w>xqi becausew>Ψ(x, y) =

∑i w>xqiA(yqi ).

w>xqi 2.5 3.7 −0.5× + × + × = 15.2→ max

A(y) A(2) = 2 A(1) = 3 A(3) = 1

Constraints for correct predictions on the training set:

∀q,∀y 6= yq,w>Ψ(xq, yq)−w>Ψ(xq, y) > 0.


SVM-like optimization problem:

minw,ξq

λ

2||w||2 +

Q∑q=1

ξq,

under constraints:

∀q,∀y 6= yq,w>Ψ(xq, yq)−w>Ψ(xq, y) ≥ ∆(y , yq)− ξq,

where ∆(y , yq) is the query loss, e.g. the difference betweenthe DCGs with ranking y and yq.

At the optimum solution, ξq ≥ ∆(yq, yq) withyq = arg max w>Ψ(xq, y).


Optimization

ξq = maxy

∆(y , yq) + w>Ψ(xq, y)−w>Ψ(xq, yq).

Need to find the argmax:

y = arg maxy

∑i

A(yi )w>xqi − G (sqi )D(yi ).

−→ Can be solved efficiently through a linear assignment problem.


Cutting plane

Strategy used in SVMstruct. Iterate between:

1 Solving the problem on a subset of constraints.

2 Find and add (the most) violated constraints.

Unconstrained optimization

min1

2||w||2 +

∑q

maxy


Convex, but not differentiable

Subgradient descent

Bundle method


Cutting plane

Strategy used in SVMstruct. Iterate between:

1 Solving the problem on a subset of constraints.

2 Find and add (the most) violated constraints.

Unconstrained optimization

min1

2||w||2 +

∑q

maxy


Convex, but not differentiable

Subgradient descent

Bundle method


Experiments

2 4 6 8 100.46

0.47

0.48

0.49

0.5

0.51

0.52

0.53

k

ND

CG

k

SVMStructRegressionRankSVM

Normalized DCG for differenttruncation levels

λ chosen on a validation set

A(r) = max(r + 1− k , 0).

Training time ∼20 minutes.

About 2% improvement(p-value = 0.03 vs regression, 0.07vs RankSVM).


Ohsumed dataset (Letor distribution):


25 features

106 queries split in training / validation / test.

Optimal solution is w = 0 even for small values of λ.Reason: there are a lot of constraints and not a lot of variables−→ the function looks like x → |x |.

Perfect

w>Ψ(x, y)

Bad

w>Ψ(x, y)

Good

w>Ψ(x, y)

Large loss (because Perfect < Bad), but we would like a small(because Good is at the top).


Ohsumed dataset (Letor distribution):


25 features

106 queries split in training / validation / test.

Optimal solution is w = 0 even for small values of λ.Reason: there are a lot of constraints and not a lot of variables−→ the function looks like x → |x |.

Perfect

w>Ψ(x, y)

Bad

w>Ψ(x, y)

Good

w>Ψ(x, y)

Large loss (because Perfect < Bad), but we would like a small(because Good is at the top).


maxy

w>Ψ(xi , y)−w>Ψ(xi , yi ) + ∆(y , yi )


miny , ∆(y ,yi )=0

maxy

w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , y)


miny

maxy

w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , y) + ∆(y , yi )


miny

maxy


= miny

maxy

w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , yi )

= maxy

(w>Ψ(xi , y) + ∆(y , yi ))−maxy

w>Ψ(xi , y).


miny

maxy


= miny

maxy

w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , yi )

= maxy

(w>Ψ(xi , y) + ∆(y , yi ))−maxy

w>Ψ(xi , y).

1 Smaller than the original loss: take y = y .

2 Still an upper bound on the loss: take y = y .

3 Non-convex.

−→ This upper bound can be used for any structured outputlearning problem.

Details available in Tighter bounds for structured estimation [Do et al. ’09] andOptimization of ranking measures [Le et al.].

http://olivier.chapelle.cc/pub/tighter.pdf

http://olivier.chapelle.cc/pub/jmlr_ranking.pdf


2 4 6 8 100.43

0.44

0.45

0.46

0.47

0.48

0.49

0.5

0.51

0.52

k

ND

CG

k

SVMStructRankBoostRankSVM

Ohsumed dataset

w0 found by regression: serves astarting point and in the regularizer||w −w0||2.

Optimization for DCG10.


Non-linear extensions

1 The ”obvious” kernel trick.2 Gradient boosted decision trees.

Frideman’s functional gradient boosting framework.

minf ∈F

R(f ) =n∑

j=1

`(yj , f (xj)), F =

{N∑

i=1

αihi

}.

Typically hi is a tree and N is infinite.

Cannot do direction optimization on F .

f ← 0repeat

gj ← −∂`(f (xj ),yj )∂f (xj )

. Functional gradient

ı← arg mini ,λ∑n

j=1(gj − λhi (xj))2. Steepest componentρ← arg minR(f + ρhı). Line searchf ← f + ηρh. ”Shrinkage” when η < 1.

until Max iterations reached.





minf ∈F

R(f ) =n∑

j=1

`(yj , f (xj)), F =

{N∑

i=1

αihi

}.



f ← 0repeat

gj ← −∂`(f (xj ),yj )∂f (xj )









minf ∈F

R(f ) =n∑

j=1

`(yj , f (xj)), F =

{N∑

i=1

αihi

}.



f ← 0repeat

gj ← −∂`(f (xj ),yj )∂f (xj )






The objective function can be much more general:

R(f (x1), . . . , f (xn))

gj = − ∂R∂f (xj )

.

For ranking via structured output learning:

R(f ) =∑q

maxy

(∆(y , yq) +

Q∑i=1

f (xqi )(A(yi )− A(yqi ))

).

Preliminary results are disappointing: with gradient boosteddecision trees, no difference between regression and structuredoutput learning.−→ Could be because the loss function matters only when theclass of functions is restricted (underfitting).


The objective function can be much more general:

R(f (x1), . . . , f (xn))

gj = − ∂R∂f (xj )

.

For ranking via structured output learning:

R(f ) =∑q

maxy

(∆(y , yq) +

Q∑i=1

f (xqi )(A(yi )− A(yqi ))

).

Preliminary results are disappointing: with gradient boosteddecision trees, no difference between regression and structuredoutput learning.−→ Could be because the loss function matters only when theclass of functions is restricted (underfitting).


Outline

1 Introduction



4 PerspectivesObjective functionFuture directions


Choice of the objective function

General consensus on relative performance of learning to rankmethods:

Pointwise < Pairwise < ListwiseTrue but ... the differences are very small.

On real web search data using non-linear functions:

Pairwise is ∼ 0.5− 1% better than pointwise;

Listwise is ∼ 0− 0.5% better than pairwise.

Letor datasets are interesting to test some ideas, but validation ina real setting is necessary.


Choice of the objective function

General consensus on relative performance of learning to rankmethods:

Pointwise < Pairwise < ListwiseTrue but ... the differences are very small.

On real web search data using non-linear functions:

Pairwise is ∼ 0.5− 1% better than pointwise;

Listwise is ∼ 0− 0.5% better than pairwise.

Letor datasets are interesting to test some ideas, but validation ina real setting is necessary.


Public web search datasets

Internet Mathematics 2009

Dataset released by the russian search engine Yandex for acompetition.Available at:http://company.yandex.ru/grant/2009/en/datasets

9,124 queries / 97,290 judgements (training)

245 features

5 levels of relevance

132 submissions

Yahoo! also plans to organize a similar competition and releasedatasets. Stay tuned!

http://company.yandex.ru/grant/2009/en/datasets


To improve a ranking system, work in priority on:

1 Feature development

2 Choice of the function class

3 Choice of the objective function to optimize

But 1 and 2 are orthogonal issues to learning to rank.

What are the other interesting problematics beyond the choice ofthe objective function?


Sample selection bias

Training and offline test sets typically come from polling topresults of other ranking functions.

But online test documents come from a ”larger” distribution(all the documents from a simple ranking function).

Problem: the learning algorithm does not learn to demotevery bad pages (low BM25, spam,...) because they rarelyappear in the training set.

Solution: reweight the training set such that it resembles theonline test distribution.


Diversity

Output a set of relevant documents which is also diverse.

Need to go beyond learning a relevance function. Structuredoutput learning can be a principled framework for this purpose.But in any case, extra computational load at test time.

Problem: no cheap metric for diversity.

Diversity on content is more important than diversity on topic:a user can always reformulate an ambiguous query.


Transfer / multi-task learning

How to leverage the data from one (big) market to another (small)one?


Cascade learning

Ideally: rank all existing web pages.In practice: rank only a small subset of them using machinelearning.

Instead: build a ”cascade” of rankers

f1, . . . , fT cascade of T rankers.All documents applied to f1.Discard bottom documents after each round.Features and functions of increasing complexity.Each ranker is learned.


Low-level learning

Two different ML philosophies:

1 Design a limited number of high-level features and put an MLalgorithm on top of them.

2 Let the ML algorithm directly work on a large number oflow-level features.

We have done 1, but 2 has been successful in various domains suchas computer vision.

Two ideas:

Learn BM25 by introducing several parameters per word (suchas the k in the saturation function).

Define the score match as∑

i ,j wijqidj and learn the wij .See earlier talk Learning to rank with low rank.


Summary

Optimizing ranking measures is difficult, but feasible.

Two types of approaches: convex upper bound or non-convexapproximation.

Only small improvements on real settings (large number ofexamples, large number of features, non-linear architecture).−→ Choice of the objective function has a small influence onthe overall performance.

Research on learning to rank should focus on new problems.

Documents

hashdsahdbasd