Upload
zarya-medyn
View
213
Download
0
Embed Size (px)
DESCRIPTION
hashdsahdbasd
Citation preview
Introduction Continuous approximation Structured output learning Perspectives
Direct Optimization for Web Search Ranking
Olivier Chapelle
SIGIR Workshop: Learning to Rank for Information RetrievalJuly 23rd 2009
Introduction Continuous approximation Structured output learning Perspectives
Outline
1 Introduction
2 Continuous approximation
3 Structured output learning
4 Perspectives
Introduction Continuous approximation Structured output learning Perspectives
Outline
1 Introduction
2 Continuous approximation
3 Structured output learning
4 Perspectives
Introduction Continuous approximation Structured output learning Perspectives
Web search ranking
Ranking via a relevance function
Given a query q and a document d , estimate the relevance ofd to q.
Web search results are sorted by relevance.
Traditional relevance functions (BM25) are hand designed.
Recently: several machine learning approaches to learn therelevance function.
Learning a relevance function is practical, but there are otherpossibilities: learn a ranking or learn a preference function.
Introduction Continuous approximation Structured output learning Perspectives
Machine learning for ranking
Training data
1 Binary relevance label (traditional IR)
2 Multiple levels of relevance (Excellent, Good, Bad,...)
3 Pairwise comparisons
Possibility of converting 1 and 2 into 3.
Need of human editors for 1 and 2.
But possibility of using 3 with large amount of click data−→ Skip-above pairs in [Joachims ’02]
Rest of this talk: 2.
Introduction Continuous approximation Structured output learning Perspectives
Information retrieval metrics
Binary relevance labels: average precision, reciprocal rank,winner takes all, AUC (i.e. fraction of misranked pairs),...
Multiple level of relevance: Discounted Cumulative Gain atrank p:
DCGp =∑ranks
Dp(j)G (sr(j))
=∑
documents
Dp(r−1(i))G (si ),
Rank j D(j) G(sr(j))
1 1 32 1/ log2(3) 73 1/ log2(4) 0
. . .
wheresi is the relevance score for doc i from 0 (Bad) to 4 (Perfect).r is the ranking function: r(j) = i means document i is atposition j .Dp is the discount function truncated at rank p,D(j) = 1/ log2(j) if j ≤ p, 0 otherwise.G is the gain function, G (s) = 2s − 1.
Introduction Continuous approximation Structured output learning Perspectives
Features
Given a query and a document, construct a feature vector xi with3 types of features:
Query only : Type of query, query length,...
Document only : Pagerank, length, spam,...
Query & document : match score,...
Set of q = 1, . . . ,Q queries.
Set of n triplets (query,document,score),(xi , si ), xi ∈ Rd , si ∈ {0, 1, 2, 3, 4}.Uq is the set of indices associated with query q-th query.
Introduction Continuous approximation Structured output learning Perspectives
Approaches to ranking
Pointwise classification [Li et al. ’07], regression−→ works surprisingly well.
Pairwise RankSVMperceptron [Crammer et al. ’03]neural nets: [Burges et al. ’05], LambdaRank,boosting: RankBoost, GBRank.
Listwise Non metric specific: ListNet, ListMLEMetric specific: AdaRank; structured learning:SVMMAP, [Chapelle et al. ’07]; gradient descent:SoftRank, [Chapelle et al’ 09].
Introduction Continuous approximation Structured output learning Perspectives
Two approaches for a direct optimization the DCG:
1 Gradient descent on a smooth approximation of the DCG
2 Large margin structured output learning where the lossfunction is the DCG.
Orthogonal issue: choice of the architecture.−→ for simplicity, linear functions.At the end, we will present non-linear extensions.
Introduction Continuous approximation Structured output learning Perspectives
Outline
1 Introduction
2 Continuous approximation
3 Structured output learning
4 Perspectives
Introduction Continuous approximation Structured output learning Perspectives
Main difficulty for a direct optimization (by gradient descent forinstance): the DCG is not continuous and constant almosteverywhere.−→ Continuous approximation of it.
DCG1 =∑
i
I (i = arg maxj
w>xj)G (si )
≈∑
i
exp(w>xi/σ)∑j exp(w>xj/σ)
G (si ).
−→ ”Soft-argmax”; softness controlled by σ.
Introduction Continuous approximation Structured output learning Perspectives
Generalization for DCGp:
A(w, σ) :=
p∑j=1
D(j)
∑i G (si )hij∑
i hij
with hij a ”smooth” version of the indicator function:”Is xi at the j-th position in the ranking?”,
hij = exp
(−||w · xi −w · xr(j)||2
2σ2
).
σ controls amount of smoothing: when σ → 0,A(w, σ)→ DCGp.A(w, σ) is continuous but non-differentiable. But it is differentiablealmost everywhere −→ no problem for gradient descent.
Approach generalizable to other IR metrics such as MAP.
Introduction Continuous approximation Structured output learning Perspectives
Optimization by gradient descent and annealing:
1 - Initialize: w = w0 and large σ.2 - Starting from w, minimize by (conjugate) gradientdescent
λ||w −w0||2 + A(w, σ).3 - Divide σ by 2 and go back to 2 (or stop).
w0 is an initial solution such as the one given by pairwise ranking.
−5 −4 −3 −2 −1 0 1 2 3 4 5−12.02
−12
−11.98
−11.96
−11.94
−11.92
−11.9
−11.88
−11.86
−11.84
t
Ob
jective
fu
nctio
n
σ=0.125
σ=1
σ=8
σ=64
Introduction Continuous approximation Structured output learning Perspectives
Evaluation on web search data
Dataset
Several hundred features.
∼50k (query,urls) pairs from an international market.
∼1500 queries randomly split in training / test (80% / 20%).
5 levels of relevance.
10−2
100
102
7.5
8
8.5
9
9.5
Smoothing factor σ
DC
G5
λ = 101
λ = 102
λ = 103
λ = 104
λ = 105
10−2
100
102
7.7
7.8
7.9
8
8.1
8.2
8.3
8.4
8.5
Smoothing factor σ
DC
G5
λ = 101
λ = 102
λ = 103
λ = 104
λ = 105
DCG can be improved by almost 10% on the training set (left),but not more than 1% on the test set (right).
Introduction Continuous approximation Structured output learning Perspectives
Evaluation on web search data
Dataset
Several hundred features.
∼50k (query,urls) pairs from an international market.
∼1500 queries randomly split in training / test (80% / 20%).
5 levels of relevance.
10−2
100
102
7.5
8
8.5
9
9.5
Smoothing factor σ
DC
G5
λ = 101
λ = 102
λ = 103
λ = 104
λ = 105
10−2
100
102
7.7
7.8
7.9
8
8.1
8.2
8.3
8.4
8.5
Smoothing factor σ
DC
G5
λ = 101
λ = 102
λ = 103
λ = 104
λ = 105
DCG can be improved by almost 10% on the training set (left),but not more than 1% on the test set (right).
Introduction Continuous approximation Structured output learning Perspectives
Evaluation on Letor 3.0
Ohsumed dataset: NDCG
0.42 0.425 0.43 0.435 0.44 0.445 0.45 0.455 0.46
SmoothNDCG
RankSVM
Regression
AdaRank−MAP
AdaRank−NDCG
FRank
ListNet
RankBoost
SVMMAP
NDCG@10
1 2 3 4 5 6 7 8 9 100.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
0.6
Position
ND
CG
SmoothNDCGRankSVMRegressionAdaRank−NDCGListNet
All datasets: NDCG / MAP
0.57 0.58 0.59 0.6 0.61 0.62 0.63
SmoothNDCG
RankSVM
Regression
AdaRank−MAP
AdaRank−NDCG
FRank
ListNet
RankBoost
SVMMAP
[email protected] 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55
SmoothNDCG
RankSVM
Regression
AdaRank−MAP
AdaRank−NDCG
FRank
ListNet
RankBoost
SVMMAP
MAP
Introduction Continuous approximation Structured output learning Perspectives
Outline
1 Introduction
2 Continuous approximation
3 Structured output learningFormulationExperiments & Extensions
4 Perspectives
Introduction Continuous approximation Structured output learning Perspectives
Structured output learning
Notations
xq is a set of documents associated with query q;xqi the i-th document.
yq is a ranking (i.e. a permutation): yqi is the rank of the i-thdocument. (obtained by sorting the scores sqi ).
Learning for structured outputs (Tsochantaridis et al. ’04)
Learn a mapping x→ y
Joint feature map: Ψ(x, y).
Prediction rule:
y = arg maxy
w>Ψ(x, y).
Introduction Continuous approximation Structured output learning Perspectives
We take: Ψ(xq, yq) =∑
i xqiA(yqi ).
A : N→ R is a user defined non increasing function.
Ranking is given by the order of w>xqi becausew>Ψ(x, y) =
∑i w>xqiA(yqi ).
w>xqi 2.5 3.7 −0.5× + × + × = 15.2→ max
A(y) A(2) = 2 A(1) = 3 A(3) = 1
Constraints for correct predictions on the training set:
∀q,∀y 6= yq,w>Ψ(xq, yq)−w>Ψ(xq, y) > 0.
Introduction Continuous approximation Structured output learning Perspectives
SVM-like optimization problem:
minw,ξq
λ
2||w||2 +
Q∑q=1
ξq,
under constraints:
∀q,∀y 6= yq,w>Ψ(xq, yq)−w>Ψ(xq, y) ≥ ∆(y , yq)− ξq,
where ∆(y , yq) is the query loss, e.g. the difference betweenthe DCGs with ranking y and yq.
At the optimum solution, ξq ≥ ∆(yq, yq) withyq = arg max w>Ψ(xq, y).
Introduction Continuous approximation Structured output learning Perspectives
Optimization
ξq = maxy
∆(y , yq) + w>Ψ(xq, y)−w>Ψ(xq, yq).
Need to find the argmax:
y = arg maxy
∑i
A(yi )w>xqi − G (sqi )D(yi ).
−→ Can be solved efficiently through a linear assignment problem.
Introduction Continuous approximation Structured output learning Perspectives
Cutting plane
Strategy used in SVMstruct. Iterate between:
1 Solving the problem on a subset of constraints.
2 Find and add (the most) violated constraints.
Unconstrained optimization
min1
2||w||2 +
∑q
maxy
∆(y , yq) + w>Ψ(xq, y)−w>Ψ(xq, yq).
Convex, but not differentiable
Subgradient descent
Bundle method
Introduction Continuous approximation Structured output learning Perspectives
Cutting plane
Strategy used in SVMstruct. Iterate between:
1 Solving the problem on a subset of constraints.
2 Find and add (the most) violated constraints.
Unconstrained optimization
min1
2||w||2 +
∑q
maxy
∆(y , yq) + w>Ψ(xq, y)−w>Ψ(xq, yq).
Convex, but not differentiable
Subgradient descent
Bundle method
Introduction Continuous approximation Structured output learning Perspectives
Experiments
2 4 6 8 100.46
0.47
0.48
0.49
0.5
0.51
0.52
0.53
k
ND
CG
k
SVMStructRegressionRankSVM
Normalized DCG for differenttruncation levels
λ chosen on a validation set
A(r) = max(r + 1− k , 0).
Training time ∼20 minutes.
About 2% improvement(p-value = 0.03 vs regression, 0.07vs RankSVM).
Introduction Continuous approximation Structured output learning Perspectives
Ohsumed dataset (Letor distribution):
3 levels of relevance.
25 features
106 queries split in training / validation / test.
Optimal solution is w = 0 even for small values of λ.Reason: there are a lot of constraints and not a lot of variables−→ the function looks like x → |x |.
Perfect
w>Ψ(x, y)
Bad
w>Ψ(x, y)
Good
w>Ψ(x, y)
Large loss (because Perfect < Bad), but we would like a small(because Good is at the top).
Introduction Continuous approximation Structured output learning Perspectives
Ohsumed dataset (Letor distribution):
3 levels of relevance.
25 features
106 queries split in training / validation / test.
Optimal solution is w = 0 even for small values of λ.Reason: there are a lot of constraints and not a lot of variables−→ the function looks like x → |x |.
Perfect
w>Ψ(x, y)
Bad
w>Ψ(x, y)
Good
w>Ψ(x, y)
Large loss (because Perfect < Bad), but we would like a small(because Good is at the top).
Introduction Continuous approximation Structured output learning Perspectives
maxy
w>Ψ(xi , y)−w>Ψ(xi , yi ) + ∆(y , yi )
Introduction Continuous approximation Structured output learning Perspectives
miny , ∆(y ,yi )=0
maxy
w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , y)
Introduction Continuous approximation Structured output learning Perspectives
miny
maxy
w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , y) + ∆(y , yi )
Introduction Continuous approximation Structured output learning Perspectives
miny
maxy
w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , y) + ∆(y , yi )
= miny
maxy
w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , yi )
= maxy
(w>Ψ(xi , y) + ∆(y , yi ))−maxy
w>Ψ(xi , y).
Introduction Continuous approximation Structured output learning Perspectives
miny
maxy
w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , y) + ∆(y , yi )
= miny
maxy
w>Ψ(xi , y)−w>Ψ(xi , y) + ∆(y , yi )
= maxy
(w>Ψ(xi , y) + ∆(y , yi ))−maxy
w>Ψ(xi , y).
1 Smaller than the original loss: take y = y .
2 Still an upper bound on the loss: take y = y .
3 Non-convex.
−→ This upper bound can be used for any structured outputlearning problem.
Details available in Tighter bounds for structured estimation [Do et al. ’09] andOptimization of ranking measures [Le et al.].
Introduction Continuous approximation Structured output learning Perspectives
2 4 6 8 100.43
0.44
0.45
0.46
0.47
0.48
0.49
0.5
0.51
0.52
k
ND
CG
k
SVMStructRankBoostRankSVM
Ohsumed dataset
w0 found by regression: serves astarting point and in the regularizer||w −w0||2.
Optimization for DCG10.
Introduction Continuous approximation Structured output learning Perspectives
Non-linear extensions
1 The ”obvious” kernel trick.2 Gradient boosted decision trees.
Frideman’s functional gradient boosting framework.
minf ∈F
R(f ) =n∑
j=1
`(yj , f (xj)), F =
{N∑
i=1
αihi
}.
Typically hi is a tree and N is infinite.
Cannot do direction optimization on F .
f ← 0repeat
gj ← −∂`(f (xj ),yj )∂f (xj )
. Functional gradient
ı← arg mini ,λ∑n
j=1(gj − λhi (xj))2. Steepest componentρ← arg minR(f + ρhı). Line searchf ← f + ηρh. ”Shrinkage” when η < 1.
until Max iterations reached.
Introduction Continuous approximation Structured output learning Perspectives
Non-linear extensions
1 The ”obvious” kernel trick.2 Gradient boosted decision trees.
Frideman’s functional gradient boosting framework.
minf ∈F
R(f ) =n∑
j=1
`(yj , f (xj)), F =
{N∑
i=1
αihi
}.
Typically hi is a tree and N is infinite.
Cannot do direction optimization on F .
f ← 0repeat
gj ← −∂`(f (xj ),yj )∂f (xj )
. Functional gradient
ı← arg mini ,λ∑n
j=1(gj − λhi (xj))2. Steepest componentρ← arg minR(f + ρhı). Line searchf ← f + ηρh. ”Shrinkage” when η < 1.
until Max iterations reached.
Introduction Continuous approximation Structured output learning Perspectives
Non-linear extensions
1 The ”obvious” kernel trick.2 Gradient boosted decision trees.
Frideman’s functional gradient boosting framework.
minf ∈F
R(f ) =n∑
j=1
`(yj , f (xj)), F =
{N∑
i=1
αihi
}.
Typically hi is a tree and N is infinite.
Cannot do direction optimization on F .
f ← 0repeat
gj ← −∂`(f (xj ),yj )∂f (xj )
. Functional gradient
ı← arg mini ,λ∑n
j=1(gj − λhi (xj))2. Steepest componentρ← arg minR(f + ρhı). Line searchf ← f + ηρh. ”Shrinkage” when η < 1.
until Max iterations reached.
Introduction Continuous approximation Structured output learning Perspectives
The objective function can be much more general:
R(f (x1), . . . , f (xn))
gj = − ∂R∂f (xj )
.
For ranking via structured output learning:
R(f ) =∑q
maxy
(∆(y , yq) +
Q∑i=1
f (xqi )(A(yi )− A(yqi ))
).
Preliminary results are disappointing: with gradient boosteddecision trees, no difference between regression and structuredoutput learning.−→ Could be because the loss function matters only when theclass of functions is restricted (underfitting).
Introduction Continuous approximation Structured output learning Perspectives
The objective function can be much more general:
R(f (x1), . . . , f (xn))
gj = − ∂R∂f (xj )
.
For ranking via structured output learning:
R(f ) =∑q
maxy
(∆(y , yq) +
Q∑i=1
f (xqi )(A(yi )− A(yqi ))
).
Preliminary results are disappointing: with gradient boosteddecision trees, no difference between regression and structuredoutput learning.−→ Could be because the loss function matters only when theclass of functions is restricted (underfitting).
Introduction Continuous approximation Structured output learning Perspectives
Outline
1 Introduction
2 Continuous approximation
3 Structured output learning
4 PerspectivesObjective functionFuture directions
Introduction Continuous approximation Structured output learning Perspectives
Choice of the objective function
General consensus on relative performance of learning to rankmethods:
Pointwise < Pairwise < ListwiseTrue but ... the differences are very small.
On real web search data using non-linear functions:
Pairwise is ∼ 0.5− 1% better than pointwise;
Listwise is ∼ 0− 0.5% better than pairwise.
Letor datasets are interesting to test some ideas, but validation ina real setting is necessary.
Introduction Continuous approximation Structured output learning Perspectives
Choice of the objective function
General consensus on relative performance of learning to rankmethods:
Pointwise < Pairwise < ListwiseTrue but ... the differences are very small.
On real web search data using non-linear functions:
Pairwise is ∼ 0.5− 1% better than pointwise;
Listwise is ∼ 0− 0.5% better than pairwise.
Letor datasets are interesting to test some ideas, but validation ina real setting is necessary.
Introduction Continuous approximation Structured output learning Perspectives
Public web search datasets
Internet Mathematics 2009
Dataset released by the russian search engine Yandex for acompetition.Available at:http://company.yandex.ru/grant/2009/en/datasets
9,124 queries / 97,290 judgements (training)
245 features
5 levels of relevance
132 submissions
Yahoo! also plans to organize a similar competition and releasedatasets. Stay tuned!
Introduction Continuous approximation Structured output learning Perspectives
To improve a ranking system, work in priority on:
1 Feature development
2 Choice of the function class
3 Choice of the objective function to optimize
But 1 and 2 are orthogonal issues to learning to rank.
What are the other interesting problematics beyond the choice ofthe objective function?
Introduction Continuous approximation Structured output learning Perspectives
Sample selection bias
Training and offline test sets typically come from polling topresults of other ranking functions.
But online test documents come from a ”larger” distribution(all the documents from a simple ranking function).
Problem: the learning algorithm does not learn to demotevery bad pages (low BM25, spam,...) because they rarelyappear in the training set.
Solution: reweight the training set such that it resembles theonline test distribution.
Introduction Continuous approximation Structured output learning Perspectives
Diversity
Output a set of relevant documents which is also diverse.
Need to go beyond learning a relevance function. Structuredoutput learning can be a principled framework for this purpose.But in any case, extra computational load at test time.
Problem: no cheap metric for diversity.
Diversity on content is more important than diversity on topic:a user can always reformulate an ambiguous query.
Introduction Continuous approximation Structured output learning Perspectives
Transfer / multi-task learning
How to leverage the data from one (big) market to another (small)one?
Introduction Continuous approximation Structured output learning Perspectives
Cascade learning
Ideally: rank all existing web pages.In practice: rank only a small subset of them using machinelearning.
Instead: build a ”cascade” of rankers
f1, . . . , fT cascade of T rankers.All documents applied to f1.Discard bottom documents after each round.Features and functions of increasing complexity.Each ranker is learned.
Introduction Continuous approximation Structured output learning Perspectives
Low-level learning
Two different ML philosophies:
1 Design a limited number of high-level features and put an MLalgorithm on top of them.
2 Let the ML algorithm directly work on a large number oflow-level features.
We have done 1, but 2 has been successful in various domains suchas computer vision.
Two ideas:
Learn BM25 by introducing several parameters per word (suchas the k in the saturation function).
Define the score match as∑
i ,j wijqidj and learn the wij .See earlier talk Learning to rank with low rank.
Introduction Continuous approximation Structured output learning Perspectives
Summary
Optimizing ranking measures is difficult, but feasible.
Two types of approaches: convex upper bound or non-convexapproximation.
Only small improvements on real settings (large number ofexamples, large number of features, non-linear architecture).−→ Choice of the objective function has a small influence onthe overall performance.
Research on learning to rank should focus on new problems.