A Support Vector Method for Optimizing Average Precision SIGIR 2007 Yisong Yue Cornell University In Collaboration With: Thomas Finley, Filip Radlinski,

A Support Vector Method for Optimizing Average Precision

SIGIR 2007

Yisong YueCornell University

In Collaboration With:

Thomas Finley, Filip Radlinski, Thorsten Joachims

(Cornell University)

Motivation

• Learn to Rank Documents

• Optimize for IR performance measures

• Mean Average Precision

• Leverage Structural SVMs [Tsochantaridis et al. 2005]

MAP vs Accuracy

• Average precision is the average of the precision scores at the rank locations of each relevant document.

• Ex: has average precision

• Mean Average Precision (MAP) is the mean of the Average

Precision scores for a group of queries.

• A machine learning algorithm optimizing for Accuracy might learn a very different model than optimizing for MAP.

• Ex: has average precision of about 0.64, but has a max accuracy of 0.8 vs 0.6 in above ranking.

76.05

3

3

2

1

1

3

1

Recent Related Work

Greedy & Local Search• [Metzler & Croft 2005] – optimized for MAP, used gradient descent,

expensive for large number of features.

• [Caruana et al. 2004] – iteratively built an ensemble to greedily improve arbitrary performance measures.

Surrogate Performance Measures• [Burges et al. 2005] – used neural nets optimizing for cross entropy.

• [Cao et al. 2006] – used SVMs optimizing for modified ROC-Area.

Relaxations• [Xu & Li 2007] – used Boosting with exponential loss relaxation

Conventional SVMs

• Input examples denoted by x (high dimensional point)• Output targets denoted by y (either +1 or -1)

• SVMs learns a hyperplane w, predictions are sign(wTx)• Training involves finding w which minimizes

• subject to

• The sum of slacks upper bounds the accuracy loss

N

iiN

Cw

1

2

2

1

iiT xwi 1)(y : i

i

i

Conventional SVMs

• Input examples denoted by x (high dimensional point)• Output targets denoted by y (either +1 or -1)

• SVMs learns a hyperplane w, predictions are sign(wTx)• Training involves finding w which minimizes

• subject to

• The sum of slacks upper bounds the accuracy loss

N

iiN

Cw

1

2

2

1

iiT xwi 1)(y : i

i

i

Adapting to Average Precision

• Let x denote the set of documents/query examples for a query• Let y denote a (weak) ranking (each yij 2 {-1,0,+1})

• Same objective function:

• Constraints are defined for each incorrect labeling y’ over the set of documents x.

• Joint discriminant score for the correct labeling at least as large as incorrect labeling plus the performance loss.

i

iN

Cw 2

2

1

)',(),'(),( :' yyxywxywyy TT


• Maximize

subject to

where

and

• Sum of slacks upper bound MAP loss.

• After learning w, a prediction is made by sorting on wTxi

reli relj

jiij xxyxy: :!

)('),'(

i

iN

Cw 2

2

1


)'(1)',( yAvgprecyy


• Maximize

subject to

where

and

• Sum of slacks upper bound MAP loss.

• After learning w, a prediction is made by sorting on wTxi

reli relj

jiij xxyxy: :!

)('),'(

i

iN

Cw 2

2

1


)'(1)',( yAvgprecyy

Too Many Constraints!

• For Average Precision, the true labeling is a ranking where the relevant documents are all ranked in the front, e.g.,

• An incorrect labeling would be any other ranking, e.g.,

• This ranking has Average Precision of about 0.8 with (y,y’) ¼ 0.2

• Exponential number of rankings, thus an exponential number of constraints!

Structural SVM Training• STEP 1: Solve the SVM objective function using only the current

working set of constraints.

• STEP 2: Using the model learned in STEP 1, find the most violated constraint from the exponential set of constraints.

• STEP 3: If the constraint returned in STEP 2 is more violated than the most violated constraint the working set by some small constant, add that constraint to the working set.

• Repeat STEP 1-3 until no additional constraints are added. Return the most recent model that was trained in STEP 1.

STEP 1-3 is guaranteed to loop for at most a polynomial number of iterations. [Tsochantaridis et al. 2005]

Illustrative Example

Original SVM Problem• Exponential constraints

• Most are dominated by a small set of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…

• …until set of constraints is a good approximation.



















Finding Most Violated Constraint

• Structural SVM is an oracle framework.

• Requires subroutine to find the most violated constraint.– Dependent on formulation of loss function and joint feature

representation.

• Exponential number of constraints!

• Efficient algorithm in the case of optimizing MAP.


Observation• MAP is invariant on the order of documents within a relevance

class– Swapping two relevant or non-relevant documents does not change MAP.

• Joint SVM score is optimized by sorting by document score, wTx

• Reduces to finding an interleaving

between two sorted lists of documents

reli relj

jT

iT

ij xwxwyyywyH: :!

)(')',();'(


reli relj

jT

iT

ij xwxwyyywyH: :!

)(')',();'(

►

• Start with perfect ranking• Consider swapping adjacent

relevant/non-relevant documents


reli relj

jT

iT

ij xwxwyyywyH: :!

)(')',();'(

►


relevant/non-relevant documents• Find the best feasible ranking of

the non-relevant document


reli relj

jT

iT

ij xwxwyyywyH: :!

)(')',();'(

►


relevant/non-relevant documents• Find the best feasible ranking of the

non-relevant document• Repeat for next non-relevant

document


reli relj

jT

iT

ij xwxwyyywyH: :!

)(')',();'(

►




document• Never want to swap past previous

non-relevant document


reli relj

jT

iT

ij xwxwyyywyH: :!

)(')',();'(

►




document• Never want to swap past previous

non-relevant document• Repeat until all non-relevant

documents have been considered

Quick Recap

SVM Formulation• SVMs optimize a tradeoff between model complexity and MAP loss• Exponential number of constraints (one for each incorrect ranking)• Structural SVMs finds a small subset of important constraints• Requires sub-procedure to find most violated constraint

Find Most Violated Constraint• Loss function invariant to re-ordering of relevant documents• SVM score imposes an ordering of the relevant documents• Finding interleaving of two sorted lists• Loss function has certain monotonic properties• Efficient algorithm

Experiments

• Used TREC 9 & 10 Web Track corpus.

• Features of document/query pairs computed from outputs of existing retrieval functions.

(Indri Retrieval Functions & TREC Submissions)

• Goal is to learn a recombination of outputs which improves mean average precision.

Comparison with Best Base methods

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

TREC 9 Indri TREC 10 Indri TREC 9Submissions

TREC 10Submissions

TREC 9Submissions(without best)


Dataset

Mea

n A

vera

ge

Pre

cisi

on

SVM-MAP

Base 1

Base 2

Base 3

Comparison with other SVM methods

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

TREC 9 Indri TREC 10 Indri TREC 9Submissions

TREC 10Submissions



Dataset

Mea

n A

vera

ge

Pre

cisi

on

SVM-MAP

SVM-ROC

SVM-ACC

SVM-ACC2

SVM-ACC3

SVM-ACC4

Moving Forward

• Approach also works (in theory) for other measures.

• Some promising results when optimizing for NDCG (with only 1 level of relevance).

• Currently working on optimizing for NDCG with multiple levels of relevance.

• Preliminary MRR results not as promising.

Conclusions

• Principled approach to optimizing average precision.

(avoids difficult to control heuristics)

• Performs at least as well as alternative SVM methods.

• Can be generalized to a large class of rank-based performance measures.

• Software available at http://svmrank.yisongyue.com

http://svmrank.yisongyue.com/