The K -armed Dueling Bandits Problem

The K-armed Dueling Bandits Problem

COLT 2009

Cornell University

YisongYue

JosefBroder

RobertKleinberg

ThorstenJoachims

Multi-armed Bandits

• K bandits (arms / actions / strategies)

• Each time step, algorithm chooses bandit based on prior actions and feedback.

• Only observe feedback from actions taken.

• Partial Feedback Online Learning Problem

Optimizing Retrieval Functions

• Interactively learn the best retrieval function

• Search users provide implicit feedback – E.g., what they click on

• Seems like a natural bandit problem– Each retrieval function is a bandit– Feedback is clicks– Find the best w.r.t. clickthrough rate– Assumes clicks → explicit absolute feedback!

What Results do Users View/Click?

[Joachims et al., TOIS 2007]

Team-Game Interleaving

1. Kernel Machines http://svm.first.gmd.de/

2. Support Vector Machinehttp://jbolivar.freeservers.com/

3. An Introduction to Support Vector Machineshttp://www.support-vector.net/

4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT...

5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/

1. Kernel Machines http://svm.first.gmd.de/

2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/

3. Support Vector Machine and Kernel ... Referenceshttp://svm.research.bell-labs.com/SVMrefs.html

4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html

5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk

1. Kernel Machines T2http://svm.first.gmd.de/

2. Support Vector Machine T1http://jbolivar.freeservers.com/

3. SVM-Light Support Vector Machine T2http://ais.gmd.de/~thorsten/svm light/

4. An Introduction to Support Vector Machines T1http://www.support-vector.net/

5. Support Vector Machine and Kernel ... References T2http://svm.research.bell-labs.com/SVMrefs.html

6. Archives of SUPPORT-VECTOR-MACHINES ... T1http://www.jiscmail.ac.uk/lists/SUPPORT...

7. Lucent Technologies: SVM demo applet T2http://svm.research.bell-labs.com/SVT/SVMsvt.html

f1(u,q) r1 f2(u,q) r2

Interleaving(r1,r2)

(u=thorsten, q=“svm”)

Interpretation: (r1 > r2) ↔ clicks(T1) > clicks(T2)

•Mix results of f1 and f2

•Relative feedback•More reliable

[Radlinski, Kurup, Joachims; CIKM 2008]

Dueling Bandits Problem

• Given K bandits b1, …, bK

• Each iteration: compare (duel) two bandits– E.g., interleaving two retrieval functions

• Comparison is noisy– Each comparison result independent

– Comparison probabilities initially unknown

– Comparison probabilities fixed over time

• Total preference ordering, initially unknown

Dueling Bandits Problem

• Want to find best (or good) bandit– Similar to finding the max w/ noisy comparisons

– Ours is a regret minimization setting

• Choose pair (bt, bt’) to minimize regret:

• (% users who prefer best bandit over chosen ones)

T

tttT bbPbbPR

1

1)'*()*(

Related Work

• Existing MAB settings– [Lai & Robbins, ’85] [Auer et al., ’02]– PAC setting [Even-Dar et al., ’06]

• Partial monitoring problem– [Cesa-Bianchi et al., ’06]

• Computing with noisy comparisons– Finding max [Feige et al., ’97]– Binary search [Karp & Kleinberg, ’07] [Ben-Or & Hassidim, ’08]

• Continuous-armed Dueling Bandits Problem– [Yue & Joachims, ’09]

Assumptions

• P(bi > bj) = ½ + εij (distinguishability)

• Strong Stochastic Transitivity– For three bandits bi > bj > bk :

– Monotonicity property

• Stochastic Triangle Inequality– For three bandits bi > bj > bk :

– Diminishing returns property

• Satisfied by many standard models– E.g., Logistic/Bradley-Terry

jkijik , max

jkijik

Naïve Approach

• In deterministic case, O(K) comparisons to find max

• Extend to noisy case:– Repeatedly compare until confident one is better

• Problem: comparing two awful (but similar) bandits– Waste comparisons to see which awful bandit is better

– Incur high regret for each comparison

– Also applies to elimination tournaments

TK

ORT log2

Interleaved Filter

• Choose candidate bandit at random ►

Interleaved Filter

• Choose candidate bandit at random

• Make noisy comparisons (Bernoulli trial)

against all other bandits simultaneously– Maintain mean and confidence interval

for each pair of bandits being compared

►

Interleaved Filter





• …until another bandit is better – With confidence 1 – δ

►

Interleaved Filter






• Repeat process with new candidate– Remove all empirically worse bandits

►

Interleaved Filter






• Repeat process with new candidate– Remove all empirically worse bandits

• Continue until 1 candidate left►

Regret Analysis

• Define a round to be all the time steps for

a particular candidate bandit– Halts round when 1 - δ confidence met– Treat candidate bandits as random walk– Takes log K rounds to reach best bandit

• Define a match to be all the comparisons

between two bandits in a round– O(K) total matches in expectation– Constant fraction of bandits removed

each round

Regret Analysis

• O(K) total matches

• Each match incurs regret– Depends on δ = K-2T-1

• Finds best bandit w.p. 1-1/T

• Expected regret:

TO log1

TK

ORE

TOT

TK

OT

RE

T

T

log

)(1

log1

1

Lower Bound Example

• Order bandits b1 > b2 > … > bK

– P(b > b’) = ½ + ε

• Each match takes comparisons

• Pay Θ(ε) regret for each comparison

• Accumulated regret over all matches is at least

Tlog

12

TK

log

Moving Forward

• Extensions– Drifting user interests

• Changing user population / document collection

– Contextualization / side information• Cost-sensitive active learning w/ relative feedback

• Live user studies– Interactive experiments on search service– Sheds insight; guides future design

Extra Slides

Per-Match Regret

• Number of comparisons in match bi vs bj :

– ε1i > εij : round ends before concluding bi > bj

– ε1i < εij : conclude bi > bj before round ends, remove bj

• Pay ε1i + ε1j regret for each comparison

– By triangle inequality ε1i + ε1j ≤ 2*max{ε1i , εij}

– Thus by stochastic transitivity accumulated regret is

TO

iji

log, max

122

1

TOTO

iji

log1

log, max

1

1

Number of Rounds

• Assume all superior bandits have

equal prob of defeating candidate– Worst case scenario under transitivity

• Model this as a random walk– rj transitions to each ri (i < j) with equal probability

– Compute total number of steps before reaching r1 (i.e., r*)

• Can show O(log K) w.h.p. using Chernoff bound

Total Matches Played

• O(K) matches played in each round– Naïve analysis yields O(K log K) total

• However, all empirically worse bandits

are also removed at the end of each round– Will not participate in future rounds

• Assume worst case that inferior bandits have ½ chance of

being empirically worse– Can show w.h.p. that O(K) total matches are played

over O(log N) rounds

Removing Inferior Bandits

• At conclusion of each round– Remove any empirically worse bandits

• Intuition:– High confidence that winner is better

than incumbent candidate

– Empirically worse bandits cannot be “much better” than

incumbent candidate

– Can show via Hoeffding bound that winner is also better than empirically worse bandits with high confidence

– Preserves 1-1/T confidence overall that we’ll find the best bandit

Documents

The K -armed Dueling Bandits Problem