Upload
fran
View
38
Download
0
Tags:
Embed Size (px)
DESCRIPTION
The K -armed Dueling Bandits Problem. COLT 2009 Cornell University. Multi-armed Bandits. K bandits (arms / actions / strategies) Each time step, algorithm chooses bandit based on prior actions and feedback. Only observe feedback from actions taken. Partial Feedback Online Learning Problem. - PowerPoint PPT Presentation
Citation preview
The K-armed Dueling Bandits Problem
COLT 2009
Cornell University
YisongYue
JosefBroder
RobertKleinberg
ThorstenJoachims
Multi-armed Bandits
• K bandits (arms / actions / strategies)
• Each time step, algorithm chooses bandit based on prior actions and feedback.
• Only observe feedback from actions taken.
• Partial Feedback Online Learning Problem
Optimizing Retrieval Functions
• Interactively learn the best retrieval function
• Search users provide implicit feedback – E.g., what they click on
• Seems like a natural bandit problem– Each retrieval function is a bandit– Feedback is clicks– Find the best w.r.t. clickthrough rate– Assumes clicks → explicit absolute feedback!
What Results do Users View/Click?
[Joachims et al., TOIS 2007]
Team-Game Interleaving
1. Kernel Machines http://svm.first.gmd.de/
2. Support Vector Machinehttp://jbolivar.freeservers.com/
3. An Introduction to Support Vector Machineshttp://www.support-vector.net/
4. Archives of SUPPORT-VECTOR-MACHINES ...http://www.jiscmail.ac.uk/lists/SUPPORT...
5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/
1. Kernel Machines http://svm.first.gmd.de/
2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/
3. Support Vector Machine and Kernel ... Referenceshttp://svm.research.bell-labs.com/SVMrefs.html
4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html
5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk
1. Kernel Machines T2http://svm.first.gmd.de/
2. Support Vector Machine T1http://jbolivar.freeservers.com/
3. SVM-Light Support Vector Machine T2http://ais.gmd.de/~thorsten/svm light/
4. An Introduction to Support Vector Machines T1http://www.support-vector.net/
5. Support Vector Machine and Kernel ... References T2http://svm.research.bell-labs.com/SVMrefs.html
6. Archives of SUPPORT-VECTOR-MACHINES ... T1http://www.jiscmail.ac.uk/lists/SUPPORT...
7. Lucent Technologies: SVM demo applet T2http://svm.research.bell-labs.com/SVT/SVMsvt.html
f1(u,q) r1 f2(u,q) r2
Interleaving(r1,r2)
(u=thorsten, q=“svm”)
Interpretation: (r1 > r2) ↔ clicks(T1) > clicks(T2)
•Mix results of f1 and f2
•Relative feedback•More reliable
[Radlinski, Kurup, Joachims; CIKM 2008]
Dueling Bandits Problem
• Given K bandits b1, …, bK
• Each iteration: compare (duel) two bandits– E.g., interleaving two retrieval functions
• Comparison is noisy– Each comparison result independent
– Comparison probabilities initially unknown
– Comparison probabilities fixed over time
• Total preference ordering, initially unknown
Dueling Bandits Problem
• Want to find best (or good) bandit– Similar to finding the max w/ noisy comparisons
– Ours is a regret minimization setting
• Choose pair (bt, bt’) to minimize regret:
• (% users who prefer best bandit over chosen ones)
T
tttT bbPbbPR
1
1)'*()*(
Related Work
• Existing MAB settings– [Lai & Robbins, ’85] [Auer et al., ’02]– PAC setting [Even-Dar et al., ’06]
• Partial monitoring problem– [Cesa-Bianchi et al., ’06]
• Computing with noisy comparisons– Finding max [Feige et al., ’97]– Binary search [Karp & Kleinberg, ’07] [Ben-Or & Hassidim, ’08]
• Continuous-armed Dueling Bandits Problem– [Yue & Joachims, ’09]
Assumptions
• P(bi > bj) = ½ + εij (distinguishability)
• Strong Stochastic Transitivity– For three bandits bi > bj > bk :
– Monotonicity property
• Stochastic Triangle Inequality– For three bandits bi > bj > bk :
– Diminishing returns property
• Satisfied by many standard models– E.g., Logistic/Bradley-Terry
jkijik , max
jkijik
Naïve Approach
• In deterministic case, O(K) comparisons to find max
• Extend to noisy case:– Repeatedly compare until confident one is better
• Problem: comparing two awful (but similar) bandits– Waste comparisons to see which awful bandit is better
– Incur high regret for each comparison
– Also applies to elimination tournaments
TK
ORT log2
Interleaved Filter
• Choose candidate bandit at random ►
Interleaved Filter
• Choose candidate bandit at random
• Make noisy comparisons (Bernoulli trial)
against all other bandits simultaneously– Maintain mean and confidence interval
for each pair of bandits being compared
►
Interleaved Filter
• Choose candidate bandit at random
• Make noisy comparisons (Bernoulli trial)
against all other bandits simultaneously– Maintain mean and confidence interval
for each pair of bandits being compared
• …until another bandit is better – With confidence 1 – δ
►
Interleaved Filter
• Choose candidate bandit at random
• Make noisy comparisons (Bernoulli trial)
against all other bandits simultaneously– Maintain mean and confidence interval
for each pair of bandits being compared
• …until another bandit is better – With confidence 1 – δ
• Repeat process with new candidate– Remove all empirically worse bandits
►
Interleaved Filter
• Choose candidate bandit at random
• Make noisy comparisons (Bernoulli trial)
against all other bandits simultaneously– Maintain mean and confidence interval
for each pair of bandits being compared
• …until another bandit is better – With confidence 1 – δ
• Repeat process with new candidate– Remove all empirically worse bandits
• Continue until 1 candidate left►
Regret Analysis
• Define a round to be all the time steps for
a particular candidate bandit– Halts round when 1 - δ confidence met– Treat candidate bandits as random walk– Takes log K rounds to reach best bandit
• Define a match to be all the comparisons
between two bandits in a round– O(K) total matches in expectation– Constant fraction of bandits removed
each round
Regret Analysis
• O(K) total matches
• Each match incurs regret– Depends on δ = K-2T-1
• Finds best bandit w.p. 1-1/T
• Expected regret:
TO log1
TK
ORE
TOT
TK
OT
RE
T
T
log
)(1
log1
1
Lower Bound Example
• Order bandits b1 > b2 > … > bK
– P(b > b’) = ½ + ε
• Each match takes comparisons
• Pay Θ(ε) regret for each comparison
• Accumulated regret over all matches is at least
Tlog
12
TK
log
Moving Forward
• Extensions– Drifting user interests
• Changing user population / document collection
– Contextualization / side information• Cost-sensitive active learning w/ relative feedback
• Live user studies– Interactive experiments on search service– Sheds insight; guides future design
Extra Slides
Per-Match Regret
• Number of comparisons in match bi vs bj :
– ε1i > εij : round ends before concluding bi > bj
– ε1i < εij : conclude bi > bj before round ends, remove bj
• Pay ε1i + ε1j regret for each comparison
– By triangle inequality ε1i + ε1j ≤ 2*max{ε1i , εij}
– Thus by stochastic transitivity accumulated regret is
TO
iji
log, max
122
1
TOTO
iji
log1
log, max
1
1
Number of Rounds
• Assume all superior bandits have
equal prob of defeating candidate– Worst case scenario under transitivity
• Model this as a random walk– rj transitions to each ri (i < j) with equal probability
– Compute total number of steps before reaching r1 (i.e., r*)
• Can show O(log K) w.h.p. using Chernoff bound
Total Matches Played
• O(K) matches played in each round– Naïve analysis yields O(K log K) total
• However, all empirically worse bandits
are also removed at the end of each round– Will not participate in future rounds
• Assume worst case that inferior bandits have ½ chance of
being empirically worse– Can show w.h.p. that O(K) total matches are played
over O(log N) rounds
Removing Inferior Bandits
• At conclusion of each round– Remove any empirically worse bandits
• Intuition:– High confidence that winner is better
than incumbent candidate
– Empirically worse bandits cannot be “much better” than
incumbent candidate
– Can show via Hoeffding bound that winner is also better than empirically worse bandits with high confidence
– Preserves 1-1/T confidence overall that we’ll find the best bandit