Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Bandit AlgorithmsTor Lattimore & Csaba Szepesvari

Bandits

Time 1 2 3 4 5 6 7 8 9 10 11 12Left arm $1 $0 $1 $1 $0Right arm $1 $0

Five rounds to go. Which arm would you play next?

Overview

• What are bandits, and why you should care• Finite-armed stochastic bandits• Finite-armed adversarial bandits

What’s in a name? A tiny bit of historyFirst bandit algorithm proposed by Thompson (1933)

Bush and Mosteller (1953) were in-terested in how mice behaved in aT-maze

Why care about bandits?

1. Many applications2. They isolate an important component ofreinforcement learning: exploration-vs-exploitation3. Rich and beautiful (we think) mathematically

Applications• Clinical trials/dose discovery• Recommendation systems (movies/news/etc)• Advert placement• A/B testing• Network routing• Dynamic pricing (eg., for Amazon products)• Waiting problems (when to auto-logout your computer)• Ranking (eg., for search)• A component of game-playing algorithms (MCTS)• Resource allocation• A way of isolating one interesting part of reinforcement learning

Applications• Clinical trials/dose discovery• Recommendation systems (movies/news/etc)• Advert placement• A/B testing• Network routing• Dynamic pricing (eg., for Amazon products)• Waiting problems (when to auto-logout your computer)• Ranking (eg., for search)• A component of game-playing algorithms (MCTS)• Resource allocation• A way of isolating one interesting part of reinforcement learning

Lots for you to do!

Finite-armed bandits• K actions• n rounds• In each round t the learner chooses an action

At ∈ {1, 2, . . . ,K} .

• Observes rewardXt ∼ PAt where P1, P2, . . . , PK are unknowndistributions

Distributional assumptionsWhile P1, P2, . . . , PK are not known in advance, we make someassumptions:

• Pi is Bernoulli with unknown bias µi ∈ [0, 1]

• Pi is Gaussian with unit variance and unknown mean µi ∈ R• Pi is subgaussian• Pi is supported on [0, 1]

• Pi has variance less than one• ...

As usual, stronger assumptions lead to stronger bounds

This tutorial All reward distributions are Gaussian (or subgaussian) withunit variance

Example: A/B testing• Business wants to optimize their webpage• Actions correspond to ‘A’ and ‘B’• Users arrive at webpage sequentially• Algorithm chooses either ‘A’ or ‘B’• Receives activity feedback (the reward)

Measuring performance – the regret

• Let µi be the mean reward of distribution Pi• µ∗ = maxi µi is the maximum mean• The regret is

Rn = nµ∗ − E

[n∑t=1

Xt

]

• Policies for which the regret is sublinear are learning• Of course we would like to make it as ‘small as possible’


Let ∆i = µ∗ − µi be the suboptimality gap for the ith arm and Ti(n) bethe number of times arm i is played over all n roundsLemma Rn =

K∑i=1

∆iE[Ti(n)]

Proof Let Et[·] = E[·|A1, X1, . . . , Xt−1, At]

Rn = nµ∗ − E

[n∑t=1

Xt

]= nµ∗ −

n∑t=1

E[Et[Xt]] = nµ∗ −n∑t=1

E[µAt ]

=

n∑t=1

E[∆At ] = E

[n∑t=1

∆At

]= E

[n∑t=1

K∑i=1

1(At = i)∆i

]

= E

[K∑i=1

∆i

n∑t=1

1(At = i)

]= E

[K∑i=1

∆iTi(n)

]=

K∑i=1

∆iE[Ti(n)]



K∑i=1

∆iE[Ti(n)]


Rn = nµ∗ − E

[n∑t=1

Xt

]= nµ∗ −

n∑t=1


E[µAt ]

=n∑t=1

E[∆At ] = E

[n∑t=1

∆At

]= E

[n∑t=1

K∑i=1

1(At = i)∆i

]

= E

[K∑i=1

∆i

n∑t=1

1(At = i)

]= E

[K∑i=1

∆iTi(n)

]=

K∑i=1

∆iE[Ti(n)]



K∑i=1

∆iE[Ti(n)]


Rn = nµ∗ − E

[n∑t=1

Xt

]= nµ∗ −

n∑t=1


E[µAt ]

=

n∑t=1

E[∆At ] = E

[n∑t=1

∆At

]= E

[n∑t=1

K∑i=1

1(At = i)∆i

]

= E

[K∑i=1

∆i

n∑t=1

1(At = i)

]= E

[K∑i=1

∆iTi(n)

]=

K∑i=1

∆iE[Ti(n)]



K∑i=1

∆iE[Ti(n)]


Rn = nµ∗ − E

[n∑t=1

Xt

]= nµ∗ −

n∑t=1


E[µAt ]

=

n∑t=1

E[∆At ] = E

[n∑t=1

∆At

]= E

[n∑t=1

K∑i=1

1(At = i)∆i

]

= E

[K∑i=1

∆i

n∑t=1

1(At = i)

]= E

[K∑i=1

∆iTi(n)

]=

K∑i=1

∆iE[Ti(n)]

A simple policy: Explore-Then-Commit

1 Choose each action m times2 Find the empirically best action I ∈ {1, 2, . . . ,K}3 Choose At = I for all remaining rounds

In order to analyse this policy we need to bound the probability ofcomitting to a suboptimal action

A simple policy: Explore-Then-Commit

1 Choose each action m times2 Find the empirically best action I ∈ {1, 2, . . . ,K}3 Choose At = I for all remaining rounds

In order to analyse this policy we need to bound the probability ofcomitting to a suboptimal action

A crash course in concentrationLet Z,Z1, Z2, . . . , Zn be a sequence of independent and identicallydistributed random variables with mean µ ∈ R and variance σ2 <∞

empirical mean = µn =1

n

n∑t=1

Zt

How close is µn to µ?

Classical statistics says:1. (law of large numbers) limn→∞ µn = µ almost surely2. (central limit theorem) √n(µn − µ)

d→ N (0, σ2)

3. (Chebyshev’s inequality) P (|µn − µ| ≥ ε) ≤ σ2

nε2

We need something nonasymptotic and stronger than Chebyshev’sNot possible without assumptions

A crash course in concentrationLet Z,Z1, Z2, . . . , Zn be a sequence of independent and identicallydistributed random variables with mean µ ∈ R and variance σ2 <∞

empirical mean = µn =1

n

n∑t=1

Zt

How close is µn to µ?Classical statistics says:

1. (law of large numbers) limn→∞ µn = µ almost surely2. (central limit theorem) √n(µn − µ)

d→ N (0, σ2)

3. (Chebyshev’s inequality) P (|µn − µ| ≥ ε) ≤ σ2

nε2

We need something nonasymptotic and stronger than Chebyshev’sNot possible without assumptions

A crash course in concentration

Random variable Z is σ-subgaussian if for all λ ∈ R,MZ(λ)

.= E[exp(λZ)] ≤ exp

(λ2σ2/2

)Lemma If Z,Z1, . . . , Zn are independent and σ-subgaussian, then

• aZ is |a|σ-subgaussian for any a ∈ R• ∑n

t=1 Zt is√nσ-subgaussian• µn is n−1/2σ-subgaussian

A crash course in concentrationTheorem If Z1, . . . , Zn are independent and σ-subgaussian, then

P

(µn ≥

√2σ2 log(1/δ)

n

)≤ δ

Proof We use Chernoff’s method. Let ε > 0 and λ = εn/σ2.

P (µn ≥ ε) = P (exp (λµn) ≥ exp (λε))

≤ E [exp (λµn)] exp(−λε) (Markov’s)≤ exp

(σ2λ2/(2n)− λε

)= exp

(−nε2/(2σ2)

)


P

(µn ≥

√2σ2 log(1/δ)

n

)≤ δ

Proof We use Chernoff’s method. Let ε > 0 and λ = εn/σ2.P (µn ≥ ε) = P (exp (λµn) ≥ exp (λε))


(σ2λ2/(2n)− λε

)= exp

(−nε2/(2σ2)

)


P

(µn ≥

√2σ2 log(1/δ)

n

)≤ δ


≤ E [exp (λµn)] exp(−λε) (Markov’s)

≤ exp(σ2λ2/(2n)− λε

)= exp

(−nε2/(2σ2)

)


P

(µn ≥

√2σ2 log(1/δ)

n

)≤ δ



(σ2λ2/(2n)− λε

)= exp

(−nε2/(2σ2)

)

A crash course in concentration• Which distributions are σ-subgaussian? Gaussian, Bernoulli,bounded support.• And not: exponential, power law• Comparing Chebyshev’s w. subgaussian bound:

Chebyshev’s:√σ2

nδSubgaussian:

√2σ2 log(1/δ)

n

• Typically δ � 1/n in our use-cases

The results that follow hold when the distributionassociated with each arm is 1-subgaussian

Analysing Explore-Then-Commit

• Standard convention Assume µ1 ≥ µ2 ≥ · · · ≥ µK• Algorithms are symmetric and do not exploit this fact• Means that first arm is optimal

• Remember, Explore-Then-Commit chooses each arm m times• Then commits to the arm with the largest payoff• We consider only K = 2


• Standard convention Assume µ1 ≥ µ2 ≥ · · · ≥ µK• Algorithms are symmetric and do not exploit this fact• Means that first arm is optimal• Remember, Explore-Then-Commit chooses each arm m times• Then commits to the arm with the largest payoff• We consider only K = 2

Analysing Explore-Then-CommitStep 1 Let µi be the average reward after exploringThe algorithm commits to the wrong arm if

µ2 ≥ µ1 ⇔ µ2 − µ2 + µ1 − µ1 ≥ ∆

Observation µ1 − µ1 + µ2 − µ2 is√2/m-subgaussian

Step 2 The regret isRn = E

[n∑t=1

∆At

]= E

[2m∑t=1

∆At

]+ E

[n∑

t=2m+1

∆At

]= m∆ + (n− 2m)∆P (commit to the wrong arm)

= m∆ + (n− 2m)∆P (µ2 − µ2 + µ1 − µ1 ≥ ∆)

≤ m∆ + n∆ exp

(−m∆2

4

)

Analysing Explore-Then-CommitStep 1 Let µi be the average reward after exploringThe algorithm commits to the wrong arm if

µ2 ≥ µ1 ⇔ µ2 − µ2 + µ1 − µ1 ≥ ∆

Observation µ1 − µ1 + µ2 − µ2 is√2/m-subgaussianStep 2 The regret is

Rn = E

[n∑t=1

∆At

]= E

[2m∑t=1

∆At

]+ E

[n∑

t=2m+1

∆At

]= m∆ + (n− 2m)∆P (commit to the wrong arm)

= m∆ + (n− 2m)∆P (µ2 − µ2 + µ1 − µ1 ≥ ∆)

≤ m∆ + n∆ exp

(−m∆2

4

)


Rn ≤ m∆︸︷︷︸(A)

+n∆ exp(−m∆2/4)︸︷︷︸(B)

(A) is monotone increasing in m while (B) is monotone decreasing in mExploration/Exploitation dilemma Exploring too much (m large) then (A)is big, while exploring too little makes (B) largeBound minimised by m =

⌈4

∆2 log(n∆2

4

)⌉ leading toRn ≤ ∆ +

4

∆log

(n∆2

4

)+

4

∆

Analysing Explore-Then-CommitLast slide: Rn ≤ ∆ +

4

∆log

(n∆2

4

)+

4

∆

What happens when ∆ is very small?

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

∆

}

0 0.2 0.4 0.6 0.8 1

0

10

20

30

∆

Regret

Analysing Explore-Then-CommitLast slide: Rn ≤ ∆ +

4

∆log

(n∆2

4

)+

4

∆

What happens when ∆ is very small?Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

∆

}

0 0.2 0.4 0.6 0.8 1

0

10

20

30

∆

Regret

Analysing Explore-Then-CommitDoes this figure make sense? Why is the regret largest when ∆ is small,but not too small?

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

∆

}0 0.2 0.4 0.6 0.8 1

0

10

20

30

∆

Small ∆ makes identification hard, but cost of failure is lowLarge ∆ makes the cost of failure high, but identification easyWorst case is when ∆ ≈

√1/n with Rn ≈ √n

Analysing Explore-Then-CommitDoes this figure make sense? Why is the regret largest when ∆ is small,but not too small?

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

∆

}0 0.2 0.4 0.6 0.8 1

0

10

20

30

∆Small ∆ makes identification hard, but cost of failure is lowLarge ∆ makes the cost of failure high, but identification easyWorst case is when ∆ ≈

√1/n with Rn ≈ √n

Limitations of Explore-Then-Commit

• Need advance knowledge of the horizon n• Optimal tuning depends on ∆

• Does not behave well with K > 2

• Issues by using data to adapt the commitment time• All variants of ETC are at least a factor of 2 from being optimal• Better approaches now exist, but Explore-Then-Commit is often agood place to start when analysing a bandit problem

Optimism principle

Informal illustrationVisiting a new regionShall I try local cuisine?Optimist: Yes!Pessimist: No!Optimism leads to exploration, pessimism prevents itExploration is necessary, but how much?

Optimism principle

• Let µi(t) = 1Ti(t)

∑ts=1 1(As = i)Xs

• Formalise the intuition using confidence intervals• Optimistic estimate of the mean of arm = ‘largest value it couldplausibly be’• Suggests

optimistic estimate = µi(t− 1) +

√2 log(1/δ)

Ti(t− 1)

• δ ∈ (0, 1) determines the level of optimism

Upper confidence bound algorithm1 Choose each action once2 Choose the action maximising

At = argmaxi µi(t− 1) +

√2 log(t3)

Ti(t− 1)

3 Goto 2Corresponds to δ = 1/t3

This is quite a conservative choice. More on this laterAlgorithm does not depend on horizon n (it is anytime)

Demonstration

Regret of UCBTheorem The regret of UCB is at most

Rn = O

∑i:∆i>0

(∆i +

log(n)

∆i

)Furthermore,

Rn = O(√

Kn log(n))

Bounds of the first kind are called problem dependent or instancedependent

Bounds like the second are called distribution free or worst case

Regret analysis

Rewrite the regret Rn =

K∑i=1

∆iE[Ti(n)]

Only need to show that E[Ti(n)] is not too large for suboptimal arms

Regret analysisKey insight Arm i is only played if its index is larger than the index of theoptimal armNeed to show two things:(A) The index of the optimal arm is larger than its actual mean with highprobability(B) The index of suboptimal arms falls below the mean of the optimalarm after only a few plays

γi(t− 1) = µi(t− 1) +

√2 log(t3)

Ti(t− 1)︸︷︷︸index of arm i in round t

Analysis intuition

Arm 1 Arm 2

∆

True meanEmpirical mean

Analysis intuition

Arm 1 Arm 2

∆

True meanEmpirical mean

Regret analysisTo make this intuition a reality we decompose the ‘pull-count’

E[Ti(n)] = E

[n∑t=1

1(At = i)

]=

n∑t=1

P (At = i)

=

n∑t=1

P (At = i and (γ1(t− 1) ≤ µ1 or γi(t− 1) ≥ µ1))

≤n∑t=1

P (γ1(t− 1) ≤ µ1)︸︷︷︸index of opt. arm too small?

+

n∑t=1

P (At = i and γi(t− 1) ≥ µ1)︸︷︷︸index of subopt. arm large?

Regret analysisWe want to show that P (γ1(t− 1) ≤ µ1) is smallTempting to use the concentration theorem...

P (γ1(t− 1) ≤ µ1) = P

(µ1(t− 1) +

√2 log(t3)

Ti(t− 1)≤ µ1

)?≤ 1

t3

What’s wrong with this?

Ti(t− 1) is a random variable!

P

(µ1(t− 1) +

√2 log(t3)

Ti(t− 1)≤ µ1

)≤ P

(∃s < t : µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

P

(µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

1

t3≤ 1

t2.

Regret analysisWe want to show that P (γ1(t− 1) ≤ µ1) is smallTempting to use the concentration theorem...

P (γ1(t− 1) ≤ µ1) = P

(µ1(t− 1) +

√2 log(t3)

Ti(t− 1)≤ µ1

)?≤ 1

t3

What’s wrong with this? Ti(t− 1) is a random variable!

P

(µ1(t− 1) +

√2 log(t3)

Ti(t− 1)≤ µ1

)≤ P

(∃s < t : µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

P

(µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

1

t3≤ 1

t2.

Regret analysisn∑t=1

P (At = i and γi(t− 1) ≥ µ1) = E

[n∑t=1

1(At = i and γi(t− 1) ≥ µ1)

]

= E

[n∑t=1

1(At = i and µi(t− 1) +

√6 log(t)

Ti(t− 1)≥ µ1)

]

≤ E

[n∑t=1

1(At = i and µi(t− 1) +

√6 log(n)

Ti(t− 1)≥ µ1)

]

≤ E

[n∑s=1

1(µi,s +

√6 log(n)

s≥ µ1)

]

=

n∑s=1

P

(µi,s +

√6 log(n)

s≥ µ1

)

Regret analysisLet u =

24 log(n)

∆2i

. Thenn∑s=1

P

(µi,s +

√6 log(n)

s≥ µ1

)≤ u+

n∑s=u+1

P

(µi,s +

√6 log(n)

s≥ µ1

)

≤ u+

n∑s=u+1

P(µi,s ≥ µi +

∆i

2

)

≤ u+

∞∑s=u+1

exp

(−s∆

2i

8

)≤ 1 + u+

8

∆2i

.

Regret analysis

Combining the two parts we haveE[Ti(n)] ≤ 3 +

8

∆2i

+24 log(n)

∆2i

So the regret is bounded byRn =

∑i:∆i>0

∆iE[Ti(n)] ≤∑i:∆i>0

(3∆i +

8

∆i+

24 log(n)

∆i

)

Distribution free bounds

Let ∆ > 0 be some constant to be chosen laterRn =

∑i:∆i>0

∆iE[Ti(n)] ≤ n∆ +∑

i:∆i>∆

∆iE[Ti(n)]

. n∆ +∑

i:∆i>∆

log(n)

∆i≤ n∆ +

K log(n)

∆.√nK log(n)

where in the last line we tuned ∆ =√K log(n)/n

Improvements• The constants in the algorithm/analysis can be improved quitesignificantly.

At = argmaxi µi(t− 1) +

√2 log(t)

Ti(t− 1)

• With this choice:limn→∞

Rnlog(n)

=∑i:∆i>0

2

∆i

• The distribution-free regret is also improvableAt = argmaxi µi(t− 1) +

√4

Ti(t− 1)log

(1 +

t

KTi(t− 1)

)• With this index we save a log factor in the distribution free bound

Rn = O(√nK)

Lower bounds

• Two kinds of lower bound: distribution free (worst case) andinstance-dependent• What could an instance-dependent lower bound look like?• Algorithms that always choose a fixed action?

Worst case lower boundTheorem For every algorithm and n and K ≤ n there exists a K-armedGaussian bandit such that Rn ≥√(K − 1)n/27

Proof sketch• µ = (∆, 0, . . . , 0)

• i = argmini>1 Eµ[Ti(n)]

• E[Ti(n)] ≤ n/(K − 1)

• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0)

• Envs. indistinguishable if ∆ ≈√K/n

• Suffers n∆ regret on one of them


Proof sketch• µ = (∆, 0, . . . , 0)


• E[Ti(n)] ≤ n/(K − 1)

• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0)




Proof sketch• µ = (∆, 0, . . . , 0)


• E[Ti(n)] ≤ n/(K − 1)

• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0)




Proof sketch• µ = (∆, 0, . . . , 0)


• E[Ti(n)] ≤ n/(K − 1)

• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0)



Instance-dependent lower boundsAn algorithm is consistent on class of bandits E if Rn = o(n) for allbandits in ETheorem If an algorithm is consistent for the class of Gaussian bandits,then

lim infn→∞

Rnlog(n)

≥∑i:∆i>0

2

∆i

• Consistency rules out stupid algorithms like the algorithm thatalways chooses a fixed action• Consistency is asymptotic, so it is not surprising the lower bound wederive from it is asymptotic• A non-asymptotic version of consistenncy leads to non-asymptoticlower bounds

Instance-dependent lower boundsAn algorithm is consistent on class of bandits E if Rn = o(n) for allbandits in ETheorem If an algorithm is consistent for the class of Gaussian bandits,then

lim infn→∞

Rnlog(n)

≥∑i:∆i>0

2

∆i

• Consistency rules out stupid algorithms like the algorithm thatalways chooses a fixed action• Consistency is asymptotic, so it is not surprising the lower bound wederive from it is asymptotic• A non-asymptotic version of consistenncy leads to non-asymptoticlower bounds

What else is there?• All kinds of variants of UCB for different noise models: Bernoulli,exponential families, heavy tails, Gaussian with unknown mean andvariance,...• A twist on UCB that replaces classifical confidence bounds withBayesian confidence bounds – offers empirical improvements• Thompson sampling: each round sample mean from posterior foreach arm, choose arm with largest• All manner of twists on the setup: non-stationarity, delayed rewards,playing multiple arms each round, moving beyond expected regret(high probability bounds)• Different objectives: Simple regret, risk aversion

The adversarial viewpoint• Replace random rewards with an adversary• At the start of the game the adversary secretly chooses lossesy1, y2, . . . , yn where yt ∈ [0, 1]K

• Learner chooses actions At and suffers loss ytAt• Regret isRn = E

[n∑t=1

ytAt

]︸︷︷︸learner’s loss

− mini

n∑t=1

yti︸︷︷︸loss of best arm

• Mission Make the regret small, regardless of the adversary

• There exists an algorithm such thatRn ≤ 2

√Kn

The adversarial viewpoint• Replace random rewards with an adversary• At the start of the game the adversary secretly chooses lossesy1, y2, . . . , yn where yt ∈ [0, 1]K

• Learner chooses actions At and suffers loss ytAt• Regret isRn = E

[n∑t=1

ytAt


− mini

n∑t=1


• Mission Make the regret small, regardless of the adversary• There exists an algorithm such that

Rn ≤ 2√Kn

The adversarial viewpoint• The trick is in the definition of regret• The adversary cannot be too mean

Rn = E

[n∑t=1

ytAt


− mini

n∑t=1


y =

(1 · · · 1 0 · · · 00 · · · 0 1 · · · 1

)

• The following alternative objective is hopelessR′n = E

[n∑t=1

ytAt


−n∑t=1

miniyti︸︷︷︸

loss of best sequence• Randomisation is crucial in adversarial bandits

The adversarial viewpoint• The trick is in the definition of regret• The adversary cannot be too mean

Rn = E

[n∑t=1

ytAt


− mini

n∑t=1


y =

(1 · · · 1 0 · · · 00 · · · 0 1 · · · 1

)

• The following alternative objective is hopelessR′n = E

[n∑t=1

ytAt


−n∑t=1

miniyti︸︷︷︸

loss of best sequence• Randomisation is crucial in adversarial bandits

Tackling the adversarial bandit

• Learner chooses distribution Pt over the K actions• Samples At ∼ Pt• Observes Yt = ytAt• Expected regret is

Rn = maxi

E

[n∑t=1

(ytAt − yti)

]= max

p∈∆KE

[n∑t=1

〈Pt − p, yt〉

]

• This looks a lot like online linear optimisation on a simplex• Only yt is not observed

Online convex optimisation (linear losses)• K ⊂ Rd is a convex set• Adversary secretly choosesy1, . . . , yn ∈ K◦ = {u : supx∈K |〈x, u〉| ≤ 1}

• Learner chooses xt ∈ K• Suffers loss 〈xt, yt〉 and the regret with respect to x ∈ K is

Rn(x) =

n∑t=1

〈xt − x, yt〉 .

• How to choose xt? Most simple idea ‘follow-the-leader’xt = argminx∈K

t−1∑s=1

〈x, ys〉 .

• Fails miserably: K = [−1, 1], y1 = 1/2, y2 = −1, y3 = 1, . . .and x1 = ?, x2 = −1, x3 = 1, . . . leading to Rn(0) ≈ n.

Follow the regularised leader• New idea Add regularization to stabalize follow-the-leader• Let F be a convex function and η > 0 be the learning rate and

xt = argminx∈K

(F (x) + η

t−1∑s=1

〈x, ys〉

)• The Bregman divergence induced by F is

DF (x, y) = F (x)− F (y)− 〈∇F (y), x− y〉

a b

DF (b, a)

F (x)

F (a) +∇F (a)(x− a)

Follow the regularised leaderTheorem The regret of follow the regularised leader satisfies

Rn(x) ≤ F (x)− F (x1)

η+

n∑t=1

(〈xt − xt+1, yt〉 −

1

ηDF (xt+1, xt)

)

≤ F (x)− F (x1)

η+η

2

n∑t=1

‖yt‖2t∗

Tradeoffs How much to regularise?

Let z ∈ [xt, xt+1] be such that DF (xt, xt+1) = 12‖xt − xt+1‖2∇2F (z)

and ‖ · ‖t = ‖ · ‖∇2F (z). Then〈xt − xt+1, yt〉 −

DF (xt+1, xt)

η≤ ‖yt‖t∗‖xt − xt+1‖t −

DF (xt+1, xt)

η

= ‖yt‖t∗√

2DF (xt+1, xt)−DF (xt+1, xt)

η≤ η

2‖yt‖2t∗

Follow the regularised leaderTheorem The regret of follow the regularised leader satisfies

Rn(x) ≤ F (x)− F (x1)

η+

n∑t=1

(〈xt − xt+1, yt〉 −

1

ηDF (xt+1, xt)

)

≤ F (x)− F (x1)

η+η

2

n∑t=1

‖yt‖2t∗

Tradeoffs How much to regularise?Let z ∈ [xt, xt+1] be such that DF (xt, xt+1) = 1

2‖xt − xt+1‖2∇2F (z)

and ‖ · ‖t = ‖ · ‖∇2F (z). Then〈xt − xt+1, yt〉 −

DF (xt+1, xt)

η≤ ‖yt‖t∗‖xt − xt+1‖t −

DF (xt+1, xt)

η

= ‖yt‖t∗√

2DF (xt+1, xt)−DF (xt+1, xt)

η≤ η

2‖yt‖2t∗

Let Φt(x) = F (x)/η +∑t

s=1〈x, ys〉

Rn(x) =

n∑t=1

〈xt − x, yt〉 =

n∑t=1

〈xt+1 − x, yt〉+

n∑t=1

〈xt − xt+1, yt〉

Then using: DΦt(·, ·) = DF (·, ·) and xt+1 = argminx Φt(x):n∑t=1

〈xt+1 − x, yt〉 =F (x)

η+

n∑t=1

(Φt(xt+1)− Φt−1(xt+1)) − Φn(x)

=F (x)

η− Φ0(x1) + Φn(xn+1)− Φn(x)︸︷︷︸

≤0

+

n∑t=0

(Φt(xt+1)− Φt(xt+2))

≤ F (x)− F (x1)

η+

n−1∑t=0

(Φt(xt+1)− Φt(xt+2))

=F (x)− F (x1)

η−n−1∑t=0

DΦt(xt+2, xt+1) + 〈∇Φt(xt+1), xt+2 − xt+1〉︸︷︷︸≥0

Follow the regularised leader for bandits• Estimate yt with unbiased importance weighted estimator Yt

Yti =1(At = i)yti

Pti

• Then the expected regret satisfiesE[Rn] = max

iE

[n∑t=1

ytAt − yti

]= max

iE

[n∑t=1

〈Pt − ei, Yt〉

]

• Choosing Pt = argminpF (p)η +

∑t−1s=1〈p, Ys〉 leads to

E[Rn] ≤ F (ei)− F (P1)

η+η

2

n∑t=1

‖Yt‖2t∗

• We just need to choose F carefully

Follow the regularised leader for bandits• We showed E[Rn] ≤ E

[F (ei)− F (P1)

η+η

2

n∑t=1

‖Yt‖2t∗

]• Let’s randomly choose the unnormalised negentropy

F (p) =

K∑i=1

pi log(pi)− pi

• An ‘easy’ calculation shows that Pti =exp

(−η∑t−1

s=1 Ysi

)∑K

j=1 exp(−η∑t−1

s=1 Ysj

)• Then F (ei)− F (P1) ≤ log(K). For the dual norm,∇2F (p) = diag(1/p) =⇒ ‖y‖2t∗ =

K∑i=1

piy2i for some p ∈ [Pt, Pt+1]

• Yti is positive and Yti = 0 unless At = i. So Pt+1,At ≤ PtAt and‖Yt‖2t∗ ≤ PtAt Y

2tAt

Follow the regularised leader for bandits• Now we have

E[Rn] ≤ log(K)

η+η

2E

[n∑t=1

PtAt Y2tAt

]

=log(K)

η+η

2E

[n∑t=1

y2tAt

PtAt

]

≤ log(K)

η+η

2E

[n∑t=1

1

PtAt

]

=log(K)

η+η

2E

[n∑t=1

K∑i=1

Pti ·1

Pti

]

=log(K)

η+ηnK

2

≤√

2nK log(K)

Adversarial bandits

• Instance-dependence?• Moving beyond expected regret (high probability bounds)• Why bother with stochastic bandits?• Best of both worlds? Bubeck and Slivkins (2012); Seldin and Lugosi(2017); Auer and Chiang (2016)• Big myth Adversarial bandits do not address nonstationarity

Resources

• Book by Bubeck and Cesa-Bianchi (2012)• Book by Cesa-Bianchi and Lugosi (2006)• The Bayesian books by Gittins et al. (2011) and Berry and Fristedt(1985). Both worth reading.• Our online notes: http://banditalgs.com• Notes by Aleksandrs Slivkins:http://slivkins.com/work/MAB-book.pdf

• We will soon release a 450 page book (“Bandit Algorithms” to bepublished by Cambridge)

http://banditalgs.com

http://slivkins.com/work/MAB-book.pdf

Historical notes

• First paper on bandits is by Thompson (1933). He proposed analgorithm for two-armed Bernoulli bandits and hand-runs somesimulations (Thompson sampling)• Popularised enormously by Robbins (1952)• Confidence bounds first used by Lai and Robbins (1985) to deriveasymptotically optimal algorithm• UCB by Katehakis and Robbins (1995) and Agrawal (1995).Finite-time analysis by Auer et al. (2002)• Adversarial bandits: Auer et al. (1995)• Minimax optimal algorithm by Audibert and Bubeck (2009)

References IAgrawal, R. (1995). Sample mean based index policies with O(log n) regret for the multi-armedbandit problem. Advances in Applied Probability, pages 1054–1078.Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits. In

Proceedings of Conference on Learning Theory (COLT), pages 217–226.Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed banditproblem. Machine Learning, 47:235–256.Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a rigged casino: Theadversarial multi-armed bandit problem. In Foundations of Computer Science, 1995. Proceedings.,

36th Annual Symposium on, pages 322–331. IEEE.Auer, P. and Chiang, C. (2016). An algorithm with nearly optimal pseudo-regret for both stochasticand adversarial bandits. In Proceedings of the 29th Conference on Learning Theory, COLT 2016,

New York, USA, June 23-26, 2016, pages 116–120.Berry, D. and Fristedt, B. (1985). Bandit problems : sequential allocation of experiments. Chapmanand Hall, London ; New York :.Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed

Bandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorporated.Bubeck, S. and Slivkins, A. (2012). The best of both worlds: Stochastic and adversarial bandits. In

COLT, pages 42.1–42.23.Bush, R. R. and Mosteller, F. (1953). A stochastic model with applications to learning. The Annals of

Mathematical Statistics, pages 559–585.Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.

References II

Gittins, J., Glazebrook, K., and Weber, R. (2011). Multi-armed bandit allocation indices. John Wiley &Sons.Katehakis, M. N. and Robbins, H. (1995). Sequential choice from several populations. Proceedings

of the National Academy of Sciences of the United States of America, 92(19):8584.Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in

applied mathematics, 6(1):4–22.Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American

Mathematical Society, 58(5):527–535.Seldin, Y. and Lugosi, G. (2017). An improved parametrization and analysis of the EXP3++ algorithmfor stochastic and adversarial bandits. In COLT, pages 1743–1759.Thompson, W. (1933). On the likelihood that one unknown probability exceeds another in view of theevidence of two samples. Biometrika, 25(3/4):285–294.

Random concentration failureLet X1, X2, . . . be a sequence of independent and identically distributedstandard Gaussian. For any n we haveP

(n∑t=1

Xt ≥√

2n log(1/δ)

)≤ δ

Want to show this can fail if n is replaced by random variable T

Law of the iterated logaritm says thatlim supn→∞

∑nt=1Xt√

2n log log(n)= 1 almost surely

Let T = min{n :∑n

t=1Xt ≥√

2n log(1/δ)}. Then P (T <∞) = 1 andP

(T∑t=1

Xt ≥√

2T log(1/δ)

)= 1 .

Contradiction! (works if T is independent of X1, X2, . . . though)

Random concentration failureLet X1, X2, . . . be a sequence of independent and identically distributedstandard Gaussian. For any n we haveP

(n∑t=1

Xt ≥√

2n log(1/δ)

)≤ δ

Want to show this can fail if n is replaced by random variable TLaw of the iterated logaritm says that

lim supn→∞

∑nt=1Xt√

2n log log(n)= 1 almost surely

Let T = min{n :∑n

t=1Xt ≥√

2n log(1/δ)}. Then P (T <∞) = 1 andP

(T∑t=1

Xt ≥√

2T log(1/δ)

)= 1 .

Contradiction! (works if T is independent of X1, X2, . . . though)

Documents

Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?