82
Bandit Algorithms Tor Lattimore & Csaba Szepesv´ ari

Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Bandit AlgorithmsTor Lattimore & Csaba Szepesvari

Page 2: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Bandits

Time 1 2 3 4 5 6 7 8 9 10 11 12Left arm $1 $0 $1 $1 $0Right arm $1 $0

Five rounds to go. Which arm would you play next?

Page 3: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Overview

• What are bandits, and why you should care• Finite-armed stochastic bandits• Finite-armed adversarial bandits

Page 4: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

What’s in a name? A tiny bit of historyFirst bandit algorithm proposed by Thompson (1933)

Bush and Mosteller (1953) were in-terested in how mice behaved in aT-maze

Page 5: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Why care about bandits?

1. Many applications2. They isolate an important component ofreinforcement learning: exploration-vs-exploitation3. Rich and beautiful (we think) mathematically

Page 6: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Applications• Clinical trials/dose discovery• Recommendation systems (movies/news/etc)• Advert placement• A/B testing• Network routing• Dynamic pricing (eg., for Amazon products)• Waiting problems (when to auto-logout your computer)• Ranking (eg., for search)• A component of game-playing algorithms (MCTS)• Resource allocation• A way of isolating one interesting part of reinforcement learning

Page 7: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Applications• Clinical trials/dose discovery• Recommendation systems (movies/news/etc)• Advert placement• A/B testing• Network routing• Dynamic pricing (eg., for Amazon products)• Waiting problems (when to auto-logout your computer)• Ranking (eg., for search)• A component of game-playing algorithms (MCTS)• Resource allocation• A way of isolating one interesting part of reinforcement learning

Lots for you to do!

Page 8: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Finite-armed bandits• K actions• n rounds• In each round t the learner chooses an action

At ∈ {1, 2, . . . ,K} .

• Observes rewardXt ∼ PAt where P1, P2, . . . , PK are unknowndistributions

Page 9: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Distributional assumptionsWhile P1, P2, . . . , PK are not known in advance, we make someassumptions:

• Pi is Bernoulli with unknown bias µi ∈ [0, 1]

• Pi is Gaussian with unit variance and unknown mean µi ∈ R• Pi is subgaussian• Pi is supported on [0, 1]

• Pi has variance less than one• ...

As usual, stronger assumptions lead to stronger bounds

This tutorial All reward distributions are Gaussian (or subgaussian) withunit variance

Page 10: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Example: A/B testing• Business wants to optimize their webpage• Actions correspond to ‘A’ and ‘B’• Users arrive at webpage sequentially• Algorithm chooses either ‘A’ or ‘B’• Receives activity feedback (the reward)

Page 11: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Measuring performance – the regret

• Let µi be the mean reward of distribution Pi• µ∗ = maxi µi is the maximum mean• The regret is

Rn = nµ∗ − E

[n∑t=1

Xt

]

• Policies for which the regret is sublinear are learning• Of course we would like to make it as ‘small as possible’

Page 12: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Measuring performance – the regret

Let ∆i = µ∗ − µi be the suboptimality gap for the ith arm and Ti(n) bethe number of times arm i is played over all n roundsLemma Rn =

K∑i=1

∆iE[Ti(n)]

Proof Let Et[·] = E[·|A1, X1, . . . , Xt−1, At]

Rn = nµ∗ − E

[n∑t=1

Xt

]= nµ∗ −

n∑t=1

E[Et[Xt]] = nµ∗ −n∑t=1

E[µAt ]

=

n∑t=1

E[∆At ] = E

[n∑t=1

∆At

]= E

[n∑t=1

K∑i=1

1(At = i)∆i

]

= E

[K∑i=1

∆i

n∑t=1

1(At = i)

]= E

[K∑i=1

∆iTi(n)

]=

K∑i=1

∆iE[Ti(n)]

Page 13: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Measuring performance – the regret

Let ∆i = µ∗ − µi be the suboptimality gap for the ith arm and Ti(n) bethe number of times arm i is played over all n roundsLemma Rn =

K∑i=1

∆iE[Ti(n)]

Proof Let Et[·] = E[·|A1, X1, . . . , Xt−1, At]

Rn = nµ∗ − E

[n∑t=1

Xt

]= nµ∗ −

n∑t=1

E[Et[Xt]] = nµ∗ −n∑t=1

E[µAt ]

=n∑t=1

E[∆At ] = E

[n∑t=1

∆At

]= E

[n∑t=1

K∑i=1

1(At = i)∆i

]

= E

[K∑i=1

∆i

n∑t=1

1(At = i)

]= E

[K∑i=1

∆iTi(n)

]=

K∑i=1

∆iE[Ti(n)]

Page 14: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Measuring performance – the regret

Let ∆i = µ∗ − µi be the suboptimality gap for the ith arm and Ti(n) bethe number of times arm i is played over all n roundsLemma Rn =

K∑i=1

∆iE[Ti(n)]

Proof Let Et[·] = E[·|A1, X1, . . . , Xt−1, At]

Rn = nµ∗ − E

[n∑t=1

Xt

]= nµ∗ −

n∑t=1

E[Et[Xt]] = nµ∗ −n∑t=1

E[µAt ]

=

n∑t=1

E[∆At ] = E

[n∑t=1

∆At

]= E

[n∑t=1

K∑i=1

1(At = i)∆i

]

= E

[K∑i=1

∆i

n∑t=1

1(At = i)

]= E

[K∑i=1

∆iTi(n)

]=

K∑i=1

∆iE[Ti(n)]

Page 15: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Measuring performance – the regret

Let ∆i = µ∗ − µi be the suboptimality gap for the ith arm and Ti(n) bethe number of times arm i is played over all n roundsLemma Rn =

K∑i=1

∆iE[Ti(n)]

Proof Let Et[·] = E[·|A1, X1, . . . , Xt−1, At]

Rn = nµ∗ − E

[n∑t=1

Xt

]= nµ∗ −

n∑t=1

E[Et[Xt]] = nµ∗ −n∑t=1

E[µAt ]

=

n∑t=1

E[∆At ] = E

[n∑t=1

∆At

]= E

[n∑t=1

K∑i=1

1(At = i)∆i

]

= E

[K∑i=1

∆i

n∑t=1

1(At = i)

]= E

[K∑i=1

∆iTi(n)

]=

K∑i=1

∆iE[Ti(n)]

Page 16: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

A simple policy: Explore-Then-Commit

1 Choose each action m times2 Find the empirically best action I ∈ {1, 2, . . . ,K}3 Choose At = I for all remaining rounds

In order to analyse this policy we need to bound the probability ofcomitting to a suboptimal action

Page 17: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

A simple policy: Explore-Then-Commit

1 Choose each action m times2 Find the empirically best action I ∈ {1, 2, . . . ,K}3 Choose At = I for all remaining rounds

In order to analyse this policy we need to bound the probability ofcomitting to a suboptimal action

Page 18: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

A crash course in concentrationLet Z,Z1, Z2, . . . , Zn be a sequence of independent and identicallydistributed random variables with mean µ ∈ R and variance σ2 <∞

empirical mean = µn =1

n

n∑t=1

Zt

How close is µn to µ?

Classical statistics says:1. (law of large numbers) limn→∞ µn = µ almost surely2. (central limit theorem) √n(µn − µ)

d→ N (0, σ2)

3. (Chebyshev’s inequality) P (|µn − µ| ≥ ε) ≤ σ2

nε2

We need something nonasymptotic and stronger than Chebyshev’sNot possible without assumptions

Page 19: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

A crash course in concentrationLet Z,Z1, Z2, . . . , Zn be a sequence of independent and identicallydistributed random variables with mean µ ∈ R and variance σ2 <∞

empirical mean = µn =1

n

n∑t=1

Zt

How close is µn to µ?Classical statistics says:

1. (law of large numbers) limn→∞ µn = µ almost surely2. (central limit theorem) √n(µn − µ)

d→ N (0, σ2)

3. (Chebyshev’s inequality) P (|µn − µ| ≥ ε) ≤ σ2

nε2

We need something nonasymptotic and stronger than Chebyshev’sNot possible without assumptions

Page 20: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

A crash course in concentration

Random variable Z is σ-subgaussian if for all λ ∈ R,MZ(λ)

.= E[exp(λZ)] ≤ exp

(λ2σ2/2

)Lemma If Z,Z1, . . . , Zn are independent and σ-subgaussian, then

• aZ is |a|σ-subgaussian for any a ∈ R• ∑n

t=1 Zt is√nσ-subgaussian• µn is n−1/2σ-subgaussian

Page 21: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

A crash course in concentrationTheorem If Z1, . . . , Zn are independent and σ-subgaussian, then

P

(µn ≥

√2σ2 log(1/δ)

n

)≤ δ

Proof We use Chernoff’s method. Let ε > 0 and λ = εn/σ2.

P (µn ≥ ε) = P (exp (λµn) ≥ exp (λε))

≤ E [exp (λµn)] exp(−λε) (Markov’s)≤ exp

(σ2λ2/(2n)− λε

)= exp

(−nε2/(2σ2)

)

Page 22: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

A crash course in concentrationTheorem If Z1, . . . , Zn are independent and σ-subgaussian, then

P

(µn ≥

√2σ2 log(1/δ)

n

)≤ δ

Proof We use Chernoff’s method. Let ε > 0 and λ = εn/σ2.P (µn ≥ ε) = P (exp (λµn) ≥ exp (λε))

≤ E [exp (λµn)] exp(−λε) (Markov’s)≤ exp

(σ2λ2/(2n)− λε

)= exp

(−nε2/(2σ2)

)

Page 23: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

A crash course in concentrationTheorem If Z1, . . . , Zn are independent and σ-subgaussian, then

P

(µn ≥

√2σ2 log(1/δ)

n

)≤ δ

Proof We use Chernoff’s method. Let ε > 0 and λ = εn/σ2.P (µn ≥ ε) = P (exp (λµn) ≥ exp (λε))

≤ E [exp (λµn)] exp(−λε) (Markov’s)

≤ exp(σ2λ2/(2n)− λε

)= exp

(−nε2/(2σ2)

)

Page 24: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

A crash course in concentrationTheorem If Z1, . . . , Zn are independent and σ-subgaussian, then

P

(µn ≥

√2σ2 log(1/δ)

n

)≤ δ

Proof We use Chernoff’s method. Let ε > 0 and λ = εn/σ2.P (µn ≥ ε) = P (exp (λµn) ≥ exp (λε))

≤ E [exp (λµn)] exp(−λε) (Markov’s)≤ exp

(σ2λ2/(2n)− λε

)= exp

(−nε2/(2σ2)

)

Page 25: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

A crash course in concentration• Which distributions are σ-subgaussian? Gaussian, Bernoulli,bounded support.• And not: exponential, power law• Comparing Chebyshev’s w. subgaussian bound:

Chebyshev’s:√σ2

nδSubgaussian:

√2σ2 log(1/δ)

n

• Typically δ � 1/n in our use-cases

The results that follow hold when the distributionassociated with each arm is 1-subgaussian

Page 26: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Analysing Explore-Then-Commit

• Standard convention Assume µ1 ≥ µ2 ≥ · · · ≥ µK• Algorithms are symmetric and do not exploit this fact• Means that first arm is optimal

• Remember, Explore-Then-Commit chooses each arm m times• Then commits to the arm with the largest payoff• We consider only K = 2

Page 27: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Analysing Explore-Then-Commit

• Standard convention Assume µ1 ≥ µ2 ≥ · · · ≥ µK• Algorithms are symmetric and do not exploit this fact• Means that first arm is optimal• Remember, Explore-Then-Commit chooses each arm m times• Then commits to the arm with the largest payoff• We consider only K = 2

Page 28: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Analysing Explore-Then-CommitStep 1 Let µi be the average reward after exploringThe algorithm commits to the wrong arm if

µ2 ≥ µ1 ⇔ µ2 − µ2 + µ1 − µ1 ≥ ∆

Observation µ1 − µ1 + µ2 − µ2 is√2/m-subgaussian

Step 2 The regret isRn = E

[n∑t=1

∆At

]= E

[2m∑t=1

∆At

]+ E

[n∑

t=2m+1

∆At

]= m∆ + (n− 2m)∆P (commit to the wrong arm)

= m∆ + (n− 2m)∆P (µ2 − µ2 + µ1 − µ1 ≥ ∆)

≤ m∆ + n∆ exp

(−m∆2

4

)

Page 29: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Analysing Explore-Then-CommitStep 1 Let µi be the average reward after exploringThe algorithm commits to the wrong arm if

µ2 ≥ µ1 ⇔ µ2 − µ2 + µ1 − µ1 ≥ ∆

Observation µ1 − µ1 + µ2 − µ2 is√2/m-subgaussianStep 2 The regret is

Rn = E

[n∑t=1

∆At

]= E

[2m∑t=1

∆At

]+ E

[n∑

t=2m+1

∆At

]= m∆ + (n− 2m)∆P (commit to the wrong arm)

= m∆ + (n− 2m)∆P (µ2 − µ2 + µ1 − µ1 ≥ ∆)

≤ m∆ + n∆ exp

(−m∆2

4

)

Page 30: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Analysing Explore-Then-Commit

Rn ≤ m∆︸︷︷︸(A)

+n∆ exp(−m∆2/4)︸ ︷︷ ︸(B)

(A) is monotone increasing in m while (B) is monotone decreasing in mExploration/Exploitation dilemma Exploring too much (m large) then (A)is big, while exploring too little makes (B) largeBound minimised by m =

⌈4

∆2 log(n∆2

4

)⌉ leading toRn ≤ ∆ +

4

∆log

(n∆2

4

)+

4

Page 31: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Analysing Explore-Then-CommitLast slide: Rn ≤ ∆ +

4

∆log

(n∆2

4

)+

4

What happens when ∆ is very small?

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

}

0 0.2 0.4 0.6 0.8 1

0

10

20

30

Regret

Page 32: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Analysing Explore-Then-CommitLast slide: Rn ≤ ∆ +

4

∆log

(n∆2

4

)+

4

What happens when ∆ is very small?Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

}

0 0.2 0.4 0.6 0.8 1

0

10

20

30

Regret

Page 33: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Analysing Explore-Then-CommitDoes this figure make sense? Why is the regret largest when ∆ is small,but not too small?

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

}0 0.2 0.4 0.6 0.8 1

0

10

20

30

Small ∆ makes identification hard, but cost of failure is lowLarge ∆ makes the cost of failure high, but identification easyWorst case is when ∆ ≈

√1/n with Rn ≈ √n

Page 34: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Analysing Explore-Then-CommitDoes this figure make sense? Why is the regret largest when ∆ is small,but not too small?

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

}0 0.2 0.4 0.6 0.8 1

0

10

20

30

∆Small ∆ makes identification hard, but cost of failure is lowLarge ∆ makes the cost of failure high, but identification easyWorst case is when ∆ ≈

√1/n with Rn ≈ √n

Page 35: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Limitations of Explore-Then-Commit

• Need advance knowledge of the horizon n• Optimal tuning depends on ∆

• Does not behave well with K > 2

• Issues by using data to adapt the commitment time• All variants of ETC are at least a factor of 2 from being optimal• Better approaches now exist, but Explore-Then-Commit is often agood place to start when analysing a bandit problem

Page 36: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Optimism principle

Page 37: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Informal illustrationVisiting a new regionShall I try local cuisine?Optimist: Yes!Pessimist: No!Optimism leads to exploration, pessimism prevents itExploration is necessary, but how much?

Page 38: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Optimism principle

• Let µi(t) = 1Ti(t)

∑ts=1 1(As = i)Xs

• Formalise the intuition using confidence intervals• Optimistic estimate of the mean of arm = ‘largest value it couldplausibly be’• Suggests

optimistic estimate = µi(t− 1) +

√2 log(1/δ)

Ti(t− 1)

• δ ∈ (0, 1) determines the level of optimism

Page 39: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Upper confidence bound algorithm1 Choose each action once2 Choose the action maximising

At = argmaxi µi(t− 1) +

√2 log(t3)

Ti(t− 1)

3 Goto 2Corresponds to δ = 1/t3

This is quite a conservative choice. More on this laterAlgorithm does not depend on horizon n (it is anytime)

Page 40: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Demonstration

Page 41: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Regret of UCBTheorem The regret of UCB is at most

Rn = O

∑i:∆i>0

(∆i +

log(n)

∆i

)Furthermore,

Rn = O(√

Kn log(n))

Bounds of the first kind are called problem dependent or instancedependent

Bounds like the second are called distribution free or worst case

Page 42: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Regret analysis

Rewrite the regret Rn =

K∑i=1

∆iE[Ti(n)]

Only need to show that E[Ti(n)] is not too large for suboptimal arms

Page 43: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Regret analysisKey insight Arm i is only played if its index is larger than the index of theoptimal armNeed to show two things:(A) The index of the optimal arm is larger than its actual mean with highprobability(B) The index of suboptimal arms falls below the mean of the optimalarm after only a few plays

γi(t− 1) = µi(t− 1) +

√2 log(t3)

Ti(t− 1)︸ ︷︷ ︸index of arm i in round t

Page 44: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Analysis intuition

Arm 1 Arm 2

True meanEmpirical mean

Page 45: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Analysis intuition

Arm 1 Arm 2

True meanEmpirical mean

Page 46: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Regret analysisTo make this intuition a reality we decompose the ‘pull-count’

E[Ti(n)] = E

[n∑t=1

1(At = i)

]=

n∑t=1

P (At = i)

=

n∑t=1

P (At = i and (γ1(t− 1) ≤ µ1 or γi(t− 1) ≥ µ1))

≤n∑t=1

P (γ1(t− 1) ≤ µ1)︸ ︷︷ ︸index of opt. arm too small?

+

n∑t=1

P (At = i and γi(t− 1) ≥ µ1)︸ ︷︷ ︸index of subopt. arm large?

Page 47: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Regret analysisWe want to show that P (γ1(t− 1) ≤ µ1) is smallTempting to use the concentration theorem...

P (γ1(t− 1) ≤ µ1) = P

(µ1(t− 1) +

√2 log(t3)

Ti(t− 1)≤ µ1

)?≤ 1

t3

What’s wrong with this?

Ti(t− 1) is a random variable!

P

(µ1(t− 1) +

√2 log(t3)

Ti(t− 1)≤ µ1

)≤ P

(∃s < t : µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

P

(µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

1

t3≤ 1

t2.

Page 48: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Regret analysisWe want to show that P (γ1(t− 1) ≤ µ1) is smallTempting to use the concentration theorem...

P (γ1(t− 1) ≤ µ1) = P

(µ1(t− 1) +

√2 log(t3)

Ti(t− 1)≤ µ1

)?≤ 1

t3

What’s wrong with this? Ti(t− 1) is a random variable!

P

(µ1(t− 1) +

√2 log(t3)

Ti(t− 1)≤ µ1

)≤ P

(∃s < t : µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

P

(µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

1

t3≤ 1

t2.

Page 49: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Regret analysisn∑t=1

P (At = i and γi(t− 1) ≥ µ1) = E

[n∑t=1

1(At = i and γi(t− 1) ≥ µ1)

]

= E

[n∑t=1

1(At = i and µi(t− 1) +

√6 log(t)

Ti(t− 1)≥ µ1)

]

≤ E

[n∑t=1

1(At = i and µi(t− 1) +

√6 log(n)

Ti(t− 1)≥ µ1)

]

≤ E

[n∑s=1

1(µi,s +

√6 log(n)

s≥ µ1)

]

=

n∑s=1

P

(µi,s +

√6 log(n)

s≥ µ1

)

Page 50: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Regret analysisLet u =

24 log(n)

∆2i

. Thenn∑s=1

P

(µi,s +

√6 log(n)

s≥ µ1

)≤ u+

n∑s=u+1

P

(µi,s +

√6 log(n)

s≥ µ1

)

≤ u+

n∑s=u+1

P(µi,s ≥ µi +

∆i

2

)

≤ u+

∞∑s=u+1

exp

(−s∆

2i

8

)≤ 1 + u+

8

∆2i

.

Page 51: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Regret analysis

Combining the two parts we haveE[Ti(n)] ≤ 3 +

8

∆2i

+24 log(n)

∆2i

So the regret is bounded byRn =

∑i:∆i>0

∆iE[Ti(n)] ≤∑i:∆i>0

(3∆i +

8

∆i+

24 log(n)

∆i

)

Page 52: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Distribution free bounds

Let ∆ > 0 be some constant to be chosen laterRn =

∑i:∆i>0

∆iE[Ti(n)] ≤ n∆ +∑

i:∆i>∆

∆iE[Ti(n)]

. n∆ +∑

i:∆i>∆

log(n)

∆i≤ n∆ +

K log(n)

∆.√nK log(n)

where in the last line we tuned ∆ =√K log(n)/n

Page 53: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Improvements• The constants in the algorithm/analysis can be improved quitesignificantly.

At = argmaxi µi(t− 1) +

√2 log(t)

Ti(t− 1)

• With this choice:limn→∞

Rnlog(n)

=∑i:∆i>0

2

∆i

• The distribution-free regret is also improvableAt = argmaxi µi(t− 1) +

√4

Ti(t− 1)log

(1 +

t

KTi(t− 1)

)• With this index we save a log factor in the distribution free bound

Rn = O(√nK)

Page 54: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Lower bounds

• Two kinds of lower bound: distribution free (worst case) andinstance-dependent• What could an instance-dependent lower bound look like?• Algorithms that always choose a fixed action?

Page 55: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Worst case lower boundTheorem For every algorithm and n and K ≤ n there exists a K-armedGaussian bandit such that Rn ≥√(K − 1)n/27

Proof sketch• µ = (∆, 0, . . . , 0)

• i = argmini>1 Eµ[Ti(n)]

• E[Ti(n)] ≤ n/(K − 1)

• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0)

• Envs. indistinguishable if ∆ ≈√K/n

• Suffers n∆ regret on one of them

Page 56: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Worst case lower boundTheorem For every algorithm and n and K ≤ n there exists a K-armedGaussian bandit such that Rn ≥√(K − 1)n/27

Proof sketch• µ = (∆, 0, . . . , 0)

• i = argmini>1 Eµ[Ti(n)]

• E[Ti(n)] ≤ n/(K − 1)

• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0)

• Envs. indistinguishable if ∆ ≈√K/n

• Suffers n∆ regret on one of them

Page 57: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Worst case lower boundTheorem For every algorithm and n and K ≤ n there exists a K-armedGaussian bandit such that Rn ≥√(K − 1)n/27

Proof sketch• µ = (∆, 0, . . . , 0)

• i = argmini>1 Eµ[Ti(n)]

• E[Ti(n)] ≤ n/(K − 1)

• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0)

• Envs. indistinguishable if ∆ ≈√K/n

• Suffers n∆ regret on one of them

Page 58: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Worst case lower boundTheorem For every algorithm and n and K ≤ n there exists a K-armedGaussian bandit such that Rn ≥√(K − 1)n/27

Proof sketch• µ = (∆, 0, . . . , 0)

• i = argmini>1 Eµ[Ti(n)]

• E[Ti(n)] ≤ n/(K − 1)

• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0)

• Envs. indistinguishable if ∆ ≈√K/n

• Suffers n∆ regret on one of them

Page 59: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Instance-dependent lower boundsAn algorithm is consistent on class of bandits E if Rn = o(n) for allbandits in ETheorem If an algorithm is consistent for the class of Gaussian bandits,then

lim infn→∞

Rnlog(n)

≥∑i:∆i>0

2

∆i

• Consistency rules out stupid algorithms like the algorithm thatalways chooses a fixed action• Consistency is asymptotic, so it is not surprising the lower bound wederive from it is asymptotic• A non-asymptotic version of consistenncy leads to non-asymptoticlower bounds

Page 60: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Instance-dependent lower boundsAn algorithm is consistent on class of bandits E if Rn = o(n) for allbandits in ETheorem If an algorithm is consistent for the class of Gaussian bandits,then

lim infn→∞

Rnlog(n)

≥∑i:∆i>0

2

∆i

• Consistency rules out stupid algorithms like the algorithm thatalways chooses a fixed action• Consistency is asymptotic, so it is not surprising the lower bound wederive from it is asymptotic• A non-asymptotic version of consistenncy leads to non-asymptoticlower bounds

Page 61: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

What else is there?• All kinds of variants of UCB for different noise models: Bernoulli,exponential families, heavy tails, Gaussian with unknown mean andvariance,...• A twist on UCB that replaces classifical confidence bounds withBayesian confidence bounds – offers empirical improvements• Thompson sampling: each round sample mean from posterior foreach arm, choose arm with largest• All manner of twists on the setup: non-stationarity, delayed rewards,playing multiple arms each round, moving beyond expected regret(high probability bounds)• Different objectives: Simple regret, risk aversion

Page 62: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

The adversarial viewpoint• Replace random rewards with an adversary• At the start of the game the adversary secretly chooses lossesy1, y2, . . . , yn where yt ∈ [0, 1]K

• Learner chooses actions At and suffers loss ytAt• Regret isRn = E

[n∑t=1

ytAt

]︸ ︷︷ ︸learner’s loss

− mini

n∑t=1

yti︸ ︷︷ ︸loss of best arm

• Mission Make the regret small, regardless of the adversary

• There exists an algorithm such thatRn ≤ 2

√Kn

Page 63: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

The adversarial viewpoint• Replace random rewards with an adversary• At the start of the game the adversary secretly chooses lossesy1, y2, . . . , yn where yt ∈ [0, 1]K

• Learner chooses actions At and suffers loss ytAt• Regret isRn = E

[n∑t=1

ytAt

]︸ ︷︷ ︸learner’s loss

− mini

n∑t=1

yti︸ ︷︷ ︸loss of best arm

• Mission Make the regret small, regardless of the adversary• There exists an algorithm such that

Rn ≤ 2√Kn

Page 64: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

The adversarial viewpoint• The trick is in the definition of regret• The adversary cannot be too mean

Rn = E

[n∑t=1

ytAt

]︸ ︷︷ ︸learner’s loss

− mini

n∑t=1

yti︸ ︷︷ ︸loss of best arm

y =

(1 · · · 1 0 · · · 00 · · · 0 1 · · · 1

)

• The following alternative objective is hopelessR′n = E

[n∑t=1

ytAt

]︸ ︷︷ ︸learner’s loss

−n∑t=1

miniyti︸ ︷︷ ︸

loss of best sequence• Randomisation is crucial in adversarial bandits

Page 65: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

The adversarial viewpoint• The trick is in the definition of regret• The adversary cannot be too mean

Rn = E

[n∑t=1

ytAt

]︸ ︷︷ ︸learner’s loss

− mini

n∑t=1

yti︸ ︷︷ ︸loss of best arm

y =

(1 · · · 1 0 · · · 00 · · · 0 1 · · · 1

)

• The following alternative objective is hopelessR′n = E

[n∑t=1

ytAt

]︸ ︷︷ ︸learner’s loss

−n∑t=1

miniyti︸ ︷︷ ︸

loss of best sequence• Randomisation is crucial in adversarial bandits

Page 66: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Tackling the adversarial bandit

• Learner chooses distribution Pt over the K actions• Samples At ∼ Pt• Observes Yt = ytAt• Expected regret is

Rn = maxi

E

[n∑t=1

(ytAt − yti)

]= max

p∈∆KE

[n∑t=1

〈Pt − p, yt〉

]

• This looks a lot like online linear optimisation on a simplex• Only yt is not observed

Page 67: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Online convex optimisation (linear losses)• K ⊂ Rd is a convex set• Adversary secretly choosesy1, . . . , yn ∈ K◦ = {u : supx∈K |〈x, u〉| ≤ 1}

• Learner chooses xt ∈ K• Suffers loss 〈xt, yt〉 and the regret with respect to x ∈ K is

Rn(x) =

n∑t=1

〈xt − x, yt〉 .

• How to choose xt? Most simple idea ‘follow-the-leader’xt = argminx∈K

t−1∑s=1

〈x, ys〉 .

• Fails miserably: K = [−1, 1], y1 = 1/2, y2 = −1, y3 = 1, . . .and x1 = ?, x2 = −1, x3 = 1, . . . leading to Rn(0) ≈ n.

Page 68: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Follow the regularised leader• New idea Add regularization to stabalize follow-the-leader• Let F be a convex function and η > 0 be the learning rate and

xt = argminx∈K

(F (x) + η

t−1∑s=1

〈x, ys〉

)• The Bregman divergence induced by F is

DF (x, y) = F (x)− F (y)− 〈∇F (y), x− y〉

a b

DF (b, a)

F (x)

F (a) +∇F (a)(x− a)

Page 69: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Follow the regularised leaderTheorem The regret of follow the regularised leader satisfies

Rn(x) ≤ F (x)− F (x1)

η+

n∑t=1

(〈xt − xt+1, yt〉 −

1

ηDF (xt+1, xt)

)

≤ F (x)− F (x1)

η+η

2

n∑t=1

‖yt‖2t∗

Tradeoffs How much to regularise?

Let z ∈ [xt, xt+1] be such that DF (xt, xt+1) = 12‖xt − xt+1‖2∇2F (z)

and ‖ · ‖t = ‖ · ‖∇2F (z). Then〈xt − xt+1, yt〉 −

DF (xt+1, xt)

η≤ ‖yt‖t∗‖xt − xt+1‖t −

DF (xt+1, xt)

η

= ‖yt‖t∗√

2DF (xt+1, xt)−DF (xt+1, xt)

η≤ η

2‖yt‖2t∗

Page 70: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Follow the regularised leaderTheorem The regret of follow the regularised leader satisfies

Rn(x) ≤ F (x)− F (x1)

η+

n∑t=1

(〈xt − xt+1, yt〉 −

1

ηDF (xt+1, xt)

)

≤ F (x)− F (x1)

η+η

2

n∑t=1

‖yt‖2t∗

Tradeoffs How much to regularise?Let z ∈ [xt, xt+1] be such that DF (xt, xt+1) = 1

2‖xt − xt+1‖2∇2F (z)

and ‖ · ‖t = ‖ · ‖∇2F (z). Then〈xt − xt+1, yt〉 −

DF (xt+1, xt)

η≤ ‖yt‖t∗‖xt − xt+1‖t −

DF (xt+1, xt)

η

= ‖yt‖t∗√

2DF (xt+1, xt)−DF (xt+1, xt)

η≤ η

2‖yt‖2t∗

Page 71: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Let Φt(x) = F (x)/η +∑t

s=1〈x, ys〉

Rn(x) =

n∑t=1

〈xt − x, yt〉 =

n∑t=1

〈xt+1 − x, yt〉+

n∑t=1

〈xt − xt+1, yt〉

Then using: DΦt(·, ·) = DF (·, ·) and xt+1 = argminx Φt(x):n∑t=1

〈xt+1 − x, yt〉 =F (x)

η+

n∑t=1

(Φt(xt+1)− Φt−1(xt+1)) − Φn(x)

=F (x)

η− Φ0(x1) + Φn(xn+1)− Φn(x)︸ ︷︷ ︸

≤0

+

n∑t=0

(Φt(xt+1)− Φt(xt+2))

≤ F (x)− F (x1)

η+

n−1∑t=0

(Φt(xt+1)− Φt(xt+2))

=F (x)− F (x1)

η−n−1∑t=0

DΦt(xt+2, xt+1) + 〈∇Φt(xt+1), xt+2 − xt+1〉︸ ︷︷ ︸≥0

Page 72: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Follow the regularised leader for bandits• Estimate yt with unbiased importance weighted estimator Yt

Yti =1(At = i)yti

Pti

• Then the expected regret satisfiesE[Rn] = max

iE

[n∑t=1

ytAt − yti

]= max

iE

[n∑t=1

〈Pt − ei, Yt〉

]

• Choosing Pt = argminpF (p)η +

∑t−1s=1〈p, Ys〉 leads to

E[Rn] ≤ F (ei)− F (P1)

η+η

2

n∑t=1

‖Yt‖2t∗

• We just need to choose F carefully

Page 73: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?
Page 74: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Follow the regularised leader for bandits• We showed E[Rn] ≤ E

[F (ei)− F (P1)

η+η

2

n∑t=1

‖Yt‖2t∗

]• Let’s randomly choose the unnormalised negentropy

F (p) =

K∑i=1

pi log(pi)− pi

• An ‘easy’ calculation shows that Pti =exp

(−η∑t−1

s=1 Ysi

)∑K

j=1 exp(−η∑t−1

s=1 Ysj

)• Then F (ei)− F (P1) ≤ log(K). For the dual norm,∇2F (p) = diag(1/p) =⇒ ‖y‖2t∗ =

K∑i=1

piy2i for some p ∈ [Pt, Pt+1]

• Yti is positive and Yti = 0 unless At = i. So Pt+1,At ≤ PtAt and‖Yt‖2t∗ ≤ PtAt Y

2tAt

Page 75: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Follow the regularised leader for bandits• Now we have

E[Rn] ≤ log(K)

η+η

2E

[n∑t=1

PtAt Y2tAt

]

=log(K)

η+η

2E

[n∑t=1

y2tAt

PtAt

]

≤ log(K)

η+η

2E

[n∑t=1

1

PtAt

]

=log(K)

η+η

2E

[n∑t=1

K∑i=1

Pti ·1

Pti

]

=log(K)

η+ηnK

2

≤√

2nK log(K)

Page 76: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Adversarial bandits

• Instance-dependence?• Moving beyond expected regret (high probability bounds)• Why bother with stochastic bandits?• Best of both worlds? Bubeck and Slivkins (2012); Seldin and Lugosi(2017); Auer and Chiang (2016)• Big myth Adversarial bandits do not address nonstationarity

Page 77: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Resources

• Book by Bubeck and Cesa-Bianchi (2012)• Book by Cesa-Bianchi and Lugosi (2006)• The Bayesian books by Gittins et al. (2011) and Berry and Fristedt(1985). Both worth reading.• Our online notes: http://banditalgs.com• Notes by Aleksandrs Slivkins:http://slivkins.com/work/MAB-book.pdf

• We will soon release a 450 page book (“Bandit Algorithms” to bepublished by Cambridge)

Page 78: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Historical notes

• First paper on bandits is by Thompson (1933). He proposed analgorithm for two-armed Bernoulli bandits and hand-runs somesimulations (Thompson sampling)• Popularised enormously by Robbins (1952)• Confidence bounds first used by Lai and Robbins (1985) to deriveasymptotically optimal algorithm• UCB by Katehakis and Robbins (1995) and Agrawal (1995).Finite-time analysis by Auer et al. (2002)• Adversarial bandits: Auer et al. (1995)• Minimax optimal algorithm by Audibert and Bubeck (2009)

Page 79: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

References IAgrawal, R. (1995). Sample mean based index policies with O(log n) regret for the multi-armedbandit problem. Advances in Applied Probability, pages 1054–1078.Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits. In

Proceedings of Conference on Learning Theory (COLT), pages 217–226.Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed banditproblem. Machine Learning, 47:235–256.Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a rigged casino: Theadversarial multi-armed bandit problem. In Foundations of Computer Science, 1995. Proceedings.,

36th Annual Symposium on, pages 322–331. IEEE.Auer, P. and Chiang, C. (2016). An algorithm with nearly optimal pseudo-regret for both stochasticand adversarial bandits. In Proceedings of the 29th Conference on Learning Theory, COLT 2016,

New York, USA, June 23-26, 2016, pages 116–120.Berry, D. and Fristedt, B. (1985). Bandit problems : sequential allocation of experiments. Chapmanand Hall, London ; New York :.Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed

Bandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorporated.Bubeck, S. and Slivkins, A. (2012). The best of both worlds: Stochastic and adversarial bandits. In

COLT, pages 42.1–42.23.Bush, R. R. and Mosteller, F. (1953). A stochastic model with applications to learning. The Annals of

Mathematical Statistics, pages 559–585.Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.

Page 80: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

References II

Gittins, J., Glazebrook, K., and Weber, R. (2011). Multi-armed bandit allocation indices. John Wiley &Sons.Katehakis, M. N. and Robbins, H. (1995). Sequential choice from several populations. Proceedings

of the National Academy of Sciences of the United States of America, 92(19):8584.Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in

applied mathematics, 6(1):4–22.Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American

Mathematical Society, 58(5):527–535.Seldin, Y. and Lugosi, G. (2017). An improved parametrization and analysis of the EXP3++ algorithmfor stochastic and adversarial bandits. In COLT, pages 1743–1759.Thompson, W. (1933). On the likelihood that one unknown probability exceeds another in view of theevidence of two samples. Biometrika, 25(3/4):285–294.

Page 81: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Random concentration failureLet X1, X2, . . . be a sequence of independent and identically distributedstandard Gaussian. For any n we haveP

(n∑t=1

Xt ≥√

2n log(1/δ)

)≤ δ

Want to show this can fail if n is replaced by random variable T

Law of the iterated logaritm says thatlim supn→∞

∑nt=1Xt√

2n log log(n)= 1 almost surely

Let T = min{n :∑n

t=1Xt ≥√

2n log(1/δ)}. Then P (T <∞) = 1 andP

(T∑t=1

Xt ≥√

2T log(1/δ)

)= 1 .

Contradiction! (works if T is independent of X1, X2, . . . though)

Page 82: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?

Random concentration failureLet X1, X2, . . . be a sequence of independent and identically distributedstandard Gaussian. For any n we haveP

(n∑t=1

Xt ≥√

2n log(1/δ)

)≤ δ

Want to show this can fail if n is replaced by random variable TLaw of the iterated logaritm says that

lim supn→∞

∑nt=1Xt√

2n log log(n)= 1 almost surely

Let T = min{n :∑n

t=1Xt ≥√

2n log(1/δ)}. Then P (T <∞) = 1 andP

(T∑t=1

Xt ≥√

2T log(1/δ)

)= 1 .

Contradiction! (works if T is independent of X1, X2, . . . though)