Introduction to Bandits: Algorithms and Theoryimagine.enpc.fr/~audibert/Bandit_small.pdf · 2011-06-24 · Introduction to Bandits: Algorithms and Theory Jean-Yves Audibert1;2 & R

Introduction to Bandits:

Algorithms and Theory

Jean-Yves Audibert1,2 & Remi Munos3

1. Universite Paris-Est, LIGM, Imagine,2. CNRS/Ecole Normale Superieure/INRIA, LIENS, Sierra3. INRIA Sequential Learning team, France

ICML 2011, Bellevue (WA), USA

Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 1/1

Outline

I Bandit problems and applications

I Bandits with small set of actions

I Stochastic setting

I Adversarial setting

I Bandits with large set of actions

I unstructured set

I structured setI linear banditsI Lipschitz banditsI tree bandits

I Extensions


Bandit gameParameters available to the forecaster:the number of arms (or actions) K and the number of rounds nUnknown to the forecaster: the way the gain vectorsgt = (g1,t , . . . , gK ,t) ∈ [0, 1]K are generated

For each round t = 1, 2, . . . , n

1. the forecaster chooses an arm It ∈ {1, . . . ,K}2. the forecaster receives the gain gIt ,t

3. only gIt ,t is revealed to the forecaster

Cumulative regret goal: maximize the cumulative gains obtained.More precisely, minimize

Rn =

(max

i=1,...,KE

n∑t=1

gi ,t

)− E

n∑t=1

gIt ,t

where E comes from both a possible stochastic generation of thegain vector and a possible randomization in the choice of It


Stochastic and adversial environments

I Stochastic environment: the gain vector gt is sampled froman unknown product distribution ν1⊗ . . .⊗ νK on [0, 1]K , thatis gi ,t ∼ νi .

I Adversarial environment: the gain vector gt is chosen by anadversary (which, at time t, knows all the past, but not It)


Numerous variants

I different environments: adversarial, “stochastic”, non-stationary

I different targets: cumulative regret, simple regret, tracking thebest expert

I Continuous or discrete set of actions

I extension with additional rules: varying set of arms, pay-per-observation, . . .


Various applications

I Clinical trials (Thompson, 1933)

I Ads placement on webpages

I Nash equilibria (traffic or communication networks, agent simu-lation, tic-tac-toe phantom, . . . )

I Game-playing computers (Go, urban rivals, . . . )

I Packet routing, itinerary selection

I . . .


Outline






I unstructured set


I Extensions


Stochastic bandit game (Robbins, 1952)

Parameters available to the forecaster: K and nParameters unknown to the forecaster: the reward distributionsν1, . . . , νK of the arms (with respective means µ1, . . . , µK )


1. the forecaster chooses an arm It ∈ {1, . . . ,K}2. the environment draws the gain vector gt = (g1,t , . . . , gK ,t)

according to ν1 ⊗ · · · ⊗ νK3. the forecaster receives the gain gIt ,t

Notation: i∗ = arg maxi=1,...,K µi µ∗ = maxi=1,...,K µi∆i = µ∗ − µi , Ti (n) =

∑nt=1 1It=i

Cumulative regret: Rn =∑n

t=1 gi∗,t −∑n

t=1 gIt ,t

Goal: minimize the expected cumulative regret

Rn = ERn = nµ∗ − En∑

t=1

gIt ,t = nµ∗ − EK∑i=1

Ti (n)µi =K∑i=1

∆iETi (n)


A simple policy: ε-greedy

I Playing the arm with highest empirical mean does not workI ε-greedy: at time t,

I with probability 1−εt , play the arm with highest empirical meanI with probability εt , play a random arm

I Theoretical guarantee: (Auer, Cesa-Bianchi, Fischer, 2002)

I Let ∆ = mini :∆i>0 ∆i and consider εt = min( 6K∆2t , 1)

I When t ≥ 6K∆2 , the probability of choosing a suboptimal arm i

is bounded by C∆2t for some constant C > 0

I As a consequence, E[Ti (n)] ≤ C∆2 log n and ERn ≤

∑i :∆i>0

C∆i

∆2 log n−→ logarithmic regret

I drawbacks:I requires knowledge of ∆I outperforms by UCB policy in practice


Optimism in face of uncertainty

I At time t, from past observations and some probabilistic ar-gument, you have an upper confidence bound (UCB) on theexpected rewards.

I Simple implementation:

play the arm having the largest UCB !


Why does it make sense?

I Could we stay a long time drawing a wrong arm?

No, since:I The more we draw a wrong arm i the closer the UCB gets to

the expected reward µi ,I µi < µ∗ ≤ UCB on µ∗


Illustration of UCB policy


Confidence intervals vs sampling times


Hoeffding-based UCB (Auer, Cesa-Bianchi, Fischer, 2002)

I Hoeffding’s inequality: Let X ,X1, . . . ,Xm be i.i.d. r.v. takingtheir values in [0, 1]. For any ε > 0, with probability at least1− ε, we have

EX ≤ 1

m

m∑s=1

Xs +

√log(ε−1)

2m

I UCB1 policy: at time t, play

It ∈ arg maxi∈{1,...,K}

{µi ,t−1 +

√2 log t

Ti (t − 1)

},

where µi ,t−1 = 1Ti (t−1)

∑Ti (t−1)s=1 Xi ,s

I Regret bound:

Rn ≤∑i 6=i∗

min

(10

∆ilog n, n∆i

)



I Hoeffding’s inequality: Let X1, . . . ,Xm be i.i.d. r.v. taking theirvalues in [0, 1]. For any ε > 0, with probability at least 1− ε,we have

EX ≤ 1

m

m∑s=1

Xs +

√log(ε−1)

2m

I UCB1 policy: At time t, play


{µi ,t−1 +

√2 log t

Ti (t − 1)

},


∑Ti (t−1)s=1 Xi ,s

I UCB1 is an anytime policy (it does not need to know n to beimplemented)



I Hoeffding’s inequality: Let X1, . . . ,Xm be i.i.d. r.v. taking theirvalues in [0, 1]. For any ε > 0, with probability at least 1− ε,we have

EX ≤ 1

m

m∑s=1

Xs +

√log(ε−1)

2m



{µi ,t−1 +

√2 log t

Ti (t − 1)

},


∑Ti (t−1)s=1 Xi ,s

I UCB1 corresponds to 2 log t = log(ε−1)2 , hence ε = 1/t4

I Critical confidence level ε = 1/t (Lai & Robbins, 1985; Agrawal,

1995; Burnetas & Katehakis, 1996; Audibert, Munos, Szepesvari, 2009;

Honda & Takemura, 2010)


Better confidence bounds imply smaller regretI Hoeffding’s inequality 1

t-confidence bound

EX ≤ 1

m

m∑s=1

Xs +

√log(t)

2m

I Bernstein’s inequality 1t-confidence bound

EX ≤ 1

m

m∑s=1

Xs +

√2 log(t)VarX

m+

log(t)

3m

I Empirical Bernstein’s inequality 1t -confidence bound

EX ≤ 1

m

m∑s=1

Xs +

√2 log(t)Var X

m+

8 log(t)

3m

(Audibert, Munos, Szepesvari, 2009; Maurer, 2009; Audibert, 2010)I Asymptotic confidence bound leads to catastrophy:

EX ≤ 1

m

m∑s=1

Xs +

√VarX

mx with x s.t.

∫ +∞

x

e−u2/2

√2π

du =1

t


Better confidence bounds imply smaller regret

Hoeffding-based UCB empirical Bernstein-based UCB

EX ≤ 1m

∑ms=1 Xs +

√log(ε−1)

2mEX ≤ 1

m

∑ms=1 Xs +

√2 log(ε−1)Var X

m+

8 log(ε−1)3m

Rn ≤∑

i 6=i∗ min(

c∆i

log n, n∆i

)Rn ≤

∑i 6=i∗ min

(c(Var νi

∆i+ 1)

log n, n∆i

)


Tuning the exploration: simple vs difficult bandit problems

I UCB1(ρ) policy: At time t, play


{µi ,t−1 +

√ρ log t

Ti (t − 1)

},

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0Exploration parameter ρ

0

5

10

15

20

25

Exp

ecte

dre

gret

Regret of UCB1(ρ) for n = 1000 and K = 2 arms:Dirac(0.6) and Ber(0.5)


0

5

10

15

20

25

30

35

Exp

ecte

dre

gret

Regret of UCB1(ρ) for n = 1000 and K = 2 arms:Ber(0.6) and Ber(0.5)


Tuning the exploration parameter: from theory to practice

I Theory:I for ρ < 0.5, UCB1(ρ) has polynomial regretI for ρ > 0.5, UCB1(ρ) has logarithmic regret

I Practice: ρ = 0.2 seems to be the best default value for n < 108


0

5

10

15

20

25

30

35

40

45

50

Exp

ecte

dre

gret

Regret of UCB1(ρ) for n = 1000 and K = 3 arms:Ber(0.6), Ber(0.5) and Ber(0.5)


0

10

20

30

40

50

60

70

80

Exp

ecte

dre

gret

Regret of UCB1(ρ) for n = 1000 and K = 5 arms:Ber(0.7), Ber(0.6), Ber(0.5), Ber(0.4) and Ber(0.3)


Deviations of UCB1 regret



{µi,t−1 +

√2 log t

Ti (t − 1)

},

I Inequality of the form P(Rn > ERn+γ) ≤ ce−cγ does not hold!

I If the smallest reward observable from the optimal arm is smallerthan the mean reward of the second optimal arm, then theregret of UCB1 satisfies: for any C > 0, there exists C ′ > 0such that for any n ≥ 2

P(Rn > ERn + C log n) >1

C ′(log n)C ′

(Audibert, Munos, Szepesvari, 2009)


Anytime UCB policies has a heavy-tailed regret

I For some difficult bandit problems, the regret of UCB1 satisfies: for anyC > 0, there exists C ′ > 0 such that for any n ≥ 2

P(Rn > ERn + C log n) >1

C ′(log n)C ′(A)

I UCB-Horizon policy: At time t, play


{µi ,t−1 +

√2 log n

Ti (t − 1)

},

(Audibert, Munos, Szepesvari, 2009)

I UCB-H satisfies P(Rn > ERn + C log n) ≤ Cn for some C

I (A) = unavoidable for anytime policies (Salomon, Audibert, 2011)


Comparison of UCB1 (solid lines) and UCB-H (dotted lines)


Comparison of UCB1(ρ) and UCB-H(ρ) in expectation


0

5

10

15

20

25

Exp

ecte

dre

gret

Comparison of policies for n = 1000 and K = 2 arms:Dirac(0.6) and Ber(0.5)

GCL∗

UCB1(ρ)UCB-H(ρ)


0

10

20

30

40

50

60

70

Exp

ecte

dre

gret

Comparison of policies for n = 1000 and K = 2 arms:Ber(0.6) and Dirac(0.5)

GCL∗

UCB1(ρ)UCB-H(ρ)

Left: Dirac(0.6) vs Bernoulli(0.5) Right: Bernoulli(0.6) vs Dirac(0.5)


Comparison of UCB1(ρ) and UCB-H(ρ) in deviations

I For n = 1000 and K = 2 arms: Bernoulli(0.6) and Dirac(0.5)

−20 0 20 40 60 80 100 120 140 160

Regret level r0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

IP(r−

2≤Rn≤r

+2)

GCL∗

UCB1(0.2)UCB-H(0.2)UCB1(0.5)UCB-H(0.5)

40 60 80 100 120 140 160

Regret level r0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

IP(R

n>r)

GCL∗

UCB1(0.2)UCB-H(0.2)UCB1(0.5)UCB-H(0.5)

Left: smoothed probability mass function. Right: tail distribution of the regret.


Knowing the horizon: theory and practice

I Theory: use UCB-H to avoid heavy tails of the regret

I Practice: Theory is right. Besides, thanks to this robustness,the expected regret of UCB-H(ρ) consistently outperforms theexpected regret of UCB1(ρ), but the gain is small.


Knowing µ∗

I Hoeffding-based GCL∗ policy: play each arm once, then play

It ∈ argmini∈{1,...,K}

Ti (t − 1) (µ∗ − µi ,t−1)2+

I Underlying ideas:I compare p-values of the K tests: H0 = {µi = µ∗}, i ∈ {1, . . . ,K}I the p-values are estimated using Hoeffding’s inequality

PH0

(µi,t−1 ≤ µ(obs)

i,t−1

)/ exp

(− 2Ti (t − 1)

(µ∗ − µ(obs)

i,t−1

)2

+

)I play the arm for which we have the Greatest Confidence Level

that it is the optimal arm.I Advantages:

I logarithmic expected regretI anytime policyI regret with a subexponential right-tailI parameter-free policy !I outperforms any other Hoeffding-based algorithm !


From Chernoff’s inequality to KL-based algorithms

I LetK(p, q) be the Kullback-Leibler divergence between Bernoullidistributions of respective parameter p and q

I Let X1, . . . ,XT be i.i.d. r.v. of mean µ, and taking their valuesin [0, 1]. Let X = 1

T

∑Ti=1 Xi . For any γ > 0

P(X ≤ µ− γ

)≤ exp

(− T K(µ− γ, µ)

).

In particular, we have

P(X ≤ X (obs)

)≤ exp

(− T K

(min(X (obs), µ), µ

)).

I If µ∗ is known, using the same idea of comparing the p-valuesof the tests H0 = {µi = µ∗}, i ∈ {1, . . . ,K}, we get theChernoff-based GCL∗ policy: play each arm once, then play

It ∈ argmini∈{1,...,K}

Ti (t − 1) K(

min(µi ,t−1, µ∗), µ∗

)Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 28/1

Back to unknown µ∗

I When µ∗ is unknown, the principle

playing the arm for which we have the greatest confidence levelthat it is the optimal arm

is replaced by

being optimistic in face of uncertainty:

I an arm i is represented by the highest mean of a distribution νfor which the hypothesis H0 = {νi = ν} has a p-value greaterthan 1

tβ(critical β = 1, as usual)

I the arm with the highest index (=UCB) is played


KL-based algorithms when µ∗ is unknown

I Approximating the p-value using Sanov’s theorem is tightlylinked to the DMED policy, which satisfies

lim supn→+∞

Rn

log n≤ ∆i

infν:EX∼νX≥µ∗ K(νi , ν)

for distributions with finite support (included in [0, 1])(Burnetas & Katehakis, 1996; Honda & Takemura, 2010)

It matches the lower bound

lim infn→+∞

Rn

log n≥ ∆i

infν:EX∼νX≥µ∗ K(νi , ν)

(Lai & Robbins, 1985; Burnetas & Katehakis, 1996)

I Approximating the p-value using non-asymptotic version of Sanov’stheorem leads to the KL-UCB (Cappe & Garivier, COLT 2011) andthe K-strategy (Maillard, Munos, Stoltz, COLT 2011)


Outline






I unstructured set


I Extensions


Adversarial bandit

Parameters: the number of arms K and the number of rounds n


1. the forecaster chooses an arm It ∈ {1, . . . ,K}, possibly with the help ofan external randomization

2. the adversary chooses a gain vector gt = (g1,t , . . . , gK ,t) ∈ [0, 1]K

3. the forecaster receives and observes only the gain gIt ,t

Goal: Maximize the cumulative gains obtained. We consider the regret:

Rn =

(max

i=1,...,KE

n∑t=1

gi,t

)− E

n∑t=1

gIt ,t ,

I In full information, step 3. is replaced by the forecaster receivesgIt ,t and observes the full gain vector gt

I In both settings, the forecaster should use an external rando-mization to have o(n) regret.


Adversarial setting in full information: an optimal policy

I Cumulative reward on [1, t − 1]: Gi ,t−1 =∑t−1

s=1 gi ,sI Follow-the-leader: It ∈ arg maxi∈{1,...,K} Gi ,t−1 is a bad policy

I An “optimal” policy is obtained by considering

pi ,t = P(It = i) =eηGi,t−1∑Kk=1 eηGk,t−1

I For this policy, Rn ≤ nη8 + log K

η

I in particular, for η =√

8 log Kn , we have Rn ≤

√n log K

2

(Littlestone, Warmuth, 1994; Long, 1996; Bylanger, 1997; Cesa-Bianchi, 1999)


Proof of the regret bound


E∑t

gIt ,t

=E∑t

∑i

pi,tgi,t

=E∑t

(− 1

ηlog∑i

pi,teη(gi,t−

∑j pj,tgj,t ) +

1

ηlog∑i

pi,teηgi,t

)

=E∑t

(− 1

ηlogEeη(Vt−EVt ) +

1

ηlog

∑i eηGi,t∑

i eηGi,t−1

)P(Vt = gi,t ) = pi,t

≥E(−∑t

η

8

)+

1

ηE log

∑j eηGj,n∑

j eηGj,0

≥− nη

8+

1

ηE log

eηmaxj Gj,n

K= −nη

8− logK

η+ Emax

jGj,n


Adapting the exponentially weighted forecasterI In bandit setting, Gi ,t−1, i = 1, . . . ,K are not observed

Trick = estimate them

I Precisely, Gi ,t−1 is estimated by Gi ,t−1 =∑t−1

s=1 gi ,s with

gi ,s = 1− 1− gi ,spi ,s

1Is=i .

Note that EIs∼ps gi ,s = 1−K∑

k=1

pk,s1− gi ,s

pi ,s1k=i = gi ,s


I For this policy, Rn ≤ nKη2 + log K

η

I In particular, for η =√

2 log KnK , we have Rn ≤

√2nK log K

(Auer, Cesa-Bianchi, Freund, Schapire, 1995)


Implicitly Normalized Forecaster (Audibert, Bubeck, 2010)

Let ψ : R∗− → R∗+ increasing, convex, twice continuously differen-tiable, and s.t. [ 1

K , 1] ⊂ ψ(R∗−)

Let p1 be the uniform distribution over {1, . . . ,K}

For each round t = 1, 2, . . . ,

I It ∼ pt

I Compute pt+1 = (p1,t+1, . . . , pK ,t+1) where

pi ,t+1 = ψ(Gi ,t − Ct)

where Ct is the unique real number s.t.∑K

i=1 pi ,t+1 = 1


Minimax policy

I ψ(x) = exp(ηx) with η > 0; this corresponds exactly to theexponentially weighted forecaster

I ψ(x) = (−ηx)−q with q > 1 and η > 0; this is a new policywhich is minimax optimal: for q = 2 and η =

√2n, we have

Rn ≤ 2√

2nK

(Audibert, Bubeck, 2010; Audibert, Bubeck, Lugosi, 2011)

while for any strategy, we have

sup Rn ≥1

20

√nK

(Auer, Cesa-Bianchi, Freund, Schapire, 1995)


Outline






I unstructured set


I Extensions


Documents

Introduction to Bandits: Algorithms and Theoryimagine.enpc.fr/~audibert/Bandit_small.pdf · 2011-06-24 · Introduction to Bandits: Algorithms and Theory Jean-Yves Audibert1;2 & R