Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Thompson Sampling : an asymptotically optimalfinite-time analysis

Emilie Kaufmann, Nathaniel Korda and Remi Munos

ALT, October 30th, 2012

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 1 / 23

The multi-armed bandit problem

1 The multi-armed bandit problem

2 From UCB to Thompson Sampling

3 Finite-time analysis of Thompson Sampling

4 A closer look at the fundamental deviation result

5 Some perspectives


The multi-armed bandit problem The multiarmed bandit problem

The stochastic MAB with Bernoulli rewards

K indepedent arms.

µ1, . . . , µK unknown parameters

(Ya,t)t is i.i.d. with distribution B(µa)

The parameter of the best arm is µ∗ = maxa=1...K µa

At time t, the forecaster chooses arm At and gets reward Rt = YAt,t.

Goal : Design a strategy At minimizing the cumulative regret:

R(T ) := Tµ∗ − E

[T∑t=1

Rt

]=∑a∈A

(µ∗ − µa)E[Na,T ]


The multi-armed bandit problem Asymptotically optimal bandit algorithm

Asymtotically optimal bandit algorithms

Lai and Robbins’ lower bound on the regret of a consistent policy:

µa < µ∗ ⇒ lim infT→∞

E[Na,T ]

lnT≥ 1

K(µa, µ∗)

or equivalently

lim infT→∞

E[R(T )]

ln(T )≥

∑a:µa<µ∗

µ∗ − µaK(µa, µ∗)

with

K(p, q) := p lnp

q+ (1− p) ln

1− p1− q

.

A bandit algorithm is asymptotically optimal if

µa < µ∗ ⇒ lim supT→∞

E[Na,T ]

lnT≤ 1

K(µa, µ∗)


From UCB to Thompson Sampling Frequentist algorithms





5 Some perspectives


From UCB to Thompson Sampling Frequentist algorithms

Some sucessfull frequentist algorithms

A family of optimistic index policies based on an upper confidencebound for the empirical mean of the rewards:

UCB [Auer et al. 02] and variants:

E[Na,T ] ≤ K1

2(µa − µ∗)2lnT +K2, with K1 > 1.

KL-UCB [Cappe, Garivier, Maillard, Stoltz, Munos 11] uses the index:

ua,t = argmaxx>

Sa,tNa,t

{K

(Sa,tNa,t

, x

)≤ ln(t) + c ln ln(t)

Na,t

}

For all ε > 0, there exists a constant Kε such that:

E[Na,T ] ≤ 1 + ε

K(µa, µ∗)lnT +Kε


From UCB to Thompson Sampling Bayesian algorithms

A Bayesian view on the MAB

Imagine we are given independent priors on the parameters of each arm:

µai.i.d.∼ U([0, 1])

(Ya,t)t is i.i.d. conditionally to µa with distribution B(µa)

The posterior on arm a at time t is

πa,t = Beta (Sa,t + 1, Na,t − Sa,t + 1) .

Bayesian algorithms uses this posterior πa,t to choose At.

⇒ We still focus on frequentist guarantees (asymptotic optimality) forBayesian algorithms



A Bayesian Upper Confidence Bound algorithm

Bayes-UCB [Kaufmann et al. 12] is the index policy associated with

qa,t := Q

(1− 1

t ln(t)c, πa,t

)This Bayesian algorithm is asymptotically optimal

0

1

16 10 265 162 46

Figure: UCB versus Bayes-UCB



Thompson Sampling : a new kind of optimism?

A very simple algorithm:

∀a ∈ {1..K}, θa,t ∼ πa,tAt = argmaxa θa,t

Recent interest for this algorithm:

partial analysis proposed[Granmo 2010][May, Korda, Lee, Leslie 2011]

extensive numerical study beyond the Bernoulli case[Chapelle, Li 2011]

first logarithmic upper bound on the regret[Agrawal,Goyal 2012]


Finite-time analysis of Thompson Sampling





5 Some perspectives


Finite-time analysis of Thompson Sampling Main result

An optimal regret bound for Thompson Sampling

Assume the first arm is the unique optimal and ∆a = µ1 − µa.

Known result : [Agrawal,Goyal, 2012]

E[R(T )] ≤ C

(K∑a=2

1

∆a

)ln(T ) + oµ(ln(T ))

Our improvement :

Theorem 2 ∀ε > 0,

E[R(T )] ≤ (1 + ε)

(K∑a=2

∆a

K(µa, µ∗)

)ln(T ) + oµ,ε(ln(T ))


Finite-time analysis of Thompson Sampling Main result

An optimal regret bound for Thompson Sampling

Assume the first arm is the unique optimal and ∆a = µ1 − µa.

Known result : [Agrawal,Goyal, 2012]

E[R(T )] ≤ C

(K∑a=2

1

∆a

)ln(T ) + oµ(ln(T ))

Our improvement :

Theorem 2 ∀ε > 0,

E[R(T )] ≤ (1 + ε)

(K∑a=2

∆a

K(µa, µ∗)

)ln(T ) + oµ,ε(ln(T ))


Finite-time analysis of Thompson Sampling Proof: Step 1

Step 1: Decomposition

We adapt an analysis working for optimistic index policies:

At = argmaxala,t

E[Na,T ] ≤T∑t=1

P (l1,t < µ1)︸︷︷︸o(ln(T ))

+

T∑t=1

P (la,t ≥ l1,t > µ1, At = a)︸︷︷︸ln(T )/K(µa,µ1)+o(ln(T ))

⇒ Does NOT work for Thompson Sampling

Our decomposition for Thompson Sampling is

E[Na,T ] ≤T∑t=1

P

(θ1,t ≤ µ1 −

√6 ln t

N1,t

)

+T∑t=1

P

(θa,t > µ1 −

√6 ln t

N1,t, At = a

)︸︷︷︸

(∗)





At = argmaxala,t

E[Na,T ] ≤T∑t=1

P (l1,t < µ1)︸︷︷︸o(ln(T ))

+

T∑t=1




E[Na,T ] ≤T∑t=1

P

(θ1,t ≤ µ1 −

√6 ln t

N1,t

)

+T∑t=1

P

(θa,t > µ1 −

√6 ln t

N1,t, At = a

)︸︷︷︸

(∗)





At = argmaxala,t

E[Na,T ] ≤T∑t=1

P (l1,t < µ1)︸︷︷︸o(ln(T ))

+

T∑t=1




E[Na,T ] ≤T∑t=1

P

(θ1,t ≤ µ1 −

√6 ln t

N1,t

)

+

T∑t=1

P

(θa,t > µ1 −

√6 ln t

N1,t, At = a

)︸︷︷︸

(∗)E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 12 / 23


Step 2: Linking quantiles to other known indices

We introduce the following quantile:

qa,t := Q

(1− 1

t ln(T ), πa,t

)

And the corresponding KL-UCB index

ua,t := argmaxx>

Sa,tNa,t

{K

(Sa,tNa,t

, x

)≤ ln(t) + ln(ln(T ))

Na,t

}

We know from previous work [Kaufmann et al.] that

qa,t < ua,t





qa,t := Q

(1− 1

t ln(T ), πa,t

)


ua,t := argmaxx>

Sa,tNa,t

{K

(Sa,tNa,t

, x

)≤ ln(t) + ln(ln(T ))

Na,t

}


qa,t < ua,t





qa,t := Q

(1− 1

t ln(T ), πa,t

)


ua,t := argmaxx>

Sa,tNa,t

{K

(Sa,tNa,t

, x

)≤ ln(t) + ln(ln(T ))

Na,t

}


qa,t < ua,t




Introducing the quantile qa,t:

T∑t=1

P

(θa,t > µ1 −

√6 ln t

N1,t, At = a

)

≤T∑t=1

P

(qa,t > µ1 −

√6 ln t

N1,t, At = a

)+

T∑t=1

P (θa,t > qa,t)︸︷︷︸≤2

Then the KL-UCB index ua,t:

T∑t=1

P

(θa,t > µ1 −

√6 ln t

N1,t, At = a

)

≤T∑t=1

P

(ua,t > µ1 −

√6 ln t

N1,t, At = a

)+ 2




Introducing the quantile qa,t:

T∑t=1

P

(θa,t > µ1 −

√6 ln t

N1,t, At = a

)

≤T∑t=1

P

(qa,t > µ1 −

√6 ln t

N1,t, At = a

)+

T∑t=1

P (θa,t > qa,t)︸︷︷︸≤2

Then the KL-UCB index ua,t:

T∑t=1

P

(θa,t > µ1 −

√6 ln t

N1,t, At = a

)

≤T∑t=1

P

(ua,t > µ1 −

√6 ln t

N1,t, At = a

)+ 2



Final decomposition

The final decomposition is:

E[Na,t] ≤T∑t=1

P

(θ1,t ≤ µ1 −

√6 ln t

N1,t

)︸︷︷︸

A

+T∑t=1

P

(ua,t > µ1 −

√6 ln t

N1,t, At = a

)︸︷︷︸

B

+2



Step 3: One extra ingredient for bounding term A and B

We state a fundamental deviation result :

Proposition 1 There exists constants b = b(µ1, µ2) ∈ (0, 1) and Cb <∞such that:

∞∑t=1

P(N1,t ≤ tb

)≤ Cb.


A closer look at the fundamental deviation result





5 Some perspectives



Understanding the deviation result

Recall the result

There exists constants b = b(µ1, µ2) ∈ (0, 1) and Cb <∞ such that

∞∑t=1

P(N1,t ≤ tb

)≤ Cb.

Where does it come from?{N1,t ≤ tb

}= {there exists a time range of length at least t1−b − 1

with no draw of arm 1}



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5

6

7

8

9

µ2

µ1

µ2 + δ

Assume that :

on Ij = [τj , τj + dt1−b − 1e] there is no draw of arm 1

there exists Jj ⊂ Ij such that ∀s ∈ Jj , ∀a 6= 1, θa,s ≤ µ2 + δ

Then :

∀s ∈ Jj , θ1,s ≤ µ2 + δ

⇒ This only happens with small probabilityE.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 19 / 23

Some perspectives





5 Some perspectives


Some perspectives

Conclusion and perspectives

Thompson Sampling in the Bernoulli setting:

has the same theoretical guarantees than known optimal algorithms(KL-UCB, Bayes-UCB)

and displays excellent empirical performance

The proof we give :

is close to the analysis of optimistic bandit algorithms

also gives a deviation result on the number of draws of optimal arms

Can Thompson Sampling be extended to more general settings?

Contextual bandit ([Agrawal, Goyal, Thompson Sampling forContextual Bandits with Linear Payoffs, sept 2012])

Model-based Bayesian reinforcement learning


Some perspectives

0

100

200

300

400

500

102

103

104

Rn

UCB

0

100

200

300

400

500

102

103

104

UCB−V

0

100

200

300

400

500

102

103

104

DMED

0

100

200

300

400

500

102

103

104

n (log scale)

Rn

KL−UCB

0

100

200

300

400

500

102

103

104

n (log scale)

Bayes−UCB

0

100

200

300

400

500

102

103

104

n (log scale)

Thompson

Figure: Regret as a function of time (on a log scale) for a 10 arms problem

Thompson Sampling outperforms other optimal algorithms


Some perspectives

Any question ?


Documents

Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi