29
Thompson Sampling : an asymptotically optimal finite-time analysis Emilie Kaufmann , Nathaniel Korda and R´ emi Munos ALT, October 30th, 2012 E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 1 / 23

Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Thompson Sampling : an asymptotically optimalfinite-time analysis

Emilie Kaufmann, Nathaniel Korda and Remi Munos

ALT, October 30th, 2012

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 1 / 23

Page 2: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

The multi-armed bandit problem

1 The multi-armed bandit problem

2 From UCB to Thompson Sampling

3 Finite-time analysis of Thompson Sampling

4 A closer look at the fundamental deviation result

5 Some perspectives

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 2 / 23

Page 3: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

The multi-armed bandit problem The multiarmed bandit problem

The stochastic MAB with Bernoulli rewards

K indepedent arms.

µ1, . . . , µK unknown parameters

(Ya,t)t is i.i.d. with distribution B(µa)

The parameter of the best arm is µ∗ = maxa=1...K µa

At time t, the forecaster chooses arm At and gets reward Rt = YAt,t.

Goal : Design a strategy At minimizing the cumulative regret:

R(T ) := Tµ∗ − E

[T∑t=1

Rt

]=∑a∈A

(µ∗ − µa)E[Na,T ]

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 3 / 23

Page 4: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

The multi-armed bandit problem Asymptotically optimal bandit algorithm

Asymtotically optimal bandit algorithms

Lai and Robbins’ lower bound on the regret of a consistent policy:

µa < µ∗ ⇒ lim infT→∞

E[Na,T ]

lnT≥ 1

K(µa, µ∗)

or equivalently

lim infT→∞

E[R(T )]

ln(T )≥

∑a:µa<µ∗

µ∗ − µaK(µa, µ∗)

with

K(p, q) := p lnp

q+ (1− p) ln

1− p1− q

.

A bandit algorithm is asymptotically optimal if

µa < µ∗ ⇒ lim supT→∞

E[Na,T ]

lnT≤ 1

K(µa, µ∗)

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 4 / 23

Page 5: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

From UCB to Thompson Sampling Frequentist algorithms

1 The multi-armed bandit problem

2 From UCB to Thompson Sampling

3 Finite-time analysis of Thompson Sampling

4 A closer look at the fundamental deviation result

5 Some perspectives

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 5 / 23

Page 6: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

From UCB to Thompson Sampling Frequentist algorithms

Some sucessfull frequentist algorithms

A family of optimistic index policies based on an upper confidencebound for the empirical mean of the rewards:

UCB [Auer et al. 02] and variants:

E[Na,T ] ≤ K1

2(µa − µ∗)2lnT +K2, with K1 > 1.

KL-UCB [Cappe, Garivier, Maillard, Stoltz, Munos 11] uses the index:

ua,t = argmaxx>

Sa,tNa,t

{K

(Sa,tNa,t

, x

)≤ ln(t) + c ln ln(t)

Na,t

}

For all ε > 0, there exists a constant Kε such that:

E[Na,T ] ≤ 1 + ε

K(µa, µ∗)lnT +Kε

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 6 / 23

Page 7: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

From UCB to Thompson Sampling Bayesian algorithms

A Bayesian view on the MAB

Imagine we are given independent priors on the parameters of each arm:

µai.i.d.∼ U([0, 1])

(Ya,t)t is i.i.d. conditionally to µa with distribution B(µa)

The posterior on arm a at time t is

πa,t = Beta (Sa,t + 1, Na,t − Sa,t + 1) .

Bayesian algorithms uses this posterior πa,t to choose At.

⇒ We still focus on frequentist guarantees (asymptotic optimality) forBayesian algorithms

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 7 / 23

Page 8: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

From UCB to Thompson Sampling Bayesian algorithms

A Bayesian Upper Confidence Bound algorithm

Bayes-UCB [Kaufmann et al. 12] is the index policy associated with

qa,t := Q

(1− 1

t ln(t)c, πa,t

)This Bayesian algorithm is asymptotically optimal

0

1

16 10 265 162 46

Figure: UCB versus Bayes-UCB

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 8 / 23

Page 9: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

From UCB to Thompson Sampling Bayesian algorithms

Thompson Sampling : a new kind of optimism?

A very simple algorithm:

∀a ∈ {1..K}, θa,t ∼ πa,tAt = argmaxa θa,t

Recent interest for this algorithm:

partial analysis proposed[Granmo 2010][May, Korda, Lee, Leslie 2011]

extensive numerical study beyond the Bernoulli case[Chapelle, Li 2011]

first logarithmic upper bound on the regret[Agrawal,Goyal 2012]

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 9 / 23

Page 10: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Finite-time analysis of Thompson Sampling

1 The multi-armed bandit problem

2 From UCB to Thompson Sampling

3 Finite-time analysis of Thompson Sampling

4 A closer look at the fundamental deviation result

5 Some perspectives

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 10 / 23

Page 11: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Finite-time analysis of Thompson Sampling Main result

An optimal regret bound for Thompson Sampling

Assume the first arm is the unique optimal and ∆a = µ1 − µa.

Known result : [Agrawal,Goyal, 2012]

E[R(T )] ≤ C

(K∑a=2

1

∆a

)ln(T ) + oµ(ln(T ))

Our improvement :

Theorem 2 ∀ε > 0,

E[R(T )] ≤ (1 + ε)

(K∑a=2

∆a

K(µa, µ∗)

)ln(T ) + oµ,ε(ln(T ))

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 11 / 23

Page 12: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Finite-time analysis of Thompson Sampling Main result

An optimal regret bound for Thompson Sampling

Assume the first arm is the unique optimal and ∆a = µ1 − µa.

Known result : [Agrawal,Goyal, 2012]

E[R(T )] ≤ C

(K∑a=2

1

∆a

)ln(T ) + oµ(ln(T ))

Our improvement :

Theorem 2 ∀ε > 0,

E[R(T )] ≤ (1 + ε)

(K∑a=2

∆a

K(µa, µ∗)

)ln(T ) + oµ,ε(ln(T ))

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 11 / 23

Page 13: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Finite-time analysis of Thompson Sampling Proof: Step 1

Step 1: Decomposition

We adapt an analysis working for optimistic index policies:

At = argmaxala,t

E[Na,T ] ≤T∑t=1

P (l1,t < µ1)︸ ︷︷ ︸o(ln(T ))

+

T∑t=1

P (la,t ≥ l1,t > µ1, At = a)︸ ︷︷ ︸ln(T )/K(µa,µ1)+o(ln(T ))

⇒ Does NOT work for Thompson Sampling

Our decomposition for Thompson Sampling is

E[Na,T ] ≤T∑t=1

P

(θ1,t ≤ µ1 −

√6 ln t

N1,t

)

+T∑t=1

P

(θa,t > µ1 −

√6 ln t

N1,t, At = a

)︸ ︷︷ ︸

(∗)

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 12 / 23

Page 14: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Finite-time analysis of Thompson Sampling Proof: Step 1

Step 1: Decomposition

We adapt an analysis working for optimistic index policies:

At = argmaxala,t

E[Na,T ] ≤T∑t=1

P (l1,t < µ1)︸ ︷︷ ︸o(ln(T ))

+

T∑t=1

P (la,t ≥ l1,t > µ1, At = a)︸ ︷︷ ︸ln(T )/K(µa,µ1)+o(ln(T ))

⇒ Does NOT work for Thompson Sampling

Our decomposition for Thompson Sampling is

E[Na,T ] ≤T∑t=1

P

(θ1,t ≤ µ1 −

√6 ln t

N1,t

)

+T∑t=1

P

(θa,t > µ1 −

√6 ln t

N1,t, At = a

)︸ ︷︷ ︸

(∗)

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 12 / 23

Page 15: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Finite-time analysis of Thompson Sampling Proof: Step 1

Step 1: Decomposition

We adapt an analysis working for optimistic index policies:

At = argmaxala,t

E[Na,T ] ≤T∑t=1

P (l1,t < µ1)︸ ︷︷ ︸o(ln(T ))

+

T∑t=1

P (la,t ≥ l1,t > µ1, At = a)︸ ︷︷ ︸ln(T )/K(µa,µ1)+o(ln(T ))

⇒ Does NOT work for Thompson Sampling

Our decomposition for Thompson Sampling is

E[Na,T ] ≤T∑t=1

P

(θ1,t ≤ µ1 −

√6 ln t

N1,t

)

+

T∑t=1

P

(θa,t > µ1 −

√6 ln t

N1,t, At = a

)︸ ︷︷ ︸

(∗)E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 12 / 23

Page 16: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Finite-time analysis of Thompson Sampling Proof: Step 2

Step 2: Linking quantiles to other known indices

We introduce the following quantile:

qa,t := Q

(1− 1

t ln(T ), πa,t

)

And the corresponding KL-UCB index

ua,t := argmaxx>

Sa,tNa,t

{K

(Sa,tNa,t

, x

)≤ ln(t) + ln(ln(T ))

Na,t

}

We know from previous work [Kaufmann et al.] that

qa,t < ua,t

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 13 / 23

Page 17: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Finite-time analysis of Thompson Sampling Proof: Step 2

Step 2: Linking quantiles to other known indices

We introduce the following quantile:

qa,t := Q

(1− 1

t ln(T ), πa,t

)

And the corresponding KL-UCB index

ua,t := argmaxx>

Sa,tNa,t

{K

(Sa,tNa,t

, x

)≤ ln(t) + ln(ln(T ))

Na,t

}

We know from previous work [Kaufmann et al.] that

qa,t < ua,t

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 13 / 23

Page 18: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Finite-time analysis of Thompson Sampling Proof: Step 2

Step 2: Linking quantiles to other known indices

We introduce the following quantile:

qa,t := Q

(1− 1

t ln(T ), πa,t

)

And the corresponding KL-UCB index

ua,t := argmaxx>

Sa,tNa,t

{K

(Sa,tNa,t

, x

)≤ ln(t) + ln(ln(T ))

Na,t

}

We know from previous work [Kaufmann et al.] that

qa,t < ua,t

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 13 / 23

Page 19: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Finite-time analysis of Thompson Sampling Proof: Step 2

Step 2: Linking quantiles to other known indices

Introducing the quantile qa,t:

T∑t=1

P

(θa,t > µ1 −

√6 ln t

N1,t, At = a

)

≤T∑t=1

P

(qa,t > µ1 −

√6 ln t

N1,t, At = a

)+

T∑t=1

P (θa,t > qa,t)︸ ︷︷ ︸≤2

Then the KL-UCB index ua,t:

T∑t=1

P

(θa,t > µ1 −

√6 ln t

N1,t, At = a

)

≤T∑t=1

P

(ua,t > µ1 −

√6 ln t

N1,t, At = a

)+ 2

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 14 / 23

Page 20: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Finite-time analysis of Thompson Sampling Proof: Step 2

Step 2: Linking quantiles to other known indices

Introducing the quantile qa,t:

T∑t=1

P

(θa,t > µ1 −

√6 ln t

N1,t, At = a

)

≤T∑t=1

P

(qa,t > µ1 −

√6 ln t

N1,t, At = a

)+

T∑t=1

P (θa,t > qa,t)︸ ︷︷ ︸≤2

Then the KL-UCB index ua,t:

T∑t=1

P

(θa,t > µ1 −

√6 ln t

N1,t, At = a

)

≤T∑t=1

P

(ua,t > µ1 −

√6 ln t

N1,t, At = a

)+ 2

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 14 / 23

Page 21: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Finite-time analysis of Thompson Sampling Proof: Step 2

Final decomposition

The final decomposition is:

E[Na,t] ≤T∑t=1

P

(θ1,t ≤ µ1 −

√6 ln t

N1,t

)︸ ︷︷ ︸

A

+T∑t=1

P

(ua,t > µ1 −

√6 ln t

N1,t, At = a

)︸ ︷︷ ︸

B

+2

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 15 / 23

Page 22: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Finite-time analysis of Thompson Sampling Proof: Step 3

Step 3: One extra ingredient for bounding term A and B

We state a fundamental deviation result :

Proposition 1 There exists constants b = b(µ1, µ2) ∈ (0, 1) and Cb <∞such that:

∞∑t=1

P(N1,t ≤ tb

)≤ Cb.

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 16 / 23

Page 23: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

A closer look at the fundamental deviation result

1 The multi-armed bandit problem

2 From UCB to Thompson Sampling

3 Finite-time analysis of Thompson Sampling

4 A closer look at the fundamental deviation result

5 Some perspectives

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 17 / 23

Page 24: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

A closer look at the fundamental deviation result

Understanding the deviation result

Recall the result

There exists constants b = b(µ1, µ2) ∈ (0, 1) and Cb <∞ such that

∞∑t=1

P(N1,t ≤ tb

)≤ Cb.

Where does it come from?{N1,t ≤ tb

}= {there exists a time range of length at least t1−b − 1

with no draw of arm 1}

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 18 / 23

Page 25: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

A closer look at the fundamental deviation result

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5

6

7

8

9

µ2

µ1

µ2 + δ

Assume that :

on Ij = [τj , τj + dt1−b − 1e] there is no draw of arm 1

there exists Jj ⊂ Ij such that ∀s ∈ Jj , ∀a 6= 1, θa,s ≤ µ2 + δ

Then :

∀s ∈ Jj , θ1,s ≤ µ2 + δ

⇒ This only happens with small probabilityE.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 19 / 23

Page 26: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Some perspectives

1 The multi-armed bandit problem

2 From UCB to Thompson Sampling

3 Finite-time analysis of Thompson Sampling

4 A closer look at the fundamental deviation result

5 Some perspectives

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 20 / 23

Page 27: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Some perspectives

Conclusion and perspectives

Thompson Sampling in the Bernoulli setting:

has the same theoretical guarantees than known optimal algorithms(KL-UCB, Bayes-UCB)

and displays excellent empirical performance

The proof we give :

is close to the analysis of optimistic bandit algorithms

also gives a deviation result on the number of draws of optimal arms

Can Thompson Sampling be extended to more general settings?

Contextual bandit ([Agrawal, Goyal, Thompson Sampling forContextual Bandits with Linear Payoffs, sept 2012])

Model-based Bayesian reinforcement learning

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 21 / 23

Page 28: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Some perspectives

0

100

200

300

400

500

102

103

104

Rn

UCB

0

100

200

300

400

500

102

103

104

UCB−V

0

100

200

300

400

500

102

103

104

DMED

0

100

200

300

400

500

102

103

104

n (log scale)

Rn

KL−UCB

0

100

200

300

400

500

102

103

104

n (log scale)

Bayes−UCB

0

100

200

300

400

500

102

103

104

n (log scale)

Thompson

Figure: Regret as a function of time (on a log scale) for a 10 arms problem

Thompson Sampling outperforms other optimal algorithms

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 22 / 23

Page 29: Thompson Sampling : an asymptotically optimal finite-time ...chercheurs.lille.inria.fr/ekaufman/talk_Thompson.pdf · nite-time analysis Emilie Kaufmann, Nathaniel Korda and R emi

Some perspectives

Any question ?

E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 23 / 23