Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Thompson Sampling : an asymptotically optimalfinite-time analysis
Emilie Kaufmann, Nathaniel Korda and Remi Munos
ALT, October 30th, 2012
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 1 / 23
The multi-armed bandit problem
1 The multi-armed bandit problem
2 From UCB to Thompson Sampling
3 Finite-time analysis of Thompson Sampling
4 A closer look at the fundamental deviation result
5 Some perspectives
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 2 / 23
The multi-armed bandit problem The multiarmed bandit problem
The stochastic MAB with Bernoulli rewards
K indepedent arms.
µ1, . . . , µK unknown parameters
(Ya,t)t is i.i.d. with distribution B(µa)
The parameter of the best arm is µ∗ = maxa=1...K µa
At time t, the forecaster chooses arm At and gets reward Rt = YAt,t.
Goal : Design a strategy At minimizing the cumulative regret:
R(T ) := Tµ∗ − E
[T∑t=1
Rt
]=∑a∈A
(µ∗ − µa)E[Na,T ]
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 3 / 23
The multi-armed bandit problem Asymptotically optimal bandit algorithm
Asymtotically optimal bandit algorithms
Lai and Robbins’ lower bound on the regret of a consistent policy:
µa < µ∗ ⇒ lim infT→∞
E[Na,T ]
lnT≥ 1
K(µa, µ∗)
or equivalently
lim infT→∞
E[R(T )]
ln(T )≥
∑a:µa<µ∗
µ∗ − µaK(µa, µ∗)
with
K(p, q) := p lnp
q+ (1− p) ln
1− p1− q
.
A bandit algorithm is asymptotically optimal if
µa < µ∗ ⇒ lim supT→∞
E[Na,T ]
lnT≤ 1
K(µa, µ∗)
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 4 / 23
From UCB to Thompson Sampling Frequentist algorithms
1 The multi-armed bandit problem
2 From UCB to Thompson Sampling
3 Finite-time analysis of Thompson Sampling
4 A closer look at the fundamental deviation result
5 Some perspectives
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 5 / 23
From UCB to Thompson Sampling Frequentist algorithms
Some sucessfull frequentist algorithms
A family of optimistic index policies based on an upper confidencebound for the empirical mean of the rewards:
UCB [Auer et al. 02] and variants:
E[Na,T ] ≤ K1
2(µa − µ∗)2lnT +K2, with K1 > 1.
KL-UCB [Cappe, Garivier, Maillard, Stoltz, Munos 11] uses the index:
ua,t = argmaxx>
Sa,tNa,t
{K
(Sa,tNa,t
, x
)≤ ln(t) + c ln ln(t)
Na,t
}
For all ε > 0, there exists a constant Kε such that:
E[Na,T ] ≤ 1 + ε
K(µa, µ∗)lnT +Kε
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 6 / 23
From UCB to Thompson Sampling Bayesian algorithms
A Bayesian view on the MAB
Imagine we are given independent priors on the parameters of each arm:
µai.i.d.∼ U([0, 1])
(Ya,t)t is i.i.d. conditionally to µa with distribution B(µa)
The posterior on arm a at time t is
πa,t = Beta (Sa,t + 1, Na,t − Sa,t + 1) .
Bayesian algorithms uses this posterior πa,t to choose At.
⇒ We still focus on frequentist guarantees (asymptotic optimality) forBayesian algorithms
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 7 / 23
From UCB to Thompson Sampling Bayesian algorithms
A Bayesian Upper Confidence Bound algorithm
Bayes-UCB [Kaufmann et al. 12] is the index policy associated with
qa,t := Q
(1− 1
t ln(t)c, πa,t
)This Bayesian algorithm is asymptotically optimal
0
1
16 10 265 162 46
Figure: UCB versus Bayes-UCB
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 8 / 23
From UCB to Thompson Sampling Bayesian algorithms
Thompson Sampling : a new kind of optimism?
A very simple algorithm:
∀a ∈ {1..K}, θa,t ∼ πa,tAt = argmaxa θa,t
Recent interest for this algorithm:
partial analysis proposed[Granmo 2010][May, Korda, Lee, Leslie 2011]
extensive numerical study beyond the Bernoulli case[Chapelle, Li 2011]
first logarithmic upper bound on the regret[Agrawal,Goyal 2012]
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 9 / 23
Finite-time analysis of Thompson Sampling
1 The multi-armed bandit problem
2 From UCB to Thompson Sampling
3 Finite-time analysis of Thompson Sampling
4 A closer look at the fundamental deviation result
5 Some perspectives
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 10 / 23
Finite-time analysis of Thompson Sampling Main result
An optimal regret bound for Thompson Sampling
Assume the first arm is the unique optimal and ∆a = µ1 − µa.
Known result : [Agrawal,Goyal, 2012]
E[R(T )] ≤ C
(K∑a=2
1
∆a
)ln(T ) + oµ(ln(T ))
Our improvement :
Theorem 2 ∀ε > 0,
E[R(T )] ≤ (1 + ε)
(K∑a=2
∆a
K(µa, µ∗)
)ln(T ) + oµ,ε(ln(T ))
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 11 / 23
Finite-time analysis of Thompson Sampling Main result
An optimal regret bound for Thompson Sampling
Assume the first arm is the unique optimal and ∆a = µ1 − µa.
Known result : [Agrawal,Goyal, 2012]
E[R(T )] ≤ C
(K∑a=2
1
∆a
)ln(T ) + oµ(ln(T ))
Our improvement :
Theorem 2 ∀ε > 0,
E[R(T )] ≤ (1 + ε)
(K∑a=2
∆a
K(µa, µ∗)
)ln(T ) + oµ,ε(ln(T ))
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 11 / 23
Finite-time analysis of Thompson Sampling Proof: Step 1
Step 1: Decomposition
We adapt an analysis working for optimistic index policies:
At = argmaxala,t
E[Na,T ] ≤T∑t=1
P (l1,t < µ1)︸ ︷︷ ︸o(ln(T ))
+
T∑t=1
P (la,t ≥ l1,t > µ1, At = a)︸ ︷︷ ︸ln(T )/K(µa,µ1)+o(ln(T ))
⇒ Does NOT work for Thompson Sampling
Our decomposition for Thompson Sampling is
E[Na,T ] ≤T∑t=1
P
(θ1,t ≤ µ1 −
√6 ln t
N1,t
)
+T∑t=1
P
(θa,t > µ1 −
√6 ln t
N1,t, At = a
)︸ ︷︷ ︸
(∗)
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 12 / 23
Finite-time analysis of Thompson Sampling Proof: Step 1
Step 1: Decomposition
We adapt an analysis working for optimistic index policies:
At = argmaxala,t
E[Na,T ] ≤T∑t=1
P (l1,t < µ1)︸ ︷︷ ︸o(ln(T ))
+
T∑t=1
P (la,t ≥ l1,t > µ1, At = a)︸ ︷︷ ︸ln(T )/K(µa,µ1)+o(ln(T ))
⇒ Does NOT work for Thompson Sampling
Our decomposition for Thompson Sampling is
E[Na,T ] ≤T∑t=1
P
(θ1,t ≤ µ1 −
√6 ln t
N1,t
)
+T∑t=1
P
(θa,t > µ1 −
√6 ln t
N1,t, At = a
)︸ ︷︷ ︸
(∗)
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 12 / 23
Finite-time analysis of Thompson Sampling Proof: Step 1
Step 1: Decomposition
We adapt an analysis working for optimistic index policies:
At = argmaxala,t
E[Na,T ] ≤T∑t=1
P (l1,t < µ1)︸ ︷︷ ︸o(ln(T ))
+
T∑t=1
P (la,t ≥ l1,t > µ1, At = a)︸ ︷︷ ︸ln(T )/K(µa,µ1)+o(ln(T ))
⇒ Does NOT work for Thompson Sampling
Our decomposition for Thompson Sampling is
E[Na,T ] ≤T∑t=1
P
(θ1,t ≤ µ1 −
√6 ln t
N1,t
)
+
T∑t=1
P
(θa,t > µ1 −
√6 ln t
N1,t, At = a
)︸ ︷︷ ︸
(∗)E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 12 / 23
Finite-time analysis of Thompson Sampling Proof: Step 2
Step 2: Linking quantiles to other known indices
We introduce the following quantile:
qa,t := Q
(1− 1
t ln(T ), πa,t
)
And the corresponding KL-UCB index
ua,t := argmaxx>
Sa,tNa,t
{K
(Sa,tNa,t
, x
)≤ ln(t) + ln(ln(T ))
Na,t
}
We know from previous work [Kaufmann et al.] that
qa,t < ua,t
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 13 / 23
Finite-time analysis of Thompson Sampling Proof: Step 2
Step 2: Linking quantiles to other known indices
We introduce the following quantile:
qa,t := Q
(1− 1
t ln(T ), πa,t
)
And the corresponding KL-UCB index
ua,t := argmaxx>
Sa,tNa,t
{K
(Sa,tNa,t
, x
)≤ ln(t) + ln(ln(T ))
Na,t
}
We know from previous work [Kaufmann et al.] that
qa,t < ua,t
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 13 / 23
Finite-time analysis of Thompson Sampling Proof: Step 2
Step 2: Linking quantiles to other known indices
We introduce the following quantile:
qa,t := Q
(1− 1
t ln(T ), πa,t
)
And the corresponding KL-UCB index
ua,t := argmaxx>
Sa,tNa,t
{K
(Sa,tNa,t
, x
)≤ ln(t) + ln(ln(T ))
Na,t
}
We know from previous work [Kaufmann et al.] that
qa,t < ua,t
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 13 / 23
Finite-time analysis of Thompson Sampling Proof: Step 2
Step 2: Linking quantiles to other known indices
Introducing the quantile qa,t:
T∑t=1
P
(θa,t > µ1 −
√6 ln t
N1,t, At = a
)
≤T∑t=1
P
(qa,t > µ1 −
√6 ln t
N1,t, At = a
)+
T∑t=1
P (θa,t > qa,t)︸ ︷︷ ︸≤2
Then the KL-UCB index ua,t:
T∑t=1
P
(θa,t > µ1 −
√6 ln t
N1,t, At = a
)
≤T∑t=1
P
(ua,t > µ1 −
√6 ln t
N1,t, At = a
)+ 2
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 14 / 23
Finite-time analysis of Thompson Sampling Proof: Step 2
Step 2: Linking quantiles to other known indices
Introducing the quantile qa,t:
T∑t=1
P
(θa,t > µ1 −
√6 ln t
N1,t, At = a
)
≤T∑t=1
P
(qa,t > µ1 −
√6 ln t
N1,t, At = a
)+
T∑t=1
P (θa,t > qa,t)︸ ︷︷ ︸≤2
Then the KL-UCB index ua,t:
T∑t=1
P
(θa,t > µ1 −
√6 ln t
N1,t, At = a
)
≤T∑t=1
P
(ua,t > µ1 −
√6 ln t
N1,t, At = a
)+ 2
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 14 / 23
Finite-time analysis of Thompson Sampling Proof: Step 2
Final decomposition
The final decomposition is:
E[Na,t] ≤T∑t=1
P
(θ1,t ≤ µ1 −
√6 ln t
N1,t
)︸ ︷︷ ︸
A
+T∑t=1
P
(ua,t > µ1 −
√6 ln t
N1,t, At = a
)︸ ︷︷ ︸
B
+2
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 15 / 23
Finite-time analysis of Thompson Sampling Proof: Step 3
Step 3: One extra ingredient for bounding term A and B
We state a fundamental deviation result :
Proposition 1 There exists constants b = b(µ1, µ2) ∈ (0, 1) and Cb <∞such that:
∞∑t=1
P(N1,t ≤ tb
)≤ Cb.
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 16 / 23
A closer look at the fundamental deviation result
1 The multi-armed bandit problem
2 From UCB to Thompson Sampling
3 Finite-time analysis of Thompson Sampling
4 A closer look at the fundamental deviation result
5 Some perspectives
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 17 / 23
A closer look at the fundamental deviation result
Understanding the deviation result
Recall the result
There exists constants b = b(µ1, µ2) ∈ (0, 1) and Cb <∞ such that
∞∑t=1
P(N1,t ≤ tb
)≤ Cb.
Where does it come from?{N1,t ≤ tb
}= {there exists a time range of length at least t1−b − 1
with no draw of arm 1}
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 18 / 23
A closer look at the fundamental deviation result
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1
2
3
4
5
6
7
8
9
µ2
µ1
µ2 + δ
Assume that :
on Ij = [τj , τj + dt1−b − 1e] there is no draw of arm 1
there exists Jj ⊂ Ij such that ∀s ∈ Jj , ∀a 6= 1, θa,s ≤ µ2 + δ
Then :
∀s ∈ Jj , θ1,s ≤ µ2 + δ
⇒ This only happens with small probabilityE.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 19 / 23
Some perspectives
1 The multi-armed bandit problem
2 From UCB to Thompson Sampling
3 Finite-time analysis of Thompson Sampling
4 A closer look at the fundamental deviation result
5 Some perspectives
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 20 / 23
Some perspectives
Conclusion and perspectives
Thompson Sampling in the Bernoulli setting:
has the same theoretical guarantees than known optimal algorithms(KL-UCB, Bayes-UCB)
and displays excellent empirical performance
The proof we give :
is close to the analysis of optimistic bandit algorithms
also gives a deviation result on the number of draws of optimal arms
Can Thompson Sampling be extended to more general settings?
Contextual bandit ([Agrawal, Goyal, Thompson Sampling forContextual Bandits with Linear Payoffs, sept 2012])
Model-based Bayesian reinforcement learning
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 21 / 23
Some perspectives
0
100
200
300
400
500
102
103
104
Rn
UCB
0
100
200
300
400
500
102
103
104
UCB−V
0
100
200
300
400
500
102
103
104
DMED
0
100
200
300
400
500
102
103
104
n (log scale)
Rn
KL−UCB
0
100
200
300
400
500
102
103
104
n (log scale)
Bayes−UCB
0
100
200
300
400
500
102
103
104
n (log scale)
Thompson
Figure: Regret as a function of time (on a log scale) for a 10 arms problem
Thompson Sampling outperforms other optimal algorithms
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 22 / 23
Some perspectives
Any question ?
E.Kaufmann, N.Korda, R.Munos (INRIA) Thompson Sampling ALT, October 30th, 2012 23 / 23