26
A Near-optimal Regret Bounds for Thompson Sampling 1 SHIPRA AGRAWAL, Microsoft Research 2 NAVIN GOYAL, Microsoft Research Thompson Sampling (TS) is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated that it has favorable empirical performance compared to the state of the art methods. In this paper, a novel and almost tight martingale-based regret analysis for Thompson Sampling is presented. Our technique simultaneously yield both problem-dependent and problem-independent bounds: (1) The first near-optimal problem-independent bound of O( NT ln T ) on the expected regret. (2) The optimal problem- dependent bound of (1 + ǫ) i ln T d(µ i 1 ) + O( N ǫ 2 ) on the expected regret (this bound was first proven by Kaufmann et al. [Kaufmann et al. 2012b]). Our technique is conceptually simple, and easily extends to distributions other than the Beta distri- bution used in the original TS algorithm. For the version of TS that uses Gaussian priors, we prove a problem-independent bound of O( NT ln N ) on the expected regret, and show the optimality of this bound by providing a matching lower bound. This is the first lower bound on the performance of a natural version of Thompson Sampling that is away from the general lower bound of Ω( NT ) for the multi-armed bandit problem. 1. INTRODUCTION The Multi-Armed Bandit problem (MAB) models the exploration/exploitation trade- off inherent in sequential decision problems. Many versions and generalizations of MAB have been studied in the literature; in this paper we will consider a basic and well-studied version of this problem: the stochastic multi-armed bandit problem. Among many algorithms available for stochastic MAB, some popular ones include Upper Confidence Bound (UCB) family of algorithms, (e.g., [Lai and Robbins 1985; Auer et al. 2002], and more recently [Audibert and Bubeck 2009; Garivier and Capp´ e 2011; Maillard et al. 2011; Kaufmann et al. 2012a]), which have good theoretical guar- antees, and the algorithm by [Gittins 1989], which gives optimal strategy under a Bayesian setting with known priors and geometric time-discounted rewards. In one of the earliest works on stochastic MAB, [Thompson 1933] proposed a natural ran- domized Bayesian algorithm to minimize regret. The basic idea is to assume a simple prior distribution on the parameters of the reward distribution of every arm, and at any time step, play an arm according to its posterior probability of being the best arm. This algorithm is known as Thompson Sampling (TS), and it is a member of the family of randomized probability matching algorithms. TS is a natural algorithm: the same idea has been rediscovered many times independently in the context of reinforcement learning, e.g., in [Wyatt 1997; Strens 2000; Ortega and Braun 2010]. Recently, TS has attracted considerable attention. Several studies (e.g., [Granmo 2010; Scott 2010; Graepel et al. 2010; Chapelle and Li 2011; May and Leslie 2011; Kaufmann et al. 2012b]) have empirically demonstrated the efficacy of TS. Despite be- ing easy to implement, competitive to the state of the art methods, and being used in practice, TS lacked a strong theoretical analysis until very recently. [Granmo 2010; May et al. 2011] provide weak guarantees, namely, a bound of o(T ) on expected regret in time T . Significant progress was made in more recent work of [Agrawal and Goyal 2012] and [Kaufmann et al. 2012b]. In [Agrawal and Goyal 2012], the first logarithmic bound on expected regret of TS was proven. [Kaufmann et al. 2012b] provided a bound 1 A preliminary version of this paper was presented at AISTATS 2013 [Agrawal and Goyal 2013a] 2 Present address: Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027. ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

A

Near-optimal Regret Bounds for Thompson Sampling1

SHIPRA AGRAWAL, Microsoft Research 2

NAVIN GOYAL, Microsoft Research

Thompson Sampling (TS) is one of the oldest heuristics for multi-armed bandit problems. It is a randomized

algorithm based on Bayesian ideas, and has recently generated significant interest after several studies

demonstrated that it has favorable empirical performance compared to the state of the art methods. In

this paper, a novel and almost tight martingale-based regret analysis for Thompson Sampling is presented.

Our technique simultaneously yield both problem-dependent and problem-independent bounds: (1) The first

near-optimal problem-independent bound of O(√NT lnT ) on the expected regret. (2) The optimal problem-

dependent bound of (1 + ǫ)∑

ilnT

d(µi,µ1)+ O(N

ǫ2) on the expected regret (this bound was first proven by

Kaufmann et al. [Kaufmann et al. 2012b]).

Our technique is conceptually simple, and easily extends to distributions other than the Beta distri-

bution used in the original TS algorithm. For the version of TS that uses Gaussian priors, we prove a

problem-independent bound of O(√NT lnN) on the expected regret, and show the optimality of this bound

by providing a matching lower bound. This is the first lower bound on the performance of a natural version

of Thompson Sampling that is away from the general lower bound of Ω(√NT ) for the multi-armed bandit

problem.

1. INTRODUCTION

The Multi-Armed Bandit problem (MAB) models the exploration/exploitation trade-off inherent in sequential decision problems. Many versions and generalizations ofMAB have been studied in the literature; in this paper we will consider a basicand well-studied version of this problem: the stochastic multi-armed bandit problem.Among many algorithms available for stochastic MAB, some popular ones includeUpper Confidence Bound (UCB) family of algorithms, (e.g., [Lai and Robbins 1985;Auer et al. 2002], and more recently [Audibert and Bubeck 2009; Garivier and Cappe2011; Maillard et al. 2011; Kaufmann et al. 2012a]), which have good theoretical guar-antees, and the algorithm by [Gittins 1989], which gives optimal strategy under aBayesian setting with known priors and geometric time-discounted rewards. In oneof the earliest works on stochastic MAB, [Thompson 1933] proposed a natural ran-domized Bayesian algorithm to minimize regret. The basic idea is to assume a simpleprior distribution on the parameters of the reward distribution of every arm, and atany time step, play an arm according to its posterior probability of being the best arm.This algorithm is known as Thompson Sampling (TS), and it is a member of the familyof randomized probability matching algorithms. TS is a natural algorithm: the sameidea has been rediscovered many times independently in the context of reinforcementlearning, e.g., in [Wyatt 1997; Strens 2000; Ortega and Braun 2010].

Recently, TS has attracted considerable attention. Several studies (e.g., [Granmo2010; Scott 2010; Graepel et al. 2010; Chapelle and Li 2011; May and Leslie 2011;Kaufmann et al. 2012b]) have empirically demonstrated the efficacy of TS. Despite be-ing easy to implement, competitive to the state of the art methods, and being used inpractice, TS lacked a strong theoretical analysis until very recently. [Granmo 2010;May et al. 2011] provide weak guarantees, namely, a bound of o(T ) on expected regretin time T . Significant progress was made in more recent work of [Agrawal and Goyal2012] and [Kaufmann et al. 2012b]. In [Agrawal and Goyal 2012], the first logarithmicbound on expected regret of TS was proven. [Kaufmann et al. 2012b] provided a bound

1A preliminary version of this paper was presented at AISTATS 2013 [Agrawal and Goyal 2013a]2Present address: Department of Industrial Engineering and Operations Research, Columbia University,New York, NY 10027.

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 2: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

A:2 Agrawal and Goyal

that matches the asymptotic lower bound of [Lai and Robbins 1985] for this prob-lem. However, both of these bounds were problem-dependent, i.e. the regret boundsare logarithmic in the time horizon T when the problem parameters, namely themean rewards for each arm, and their differences, are assumed to be constants. Theproblem-independent bounds implied by these works were far from optimal. Obtain-

ing a problem-independent bound that is close to the lower bound of Ω(√NT ) was also

posed as an open problem by [Li and Chapelle 2012].In this paper, we give a regret analysis for TS that provides both optimal

problem-dependent and near-optimal problem-independent regret bounds. Our novelmartingale-based analysis technique is conceptually simple and arguably simplerthan the previous work. Our technique easily extends to the distributions other thanBeta distribution, and it also extends to the more general contextual bandits setting[Agrawal and Goyal 2013b].

Before stating our results, we describe the stochastic multi-armed bandit problemand TS formally.

1.1. The stochastic multi-armed bandit problem

We consider the stochastic multi-armed bandit problem: We are given a slot machinewith N arms; at each time step t = 1, 2, 3, . . ., one of the N arms must be chosen tobe played. Each arm i, when played, yields a random real-valued reward accordingto some fixed unknown distribution associated with arm i with support in [0, 1]. Therandom rewards obtained from playing an arm repeatedly are i.i.d. and independentof the plays of the other arms. The reward is observed immediately after playing thearm.

An algorithm for stochastic MAB must decide which arm to play at each time stept, based on the outcomes of the previous t − 1 plays. Let µi denote the (unknown)expected reward for arm i. A popular goal in designing algorithms for stochastic MAB

is to maximize the expected total reward at time T , i.e., E[∑T

t=1 µi(t)], where i(t) is the

arm played in step t, and the expectation is over the random choices of i(t) made by thealgorithm. It is common to work with the equivalent measure of expected total regret:the amount we lose because of not playing optimal arm in each step. To formally defineregret, let us introduce some notation. Let µ∗ := maxi µi, and ∆i := µ∗ − µi. Let ki(t)denote the number of times arm i has been played up to step t − 1; thus ki(t) is arandom variable. Then the expected total regret in time T is given by

E [R(T )] = E

[

∑Tt=1(µ

∗ − µi(t))]

=∑

i ∆i · E [ki(T + 1)] .

To define different notions of expected regret used in this paper, let E [R(T,Θ)] denotethe expected regret for a MAB instance Θ which is fully specified by the number ofarms and the distributions for the arms. Fixing the problem instance fixes the rewarddistributions, and therefore the values of the means µi, i = 1, . . . , N . Then, problem-dependent bounds on regret are bounds on E [R(T,Θ)] for every problem instance Θ,in terms of T, µi, i = 1, . . . , N and possibly other distribution parameters associatedwith Θ. Problem-independent bounds are bounds on the worst-case expected regret asa function of the number of arms N and time T , i.e.,

maxΘ

E [R(T,Θ)] ,

where the maximization is over MAB instances with N arms. These are the notionsof regret considered in most classic works on the UCB algorithm for multi-armed ban-dit problem, e.g. [Auer et al. 2002; Lai and Robbins 1985], and therefore our boundsare directly comparable with those available for UCB. [Auer et al. 2002] provides

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 3: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

Near-optimal Regret Bounds for Thompson Sampling A:3

O(∑

i:µi<µ∗log(T )∆i

) problem-dependent regret bound and O(√

NT log(T )) problem-

independent regret bound for UCB.A related notion of regret considered by many recent works on TS, e.g.

[Bubeck and Liu 2014; Russo and Van Roy 2015, 2014; Russo et al. 2013], is BayesianRegret. Bayesian regret is expected regret over a (known) prior f over the probleminstances. Using the terminology above, Bayesian regret is defined as

EΘ∼f [E [R(T,Θ)|Θ]].

For priors on reward distributions with [0, 1] support, clearly, this is a weaker notionof regret than the worst-case problem-independent regret, in that any bound on theworst-case regret implies the same bound on Bayesian regret for any such prior f , butnot vice-versa. However, we must note that these works allow more general priors andsome of these even accommodate contexts and further complex information structures.Bayesian regret bounds in those more complex settings are incomparable to the worst-case regret bounds presented here.

1.2. Thompson Sampling

As mentioned before, TS is a natural algorithm for stochastic MAB. The basic ideabehind TS is to assume a simple prior distribution on the underlying parameters ofthe reward distribution of every arm, and at every time step, play an arm according toits posterior probability of being the best arm. While TS is a specific algorithm due toThompson, in this paper we will use TS more generally to refer to a class of algorithmsthat have a similar structure and include the original algorithm of Thompson as a spe-cial case. The general structure of TS involves the following elements (this descriptionof TS follows closely that of [Chapelle and Li 2011]):

(1) a set ψ of parameters µ;(2) an assumed prior distribution P (µ) on these parameters;(3) past observations D consisting of (reward r) for the arms played in the past time

steps;(4) an assumed likelihood function P (r|µ), which gives the probability of reward given

a parameter µ ∈ ψ;(5) a posterior distribution P (µ|D) ∝ P (D|µ)P (µ), where P (D|µ) is the likelihood func-

tion.

The notation P (·) above denotes probability density function (or probability mass func-tion for discrete random variables). TS maintains a posterior distribution for the un-derlying parameters µi, i.e. the expected reward of every arm i. In each round, TS playsan arm according to its posterior probability of being the best arm, that is, the poste-rior probability of having the highest value of µi. A simple way to achieve this is toproduce a sample from the posterior distribution of every arm, and play the arm thatproduces the largest sample. Below we describe two versions of TS, using Beta priorsand Bernoulli likelihood function, and using Gaussian priors and Gaussian likelihood.

We emphasize that the Beta priors and the Bernoulli likelihood model, or Gaussianpriors and the Gaussian likelihood model for rewards are only used below to design theThompson Sampling algorithm. Our analysis of these algorithms allows these modelsto be completely unrelated to the actual reward distribution. The assumptions on theactual reward distribution are only those mentioned in Section 1.1, namely the re-wards are in the range [0, 1], and are generated i.i.d. upon playing an arm. In thedescription of TS using Beta priors and Bernoulli likelihood, for simplicity we do beginwith the description of the algorithm for the Bernoulli bandit problem, i.e., when the

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 4: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

A:4 Agrawal and Goyal

rewards are either 0 or 1, but as we explain later, the algorithm and its analysis extendto any distribution of rewards with [0, 1] support.

Thompson Sampling using Beta priors and Bernoulli likelihood. Consider theBernoulli bandit problem, i.e., when the rewards are either 0 or 1, and the likelihood ofreward 1 for arm i (the probability of success) is µi. Using Beta priors is convenient forBernoulli rewards because if the prior is a Beta(α, β) distribution, then after observinga Bernoulli trial, the posterior distribution is simply Beta(α + 1, β) or Beta(α, β + 1),depending on whether the trial resulted in a success or failure, respectively.

TS initially assumes arm i to have prior Beta(1, 1) on µi, which is natural becauseBeta(1, 1) is the uniform distribution on (0, 1). Informally, this choice captures the factthat initially we have no knowledge about the µi. At time t, having observed Si(t)successes (reward = 1) and Fi(t) failures (reward = 0) in ki(t) = Si(t) + Fi(t) plays ofarm i, the algorithm updates the distribution on µi to Beta(Si(t) + 1, Fi(t) + 1). Thealgorithm then generates independent samples from these posterior distributions ofthe µi’s, and plays the arm with the largest sample value.

ALGORITHM 1: Thompson Sampling using Beta priors

For each arm i = 1, . . . , N set Si = 0, Fi = 0.foreach t = 1, 2, . . . , do

For each arm i = 1, . . . , N , sample θi(t) from the Beta(Si + 1, Fi + 1) distribution.Play arm i(t) := argmaxi θi(t) and observe reward rt.If rt = 1, then Si(t) := Si(t) + 1, else Fi(t) := Fi(t) + 1.

end

We have provided the details of TS with Beta priors for the Bernoulli bandit problem.A simple extension of this algorithm to general reward distributions with support [0, 1]is described in Agrawal and Goyal [2012]. In this extension, on observing a rewardrt ∈ [0, 1], we toss a coin with bias rt, and use the 0, 1 outcome to update the betadistribution as above. It is easy to show that any expected regret bounds produced forAlgorithm 1 will also hold for this extension [Agrawal and Goyal 2012].

Thompson Sampling using Gaussian priors and likelihood. As before, let ki(t) denotethe number of plays of arm i until time t− 1, and let i(t) denote the arm played at time

t. Let ri(t) denote the reward of arm i at time t; define µi(t) :=∑t−1

τ=1:i(τ)=iri(τ)

ki(t)+1 . Note that

µi(1) = 0. To derive TS with Gaussian priors, assume that the likelihood of rewardri(t) at time t, given parameter µi, is given by the pdf of Gaussian distribution N (µi, 1).Then, assuming that the prior for µ at time t is given by N (µi(t),

1ki(t)+1 ), and arm i

is played at time t with reward r, it is easy to compute the posterior distributionPr(µi|ri(t)) ∝ Pr(ri(t)|µi) Pr(µi) as Gaussian distribution N (µi(t+ 1), 1

ki(t+1)+1 ). In TS

with Gaussian priors, for each arm i, we will generate an independent sample θi(t)from the distribution N (µi(t),

1ki(t)+1 ) at time t. The arm with maximum value of θi(t)

will be played.

1.3. Our results

In this article, we bound the finite time expected regret of TS. Henceforth we will as-sume that the first arm is the unique optimal arm, i.e., µ∗ = µ1 > argmaxi6=1 µi. As-suming that the first arm is an optimal arm is a matter of convenience for statingthe results and for the analysis. The algorithms did not use this assumption. The as-sumption of unique optimal arm is also without loss of generality, since adding more

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 5: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

Near-optimal Regret Bounds for Thompson Sampling A:5

ALGORITHM 2: Thompson Sampling using Gaussian priors

For each arm i = 1, . . . , N set ki = 0, µi = 0.foreach t = 1, 2, . . . , do

For each arm i = 1, . . . , N , sample θi(t) independently from the N (µi,1

ki+1) distribution.

Play arm i(t) := argmaxi θi(t) and observe reward rt.

Set µi(t) :=µi(t)ki(t)+rt

ki(t)+2, ki(t) := ki(t) + 1.

end

arms with µi = µ∗ can only decrease the expected regret; details of this argument wereprovided in [Agrawal and Goyal 2012].

THEOREM 1.1. Fix ǫ ∈ (0, 1). For the N -armed stochastic bandit problem with theµi satisfying assumptions in the previous paragraph, Thompson Sampling using Betapriors has expected regret

E[R(T )] ≤ (1 + ǫ)

N∑

i=2

lnT

d(µi, µ1)∆i +O(

N

ǫ2)

at time T , where d(µi, µ1) := µi logµi

µ1+ (1− µi) log

(1−µi)(1−µ1)

. The big-Oh notation assumes

µi,∆i, i = 1, . . . , N to be constants.

THEOREM 1.2. For the N -armed stochastic bandit problem, Thompson Samplingusing Beta priors has expected regret

E[R(T )] ≤ O(√NT lnT )

at time T . The big-Oh notation hides only the absolute constants.

THEOREM 1.3. For the N -armed stochastic bandit problem, Thompson Samplingusing Gaussian priors has expected regret

E[R(T )] ≤ O(√NT lnN)

at time T ≥ N . The big-Oh notation hides only absolute constants.

THEOREM 1.4. There exists an instance of the N -armed stochastic bandit problem,for which Thompson Sampling using Gaussian priors has expected regret

E[R(T )] ≥ Ω(√NT lnN)

at time T ≥ N . Here Ω hides only absolute constants.

1.4. Related work

Let us contrast our bounds with the previous work. Let us first consider the problem-dependent regret bounds, i.e., regret bounds that depend on the problem parametersN and µi,∆i, i = 1, . . . , N . [Lai and Robbins 1985] essentially proved an asymptotic

lower bound of[

∑Ni=2

∆i

d(µi,µ1)+ o(1)

]

lnT for any algorithm for this problem. They also

gave algorithms asymptotically achieving this guarantee. [Auer et al. 2002] gave the

UCB1 algorithm, which achieves a finite time regret bound of[

8∑N

i=21∆i

]

lnT + (1 +

π2/3)(

∑Ni=2 ∆i

)

. More recently, Kaufmann et al. [2012a] gave Bayes-UCB algorithm,

and [Garivier and Cappe 2011] and Maillard et al. [2011] gave UCB-like algorithms,

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 6: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

A:6 Agrawal and Goyal

which achieve the lower bound of Lai and Robbins [1985]. Our regret bound in Theo-rem 1.1 achieves the lower bounds of Lai and Robbins [1985], and matches the upperbounds provided by [Kaufmann et al. 2012b] for TS.

Theorem 1.2 and 1.3 show that TS with Beta and Gaussian distributions achievea problem independent regret bound of O(

√NT lnT ) and O(

√NT lnN) respectively.

This is the first analyis for TS that matches the Ω(√NT ) problem-independent lower

bound (see Section 3.3 of [Bubeck and Cesa-Bianchi 2012]) for the stochastic MABwithin logarithmic factors. The problem-dependent bounds can be used to deriveproblem-independent bounds. However, the previous work on TS implied only subopti-mal problem-independent bounds: The results of Agrawal and Goyal [2012] implied a

problem-independent bound of O(N1/5T 4/5). In [Kaufmann et al. 2012b], the additiveproblem-dependent term was not explicitly calculated, which makes it difficult to de-rive the implied problem-independent bound, but a preliminary examination suggeststhat it would involve an even higher power of T .

To compare with other existing algorithms for this problem, note that the best

known problem-independent bound for the expected regret of UCB1 is O(√NT lnT )

(see [Bubeck and Cesa-Bianchi 2012]). Our regret bound of O(√NT lnN) for TS

with Gaussian priors is an improvement over the bound for UCB1. More recently,[Audibert and Bubeck 2009] gave an algorithm MOSS, inspired by UCB1, with regret

O(√NT ) that matches the Ω(

√NT ) problem-independent lower bound for the multi-

armed bandit problem. However, their algorithm needs to know the time horizon T .

It is unclear whether an O(√NT ) regret can be achieved by an algorithm that does

not know the time horizon. Interestingly, Theorem 1.4 shows that this is unachievable

for TS with Gaussian priors, as there is a lower bound of Ω(√NT lnN) on its expected

regret. This is the first lower bound for TS that differs from the general lower boundfor the problem.

Much followup work has been conducted in understanding theoretical proper-ties of Thompson Sampling since our work first appeared in public domain as[Agrawal and Goyal 2013a] (conference version). We mention some of this work:[Korda et al. 2013] study TS for the special case of exponential family of dis-tributions, [Kocak et al. 2014] for spectral bandits, [Agrawal and Goyal 2013b;Russo and Van Roy 2014; Li 2013] for contextual bandits, and [Russo et al. 2013] forreinforcement learning.

2. PROOFS OF UPPER BOUNDS

In this section, we prove Theorems 1.1, 1.2 and 1.3. The proofs of the three theoremsfollow similar steps, and diverge only towards the end of the analysis.

Proof Outline:. Our proof uses a martingale based analysis. Essentially, we provethat conditioned on any history of execution in the preceding steps, the probabilityof playing any suboptimal arm i at the current step can be bounded by a linearfunction of the probability of playing the optimal arm at the current step. This isproven in Lemma 2.8, which forms the core of our analysis. Further, we show thatthe coefficient in this linear function decreases exponentially fast with the number ofplays of the optimal arm (Lemma 2.9). This allows us to bound the number of plays ofevery suboptimal arm, which in turn bounds the regret. The differences between theanalyses for obtaining the logarithmic problem-dependent bound of Theorem 1.1, andthe problem-independent bound of Theorem 1.2 and Theorem 1.3 are technical, andoccur only towards the end of the proof.

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 7: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

Near-optimal Regret Bounds for Thompson Sampling A:7

We recall some of the definitions introduced earlier and introduce some newones.

Definition 2.1 (FBn,p, f

Bn,p, F

betaα,β ). FB

n,p(·) denotes the cdf, fBn,p(·) denotes the probabil-

ity mass function of the binomial distribution with parameters n, p, and F betaα,β (·) de-

notes the cdf of the beta distribution with parameters α, β.

Definition 2.2 (Quantities ki(t), i(t), Si(t), µi(t)). i(t) denotes the arm played attime t, ki(t) denotes the number of plays of arm i until (and including) time t− 1, Si(t)denotes the number of successes among the ki(t) plays of arm i until time t− 1 for theBernoulli bandit case (in other words, Si(t) is the number of times arm i gave reward 1).

Finally, the empirical mean µi(t) for arm i at time t is defined by µi(t) :=∑t−1

τ=1:i(τ)=iri(τ)

ki(t)+1 ,

where ri(τ) is the reward for arm i at time τ (note that µi(t) = 0 when ki(t) = 0). For

Bernoulli bandits, µi(t) =Si(t)

ki(t)+1 .

Definition 2.3 (Quantities θi(t)). θi(t) denotes a sample generated independentlyfor each arm i, from the posterior distribution at time t. For Algorithm 1, this is gener-ated from posterior distribution Beta(Si(t) + 1, ki(t) − Si(t) + 1). For Algorithm 2, thisis generated from posterior distribution N (µi(t),

1ki(t)+1 ).

Definition 2.4 (Quantities xi, yi). For each arm i 6= 1, we will choose two thresh-olds xi and yi such that µi < xi < yi < µ1. The specific choice of these thresholds willdepend on whether we are proving problem-dependent bound or problem-independentbound, and will be described at the appropriate points in the proof.

Definition 2.5 (Events Eµi (t) and Eθ

i (t)). For i 6= 1, Eµi (t) is the event µi(t) ≤ xi,

and Eθi (t) is the event θi(t) ≤ yi.

Intuitively, Eµi (t), E

θi (t) are the events that the estimate µi(t) and the sample value

θi(t), respectively, are not too far above the mean µi. As we show later, these eventswill hold with high probability.

Definition 2.6 (History Ft). For t = 1, 2, . . ., define Ft as the history of plays untiltime t, i.e. the sequence

Ft = i(τ), ri(τ)(τ), τ = 1, . . . , t,

where i(τ) denotes the arm played at time τ , and ri(τ)(τ) denotes the reward observed

for arm i(τ) at time τ . Define F0 = .

By definition, F0 ⊆ F1 ⊆ · · · ⊆ FT−1. Also by definition, for every arm i, the quanti-ties Si(t) (this is only defined for the case of Bernoulli rewards), ki(t), µi(t), the distri-bution of θi(t), and whether or not Eµ

i (t) is true, are determined by the history of playsuntil time t− 1, i.e. by Ft−1.

Definition 2.7. Define, pi,t as the probability

pi,t := Pr(θ1(t) > yi|Ft−1).

Note that pi,t is a random variable determined by Ft−1; we do not explicitly indicatethis dependence by using notation such as pi,t(Ft−1) for brevity.

We prove the following lemma for Thompson Sampling, independent of the type ofpriors (e.g., Beta or Gaussian) used.

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 8: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

A:8 Agrawal and Goyal

LEMMA 2.8. For all t, i 6= 1 and all instantiations Ft−1 of Ft−1 we have

Pr(

i(t) = i, Eµi (t), E

θi (t) Ft−1

)

≤ (1− pi,t)

pi,tPr(

i(t) = 1, Eµi (t), E

θi (t) Ft−1

)

.

PROOF. Recall that whether or not Eµi (t) is true is determined by the instantiation

Ft−1 of Ft−1. Assume that history Ft−1 is such that Eµi (t) is true (otherwise the proba-

bility on the left hand side is 0 and the inequality is trivially true). It then suffices toprove that for all such Ft−1 we have

Pr(

i(t) = i Eθi (t),Ft−1 = Ft−1

)

≤ (1− pi,t)

pi,tPr(

i(t) = 1 Eθi (t),Ft−1 = Ft−1

)

. (1)

We will use the observation that since Eθi (t) is the event that θi(t) ≤ yi, therefore, given

Eθi (t), we have i(t) = i only if θj(t) ≤ yi, ∀j. Therefore, for any i 6= 1,

Pr(

i(t) = i Eθi (t),Ft−1 = Ft−1

)

≤ Pr(

θj(t) ≤ yi, ∀j Eθi (t),Ft−1 = Ft−1

)

= Pr (θ1(t) ≤ yi Ft−1 = Ft−1)

·Pr(

θj(t) ≤ yi, ∀j 6= 1 Eθi (t),Ft−1 = Ft−1

)

= (1− pi,t) · Pr(

θj(t) ≤ yi, ∀j 6= 1 Eθi (t),Ft−1 = Ft−1

)

.

The first equality holds because given Ft−1 = Ft−1 (and hence Sj(t), kj(t), µj(t) andthe distributions of θj(t) for all j), θ1(t) is independent of all the other θj(t) and events

Eθj (t), j 6= 1. Similarly,

Pr(

i(t) = 1 Eθi (t),Ft−1 = Ft−1

)

≥ Pr(

θ1(t) > yi ≥ θj(t), ∀j 6= 1 Eθi (t),Ft−1 = Ft−1

)

= Pr (θ1(t) > yi Ft−1 = Ft−1)

·Pr(

θj(t) ≤ yi, ∀j 6= 1 Eθi (t),Ft−1 = Ft−1

)

= pi,t · Pr(

θj(t) ≤ yi, ∀j 6= 1 Eθi (t),Ft−1 = Ft−1

)

.

Combining the above two inequalities, we get (1).

Now we are ready to prove the upper bounds on regret in Theorems 1.1, 1.2, and 1.3.

2.1. Proof of Theorem 1.1

We can decompose the expected number of plays of a suboptimal arm i 6= 1 as follows.

E[ki(T )] =

T∑

t=1

Pr(i(t) = i)

=

T∑

t=1

Pr(i(t) = i, Eµi (t), E

θi (t)) +

T∑

t=1

Pr(

i(t) = i, Eµi (t), E

θi (t)

)

+

T∑

t=1

Pr(

i(t) = i, Eµi (t)

)

.

(2)

Next, we bound each of the above terms. For the first term above, applying Lemma2.8 and some algebraic manipulations using properties of conditional expectations, we

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 9: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

Near-optimal Regret Bounds for Thompson Sampling A:9

have

T∑

t=1

Pr(i(t) = i, Eµi (t), E

θi (t)) =

T∑

t=1

E[

Pr(

i(t) = i, Eµi (t), E

θi (t) Ft−1

)]

≤T∑

t=1

E

[

(1− pi,t)

pi,tPr(

i(t) = 1, Eθi (t), E

µi (t) Ft−1

)

]

=

T∑

t=1

E

[

E

[

(1− pi,t)

pi,tI(i(t) = 1, Eθ

i (t), Eµi (t)) Ft−1

]]

=

T∑

t=1

E

[

(1− pi,t)

pi,tI(i(t) = 1, Eθ

i (t), Eµi (t))

]

. (3)

The second equality above uses that pi,t is fixed given Ft−1. Now, let τk denote the time

step at which arm 1 is played for the kth time for k ≥ 1, and let τ0 = 0. (Note thatfor any i, for k > ki(T ), τk > T . Also, τT ≥ T .) Observe that pi,t = Pr(θ1(t) > yi|Ft−1)changes only when the distribution of θ1(t) changes, that is, only on the time step aftereach play of the first arm. Thus, pi,t is the same at all time steps t ∈ τk + 1, . . . , τk+1,for every k. Using this observation, we can decompose the above term in the followingway.

T∑

t=1

E

[

(1− pi,t)

pi,tI(i(t) = 1, Eθ

i (t), Eµi (t))

]

=T−1∑

k=0

E

[

(1− pi,τk+1)

pi,τk+1

τk+1∑

t=τk+1

I(i(t) = 1, Eθi (t), E

µi (t))

]

≤T−1∑

k=0

E

[

(1− pi,τk+1)

pi,τk+1

]

. (4)

We prove the following lemma to bound the sum of 1pi,τk+1

.

LEMMA 2.9. Let τk denote the time step at which kth trial of the first arm happens,then for i 6= 1 we have 3

E

[

1

pi,τk+1− 1

]

≤ 3

∆′i

for k < 8∆′

i,

Θ(e−∆′2i k/2 + 1

(k+1)∆′2ie−Dik + 1

e∆′2i

k/4−1) for k ≥ 8

∆′i,

where ∆′i = µ1 − yi and Di = yi ln

yi

µ1+ (1− yi) ln

1−yi

1−µ1.

PROOF. The proof of this inequality uses careful numerical estimates and appearsin Appendix B.

Substituting the bound from Lemma 2.9 into (4), we obtain the following bound onthe first term on the right hand side of (2).

LEMMA 2.10. For i 6= 1,

T∑

t=1

Pr(i(t) = i, Eµi (t), E

θi (t)) ≤

24

∆′2i

+∑

j≥8/∆′i

Θ

(

e−∆′2i j/2 +

1

(j + 1)∆′2i

e−Dij +1

e∆′2i j/4 − 1

)

.

3Here for two functions f(x), g(x) taking positive values and with the same domain we say that f(x) =Θ(g(x)) if there exist absolute constants b, c > 0 such that for all x in the domains of f and g we havebg(x) ≤ f(x) ≤ cg(x).

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 10: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

A:10 Agrawal and Goyal

To bound the remaining two terms in (2), we use the fact that as the number of playsof arm i increases, the probability of violating the events Eµ

i (t) and Eθi (t) decreases

exponentially. More precisely, we prove the following lemmas.

LEMMA 2.11. For i 6= 1,

∑Tt=1 Pr

(

i(t) = i, Eµi (t)

)

≤ 1d(xi,µi)

+ 1.

PROOF. Let τk denote the time at which the kth trial of arm i happens. Set τ0 = 0.

(Note that τk > T for k > ki(T ). Note also that T ≤ τT .) Recall that the event Eµi (t) was

defined as µi(t) > xi. We have

T∑

t=1

Pr(i(t) = i, Eµi (t)) ≤

T−1∑

k=0

Pr(Eµi (τk+1)).

Each summand on the right hand side in the inequality above is a fixed number eventhough the random variables τk+1 appear in it. This is because the distribution of

µi(τk+1) only depends on k and not on τk+1. At time τk+1 for k ≥ 1, µi(τk+1) =Si(τk+1)

k+1 ≤Si(τk+1)

k , where the latter is simply the average of the outcomes of k i.i.d. plays of arm i,each of which is a Bernoulli trial with mean µi. Using the Chernoff-Hoeffding bounds

(Fact 1), we obtain Pr(µi(τk+1) > xi) ≤ Pr(Si(τk+1)k > xi) ≤ e−kd(xi,µi). Substituting, we

get,

T−1∑

k=0

Pr(Eµi (τk+1)) =

T−1∑

k=0

Pr(µi(τk+1) > xi) ≤ 1 +

T−1∑

k=1

exp(−kd(xi, µi))

≤ 1 +1

d(xi, µi).

The last inequality uses the fact that d(xi, µi) > 0.

LEMMA 2.12. For i 6= 1,

∑Tt=1 Pr

(

i(t) = i, Eθi (t), E

µi (t)

)

≤ Li(T ) + 1,

where Li(T ) =lnT

d(xi,yi).

PROOF. We decompose the probability term into two parts, based on whether or notki(T ) is large (ki(t) > Li(t)).

T∑

t=1

Pr(

i(t) = i, Eθi (t), E

µi (t)

)

=

T∑

t=1

Pr(

i(t) = i, ki(t) ≤ Li(T ), Eθi (t), E

µi (t)

)

+

T∑

t=1

Pr(

i(t) = i, ki(t) > Li(T ), Eθi (t), E

µi (t)

)

. (5)

The first term in the above decomposition is bounded by E[∑T

t=1 I(i(t) = i, ki(t) ≤Li(T ))], which is bounded trivially by Li(T ). What remains is to bound the second termby 1. To bound the second term, we demonstrate that if ki(t) is large and the eventEµ

i (t) is satisfied, then the probability that the event Eθi (t) is violated, is small. Recall

that Eθi (t) is defined as the event that θi(t) ≤ yi. And, Eµ

i (t) is the event that µi(t) ≤ xi.

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 11: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

Near-optimal Regret Bounds for Thompson Sampling A:11

Then,

T∑

t=1

Pr(

i(t) = i, ki(t) > Li(T ), Eθi (t), E

µi (t)

)

=

T∑

t=1

E

[

I(

i(t) = i, ki(t) > Li(T ), Eθi (t), E

µi (t)

)]

= E

[

T∑

t=1

E

[

I(

i(t) = i, ki(t) > Li(T ), Eθi (t), E

µi (t)

)

Ft−1

]

]

= E

[

T∑

t=1

I (ki(t) > Li(T ), Eµi (t)) Pr

(

i(t) = i, Eθi (t) Ft−1

)

]

≤ E

[

T∑

t=1

I (ki(t) > Li(T ), µi(t) ≤ xi) Pr (θi(t) > yi Ft−1)

]

. (6)

The third equality above uses the fact that ki(t) and Eµi (t) are determined by the his-

tory Ft−1.Now, by definition, Si(t) = µi(t)(ki(t) + 1), and therefore, θi(t) is a Beta(µi(t)(ki(t) +

1)+ 1, (1− µi(t))(ki(t) + 1)) distributed random variable. A Beta(α, β) random variableis stochastically dominated by Beta(α′, β′) if α′ ≥ α, β′ ≤ β. Therefore, if µi(t) ≤ xi, thedistribution of θi(t) is stochastically dominated by Beta(xi(ki(t)+1)+1, (1−xi)(ki(t)+1)). Therefore, given an instantiation Ft−1 of Ft−1 such that µi(t) ≤ xi and ki(t) >Li(T ), we have

Pr (θi(t) > yi Ft−1 = Ft−1) ≤ 1− F betaxi(ki(t)+1)+1,(1−xi)(ki(t)+1))(yi).

Now, using Fact 3 along with the Chernoff-Hoeffding bounds (Fact 1), we obtain thatfor any fixed ki(t) > Li(T ),

1− F betaxi(ki(t)+1)+1,(1−xi)(ki(t)+1))(yi) = FB

ki(t)+1,yi(xi(ki(t) + 1))

≤ e−(ki(t)+1)d(xi,yi)

≤ e−Li(T )d(xi,yi),

which is smaller than 1T because Li(T ) = lnT

d(xi,yi). Substituting, we get that for any

instantiation Ft−1 of Ft−1 such that µi(t) ≤ xi and ki(t) > Li(T ),

Pr (θi(t) > yi Ft−1 = Ft−1) ≤1

T.

For other instantiations of Ft−1, the indicator term I (ki(t) > Li(T ), µi(t) ≤ xi) in (6)will be 0. Summing over t, this bounds the second term in (5) by 1 to complete the proofof the lemma.

Putting it all together: Substituting the results from Lemma 2.10, Lemma 2.11 andLemma 2.12 into (2), we obtain

E[ki(T )] ≤24

∆′2i

+∑

j≥8/∆′i

Θ

(

e−∆′2i j/2 +

1

(j + 1)∆′2i

e−Dij +1

e∆′2i j/4 − 1

)

+Li(T )+1+1

d(xi, µi)+1.

(7)

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 12: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

A:12 Agrawal and Goyal

To obtain the problem-dependent bound of Theorem 1.1, for 0 < ǫ ≤ 1, we set xi ∈(µi, µ1) such that d(xi, µ1) = d(µi, µ1)/(1 + ǫ), and set yi ∈ (xi, µ1) such that d(xi, yi) =d(xi, µ1)/(1 + ǫ) = d(µi, µ1)/(1 + ǫ)2 (4). This gives

Li(T ) =lnT

d(xi,yi)= (1 + ǫ)2 lnT

d(µi,µ1).

Also, by some simple algebraic manipulations of the equality d(xi, µ1) = d(µi, µ1)/(1 +ǫ), we can obtain

xi − µi ≥ǫ

(1 + ǫ)· d(µi, µ1)

ln(

µ1(1−µi)µi(1−µ1)

) ,

giving 1d(xi,µi)

≤ 12(xi−µi)2

= O( 1ǫ2 ). Here big-Oh is hiding functions of the µi’s and ∆i’s.

Substituting in (7), we get

E[ki(T )] ≤24

∆′2i

+∑

j≥8/∆′i

Θ

(

e−∆′2i j/2 +

1

(j + 1)∆′2i

e−Dij +1

e∆′2i j/4 − 1

)

+ Li(T ) + 1 +1

d(xi, µi)+ 1

≤ 24

∆′2i

(

1

∆′2i

+1

∆′2i Di

+1

∆′4i

)

+ (1 + ǫ)2lnT

d(µi, µ1)+O

(

1

ǫ2

)

= O(1) + (1 + ǫ)2lnT

d(µi, µ1)+O

(

1

ǫ2

)

.

The big-Oh above hides dependence on the µi’s and ∆i’s. This gives expected regretbound

E[R(T )] =∑

i

∆iE[ki(T )]

≤∑

i

(1 + ǫ)2lnT

d(µi, µ1)∆i +O(

N

ǫ2)

≤∑

i

(1 + ǫ′)lnT

d(µi, µ1)∆i +O(

N

ǫ′2),

where ǫ′ = 3ǫ, and the big-Oh above hides µis and ∆is in addition to the absoluteconstants. This completes the proof of Theorem 1.1.

2.2. Proof of Theorem 1.2

The proof of O(√NT lnT ) problem-independent bound of Theorem 1.2 is basically the

same as the proof of Theorem 1.1, except for the choice of the xi’s and yi’s. Here, we pick

xi = µi+∆i

3 , yi = µ1− ∆i

3 , so that ∆′2i = (µ1−yi)2 =

∆2i

9 , and using Pinsker’s inequality,

d(xi, µi) ≥ 2(xi−µi)2 =

2∆2i

9 , d(xi, yi) ≥ 2(yi−xi)2 ≥ 2∆2i

9 . Then, Li(T ) =lnT

d(xi,yi)≤ 9 lnT

2∆2i,

4This way of choosing thresholds, in order to obtain bounds in terms of the KL-divergences d(µi, µ1) ratherthan the ∆i ’s, is inspired by [Garivier and Cappe 2011; Maillard et al. 2011; Kaufmann et al. 2012a].

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 13: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

Near-optimal Regret Bounds for Thompson Sampling A:13

and 1d(xi,µi)

≤ 92∆2

i. Then, substituting these bounds in (7), we get

E[ki(T )] ≤ 24

∆′2i

+T−1∑

j≥8/∆′i

Θ

(

e−∆′2i j/2 +

1

(j + 1)∆′2i

e−Dij +1

e∆′2i j/4 − 1

)

+ Li(T ) + 1 +1

d(xi, µi)+ 1

≤T−1∑

j≥8/∆′i

Θ

(

e−∆′2i j/2 +

1

(j + 1)∆′2i

+1

j∆′2i

)

+O

(

lnT

∆2i

)

= Θ

(

1

∆′2i

+lnT

∆′2i

)

+O

(

lnT

∆2i

)

= O

(

lnT

∆2i

)

.

Therefore, for every arm i with ∆i ≥√

N lnTT , expected regret is bounded by

∆iE[ki(T )] = O(√

T lnTN ). For arms with ∆i ≤

N lnTT , total expected regret is bounded

by√NT lnT . This gives a total regret bound of O(

√NT lnT ), completing the proof of

Theorem 1.2.

2.3. Proof of Theorem 1.3

The regret analysis of TS with Gaussian priors follows essentially the same steps as inthe analysis of the version with Beta priors. Here, we choose xi = µi+

∆i

3 , yi = µ1− ∆i

3 ,

Li(T ) =2 ln(T∆2

i )(yi−xi)2

=18 ln(T∆2

i )

∆2i

. We prove lemmas similar to Lemma 2.10-2.12 to bounds

the three terms in (2).To obtain bounds on the first term, observe that the derivation of Lemma 2.8 is

independent of the type of priors used, therefore the derivation of (4) holds as is forGaussian priors. We prove the following lemma corresponding to Lemma 2.9.

LEMMA 2.13. Let τj denote the time of the jth play of the first arm. Then

E

[

1

pi,τj+1− 1

]

e11 + 5 ∀j,4

T∆2i

j > 4Li(T ),

where Li(T ) =18 ln(T∆2

i )

∆2i

.

PROOF. From Definitions 2.3 and 2.7, recall that pi,t denotes the probability of θi(t)exceeding yi, given Ft−1. And for the algorithm with Gaussian priors θi(t) has distri-bution N (µi(t),

1ki(t)+1 ).

Given Fτj , let Θj denote a N (µ1(τj + 1), 1j+1 ) distributed Gaussian random vari-

able. Let Gj be the geometric random variable denoting the number of consecutiveindependent trials until a sample of Θj becomes greater than yi. Then observe thatpi,τj+1 = Pr(Θj > yi|Fτj ) and

E

[

1

pi,τj+1− 1

]

= E[E[Gj |Fτj ] = E[Gj ]

We will bound the expected value of Gj by a constant for all j.

Consider any integer r ≥ 1. Let z =√ln r and let random variable MAXr denote

the maximum of r independent samples of Θj . We abbreviate µ1(τj + 1) to µ1 in the

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 14: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

A:14 Agrawal and Goyal

following. Then,

Pr(Gj < r) ≥ Pr(MAXr > yi)

≥ Pr(MAXr > µ1 +z√j + 1

≥ yi)

= E

[

E

[

I

(

MAXr > µ1 +z√j + 1

≥ yi

)

Fτj

]]

= E

[

I

(

µ1 +z√j + 1

≥ yi

)

Pr

(

MAXr > µ1 +z√j + 1

Fτj

)]

(8)

The following anti-concentration bound can be derived for the Gaussian r.v. Z withmean µ and std deviation σ, using Formula 7.1.13 from [Abramowitz and Stegun1964].

Pr(Z > µ+ xσ) ≥ 1√2π

x

x2 + 1e−x2/2.

For any instantiation Fτj of Fτj , since Θj is Gaussian N (µ1,1

j+1 ) distributed r.v., this

gives

Pr

(

MAXr > µ1 +z√j + 1

Fτj = Fτj

)

≥ 1−(

1− 1√2π

z

(z2 + 1)e−z2/2

)r

= 1−(

1− 1√2π

√ln r

(ln r + 1)

1√r

)r

≥ 1− e− r√

4πr ln r .

Observe that for large r (in particular for any r ≥ e11), e−√

r2π ln r ≤ 1

r2 . Therefore, for

any r ≥ e11,

Pr

(

MAXr > µ1 +z√j + 1

Fτj = Fτj

)

≥ 1− 1

r2. (9)

Substituting back in (8), we have for any r ≥ e11,

Pr(Gj < r) ≥ E

[

I

(

µ1 +z√j + 1

≥ yi

)(

1− 1

r2

)]

=

(

1− 1

r2

)

Pr

(

µ1 +z√j + 1

≥ yi

)

(10)

Next, we apply Chernoff-Hoeffding bounds to lower bound the probability

Pr(

µ1 +z√j+1

≥ yi

)

. Recall from Definition 2.2 that µ1(t) =∑t−1

s=1:i(s)=1r1(s)

k1(t)+1 . Using

Chernoff bounds for t = τj + 1, k1(t) = j:

Pr(µ1 +1

j+1 + x√j+1

≥ µ1) ≥ 1− e−2x2

.

Here, the term 1/(j + 1) was added to µ1 to adjust for the fact that µ1 is not simplyaverage of the past j observations, instead, it is the sum of past j observations divided

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 15: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

Near-optimal Regret Bounds for Thompson Sampling A:15

by j + 1 (Definition 2.2). Now, we use x := z − 1√j+1

≥ z − 1 to get

Pr

(

µ1 +z√j + 1

≥ µ1

)

≥ 1− e−2(z−1)2 ≥ 1− e−2z2+4z ≥ 1− 1

r2e4√

ln(r).

Using, yi ≤ µ1, this gives

Pr

(

µ1 +z√j + 1

≥ yi

)

≥ 1− 1

r2e4√

ln(r). (11)

Observe that for large r (in particular, for any r ≥ e64), 1r2 e

√ln(r) ≤ 1

r1.5 .

Therefore, substituting, for any r ≥ e64,

Pr(Gj < r) ≥ 1− 1

r2− 1

r1.5.

This gives,

E[Gj ] =

T∑

r=0

Pr(Gj ≥ r)

≤ e64 +∑

r≥1

(

1

r2+

1

r1.5

)

≤ e64 + 2 + 2.7.

This proves a constant bound on E[Gj ] for all j.

Next, we derive a tighter bound for large j. Consider j > 4Li(T ). Since, Θj is

N (µ1(τj + 1), 1j+1 ) distributed random variable, therefore, for j > 4Li(T ), using the

upper bound in Fact 4 for z = ∆i

6 , we obtain for any instantiation Fτj of history Fτj ,

Pr

(

Θj > µ1(τj + 1)− ∆i

6Fτj = Fτj

)

≥ 1− 1

2e−4Li(T )∆2

i /72 ≥ 1− 1

T∆2i

Define At−1 as the event that µ1(t)− ∆i

6 ≥ yi := µ1 − ∆i

3 . Note that whether this eventit true or not is determined by the history Ft−1. Now, consider instantiations Fτj of Fτjsuch that At−1 is true. For such Fτj , from above we have,

Pr(Θj > yi|Fτj = Fτj ) ≥ 1− 1

T∆2i

Let us use the notation Ft−1/At−1to indicate the random variable Ft−1 conditioned on

At−1 being true. Then,

E

[

1

pi,τj+1

]

= E

[

1

Pr(Θj > yi|Fτj )

]

≤ E

[

1

Pr(Θj > yi|Fτj/Aτj) Pr(Aτj ))

]

≤ E

1(

1− 1T∆2

i

)

Pr(Aτj ))

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 16: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

A:16 Agrawal and Goyal

Now, for any t ≥ τj + 1, we have k1(t) ≥ j ≥ 4Li(T ), and using Chernoff-Hoeffdingbounds (Fact 2), we get

Pr

(

µ1(t) ≥ µ1 −∆i

6

)

≥ 1− e−2k1(t)∆2i /36 ≥ 1− 1

T∆2i

,

giving,

Pr(Aτj ) ≥ 1− 1

T∆2i

.

Substituting, we get for j ≥ 4Li(T ),

E

[

1

pi,τj+1

]

− 1 ≤ 1(

1− 1T∆2

i

)2 − 1

≤ 4

T∆2i

.

Substituting the bound from Lemma 2.13 into (4), we obtain the following bound onthe first term on the right hand side of (2).

LEMMA 2.14.

T∑

t=1

Pr(i(t) = i, Eµi (t), E

θi (t)) ≤ (e64 + 4)(4Li(T )) +

4

∆2i

.

The proof of Lemma 2.11 can be easily adapted to Gaussian priors. So, this lemmaholds as is for this case. Here, xi = µi +

∆i

3 , therefore, using Pinsker’s inequality

d(xi, µi) ≥ 2(xi − µi)2 =

2∆2i

9 .

LEMMA 2.15.

T∑

t=1

Pr(

i(t) = i, Eµi (t)

)

≤ 1

d(xi, µi)+ 1 ≤ 9

2∆2i

+ 1.

Corresponding to Lemma 2.12, we prove the following lemma.

LEMMA 2.16.∑T

t=1 Pr(

i(t) = i, Eθi (t), E

µi (t)

)

≤ Li(T ) +1∆2

i.

where Li(T ) =18 ln(T∆2

i )

∆2i

.

PROOF. The proof of this lemma is similar to the proof of Lemma 2.12. We decom-pose each summand into two parts, based on whether or not ki(T ) is large (ki(t) >Li(t)).

T∑

t=1

Pr(

i(t) = i, Eθi (t), E

µi (t)

)

=T∑

t=1

Pr(

i(t) = i, ki(t) ≤ Li(T ), Eθi (t), E

µi (t)

)

+T∑

t=1

Pr(

i(t) = i, ki(t) > Li(T ), Eθi (t), E

µi (t)

)

. (12)

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 17: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

Near-optimal Regret Bounds for Thompson Sampling A:17

The first term in the above decomposition is bounded by E[∑T

t=1 I(i(t) = i, ki(t) ≤Li(T ))], which is bounded trivially by Li(T ). What remains is to bound the second termby 1/∆2

i . To this end, we show that if ki(t) is large and the event Eµi (t) is satisfied, then

the probability that the event Eθi (t) is violated is small. Recall that Eθ

i (t) is defined asthe event that θi(t) ≤ yi. And, Eµ

i (t) is the event that µi(t) ≤ xi. Then,

T∑

t=1

Pr(

i(t) = i, ki(t) > Li(T ), Eθi (t), E

µi (t)

)

≤ E

[

T∑

t=1

Pr(

i(t) = i, Eθi (t) ki(t) > Li(T ), E

µi (t),Ft−1

)

]

≤ E

[

T∑

t=1

Pr (θi(t) > yi ki(t) > Li(T ), µi(t) ≤ xi,Ft−1)

]

Now, θi(t) is a N(

µi(t),1

ki(t)+1

)

distributed Gaussian random variable. An N (m,σ2)

distributed r.v. (i.e., a Gaussian random variable with mean m and variance σ2) isstochastically dominated by N (m′, σ2) distributed r.v. if m′ ≥ m. Therefore, given

µi(t) ≤ xi, the distribution of θi(t) is stochastically dominated by N(

µi(t),1

ki(t)+1

)

.

That is,

Pr (θi(t) > yi ki(t) > Li(T ), µi(t) ≤ xi,Ft−1) ≤ Pr(

N(

xi,1

ki(t)+1

)

> yi Ft−1, ki(t) > Li(T ))

.

Here, we slightly abused the notation for readability: Pr(N (m,σ2) > yi) represents theprobability that a random variable distributed as N (m,σ2) takes value greater thanyi.

Now, using the concentration of Gaussian distribution (Fact 4), we obtain that forany fixed ki(t) > Li(T ),

Pr(

N(

xi,1

ki(t)+1

)

> yi

)

≤ 1

2e−

(ki(t)+1)(yi−xi)2

2

≤ 1

2e−

(Li(t))(yi−xi)2

2 ,

which is smaller than 1T∆2

ibecause Li(T ) =

2 ln(T∆2i )

(yi−xi)2. Substituting, we get,

Pr (θi(t) > yi ki(t) > Li(T ), µi(t) ≤ xi,Ft−1) ≤1

T∆2i

.

Summing over t = 1, . . . , T , we get a bound of 1∆2

ion the second term in (5), completing

the proof of the lemma.

Substituting the bounds from Lemma 2.14-2.16 in (2), we get

E[ki(T )] ≤ (e64 + 4)72 ln(T∆2

i )

∆2i

+4

∆2i

+18 ln(T∆2

i )

∆2i

+1

∆2i

+9

2∆2i

+ 1.

Thus the expected regret due to arm i is upper bounded by ∆iE[ki(T )] ≤ 192∆i

+(e64 +

5)72 ln(T∆2

i )∆i

+∆i. The above is decreasing in ∆i for ∆i ≥ e√T

. Therefore, for every arm i

with ∆i ≥ e√

N lnNT , the expected regret is bounded by

O

(√

T lnN

N+ 1

)

.

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 18: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

A:18 Agrawal and Goyal

For arms with ∆i ≤ e√

N lnNT , the total regret is bounded by e

√NT lnN . This bounds

the total regret by O(N +√NT lnN), or O(

√NT lnN) assuming T ≥ N . This proves

Theorem 1.3.

3. PROOF OF THE LOWER BOUND

In this section we prove Theorem 1.4. To this end, we construct a problem instance

such that TS has regret Ω(√NT lnN) at time T . Let each arm i when played produce a

reward of µi. That is, the reward distribution for every arm is a one point distribution.

Set µ1 := ∆ :=√

N lnNT , and µ2 := 0, · · · , µN := 0.

Note that µi(t), i 6= 1, will always be 0, as µi(1) = 0, and these arms will always pro-

duce reward 0 when played. For arm 1, µ1(t) =k1(t)µ1

k1(t)+1 ≤ µ1. Every time an arm other

than arm 1 is played, there is a regret of ∆. Let Ft−1 represent the history of plays andoutcomes until time t as defined earlier, which includes ki(t), µi(t), i = 1, . . . , N . Define

At−1 as the event that∑

i6=1 ki(t) ≤ c√NT lnN∆ for a fixed constant c (to be specified

later). Note that whether the event At−1 is true, is determined by Ft−1.

Now, if At−1 is not true, then the regret until time t is at least c√NT lnN . Therefore,

for any t ≤ T we can assume that Pr(At−1) ≥ 12 . Otherwise, the expected regret until

time t,

E [R(t)] ≥ E[

R(t)|At−1

]

· 12

≥ 12c√NT lnN = Ω(

√NT lnN).

We will show that given any instantiation of the history Ft−1 such that the eventAt−1 is true, the probability of playing a suboptimal arm is at least a constant, so that

the regret is Ω(T∆) = Ω(√NT lnN). For this, we show that with constant probability,

θ1(t) will be smaller than µ1, and θi(t) for some suboptimal arm i will be larger thanµ1.

Now, given any history Ft−1 with any value of k1(t), θ1(t) is a Gaussian r.v. with

mean µ1(t) =k1(t)µ1

k1(t)+1 ≤ µ1, therefore, by the symmetry of the Gaussian distribution,

Pr (θ1(t) ≤ µ1 Ft−1) ≥ 12 .

Also, given any instantiation Ft−1 = Ft−1, the θi(t)’s for i 6= 1 are independent Gaus-sian distributed random variables with mean 0 and variance 1

ki(t)+1 , therefore, using

anti-concentration inequality provided by Fact 4 for Gaussian random variables we get

Pr (∃i 6= 1, θi(t) > µ1 Ft−1 = Ft−1)

= Pr(

∃i 6= 1, (θi(t)− 0)√

ki(t) + 1 > ∆√

ki(t) + 1 Ft−1 = Ft−1

)

≥(

1−∏i6=1

(

1− 18√πe−(ki(t)+1) 7∆2

2

))

.

Now, given an instantiation Ft−1 of Ft−1 such that At−1 is true, we have∑

i6=1 ki(t) ≤c√NT lnN∆ , so that the right hand side in the above inequality is minimized when ki(t) =

c√NT lnN

(N−1)∆ for all i 6= 1. Then, substituting ∆ =√

N lnNT and choosing the constant c

appropriately, we get

Pr (∃i, θi(t) > µ1 Ft−1 = Ft−1) ≥(

1−∏i6=1

(

1− e− lnN)

)

= 1−(

1− 1N

)N−1.

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 19: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

Near-optimal Regret Bounds for Thompson Sampling A:19

for any Ft−1 such that At−1 is true.Let us use the notation Ft−1/At−1

to indicate the random variable Ft−1 conditionedon At−1 being true. Then, to summarize, for any t, the probability of playing a subopti-mal arm at time t satisfies

Pr (∃i 6= 1, i(t) = i) ≥ Pr (θ1(t) ≤ µ1, ∃i, θi(t) > µ1)

= E [Pr (θ1(t) ≤ µ1, ∃i, θi(t) > µ1 Ft−1)]

≥ E[

Pr(

θ1(t) ≤ µ1, ∃i, θi(t) > µ1 Ft−1/At−1

)]

· Pr (At−1)

= E[

Pr(

θ1(t) ≤ µ1 Ft−1/At−1

)

· Pr(

∃i, θi(t) > µ1 Ft−1/At−1

)]

· Pr (At−1)

≥ 12 ·(

1−(

1− 1N

)N−1)

· 12

≥ p,

for some constant p ∈ (0, 1). Therefore the regret in time T is at least Tp∆ =

Ω(√NT lnN). This proves Theorem 1.4.

Conclusions. In this paper, we proved optimal problem dependent regret bounds forThompson Sampling for stochastic MAB problem with Bernoulli arms. Further, weprovided near-optimal problem-dependent and problem-independent regret bounds forthe general MAB problem with bounded rewards. Specifically, our technique yields

the first problem-independent regret upper bound of O(√NT lnT ) for the version of

TS with Beta priors and an upper bound of O(√NT lnN) for the version of TS with

Gaussian priors along with a matching lower bound. The availability of strong anti-concentration bounds for Gaussian distribution allowed us to derive these tight upperand lower bounds for the version of TS with Gaussian priors. Similar lower bound mayexist for TS with Beta priors.

In addition to near-optimal regret bounds, an important contribution of this paperis a simple proof technique that is easily adapted to provide optimal or near-optimalproblem-dependent and problem independent bounds, and handle different prior dis-tributions. The basic techniques presented in this work have also been adapted toprove Thompson Sampling regret bounds for the contextual bandits problem in subse-quent work ([Agrawal and Goyal 2013b]).

Acknowledgement. We thank the anonymous referees for careful reading and sug-gestions that have improved the presentation.

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 20: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

A:20 Agrawal and Goyal

REFERENCES

M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions with Formulas,Graphs, and Mathematical Tables. Dover, New York, 1964.

S. Agrawal and N. Goyal. Analysis of Thompson Sampling for the Multi-armed BanditProblem. In Proceedings of the 25th Annual Conference on Learning Theory (COLT),2012.

S. Agrawal and N. Goyal. Further optimal regret bounds for thompson sampling. InProceedings of the Sixteenth International Conference on Artificial Intelligence andStatistics, (AISTATS), 2013a.

S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linearpayoffs. In Proceedings of the 30th International Conference on Machine Learning(ICML), 2013b.

J.-Y. Audibert and S. Bubeck. Minimax Policies for Adversarial and Stochastic Bandits.In Proceedings of the 22th Annual Conference on Learning Theory (COLT), 2009.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed banditproblem. Machine Learning, 47(2-3):235–256, 2002.

S. Bubeck and N. Cesa-Bianchi. Regret Analysis of Stochastic and NonstochasticMulti-armed Bandit Problems. CoRR, 2012.

S. Bubeck and C. Liu. Prior-free and prior-dependent regret bounds for ThompsonSampling. In 48th Annual Conference on Information Sciences and Systems, CISS2014, Princeton, NJ, USA, March 19-21, 2014, pages 1–9, 2014.

O. Chapelle and L. Li. An Empirical Evaluation of Thompson Sampling. In Advancesin Neural Information Processing Systems (NIPS) 24, pages 2249–2257, 2011.

A. Garivier and O. Cappe. The KL-UCB Algorithm for Bounded Stochastic Bandits andBeyond. In Proceedings of the 24th Annual Conference on Learning Theory (COLT),2011.

J. C. Gittins. Multi-armed Bandit Allocation Indices. Wiley Interscience Series inSystems and Optimization. John Wiley and Son, 1989.

T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich. Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft’s BingSearch Engine. In Proceedings of the 27th International Conference on MachineLearning (ICML), pages 13–20, 2010.

O.-C. Granmo. Solving Two-Armed Bernoulli Bandit Problems Using a BayesianLearning Automaton. International Journal of Intelligent Computing and Cyber-netics (IJICC), 3(2):207–234, 2010.

E. Jerabek. Dual weak pigeonhole principle, Boolean complexity, and derandomization.Annals of Pure and Applied Logic, 129(1-3):1–37, October 2004.

E. Kaufmann, O. Cappe, and A. Garivier. On Bayesian Upper Confidence Bounds forBandit Problems. In Fifteenth International Conference on Artificial Intelligence andStatistics (AISTATS), 2012a.

E. Kaufmann, N. Korda, and R. Munos. Thompson Sampling: An Optimal Finite TimeAnalysis. In International Conference on Algorithmic Learning Theory (ALT), 2012b.

T. Kocak, M. Valko, R. Munos, and S. Agrawal. Spectral Thompson Sampling. InProceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pages1911–1917, 2014.

N. Korda, E. Kaufmann, and R. Munos. Thompson sampling for 1-dimensional ex-ponential family bandits. In Advances in Neural Information Processing Systems(NIPS) 26, pages 1448–1456, 2013.

T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advancesin Applied Mathematics, 6:4–22, 1985.

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 21: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

Near-optimal Regret Bounds for Thompson Sampling A:21

L. Li. Generalized Thompson Sampling for Contextual Bandits. CoRR, abs/1310.7163,2013. URL http://arxiv.org/abs/1310.7163.

L. Li and O. Chapelle. Open Problem: Regret Bounds for Thompson Sampling. InProceedings of the 25th Annual Conference on Learning Theory (COLT), 2012.

O.-A. Maillard, R. Munos, and G. Stoltz. Finite-time analysis of multi-armed ban-dits problems with Kullback-Leibler divergences. In Proceedings of the 24th AnnualConference on Learning Theory (COLT), 2011.

B. C. May and D. S. Leslie. Simulation studies in optimistic Bayesian sampling incontextual-bandit problems. Technical Report 11:02, Statistics Group, Departmentof Mathematics, University of Bristol, 2011.

B. C. May, N. Korda, A. Lee, and D. S. Leslie. Optimistic Bayesian sampling incontextual-bandit problems. Technical Report 11:01, Statistics Group, Departmentof Mathematics, University of Bristol, 2011.

P. A. Ortega and D. A. Braun. Linearly parametrized bandits. Journal of ArtificialIntelligence Research, 38:475–511, 2010.

D. Russo and B. Van Roy. Learning to Optimize Via Posterior Sampling. Mathematicsof Operations Research, 39(4):1221–1243, 2014.

D. Russo and B. Van Roy. An Information-Theoretic Analysis of Thompson Sampling.Journal of Machine Learning Research (to appear), 2015.

D. Russo, I. Osband, and B. Van Roy. (more) efficient reinforcement learning via poste-rior sampling. Advances in Neural Information Processing Systems (NIPS) 26, 2013.

S. Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Mod-els in Business and Industry, 26:639–658, 2010.

M. J. A. Strens. A Bayesian Framework for Reinforcement Learning. In Proceedings ofthe Seventeenth International Conference on Machine Learning (ICML), pages 943–950, 2000.

W. R. Thompson. On the likelihood that one unknown probability exceeds another inview of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.

J. Wyatt. Exploration and Inference in Learning from Reinforcement. PhD thesis,Department of Artificial Intelligence, University of Edinburgh, 1997.

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 22: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

A:22 Agrawal and Goyal

A. SOME WELL-KNOWN INEQUALITIES

FACT 1 (CHERNOFF-HOEFFDING BOUND). LetX1, . . . , Xn be independent 0−1 r.v.swith E[Xi] = pi (not necessarily equal). Let X = 1

n

iXi, µ = E[X] = 1n

∑ni=1 pi. Then,

for any 0 < λ < 1− µ,

Pr(X ≥ µ+ λ) ≤ exp−nd(µ+ λ, µ),

and, for any 0 < λ < µ,

Pr(X ≤ µ− λ) ≤ exp−nd(µ− λ, µ),

where d(a, b) = a ln ab + (1− a) ln (1−a)

(1−b) .

FACT 2 (CHERNOFF–HOEFFDING BOUND). Let X1, ..., Xn be random variableswith common range [0, 1] and such that E [Xt X1, ..., Xt−1] = µ. Let Sn = X1 + . . .+Xn.Then for all a ≥ 0,

Pr(Sn ≥ nµ+ a) ≤ e−2a2/n,

Pr(Sn ≤ nµ− a) ≤ e−2a2/n.

FACT 3.

F betaα,β (y) = 1− FB

α+β−1,y(α− 1),

for all positive integers α, β.

Formula 7.1.13 from [Abramowitz and Stegun 1964] can be used to derive the fol-lowing concentration for Gaussian distributed random variables.

FACT 4. [Abramowitz and Stegun 1964] For a Gaussian distributed random vari-able Z with mean m and variance σ2, for any z,

1

4√π· e−7z2/2 < Pr(|Z −m| > zσ) ≤ 1

2e−z2/2.

B. THOMPSON SAMPLING WITH BETA DISTRIBUTION

Proof of Lemma 2.9

Let k1(t) = j, S1(t) = s. Let y = yi. Then, pi,t = Pr(θ1(t) > y) = FBj+1,y(s) (using Fact 3).

Let τj + 1 denote the time step after the jth play of arm 1. Then, k1(τj + 1) = j, and

E

[

1

pi,τj+1

]

=

j∑

s=0

fj,µ1(s)

FBj+1,y(s)

.

Let ∆′ = µ1 − y.In the derivation below, we abbreviate FB

j+1,y(s) as Fj+1,y(s).

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 23: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

Near-optimal Regret Bounds for Thompson Sampling A:23

Case j < 8∆′ . Let R = µ1(1−y)

y(1−µ1), D = y ln y

µ1+ (1 − y) ln 1−y

1−µ1. Note that µ1 ≥ y, so that

R ≥ 1.

j∑

s=0

fj,µ1(s)

Fj+1,y(s)≤ 1

1− y

j∑

s=0

fj,µ1(s)

Fj,y(s)

≤ 1

1− y

⌊yj⌋∑

s=0

fj,µ1(s)

fj,y(s)+

1

1− y

j∑

s=⌈yj⌉2fj,µ1

(s)

=1

1− y

⌊yj⌋∑

s=0

Rs (1− µ1)j

(1− y)j+

1

1− y

j∑

s=⌈yj⌉2fj,µ1

(s)

=1

1− y

(

R⌊yj⌋+1 − 1

R− 1

)

(1− µ1)j

(1− y)j

+1

1− y

j∑

s=⌈yj⌉2fj,µ1

(s)

≤ 1

1− y

(

R

R− 1

)

Ryj (1− µ1)j

(1− y)j+

2

∆′

=µ1

∆′ e−Dj +

2

∆′

≤ 3

∆′ . (13)

Case j ≥ 8∆′ . We will divide the sum Sum(0, j) =

∑js=0

fj,µ1(s)

Fj+1,y(s)into four partial

sums and prove that

Sum(0, ⌊yj⌋ − 1) ≤ Θ(

e−Dj 1(j+1)

1∆′2

)

+Θ(e−2∆′2j),Sum(⌊yj⌋, ⌊yj⌋) ≤ 3e−Dj ,

Sum(⌈yj⌉, ⌊µ1j − ∆′

2 j⌋) ≤ Θ(e−∆′2j/2),

Sum(⌈µ1j − ∆′

2 j⌉, j) ≤ 1 + 1e∆′2j/4−1

.

Together, the above estimates will prove the required bound.We use the following bounds on the cdf of Binomial distribution [Jerabek 2004, Prop.

A.4].For s ≤ y(j + 1)−

(j + 1)y(1− y),

Fj+1,y(s) = Θ

(

y(j + 1− s)

y(j + 1)− s

(

j + 1

s

)

ys(1− y)j+1−s

)

. (14)

For s ≥ y(j + 1)−√

(j + 1)y(1− y),

Fj+1,y(s) = Θ(1). (15)

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 24: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

A:24 Agrawal and Goyal

Bounding Sum(0, ⌊yj⌋ − 1). Using the bounds just given, for any s,

fj,µ1(s)

Fj+1,y(s)≤ Θ

fj,µ1(s)

y(j+1−s)y(j+1)−s

(

j+1s

)

ys(1− y)j+1−s

+Θ(1)fj,µ1(s)

= Θ

((

1− s

y(j + 1)

)

·Rs · (1− µ1)j

(1− y)j+1

)

+Θ(1)fj,µ1(s).

This gives

Sum(0, ⌊yj⌋ − 1) ≤ Θ

(1− µ1)j

(1− y)j+1

⌊yj⌋−1∑

s=0

(

1− s

y(j + 1)

)

·Rs

+Θ(1)

⌊yj⌋−1∑

s=0

fj,µ1(s).

(16)

We now bound the first expression on the RHS.

(1− µ1)j

(1− y)j+1

⌊yj⌋−1∑

s=0

(

1− s

y(j + 1)

)

·Rs =(1− µ1)

j

(1− y)j+1

(

R⌊yj⌋ − 1

R− 1

− 1

y(j + 1)

(

(⌊yj⌋ − 1)R⌊yj⌋

R− 1− R⌊yj⌋ −R

(R− 1)2

))

≤ (1− µ1)j

(1− y)j+1

(

1

y(j + 1)

R⌊yj⌋

(R− 1)2

+(y(j + 1)− ⌊yj⌋+ 1)

y(j + 1)

R⌊yj⌋

(R− 1)

)

≤ (1− µ1)j

(1− y)j+1

3

y(j + 1)

R⌊yj⌋+1

(R− 1)2

≤ e−Dj 3

y(1− y)(j + 1)

R

(R− 1)2

The last inequality uses

(1− µ1)j

(1− y)jR⌊yj⌋ ≤ (1− µ1)

j

(1− y)jRyj = e−Dj .

Now, R− 1 = µ1(1−y)y(1−µ1)

− 1 = µ1−yy(1−µ1)

. And, RR−1 = µ1(1−y)

µ1−y . Therefore,

1

y(1− y)(j + 1)

R

(R− 1)2=

1

y(1− y)(j + 1)· µ1(1− y)

µ1 − y· y(1− µ1)

µ1 − y

=1

(j + 1)

µ1(1− µ1)

(µ1 − y)2.

Substituting, we get

(1− µ1)j

(1− y)j+1

⌊yj⌋∑

s=0

(

1− s

y(j + 1)

)

·Rs ≤ e−Dj 1

(j + 1)

µ1(1− µ1)

(µ1 − y)2.

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 25: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

Near-optimal Regret Bounds for Thompson Sampling A:25

Substituting in (16)

Sum(0, ⌊yj⌋ − 1) ≤ Θ

(

e−Dj 1

(j + 1)

1

∆′2

)

+Θ(1)

⌊yj⌋−1∑

s=0

fj,µ1(s)

≤ Θ

(

e−Dj 1

(j + 1)

1

∆′2

)

+Θ(e−2(µ1−y)2j).

Bounding Sum(⌊yj⌋, ⌊yj⌋). We usefj,µ1

(s)

Fj+1,y(s)≤ fj,µ1

(s)

fj+1,y(s)=(

1− sj+1

)

Rs (1−µ1)j

(1−y)j+1 , to get

Sum(⌊yj⌋, ⌊yj⌋) =fj,µ1

(⌊yj⌋)Fj+1,y(⌊yj⌋)

≤(

1− yj − 1

j + 1

)

Ryj (1− µ1)j

(1− y)j+1

≤(1− y + 2

j+1 )

1− yRyj (1− µ1)

j

(1− y)j

≤ 3e−Dj . (17)

The last inequality uses j ≥ 1∆′ ≥ 1

1−y .

Bounding Sum(⌈yj⌉, ⌊µ1j − ∆′

2 j⌋). Now, if j > 1∆′ ,then

(j + 1)y(1− y) >√y > y,

so y(j + 1) −√

(j + 1)y(1− y) < yj ≤ ⌈yj⌉. Therefore, (using the bounds by [Jerabek2004] given in (15)) for s ≥ ⌈yj⌉, Fj+1,y(s) = Θ(1). Using this observation, we derivethe following.

Sum(⌈yj⌉, ⌊µ1j −∆′

2j⌋) =

⌊µ1j−∆′

2 j⌋∑

s=⌈yj⌉

fj,µ1(s)

Fj+1,y(s)

= Θ

⌊µ1j−∆′

2 j⌋∑

s=⌈yj⌉fj,µ1

(s)

≤ Θ(e−2

(

µ1j−⌊µ1j−∆′

2 j⌋)2

/j)

= Θ(e−∆′2j/2), (18)

where the inequality follows using the Chernoff-Hoeffding bounds (Fact 2).

Bounding Sum(⌈µ1j − ∆′

2 j⌉, j). For s ≥ ⌈µ1j − ∆′

2 j⌉ = ⌈yj + ∆′

2 j⌉, again using theChernoff-Hoeffding bounds from Fact 2,

Fj+1,y(s) ≥ 1− e−2(yj+∆′

2 j−y(j+1))2/(j+1)

≥ 1− e2∆′

e−∆′2j/2

≥ 1− e∆′2j/4e−∆′2j/2

= 1− e−∆′2j/4.

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 26: A Near-optimal Regret Bounds for Thompson Samplingsa3305/j3.pdf · posed as an open problem by [Li and Chapelle 2012]. In this paper, we give a regret analysis for TS that provides

A:26 Agrawal and Goyal

The last inequality uses j ≥ 8∆′ .

Sum(⌈µ1j −∆′

2j⌉, j) =

j∑

s=⌈µ1j−∆′2 j⌉

fj,µ1(s)

Fj+1,y(s)

≤ 1

1− e−∆′2j/4

= 1 +1

e∆′2j/4 − 1. (19)

Combining, we get for j ≥ 8∆′ ,

E

[

1

pi,τj+1

]

≤ 1 + Θ(e−∆′2j/2 +1

(j + 1)∆′2 e−Dj +

1

e∆′2j/4 − 1)

ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.