21
Trading oRewards and Errors in Multi-Armed Bandits Anonymous Author 1 Anonymous Author 2 Anonymous Author 3 Unknown Institution 1 Unknown Institution 2 Unknown Institution 3 Abstract In multi-armed bandits, the most common objective is the maximization of the cumu- lative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrast- ing, in many scenarios it is desirable to trade orewards and errors. For instance, in edu- cational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small estimation errors ) but, at the same time, the system needs to avoid giving a bad experi- ence to the players, who may leave the sys- tem permanently (large reward ). In this pa- per, we formalize this tradeoand introduce the ForcingBalance algorithm whose perfor- mance is provably close to the best possible tradeostrategy. Finally, we demonstrate on real-world educational data and we show that ForcingBalance returns useful information about the arms without compromising the overall reward. 1 Introduction We consider sequential, interactive systems, when a learner aims at optimizing an objective function whose parameters are initially unknown and need to be es- timated over time. We take the multi-armed bandit (MAB) framework where the learner has access to a finite set of distributions (arms ), each one character- ized by an expected value (reward ). The learner does not know the distributions beforehand and it can only obtain a random sample by selecting an arm. The most common objective in MAB is to minimize the regret, i.e., the dierence between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm with the highest mean and the reward of the arms pulled by the learner. Since the arm means are un- known, this requires balancing exploration of the arms and exploitation of the mean estimates. An alternative setting is pure exploration, where the learner’s perfor- mance is only evaluated upon the termination of the process, and its learning performance is allowed to be arbitrarily bad in terms of rewards accumulated over time. In best-arm identification [Even-Dar et al., 2006, Audibert et al., 2010], the learner selects arms to find the optimal arm either with very high probability or in a short number of steps. In active exploration [Antos et al., 2010, Carpentier et al., 2011], the objective is to estimate the value of all arms as accurately as possible. This setting, which is related to active learning and experimental optimal design, is particularly relevant whenever accurate predictions of the arms’ value is needed to support decisions at a later time. The previous objectives have been studied separately. However, they do not address the increasingly-prevalent situation where users participate in research studies (e.g., for education or health) that are designed to col- lect reliable data and compute accurate estimates of the performance of the available options. In such situ- ations, the subjects/users themselves rarely care about the underlying research questions, but may reasonably wish to gain their own benefit, such as students seek- ing to learn new material, or patients seeking to find improved care for their condition. In order to serve these individuals and gather generalizable knowledge at the same time, we formalize this situation as a multi- objective bandit problem, where a designer seeks to trade ocumulative regret minimization (providing good direct reward for participants), with informing scientific knowledge about the strengths and limitations of the various conditions (active exploration to estimate all arm means). This tradeois especially needed in high-stakes domains, such as medical testing or educa- tion, or when running experiments in online settings, where poor experience may lead to users leaving the system permanently. A similar tradeohappens when we apply bandits to A/B testing. Here, the designer may want to retain the ability to set a desired level of accuracy in estimating the value of dierent alter-

Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Trading off Rewards and Errors in Multi-Armed Bandits

Anonymous Author 1 Anonymous Author 2 Anonymous Author 3Unknown Institution 1 Unknown Institution 2 Unknown Institution 3

Abstract

In multi-armed bandits, the most commonobjective is the maximization of the cumu-lative reward. Alternative settings includeactive exploration, where a learner tries togain accurate estimates of the rewards of allarms. While these objectives are contrast-ing, in many scenarios it is desirable to tradeoff rewards and errors. For instance, in edu-cational games the designer wants to gathergeneralizable knowledge about the behaviorof the students and teaching strategies (smallestimation errors) but, at the same time, thesystem needs to avoid giving a bad experi-ence to the players, who may leave the sys-tem permanently (large reward). In this pa-per, we formalize this tradeoff and introducethe ForcingBalance algorithm whose perfor-mance is provably close to the best possibletradeoff strategy. Finally, we demonstrate onreal-world educational data and we show thatForcingBalance returns useful informationabout the arms without compromising theoverall reward.

1 Introduction

We consider sequential, interactive systems, when alearner aims at optimizing an objective function whoseparameters are initially unknown and need to be es-timated over time. We take the multi-armed bandit(MAB) framework where the learner has access to afinite set of distributions (arms), each one character-ized by an expected value (reward). The learner doesnot know the distributions beforehand and it can onlyobtain a random sample by selecting an arm. Themost common objective in MAB is to minimize theregret, i.e., the difference between the reward of the

Preliminary work. Under review by AISTATS 2017. Do notdistribute.

arm with the highest mean and the reward of the armspulled by the learner. Since the arm means are un-known, this requires balancing exploration of the armsand exploitation of the mean estimates. An alternativesetting is pure exploration, where the learner’s perfor-mance is only evaluated upon the termination of theprocess, and its learning performance is allowed to bearbitrarily bad in terms of rewards accumulated overtime. In best-arm identification [Even-Dar et al., 2006,Audibert et al., 2010], the learner selects arms to findthe optimal arm either with very high probability or ina short number of steps. In active exploration [Antoset al., 2010, Carpentier et al., 2011], the objective is toestimate the value of all arms as accurately as possible.This setting, which is related to active learning andexperimental optimal design, is particularly relevantwhenever accurate predictions of the arms’ value isneeded to support decisions at a later time.

The previous objectives have been studied separately.However, they do not address the increasingly-prevalentsituation where users participate in research studies(e.g., for education or health) that are designed to col-lect reliable data and compute accurate estimates ofthe performance of the available options. In such situ-ations, the subjects/users themselves rarely care aboutthe underlying research questions, but may reasonablywish to gain their own benefit, such as students seek-ing to learn new material, or patients seeking to findimproved care for their condition. In order to servethese individuals and gather generalizable knowledgeat the same time, we formalize this situation as a multi-objective bandit problem, where a designer seeks totrade off cumulative regret minimization (providinggood direct reward for participants), with informingscientific knowledge about the strengths and limitationsof the various conditions (active exploration to estimateall arm means). This tradeoff is especially needed inhigh-stakes domains, such as medical testing or educa-tion, or when running experiments in online settings,where poor experience may lead to users leaving thesystem permanently. A similar tradeoff happens whenwe apply bandits to A/B testing. Here, the designermay want to retain the ability to set a desired levelof accuracy in estimating the value of different alter-

Page 2: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

natives (e.g., to justify decisions that are to be takenposterior to the experiment) while still maximizing thereward.

A natural initial question is whether these two differ-ent objectives, reward maximization and accurate armestimation, or other alternative objectives, like bestarm identification, are mutually compatible: Can onealways recover the best of all objectives? Unfortunatelyin general the answer is negative. Bubeck et al. [2009]have already showed that any algorithm with sub-linearregret cannot be optimal for identifying the best arm.Though it may not be possible to be simultaneouslyoptimal for both active exploration and reward maxi-mization, we wish to carefully trade off between thesetwo objectives. How to properly balance multiple ob-jectives in MAB is a mostly unexplored question. Buiet al. [2011] introduce the committing bandits, where agiven horizon is divided into an experimentation phase,when the learner is free to explore all the arms butstill pays a cost, and a commit phase when the learnermust choose one single arm that will be pulled untilthe end of the horizon. Lattimore [2015] analyzes theproblem where the learner wants to minimize the regretsimultaneously w.r.t. two special arms. He shows thatif the regret w.r.t. one arm is bounded by a small quan-tity B, then the regret w.r.t. the other arm scales atleast as 1/B, which reveals the difficulty of balancingtwo objectives at the same time. Drugan and Nowé[2013] formalize the multi-objective bandit problemwhere each arm is characterized by multiple values andthe learner should maximize a multi-objective functionconstructed over the values of each arm. They derivevariations of UCB to minimize the regret w.r.t. the fullPareto frontier obtained for different multi-objectivefunctions. Finally, Sani et al. [2012] study strategieshaving a small regret versus the arm with the bestmean-variance tradeoff. In this case, they show that itis not always possible to achieve a small regret w.r.t.the arm with the best mean-variance.

In this paper, we study the tradeoff between cumulativereward and accuracy of estimation of the arms’ val-ues (i.e., reward maximization and active exploration),which was first introduced by Liu et al. [2014]. Theirwork presented a heuristic algorithm for balancing thistradeoff and promising empirical results on an educa-tion simulation. In the present paper, we take a morerigorous approach and make several new contributions.1) We propose and justify a new objective functionfor the integration of rewards and estimation errors(Sect. 2), that provides a simple way for a designerto weigh directly between them. 2) We introduce theForcingBalance algorithm that optimizes the objec-tive function when the arm distributions are unknown(Sect. 3). Despite its simplicity, we prove that Forcing-

Balance incurs a regret that asymptotically matchesthe minimax rate for cumulative regret minimizationand the performance of active exploration algorithms(Sect. 4). This is very encouraging, as it shows that bal-ancing a tradeoff between rewards and errors is not fun-damentally more difficult than either of these separateobjectives. Interestingly, we also show that a simpleextension of UCB is not sufficient to achieve good per-formance. 3) Our analysis requires only requires strongconvexity and smoothness of the objective function andtherefore our algorithm and the proof technique can beeasily extended. 4) We provide empirical simulationson both synthetic and educational data from Liu et al.[2014] that support our analysis (Sect. 5).

2 Balancing Rewards and Errors

We consider a MAB of K arms with distributions{⌫i}Ki=1

, each characterized by mean µi and variance �2

i .For technical convenience we consider distributionswith bounded support in [0, 1]. All the following re-sults extend to the general case of sub-Gaussian dis-tributions (used in the experiments). We denote thes-th i.i.d. sample drawn from ⌫i by Xi,s and we define[K] = {1, . . . ,K}. As discussed in the introduction, westudy the combination of two objectives: reward max-imization and estimation error minimization. Givena fixed sequence of n arms In = (I

1

, I2

, .., In), whereIt 2 [K] is the arm pulled at time t, the average rewardis defined as

⇢(In) = E1

n

nX

t=1

XIt,TIt,t

�=

1

n

KX

i=1

Ti,nµi, (1)

where Ti,t =

Pt�1

s=1

I{Is = i} is the number of timesarm i is selected up to step t� 1. The sequence maxi-mizing ⇢ simply selects the arm with largest mean forall n steps. On the other hand, the estimation error ismeasured as

"(In)= 1

K

KX

i=1

rnE

h�bµi,n�µi

�2

i=

1

K

KX

i=1

sn�2

i

Ti,n

, (2)

where bµi,n is the empirical average of the Ti,n sam-ples. Similar functions were used by Carpentier et al.[2011, 2015]. Notice that (2) is multiplying the rootmean-square error by

pn. This is to allow the user to

specify a direct tradeoff between (1) and (2) regardlesson how their average magnitude varies as a functionof n.1 Optimizing " requires selecting all the armswith a frequency proportional to their standard de-viations. More precisely, each arm should be pulled

1This choice also “equalizes” the standard regret boundsfor the two separate objectives, so that the minimax regretin terms of ⇢ and the known upper-bounds on the regretw.r.t. " are both eO(1/

pn).

Page 3: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

0 0.2 0.4 0.6 0.8 10

0.5

1!4

!2

0

2

4

lambdaw

ob

ject

ive

fu

nct

ion

Figure 1: Function fw and optimal solution �⇤ for differentvalues of w (red line) for a MAB with K = 2, µ

1

= 1, µ2

= 2,�2

1

=�2

2

=1. For small w the problem reduces to optimizingthe average estimation error. Since the arms have the samevariance, �⇤ is an even allocation over the two arms. As wincreases, the ⇢ component in fw becomes more relevantand the optimal allocation selects arm 2 more often, untilw = 1 when all the resources are allocated to arm 2.

proportionally to �2/3i . We define the tradeoff objective

function balancing the two functions above as a convexcombination:

fw(In;{⌫i}i) = w⇢(In)� (1� w)"(In)

= wKX

i=1

Ti,n

nµi � (1� w)

K

KX

i=1

�ipTi,n/n

, (3)

where w 2 [0, 1] is a weight parameter and the objectiveis to find the sequence of pulls In which maximizesfw. For w = 1 we recover the reward maximizationproblem, while for w = 0 the problem reduces to mini-mizing the average estimation error. In the rest of thepaper, we are interested in the case w 2 (0, 1) sincethe extreme cases have already been studied. Usingroot mean square error for "(In) gives fw the scale-invariant property: Rescaling the distributions equallyimpacts ⇢ and ". Furthermore, fw can be equivalentlyobtained as a Lagrangian relaxation of a constrainedoptimization problem where we intend to maximize thereward subject to a desired level of estimation accuracy.In this case, the parameter w is directly related to thevalue of the constraint. Liu et al. [2014] proposed asimilar tradeoff function where the estimation erroris measured by Hoeffding confidence intervals, whichdisregard the variance of the arms and only dependon the number of pulls. In addition, in their objec-tive, the optimal allocation radically changes with thehorizon n, where a short horizon forces the learner tobe more explorative, while longer horizons allow thelearner to be more greedy in accumulating rewards.

Overall, their tradeoff reduces to a mixture betweena completely uniform allocation (that minimizes theconfidence intervals) and a UCB strategy that maxi-mizes the cumulative reward. While their algorithmdemonstrated encouraging empirical performance, noformal analysis was provided. In contrast, fw is stableover time and it allows us to compare the performanceof a learning algorithm to a static optimal allocation.We later show that fw also enjoys properties such assmoothness and strong concavity that are particularlyconvenient in the analysis. Beside the mathematicaladvantages, we notice that without normalizing " by n,as w tends to 0, we would not never be able to recoverthe optimal strategy for error minimization, since ⇢(In)would always dominate fw, thus making the impact oftuning w difficult to interpret.

Given an horizon n, finding the optimal In requiressolving a difficult discrete optimization problem, thuswe study its continuous relaxation2

fw(�;{⌫i}i) = wKX

i=1

�iµi � (1� w)

K

KX

i=1

�ip�i

, (4)

where � 2 DK belongs to the K-dimensional simplexsuch that �i � 0 and

Pi �i = 1. As a result, �

defines an allocation of arms and fw(�; {⌫i}i) is itsasymptotic performance if arms are repeatedly chosenaccording to �. We define the optimal allocation andits performance as �⇤

=argmax�2DK fw(�; {⌫i}i) andf⇤

=fw(�⇤; {⌫i}i) respectively. Since fw is concave and

DK is convex, �⇤ always exists and it is unique when-ever w < 1 (and there is at least a non-zero variance)or when the largest mean is distinct from the secondlargest mean. Although a closed-form solution can-not be computed in general, intuitively �⇤ favors armswith large means and large variance since allocatinga large portion of the resources to them contribute tominimizing fw by increasing the reward ⇢ and reducingthe error ". The parameter w defines the sensitivityof �⇤ to the arm parameters, such that for large w,�⇤ tends to concentrate on the arm with largest mean,while for small w, �⇤ allocates arms proportionally totheir standard deviations. Fig. 1, Sect. 5.1 and App. Cprovide additional examples illustrating the sensitivityof �⇤ to the parameters in fw. Let I⇤

n be the optimaldiscrete solution to Eq. 3. Then, it is easy to show thatthe difference between the two solutions rapidly shrinksto 0 with n. In fact3, for any arm i, |T ⇤

i,n/n��⇤i | 1/n

2A more accurate definition of fw over the simplex re-quires completing it with fw(�) = �1 whenever thereexists a component �i = 0 associated with a non-zero vari-ance �2

i .3Consider a real number r 2 [0, 1] and Rn any rounding

of rn (e.g., Rn = brnc), then |Rn � rn| 1. If we usebrn = Rn/n as fractional approximation of r with resolutionn, then we obtain that |brn � r| 1/n.

Page 4: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

and according to Lem. 4 (stated later), this guaranteesthat the value of �⇤ (f⇤) differs from the optimum offw(In) by 1/n2.

In the following, we consider the restricted simplexDK = {�i � �

min

,P

i �i = 1} with �min

>0 on whichfw is always bounded and it can be characterized asfollows.Lemma 1. Let �

max

= maxi �i and �min

= mini �i >0 be the largest and smallest standard deviations, thefunction fw(�; {⌫i}) is ↵-strongly concave everywherein DK with ↵ =

3(1�w)�min

4K and it is �-smooth in DK

with � =

3(1�w)�max

4K�5/2min

.

Finally, we define the performance of a learning algo-rithm. Let e�n be the empirical frequency of pulls, i.e.,e�i,n = Ti,n/n. We define its regret w.r.t. the value ofthe optimal allocation as Rn(

e�n) = f⇤ � fw(e�n; {⌫i}i).The previous equation defines the pseudo-regret of astrategy e�n since in Eq. 2 the second equality is truefor fixed allocations. This is similar to the definitionof [Carpentier et al., 2015], where the difference be-tween true and pseudo-regret is discussed in detail.

3 The ForcingBalance Algorithm

Why naïve UCB fails. One of the most success-ful approaches to bandits is the optimism-in-face-of-uncertainty principle, where we construct confidencebounds for the parameters and select the arm maximiz-ing an upper-bound on the objective function. Thisapproach was successfully applied in both regret mini-mization (see e.g., Auer et al. [2002]) and active explo-ration (see e.g., Carpentier et al. [2011]). As such, afirst natural approach to our problem is to constructan upper-bound on fw as (see Prop. 1 for the definitionof the confidence bounds)

fUBw (�; {b⌫i,n}) = w

KX

i=1

�i

✓bµi,n +

slog(1/�n)

2Ti,n

◆(5)

�(1�w)KX

i=1

1p�i

✓b�i,n�

s2 log(2/�n)

Ti,n

◆·

At each step n, we compute the allocation b�UB

i,n max-imizing fUB

w and select arms accordingly (e.g., bypulling an arm at random from b�

UB

i,n ). Althoughthe confidence bounds guarantee that for any �,fUBw (�; {b⌫i,n}) � fw(�; {⌫i}) w.h.p., this approach is

intrinsically flawed and it would perform poorly. Whilefor large values of w, the algorithm reduces to UCB, forsmall values of w, the algorithm tends to allocate armsto balance the estimation errors on the basis of lower-bounds on the variances and thus arms with small lower-bounds are selected less. Since small lower-bounds may

1: Input: forcing parameter ⌘, weight w2: for t = 1, . . . , n do

3: Ut = argminTi,t

4: if TUt,t < ⌘pt then

5: Select arm It = Ut (forcing)6: else

7: Compute optimal estimated allocationb�t = arg max

�2DK

fw(�; {b⌫i,t}i)

8: Select arm (tracking)

It = arg max

i=1,...,K

b�i,t � e�i,t

9: end if

10: Pull arm It, observe XIt,t, update b⌫It .11: end for

Figure 2: The ForcingBalance algorithm.

be associated with arms with large confidence inter-vals, and thus poorly estimated variances, this behaviorwould prevent the algorithm from correcting its esti-mates and improving its performance over time (seeApp. C for additional discussion and empirical simula-tions). Constructing lower-bounds on fw suffers fromthe same issue. This suggests that a straightforward(naïve) application of a UCB-like strategy fails in thiscontext. As a result, we take a different approach andpropose a forcing algorithm inspired by the GAFS-

MAX algorithm introduced by Antos et al. [2010] foractive exploration.4

Forced sampling. The ForcingBalance algorithmis illustrated in Fig. 2. It receives as input an ex-ploration parameter ⌘ > 0 and the restricted sim-plex DK defined by �

min

. At each step t, the algo-rithm first checks the number of pulls of each armand selects any arm with less than ⌘

pt samples.

If all arms have been sufficiently pulled, the alloca-tion b�t is computed using the empirical estimates ofthe arms’ means and variances bµi,t =

1

Ti,t

PTi,t

s=1

Xi,s

and b�2

i,n =

1

2Ti,t(Ti,t�1)

PTi,t

s,s0=1

�Xi,s �Xi,s0

�2. Notice

that the optimization is done over the restricted sim-plex DK and b�t can be computed efficiently. Oncethe allocation b�t is computed, an arm is selected. Astraightforward option is either to directly implementthe optimal estimated allocation by pulling an armdrawn at random from it or allocate the arms propor-tionally to b�t over a short phase. Both solutions maynot be effective since the final performance is evaluatedaccording to the actual allocation realized over all n

4Variations on the forcing or forced sampling approachhave been used in many settings including standard ban-dits Yakowitz and Lai [1995], Szepesvári [2008], linearbandit Goldenshluger and Zeevi [2013], contextual ban-dit Langford and Zhang [2007], and experimental optimaldesign Wiens and Li [2014].

Page 5: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

steps (i.e., e�i,n = Ti,n/n) and not b�n. Consequently,even when b�n is an accurate approximation of �⇤, theregret may not be small.5 ForcingBalance explicitlytracks the allocation b�n by selecting the arm It that isunder-pulled the most so far. This tracking step allowsus to force e�n to stay close to b�n (and its performance)at each step. The tracking step is slightly differentfrom GAFS-MAX, which selects the arms with largestratio between b�i,n and e�i,t. We show in the analysisthat the proposed tracking rule is more efficient.

The parameter ⌘ defines the amount of explorationforced by the algorithm. A large ⌘ forces all arms tobe pulled many times. While this guarantees accurateestimates bµi,t and b�2

i,n and an optimal estimated allo-cation b�t that rapidly converges to �⇤, the algorithmwould perform the tracking step very rarely and thuse�t would not track b�t fast enough. In the next section,we show that any value of ⌘ in a wide range (e.g., ⌘ = 1)guarantees a small regret. The other parameter is �

min

which defines a restriction on the set of allocationsthat can be learned. From an algorithmic point ofview, �

min

= 0 is a viable choice, since fw is stronglyconcave and it always admits at least one solution inDK (the full simplex). Nonetheless, we show next that�min

needs to be strictly positive to guarantee uniformconvergence of fw for true and estimated parameters,which is a critical property to ensure regret bounds.

4 Theoretical Guarantees

In this section, we derive an upper-bound on the regretof ForcingBalance with explicit dependence on itsparameters and the characteristics of fw. We startwith high-probability confidence intervals for the meanand the standard deviation (see Thm. 10 of Maurerand Pontil [2009]).Proposition 1. Fix � 2 (0, 1). For any n > 0 and

any arm i 2 [K],��bµi,n�µi

�� q

log(1/�n)2Ti,n

,��b�i,n��i

�� q

2 log(2/�n)Ti,n

, w.p. 1� �, where �n = �/(4Kn(n+ 1)).

The accuracy of the estimates translates into the dif-ference in estimated and true function f (we drop thedependence on w for readability).Lemma 2. Let b⌫i be an empirical distribution charac-terized by mean bµi and variance b�i such that |bµi �µi| "µi and |b�i � �i| "�i , then for any fixed� 2 DK we have

��f(�; {⌫i})�f(�; {b⌫i})�� wmaxi "

µi +

1�wmini

p�i

maxi "�i .

5Consider the case of 3 arms, where after t steps theempirical allocation e�t is (0.5, 0.1, 0.4) and the estimatedallocation b�t is (0.5, 0.4, 0.1). In the following steps, themost effective way to reduce the regret is not to use b�t,but to pull arm 2 more than 40%, in order to close the gapbetween e�t and b�t as fast as possible.

This lemma shows that the accuracy in estimating f isaffected by the largest error in estimating the mean orthe variance of any arm. This is due to the fact that �may give a high weight to a poorly estimated arm (i.e.,�i may be large for large "i). As a result, if "µi and "�iare defined as in Prop. 1, the lemma requires that allarms are pulled often enough to guarantee an accurateestimation of f . Furthermore, the upper-bound scalesinversely with the minimum proportion mini �i. Thisshows the need of restricting the possible �s to alloca-tions with a non-zero lower-bound to mini �i, which isguaranteed by the use of the restricted simplex DK inthe algorithm. Finally, notice that here we consider afixed allocation �, while later we need to deal with a(possibly) random choice of �, which requires a unionbound over a cover of DK (see Cor. 1). Next two lem-mas show how the difference in performance translatesin the difference of allocations and vice versa.

Lemma 3. If an allocation � 2 DK is such that��f⇤�

f(�; {⌫i})�� "f , then for any arm i 2 [K], |�i��⇤

i | q2K↵

p"f , where ↵ is the strong-concavity parameter

of fw (Lem. 1).

Lemma 4. The performance of an allocation � 2DK compared to the optimal allocation �⇤ is such thatf(�⇤

; {⌫i})� f(�; {⌫i}) 3�2

k�� �⇤k2.

In both cases, the bounds depend on the shape of fthrough the parameters of strong concavity ↵ andsmoothness �, which in turn depends on the con-strained simplex DK and the choice of �

min

. Beforestating the regret bound, we need to introduce anassumption of �⇤.

Assumption 1. Let �⇤min

= mini �⇤i be the smallest

proportion over the arms in the optimal allocation andlet DK the restricted simplex used in the algorithm. Weassume that the weight parameter w and the distribu-tions {⌫i}i are such that �⇤

min

� �min

, that is �⇤ 2 DK .

Notice that whenever all arms have non-zero varianceand w < 1, �⇤

min

> 0 and there always exists a non-zero�min

(and thus a set DK) for which the assumptioncan be verified. In general, the larger and more similarthe variances and the smaller w, the bigger �⇤

min

andless restrictive the assumption. The choice of �

min

alsoaffects the final regret bound.

Theorem 1. We consider a MAB with K � 2 armswith mean {µi} and variance {�2

i }. Under Asm. 1,ForcingBalance with a parameter ⌘ 21 and a sim-plex DK restricted to �

min

suffers a regret

Page 6: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

Rn(e�)

8>>>>>><

>>>>>>:

1 if n n0

43K5/2 �

slog(2/�n)

⌘�min

n�1/4 if n0

< n n2

153K5/2 �

slog(2/�n)

�min

�⇤min

n�1/2 if n > n2

,

with probability 1� � (where �n = �/(4Kn(n+1))) and

n0

= K(K⌘2 + ⌘pK + 1),

n2

=

C

(�⇤min

)

8

K10

↵4

log

2

(1/�n)

�2

min

,

where C is a suitable numerical constant.

Remark 1 (dependence on n). The previous boundreveals the existence of three phases. For n n

0

, weare in a fully explorative phase where the pulls are al-ways triggered by the forcing condition, the allocatione�n is uniform over arms; and it can be arbitrarily badw.r.t. �⇤. In the second phase, the algorithm inter-leaves forcing and tracking but the estimates {b⌫i,n} arenot accurate enough to guarantee that b�n performswell. In particular, we can only guarantee that all armsare selected ⌘

pn, which implies the regret decreases

very slowly as eO(n�1/4). Fortunately, as the estimates

become more accurate, b�n approaches �⇤, and after n2

steps the algorithm successfully tracks �⇤ and achievesthe asymptotic regret of eO(n�1/2

). This regret matchesthe minimax rate for regret minimization and activeexploration (e.g., GAFS-MAX). This shows that op-erating a trade-off between rewards and errors is notfundamentally more difficult than optimizing either ob-jective individually. While in this analysis, the secondand third phases are sharply separated (and n

2

may belarge), in practice the performance gradually improvesas b� approaches �⇤.

Remark 2 (dependence on parameters). �min

hasa major impact on the bound. The smaller its value,the higher the regret (both explicitly and through thesmoothness �). At the same time, the larger �

min

thestricter Asm. 1, which limits the validity of Thm. 1. Apossible compromise is to set �

min

to an appropriatedecreasing function of n, thus making Asm. 1 alwaysverified (for a large enough n), at the cost of worseningthe asymptotic rate of the regret. In the experiments,we run ForcingBalance with �

min

= 0 without theregret being negatively affected. We indeed conjecturethat we can always set �

min

=0 (for which Asm. 1 isalways verified), while the bound could be refined byreplacing �

min

(the ForcingBalance parameter) with�⇤min

(the minimum optimal allocation).Nonetheless,we point out that this would require to significantly

change the structure of the proof as Lemma 2 does nothold anymore when �

min

= 0.

Remark 3 (dependence on the problem). Theremaining terms in the bound depend on the number ofarms K, w, �2

min

(through ↵), and �⇤min

. By definitionof ↵, we notice that as w tends to 1 (pure rewardmaximization), the bound gets worse. This is expected,since the proof relies on the strong convexity of fw torelate the accuracy in estimating fw and the accuracyof the allocations (see Lem. 3). Finally, the regret hasan inverse dependence on �⇤

min

, which shows that ifthe optimal allocation requires an arm to be selectedonly a very limited fraction of time, the problem ismore challenging and the regret increases. This mayhappen in a range of different configurations such aslarge value of w or when one arm has very high meanand variance, which leads to a �⇤ highly concentratedon one single arm and �⇤

min

very small. A very similardependence is already present in previous results foractive exploration (see e.g., Carpentier et al. [2011]).

Remark 4 (proof). A sketch and the complete proofare reported in App. B. While the proof shares a similarstructure as GAFS-MAX’s, in GAFS-MAX we haveaccess to an explicit form of the optimal allocation �⇤

and the proof directly measures the difference betweenallocations. Here, we have to rely on Lemmas 3 and 4to relate allocations to objective functions and viceversa. In this sense, our analysis is a generalization ofthe proof in GAFS-MAX and it can be applied to anystrongly convex and smooth objective function.

5 Experiments

We evaluate the empirical performance of ForcingBal-

ance on synthetic data and a more challenging problemdirectly derived from an educational application. Ad-ditional experiments are in the appendix.

5.1 Synthetic Data

We consider a MAB with K = 5 arms with mean andvariance given in Fig. 4. While ⇢(�) is optimized byalways pulling arm 5, "(�) is minimized by an allocationselecting more often arm 4 that has the larger variance(for w = 0, the optimal allocation �⇤

4

is over 0.41). Forw = 0.9 (i.e., more weight to cumulative reward thanestimation error) the optimal allocation �⇤ is clearlybiased towards arm 5 and only partially to arm 4, whileall other arms are pulled only a limited fraction of time(well below 2%). We run ForcingBalance with ⌘ = 1

and �min

= 0 and we average over 200 runs.

Dependence on n. In Fig. 3-(left) we report theaverage and the 0.95-quantile of the rescaled regreteRn =

pnRn. From Thm. 1 we expect the rescaled

regret to increase aspn in the first exploration phase,

Page 7: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

0 100 200 300 400 5000

2

4

6

8

10

Step n

Resca

ledregret

95% Quantile!Rn =

!

nRn

2000 4000 6000 8000 100000

2

4

6

0 100 200 300 400 5000

0.05

0.1

0.15

0.2

0.25

Step n

|!! i!

!i|

!!"!!!

3.5 4 4.5 51

1.5

2

2.5

3

3.5

4

4.5

5

Avg. reward !(")

Est.error#(")

Optimal FrontierForcing n = 100Forcing n = 250Forcing n = 750

w=0.96

Regret

Figure 3: Rescaled regret (left), allocations errors (center), Pareto frontier (right) for the setting in Fig. 4.µ �2 �⇤

Arm1 1.0 0.05 0.0073Arm2 1.5 0.1 0.01Arm3 2.0 0.2 0.014Arm4 4.0 4.0 0.0794Arm5 5.0 0.5 0.8893

Figure 4: Arm mean, variance and optimal allocation forw = 0.9.

then to increase as n1/4 in the second phase, and finallyconverge to a constant (i.e., when the actual regret en-ters into the asymptotic regime of eO(n�1/2

)). From theplot we see that this is mostly verified by the empiricalregret, although there is a transient phase during whichthe rescaled regret decreases over n, which suggeststhat the actual regret may decrease with a faster rate,at least in a first moment. This behavior may be cap-tured in the theoretical analysis by replacing the useof Hoeffding bounds with Bernstein concentration in-equalities, which may reveal faster rate (up to eO(1/n))whenever n and the standard deviations are small.

Tracking. In Fig. 3-(center), we study the behaviorof the estimated allocation b� and the actual allocatione� (we show b�

4

and e�4

) w.r.t. the optimal allocation(�⇤

4

= 0.0794). In the initial phase, b� is basicallyuniform (1/K) since the algorithm is always in forcing-mode. After the exploration phase, b� is computed onthe estimates that are already quite accurate, and itrapidly converges to �⇤. At the same time, e� keepstracking the estimated optimal allocation and it alsotends to converge to �⇤ but with a slightly longer delay.We further study the tracking rule in the appendix.

Pareto frontier. In Fig. 3-(right) we study the per-formance of the optimal allocation �⇤ for varyingweights w. We report the Pareto frontier in termsof average reward ⇢(�) and average estimation error"(�). The optimal allocation smoothly changes fromfocusing on arm 4 to being almost completely concen-trated on arm 5 (�⇤

4

= 0.41 and �⇤5

= 0.20 for w = 0.0and �⇤

4

= 0.0484 and �⇤4

= 0.9326 for w = 0.95). As aresult, we move from an allocation with very low esti-

mation error but poor reward to a strategy with largereward but poor estimation. We report the Paretofrontier of ForcingBalance for different values of n.In this setting ForcingBalance is more effective inapproaching the performance of �⇤ for small valuesof w. This is consistent with the fact that for w = 0,�⇤min

= 0.097, while it decreases to 0.004 for w = 0.95,which increases the regret as illustrated by Thm. 1.

5.2 Educational Data

Treefrog Treasure is an educational math game in whichplayers navigate through a world of number lines. Play-ers must find and jump through the appropriate frac-tion on each number line. To analyze the effectivenessof our algorithm when parameters are drawn from areal-world setting, we use data from an experimentin Treefrog Treasure to estimate the means and vari-ances of a 64-arm experiment. Each arm correspondsto a different experimental condition: After a tutorial,34,197 players each received a pair of number lines withdifferent properties, followed by the same (randomized)distribution of number lines thereafter. We measuredhow many number lines students solved conditioned onthe type of this initial pair; the hope is to learn whichtype of number line encourages player persistence ona wide variety of number lines afterwards. There werea total of K = 64 conditions, formed from choosingbetween 2 representations of the target fraction, 2 repre-sentations of the label fractions on the lines themselves,adding or withholding tick marks at regular intervalson the number line, adding or removing hinting ani-mations if the problem was answered incorrectly, and1-4 different rates of backoff hints that would progres-sively offer more and more detailed hints as the playermade mistakes. The details of both the experimentsand the experimental conditions are taken from Liuet al. [2014], though we emphasize that we measure adifferent outcome in this paper (player persistence asopposed to chance of correct answer).

We run ForcingBalance, standard UCB, GAFS-MAX

(adapted to minimize the average estimation error) overn = 25, 000 and 100 runs. Both ForcingBalance and

Page 8: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

Figure 5: Treefrog Treasure, a math game aboutnumber lines.

Alg. "(�)

�2

max

⇢(�)

µmax

Rn RelDCG RankErr

w = 0.95

�⇤ 6.549 0.9405 - - -Force 6.708 0.9424 1.878 0.1871 5.935UCB 11.03 0.9712 95.15 1.119 8.629GAFS 5.859 0.9183 17.79 0.1268 5.117

Unif 5.861 0.9168 20.49 0.132 5.25w = 0.6

�⇤ 5.857 0.9189 - - -Force 5.859 0.92 0.4437 0.1227 5.178UCB 11.03 0.9712 1343 1.119 8.629GAFS 5.859 0.9183 1.314 0.1268 5.117

Unif 5.861 0.9168 3.482 0.132 5.25

Figure 6: Results on the educational dataset.

GAFS-MAX use ⌘ = 1 and w is set to 0.6 to givepriority to preference to the accuracy of the estimatesand to 0.95 to favor the player’s experience and en-tertainment. We study the performance according tothe average reward ⇢(�) (normalized by the largestmean), the estimation error "(�) (normalized by thelargest standard deviation), the rescaled regret

pnRn,

the relative discounted cumulative gain (DCG) and theRankErr that measure how well arms are ranked onthe basis of their mean.6 Small values of RelDCG andRankErr mean that arms are estimated well enoughto correctly rank them and can allow the experimentdesigner to later reliably remove the worst performingarms. The results are reported in Fig. 6. Since UCB,GAFS-MAX, and Unif do not depend on w, their per-formance is constant except for the regret, which iscomputed w.r.t. to different �⇤s. As expected, UCB

achieves the highest reward but it performs very poorlyin estimating the arms’ mean and in ranking them.GAFS-MAX does not collect much reward but is veryaccurate in the estimate of the means. On the otherhand, ForcingBalance balances the two objectivesand it achieves the smallest regret. We notice thatForcingBalance preserves a very good estimation ac-curacy without compromising too much the averagereward (for w = 0.95). Here effectively balancing be-tween the two objectives allows us to rank the differentgame settings in the right order while providing play-ers with a good experience. Had we used UCB, theoutcome for players would have been better, but thedesigners would have less insight into how the differentways of providing number lines affect player behav-ior for when they need to design the next game (highRankErr). Alternatively, using GAFS-MAX wouldgive the designer excellent insight into how different

6Let ⇡⇤ be the true ranking and b⇡ the estimated ranking(i.e., b⇡(k) returns the identity of the arm ranked at posi-tion k), the DCG is computed as DCG⇡ =

PKk=1

µ⇡(k)

log(k+1)

and then we compute (DCG⇡⇤ � DCGb⇡)/DCG⇡⇤ , whileRankErr = 1/K

PKi=1

|⇡⇤(i)� b⇡(i)|.

number lines affect players; however, if some conditionsare too difficult, we might have caused many playersto quit. ForcingBalance provides a useful feedbackto the designer without compromising the players’ ex-perience (the RankErr is close to GAFS-MAX but thereward is higher). This is more evident when movingto w = 0.95, where ForcingBalance significantly im-proves the reward w.r.t. GAFS-MAX without loosingmuch accuracy in ranking the arms.

6 Conclusions

We studied the tradeoff between rewards and estimationerrors. We proposed a new formulation of the problem,introduced a variant of a forced-sampling algorithm,derived bounds on its regret, and we validated ourresults on synthetic and educational data.

There are a number of interesting directions for fu-ture work. 1) An active exploration strategy tendsto pull all arms a linear t fraction of time, while min-imizing regret requires selecting sub-optimal arms asublinear number of times. It would be interesting toprove an explicit incompatibility result between maxi-mizing ⇢(�) and minimizing "(�) similar to the resultof Bubeck et al. [2009] for simple and cumulative regret.2) While a straightforward application of the UCB prin-ciple fails in this case, alternative formulations, suchas using upper-bounds on both means and variances,could overcome the limitations of (µ,�)-Naive-UCB.Nonetheless, the resulting function fw(·; {e⌫i,n}) is nei-ther an upper nor a lower bound on the true functionfw(·; {⌫i}) and the regret analysis could be considerablymore difficult than for ForcingBalance. Furthermore,it would be interesting to study how a Thompson sam-pling approach could be adapted to case. 3) Finally,alternative tradeoffs can be formulated (e.g., simplevs. cumulative regret). Notice that the current model,algorithm, and analysis could be easily extended toany strongly-convex and smooth function defined oversome parameters of the arms’ distributions.

Page 9: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

References

András Antos, Varun Grover, and Csaba Szepesvári.Active learning in heteroscedastic noise. TheoreticalComputer Science, 411:2712–2728, June 2010.

J.-Y. Audibert, S. Bubeck, and R. Munos. Best armidentification in multi-armed bandits. In Proceedingsof the Twenty-Third Annual Conference on LearningTheory (COLT’10), pages 41–53, 2010.

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer.Finite-time analysis of the multiarmed bandit prob-lem. Machine Learning, 47(2-3):235–256, 2002.

Olivier Bousquet, Stéphane Boucheron, and GáborLugosi. Introduction to statistical learning theory.In Olivier Bousquet, Ulrike von Luxburg, and Gun-nar Rätsch, editors, Advanced Lectures on MachineLearning, volume 3176 of Lecture Notes in ComputerScience, pages 169–207. Springer, 2003.

Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pureexploration in multi-armed bandits problems. InProceedings of the 20th International Conference onAlgorithmic Learning Theory (ALT’09), 2009.

Loc X. Bui, Ramesh Johari, and Shie Mannor. Com-mitting bandits. In J. Shawe-Taylor, R.S. Zemel, P.L.Bartlett, F. Pereira, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems24, pages 1557–1565. 2011.

Alexandra Carpentier, Alessandro Lazaric, MohammadGhavamzadeh, Rémi Munos, and Peter Auer. Upper-confidence-bound algorithms for active learning inmulti-armed bandits. In Proceedings of the 22ndInternational Conference on Algorithmic LearningTheory (ALT’11), pages 189–203, 2011.

Alexandra Carpentier, Remi Munos, and András Antos.Adaptive strategy for stratified monte carlo sampling.Journal of Machine Learning Research, 16:2231–2271,2015.

Madalina M. Drugan and Ann Nowé. Designing multi-objective multi-armed bandits algorithms: A study.In IJCNN, pages 1–8. IEEE, 2013.

Eyal Even-Dar, Shie Mannor, and Yishay Mansour.Action elimination and stopping conditions for themulti-armed bandit and reinforcement learning prob-lems. Journal of Machine Learning Research, 7:1079–1105, 2006.

Alexander Goldenshluger and Assaf Zeevi. A linearresponse bandit problem. Stoch. Syst., 3(1):230–261,2013. doi: 10.1214/11-SSY032.

John Langford and Tong Zhang. The epoch-greedyalgorithm for multi-armed bandits with side infor-mation. In NIPS, 2007.

Tor Lattimore. The pareto regret frontier for bandits.In C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama,and R. Garnett, editors, Advances in Neural Infor-mation Processing Systems 28, pages 208–216. 2015.

Yun-En Liu, Travis Mandel, Emma Brunskill, and Zo-ran Popovic. Trading off scientific knowledge and userlearning with multi-armed bandits. In Proceedingsof the 7th International Conference on EducationalData Mining (EDM), 2014.

A. Maurer and M. Pontil. Empirical bernstein boundsand sample-variance penalization. In Proceedings ofthe Twenty-Second Annual Conference on LearningTheory, pages 115–124, 2009.

Amir Sani, Alessandro Lazaric, and Rémi Munos. Riskaverse multi-arm bandits. In Proceedings of theTwenty-Sixth Annual Conference on Neural Informa-tion Processing Systems (NIPS’12), 2012.

Csaba Szepesvári. Learning theory of optimaldecision making. Lecture notes of the 2008Machine Learning Summer School, 2008. URLhttps://www.ualberta.ca/~szepesva/Talks/MLSS-IleDeRe-day1.pdf.

Douglas P. Wiens and Pengfei Li. V-optimal designsfor heteroscedastic regression. Journal of StatisticalPlanning and Inference, 145:125 – 138, 2014.

S. Yakowitz and T.-L. Lai. The nonparametric banditapproach to machine learning. In Decision and Con-trol, 1995., Proceedings of the 34th IEEE Conferenceon, volume 1, pages 568–572, 1995.

Page 10: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

A Technical Lemmas

Proof of Lemma 1. We study the function g(·) = �fw(·). For any i = 1, . . . ,K we have

@g(�)

@�i= �wµi � 1� w

2K

�i

�3/2i

,

and thus the Hessian of g in � is

[Hg]i,j =@2g(�)

@�i@�j=

(0 if i 6= j3(1�w)

4K�i

�5/2i

otherwise ,

which means that we have a diagonal matrix. Thus it is easy to show that for any � 2 DK , the Hessian is boundedas ↵IK � Hg � �IK , which implies that the function g is ↵-strongly convex and �-smooth with parameters

↵ =

3(1� w)�min

4K,

� =

3(1� w)�max

4K�5/2min

.

Proof of Lemma 2. The statement is obtained by the series of inequalities

���f(�; {⌫i})� f(�; {b⌫i})��� =

�����wKX

i=1

�i(µi � bµi) +(1� w)

K

KX

i=1

1p�i

(b�i � �i)

�����

wKX

i=1

�i

��µi � bµi

��+

(1� w)

K

KX

i=1

1p�i

��b�i � �i

��

wmax

i"µi +

(1� w)

minip�i

max

i"�i ,

where in the last step we used thatP

i �i = 1.

We also derive a simple corollary of Lemma 2, which extends the previous result to any (random) choice ofallocation � 2 DK .Corollary 1. After n steps, let {b⌫i,n}i be the empirical distributions obtained after pulling each arm Ti,n times.If we define �n = �/(4Kn2

(n+ 1)), then

P"8n > 0, 8� 2 DK ,

��f(�; {⌫i})� f(�; {b⌫i})�� max

i

s2K log(2/�n)

�min

Ti,n

#� 1� �,

Proof. The proof is identical to the one of Lemma 2 together with a union bound over a covering of the simplexDK and Prop. 1 for the concentration of bµ and b�. We first notice than any covering of the unrestricted simplexalso covers DK . We sketch how to construct an "-cover of a K-dimensional simplex. For any integer n = d1/"e, wecan design a discretization D(n)

K of the simplex defined by any possible (fractional) distribution b� = (�1

, . . . ,�K)

such that for any �i there exists an integer j such that �i = j/n. D(n)K is then an " cover in `1-norm since for

any distribution � 2 DK there exists a distribution b� 2 D(n)K such that ||�� b�||1 1/n ". The cardinality of

D(n)K is (loosely) upper-bounded by nK (n possible integers for each component �i). Upper-bounding the result

of Lemma 2 by max{maxi "µi ; maxi "�i }/

p�min

(since we focus on � in the restricted simplex DK) and followingstandard techniques in statistical learning theory (see e.g., Thm. 4 by Bousquet et al. [2003]), we obtain the finalstatement.

Page 11: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

Proof of Lemma 3. Let g(·) = �fw(·). For any pair of allocations �,�0 2 DK , from Taylor’s theorem there existsan allocation �00 such that

g(�) = g(�0) +rg(�0

)

>(�� �0

) +

1

2

(�� �0)

>Hg(�00)(�� �0

).

Given the bound from Lemma 1 (strong convexity) and taking �0= �⇤ (by Asm. 1, �⇤ 2 DK) we have

g(�) � g(�⇤) +rg(�⇤

)

>(�� �⇤

) +

2

||�� �⇤||22

.

Since �⇤ is the optimal allocation over DK , the gradient rg(�⇤) in �⇤ in direction towards � is nonnegative and

thus↵

2

||�� �⇤||22

f(�⇤)� f(�).

Given that ||�� �⇤||1 ||�� �⇤||2

, we finally obtain

max

i=1,...,K|�i � �⇤

i | r

2

�f(�⇤

)� f(�)�.

Proof of Lemma 4. Let g(·) = �fw(·). For any pair of allocations �,�0 2 DK , from Taylor’s theorem there existsan allocation �00 such that

g(�) = g(�0) +rg(�0

)

>(�� �0

) +

1

2

(�� �0)

>Hg(�00)(�� �0

).

Given the bound from Lemma 1 (smoothness) and taking �0= �⇤ (by Asm. 1, �⇤ 2 DK) we have

g(�) g(�⇤) +rg(�⇤

)

>(�� �⇤

) +

2

||�� �⇤||22

.

Consider the term rg(�)>(���⇤). By convexity of g, the gradient of g in � towards the optimum �⇤ is negative.

As a result we get

g(�) g(�⇤) +

�rg(�⇤)�rg(�)

�>(�� �⇤

) +

2

||�� �⇤||22

.

Using Cauchy-Schwarz inequality and the fact that for twice differentiable functions, the boundedness of theHessian (i.e., the smoothness of function) implies that the gradient of g is Lipschitz with coefficient �, we obtain

g(�) g(�⇤) + �||�� �⇤||2

2

+

2

||�� �⇤||22

.

Substituting g with f we obtain the desired statement

fw(�⇤; {⌫i})� fw(�; {⌫i}) 3�

2

k�� �⇤k2.

We introduce another useful intermediate lemma that states the quality of the estimated optimal allocation.Lemma 5. Let b⌫i,n be the empirical distribution characterized by mean bµi,n and variance b�i,n estimated usingTi,n samples. If b�n = argmax�2DK

f(�; {b⌫i})) is the estimated optimal allocation, then

f⇤ � f(b�i,n; {⌫i}) max

i2

s2K log(2/�n)

�min

Ti,n.

with probability 1� �, where �n = �/(4Kn2

(n+ 1)).

Page 12: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

Proof of Lemma 5. The statement follows from the series of inequalities

f⇤�f(b�i,n; {⌫i})= f(�⇤

; {⌫i})� f(�⇤; {b⌫i}) + f(�⇤

; {b⌫i})� f(b�i,n; {b⌫i}) + f(b�i,n; {b⌫i})� f(b�i,n; {⌫i}) 2 sup

�2DK

��f(�; {⌫i})� f(�; {b⌫i})��,

where the difference between the third and fourth term is upper-bounded by 0 since b�i,n is the optimizer off(·; {b⌫i}). The final statement follows from Corollary 1.

Lemma 6. Let consider a function h(n) = o(n) monotonically increasing with n. If the forcing condition is

Ti,n < h(n) + 1, (6)

then for any n � n0

Ti,n � h(n), (7)

with n0

= min{n : 9⇢ 2 N, n = ⇢K + 1, s.t. ⇢ � h(⇢K) + 1} corresponding to the end of the uniform explorationphase.

Proof. The proof of this lemma generalizes Lemma 11 of Antos et al. [2010].

Step 1. We consider a step n such that (7) holds. We recall that since Ti,n is an integer, then Ti,n � dh(n)e. Wedefine �(n) as the largest number of steps after n in which (7) still holds, that is

�n = max

�� 2 N : Ti,n+�n � Ti,n � dh(n)e � h(n+�n)

.

If for n� 1 we have dh(n� 1)e = h(n� 1), then �n�1

= 0, since h(n� 1) < h(n) by definition of h(n) and wesay that n is a reset step. We use enl with l 2 N to denote the sequence of all reset steps. We define the l-thphase as Pl = {enl, . . . , enl +�enl

} and we notice that if there exists a step n0 2 Pl such that Ti,n0 satisfies (7),then for any other step n00 2 Pl we have

Ti,n00 � Ti,n0(a)� dh(n0

)e(b)� dh(enl)e

(c)� h(enl +�enl

)

(d)� h(n00

)

where (a) follows from (7), (b) holds since n0 � n, (c) by definition of �n, and (d) by the fact that h(n) ismonotonically increasing in n and n+�n � n00. Finally, we also notice that �enl

is an non-decreasing function ofl and thus Pl becomes longer and longer over time. At this point, we have that (7) is consistent within eachphase Pl and thus we need to show that the forcing exploration guarantees that the condition is also preservedacross phases.

Step 2. We study the initial phase of the algorithm. The forcing condition determines a first phase in which allarms are explored uniformly in an arbitrary (but fixed) order (for sake of simplicity let us consider the naturalordering {1, . . . ,K}. Let n = ⇢K for some ⇢ 2 N during the uniform exploration phase, then at the beginning ofstep n arm K is pulled and at the end of the step all arms have Ti,n+1

= ⇢ samples. The end of the explorationphase corresponds to the smallest value of ⇢ so that step n = ⇢K+1 is such that Ti,n = ⇢ � h(n)+1 = h(⇢K)+1,so that the forcing condition is not triggered any more. We also notice Ti,n � h(⇢K) satisfies (7) and thatn = ⇢K + 1 is a reset step (i.e., dn� 1e = ⇢K) and thus we denote by en

1

= ⇢K + 1 the beginning of the firstphase P

1

and by step 1, we obtain that for all n0 2 P1

, Ti,n � h(⇢K + 1 +�⇢K) (i.e., (7) keeps holding). This isthe base for induction.

Step 3. We assume that (7) holds for a step bnl at the beginning of phase Pl for all arms, then by step 1, (7)also holds for any other step until n+�n independently on whether the arms are pulled or not. We study whathappens at the beginning of the successive phase starting at n+�n + 1. We first consider all arms i for whichTi,n = dh(n)e, then we have Ti,n < h(n) + 1, which implies that the forcing exploration is triggered on this arm.Since there are potentially K arms in this situation, it may take as long as K steps before updating them all.Thus, if �n > K, then

Ti,enl+�enl+1

= Ti,enl+1

� Ti,enl+K

(a)� dh(enl)e+ 1

(b)� h(enl +�enl

) + 1

(c)> h(enl +�enl

+ 1),

Page 13: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

where in (a) we use the fact that i is forced to be pulled, (b) follows from the definition of �n, and (c) is by thesub-linearity of h(n). Then we focus on the arms for which Ti,n � dhne+ 1. Since dhenl

e+ 1 < henl+ 1 the forcing

condition is not met (at least at the beginning). We have that even if during in phase Pl these arms are neverpulled, we have

Ti,enl+�enl+1

= Ti,enl+1

� Ti,enl� dh(enl)e+ 1 > h(enl +�enl

+ 1),

where the arguments are as above. This concludes the inductive step showing that if (7) holds in a phase Pl thenit holds at Pl+1

as well as soon as �enl> K. As a result, step 2, together with the condition on n for the end of

the exploration phase, and step 3 prove the statement.

Corollary 2. If h(n) = ⌘pn, then for any n � n

0

= K(K⌘2 + ⌘pK + 1) and all arms Ti,n � ⌘

pn.

Proof. We just need to derive the length of the exploration phase n0

= min{n : 9⇢ 2 N, n = ⇢K + 1, s.t. ⇢ �h(⇢K) + 1}

⇢ � ⌘p⇢K + 1.

Solving for ⇢ and upper-bounding the condition provides ⇢ � ⌘2K + ⌘pK +1, which gives n

0

= K(K⌘2 + ⌘pK +

1).

Lemma 7. We assume that there exists a value n1

after which b�n is constantly a good approximation of �⇤, i.e.,there exists a monotonically decreasing function !(n) such that for any step n � n

1

max

i=1,...,K

��b�i,n � �⇤i

�� !(n).

Furthermore, we assume that for any n � n1

the following condition holds

2n!(n) � ⌘pn+ 1. (8)

Then for any arm i

�(K � 1)max

⇢n1

n, 2!(n) + 1

� e�i,n � �⇤

i max

⇢n1

n(1� �i), 2!(n) + 1

�.

Proof. This lemma follows from similar arguments as Lemma 4 by Antos et al. [2010]. Nonetheless, given the useof a slightly different tracking rule, we provide the full proof here.

We study the error "i,n =

e�i,n � �⇤i . Since Ti,n+1

= Ti,n + I{In = i}, we have

"i,n+1

=

Ti,n + I{In = i}n+ 1

� n+ 1

n+ 1

�⇤i

=

n

n+ 1

✓Ti,n

n� �⇤

i

◆+

I{In = i}� �⇤i

n+ 1

=

n

n+ 1

"i,n +

I{In = i}� �⇤i

n+ 1

.

Then we need to study the arm selection rule at step n to understand the evolution of the error and its relationshipwith the error of b�. We have

I{In = i} I{Ti,n < ⌘pn+ 1 or i = argmin

j(

e�i,n � b�i,n)}.

We study the tracking condition. Let i = argminj(e�j,n � b�j,n) then

e�i,n =

b�i,n +min

j

�e�j,n � b�j,n

= �⇤i +

b�i,n � �⇤i +min

j

�e�j,n � �⇤j + �⇤

j � b�j,n

�⇤i +

b�i,n � �⇤i +min

j

�e�j,n � �⇤j

�+max

j

��⇤j � b�j,n

�⇤i + 2max

j

���⇤j � b�j,n

��

�⇤i + 2!(n),

Page 14: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

where we use the fact that sinceP

j

�e�j,n � �⇤j

�= 0 and all proportions are bigger than zero, then the minimum

over�e�j,n � �⇤

j

�is nonpositive. We now study the forcing condition Ti,n = ne�i,n < ⌘

pn+1. For n � n

1

, we have

e�i,n � �⇤i ⌘p

n+

1

n 2!(n),

where the last step follows from the properties of ! defined in the statement. Then we can simplify the conditionunder which arm i is selected as

I{In = i} I{"i,n 2!(n)}.We proceed by defining Ei,n = n"i,n and the corresponding process Ei,n+1

= Ei,n + I{In = i} � �⇤i . We also

introduce

eEi,n1

= Ei,n1

,

eEi,n+1

=

eEi,n + I{ eEi,n 2n!(n)}� �⇤i ,

which follows the same dynamics of Ei,n except for the fact that the looser arm selection is considered. FromLemma 5 of Antos et al. [2010], we have Ei,n eEi,n for any n � n

1

. It is easy to see that eEi,n satisfies thefollowing inequality

eEi,n max{Ek,n1

, 2n!(n) + 1} max{n1

, 2n!(n) + 1},where the last inequality follows from the fact that Ek,n

1

n1

(1� �i). Then, we obtain the upper-bound

"i,n max

⇢n1

n(1� �i), 2!(n) + 1

�.

From the upper bound, we obtain a lower bound on "i,n as

"i,n = �X

j 6=i

"i,n � �(K � 1)max

j"j,n � �(K � 1)max

⇢n1

n, 2!(n) + 1

�,

which concludes the proof.

B Proof of Theorem 1

In this section we report the full proof of the main regret theorem, whose complete statement is as follows.

Theorem 1 We consider a MAB with K � 2 arms characterized by distributions {⌫i} with mean {µi} andvariance {�2

i }. Consider ForcingBalance with a parameter ⌘ 21 and a simplex restricted to �min

. Given atradeoff parameter w and under Asm. 1, ForcingBalance suffers a regret

Rn(e�)

8>>>>>><

>>>>>>:

1 if n n0

43K5/2 �

slog(2/�n)

⌘�min

n�1/4 if n0

< n n2

153K5/2 �

slog(2/�n)

�min

�⇤min

n�1/2 if n > n2

,

with probability 1� � (where �n = �/(4Kn(n+ 1))) and

n0

= K(K⌘2 + ⌘pK + 1), and n

2

=

C

(�⇤min

)

8

1

↵4�4

min

K10

log(1/�n)2,

where C is a suitable numerical constant.

Sketch of the proof. In the active exploration problem, Antos et al. [2010] rely on the fact that minimizing"(�) has a closed-form solution w.r.t. the parameters of the problem (i.e., the variance of the arms) and errors in

Page 15: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

estimating the parameters are directly translated into deviations between b�i,n and �⇤. In our case, �⇤ has noclosed-form solution and we need to explicitly “translate” errors in estimating fw into the deviations betweenallocations and vice versa. Furthermore, ForcingBalance uses a slightly different tracking strategy (see Lem. 7in the supplement) and we prove the effect of the forcing exploration for a more general condition (see Lem. 6 inthe supplement). The proof follows the following steps. We first exploit the forcing exploration of the algorithmto guarantee that each arm is pulled at least eO(

pn) at each step n. Through Prop. 1, Lemma 2, and Lemma 3

we obtain that the allocation b�n converges to �⇤ with a rate eO(1/n1/8). We show that the tracking step is

executed often enough (w.r.t. the forced exploration) and it is efficient enough to propagate the errors of b�n tothe actual allocation e�n, which also approaches �⇤ with a rate eO(1/n1/8

). Unfortunately, this does not translatein a satisfactory regret bound but it shows that for n big enough, Ti,n is only a fraction away from the desirednumber of pulls n�⇤

i , which provides a more refined lower-bound on Ti,n =

e⌦(n). In this second phase (n � n

2

),the estimates b⌫i,n are much more accurate (Prop. 1) and through Lemma 2 and Lemma 3, the accuracy of b� ande� improves to eO(1/n1/4

). At this point, we apply Lemma 4 and translate the guarantee on e� to the final regretbound. While in Prop. 1 we use simple Chernoff-Heoffding bounds, Bernstein bounds (which consider the impactof the variance on the concentration inequality) could significantly improve the final result. While the asymptoticrate would remain unaffected, we conjecture that this more refined analysis would show that whenever arms havevery small variance the regret Rn decreases as eO(1/n) before converging to the asymptotic rate.

We now proceed with the formal proof. We start with a technical lemma.Lemma 8. If ⌘ 21, then

2n

r20K

↵�min

✓K

log(1/�n)

2⌘pn

◆1/4

� ⌘pn+ 1,

for any n � 4.

Proof. We study the value of n1

to satisfy Eq. 8 for !(n), which is defined later on in step 5 of the final proof.This corresponds to finding the minimal value of n 2 N such that

2n

r20K

↵�min

✓K

log(1/�n)

2⌘pn

◆1/4

� ⌘pn+ 1.

We proceed by successive (often loose) simplifications to the previous expression. Since K � 2 and n � 1, wehave that �n �/16. If we choose � < 1/2, then �n 1/32 and log(1/�n) > 3. As a result, we obtain that theprevious condition can be written as

2n

r40

↵�min

✓3

⌘pn

◆1/4

� ⌘pn+ 1

) 16n1p

↵�min

✓1

⌘pn

◆1/4

� ⌘pn+ 1

) 16n1p

↵�min

� ⌘5/4n5/8 � ⌘1/4n1/8 � 0.

We further study the first multiplicative term and use the fact that �min

1/K 1/2, ↵ = 2(1�w)�2

min

/K 1/4,then

2n!(n) � ⌘pn+ 1

) 16n1p1/8

� ⌘5/4n5/8 � ⌘1/4n1/8 � 0

) 45n� ⌘5/4n5/8 � ⌘1/4n1/8 � 0.

Using the condition that 45 � ⌘5/4 and the fact that ⌘5/4 ⌘1/4, we can further simplify the last term as

2n!(n) � ⌘pn+ 1

) n� n5/8 � n1/8 � 0,

which is satisfied for any n � 4.

Page 16: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

Proof of Theorem 1. Step 1 (accuracy of empirical estimates). In Alg. 2 we explicitly force all arms to bepulled a minimum number of times. In particular, from Lemma 6 we have that for any n � n

0

= K(K⌘2+⌘pK+1)

then Ti,n � ⌘pn. From Proposition 1, if �n = �/(4Kn2

(n+ 1)), then for any arm i we have7

��bµi,n � µi

�� s

log(1/�n)

2⌘pn

��b�2

i,n � �2

i

�� s

2K log(2/�n)

2⌘pn

,

with probability at least 1� �.

Step 2 (accuracy of function estimate). Using Corollary 1 we can bound the error on the function f whenusing the estimates b⌫i,n instead of the true parameters ⌫i. In fact, for any � 2 DK we have

��f(�; {⌫i})� f(�; b⌫i,n)��

s2K log(2/�n)

�min

⌘pn

Step 3 (performance of estimated optimal allocation). We can derive the performance of the allocationb�n computed on the basis of the estimates obtained after n samples. From Lemma 5 we have

��f(�⇤; {⌫i})� f(b�n; {⌫i})

�� 2

s2K log(2/�n)

�min

⌘pn

.

Step 4 (from performance to allocation). From Lemma 3 we have that a loss in performance in terms ofthe function f implies the similarity of the estimated allocation to �⇤. For any arm i = 1, . . . ,K we have

|b�i,n � �⇤i |

r4

✓2K

log(2/�n)

�min

⌘pn

◆1/4

.

Step 5 (tracking). The algorithm is designed so that the actual allocation e�n (i.e., the fraction of pulls allocatedto each arm until step n) is tracking the optimal estimated allocation b�n. Since the difference between b�n and�⇤ is bounded as above, we can use Lemma 7 to bound the accuracy of e�. We define

!(n) :=

r4

✓2K

log(2/�n)

�min

⌘pn

◆1/4

,

then from Lemma 8 we have that for any n � 4 (and ⌘ 21), the condition in Eq. 8 (2n!(n) � ⌘pn + 1)

is satisfied and we can apply Lemma 7. In particular, we have that n1

= max{5, n0

} n0

guarantees bothconditions in the lemma and this implies that for any arm i = 1, . . . ,K

|e�i,n � �⇤i | ⌘(n) := (K � 1)max

⇢n0

n; 2!(n) + 1

�. (9)

If we stopped at this point, the regret could be bounded using Lemma 4 as

f(�⇤; {⌫i})� f(e�n; {⌫i}) 34K5/2 �

slog(1/�n)

⌘�min

n�1/4,

which is decreasing to zero very slowly.

Step 6 (linear pulls). From Eq. 9, we can than easily derive a much stronger guarantee on the number of pullsallocated to any arm i. Let n

2

= min{n 2 N : ⌘(n) mini �⇤i /2}, then for any n � n

2

we have

|e�i,n � �⇤i | �⇤

i /2,

7Here we already use the �n and form of the confidence intervals used in Corollary 1.

Page 17: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

which implies that

Ti,n � n�⇤i /2.

Step 7 (regret bound). At this point we can reproduce the steps 1, 2, and 3 using Ti,n � n�⇤i /2 � n�⇤

min

/2samples and we obtain that for any n � n

2

��f(�⇤; {⌫i})� f(b�n; {⌫i})

�� 4

sK log(2/�n)

n�⇤min

�min

.

Unfortunately this guarantee on the performance of the optimal estimated allocation does not directly translateinto a regret bound on e�n (i.e., the actual distribution implemented by the algorithm up to step n). We firstapply the same idea as in step 4 and obtain

|b�i,n � �⇤i | !0

(n) :=

r8

✓K

log(2/�n)

2n�⇤min

�min

◆1/4

.

By applying a similar argument as in Lemma 8, we obtain that n1

= max{4, n0

} = n0

and the tracking argumentin step 5 gives us

|e�i,n � �⇤i | ⌘0(n) := (K � 1)max

⇢n0

n; 2!0

(n) + 1

�.

At this point we just need to apply Lemma 4 on the difference between e� and �⇤ and obtain

f(�⇤; {⌫i})� f(e�n; {⌫i}) 3�

2

||e�n � �⇤i ||22 3�K2

2

max

⇢n2

0

n2

; 9(!0(n))2

�,

where we used ||e�n � �⇤i ||2 p

Kmaxi |e�i,n � �⇤i |. Using the definition of !0

(n) gives the final statement

f(�⇤; {⌫i})� f(e�n; {⌫i}) 153K5/2 �

slog(2/�n)

n�⇤min

�min

.

Step 8 (condition on n). From the definition of n2

, we have that n2

is at most a value n such that ⌘(n) �⇤i /2.

We consider the worst case form �⇤i , that is �⇤

min

and we bound separately the two possible terms in the max inthe definition of ⌘(n). The first term should satisfy

(K � 1)

n0

n �⇤

i

2

) n � 2n0

(K � 1)

�⇤min

.

For the second term we have a much slower decreasing function !(n) and the condition is

K

r4

✓2K

log(2/�n)

�min

⌘pn

◆1/4

�⇤i

2

) n1/8 � 4

�⇤min

K5/4

p↵

✓log(2/�n)

�min

◆1/4

,

which is clearly more constraining than the first one. Then we define

n2

=

4

8

(�⇤min

)

8

K10

↵4

log

2

(1/�n)

�2

min

.

Page 18: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

Exp. �⇤ µ �2

1 0.57 (balanced) (1.5,1) (1,1)2 0.56 (balanced) (2,1) (1,2)3 0.28 (unbalanced) (1.1, 1) (0.1, 2)4 0.85 (unbalanced) (3,1) (0.1,0.1)

Table 1: Optimal allocation for a MAB with K = 2 arms, different values of mean and variance, and w = 0.4.

C Supplementary Results

Optimal allocation. We consider four different MABs with K = 2 arms with means and variances reported inTable 1 and w = 0.4 (i.e., slightly more weight to the estimation errors). In the first two settings, we notice that�⇤ is an almost balanced allocation over the two arms, with a slight preference for arm 1. In the first case, this isdue to the fact that the variance of the arms is exactly the same, which suggests a perfectly even allocation wouldguarantee equal estimation accuracy of the means. Nonetheless, since arm 1 has a larger mean, this moves �⇤

towards it. On the other hand, in the second case both means and variances are unbalanced, but while �2

2

> �2

1

suggests that arm 2 should be pulled more (to compensate for the larger variance), µ1

� µ2

requires arm 1 to bepulled much more than arm 1. As a result, �⇤ is still balanced. In the third and fourth setting, �⇤ recommendsselecting one arm much more than the other. While in the third setting this is due to a strong unbalance in thevariances, in the fourth setting this is induced by the difference in the mean.

Comparison with Naive-UCB. Before reporting empirical results on the straightforward UCB-like algorithm(called (µ,�)-Naive-UCB) illustrated in Sect. 3, where an upper-bound on the function fw is constructed at eachstep, we first provide a preliminary example. Consider the case (very extreme for illustrative purposes) w = 0,�1

= 1, �2

= 2, for which �⇤ ⇡ [0.38, 0.62]. Assume that after pulling each arm twice, we have b�1

= 2, b�2

= 0.1.Using (µ,�)-Naive-UCB, the estimated optimal allocation would be very close to [10] (i.e., only arm 1 is pulled).Then (µ,�)-Naive-UCB keeps selecting arm 1, while arm2, whose lower-bound on the variance remains verysmall (1/

pT2

is large), is almost never pulled, thus preventing the estimate from converging and the algorithm tohave a small regret. This shows the algorithm has a constant regret with a fixed probability.

In Fig. 7 we report the rescaled regret eRn =

pnRn for both ForcingBalance and (µ,�)-Naive-UCB. More

in detail, we define an upper-bound of fw as in Eq. 5, but we threshold the lower bounds on the variance toa constant (0.01 in the experiments). Then we compute b�n as the optimal allocation of fUB

w and the sametracking arm selection as in ForcingBalance is used. On the other hand, (µ,�)-Naive-UCB does not use anyforced exploration, while it relies on the optimism-of-uncertainty principle to sufficiently explore all arms. Thecomparison is reported for settings 2 and 3 of Table 1.

For both settings ForcingBalance performs as well as expected and its rescaled regret eventually converges to aconstant (in this case the constant is very small since we have only two arms). On the other hand, (µ,�)-Naive-

UCB achieves very contrasting results. In setting 2 (Fig. 7-left), in a first phase (µ,�)-Naive-UCB suffers a regretwith a slower rate than eO (1/

pn) since the rescaled regret is increasing. While in ForcingBalance this phase is

limited to the initial exploration of all arms, in (µ,�)-Naive-UCB this is due to the fact that the algorithm isunder estimating the variances of the arms and it tends to be more aggressive in selecting arms with larger mean(arm 1 in this case), which corresponds to very limited exploration to the other arm. The only residual sources ofexploration are triggered by the fact that lower bound are capped to a small constant (when negative), whichencourages partial exploration, and upper-confidence bounds on the means, which induce a UCB-like strategywhere the two arms are explored to identify the best.. As a result, there is a long phase of poor performance,until enough exploration is achieved to have estimates which allow an accurate estimate of �⇤ and thus a regretwhich decreases again as O (1/

pn). In setting 3 (Fig. 7-right), the optimal allocation is very biased towards the

second arm (see Table 1). However, (µ,�)-Naive-UCB would target more the first arm attracted by its highmean upper bound (i.e., e�

1

� e�2

in contrast with �⇤). As a result, its regret is much higher in this case than inthe previous case. Actually, with the horizon of n = 5000 the regret is constant (and thus the rescaled regretincreases as eO(

pn)) and the algorithm does not seem to be able to recover from bad estimates of the variance.

Tracking performance. Finally we investigate the effect of the tracking strategy of ForcingBalance bycomparing it with an arm selection where It is drawn at random from b�t. This version of ForcingBalance doesnot try to compensate for the difference between the current allocation e�t and the desired allocation b�t. We

Page 19: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

report the error in estimating �⇤ in `1-norm for both b�t and e�t in the two configurations of ForcingBalance.We can see in Fig. 8-(left/center) that while b� is not affected by the tracking rule, e� is significantly slower inconverging to �⇤ when the algorithm does not compensate for the mismatch is estimated optimal and actualallocations. Furthermore, this directly translates in a much higher regret as illustrated in Fig. 8-(right).

Page 20: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

Step n

Resca

ledregret

95% Quantile!Rn =

!

nRn

2000 4000 6000 8000 100000

0.01

0.02

0.03

0.04

0 100 200 300 400 5000

5

10

15

20

Step n

Resca

ledregret

95% Quantile!Rn =

!

nRn

2000 4000 6000 8000 10000−0.5

0

0.5

1

0 500 1000 1500 20000

0.2

0.4

0.6

0.8

1

Step n

Resca

ledregret

95% Quantile!Rn =

!

nRn

4000 6000 8000 100000

1

2

3

4x 10−3

0 500 1000 1500 20000

0.5

1

1.5

2

Step n

Resca

ledregret

95% Quantile!Rn =

!

nRn

4000 6000 8000 100000

0.5

1

1.5

2

Figure 7: Rescaled regret for ForcingBalance and (µ,�)-Naive-UCB on settings 2 and 3 of Table 1. Notice thedifference in the scale of the x and y axes.

Page 21: Trading offRewards and Errors in Multi-Armed Banditsregret, i.e., the difference between the reward of the Preliminary work. Under review by AISTATS 2017. Do not distribute. arm

Manuscript under review by AISTATS 2017

0 100 200 300 400 5000

0.1

0.2

0.3

0.4

0.5

Step n

!!!!

!!"

!!!!

!!!"!!!

!"!!"

0 100 200 300 400 5000

0.05

0.1

0.15

0.2

0.25

Step n

|!! i!

!i|

!!"!!!

0 100 200 300 400 5000

2

4

6

8

10

Step n

Resca

ledregret

95% Quantile!Rn =

!

nRn

2000 4000 6000 8000 100000

2

4

6

0 100 200 300 400 5000

0.1

0.2

0.3

0.4

0.5

Step n

!!!!

!!"

!!!!

!!!"!!!

!"!!"

0 100 200 300 400 5000

0.05

0.1

0.15

0.2

0.25

Step n

|!! i!

!i|

!!"!!!

0 100 200 300 400 5000

2

4

6

8

10

Step n

Resca

ledregret

95% Quantile!Rn =

!

nRn

2000 4000 6000 8000 100000

2

4

6

Figure 8: Performance of ForcingBalance with (top) and without (bottom) tracking step for the setting inTable 4. From left to right: `1 error in approximating �⇤, error in approximating �⇤

4

and rescaled regret.