Near Minimax Optimal Players for the Finite-Time 3-Expert …victorgabillon.nfshost.com/publications/minimaxFull/nips... · 2018-01-31 · Near Minimax Optimal Players for the Finite-Time

Near Minimax Optimal Players for the Finite-Time3-Expert Prediction Problem

Yasin Abbasi-YadkoriAdobe Research

Peter L. BartlettUC Berkeley

Victor GabillonQueensland University of Technology

Abstract

We study minimax strategies for the online prediction problem with expert advice.It has been conjectured that a simple adversary strategy, called COMB, is nearoptimal in this game for any number of experts. Our results and new insights makeprogress in this direction by showing that, up to a small additive term, COMB isminimax optimal in the finite-time three expert problem. In addition, we providefor this setting a new near minimax optimal COMB-based learner. Prior to thiswork, in this problem, learners obtaining the optimal multiplicative constant intheir regret rate were known only when K = 2 or K →∞. We characterize, whenK = 3, the regret of the game scaling as

√8/(9π)T ± log(T )2 which gives for

the first time the optimal constant in the leading (√T ) term of the regret.

1 Introduction

This paper studies the online prediction problem with expert advice. This is a fundamental problemof machine learning that has been studied for decades, going back at least to the work of Hannan [12](see [4] for a survey). As it studies prediction under adversarial data the designed algorithms areknown to be robust and are commonly used as building blocks of more complicated machine learningalgorithms with numerous applications. Thus, elucidating the yet unknown optimal strategies has thepotential to significantly improve the performance of these higher level algorithms, in addition toproviding insight into a classic prediction problem. The problem is a repeated two-player zero-sumgame between an adversary and a learner. At each of the T rounds, the adversary decides thequality/gain of K experts’ advice, while simultaneously the learner decides to follow the advice ofone of the experts. The objective of the adversary is to maximize the regret of the learner, defined asthe difference between the total gain of the learner and the total gain of the best fixed expert.

Open Problems and our Main Results. Previously this game has been solved asymptotically asboth T and K tend to∞: asymptotically the upper bound on the performance of the state-of-the-art Multiplicative Weights Algorithm (MWA) for the learner matches the optimal multiplicativeconstant of the asymptotic minimax optimal regret rate

√(T/2) logK [3]. However, for finite K,

this asymptotic quantity actually overestimates the finite-time value of the game. Moreover, Gravin etal. [10] proved a matching lower bound

√(T/2) logK on the regret of the classic version of MWA,

additionally showing that the optimal learner does not belong an extended MWA family. Already,Cover [5] proved that the value of the game is of order of

√T/(2π) when K = 2, meaning that the

regret of a MWA learner is 47% larger that the optimal learner in this case. Therefore the question ofoptimality remains open for non-asymptotic K which are the typical cases in applications.

In studying a related setting with K = 3, where T is sampled from a geometric distribution withparameter δ, Gravin et al. [9] conjectured that, for any K, a simple adversary strategy, calledthe COMB adversary, is asymptotically optimal (T → ∞, or when δ → 0), and also excessivelycompetitive for finite-time fixed T . The COMB strategy sorts the experts based on their cumulativegains and, with probability one half, assigns gain one to each expert in an odd position and gain zero

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

to each expert in an even position. With probability one half, the zeros and ones are swapped. Thesimplicity and elegance of this strategy, combined with its almost optimal performance makes it veryappealing and calls for a more extensive study of its properties.

Our results and new insights make progress in this direction by showing that, for any fixed T and up tosmall additive terms, COMB is minimax optimal in the finite-time three expert problem. Additionallyand with similar guarantees, we provide for this setting a new near minimax optimal COMB-basedlearner. For K = 3, the regret of a MWA learner is 39% larger than our new optimal learner1. Inthis paper we also characterize, when K = 3, the regret of the game as

√8/(9π)T ± log(T )2 which

gives for the first time the optimal constant in the leading (√T ) term of the regret. Note that the

state-of-the-art non-asymptotic lower bound in [15] on the value of this problem is non informativeas the lower bound for the case of K = 3 is a negative quantity.

Related Works and Challenges. For the case of K = 3, Gravin et al. [9] proved the exact minimaxoptimality of a COMB-related adversary in the geometrical setting, i.e. where T is not fixed in advancebut rather sampled from a geometric distribution with parameter δ. However the connection betweenthe geometrical setting and the original finite-time setting is not well understood, even asymptotically(possibly due to the large variance of geometric distributions with small δ). Addressing this issue, inSection 7 of [8], Gravin et al. formulate the “Finite vs Geometric Regret” conjecture which statesthat the value of the game in the geometrical setting, Vα, and the value of the game in the finite-timesetting, VT , verify VT = 2√

πVα=1/T . We resolve here the conjecture for K = 3.

Analyzing the finite-time expert problem raises new challenges compared to the geometric setting. Inthe geometric setting, at any time (round) t of the game, the expected number of remaining roundsbefore the end of the game is constant (does not depend on the current time t). This simplifies theproblem to the point that, when K = 3, there exists an exactly minimax optimal adversary thatignores the time t and the parameter δ. As noted in [9], and noticeable from solving exactly smallinstances of the game with a computer, in the finite-time case, the exact optimal adversary seems todepend in a complex manner on time and state. It is therefore natural to compromise for a simpleradversary that is optimal up to a small additive error term. Actually, based on the observation of therestricted computer-based solutions, the additive error term of COMB seems to vanish with larger T .

Tightly controlling the errors made by COMB is a new challenge with respect to [9], where thesolution to the optimality equations led directly to the exact optimal adversary. The existence of suchequations in the geometric setting crucially relies on the fact that the value-to-go of a given policy ina given state does not depend on the current time t (because geometric distributions are memoryless).To control the errors in the finite-time setting, our new approach solves the game by backwardinduction showing the approximate greediness of COMB with respect to itself (read Section 2.1 foran overview of our new proof techniques and their organization). We use a novel exchangeabilityproperty, new connections to random walks and a close relation that we develop between COMB anda TWIN-COMB strategy. Additional connections with new related optimal strategies and randomwalks are used to compute the value of the game (Theorem 2). We discuss in Section 6 how our newtechniques have more potential to extend to an arbitrary number of arms, than those of [9].

Additionally, we show how the approximate greediness of COMB with respect to itself is key toproving that a learner based directly on the COMB adversary is itself quasi-minimax-optimal. This isthe first work to extend to the approximate case, approaches used to designed exactly optimal playersin related works. In [2] a probability matching learner is proven optimal under the assumption that theadversary is limited to a fixed cumulative loss for the best expert. In [14] and [1], the optimal learnerrelies on estimating the value-to-go of the game through rollouts of the optimal adversary’s plays.The results in these papers were limited to games where the optimal adversary was only playingcanonical unit vector while our result holds for general gain vectors. Note also that a probabilitymatching learner is optimal in [9].

Notation: Let [a : b] = {a, a + 1, . . . , b} with a, b ∈ N, a ≤ b, and [a] = [1 : a]. For a vectorw ∈ Rn, n ∈ N, ‖w‖∞ = maxk∈[n]|wk|. A vector indexed by both a time t and a specificelement index k is wt,k. An undiscounted Markov Decision Process (MDP) [13, 16] M is a 4-tuple〈S,A, r, p〉. S is the state space, A is the set of actions, r : S ×A→ R is the reward function, andthe transition model p(·|s, a) gives the probability distribution over the next state when action a istaken in state s. A state is denoted by s or st if it is taken at time t. An action is denoted by a or at.

1[19] also provides an upper-bound that is suboptimal when K = 3 even after optimization of its parameters.

2

2 The Game

We consider a game, composed of T rounds, between two players, called a learner and an adversary.At each time/round t the learner chooses an index It ∈ [K] from a distribution pt on the K arms.Simultaneously, the adversary assigns a binary gain to each of the arms/experts, possibly at randomfrom a distribution At, and we denote the vector of these gains by gt ∈ {0, 1}K . The adversary andthe learner then observe It and gt. For simplicity we use the notation g[t] = (gs)s=1,...,t. The valueof one realization of such a game is the cumulative regret defined as

RT =

∥∥∥∥∥T∑t=1

gt

∥∥∥∥∥∞

−T∑t=1

gt,It .

A state s ∈ S = (N ∪ {0})K is a K-dimensional vector such that the k-th element is the cumulativesum of gains dealt by the adversary on arm k before the current time t. Here the state does not includet but is typically denoted for a specific time t as st and computed as st =

∑t−1t′=1 gt′ . This definition

is motivated by the fact that there exist minimax strategies for both players that rely solely on thestate and time information as opposed to the complete history of plays, g[t] ∪ I[t]. In state s, the setof leading experts, i.e., those with maximum cumulative gain, is X(s) = {k ∈ [K] : sk = ‖s‖∞}.We use π to denote the (possibly non-stationary) strategy/policy used by the adversary, i.e., for anyinput state s and time t it outputs the gain distribution π(s, t) played by the adversary at time t instate s. Similarly we use p to denote the strategy of the learner. As the state depends only on theadversary plays, we can sample a state s at time t from π.

Given an adversary π and a learner p, the expected regret of the game, V Tp,π, isV Tp,π = Eg[T ]∼π,I[T ]∼p [RT ] . The learner tries to minimize the expected regret while the adversarytries to maximize it. The value of the game is the minimax value VT defined by

VT = minp

maxπ

V Tp,π = maxπ

minpV Tp,π.

In this work, we are interested in the search for optimal minimax strategies, which are adversarystrategies π∗ such that VT = minp V

Tp,π∗ and learner strategies p∗, such that VT = maxπ V

Tp∗,π .

2.1 Summary of our Approach to Obtain the Near Greediness of COMB

Most of our material is new. First, Section 3 recalls that Gravin et al. [9] have shown that the searchfor the optimal adversary π∗ can be restricted to the finite family of balanced strategies (defined inthe next section). WhenK = 3, the action space of a balanced adversary is limited to seven stochasticactions (gain distributions), denoted by B3 = {W, C, V, 1, 2, {}, {123}} (see Section 5.1 for theirdescription). The COMB adversary repeats the gain distribution C at each time and in any state.

In Section 4 we provide an explicit formulation of the problem as finding π∗ inside an MDP witha specific reward function. Interestingly, we observe that another adversary, which we call TWIN-COMB and denote by πW, which repeats the distribution W, has the same value as πC (Section 5.1).To control the errors made by COMB, the proof uses a novel and intriguing exchangeability property(Section 5.2). This exchangeability property holds thanks to the surprising role played by the TWIN-COMB strategy. For any distributions A ∈ B3 there exists a distribution D, mixture of C and W,such that for almost all states, playing A and then D is the same as playing W and then A in terms ofthe expected reward and the probabilities over the next states after these two steps. Using Bellmanoperators, this can be concisely written as: for any (value) function f : S −→ R, in (almost) anystate s, we have that [TA[TDf ]](s) = [TW[TAf ]](s). We solve the MDP with a backward inductionin time from t = T . We show that playing C at time t is almost greedy with respect to playing πC inlater rounds t′ > t. The greedy error is defined as the difference of expected reward between alwaysplaying πC and playing the best (greedy) first action before playing COMB. Bounding how theseerrors accumulate through the rounds relates the value of COMB to the value of π∗ (Lemma 16).

To illustrate the main ideas, let us first make two simplifying (but unrealistic) assumptions at time t:COMB has been proven greedy w.r.t. itself in rounds t′ > t and the exchangeability holds in all states.Then we would argue at time t that by the exchangeability property, instead of optimizing the greedy

3

action w.r.t. COMB as maxA∈B3AC . . . C, we can study the optimizer of maxA∈B3

WAC . . . C. Thenwe use the induction property to conclude that C is the solution of the previous optimization problem.

Unfortunately, the exchangeability property does not hold in one specific state denoted by sα. Whatsaves us though is that we can directly compute the error of greedification of any gain distributionwith respect to COMB in sα and show that it diminishes exponentially fast as T − t, the number ofrounds remaining, increases (Lemma 7). This helps us to control how the errors accumulate duringthe induction. From one given state st 6= sα at time t, first, we use the exchangeability property oncewhen trying to assess the ‘quality’ of an action A as a greedy action w.r.t. COMB. This leads us toconsider the quality of playing A in possibly several new states {st+1} at time t+1 reached followingTWIN-COMB in s. We use our exchangeability property repeatedly, starting from the state st until asubsequent state reaches sα, say at time tα, where we can substitute the exponentially decreasinggreedy error computed at this time tα in sα. Here the subsequent states are the states reached afterhaving played TWIN-COMB repetitively starting from the state st. If sα is never reached we usethe fact that COMB is an optimal action everywhere else in the last round. The problem is then todetermine at which time tα, starting from any state at time t and following a TWIN-COMB strategy,we hit sα for the first time. This is translated into a classical gambler’s ruin problem, which concernsthe hitting times of a simple random walk (Section 5.3). Similarly the value of the game is computedusing the study of the expected number of equalizations of a simple random walk (Theorem 5.1).

3 Solving for the Adversary Directly

In this section, we recall the results from [9] that, for arbitrary K, permit us to directly search for theminimax optimal adversary in the restricted set of balanced adversaries while ignoring the learner.Definition 1. A gain distribution A is balanced if there exists a constant cA, the mean gain of A, suchthat ∀k ∈ [K], cA = Eg|A [gk]. A balanced adversary uses exclusively balanced gain distributions.Lemma 1 (Claim 5 in [9]). There exists a minimax optimal balanced adversary.

Use B to denote the set of all balanced strategies and B to denote the set of all balanced gaindistributions. Interestingly, as demonstrated in [9], a balanced adversary π inflicts the same regreton every learner: If π ∈ B, then ∃V πT ∈ R : ∀p, V Tp,π = V πT . (See Lemma 10) Therefore, given anadversary strategy π, we can define the value-to-go V πt0 (s) associated with π from time t0 in state s,

V πt0 (s) = EsT+1

‖sT+1‖∞ −T∑t=t0

Est

[cπ(st,t)

], st+1 ∼ P (.|st, π(st, t), st0 = s).

Another reduction comes from the fact that the set of balanced gain distributions can be seen as aconvex combination of a finite set of balanced distributions [9, Claim 2 and 3]. We call this limitedset the atomic gain distributions. Therefore the search for π∗ can be limited to this set. The set ofconvex combinations of the m distributions A1, . . . Am is denoted by ∆(A1, . . . Am).

4 Reformulation as a Markovian Decision Problem

In this section we formulate, for arbitrary K, the maximization problem over balanced adversariesas an undiscounted MDP problem 〈S,A, r, p〉. The state space S was defined in Section 2 and theaction space is the set of atomic balanced distributions as discussed in Section 3. The transitionmodel is defined by p(.|s, D), which is a probability distribution over states given the current states and a balanced distribution over gains D. In this model, the transition dynamics are deterministicand entirely controlled by the adversary’s action choices. However, the adversary is forced to choosestochastic actions (balanced gain distributions). The maximization problem can therefore also bethought of as designing a balanced random walk on states so as to maximize a sum of rewards (thatare yet to be defined). First, we define PA the transition probability operator with respect to a gaindistribution A. Given function f : S −→ R, PA returns

[PAf ](s) = E[f(s′)|s′ ∼ p(.|s, A)] = Eg∼s,A

[f(s + g)].

g is sampled in s according to A. Given A in s, the per-step regret is denoted by rA(s) and defined asrA(s) = E

s′|s,A‖s′‖∞ − ‖s‖∞ − cA.

4

Given an adversary strategy π, starting in s at time t0, the cumulative per-step regret isV πt0 (s) =

∑Tt=t0

E[rπ(·,t)(st) | st+1 ∼ p(.|st, π(st, t), st0 = s)

]. The action-value function of π

at (s, D) and t is the expected sum of rewards received by starting from s, taking action D, and thenfollowing π: Qπt (st, D) = E [

∑Tt′=t rAt(st) | A0 = D, st+1 ∼ p(·|st, At), At+1 = π(st+1, t+ 1)].

The Bellman operator of A, TA, is [TAf ](s) = rA(s) + [PAf ](s). with [Tπ(s,t)Vπt+1](s) = V πt (s).

This per-step regret, rA(s), depends on s and A and not on the time step t. Removing the timefrom the picture permits a simplified view of the problem that leads to a natural formulation of theexchangeability property that is independent of the time t. Crucially, this decomposition of the regretinto per-step regrets is such that maximizing V πt0 (s) over adversaries π is equivalent, for all time t0and s, to maximizing over adversaries the original value of the game, the regret V πt0 (s) (Lemma 2).

Lemma 2. For any adversary strategy π and any state s and time t0, V πt0 (s) = V πt0 (s) + ‖s‖∞.

The proof of Lemma 2 is in Section 8. In the following, our focus will be on maximizing V πt (s) inany state s. We now show some basic properties of the per-step regret that holds for an arbitrarynumber of experts K and discuss their implications. The proofs are in Section 9.

Lemma 3. Let A ∈ B, for all s, t , we have 0 ≤ rA(s) ≤ 1. Furthermore if |X(s)|= 1, rA(s) = 0.

Lemma 3 shows that a state s in which the reward is not zero contains at least two equal leadingexperts, |X(s)|> 1. Therefore the goal of maximizing the reward can be rephrased into finding apolicy that visits the states with |X(s)|> 1 as often as possible, while still taking into account that theper-step reward increases with |X(s)|. The set of states with |X(s)|> 1 is called the ‘reward wall’.Lemma 4. In any state s, with |X(s)|= 2, for any balanced gain distribution D such that withprobability one exactly one of the leading expert receives a gain of 1, rD(s) = maxA∈B rA(s).

5 The Case of K = 3

5.1 Notations in the 3-Experts Case, the COMB and the TWIN-COMB Adversaries

First we define the state space in the 3-expert case. The experts are sorted with respect to theircumulative gains and are named in decreasing order, the leading expert, the middle expert and thelagging expert. As mentioned in [9], in our search for the minimax optimal adversary, it is sufficientfor any K to describe our state only using dij that denote the difference between the cumulative gainsof consecutive sorted experts i and j = i+ 1. Here, i denotes the expert with ith largest cumulativegains, and hence dij ≥ 0 for all i < j. Therefore one notation for a state, that will be used throughoutthis section, is s = (x, y) = (d12, d23). We distinguish four types of states C1,C2,C3,C4 asdetailed below in Figure 1. In the same figure, in the center, the states are represented on a 2d-grid.C4 contains only the state denoted sα = (0, 0).

s ∈ C1, d12 > 0, d23 > 0

s ∈ C2, d12 = 0, d23 > 0

s ∈ C3, d12 > 0, d23 = 0

s ∈ C4, d12 = 0, d23 = 0 d12

d23

4 3 3 3

2 1 1 1

2 1 1 1

2 1 1 1

Reward Wall

Atomic A Symbol cA

{1}{23} W 1/2{2}{13} C 1/2{3}{12} V 1/2{1}{2}{3} 1 1/3{12}{13}{23} 2 2/3

Figure 1: 4 types of states (left), their location on the 2d grid of states (center) and 5 atomic A (right)

Concerning the action space, the gain distributions use brackets. The group of arms in the same bracketreceive gains together and each group receive gains with equal probability. For instance, {1}{2}{3}exclusively deals a gain to expert 1 (leading expert) with probability 1/3, expert 2 (middle expert)with probability 1/3, and expert 3 (lagging expert) with probability 1/3, whereas {1}{23} meansdealing a gain to expert 1 alone with probability 1/2 and experts 2 and 3 together with probability 1/2.As discussed in Section 3, we are searching for a π∗ using mixtures of atomic balanced distributions.When K = 3 there are seven atomic distributions, denoted by B3 = {V, 1, 2, C, W, {}, {123}}and described in Figure 1 (right). Moreover, in Figure 2, we report in detail—in a table (left) and

5

s rC(s)Distribution of next state s′ ∼ p(·|s, C)with s = (x, y)

C1 0 P (s′ = (x−1, y+1)) = P (s′ = (x+1, y−1)) = .5C2 1/2 P (s′ = (x+ 1, y)) = P (s′ = (x+ 1, y − 1)) = .5C3 0 P (s′ = (x, y + 1)) = P (s′ = (x− 1, y + 1)) = .5C4 1/2 P (s′ = (x, y + 1)) = P (s′ = (x+ 1, y)) = .5 4

.5

2.5

.5

3

.5

.5.5

1.5

.5

0

0

1/2

1/2

d12

d23

Figure 2: The per-step regret and transition probabilities of the gain distribution C

an illustration (right) on the 2-D state grid—the properties of the COMB gain distribution C. Theremaining atomic distributions are similarly reported in the appendix in Figures 5 to 8.

In the case of three experts, the COMB distribution is simply playing {2}{13} in any state. We useW to denote the strategy that plays {1}{23} in any state and refer to it as the TWIN-COMB strategy.The COMB and TWIN-COMB strategies (as opposed to the distributions) repeat their respective gaindistributions in any state and any time. They are respectively denoted πC, πW . The Lemma 5 showsthat the COMB strategy πC, the TWIN-COMB strategy πW and therefore any mixture of both, havethe same expected cumulative per-step regret. The proof is reported to Section 11.Lemma 5. For all states s at time t, we have V πC

t (s) = V πWt (s).

5.2 The Exchangeability Property

Lemma 6. Let A ∈ B3, there exists D ∈ ∆(C, W) such that for any s 6= sα, and for any f : S −→ R,

[TA[TDf ]](s) = [TW[TAf ]](s).

Proof. If A = W, A = {} or A = {123}, use D = W. If A = C, use Lemma 11 and 12.Case 1. A = V: V is equal to C in C3 ∪C4 and if s′ ∼ p(.|s, W) with s ∈ C3 then s′ ∈ C3 ∪C4.So when s ∈ C3 we reuse the case A = C above. When s ∈ C1 ∪C2, we consider two cases.Case 1.1. s 6= (0, 1): We choose D = W which is {1}{23}. If s′ ∼ p(.|s, V) with s ∈ C2 thens′ ∈ C2. Similarly, if s′ ∼ p(.|s, V) with s ∈ C1 then s′ ∈ C1 ∪ C3. Moreover D modifiessimilarly the coordinates (d12, d23) of s ∈ C1 and s ∈ C3. Therefore the effect in terms of transitionprobability and reward of D is the same whether it is done before or after the actions chosen by V. Ifs′ ∼ p(.|s, D) with s ∈ C1 ∪C2 then s′ ∈ C1 ∪C2. Moreover V modifies similarly the coordinates(d12, d23) of s ∈ C1 and s ∈ C2. Therefore the effect in terms of the transition probability of V isthe same whether it is done before or after the action D. In terms of reward, notice that in the statess ∈ C1 ∪C2, V has 0 per-step regret and using V does not make s′ leave or enter the reward wall.Case 1.2 st = (0, 1): We can chose D = W. One can check from the tables in Figures 7 and 8 thatexchangebility holds. Additionally we provide an illustration of the exchangeability equality in the2d-grid in Figure 1. The starting state s = (0, 1), is graphically represented by . We show on thegrid the effect of the gain distribution V (in dashed red) followed (left picture) or preceded (rightpicture) by the gain distribution D (in plain blue). The illustration shows that V·D and D·V lead tothe same final states ( ) with equal probabilities. The rewards are displayed on top of the pictures.Their color corresponds to the actions, the probabilities are in italic, and the rewards are in roman.

Case 2 & 3. A = 1 & A = 2: The proof is similar and is reported in Section 12 of the appendix.

6

5.3 Approximate Greediness of COMB, Minimax Players and Regret

The greedy error of the gain distribution D in state s at time t is

εDs,t = max

A∈B3

QπCt (s, A)− QπC

t (s, D).

Let εDt = maxs∈S ε

Ds,t denote the maximum greedy error of the gain distribution D at time t. The

COMB greedy error in sα is controlled by the following lemma proved in Section 13.1. Missingproofs from this section are in the appendix in Section 13.2.

Lemma 7. For any t ∈ [T ] and gain distribution D ∈ {W, C, V, 1}, εDsα,t ≤

16

(12

)T−t.

d12

d23

4 3 3 3 3

2 1 1 1 1

2 1 1 1 1

2 1 1 1 1

.5.5

.5.5

.5

.5

.5

.5

1 2 3 4

3 4 5 62

5 6 7 84

7 8 9 106

d12d234 3 3 3 3

.5.5

1 1 2 3 40

Figure 3: Numbering TWIN-COMB(top) & πG random walks (bottom)

The following proposition shows how we can index the statesin the 2d-grid as a one dimensional line over which the TWIN-COMB strategy behaves very similarly to a simple random walk.Figure 3 (top) illustrates this random walk on the 2d-grid andthe indexing scheme (the yellow stickers).Proposition 1. Index a state s = (x, y) by is = x + 2y ir-respective of the time. Then for any state s 6= sα, and s′ ∼p(·|s, W) we have that P (is′ = is−1) = P (is′ = is+1) = 1

2 .

Consider a random walk that starts from state s0 = s and is gen-erated by the TWIN-COMB strategy, st+1 ∼ p(.|st, W). Definethe random variable Tα,s = min{t ∈ N∪{0} : st = sα}. Thisrandom variable is the number of steps of the random walk be-fore hitting sα for the first time. Then, let Pα(s, t) be the proba-bility that sα is reached after t steps: Pα(s, t) = P (Tα,s = t).Lemma 8 controls the COMB greedy error in st in relation toPα(s, t). Lemma 9 derives a state-independent upper-bound for Pα(s, t).Lemma 8. For any time t ∈ [T ] and state s,

εCs,t ≤

T∑t′=t

Pα(s, t′ − t)1

6

(1

2

)T−t′.

Proof. If s = sα, this is a direct application of Lemma 7 as Pα(sα, t′) = 0 for t′ > 0.

When s 6= sα, the following proof is by induction.

Initialization: Let t = T . At the last round only the last per-step regret matters (for all states s,QπCt (s, D) = rD(s)). As s 6= sα, s is such that |X(s)|≤ 2 then rD(s) = maxA∈B rA(s) because of

Lemma 4 and Lemma 3. Therefore the statement holds.

Induction: Let t < T . We assume the statement is true at time t+ 1. We distinguish two cases.

For all gain distributions D ∈ B3,

QπCt (s, D)

(a)= [TD[TEV

πCt+2]](s)

(b)= [TW[TDV

πCt+2]](s) = [TWQ

πCt+1(., D)](s)

(c)

≥ [TW maxA∈B3

QπCt+1(., A)](s)−

T∑t1=t+1

[PWPα(., t1 − t− 1)1

6

(1

2

)T−t1](s)

(d)

≥ maxA∈B3

[TWQπCt+1(., A)](s)−

T∑t1=t+1

1

6

(1

2

)T−t1[PWPα(., t1 − t− 1)](s)

(b)= max

A∈B3

QπCt (s, A)−

T∑t1=t+1

1

6

(1

2

)T−t1[PWPα(., t1 − t− 1)](s)

(e)= max

A∈B3

QπCt (s, A)−

T∑t1=t

1

6

(1

2

)T−t1Pα(s, t1 − t)

7

where in (a) E is any distribution in ∆(C, W) and this step holds because of Lemma 5, (b) holdsbecause of the exchangeability property of Lemma 6, (c) is true by induction and monotonicityof Bellman operator, in (d) the max operators change from being specific to any next state s′ attime t + 1 to being just one max operator that has to choose a single optimal gain distribution instate s at time t, (e) holds by definition as for any t2, (here the last equality holds because s 6= sα)[PWPα(., t2)](s) = Es′∼p(.|s,W)[Pα(s′, t2)] = Es′∼p(.|s,W)[P (Tα,s′ = t2)] = Pα(s, t2 + 1).

Lemma 9. For t > 0 and any s,

Pα(s, t) ≤ 2

t

√2

π.

Proof. Using the connection between the TWIN-COMB strategy and a simple random walk inProposition 1, a formula can be found for Pα(s, t) from the classical “Gambler’s ruin” problem,where one wants to know the probability that the gambler reaches ruin (here state sα) at any timet given an initial capital in dollars (here is as defined in Proposition 1). The gambler has an equalprobability to win or lose one dollar at each round and has no upper bound on his capital during thegame. Using [7] (Chapter XIV, Equation 4.14) or [18] we have Pα(s, t) = is

t

(t

t+is2

)2−t, where the

binomial coefficient is 0 if t and is are not of the same parity. The technical Lemma 14 completes theproof.

We now state our main result, connecting the value of the COMB adversary to the value of the game.Theorem 1. Let K = 3, the regret of COMB strategies against any learner p, minp V

Tp,πC

, satisfies

minpV Tp,πC

≥ VT − 12 log2 (T + 1) .

We also characterize the minimax regret of the game.Theorem 2. Let K = 3, for even T , we have that∣∣∣∣VT − ( T + 2

T/2 + 1

)T/2 + 1

3 ∗ 2T

∣∣∣∣ ≤ 12 log2(T + 1), with(T + 2

T/2 + 1

)T/2 + 1

3 ∗ 2T∼√

8T

9π.

In Figure 4 we introduce a COMB-based learner that is denoted by pC. Here a state is representedby a vector of 3 integers. The three arms/experts are ordered as (1) (2) (3), breaking ties arbitrarily.

pt,(1)(s) = V πCt+1(s+e(1))−V πC

t (s)pt,(2)(s) = V πC

t+1(s+e(2))−V πCt (s)

pt,(3)(s) = 1−pt,(1)(s)−pt,(2)(s)

Figure 4: A COMB learner, pC

We connect the value of the COMB-based learner to thevalue of the game.Theorem 3. Let K = 3, the regret of COMB-basedlearner against any adversary π, maxπ V

TpC,π

, satisfies

maxπ

V TpC,π≤ VT + 36 log2 (T + 1) .

Similarly to [2] and [14], this strategy can be efficientlycomputed using rollouts/simulations from the COMB adversary in order to estimate the value V πC

t (s)of πC in s at time t.

6 Discussion and Future Work

The main objective is to generalize our new proof techniques to higher dimensions. In our case, theMDP formulation and all the results in Section 4 already holds for general K. Interestingly, Lemma 3and 4 show that the COMB distribution is the balanced distribution with highest per-step regret in allthe states s such that |X(s)|≤ 2, for arbitrary K. Then assuming an ideal exchangeability propertythat gives maxA∈B AC . . . C = maxA∈B CC . . . CA, a distribution would be greedy w.r.t the COMBstrategy at an early round of the game if it maximizes the per-step regret at the last round of thegame. The COMB policy specifically tends to visit almost exclusively states |X(s)|≤ 2, states whereCOMB itself is the maximizer of the per-step regret (Lemma 3). This would give that COMB is greedyw.r.t. itself and therefore optimal. To obtain this result for larger K, we will need to extend theexchangeability property to higher K and therefore understand how the COMB and TWIN-COMBfamilies extend to higher dimensions. One could also borrow ideas from the link with pde approachesmade in [6].

8

AcknowledgementsWe gratefully acknowledge the support of the NSF through grant IIS-1619362 and of the AustralianResearch Council through an Australian Laureate Fellowship (FL110100281) and through the Aus-tralian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS).We would like to thank Nate Eldredge for pointing us to the results in [18] and Wouter Koolen forpointing us at [19]!

References[1] Jacob Abernethy and Manfred K. Warmuth. Repeated games against budgeted adversaries. In

Advances in Neural Information Processing Systems (NIPS), pages 1–9, 2010.

[2] Jacob Abernethy, Manfred K. Warmuth, and Joel Yellin. Optimal strategies from random walks.In 21st Annual Conference on Learning Theory (COLT), pages 437–446, 2008.

[3] Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, andManfred K. Warmuth. How to use expert advice. Journal of the ACM (JACM), 44(3):427–485,1997.

[4] Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge universitypress, 2006.

[5] Thomas M. Cover. Behavior of sequential predictors of binary sequences. In 4th PragueConference on Information Theory, Statistical Decision Functions, Random Processes, pages263–272, 1965.

[6] Nadeja Drenska. A pde approach to mixed strategies prediction with expert advice.http://www.gtcenter.org/Downloads/Conf/Drenska2708.pdf. (Extended abstract).

[7] Willliam Feller. An Introduction to Probability Theory and its Applications, volume 2. JohnWiley & Sons, 2008.

[8] Nick Gravin, Yuval Peres, and Balasubramanian Sivan. Towards optimal algorithms for predic-tion with expert advice. In arXiv preprint arXiv:1603.04981, 2014.

[9] Nick Gravin, Yuval Peres, and Balasubramanian Sivan. Towards optimal algorithms for predic-tion with expert advice. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposiumon Discrete Algorithms (SODA), pages 528–547, 2016.

[10] Nick Gravin, Yuval Peres, and Balasubramanian Sivan. Tight Lower Bounds for MultiplicativeWeights Algorithmic Families. In 44th International Colloquium on Automata, Languages, andProgramming (ICALP), volume 80, pages 48:1–48:14, 2017.

[11] Charles Miller Grinstead and James Laurie Snell. Introduction to probability. AmericanMathematical Soc., 2012.

[12] James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory ofGames, 3:97–139, 1957.

[13] Ronald A. Howard. Dynamic Programming and Markov Processes. The MIT Press, Cambridge,MA, 1960.

[14] Haipeng Luo and Robert E. Schapire. Towards minimax online learning with unknown timehorizon. In Proceedings of The 31st International Conference on Machine Learning (ICML),pages 226–234, 2014.

[15] Francesco Orabona and Dávid Pál. Optimal non-asymptotic lower bound on the minimax regretof learning with expert advice. arXiv preprint arXiv:1511.02176, 2015.

[16] Martin L. Puterman. Markov Decision Processes. Wiley, New York, 1994.

[17] Pantelimon Stanica. Good lower and upper bounds on binomial coefficients. Journal ofInequalities in Pure and Applied Mathematics, 2(3):30, 2001.

9

[18] Remco van der Hofstad and Michael Keane. An elementary proof of the hitting time theorem.The American Mathematical Monthly, 115(8):753–756, 2008.

[19] Vladimir Vovk. A game of prediction with expert advice. Journal of Computer and SystemSciences (JCSS), 56(2):153–173, 1998.

10

We use e1, . . . eK ∈ {0, 1}K to denote the K canonical basis unit vectors.

7 A Short Proof on the Regret of Balanced Adversaries

Lemma 10. A balanced adversary inflicts the same regret on any learner:

If π ∈ B,∃V πT ∈ R : ∀p, V Tp,π = V πT .

Proof. Indeed the expected regret for a balanced adversary, E[RT ] against any learner, can be writtenas

Ep,π

[RT ] = Egt,It∼p,π

[∥∥∥∥∥T∑t=1

gt

∥∥∥∥∥∞

−T∑t=1

eIt · gt

](1)

= EsT+1∼π

‖sT+1‖∞ −T∑t=1

Est∼π

Egt,It∼π,p|st

(gt)It (2)

(a)= E

sT+1∼π‖sT+1‖∞ −

T∑t=1

Est∼π

cπ(s,t) (3)

where (a) is because π is balanced which means that Egt|st [(gt)k] = cπ(s,t) for each k.

8 Equivalence Between Maximizing the Value and Maximizing theCumulative Per-Step Regret

Proof of Lemma 2. In the following the expectation over states are all taken with respect to thestrategy π.

V πt0 (s) + ‖s‖∞ =

T∑t=t0

Est|s,π

r(st) + ‖s‖∞ (4)

=

T∑t=t0

Est|s

[E

st+1|st‖st+1‖∞ − ‖st‖∞ − cπ(st,t)

]+ ‖s‖∞ (5)

=

T∑t=t0

(E

st+1|s‖st+1‖∞ − E

st|s‖st‖∞

)−

T∑t=t0

Est|s

cπ(st,t) + ‖s‖∞ (6)

= EsT+1|s

‖sT+1‖∞ − ‖s‖∞ −T∑t=t0

Est|s

cπ(st,t) + ‖s‖∞ (7)

= EsT+1|s

‖sT+1‖∞ −T∑t=t0

Est|s

cπ(st,t) (8)

= V πt0 (s) (9)

9 Proofs of Basic Properties of the Per-Step Regret

First we notice that the per-step regret can be written as

rA(s) = PA (∃k : k ∈ X(s) and gk = 1)− cA. (10)

where the notation PA denotes the fact that the gain vector g is sampled from A.

Proof of Lemma 3. Step 1) rA(s) ≤ 1: We have

Es′|s,A

‖s′‖∞ − ‖s‖∞ ≤ 1

11

s r1(s)Distribution of next state s′ ∼ p(·|s, 1)with s = (x, y)

C1 0 P (s′ = (x+1, y)) = P (s′ = (x, y−1)) = P (s′ =(x− 1, y + 1)) = 1/3

C2 1/3 P (s′ = (x+1, y)) = 2/3, P (s′ = (x, y−1)) = 1/3C3 0 P (s′ = (x+1, y)) = 1/3, P (s′ = (x−1, y+1)) =

2/3C4 2/3 P (s′ = (x+ 1, y)) = 1

21

1/3

1/3

1/3

2/3

1/3

143

2/3

1/3

1/3

2/3

0

0

Figure 5: The per-step regret and transition probabilities of the gain distribution 1

by definition as the maximum cumulative gain cannot increase by more than 1 in one round. Moreovercs ≥ 0 as the adversary only deals positive gains. Therefore rA(s) = Es′|s,A ‖s′‖∞−‖s‖∞−cs ≤ 1.

Step 2) 0 ≤ rA(s): We are following the argument in [9]. We write

PA (∃k : k ∈ X(s) and gk = 1)(a)

≥ PA (gk′ = 1)(b)= E

g|A[gk′s ]

(c)= cA ,

where in (a) k′s is any arm in X(s), (b) holds because as g ∈ {0, 1}K , for all k ∈ [K], Eg|s,A[gk] =

PA(gk = 1), and (c) holds because, as A is balanced, for all k, cA = Eg|s [gk]. Therefore

rA(s) = PA (∃k : k ∈ X(s) and gk = 1)− cA ≥ 0 .

Step 3) rA(s) = 0 if |X(s)|= 1: This result can be proven by following the same steps as in theprevious case but now the inequality turns into an equality as |X(s)|= 1.

Proof of Lemma 4. We have exactly two (leading) arms that have equal maximal cumulative gain attime t. The adversary is designing a balanced gain distribution on the K arms. Let p11 denote theprobability that both leading arms are allocated a gain of 1 and let p01, p10, p00 be similarly defined.Therefore p00 + p01 + p10 + p11 = 1. Also, the balanced property forces that the expected gains ofboth leading experts are equal: cA = p10 + p11 = p01 + p11, which gives p10 = p01. A balancedgain distribution A possesses therefore the following per-step regret at state s (see Equation 10):

rA(s) = PA (∃k : k ∈ X(s) and gk = 1)− cA = p01 + p10 + p11 − (p10 + p11) = p01 = p10.

Finding the balanced gain distribution with maximal per-step regret, arg maxA∈B rA(s), meansmaximizing p10 = p01 subject to the constraints p00 + p01 + p10 + p11 = 1 which is also p00 +2p10 + p11 = 1. This is solved by having p10 = 1

2 and p00 = p11 = 0, which is a valid balanceddistribution that satisfies PD(|k : k ∈ X(s) and gk = 1|= 1) = 1.

10 Gain Distributions Illustrations

In this section we report the dynamics of the MDP for all the balanced gain distributions in B3 andall the states. We also illustrate them on the 2d-grid of states as introduced in Figure 1. We first reportthe dynamics of the MDP grouped by gain distribution (Section 10.1) and then grouped by state class(Section 10.2).

10.1 Grouped by Gain Distributions

We detail in tables and illustrate on the 2d state grid, the properties of the atomic gain distributionsV, 1, 2, W in Figures 5 to 8. Note that the table and the illustration for C is in Figure 2 of the mainpaper.

10.2 Grouped by State Class

In Figure 9, we illustrate for each state class the effect of each actions in terms of the next statedreached in the 2d-grid. For instance if we look at the state in class C4, we can see that the actions 1,2 and 3 leads us to the state (1, 0).

12

s r2(s)Distribution of next state s′ ∼ p(·|s, 2)with s = (x, y)

C1 0 P (s′ = (x + 1, y − 1)) = P (s′ = (x, y + 1)) =P (s′ = (x− 1, y)) = 1/3

C2 1/3 P (s′ = (x+1, y−1)) = 2/3, P (s′ = (x, y+1)) =1/3

C3 0 P (s′ = (x−1, y)) = 1/3, P (s′ = (x, y+1)) = 2/3C4 1/3 P (s′ = (x, y + 1)) = 1

11/3

1/3

1/30

3

2/3

1/30

2

1/3

2/3

1/3

4

1

1/3

Figure 6: The per-step regret and transition probabilities of the gain distribution 2

s rW(s)Distribution of next state s′ ∼ p(·|s, W)with s = (x, y)

C1 0 P (s′ = (x+ 1, y)) = P (s′ = (x− 1, y)) = .5C2 1/2 P (s′ = (x+ 1, y)) = P (s′ = (x+ 1, y − 1)) = .5C3 0 P (s′ = (x+ 1, y)) = P (s′ = (x− 1, y)) = .5C4 1/2 P (s′ = (x, y + 1)) = P (s′ = (x+ 1, y)) = .5 4

.5

1.5 .5

2.5

.5

3.5.5

.5

0

0

1/2

1/2

Figure 7: The per-step regret and transition probabilities of the gain distribution W

s rV(s)Distribution of next state s′ ∼ p(·|s, V)with s = (x, y)

C1 0 P (s′ = (x, y + 1)) = .5, P (s′ = (x, y − 1)) = .5C2 1/2 P (s′ = (x, y + 1)) = .5, P (s′ = (x, y − 1)) = .5C3 0 P (s′ = (x, y+1)) = .5, P (s′ = (x−1, y+1)) = .5C4 1/2 P (s′ = (x, y + 1)) = .5, P (s′ = (x+ 1, y)) = .5 4

.5

2

3

.5.5

.5

1

.5

.5 .5

.50

0

0

1/2

Figure 8: The per-step regret and transition probabilities of the gain distribution V

3

12-13

d12

d23

1

2-3

234

12-13-23

1-2-3

1

12

1

2

23

133

2

12

1-2

13-233

Figure 9: The actions illustrated for each state class

13

11 Proofs of Section 5.1

Given two adversary gain distributions A1, A2 respectively at times t, t+ 1, a two-step regret in states is rA1 A2

(s) = Es′|s,A1[rA2

(s′)] + rA1(s).

Proof of Lemma 5. The proof is by induction. If t = T , V πCt (s) = rC(s) = rW(s) = V πW

t (s) forany state s. Now, given a time t, we assume for all times t′′ > t, for all states s at time t′′, we haveV πCt′′ (s) = V πW

t′′ (s).

Case 1. At time t, if s ∈ C2 or s ∈ C4 then we have p(.|s, C) = p(.|s, W) by Lemma 11 and,looking at the gain distributions tables, rC(s) = rW(s). Therefore we have, using the inductionproperty,

V πCt (s) = rC(s) + [PCV

πCt+1](s) = rW(s) + [PWV

πWt+1](s) = V πW

t (s) .

Case 2. At time t, if s ∈ C1 or s ∈ C3 then we have,

V πCt (s) = rC(s) + [PCV

πCt+1](s)

(a)= rCW(s) + [PC[PWV

πCt+2]](s)

(b)= rWC(s) + [PW[PCV

πCt+2]](s)

(a)= rW(s) + [PWV

πWt+1](s) = V πW

t (s) ,

where (a) holds by induction, and (b) follows from the exchangeability between PW and PC (provedin Lemma 11) and also the fact that rCW(s) = rWC(s) (proved in Lemma 12). The exact statementsand proofs of these two lemmas are reported in the next subsection.

11.1 Lemma 11 and Lemma 12

We prove a first exchangeability result between the COMB gain distribution C and the TWIN-COMBgain distribution W.Lemma 11. If s ∈ C2 or s ∈ C4,

p(·|s, W) = p(·|s, C).

If s ∈ C1 or s ∈ C3,E

s′∼s,Cp(·|s′, W) = E

s′∼s,Wp(·|s′, C).

Proof. If s ∈ C2 or s ∈ C4 then we have p(.|s, C) = p(.|s, W). This can be seen by directlyreading the tables in Figures 2 and 5 to 8.

Let us now focus on the cases s ∈ C1 or s ∈ C3. To prove that the gains distributions C and Ware invertible, we show that the order does not matter for any pair of possible ‘outcomes’ of thegains of each distribution. Recall that the outcomes are 1 or 23 for W and 2 or 13 for C. To followthe upcoming reasonings one can use the illustrations of the effect of the actions in Figure 9 andthe 2d-grid in Figure 1. In the following we will refer to the action 1 as the outcome of the gaindistribution W = {1}{23} that happens by definition half of the time and that means that the expertnumber 1, ie. the leading expert receives a gain of one while the middle and lagging experts receivezero. Similarly, for instance action 23 refers to a one dealt to both the lagging and middle expertswhile the leading expert receives zero.

Case of the exchangeability of 1 with 13: The action 1 applied to a state s ∈ C1 leads to a states′ ∈ C1 and similarly a state s ∈ C3 is lead to a state s′ ∈ C3. Therefore the effect of 13 is the samewhether it is done before or after the action 1. The action 13 applied to a s ∈ C1 ∪C3 leads to a states′ ∈ C1 ∪C3. Moreover the action 1 has the same effect whether it is in C1 or C3 (incrementingd12 by one and leaving d23 the same). Therefore the effect of 1 is the same whether it is done beforeor after the action 13.

Case of the exchangeability of 2 with 23:

Case 2.1 d12 > 1: The action 2 has the same effect whether it is in C1 ∪C3. The same goes foraction 23. Moreover d12 > 1 insures that after applying either 2 or 23 we stay in C1 ∪ C3. Theexchangeability holds.

14

Case 2.2 d12 = 1: One can check from the tables that the effects of playing 2 then 23 or 23 then 2either from a state in C1 or C3 are canceling out and the final state is the original state in all cases.

Case of the exchangeability of 2 with 1:

The action 1 applied to a s ∈ C1 leads to a state s′ ∈ C1 similarly for a state s ∈ C3 leads to a states′ ∈ C3. Therefore the effect of 2 is the same whether it is done before or after the action 1. 1 is anaction that has the same effect in any state. Therefore the effect of 1 is the same whether it is donebefore or after the action 2.

Case of the exchangeability of 23 with 13: The action 23 applied to a state s ∈ C1 leads to a states′ ∈ C1 or C2. The action 23 applied to a state s ∈ C3 leads to a state s′ ∈ C3 or C4. Moreoverthe action 13 has the same effect whether it is in C1 or C2 and also has the same effect whether it isin C3 or C4. Therefore the effect of 13 is the same whether it is done before or after the action 23.The action 13 applied to a state s ∈ C1 ∪C3 leads to a state s′ ∈ C1 ∪C3. Moreover the action 23has the same effect whether it is in C1 or C3. Therefore the effect of 23 is the same whether it isdone before or after the action 13.

Lemma 12. For all state s we haverCW(s) = rWC(s) .

Proof. First we have rC(s) = rW(s) for all states s. Then we have

Es′|s,C

[rW(s′)] = Es′|s,C

[1{|X(s′)|>1}

2

]= Ps′|s,C(|X(s′)|> 1) (11)

(a)= Ps′|s,W(|X(s′)|> 1) = E

s′|s,W

[1{|X(s′)|>1}

2

]= E

s′|s,W[rC(s′)] , (12)

where (a) This is obvious in the case s ∈ C4 ∪C2 as stated in Lemma 11, p(.|s, C) = p(.|s, W).If s = (x, y) ∈ C3 ∪ C1. Let the state s′ = (x′, y′) be generated with the COMB policy C and astate s′′ = (x′′, y′′) be generated with the TWIN-COMB policy W, we have that P (x′ = x− 1) =P (x′′ = x− 1) = 1/2 so P (|X(s′)|> 1) = P (|X(s′′)|> 1) and these probabilities are either equalto 0 (if d12 > 1) or 1/2 (if d12 = 1) .

Therefore rCW(s) = Es′|s,C [rW(s′)] + rC(s) = Es′|s,W [rC(s′)] + rW(s) = rWC(s)

12 Proof of the Exchangeability Property

End of the Proof of Lemma 6. Case 2. A = 1:Case 2.1 s ∈ C2: We can chose D = {1}{2}{23}{13}, the distribution that mixes with equalprobabilities C and W. One can check from the tables that the exchangeability holds.

Case 2.1.1 s 6= (0, 1): One can check from the table in Figures 7 and 8 that the exchangebility holds.We provide a visual illustration of the exchangeability equality below.

15

Case 2.1.2 s = (0, 1): One can check from the table that the exchangebility holds. Also we providea visual illustration of the exchangeability equality below.

Case 2.2 s ∈ C1 ∪ C3: We can chose D = {1}{2}{23}{13}, the distribution that mixes withprobability half-half C and W.

Case 2.2.1 d12 > 1: Following a very similar reasoning as in the Case 1.1 the result hold.

Case 2.2.2 d12 = 1:

Case 2.2.2.1 d23 > 0:

One can check from the table that the exchangebility holds. Also we provide a visual illustration ofthe exchangeability equality below.

Case 2.2.2.2 d23 = 0:

One can check from the table that the exchangebility holds. Also we provide a visual illustration ofthe exchangeability equality below.

Case 3. A = 2: This case uses a very similar structure of arguments as in the Case 2.

16

13 Proofs for Controlling the Accumulation of the Greedy Errors

13.1 The COMB Greedy Error in sα, Proof of Lemma 7

Let the value of playing the COMB strategy from the states sA = (1, 0), sB = (0, 1), sC = (1, 1) attime t be V At = V πC

t (sA), V Bt = V πCt (sB) and V Ct = V πC

t (sC). Lemma 13 relates V At to V Bt .

Lemma 13. For all t ∈ [T ] we have, V Bt = V At + 13 + 1

6

(− 1

2

)T−t.

Proof. From Table 2, for all t ∈ [T−1] we have, V At =V Bt+1+V Ct+1

2 , V Bt = .5+V At+1+V Ct+1

2 . Therefore

for all t ∈ [T−1] we have, V Bt −V At =V At+1−V

Bt+1

2 +.5 and V BT −V AT = .5−0. Solving this recurrence

formula for the value of V Bt − V At , we have for all t ∈ [T ], V Bt − V At = 13 + 1

6

(− 1

2

)T−t.

The greedy action in sα w.r.t playing COMB in latter rounds is arg maxA∈B3QπCt (sα, A) which

is equal to arg maxA∈{C,1,2} QπCt (sα, A) as W, C, V are equivalent in sα and QπC

t (sα, {}) =

QπCt (sα, {123}) = V πC

t+1(sα) ≤ V πCt (sα) = QπC

t (sα, C). Moreover, from Figures 2, 5 and 6we have

QπCt (sα, C) = 1/2 + (V At+1 + V Bt+1)/2, QπC

t (sα, 1) = 2/3 + V At+1, QπCt (sα, 2) = 1/3 + V Bt+1 .

Combining these equalities with Lemma 13 leads to Lemma 7.

13.2 Bounding All the Errors

Lemma 14. For any two integers i ≤ n of the same parity, i(nn+i2

)2−n ≤ 2

√2π .

Proof of Lemma 14. Case 1. i ≤√n: we have

i

(nn+i

2

)2−n

(a)

≤√n

(n⌈n2

⌉)2−n(b)

≤√n

√2

π

2n+1

√n+ 1

2−n ≤ 2

√2

π.

Here (a) uses the fact that the central binomial coefficient(ndn2 e)

is the largest of the binomial

coefficient of the shape(nm

)with m an integer, and (b) uses the upper bound in Theorem 2.6 in [17]

which gives(ndn2 e)2−n ≤

√2π√

n+12n+1.

Case 2. i ≥√n: We now want to show that the mapping Φ from i to i

(nn+i2

)is non-increasing for

i ≥√n. Indeed this would prove that for all i ≥

√n, Φ(i) ≤ Φ(

√n). As we have already proven in

the Case 1 above Φ(√n)2−n ≤ 2

√2π this would prove the desired inequality for the Case 2 also.

To prove the non-increase, we study the ratio of the values for i and i+ 2 (the expression is zero ati+ 1) is

(i+ 2)(

nn+i2 +1

)i(nn+i2

) =

(1 +

2

i

)(n+i

2 )! (n−i2 )!

(n+i2 + 1)! (n−i2 − 1)!

=

(1 +

2

i

) n−i2

n+i2 + 1

≤(

1 +2√n

) n−√n

2n+√n

2 + 1=

(1 +

2√n

) 1− 1√n

1 + 1√n

+ 2n

≤1 + 1√

n

1 + 1√n

= 1 .

Lemma 15. For t ∈ [T ] and any s,∑Tt′=t Pα(s, t′ − t) 1

6

(12

)T−t′ ≤ 2√

2π

2 log(T−t+1)+4T−t+1 .

Proof of Lemma 15. Case 1. s = sα: We have Pα(s, t′ − t) = 1 for t′ = t and Pα(s, t′ − t) = 0for t′ > t. Therefore the bound holds in this case.

17

Case 2. s 6= sα: We have Pα(s, 0) = 0 which gives∑Tt′=t Pα(s, t′ − t) 1

6

(12

)T−t′=∑T

t′=t+1 Pα(s, t′ − t) 16

(12

)T−t′. Let T1 = T − dlog (T − t+ 1)e so that we have

1

2T−T1≤ 1

T − t+ 1. (13)

We have thatT1+1∑t′=t+1

1

t′ − t

(1

2

)T−t′≤

T1+1∑t′=t+1

1

(1

2

)T−t′≤ 1

2T−T1−2

(a)

≤ 4

T − t+ 1.

where (a) is using Equation 13.

Moreover we haveT∑

t′=T1+2

1

t′ − t

(1

2

)T−t′≤

T∑t′=T1+2

1

t′ − t1

≤ (T − T1 − 1)1

T1 + 2− t(a)

≤ log (T − t+ 1)1

T1 + 2− t(b)

≤ 2 log (T − t+ 1)1

T − t+ 1,

where (a) is using the definition of T1 above, and (b) is because T1 + 1− T ≥ − log (T − t+ 1) ≥−T−t+1

2 and therefore T1 + 2− t = T1 + 1− t− T + T + 1 ≥ T−t+12 . Therefore we have

T∑t′=t+1

Pα(s, t′ − t)(

1

2

)T−t′ (a)

≤T∑t′=t

2

t′ − t

√2

π

(1

2

)T−t′

= 2

√2

π

(T1+1∑t′=t+1

1

t′ − t

(1

2

)T−t′+

T∑t′=T1+2

1

t′ − t

(1

2

)T−t′)

≤ 2

√2

π(2 log (T − t+ 1) + 4)

1

T − t+ 1,

where (a) holds by Lemma 9.

The optimal Bellman operator is denoted by T ∗ and is defined as [T ∗f ](s) = maxD∈B rD(s) +

[PDf ](s). If we define the optimal cumulative per-step regret as V ∗t (s) = maxπ∈B V πt (s) , wehave that [T ∗V ∗t+1](s) = V ∗t (s).

Lemma 16. For any state s and any time t, we have V ∗t (s) ≤ V πCt (s) +

∑Tt′=t ε

Ct .

Proof of Lemma 16. The proof is by backward induction on time t.

Initialization t = T : We use Lemma 8 and Lemma 15.

Induction: We write

V ∗t (s) = [T ∗V ∗t+1](s)

= maxD∈B

rD(s) + [PDV∗t+1](s)

(a)

≤ maxD∈B

rD(s) + [PDVπCt+1](s) +

T∑t′=t+1

εCt′

18

= maxD∈B

QπCt (s, D) +

T∑t′=t+1

εCt′

(b)= QπC

t (s, C) +

T∑t′=t+1

εCt′ + εC

s,t

(b)

≤ QπCt (s, C) +

T∑t′=t+1

εCt′ + εC

t

= V πCt (s) +

T∑t′=t

εCt′ ,

where (a) holds by induction and (b) holds by definition of the errors εCs,t and εC

t .

Proof of Theorem 3. We have,

minpV Tp,πC

(a)= V πC

T (sα)(b)= V πC

T (sα)(c)

≥ V ∗t (sα)−T∑t′=t

εCt

(d)

≥ V ∗t (sα)−T∑t=1

2

√2

π(2 log (T − t+ 1) + 4)

1

T − t+ 1

≤ V ∗t (sα)− 2

√2

π(2 log (T + 1) + 4)

T∑t=1

1

T − t+ 1≤ 12 log2 (T + 1) ,

where we use (a) πC is balanced, (b) Lemma 2, (c) Lemma 16, and (d) Lemma 8 & 15.

14 Proofs for the Minimax Regret of the Game

Proof of Theorem 2. A new adversary, GREEDY-COMB, is denoted πG. We prove that the value ofCOMB and the value of GREEDY-COMB do not differ by more than an additive numerical constant.As we have proved that COMB is almost minimax optimal, then GREEDY-COMB is also almostoptimal. However, computing the value of this new strategy is easier using the classical results onnumber of passages to the origin in random walks. The GREEDY-COMB adversary takes the sameactions as the TWIN-COMB in all states except in state sα where, at all time t, the gain distributionplayed by GREEDY-COMB is 1. Lemma 17 in the appendix proves that the value of GREEDY-COMBis not different from the value of TWIN-COMB by more than a small constant: V TπG

≥ V TπC− 1/3.

This is shown by using a backward induction over time, accumulating the errors that appear in statesα (see Lemma 13), and noting that these errors decrease exponentially with T − t.Starting from sα, the random walk followed by GREEDY-COMB is a simple random walk on the lined23 = 0 as illustrated in Figure 3(bottom). Therefore the value of this adversary is V TπG

= 23HG ,

where HG is the expected number of times the random walk of GREEDY-COMB hits the reward wall.Indeed each time the wall is hit the GREEDY-COMB earns 2/3 as its gain distribution on the wall is1. We notice that computing HG is equivalent to computing the expected number of equalizations(passages by 0) in a random walk that starts with a value 0 and increments its value by +1 withprobability half and decrements by -1 with probability half. Therefore, following the classic result

in [11, Example 12.3 in Section 12], we have HG = 12

(T+2T/2+1

)(T/2 + 1)2−T − 1 ∼

√2Tπ . Finally

we have, VT − 12 log2(T + 1)(a)

≤ V TπC≤ V TπG

+ 1/3 , where (a) holds by Theorem 3. Moreoverwe have VT = maxπ∈B V Tπ ≥ V TπG

.

Lemma 17. For all T > 0,

V TπG≥ V TπC

− 1

3.

19

Proof of Lemma 17. We will actually prove that for all s, and all t ≤ T , V πGt (s) ≥ V πC

t (s) −∑Tt′=t

16

(12

)T−t′which implies the claim of the lemma by looking at the special case t = 1, s = sα

and using Lemma 2. The proof is by backward induction on time t.

Initialization t = T : The two policies πG and πC only differ in state sα where we use Lemma 7.

Induction: We write

V πGt (s) = QπG

t (s, πG(s)) ≥ QπCt (s, πG(s))−

T∑t′=t+1

1

6

(1

2

)T−t′, (14)

where the last inequality is by induction. We distinguish two cases.

Case 1. s 6= sα: QπCt (s, πG(s)) = QπC

t (s, πC(s)) = V πCt (s) and the Equation 14 gives the

induction.

Case 2. s = sα: Using Lemma 7 we have QπCt (sα, πG(sα)) + 1

6

(12

)T−t= QπC

t (sα, 1) +16

(12

)T−t ≥ maxA∈B3QπCt (sα, A) ≥ QπC

t (sα, πC(sα)) = V πCt (sα).

15 Proofs for the COMB-based Learner

Definition 2. An ordering (·) is a bijective function that maps an order k ∈ [K] to its arm (k) ∈ [K].

For a state s, a valid increasing ordering (·) is an ordering such that for all pairs of distinct ordersi, j ∈ [K]2 with i 6= j and i < j we have s(i) ≥ s(j).

Proposition 2. Let g ∈ {0, 1}K , for any state s, X(s) ∩ X(s + g) 6= ∅.

Proof. If there exists an arm k ∈ X(s) such that gk = 1, then k ∈ X(s + g) and then X(s) ∩X(s + g) 6= ∅. Otherwise for all arms k ∈ X(s) we have gk = 0 so actually X(s) ⊂ X(s + g) as‖s‖∞ = ‖s + g‖∞. Again we have X(s) ∩ X(s + g) 6= ∅.

For a set of deactivated arms I ( [K], we define XI(s) = {k ∈ [K]\I : sk = maxj∈[K]\I sj}.Therefore we also have the following proposition.Proposition 3. Let g ∈ {0, 1}K , for any state s, for any set arms I ( [K], XI(s) ∩ XI(s + g) 6= ∅.Proposition 4. Let g ∈ {0, 1}K , for any state s there exists an ordering (·) that is valid simultane-ously for both states s and s + g.

Proof. To construct the ordering follow the following steps. Let the first step be i = 1 and A1 = [K],I1 = ∅. Iteratively, for each step i, we take ki ∈ XI1(s) ∩ XI1(s + g) with XI1(s) ∩ XI1(s + g) 6= ∅(Proposition 3). We set (i) , ki and we pass to the next iteration (i ← i + 1) by ’deactivating’index arm ki with Ii+1 = Ii ∪ {ki}. Note that by construction this ordering satisfies at any step thedefiniting property of valid increasing ordering for s.

The following lemma, which holds for any K, will be used in showing that pC defines a probability.Lemma 18. Let g ∈ {0, 1}K , for any state s, then for all t,

V πCt (s + g)− V πC

t (s) ≥ 0,

andV πCt (s + 1)− V πC

t (s) = 1.

Proof. The proof is by backward induction on time.

Initiation: At t = T + 1 we have for any s and any g ∈ {0, 1}K ,

V πCT+1(s + g)− V πC

T+1(s) = ‖s + g‖∞ − ‖s‖∞(a)= (sj + gj)− si

(b)

≥ gi ≥ mink∈[K]

gk,

where in (a) i = arg maxk∈[K](s) and j = arg maxk∈[K](s + g), and (b) uses sj + gj ≥ si + giby definition. Note that the inequalities turn into equalities when g = 1.

20

Induction step: Assuming that for all t′ > t and for any s and any g ∈ {0, 1}K , V πCt′ (s + g) −

V πCt′ (s) ≥ mink∈[K] gk we have,

V πCt (s + g)− V πC

t (s)(a)=

1

2

(−1 + V πC

t+1(s + g + e(2)) + V πCt+1(s + g + e(1) + e(3))

)− 1

2

(−1 + V πC

t+1(s + e(2)) + V πCt+1(s + e(1) + e(3))

)(b)

≥ mink∈[K]

gk

where (a) is because, using Proposition 4, there exists a common ordering (·) for both states s anss + g and (b) is by induction. Note that the inequality turn into equality when g = 1.

We now turn to the analysis of our COMB-based learner, specifically designed for the case K = 3.Proposition 5. In all states and at all times, the COMB-based learner is a probability.

Proof. We need to show for any state s, any time t,∑3k=1 pt,k(s) = 1 and for any expert k,

pt,k(s) ≥ 0. For all t, for any state s,

3∑k=1

pt,k(s) = pt,(1)(s) + pt,(2)(s) + pt,(3)(s)

= pt,(1)(s) + pt,(2)(s) + 1− pt,(1)(s)− pt,(2)(s)

= 1.

We have moreover,

pt,(1)(s) + pt,(2)(s)

= V πCt+1(s + e(1))− V πC

t (s) + V πCt+1(s + e(2))− V πC

t (s)

(a)= V πC

t+1(s + e(1))−1

2(V πCt+1(s + e(1)) + V πC

t+1(s + e(2) + e(3))− 1)

+ V πCt+1(s + e(2))−

1

2(V πCt+1(s + e(2)) + V πC

t+1(s + e(1) + e(3))− 1)

= 1 +1

2V πCt+1(s + e(1))−

1

2V πCt+1(s + e(2) + e(3)) +

1

2V πCt+1(s + e(2))−

1

2V πCt+1(s + e(1) + e(3))

= 1− 1

2

(V πCt+1(s + e(1) + e(3))− V πC

t+1(s + e(1)))− 1

2

(V πCt+1(s + e(2) + e(3))− V πC

t+1(s + e(2)))

(b)

≤ 1− 1

2mink∈[K]

(s + e(1) + e(3) − (s + e(1)))k −1

2mink∈[K]

(s + e(2) + e(3) − (s + e(2)))k

= 1,

where (a) is because of Lemma 5 and (b) is using Lemma 18.

Therefore pt,(3)(s) = 1− pt,(1)(s)− pt,(2)(s) ≥ 0. Concerning the positiveness of pt,(1)(s), wehave,

pt,(1)(s) = V πCt+1(s + e(1))− V πC

t (s)

= V πCt+1(s + e(1))−

1

2(V πCt+1(s + e(1)) + V πC

t+1(s + e(2) + e(3))− 1)

=1

2(V πCt+1(s + e(1))− V πC

t+1(s + e(2) + e(3)) + 1)

=1

2(V πCt+1(s + e(1))− V πC

t+1(s)− 1

+ V πCt+1(s + e(1) + e(2) + e(3))− V πC

t+1(s + e(2) + e(3)) + 1)

(a)

≥ 1

2mink∈[K]

(s + e(1) − s)k

21

+1

2mink∈[K]

(s + e(1) + e(2) + e(3) − (s + e(2) + e(3)))k

= 0

where (a) is using Lemma 18.

A symmetrical argument holds for the proof of positiveness of pt,(2)(s).

Proof of Theorem 3. We define here, given an adversary strategy π and an adversary gain distributionA, the action-value-to-go Qπt0(s, A) associated with π and A from time t0 in state s,

Qπt0(s, A) = EsT+1

‖sT+1‖∞ − cA −T∑

t=t0+1

Est

[cπ(st,t)

], (15)

st+1 ∼ P (.|st, π(st, t), st0 = s), st0+1 ∼ P (.|st+0, A). (16)

Similarly to Lemma 2, we have for any adversary strategy π, any adversary gain distribution A andany state s and time t0, Qπt0(s, A) = Qπt0(s, A) + ‖s‖∞. The proof is by backward induction.

Initiation: t = T + 1. At time T + 1 the game is over, both players have no influence on its value sofor all s, V T+1

p (s) = V πCT+1(s) = ‖s‖∞.

Induction: We assume for all t′ > t, V t′

p (s) ≤ V πCt′ (s) + 3

∑Tr=t′ ε

Cr

Given a state s at time t, we look at the best response from the adversary, it means we look at all theresponses:

V tp(s)

= max

•V pt+1(s) {}

• − pt(s) · e(k) + V pt+1(s + e(k)) k ∈ [K] : {1}or{2}or{3}

• − pt(s) · (e(k) + e(j)) + V pt+1(s + e(k) + e(j)) k 6= j ∈ [K]2 : {12}or{23}or{13}

• − pt(s) · (e(1) + e(2) + e(3))

+ V pt+1(s + e(1) + e(2) + e(3)) k 6= j ∈ [K]2 : {123}

(a)

≤ 3

T∑r=t+1

εCr + max

V πCt+1(s)

−pt(s) · e(k) + V πCt+1(s + e(k))

−pt(s) · (e(k) + e(j)) + V πCt+1(s + e(k) + e(j))

−pt(s) · (e(1) + e(2) + e(3)) + V πCt+1(s + e(1) + e(2) + e(3))

where (a) is by induction.

We provide a bound for the eight possible terms of the previous equation.

Case 1. Adversary Response e(1) or e(2): For k ∈ {1, 2}, we have,

−pt(s) · e(k) + V t+1πC

(s + e(k)) = V πCt (s)− V πC

t+1(s + e(k)) + V πCt+1(s + e(k))

= V πCt (s).

Case 2. Adversary Response e(3): We have,

− pt(s) · e(3) + V πCt+1(s + e(3))

= − 1 + V πCt+1(s + e(1))− V πC

t (s) + V πCt+1(s + e(2))− V πC

t (s) + V πCt+1(s + e(3))

= − 1− 2V πCt (s) + V πC

t+1(s + e(1)) + V πCt+1(s + e(2)) + V πC

t+1(s + e(3))

(a)= − 1− 2V πC

t (s) + 3QπCt (s, 1) + 1

(b)

≤ − 1− 2V πCt (s) + 3(V πC

t (s) + εCt ) + 1

= V πCt (s) + 3εC

t ,

where (a) is because by definition of the gain distribution 1 we have QπCt (s, 1) = 1

3 (V πCt+1(s+e(1))+

V πCt+1(s + e(2)) + V πC

t+1(s + e(3)))− 13 and (b) is by definition of εC

t .

22

Case 3. Adversary Response e(2)+e(1): We have,

− pt(s) · (e(2) + e(1)) + V πCt+1(s + e(2) + e(1))

= 2V πCt (s)− V πC

t+1(s + e(2))− Vt+1(s + e(1)) + V πCt (s + e(2) + e(1))

(a)= 2V πC

t (s)− V πCt+1(s + e(2))− V πC

t+1(s + e(1) + e(3))− V πCt+1(s + e(1))− V πC

t+1(s + e(2) + e(3))

+ V πCt+1(s + e(2) + e(3)) + V πC

t+1(s + e(1) + e(3)) + V πCt+1(s + e(2) + e(1))

(b)= − 2V πC

t (s)− 2 + V πCt+1(s + e(2) + e(3)) + V πC

t+1(s + e(1) + e(3)) + V πCt+1(s + e(2) + e(1))

(c)= − 2V πC

t (s)− 2 + 3QπCt (s, 2) + 2

(d)

≤ − 2V πCt (s) + 3(V πC

t (s) + εCt )

= V πCt (s) + 3εC

t ,

where (a) only introduces (by adding and substrating) the terms V πCt+1(s + e(2) + e(3)) and V πC

t+1(s +

e(1) + e(3)), (b) uses the fact that by definition of the COMB strategy, V πCt (s) = 1

2 (V πCt+1(s + e(2) +

e(3)) +V πCt+1(s+ e(1))− 1) as well as by definition of the TWIN-COMB strategy as well as Lemma 5

proving the equivalence between πC and πW , V πCt (s) = 1

2 (V πCt+1(s+e(2)+e(3))+V

πCt+1(s+e(1))−1),

(c) is because by definition of the gain distribution 2 we have QπCt (s, 2) = 1

3 (V πCt+1(s+e(1) +e(3))+

V πCt+1(s + e(2) + e(1)) + V πC

t+1(s + e(3) + e(2)))− 23 and (d) is by definition of εC

t .

Case 4. Response e(2)+e(3) (same for e(1)+e(3)): We have,

− pt(s) · (e(2) + e(3)) + V πCt+1(s + e(2) + e(3))

= − (1− pt,(1)(s)− pt,(2)(s) + pt,(2)(s)) + V πCt+1(s + e(2) + e(3))

= − 1 + pt,(1)(s) + V πCt+1(s + e(2) + e(3))

= − 1− V πCt (s) + V πC

t+1(s + e(1)) + V πCt+1(s + e(2) + e(3))

(a)= − 1− V πC

t (s) + 2V πCt (s) + 1

= V πCt (s),

where (a) uses the fact that by definition of the COMB strategy, V πCt (s) = 1

2 (V πCt+1(s+ e(2) + e(3)) +

V πCt+1(s + e(1))− 1).

Case 5: Adversary Response e(1)+e(2)+e(3): We have,

− pt(s) · (e(1) + e(2) + e(3)) + V πCt+1(s + e(1) + e(2) + e(3))

= − 1 + V πCt+1(s + e(1) + e(2) + e(3))

(a)

≤ V πCt (s) + εC

t ,

where (a) is by definition of εCt .

Case 6. Adversary Response with zero vector: We have,

− pt(s) · 0 + V πCt+1(s)

= V πCt+1(s)

(a)

≤ V πCt (s) + εC

t ,

where (a) is by definition of εCt .

23

Documents

Near Minimax Optimal Players for the Finite-Time 3-Expert …victorgabillon.nfshost.com/publications/minimaxFull/nips... · 2018-01-31 · Near Minimax Optimal Players for the Finite-Time