Policy Optimization in Adversarial MDPs: Improved

arX

iv:2

107.

0834

6v1

[cs

.LG

] 1

8 Ju

l 202

1

Policy Optimization in Adversarial MDPs:

Improved Exploration via Dilated Bonuses

Haipeng Luo*

[email protected]

Chen-Yu Wei*

[email protected]

Chung-Wei Lee

[email protected]

University of Southern California

Abstract

Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however,

theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs)

that bypass the challenge of global exploration. To eliminate the need of such assumptions, in this work, we develop

a general solution that adds dilated bonuses to the policy update to facilitate global exploration. To showcase the

power and generality of this technique, we apply it to several episodic MDP settings with adversarial losses and bandit

feedback, improving and generalizing the state-of-the-art. Specifically, in the tabular case, we obtain O(√T ) regret

where T is the number of episodes, improving the O(T 2/3) regret bound by Shani et al. [2020]. When the number

of states is infinite, under the assumption that the state-action values are linear in some low-dimensional features,

we obtain O(T2/3) regret with the help of a simulator, matching the result of Neu and Olkhovskaya [2020] while

importantly removing the need of an exploratory policy that their algorithm requires. When a simulator is unavailable,

we further consider a linear MDP setting and obtain O(T 14/15) regret, which is the first result for linear MDPs with

adversarial losses and bandit feedback.

1 Introduction

Policy optimization methods are among the most widely-used methods in reinforcement learning. Its empirical success

has been demonstrated in various domains such as computer games [Schulman et al., 2017] and robotics [Levine and Koltun,

2013]. However, due to its local-search nature, global optimality guarantees of policy optimization often rely on

unrealistic assumptions to ensure global exploration (see e.g., [Abbasi-Yadkori et al., 2019, Agarwal et al., 2020b,

Neu and Olkhovskaya, 2020, Wei et al., 2021]), making it theoretically less appealing compared to other methods.

Motivated by this issue, a line of recent works [Cai et al., 2020, Shani et al., 2020, Agarwal et al., 2020a, Zanette et al.,

2021] equip policy optimization with global exploration by adding exploration bonuses to the update, and prove favor-

able guarantees even without making extra exploratory assumptions. Moreover, they all demonstrate some robust-

ness aspect of policy optimization (such as being able to handle adversarial losses or a certain degree of model mis-

specification). Despite these important progresses, however, many limitations still exist, including worse regret rates

comparing to the best value-based or model-based approaches [Shani et al., 2020, Agarwal et al., 2020a, Zanette et al.,

2021], or requiring full-information feedback on the entire loss function (as opposed to the more realistic bandit feed-

back) [Cai et al., 2020].

To address these issues, in this work, we propose a new type of exploration bonuses called dilated bonuses, which

satisfies a certain dilated Bellman equation and provably leads to improved exploration compared to existing works

(Section 3). We apply this general idea to advance the state-of-the-art of policy optimization for learning finite-horizon

episodic MDPs with adversarial losses and bandit feedback. More specifically, our main results are:

• First, in the tabular setting, addressing the main open question left in [Shani et al., 2020], we improve their O(T 2/3)

regret to the optimal O(√T ) regret. This shows that policy optimization, which performs local optimization, is

as capable as other occupancy-measure-based global optimization algorithms [Jin et al., 2020a, Lee et al., 2020] in

*Equal contribution.

1

http://arxiv.org/abs/2107.08346v1

terms of global exploration. Moreover, our algorithm is computationally more efficient than those global methods

since they require solving some convex optimization in each episode. (Section 4)

• Second, to further deal with large-scale problems, we consider a linear function approximation setting where the state-

action values are linear in some known low-dimensional features and also a simulator is available, the same setting

considered by [Neu and Olkhovskaya, 2020]. We obtain the same O(T 2/3) regret while importantly removing their

exploratory assumption. (Section 5)

• Finally, to remove the need of a sampling oracle, we further consider linear MDPs, a special case where the transition

kernel is also linear in the features. To our knowledge, the only existing works that consider adversarial losses in this

setup are [Cai et al., 2020], which obtains O(√T ) regret but requires full-information feedback on the loss functions,

and [Neu and Olkhovskaya, 2021] (an updated version of [Neu and Olkhovskaya, 2020]), which obtains O(√T )

regret under bandit feedback but requires perfect knowledge of the transition as well as an exploratory assumption.

We propose the first algorithm for the most challenging setting with bandit feedback and unknown transition, which

achieves O(T 14/15) regret without any exploratory assumption. (Section 6)

We emphasize that unlike the tabular setting (where we improve existing regret rates of policy optimization), in

the two adversarial linear function approximation settings with bandit feedback that we consider, researchers have

not been able to show any sublinear regret for policy optimization without exploratory assumptions before our work,

which shows the critical role of our proposed dilated bonuses. In fact, there are simply no existing algorithms with

sublinear regret at all for these two settings, be it policy-optimization-type or not. This shows the advantage of policy

optimization over other approaches, when combined with our dilated bonuses.

Related work. In the tabular setting, except for [Shani et al., 2020], most algorithms apply the occupancy-measure-

based framework to handle adversarial losses (e.g., [Rosenberg and Mansour, 2019, Jin et al., 2020a, Chen et al., 2021,

Chen and Luo, 2021]), which as mentioned is computationally expensive. For stochastic losses, there are many more dif-

ferent approaches such as model-based ones [Jaksch et al., 2010, Dann and Brunskill, 2015, Azar et al., 2017, Fruit et al.,

2018, Zanette and Brunskill, 2019] and value-based ones [Jin et al., 2018, Dong et al., 2019].

Theoretical studies for linear function approximation have gained increasing interest recently [Yang and Wang,

2020, Zanette et al., 2020, Jin et al., 2020b]. Most of them study stochastic/stationary losses, with the exception

of [Cai et al., 2020, Neu and Olkhovskaya, 2020, 2021]. Our algorithm for the linear MDP setting bears some similar-

ity to those of [Agarwal et al., 2020a, Zanette et al., 2021] which consider stationary losses. However, in each episode,

their algorithms first execute an exploratory policy (from a policy cover), and then switch to the policy suggested by

the policy optimization algorithm, which inevitably leads to linear regret when facing adversarial losses.

2 Problem Setting

We consider an MDP specified by a state space X (possibly infinite), a finite action space A, and a transition function

P with P (·|x, a) specifying the distribution of the next state after taking action a in state x. In particular, we focus on

the finite-horizon episodic setting in which X admits a layer structure and can be partitioned into X0, X1, . . . , XH for

some fixed parameter H , where X0 contains only the initial state x0, XH contains only the terminal state xH , and for

any x ∈ Xh, h = 0, . . . , H − 1, P (·|x, a) is supported on Xh+1 for all a ∈ A (that is, transition is only possible from

Xh to Xh+1). An episode refers to a trajectory that starts from x0 and ends at xH following some series of actions and

the transition dynamic. The MDP may be assigned with a loss function ℓ : X ×A→ [0, 1] so that ℓ(x, a) specifies the

loss suffered when selecting action a in state x.

A policy π for the MDP is a mapping X → ∆(A), where ∆(A) denotes the set of distributions over A and π(a|x)is the probability of choosing action a in state x. Given a loss function ℓ and a policy π, the expected total loss of

π is given by V π(x0; ℓ) = E[∑H−1

h=0 ℓ(xh, ah)∣∣ ah ∼ πt(·|xh), xh+1 ∼ P (·|xh, ah)

]. It can also be defined via

the Bellman equation involving the state value function V π(x; ℓ) and the state-action value function Qπ(x, a; ℓ) (a.k.a.

Q-function) defined as below: V (xH ; ℓ) = 0,

Qπ(x, a; ℓ) = ℓ(x, a) + Ex′∼P (·|x,a) [Vπ(x′; ℓ)] , and V π(x; ℓ) = Ea∼π(·|x) [Q

π(x, a; ℓ)] .

2

We study online learning in such a finite-horizon MDP with unknown transition, bandit feedback, and adversarial

losses. The learning proceeds through T episodes. Ahead of time, an adversary arbitrarily decides T loss functions

ℓ1, . . . , ℓT , without revealing them to the learner. Then in each episode t, the learner decides a policy πt based on

all information received prior to this episode, executes πt starting from the initial state x0, generates and observes a

trajectory {(xt,h, at,h, ℓt(xt,h, at,h))}H−1h=0 . Importantly, the learner does not observe any other information about ℓt

(a.k.a. bandit feedback).1 The goal of the learner is to minimize the regret, defined as

Reg =

T∑

t=1

V πt

t (x0)−minπ

T∑

t=1

V πt (x0),

where we use V πt (x) as a shorthand for V π(x; ℓt) (and similarly Qπ

t (x, a) as a shorthand for Qπ(x, a; ℓt)). Without fur-

ther structures, the best existing regret bound is O(H |X |√|A|T ) [Jin et al., 2020a], with an extra

√X factor compared

to the best existing lower bound [Jin et al., 2018].

Occupancy measures. For a policy π and a state x, we define qπ(x) to be the probability (or probability measure

when |X | is infinite) of visiting state x within an episode when following π. When it is necessary to highlight the

dependence on the transition, we write it as qP,π(x). Further define qπ(x, a) = qπ(x)π(a|x) and qt(x, a) = qπt(x, a).

Finally, we use q⋆ as a shorthand for qπ⋆

where π⋆ ∈ argminπ∑T

t=1 Vπt (x0) is one of the optimal policies.

Note that by definition, we have V π(x0; ℓ) =∑

x,a qπ(x, a)ℓ(x, a). In fact, we will overload the notation and let

V π(x0; b) =∑

x,a qπ(x, a)b(x, a) for any function b : X ×A→ R (even though it might not correspond to a real loss

function).

Other notations. We denote by Et[·] and Vart[·] the expectation and variance conditioned on everything prior to

episode t. For a matrix Σ and a vector z (of appropriate dimension), ‖z‖Σ denotes the quadratic norm√z⊤Σz. The

notation O(·) hides all logarithmic factors.

3 Dilated Exploration Bonuses

In this section, we start with a general discussion on designing exploration bonuses (not specific to policy optimization),

and then introduce our new dilated bonuses for policy optimization. For simplicity, the exposition in this section

assumes a finite state space, but the idea generalizes to an infinite state space.

When analyzing the regret of an algorithm, very often we run into the following form:

Reg =

T∑

t=1

V πt

t (x0)−T∑

t=1

V π⋆

t (x0) ≤ o(T ) +

T∑

t=1

∑

x,a

q⋆(x, a)bt(x, a) = o(T ) +

T∑

t=1

V π⋆

(x0; bt), (1)

for some function bt(x, a) usually related to some estimation error or variance that can be prohibitively large. For

example, in policy optimization, the algorithm performs local search in each state essentially using a multi-armed bandit

algorithm and treating Qπt

t (x, a) as the loss of action a in state x. Since Qπt

t (x, a) is unknown, however, the algorithm

has to use some estimator of Qπt

t (x, a) instead, whose bias and variance both contribute to the bt function. Usually,

bt(x, a) is large for a rarely-visited state-action pair (x, a) and is inversely related to qt(x, a), which is exactly why

most analysis relies on the assumption that some distribution mismatch coefficient related to q⋆(x,a)/qt(x,a) is bounded

(see e.g., [Agarwal et al., 2020b, Wei et al., 2020]).

On the other hand, an important observation is that while V π⋆

(x0; bt) can be prohibitively large, its counterpart

with respect to the learner’s policy V πt(x0; bt) is usually nicely bounded. For example, if bt(x, a) is inversely related

to qt(x, a) as mentioned, then V πt(x0; bt) =∑

x,a qt(x, a)bt(x, a) is small no matter how small qt(x, a) could be for

some (x, a). This observation, together with the linearity property V π(x0; ℓt−bt) = V π(x0; ℓt)−V π(x0; bt), suggests

that we treat ℓt − bt as the loss function of the problem, or in other words, add a (negative) bonus to each state-action

1Full-information feedback, on the other hand, refers to the easier setting where the entire loss function ℓt is revealed to the learner at the end of

episode t.

3

pair, which intuitively encourages exploration due to underestimation. Indeed, assuming for a moment that Eq. (1) still

roughly holds even if we treat ℓt − bt as the loss function:

T∑

t=1

V πt(x0; ℓt − bt)−T∑

t=1

V π⋆

(x0; ℓt − bt) . o(T ) +

T∑

t=1

V π⋆

(x0; bt). (2)

Then by linearity and rearranging, we have

Reg =T∑

t=1

V πt

t (x0)−T∑

t=1

V π⋆

t (x0) . o(T ) +T∑

t=1

V πt(x0; bt). (3)

Due to the switch from π⋆ to πt in the last term compared to Eq. (1), this is usually enough to prove a desirable regret

bound without making extra assumptions.

The caveat of this discussion is the assumption of Eq. (2). Indeed, after adding the bonuses, which itself contributes

some more bias and variance, one should expect that bt on the right-hand side of Eq. (2) becomes something larger,

breaking the desired cancellation effect to achieve Eq. (3). Indeed, the definition of bt essentially becomes circular in

this sense.

Dilated Bonuses for Policy Optimization To address this issue, we take a closer look at the policy optimization

algorithm specifically. As mentioned, policy optimization decomposes the problem into individual multi-armed bandit

problems in each state and then performs local optimization. This is based on the well-known performance difference

lemma [Kakade and Langford, 2002]:

Reg =∑

x

q⋆(x)

T∑

t=1

∑

a

(πt(a|x)− π⋆(a|x)

)Qπt

t (x, a),

showing that in each state x, the learner is facing a bandit problem with Qπt

t (x, a) being the loss for action a. Cor-

respondingly, incorporating the bonuses bt for policy optimization means subtracting the bonus Qπt(x, a; bt) from

Qπt

t (x, a) for each action a in each state x. Recall that Qπt(x, a; bt) satisfies the Bellman equation Qπt(x, a; bt) =bt(x, a)+Ex′∼P (·|x,a)Ea′∼πt(·|x′) [Bt(x

′, a′)]. To resolve the issue mentioned earlier, we propose to replace this bonus

function Qπt(x, a; bt) with its dilated version Bt(s, a) satisfying the following dilated Bellman equation:

Bt(x, a) = bt(x, a) +

(1 +

1

H

)Ex′∼P (·|x,a)Ea′∼πt(·|x′) [Bt(x

′, a′)] (4)

(with Bt(xH , a) = 0 for all a). The only difference compared to the standard Bellman equation is the extra (1 + 1H )

factor, which slightly increases the weight for deeper layers and thus intuitively induces more exploration for those

layers. Due to the extra bonus compared to Qπt(x, a; bt), the regret bound also increases accordingly. In all our

applications, this extra amount of regret turns out to be of the form 1H

∑Tt=1

∑x,a q

⋆(x)πt(a|x)Bt(x, a), leading to

∑

x

q⋆(x)

T∑

t=1

∑

a


)(Qπt

t (x, a)−Bt(x, a))

≤ o(T ) +

T∑

t=1

V π⋆

(x0; bt) +1

H

T∑

t=1

∑

x,a

q⋆(x)πt(a|x)Bt(x, a). (5)

With some direct calculation, one can show that this is enough to show a regret bound that is only a constant factor

larger than the desired bound in Eq. (3)! This is summarized in the following lemma.

Lemma 3.1. If Eq. (5) holds with Bt defined in Eq. (4), then Reg ≤ o(T ) + 3∑T

t=1 Vπt(x0; bt).

The high-level idea of the proof is to show that the bonuses added to a layer h is enough to cancel the large

bias/variance term (including those coming from the bonus itself) from layer h + 1. Therefore, cancellation happens

in a layer-by-layer manner except for layer 0, where the total amount of bonus can be shown to be at most (1 +1H )H

∑Tt=1 V

πt(x0; bt) ≤ 3∑T

t=1 Vπt(x0; bt).

4

Recalling again that V πt(x0; bt) is usually nicely bounded, we thus arrive at a favorable regret guarantee without

making extra assumptions. Of course, since the transition is unknown, we cannot compute Bt exactly. However,

Lemma 3.1 is robust enough to handle either a good approximate version of Bt (see Lemma B.1) or a version where

Eq. (4) and Eq. (5) only hold in expectation (see Lemma B.2), which is enough for us to handle unknown transition. In

the next three sections, we apply this general idea to different settings, showing what bt and Bt are concretely in each

case.

4 The Tabular Case

In this section, we study the tabular case where the number of states is finite. We propose a policy optimization

algorithm with O(√T ) regret, improving the O(T 2/3) regret of [Shani et al., 2020]. See Algorithm 1 for the complete

pseudocode.

Algorithm design. First, to handle unknown transition, we follow the common practice (dating back to [Jaksch et al.,

2010]) to maintain a confidence set of the transition, which is updated whenever the visitation count of a certain state-

action pair is doubled. We call the period between two model updates an epoch, and use Pk to denote the confidence

set for epoch k, formally defined in Eq. (10).

In episode t, the policy πt is defined via the standard multiplicative weight algorithm (also connected to Natural

Policy Gradient [Kakade, 2001, Agarwal et al., 2020b, Wei et al., 2021]), but importantly with the dilated bonuses

incorporated such that πt(a|x) ∝ exp(−η∑t−1τ=1(Qτ (x, a) − Bτ (x, a))). Here, η is a step size parameter, Qτ (x, a) is

an importance-weighted estimator for Qπττ (x, a) defined in Eq. (7), and Bτ (x, a) is the dilated bonus defined in Eq. (9).

More specifically, for a state x in layer h, Qt(x, a) is defined asLt,h1t(x,a)qt(x,a)+γ , where 1t(x, a) is the indicator of

whether (x, a) is visited during episode t; Lt,h is the total loss suffered by the learner starting from layer h till the

end of the episode; qt(x, a) = maxP∈PkqP ,πt(x, a) is the largest plausible value of qt(x, a) within the confidence set,

which can be computed efficiently using the COMP-UOB procedure of [Jin et al., 2020a] (see also Appendix C.1); and

finally γ is a parameter used to control the maximum magnitude of Qt(x, a). To get a sense of this estimator, consider

the special case when γ = 0 and the transition is known so that we can set Pk = {P} and thus qt = qt. Then, since

the expectation of Lt,h conditioned on (x, a) being visited is Qπt

t (x, a) and the expectation of 1t(x, a) is qt(x, a), we

know that Qt(x, a) is an unbiased estimator for Qπt

t (x, a). The extra complication is simply due to the transition being

unknown, forcing us to use qt and γ > 0 to make sure that Qt(x, a) is an optimistic underestimator, an idea similar

to [Jin et al., 2020a].

Next, we explain the design of the dilated bonus Bt. Following the discussions of Section 3, we first figure out what

the corresponding bt function is in Eq. (1), by analyzing the regret bound without using any bonuses. The concrete

form of bt turns out to be Eq. (8), whose value at (x, a) is independent of a and thus written as bt(x) for simplicity.

Note that Eq. (8) depends on the occupancy measure lower bound qt(s, a) = minP∈Pk

qP ,πt(x, a), the opposite of

qt(s, a), which can also be computed efficiently using a procedure similar to COMP-UOB (see Appendix C.1). Once

again, to get a sense of this, consider the special case with a known transition so that we can set Pk = {P} and thus

qt = qt= qt. Then, one see that bt(x) is simply upper bounded by Ea∼πt(·|x) [3γH/qt(x,a)] = 3γH|A|/qt(x), which is

inversely related to the probability of visiting state x, matching the intuition we provided in Section 3 (that bt(x) is

large if x is rarely visited). The extra complication of Eq. (8) is again just due to the unknown transition.

With bt(x) ready, the final form of the dilated bonus Bt is defined following the dilated Bellman equation of Eq. (4),

except that since P is unknown, we once again apply optimism and find the largest possible value within the confidence

set (see Eq. (9)). This can again be efficiently computed; see Appendix C.1. This concludes the complete algorithm

design.

Regret analysis. The regret guarantee of Algorithm 1 is presented below:

Theorem 4.1. Algorithm 1 ensures that with probability 1−O(δ), Reg = O(H2|X |

√AT +H4

).

Again, this improves the O(T 2/3) regret of [Shani et al., 2020]. It almost matches the best existing upper bound

for this problem, which is O(H |X |√|A|T ) [Jin et al., 2020a]. While it is unclear to us whether this small gap can

2We use y+← z as a shorthand for the increment operation y ← y + z.

5

Algorithm 1 Policy Optimization with Dilated Bonuses (Tabular Case)

Parameters: δ ∈ (0, 1), η = min {1/24H3, 1/√

|X||A|HT}, γ = 2ηH .

Initialization: Set epoch index k = 1 and confidence set P1 as the set of all transition functions. For all (x, a, x′),initialize counters N0(x, a) = N1(x, a) = 0, N0(x, a, x

′) = N1(x, a, x′) = 0.

for t = 1, 2, . . . , T doStep 1: Compute and execute policy. Execute πt for one episode, where

πt(a|x) ∝ exp

(−η

t−1∑

τ=1

(Qτ (x, a)−Bτ (x, a)

)), (6)

and obtain trajectory {(xt,h, at,h, ℓt(xt,h, at,h))}H−1h=0 .

Step 2: Construct Q-function estimators. For all h ∈ {0, . . . , H − 1} and (x, a) ∈ Xh ×A,

Qt(x, a) =Lt,h

qt(x, a) + γ1t(x, a), (7)

with Lt,h =H−1∑i=h

ℓt(xt,i, at,i), qt(x, a) = maxP∈Pk

qP ,πt(x, a), and 1t(x, a) = 1{xt,h = x, at,h = a}.

Step 3: Construct bonus functions. For all (x, a) ∈ X ×A,

bt(x) = Ea∼πt(·|x)

[3γH +H(qt(x, a)− q

t(x, a))

qt(x, a) + γ

](8)

Bt(x, a) = bt(x) +

(1 +

1

H

)maxP∈Pk

Ex′∼P (·|x,a)Ea′∼πt(·|x′) [Bt(x′, a′)] (9)

where qt(x, a) = minP∈Pk

qP ,πt(x, a) and Bt(xH , a) = 0 for all a.

Step 4: Update model estimation. ∀h < H , Nk(xt,h, at,h)+← 1, Nk(xt,h, at,h, xt,h+1)

+← 1.2

if ∃h, Nk(xt,h, at,h) ≥ max{1, 2Nk−1(xt,h, at,h)} then

Increment epoch index k+← 1 and copy counters: Nk ← Nk−1, Nk ← Nk−1.

Compute empirical transition P k(x′|x, a) = Nk(x,a,x

′)max{1,Nk(x,a)} and confidence set:

Pk ={P :

∣∣∣P (x′|x, a)− P k(x′|x, a)

∣∣∣ ≤ confk(x′|x, a),

∀(x, a, x′) ∈ Xh ×A×Xh+1, h = 0, 1, . . . , H − 1},

(10)

where confk(x′|x, a) = 4

√√√√P k(x′|x, a) ln(

T |X||A|δ

)

max{1, Nk(x, a)}+

28 ln(

T |X||A|δ

)

3max{1, Nk(x, a)}.

be closed using policy optimization, we point out that our algorithm is arguably more efficient than that of [Jin et al.,

2020a], which performs global convex optimization over the set of all plausible occupancy measures in each episode.

The complete proof of this theorem is deferred to Appendix C. Here, we only sketch an outline of proving Eq. (5),

which, according to the discussions in Section 3, is the most important part of the analysis. Specifically, we de-

compose the left-hand side of Eq. (5),∑

x q⋆(x)

∑t 〈πt(·|x) − π⋆(·|x), Qπt

t (x, ·)−Bt(x, ·)〉, as BIAS-1 + BIAS-2 +REG-TERM, where

• BIAS-1 =∑

x q⋆(x)

∑t〈πt(·|x), Qπt

t (x, ·)− Qt(x, ·)〉 measures the amount of underestimation of Qt related to

6

πt, which can be bounded by∑

t

∑x,a q

⋆(x)πt(a|x)(

2γH+H(qt(x,a)−qt(x,a))

qt(x,a)+γ

)+ O (H/η) with high probability

(Lemma C.1);

• BIAS-2 =∑

x q⋆(x)

∑t〈π⋆(·|x), Qt(x, ·) −Qπt

t (x, ·)〉 measures the amount of overestimation of Qt related to π⋆,

which can be bounded by O (H/η) since Qt is an underestimator (Lemma C.2);

• REG-TERM =∑

x q⋆(x)

∑t〈πt(·|x) − π⋆(·|x), Qt(x, ·)−Bt(x, ·)〉 is directly controlled by the multiplicative

weight update, and is bounded by∑

t

∑x,a q

⋆(x)πt(a|x)(

γHqt(x,a)+γ + Bt(x,a)

H

)+ O (H/η) with high probability

(Lemma C.3).

Combining all with the definition of bt proves the key Eq. (5) (with the o(T ) term being O(H/η)).

5 The Linear-Q Case

In this section, we move on to the more challenging setting where the number of states might be infinite, and function

approximation is used to generalize the learner’s experience to unseen states. We consider the most basic linear func-

tion approximation scheme where for any π, the Q-function Qπt (x, a) is linear in some known feature vector φ(x, a),

formally stated below.

Assumption 1 (Linear-Q). Let φ(x, a) ∈ Rd be a known feature vector of the state-action pair (x, a). We assume

that for any episode t, policy π, and layer h, there exists an unknown weight vector θπt,h ∈ Rd such that for all

(x, a) ∈ Xh × A, Qπt (x, a) = φ(x, a)⊤θπt,h. Without loss of generality, we assume ‖φ(x, a)‖ ≤ 1 for all (x, a) and

‖θπt,h‖ ≤√dH for all t, h, π.

For justification on the last condition on norms, see [Wei et al., 2021, Lemma 8]. This linear-Q assumption has

been made in several recent works with stationary losses [Abbasi-Yadkori et al., 2019, Wei et al., 2021] and also

in [Neu and Olkhovskaya, 2020] with the same adversarial losses.3 It is weaker than the linear MDP assumption (see

Section 6) as it does not pose explicit structure requirements on the loss and transition functions. Due to this generality,

however, our algorithm also requires access to a simulator to obtain samples drawn from the transition, formally stated

below.

Assumption 2 (Simulator). The learner has access to a simulator, which takes a state-action pair (x, a) ∈ X × A as

input, and generates a random outcome of the next state x′ ∼ P (·|x, a).

Note that this assumption is also made by [Neu and Olkhovskaya, 2020] and more earlier works with stationary

losses (see e.g., [Azar et al., 2012, Sidford et al., 2018]).4 In this setting, we propose a new policy optimization algo-

rithm with O(T 2/3) regret. See Algorithm 2 for the pseudocode.

Algorithm design. The algorithm still follows the multiplicative weight update Eq. (11) in each state x ∈ Xh (for

some h), but now with φ(x, a)⊤θt,h as an estimator for Qπt

t (x, a) = φ(x, a)⊤θπt

t,h, and BONUS(t, x, a) as the dilated

bonus Bt(x, a). Specifically, the construction of the weight estimator θt,h follows the idea of [Neu and Olkhovskaya,

2020] (which itself is based on the linear bandit literature) and is defined in Eq. (12) as Σ+t,hφ(xt,h, at,h)Lt,h. Here, Σ+

t,h

is an ǫ-accurate estimator of (γI +Σt,h)−1

, where γ is a small parameter and Σt,h = Et[φ(xt,h, at,h)φ(xt,h, at,h)⊤]

is the covariance matrix for layer h under policy πt; Lt,h =∑H−1

i=h ℓt(xt,i, at,i) is again the loss suffered by the

learner starting from layer h, whose conditional expectation is Qπt

t (xt,h, at,h) = φ(xt,h, at,h)⊤θπt

t,h. Therefore, when

γ and ǫ approach 0, one see that θt,h is indeed an unbiased estimator of θπt

t,h. We adopt the GEOMETRICRESAMPLING

procedure (see Algorithm 4) of [Neu and Olkhovskaya, 2020] to compute Σ+t,h, which requires calling the simulator

multiple times.

3The assumption in [Neu and Olkhovskaya, 2020] is stated slightly differently (e.g., their feature vectors are independent of the action). However,

it is straightforward to verify that the two versions are equivalent.4The simulator required by Neu and Olkhovskaya [2020] is in fact slightly weaker than ours and those from earlier works — it only needs to be

able to generate a trajectory starting from x0 for any policy.

7

Algorithm 2 Policy Optimization with Dilated Bonuses (Linear-Q Case)

parameters: γ, β, η, ǫ ∈ (0, 12 ), M =⌈24 ln(dHT )

ǫ2γ2

⌉, N =

⌈2γ ln 1

ǫγ

⌉.

for t = 1, 2, . . . , T doStep 1: Interact with the environment. Execute πt, which is defined such that for each x ∈ Xh,

πt(a|x) ∝ exp

(−η

t−1∑

τ=1

(φ(x, a)⊤ θτ,h − BONUS(τ, x, a)

)), (11)

and obtain trajectory {(xt,h, at,h, ℓt(xt,h, at,h))}H−1h=0 .

Step 2: Construct covariance matrix inverse estimators. Collect MN trajectories using the simulator and πt.

Let Tt be the set of trajectories. Compute

{Σ+

t,h

}H−1

h=0= GEOMETRICRESAMPLING (Tt,M,N, γ) . (see Algorithm 4)

Step 3: Construct Q-function weight estimators. For h = 0, . . . , H − 1, compute

θt,h = Σ+t,hφ(xt,h, at,h)Lt,h, where Lt,h =

H−1∑

i=h

ℓt(xt,i, at,i). (12)

Algorithm 3 BONUS(t, x, a)

if BONUS(t, x, a) has been called before thenreturn the value of BONUS(t, x, a) calculated last time.

Let h be such that x ∈ Xh. if h = H then return 0.

Compute πt(·|x), defined in Eq. (11) (which involves recursive calls to BONUS for smaller t).Get a sample of the next state x′ ← SIMULATOR(x, a).Compute πt(·|x′) (again, defined in Eq. (11)), and sample an action a′ ∼ πt(·|x′).

return β‖φ(x, a)‖2Σ+

t,h

+ Ej∼πt(·|x)[β‖φ(x, j)‖2

Σ+t,h

]+(1 + 1

H

)BONUS(t, x′, a′).

Next, we explain the design of the dilated bonus. Again, following the general principle discussed in Section 3, we

identify bt(x, a) in this case as β‖φ(x, a)‖2Σ+

t,h

+ Ej∼πt(·|x)[β‖φ(x, j)‖2

Σ+t,h

]for some parameter β > 0. Further fol-

lowing the dilated Bellman equation Eq. (4), we thus define BONUS(t, x, a) recursively as the last line of Algorithm 3,

where we replace the expectation E(x′,a′)[BONUS(t, x′, a′)] with one single sample for efficient implementation.

However, even more care is needed to actually implement the algorithm. First, since the state space is potentially

infinite, one cannot actually calculate and store the value of BONUS(t, x, a) for all (x, a), but can only calculate them

on-the-fly when needed. Moreover, unlike the estimators for Qπt

t (x, a), which can be succinctly represented and stored

via the weight estimator θt,h, this is not possible for BONUS(t, x, a) due to the lack of any structure. Even worse,

the definition of BONUS(t, x, a) itself depends on πt(·|x) and also πt(·|x′) for the afterstate x′, which, according to

Eq. (11), further depends on BONUS(τ, x, a) for τ < t, resulting in a complicated recursive structure. This is also why

we present it as a procedure in Algorithm 3 (instead of Bt(x, a)). In total, this leads to (TAH)O(H) number of calls to

the simulator. Whether this can be improved is left as a future direction.

Regret guarantee By showing that Eq. (5) holds in expectation for our algorithm, we obtain the following regret

guarantee. (See Appendix E for the proof.)

Theorem 5.1. Under Assumption 1 and Assumption 2, with appropriate choices of the parameters γ, β, η, ǫ, Algorithm 2

ensures E[Reg] = O(H2(dT )2/3

)(the dependence on |A| is only logarithmic).

This matches the O(T 2/3) regret of [Neu and Olkhovskaya, 2020, Theorem 1], without the need of their assumption

8

Algorithm 4 GEOMETRICRESAMPLING(T ,M,N, γ)

Denote the MN trajectories in T as: {(xi,0, ai,0, . . . , xi,H−1, ai,H−1)}i=1,...,MN . Let c = 12 .

for m = 1, . . . ,M do

for n = 1, . . . , N doi = (m− 1)N + n.

For all h, compute Yn,h = γI + φ(xi,h, ai,h)φ(xi,h, ai,h)⊤.

For all h, compute Zn,h = Πnj=1(I − cYj,h).

For all h, set Σ+(m)h = cI + c

∑Nn=1 Zn,h.

For all h, set Σ+h = 1

M

∑Mm=1 Σ

+(m)h .

return Σ+h for all h = 0, . . . , H − 1.

which essentially says that the learner is given an exploratory policy to start with.5 To our knowledge, this is the first no-

regret algorithm for the linear-Q setting (with adversarial losses and bandit feedback) when no exploratory assumptions

are made.

6 The Linear MDP Case

To remove the need of a simulator, we further consider the linear MDP case, a special case of the linear-Q setting. It is

equivalent to Assumption 1 plus the extra assumption that the transition function also has a low-rank structure, formally

stated below.

Assumption 3 (Linear MDP). The MDP satisfies Assumption 1 and that for any h and x′ ∈ Xh+1, there exists an

unknown weight vector νx′

h ∈ Rd such that P (x′|x, a) = φ(x, a)⊤νx

′

h for all (x, a) ∈ Xh ×A.

There is a surge of works studying this setting, with [Cai et al., 2020] being the closest to us. They achieve O(√T )

regret but require full-information feedback of the loss functions, and there are no existing results for the bandit feedback

setting, except for a concurrent work [Neu and Olkhovskaya, 2021] which assumes perfect knowledge of the transition

and an exploratory condition. We propose the first algorithm with sublinear regret for this problem with unknown

transition and bandit feedback, shown in Algorithm 5. The structure of Algorithm 5 is similar to that of Algorithm 2,

but importantly with the following modifications.

A succinct representation of dilated bonuses Our definition of bt remains the same as in the linear-Q case. How-

ever, due to the low-rank transition structure in linear MDPs, we are now able to efficiently construct estimators of

Bt(x, a) even for unseen state-action pairs using function approximation, bypassing the requirement of a simulator.

Specifically, observe that according to Eq. (4), for each x ∈ Xh, under Assumption 3 Bt(x, a) can be written as

bt(x, a) + φ(x, a)⊤Λπt

t,h, where Λπt

t,h = (1 + 1H )∫x′∈Xh+1

Ea′∼πt(·|x′)[Bt(x′, a′)]νx

′

h dx′ is a vector independent of

(x, a). Thus, following the similar idea of using θt,h to estimate θπt

t,h as we did in Algorithm 2, we can construct Λt,h to

estimate Λπt

t,h as well, thus succinctly representing Bt(x, a) for all (x, a).

Epoch schedule Recall that estimating θπt

t,h (and thus also Λπt

t,h) requires constructing the covariance matrix inverse

estimate Σ+t,h. Due to the lack of a simulator, another important change of the algorithm is to construct Σ+

t,h using

online samples. To do so, we divide the entire horizon (or more accurately the last T − T0 rounds since the first T0

rounds are reserved for some other purpose to be discussed next) into epochs with equal length W , and only update

the policy optimization algorithm at the beginning of an epoch. We index an epoch by k, and thus θπt

t,h, Λπt

t,h, Σ+t,h are

now denoted by θπk

k,h, Λπk

k,h, Σ+k,h. Within an epoch, we keep executing the same policy πk (up to a small exploration

probability δe) and collect W trajectories, which are then used to construct Σ+k,h as well as θπk

k,h and Λπk

k,h. To decouple

their dependence, we uniformly at random partition these W trajectories into two sets S and S′ with equal size, and

use data from S to construct Σ+k,h in Step 2 via the same GEOMETRICRESAMPLING procedure and data from S′ to

construct θπk

k,h and Λπk

k,h in Step 3 and Step 4 respectively.

5Under an even stronger assumption that every policy is exploratory, they also improve the regret to O(√T ); see [Neu and Olkhovskaya, 2020,

Theorem 2].

9

Algorithm 5 Policy Optimization with Dilated Bonuses (Linear MDP Case)

Parameters: γ, β, η, ǫ, δe ∈ (0, 12 ), δ, M =

⌈96 ln(dHT )

ǫ2γ2

⌉, N =

⌈2γ ln 1

ǫγ

⌉, W = 2MN , α = δe

6β , M0 =⌈α2dH2

⌉,

N0 =100M4

0 log(T/δ)α2 , T0 = M0N0.

Construct a mixture policy πcov and its estimated covariance matrices (which requires interacting with the environment

for the first T0 rounds using Algorithm 6):

πcov,{Σcov

h

}h=0,...,H−1

← POLICYCOVER(M0, N0, α, δ).

Define known state set K ={x ∈ X : ∀a ∈ A, ‖φ(x, a)‖2

(Σcovh

)−1≤ α where h is such that x ∈ Xh

}.

for k = 1, 2, . . . , (T − T0)/W doStep 1: Interact with the environment. Define πk as the following: for x ∈ Xh,

πk(a|x) ∝ exp

(−η

k−1∑

τ=1

(φ(x, a)⊤θτ,h − φ(x, a)⊤Λτ,h − bτ (x, a)

))(13)

where bτ (x, a) =

(β‖φ(x, a)‖2

Σ+τ,h

+ βEa′∼πτ (·|x)[‖φ(x, a′)‖2

Σ+τ,h

])1[x ∈ K].

Randomly partition {T0 + (k − 1)W + 1, . . . , T0 + kW} into two parts: S and S′, such that |S| = |S′| = W/2.

for t = T0 + (k − 1)W + 1, . . . , T0 + kW doDraw Yt ∼ BERNOULLI(δe).if Yt = 1 then

if t ∈ S then Execute πcov.

else Draw h∗t

unif.∼ {0, . . . , H − 1}; execute πcov in steps 0, . . . , h∗t − 1 and πk in steps h∗

t , . . . , H − 1.

else Execute πk .

Collect trajectory {(xt,h, at,h, ℓt(xt,h, at,h))}H−1h=0 .

Step 2: Construct inverse covariance matrix estimators. Let

Tk = {(xt,0, at,0, . . . , xt,H−1, at,H−1)}t∈S , (the trajectories in S){Σ+

k,h

}H−1

h=0= GEOMETRICRESAMPLING(Tk,M,N, γ). (14)

Step 3: Construct Q-function weight estimators. Computer for all h (with Lt,h =∑H−1

i=h ℓt(xt,i, at,i)):

θk,h = Σ+k,h

(1

|S′|∑

t∈S′

((1− Yt) + YtH1[h = h∗t ])φ(xt,h, at,h)Lt,h

). (15)

Step 4: Construct bonus function weight estimators. Computer for all h :

Λk,h = Σ+k,h

(1

|S′|∑

t∈S′

((1− Yt) + YtH1[h = h∗t ])φ(xt,h, at,h)Dt,h

), (16)

where Dt,h =∑H−1

i=h+1

(1 + 1

H

)i−hbk(xt,i, at,i).

Exploration with a policy cover Unfortunately, some technical difficulty arises when bounding the estimation error

and the variance of Λk,h. Specifically, they can be large if the magnitude of the bonus term bk(x, a) is large for some

(x, a); furthermore, since Λk,h is constructed using empirical samples, its variance can be even larger in those directions

of the feature space that are rarely visited. Overall, due to the combined effect of these two facts, we are unable to prove

10

any sublinear regret with only the ideas described so far.

To address this issue, we adopt the idea of policy cover, recently introduced in [Agarwal et al., 2020a, Zanette et al.,

2021]. Specifically, we spend the first T0 rounds to find an exploratory (mixture) policy πcov (called policy cover) which

tends to reach all possible directions of the feature space. This is done via the procedure POLICYCOVER (Algorithm 6)

(to be discussed in detail soon), which also returns Σcovh for each layer h, an estimator of the true covariance matrix Σcov

h

of the policy cover πcov. POLICYCOVER guarantees that with high probability, for any policy π and h we have

Prxh∼π

[∃a, ‖φ(xh, a)‖2(Σcov

h)−1 ≥ α

]≤ O

(dH

α

)(17)

where xh ∈ Xh is sampled from executing π; see Lemma D.4. This motivates us to only focus on x such that

‖φ(x, a)‖2(Σcov

h)−1≤ α for all a (h is the layer to which x belongs). This would not incur much regret because no

policy would visit other states often enough. We call such state a known state and denote by K the set of all known

states. To implement the idea above, we simply introduce an indicator 1[x ∈ K] in the definition of bk (that is, no bonus

at all for unknown states).

The benefit of doing so is that the aforementioned issue of bk(x, a) having a large magnitude is now alleviated as

long as we explore using πcov with some small probability in each episode. Specifically, in each episode of epoch k, with

probability 1−δe we executeπk suggested by policy optimization, otherwise we explore using πcov. The way we explore

differs slightly for episodes in S and those in S′ (recall that an epoch is partitioned evenly into S and S′, where S is used

to estimate Σ+k,h and S′ is used to estimate θπk

k,h and Λπk

k,h). For an episode in S, we simply explore by executing πcov for

the entire episode, so that Σ+k,h is an estimation of the inverse of γI+ δeΣ

covh +(1− δe)E(xh,a)∼πk

[φ(xh, a)φ(xh, a)⊤],

and thus by its definition bk(x, a) is bounded by roughly αβδe

for all (x, a) (this improves over the trivial bound βγ by

our choice of parameters; see Lemma F.1). On the other hand, for an episode in S′, we first uniformly at random draw

a step h∗t , then we execute πcov for the first h∗

t steps and continue with πk for the rest. This leads to a slightly different

form of the estimators θπk

k,h and Λπk

k,h compared to Eq. (12) (see Eq. (15) and Eq. (16), where the definition of Dt,h is in

light of Eq. (4)), which is important to ensure their (almost) unbiasedness. This also concludes the description of Step

1.

We note that the idea of dividing states into known and unknown parts is related to those of [Agarwal et al., 2020a,

Zanette et al., 2021]. However, our case is more challenging because we are only allowed to mix a small amount of πcov

into our policy in order to get sublinear regret against an adversary, while their algorithms can always start by executing

πcov in each episode to maximally explore the feature space.

Constructing the policy cover Finally, we describe how Algorithm 6 finds a policy cover πcov. It is a procedure very

similar to Algorithm 1 of [Wang et al., 2020]. Note that the focus of Wang et al. [2020] is reward-free exploration in

linear MDPs, but it turns out that the same idea can be used for our purpose, and it is also related to the exploration

strategy introduced in [Agarwal et al., 2020a, Zanette et al., 2021].

More specifically, POLICYCOVER interacts with the environment for T0 = M0N0 rounds. At the beginning of

episode (m−1)N0+1 for every m = 1, . . . ,M0, it computes a policy πm using the LSVI-UCB algorithm of [Jin et al.,

2020b] but with a fake reward function Eq. (18) (ignoring the true loss feedback from the environment). This fake

reward function is designed to encourage the learner to explore unseen state-action pairs and to ensure Eq. (17) even-

tually. For this purpose, we could have set the fake reward for (x, a) to be 1[‖φ(x, a)‖2

Γ−1m,h

≥ α2M0

]. However, for

technical reasons the analysis requires the reward function to be Lipschitz, and thus we approximate the indicator

function above using a ramp function (with a large slope T ). With πm in hand, the algorithm then interacts with the

environment for N0 episodes, collecting trajectories to construct a good estimator of the covariance matrix of πm. The

design of the fake reward function and this extra step of covariance estimation are the only differences compared to

Algorithm 1 of [Wang et al., 2020]. At the end of the procedure, POLICYCOVER construct πcov as a uniform mixture of

{πm}m=1,...,M0 . This means that whenever we execute πcov, we first uniformly at random sample m ∈ [M0], and then

execute the (pure) policy πm.

Regret guarantee With all these elements, we successfully remove the need of a simulator and prove the following

regret guarantee.

Theorem 6.1. Under Assumption 3, Algorithm 5 with appropriate choices of the parameters ensures E[Reg] = O(d2H4T 14/15

).

11

Algorithm 6 POLICYCOVER (M0, N0, α, δ)

Let ξ = 60dH√log(T/δ).

Let Γ1,h be the identity matrix in Rd×d for all h.

for m = 1, . . . ,M0 do

Let Vm(xH) = 0.

for h = H − 1, H − 2, . . . , 0 doFor all (x, a) ∈ Xh ×A, compute

Qm(x, a) = min{rm(x, a) + ξ‖φ(x, a)‖Γ−1

m,h+ φ(x, a)⊤θm,h, H

},

Vm(x) = maxa′

Qm(x, a′),

πm(a|x) = 1

[a = argmax

a′

Qm(x, a′)

], (break tie in argmax arbitrarily)

with

rm(x, a) = ramp 1T

(‖φ(x, a)‖2

Γ−1m,h

− α

M0

), (18)

θm,h = Γ−1m,h

1

N0

(m−1)N0∑

t=1

φ(xt,h, at,h)Vm(xt,h+1)

,

where rampz(y) =

0 if y ≤ −z,1 if y ≥ 0,yz + 1 if − z < y < 0.

for t = (m− 1)N0 + 1, . . . ,mN0 do

Execute πm in episode t and collect trajectory {xt,h, at,h}H−1h=0 .

Compute

Γm+1,h = Γm,h +1

N0

mN0∑

t=(m−1)N0+1

φ(xt,h, at,h)φ(xt,h, at,h)⊤.

Let πcov = UNIFORM

({πm}M0

m=1

)and Σcov

h = 1M0

ΓM0+1,h for all h.

return πcov and{Σcov

h

}h=0,...,H−1

.

Although our regret rate is significantly worse than that in the full-information setting [Cai et al., 2020], in the

stochastic setting [Zanette et al., 2021], or in the case when the transition is known [Neu and Olkhovskaya, 2021], we

emphasize again that our algorithm is the first with provable sublinear regret guarantee for this challenging adversarial

setting with bandit feedback and unknown transition.

7 Conclusions and Future Directions

In this work, we propose the general idea of dilated bonuses and demonstrate how it leads to improved exploration and

regret bounds for policy optimization in various settings. One future direction is to further improve our results in the

function approximation setting, including reducing the number of simulator calls in the linear-Q setting and improving

the regret bound for the linear MDP setting (which is currently far from optimal). A potential idea for the latter is

to reuse data across different epochs, an idea adopted by several recent works [Zanette et al., 2021, Lazic et al., 2021]

for different problems. Another key future direction is to investigate whether the idea of dilated bonuses is applicable

beyond the finite-horizon setting (e.g. whether it is applicable to the more general stochastic shortest path model or the

12

infinite-horizon setting).

Acknowledgments We thank Gergely Neu and Julia Olkhovskaya for discussions on the technical details of their

GEOMETRICRESAMPLING procedure.

References

Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellért Weisz. Politex: Regret

bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages 3692–

3702. PMLR, 2019.

Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. Pc-pg: Policy cover directed exploration for provable

policy gradient learning. arXiv preprint arXiv:2007.08459, 2020a.

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy

gradient methods in markov decision processes. In Conference on Learning Theory, pages 64–66. PMLR, 2020b.

Mohammad Gheshlaghi Azar, Rémi Munos, and Bert Kappen. On the sample complexity of reinforcement learning

with a generative model. In International Conference on Machine Learning, 2012.

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In

International Conference on Machine Learning, pages 263–272. PMLR, 2017.

Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with

supervised learning guarantees. In International Conference on Artificial Intelligence and Statistics, 2011.

Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In Interna-

tional Conference on Machine Learning, pages 1283–1294. PMLR, 2020.

Liyu Chen and Haipeng Luo. Finding the stochastic shortest path with low regret: The adversarial cost and unknown

transition case. In International Conference on Machine Learning, 2021.

Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Minimax regret for stochastic shortest path with adversarial costs and

known transition. In Conference On Learning Theory, 2021.

Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. arXiv

preprint arXiv:1510.08906, 2015.

Kefan Dong, Yuanhao Wang, Xiaoyu Chen, and Liwei Wang. Q-learning with ucb exploration is sample efficient for

infinite-horizon mdp. arXiv preprint arXiv:1901.09311, 2019.

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-

exploitation in reinforcement learning. In International Conference on Machine Learning, pages 1578–1586. PMLR,

2018.

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of

Machine Learning Research, 11(4), 2010.

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in

neural information processing systems, pages 4863–4873, 2018.

Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial markov decision processes

with bandit feedback and unknown transition. In International Conference on Machine Learning, 2020a.

Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear

function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020b.

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In In Proc. 19th

International Conference on Machine Learning. Citeseer, 2002.

13

Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.

Nevena Lazic, Dong Yin, Yasin Abbasi-Yadkori, and Csaba Szepesvari. Improved regret bound and experience replay

in regularized policy iteration. arXiv preprint arXiv:2102.12611, 2021.

Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, and Mengxiao Zhang. Bias no more: high-probability data-dependent

regret bounds for adversarial bandits and mdps. Advances in Neural Information Processing Systems, 2020.

Sergey Levine and Vladlen Koltun. Guided policy search. In International conference on machine learning, pages 1–9.

PMLR, 2013.

Haipeng Luo. Lecture 2, introduction to online learning, 2017. Available at

https://haipeng-luo.net/courses/CSCI699/lecture2.pdf.

Lingsheng Meng and Bing Zheng. The optimal perturbation bounds of the moore–penrose inverse under the frobenius

norm. Linear algebra and its applications, 432(4):956–963, 2010.

Gergely Neu and Julia Olkhovskaya. Online learning in mdps with linear function approximation and bandit feedback.

arXiv preprint arXiv:2007.01612v1, 2020.

Gergely Neu and Julia Olkhovskaya. Online learning in mdps with linear function approximation and bandit feedback.

arXiv preprint arXiv:2007.01612v2, 2021.

Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial Markov decision processes. In Pro-

ceedings of the 36th International Conference on Machine Learning, 2019.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algo-

rithms. arXiv preprint arXiv:1707.06347, 2017.

Lior Shani, Yonathan Efroni, Aviv Rosenberg, and Shie Mannor. Optimistic policy optimization with bandit feedback.

In International Conference on Machine Learning, pages 8604–8613. PMLR, 2020.

Aaron Sidford, Mengdi Wang, Xian Wu, Lin F Yang, and Yinyu Ye. Near-optimal time and sample complexities for

solving markov decision processes with a generative model. In Advances in Neural Information Processing Systems,

pages 5192–5202, 2018.

Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12

(4):389–434, 2012.

Ruosong Wang, Simon S Du, Lin F Yang, and Ruslan Salakhutdinov. On reward-free reinforcement learning with linear

function approximation. arXiv preprint arXiv:2006.11274, 2020.

Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free reinforcement

learning in infinite-horizon average-reward markov decision processes. In International Conference on Machine

Learning, pages 10170–10180. PMLR, 2020.

Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, and Rahul Jain. Learning infinite-horizon average-reward mdps

with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3007–

3015. PMLR, 2021.

Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In

International Conference on Machine Learning, pages 10746–10756. PMLR, 2020.

Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without

domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–

7312. PMLR, 2019.

Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret

bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statis-

tics, pages 1954–1964. PMLR, 2020.

Andrea Zanette, Ching-An Cheng, and Alekh Agarwal. Cautiously optimistic policy optimization and exploration with

linear function approximation. arXiv preprint arXiv:2103.12923, 2021.

14

https://haipeng-luo.net/courses/CSCI699/lecture2.pdf

Appendix

A Auxiliary Lemmas

In this section, we list auxiliary lemmas that are useful in our analysis. First, we show some concentration inequalities.

Lemma A.1 ((A special form of) Freedman’s inequality, Theorem 1 of [Beygelzimer et al., 2011]). Let F0 ⊂ · · · ⊂ Fn

be a filtration, and X1, . . . , Xn be real random variables such that Xi is Fi-measurable, E[Xi|Fi] = 0, |Xi| ≤ b, and∑ni=1 E[X

2i |Fi] ≤ V for some fixed b ≥ 0 and V ≥ 0. Then for any δ ∈ (0, 1), we have with probability at least 1− δ,

n∑

i=1

Xi ≤V

b+ b log(1/δ).

Throughout the appendix, we let Ft be the σ-algebra generated by the observations before episode t.

Lemma A.2 (Adapted from Lemma 11 of [Jin et al., 2020a]). For all x, a, let {zt(x, a)}Tt=1 be a sequence of functions

where zt(x, a) ∈ [0, R] isFt-measurable. Let Zt(x, a) ∈ [0, R] be a random variable such that Et[Zt(x, a)] = zt(x, a).Then with probability at least 1− δ,

T∑

t=1

∑

x,a

(1t(x, a)Zt(x, a)

qt(x, a) + γ− qt(x, a)zt(x, a)

qt(x, a)

)≤ RH

2γln

H

δ.

Lemma A.3 (Matrix Azuma, Theorem 7.1 of [Tropp, 2012]). Consider an adapted sequence {Xk}nk=1 of self-adjoint

matrices in dimension d, and a fixed sequence {Ak}nk=1 of self-adjoint matrices that satisfy

Ek[Xk] = 0 and X2k � A2

k almost surely

Define the variance parameter

σ2 =

∥∥∥∥∥1

n

n∑

k=1

A2k

∥∥∥∥∥op

.

Then, for all τ > 0,

Pr

∥∥∥∥∥1

n

n∑

k=1

Xk

∥∥∥∥∥op

≥ τ

≤ de−nτ2/8σ2

.

Next, we show a classic regret bound for the exponential weight algorithm, which can be found, for example, in

[Luo, 2017].

Lemma A.4 (Regret bound of exponential weight, extracted from Theorem 1 of [Luo, 2017]). Let η > 0, and let

πt ∈ ∆(A) and ℓt ∈ RA satisfy the following for all t ∈ [T ] and a ∈ A:

π1(a) =1

|A| ,

πt+1(a) =πt(a)e

−ηℓt(a)

∑a′∈A πt(a′)e−ηℓt(a′)

,

|ηℓt(a)| ≤ 1.

Then for any π⋆ ∈ ∆(A),

T∑

t=1

∑

a∈A

(πt(a)− π⋆(a))ℓt(a) ≤ln |A|η

+ η

T∑

t=1

∑

a∈A

πt(a)ℓt(a)2.

15

B Proofs Omitted in Section 3

In this section, we prove Lemma 3.1. In fact, we prove two generalized versions of it. Lemma B.1 states that the lemma

holds even when we replace the definition of Bt(x, a) by an upper bound of the right hand side of Eq. (4). (Note that

Lemma 3.1 is clearly a special case with P = P .)

Lemma B.1. Let bt(x, a) be a non-negative loss function, and P be a transition function. Suppose that the following

holds for all x, a:

Bt(x, a) = bt(x, a) +

(1 +

1

H


′, a′)] (19)

≥ bt(x, a) +

(1 +

1

H


′, a′)]

with Bt(xH , a) , 0, and suppose that Eq. (5) holds. Then

Reg ≤ o(T ) + 3

T∑

t=1

V πt(x0; bt).

where V π is the state value function under the transition function P and policy π.

Proof of Lemma B.1. By rearranging Eq. (5), we see that

Reg ≤ o(T ) +

T∑

t=1

∑

x,a

q⋆(x)π⋆(a|x)bt(x, a)︸︷︷︸

term1

+1

H

T∑

t=1

∑

x,a

q⋆(x)πt(a|x)Bt(x, a)

︸︷︷︸term2

+

T∑

t=1

∑

x,a

q⋆(x)(πt(a|x) − π⋆(a|x)

)Bt(x, a)

︸︷︷︸term3

.

We first focus on term3, and focus on a single layer 0 ≤ h ≤ H − 1 and a single t:

∑

x∈Xh

∑

a∈A

q⋆(x) (πt(a|x)− π⋆(a|x))Bt(x, a)

=∑

x∈Xh

∑

a∈A

q⋆(x)πt(a|x)Bt(x, a)−∑

x∈Xh

∑

a∈A

q⋆(x)π⋆(a|x)Bt(x, a)

=∑

x∈Xh

∑

a∈A


−∑

x∈Xh

∑

a∈A

q⋆(x)π⋆(a|x)(bt(x, a) +

(1 +

1

H


′, a′)]

)

≤∑

x∈Xh

∑

a∈A


−∑

x∈Xh

∑

a∈A

q⋆(x)π⋆(a|x)(bt(x, a) +

(1 +

1

H


′, a′)]

)

=∑

x∈Xh

∑

a∈A

q⋆(x)πt(a|x)Bt(x, a)−∑

x∈Xh+1

∑

a∈A


−∑

x∈Xh

∑

a∈A

q⋆(x)π⋆(a|x)bt(x, a)−1

H

∑

x∈Xh+1

∑

a∈A

q⋆(x)πt(a|x)Bt(x, a),

16

where the last step uses the fact∑

x∈Xh

∑a∈A q⋆(x)π⋆(a|x)P (x′|x, a) = q⋆(x′) (and then changes the notation

(x′, a′) to (x, a)). Now summing this over h = 0, 1, . . . , H − 1 and t = 1, . . . , T , and combining with term1 and

term2, we get

term1 + term2 + term3 =

(1 +

1

H

) T∑

t=1

∑

a

πt(a|x0)Bt(x0, a).

Finally, we relate∑

a πt(a|x0)Bt(x0, a) to V πt(x0; bt). Below, we show by induction that for x ∈ Xh and any a,

∑

a∈A

πt(a|x)Bt(x, a) ≤(1 +

1

H

)H−h−1

V πt(x; bt).

When h = H − 1,∑

a πt(a|x)Bt(x, a) =∑

a πt(a|x)bt(x, a) = V πt(x; bt). Suppose that the hypothesis holds for all

x ∈ Xh. Then for any x ∈ Xh−1,

∑

a∈A

πt(a|x)Bt(x, a) =∑

a

πt(a|x)(bt(x, a) +

(1 +

1

H


′, a′)])

≤∑

a

πt(a|x)(bt(x, a) +

(1 +

1

H

)H−h

Ex′∼P (·|x,a)

[V πt(x′; bt)

] )(induction hypothesis)

≤(1 +

1

H

)H−h∑

a

πt(a|x)(bt(x, a) + Ex′∼P (·|x,a)

[V πt(x′; bt)

] )(bt(x, a) ≥ 0)

=

(1 +

1

H

)H−h

V πt(x; bt),

finishing the induction. Applying the relation on x = x0 and noticing that(1 + 1

H

)H ≤ e < 3 finishes the proof.

Besides Lemma B.1, we also show Lemma B.2 below, which guarantees that Lemma 3.1 holds even if Eq. (4) and

Eq. (5) only hold in expectation.

Lemma B.2. Let bt(x, a) be a non-negative loss function that is fixed at the beginning of episode t, and let πt be fixed

at the beginning of episode t. Let Bt(x, a) be a randomized bonus function that satisfies the following for all x, a:

Et [Bt(x, a)] = bt(x, a) +

(1 +

1

H

)Ex′∼P (·|x,a)Ea′∼πt(·|x′)Et

[Bt(x

′, a′)]

(20)

with Bt(xH , a) , 0, and suppose that the following holds (simply taking expectations on Eq. (5)):

E

[∑

x

q⋆(x)

T∑

t=1

∑

a


)(Qπt

t (x, a)−Bt(x, a))]

≤ o(T ) + E

[T∑

t=1

V π⋆

(x0; bt)

]+

1

HE

[T∑

t=1

∑

x,a


]. (21)

Then

E [Reg] ≤ o(T ) + 3E

[T∑

t=1

V πt(x0; bt)

].

Proof. The proof of this lemma follows that of Lemma B.1 line-by-line (with P = P ), except that we take expectations

in all steps.

17

C Details Omitted in Section 4

In this section, we first discuss the implementation details of Algorithm 1 in Section C.1, then we give the complete

proof of Theorem 4.1 in Section C.2.

C.1 Implementation Details

The COMP-UOB procedure is the same as Algorithm 3 of [Jin et al., 2020a], which shows how to efficiently compute

an upper occupancy bound. We include the algorithm in Algorithm 7 for completeness. As Algorithm 1 also needs

COMP-LOB, which computes a lower occupancy bound, we provide its complete pseudocode in Algorithm 8 as well.

Fix a state x. Define f(x) to be the maximum and minimum probability of visiting x starting from state x for

COMP-UOB and COMP-LOB, respectively. Then the two algorithms almost have the same procedure to find f(x) by

solving the optimization in Eq. (22) subject to P in the confidence set P via a greedy approach in Algorithm 9. The

difference is that COMP-UOB sets OPTIMIZE to be max while COMP-LOB sets OPTIMIZE to be min, and thus in

Algorithm 9, {f(x)}x∈Xkis sorted in an ascending and a descending order, respectively.

Finally, we point out that the bonus function Bt(s, a) defined in Eq. (9) can clearly also be computed using a greedy

procedure similar to Algorithm 9. This concludes that the entire algorithm can be implemented efficiently.

f(x) =∑

a∈A

πt(a|x)

OPTIMIZE

P (·|x,a)

∑

x′∈Xk(x)+1

P (x′|x, a)f(x′)

(22)

Algorithm 7 COMP-UOB (Algorithm 3 of [Jin et al., 2020a])

Input: a policy πt, a state-action pair (x, a) and a confidence set P of the form

{P :

∣∣∣P (x′|x, a)− P (x′|x, a)∣∣∣ ≤ ǫ(x′|x, a), ∀(x, a, x′)

}

Initialize: for all x ∈ Xk(x), set f(x) = 1{x = x}.for k = k(x)− 1 to 0 do

for ∀x ∈ Xk doCompute f(x) based on :

f(x) =∑

a∈A

πt(a|x) · GREEDY(f, P (·|x, a), ǫ(·|x, a),max

)

Return: πt(a|x)f(x0).

Algorithm 8 COMP-LOB

Input: a policy πt, a state-action pair (x, a) and a confidence set P of the form

{P :

∣∣∣P (x′|x, a)− P (x′|x, a)∣∣∣ ≤ ǫ(x′|x, a), ∀(x, a, x′)

}

Initialize: for all x ∈ Xk(x), set f(x) = 1{x = x}.for k = k(x)− 1 to 0 do

for ∀x ∈ Xk doCompute f(x) based on :

f(x) =∑

a∈A

πt(a|x) · GREEDY(f, P (·|x, a), ǫ(·|x, a),min

)

Return: πt(a|x)f(x0).

18

Algorithm 9 GREEDY

Input: f : X → [0, 1], a distribution p over n states of layer k , positive numbers {ǫ(x)}x∈Xk, objective OPTIMIZE

(max for COMP-UOB and min for COMP-LOB).

Initialize: j− = 1, j+ = n, sort {f(x)}x∈Xkand find σ such that

f(σ(1)) ≤ f(σ(2)) ≤ · · · ≤ f(σ(n))

for OPTIMIZE = max, and

f(σ(1)) ≥ f(σ(2)) ≥ · · · ≥ f(σ(n))

for OPTIMIZE = min.

while j− < j+ do

x− = σ(j−), x+ = σ(j+)δ− = min{p(x−), ǫ(x−)}δ+ = min{1− p(x+), ǫ(x+)}p(x−)← p(x−)−min{δ−, δ+}p(x+)← p(x+) + min{δ−, δ+}if δ− ≤ δ+ then

ǫ(x+)← ǫ(x+)− δ−

j− ← j− + 1

else

ǫ(x−)← ǫ(x−)− δ+

j+ ← j+ − 1

Return:∑n

j=1 p(σ(j))f(σ(j))

C.2 Omitted Proofs

To prove Theorem 4.1, as discussed in the analysis sketch of Section 4, we decompose the left-hand side of Eq. (5) as:

T∑

t=1

∑

x

q⋆(x) 〈πt(·|x)− π⋆(·|x), Qπt

t (x, ·)−Bt(x, ·)〉

=

T∑

t=1

∑

x

q⋆(x)⟨πt(·|x), Qπt

t (x, ·) − Qt(x, ·)⟩

︸︷︷︸BIAS-1

+

T∑

t=1

∑

x

q⋆(x)⟨π⋆(·|x), Qt(x, ·) −Qπt

t (x, ·)⟩

︸︷︷︸BIAS-2

+T∑

t=1

∑

x

q⋆(x)⟨πt(·|x)− π⋆(·|x), Qt(x, ·)−Bt(x, ·)

⟩

︸︷︷︸REG-TERM

. (23)

We bound each term in a corresponding lemma. Specifically, We show a high probability bound of BIAS-1 in Lemma C.1,

a high probability bound of BIAS-2 in Lemma C.2, and a high-probability bound of REG-TERM in Lemma C.3. Finally,

we show how to combine all terms with the definition of bt in Theorem C.5, which is a restatement of Theorem 4.1.

Lemma C.1 (BIAS-1). With probability at least 1− 5δ,

BIAS-1 ≤ O(H

η

)+

T∑

t=1

∑

x,a

q⋆(x)πt(a|x)

2γH +H

(qt(x, a)− q

t(x, a)

)

qt(x, a) + γ

.

Proof. In the proof, we assume that P ∈ Pk for all k, with holds with probability at least 1 − 4δ as already shown in

[Jin et al., 2020a, Lemma 2]. Under this event, qt(x, a) ≤ qt(x, a) ≤ qt(x, a) for all t, x, a.

19

Let Yt =∑

x∈X q⋆(x)⟨πt(·|x), Qt(x, ·)

⟩. First, we decompose BIAS-1 as

T∑

t=1

(Et[Yt]− Yt) +

(∑

x

q⋆(x) 〈πt(·|x), Qπt

t (x, ·)〉 − Et[Yt]

). (24)

We will bound the first Martingale sequence using Freedman’s inequality. Note that we have

Vart[Yt] ≤ Et

(∑

x

q⋆(x)⟨πt(·|x), Qt(x, ·)

⟩)2

≤ Et

[(∑

x,a

q⋆(x)πt(a|x))(

∑

x,a

q⋆(x)πt(a|x)Qt(x, a)2

)](Cauchy-Schwarz)

= H∑

x,a

q⋆(x)πt(a|x)L2t,hEt[1t(x, a)]

(qt(x, a) + γ)2(∑

x,a q⋆(x)πt(a|x) = H)

≤ H∑

x,a

q⋆(x)πt(a|x)qt(x, a)H

2

(qt(x, a) + γ)2(Lt,h ≤ H and Et[1t(x, a)] = qt(s, a))

≤∑

x,a

q⋆(x)πt(a|x)H3

qt(x, a) + γ(qt(s, a) ≤ qt(x, a))

and |Yt| ≤ H supx,a |Q(x, a)| ≤ H2

γ .

Moreover, for every t, the second term in Eq. (24) can be bounded as

∑

x

q⋆(x) 〈πt(·|x), Qπt

t (x, ·)〉 − Et

[∑

x

q⋆(x)⟨πt(·|x), Qt(x, ·)

⟩]

=∑

x,a

q⋆(x)πt(a|x)Qπt

t (x, a)

(1− qt(x, a)

qt(x, a) + γ

)

≤∑

x,a

q⋆(x)πt(a|x)H(qt(x, a)− qt(x, a) + γ

qt(x, a) + γ

)(Qt(x, a) ≤ H)

≤∑

x,a

q⋆(x)πt(a|x)H(qt(x, a)− q

t(x, a) + γ

qt(x, a) + γ

). (q

t(x, a) ≤ qt(x, a))

Combining them, and using Freedman’s inequality (Lemma A.1), we have that with probability at least 1− 5δ,

BIAS-1 =

T∑

t=1

∑

x

q⋆(x)⟨πt(·|x), Qπt

t (x, ·) − Qt(x, ·)⟩

≤T∑

t=1

∑

x,a

q⋆(x)πt(a|x)H

(qt(x, a) − q

t(x, a)

)+ γ

qt(x, a) + γ

+γ

H2

T∑

t=1

∑

x,a

q⋆(x)πt(a|x)H3

qt(x, a) + γ+

H2

γln

1

δ

≤ O(H

η

)+

T∑

t=1

∑

x,a

q⋆(x)πt(a|x)

2γH +H

(qt(x, a)− q

t(x, a)

)

qt(x, a) + γ

,

where we use γ = 2ηH .

Next, we bound BIAS-2.

20

Lemma C.2 (BIAS-2). With probability at least 1− 5δ, BIAS-2 ≤ O(

Hη

).

Proof. We invoke Lemma A.2 with zt(x, a) = q⋆(x)π⋆(a|x)Qπt

t (x, a) and

Zt(x, a) = q⋆(x)π⋆(a|x) (1t(x, a)Lt(x, a) + (1− 1t(x, a))Qπt

t (x, a)) .

Then we get that with probability at least 1− δ (recalling the definition Qt(x, a) =Lt,h

qt(x,a)+γ1t(x, a)),

T∑

t=1

∑

x,a

q⋆(x)π⋆(a|x)(Qt(x, a)−

qt(x, a)

qt(x, a)Qπt

t (x, a)

)≤ H2

2γln

H

δ, (25)

Since with probability at least 1 − 4δ, qt(x, a) ≤ qt(x, a) for all t, x, a (by [Jin et al., 2020a, Lemma 2]), Eq. (25)

further implies that with probability at least 1− 5δ,

BIAS-2 =

T∑

t=1

∑

x,a

q⋆(x)π⋆(x, a)(Qt(x, a)−Qπt

t (x, a))≤ H2

2γln

H

δ.

Noting that γ = 2ηH finishes the proof.

We continue to bound REG-TERM.

Lemma C.3 (REG-TERM). With probability at least 1− 5δ,

REG-TERM ≤ O(H

η

)+

T∑

t=1

∑

x,a

q⋆(x)πt(a|x)(

γH

qt(x, a) + γ+

Bt(x, a)

H

).

Proof. The algorithm runs individual exponential weight updates on each state with loss vectors Qt(x, ·)−Bt(x, ·), so

we can apply standard results for exponential weight updates. Specifically, we can apply Lemma A.4 on each state x,

and get

T∑

t=1

⟨πt(·|x) − π⋆(·|x), Qt(x, ·)−Bt(x, ·)

⟩≤ ln |A|

η+ η

T∑

t=1

∑

a∈A

πt(a|x)(Qt(x, a)−Bt(x, a)

)2. (26)

The condition required by Lemma A.4 (i.e., η|Qt(x, a)− Bt(x, a)| ≤ 1) is verified in Lemma C.4. Summing Eq. (26)

over states with weights q⋆(x), we get

REG-TERM ≤ H ln |A|η

+ ηT∑

t=1

∑

x,a

q⋆(x)πt(a|x)(Qt(x, a)−Bt(x, a)

)2

≤ H ln |A|η

+ 2ηT∑

t=1

∑

x,a

q⋆(x)πt(a|x)Qt(x, a)2 + 2η

T∑

t=1

∑

x,a

q⋆(x)πt(a|x)Bt(x, a)2. (27)

Below, we focus on the last two terms on the right-hand side of Eq. (27). First, we have

2η

T∑

t=1

∑

x,a

q⋆(x)πt(a|x)Qt(x, a)2 ≤ 2η

T∑

t=1

∑

x,a

q⋆(x)πt(a|x)H2

1t(x, a)

(qt(x, a) + γ)2

= 2ηH2T∑

t=1

∑

x,a

q⋆(x)πt(a|x)qt(x, a) + γ

· 1t(x, a)

qt(x, a) + γ

≤ 2ηH2T∑

t=1

∑

x,a

q⋆(x)πt(a|x)qt(x, a) + γ

· qt(x, a)qt(x, a)

+ 2ηH2 ×Hγ ln H

δ

2γ

≤ H

4ηln

H

δ+

T∑

t=1

∑

x,a

q⋆(x)πt(a|x)γH

qt(x, a) + γ,

21

where the third step happens with probability at least 1−δ by Lemma A.2 with zt(x, a) = Zt(x, a) =q⋆(x)πt(a|x)qt(x,a)+γ ≤ 1

γ ,

and the last step uses γ = 2ηH and qt(x, a) ≤ qt(x, a) (which happens with probability at least 1−4δ). For the second

term in Eq. (27), note that

2η

T∑

t=1

∑

a∈A

πt(a|x)Bt(x, a)2 ≤ 1

H

T∑

t=1

∑

a∈A

πt(a|x)Bt(x, a)

due to the fact ηBt(x, a) ≤ 12H by Lemma C.4. Combining everything finishes the proof.

In Lemma C.3, as required by Lemma A.4, we control the magnitude of ηQt(x, a) and ηBt(x, a) by setting γ and

η properly, shown in the following technical lemma.

Lemma C.4. ηQt(x, a) ≤ 12 and ηBt(x, a) ≤ 1

2H .

Proof. Recall that γ = 2ηH and η ≤ 124H3 . Thus,

ηQt(x, a) ≤ηH

γ=

ηH

2ηH=

1

2,

ηbt(x, a) =3ηγH + ηH(qt(x, a)− q

t(x, a))

qt(x, a) + γ≤ 3ηH + ηH ≤ 1

6H2.

By the definition of Bt(x, a) in Eq. (9), we have

ηBt(x, a) ≤ H

(1 +

1

H

)H

η supx′,a′

bt(x′, a′) ≤ 3H × 1

6H2=

1

2H.

This finishes the proof.

Now we are ready to prove Theorem 4.1. For convenience, we state the theorem again here and show the proof.

Theorem C.5. Algorithm 1 ensures that with probability 1−O(δ), Reg = O(|X |H2

√AT +H4

).

Proof. Combining BIAS-1, BIAS-2, REG-TERM, we get that with probability at least 1−O(δ),BIAS-1 + BIAS-2 + REG-TERM

≤ O(H

η

)+

T∑

t=1

∑

x,a

q⋆(x)πt(a|x)(3γH +H(qt(x, a)− q

t(x, a))

qt(x, a) + γ+

1

HBt(x, a)

)

= O(H

η

)+

T∑

t=1

∑

x,a

q⋆(x)π⋆(a|x)bt(x, a) +1

H

T∑

t=1

∑

x,a

q⋆(x)πt(a|x)Bt(x, a),

which is of the form specified in Eq. (5). By the definition of Bt(x, a) in Eq. (9), we see that Eq. (19) also holds with

probability at least 1−O(δ) for all t, x, a.

Therefore, by Lemma B.1, we can bound the regret as (let Pt be the optimistic transition function chosen in Eq. (9)

at episode t)

Reg = O(H

η+

T∑

t=1

∑

x,a

qPt,πt(x, a)bt(x, a)

)

= O(H

η+

T∑

t=1

∑

x,a

qPt,πt(x, a)H(qt(x, a) − q

t(x, a)) + γH

qt(x, a) + γ

)

= O(H

η+

T∑

t=1

∑

x,a

(H(qt(x, a)− q

t(x, a)) + ηH2

))(qPt,πt(x, a) ≤ qt(x, a) and γ = 2ηH)

≤ O(H

η+ |X |H2

√AT + η|X ||A|H2T

),

where the last inequality is due to [Jin et al., 2020a, Lemma 4]. Plugging in the specified value for η, the regret can be

further upper bounded by O(|X |H2

√AT +H4

).

22

D Analysis for Auxiliary Procedures

In this section, we analyze two important auxiliary procedures for the linear function approximation settings: GEOMET-

RICRESAMPLING and POLICYCOVER.

D.1 The Guarantee of GEOMETRICRESAMPLING

The GEOMETRICRESAMPLING algorithm is shown in Algorithm 4, which is almost the same as that in Neu and Olkhovskaya

[2020] except that we repeat the same procedure for M times and average the outputs (see the extra outer loop). This

extra step is added to deal with some technical difficulties in the analysis. The following lemma summarizes some

useful guarantees of this procedure. For generality, we present the lemma assuming a lower bound on the minimum

eigenvalue λ of the covariance matrix, but it will simply be 0 in all our applications of this lemma in this work.

Lemma D.1. Let π be a policy (possibly a mixture policy) with a covariance matrix Σh = Eπ [φ(xh, ah)φ(xh, ah)⊤] �

λI for layer h and some constant λ ≥ 0. Further let ǫ > 0 and γ ≥ 0 be two parameters satisfying 0 < γ + λ < 1.

Define M =⌈24 ln(dHT )

ǫ2 min{

1γ2 ,

4λ2 ln

2 1ǫλ

}⌉and N =

⌈2

γ+λ ln 1ǫ(γ+λ)

⌉. Let T be a set of MN trajectories

generated by π. Then GEOMETRICRESAMPLING (Algorithm 4) with input (T ,M,N, γ) ensures the following for all

h:

∥∥∥Σ+h

∥∥∥op≤ min

{1

γ,2

λln

1

ǫλ

}. (28)

∥∥∥E[Σ+

h

]− (γI +Σh)

−1∥∥∥

op≤ ǫ, (29)

∥∥∥Σ+h − (γI +Σh)

−1∥∥∥

op≤ 2ǫ, (30)

∥∥∥Σ+hΣh

∥∥∥op≤ 1 + 2ǫ, (31)

where ‖·‖op represents the spectral norm and the last two properties Eq. (30) and Eq. (31) hold with probability at least

1− 1T 3 .

Proof. To prove Eq. (28), notice that each one of Σ+(m)h , m = 1, . . . ,M , is a sum of N + 1 terms. Furthermore, the

n-th term of them (cZn,h in Algorithm 4) has an operator norm upper bounded by c(1 − cγ)n. Therefore,

∥∥∥Σ+(m)h

∥∥∥op≤

N∑

n=0

c(1− cγ)n ≤ min

{1

γ, c(N + 1)

}≤ min

{1

γ,2

λln

1

ǫλ

}(32)

by the definition of N and that c = 12 . Since Σ+

h is an average of Σ+(m)h , this implies Eq. (28).

To show Eq. (29), observe that Et[Yn,h] = γI +Σh and {Yn,h}Nn=1 are independent. Therefore, we a have

E

[Σ+

t,h

]= E

[Σ

+(m)t,h

]= cI + c

N∑

i=1

(I − c (γI +Σt,h))i

= (γI +Σt,h)−1(I − (I − c (γI +Σt,h))

N+1)

where the last step uses the formula:(I +

∑Ni=1 A

i)= (I −A)−1(I −AN+1) with A = I − c(γI +Σt,h). Thus,

∥∥∥Et

[Σ+

h

]− (γI +Σh)

−1∥∥∥

op=∥∥∥(γI +Σh)

−1(I − c (γI +Σh))

N+1∥∥∥

op

≤ (1 − c(γ + λ))N+1

γ + λ≤ e−(N+1)c(γ+λ)

γ + λ≤ ǫ,

where the first inequality is by 0 ≺ I − c(γI + I) � I − c(γI +Σh) � I − c(γ + λ)I , and the last inequality is by our

choice of N and that c = 12 .

23

To show Eq. (30), we only further need

∥∥∥Σ+h − E

[Σ+

h

]∥∥∥op≤ ǫ

and combine it with Eq. (29). This can be shown by applying Lemma A.3 with Xk = Σ+(k)h − E

[Σ

+(k)h

], Ak =

min{

1γ ,

2λ ln 1

ǫλ

}I (recall Eq. (32) and thus X2

k � A2k), σ = min

{1γ ,

2λ ln 1

ǫλ

}, τ = ǫ, and n = M . This gives the

following statement: the event

∥∥∥Σ+h − Et

[Σ+

h

]∥∥∥op

> ǫ holds with probability less than

d exp

(−M × ǫ2 × 1

8×max

{γ2,

λ2

4 ln2 1ǫλ

})≤ 1

d2H3T 3≤ 1

HT 3

by our choice of M . The conclusion follows by a union bound over h.

To prove Eq. (31), observe that with Eq. (30), we have

∥∥∥Σ+hΣh

∥∥∥op≤∥∥∥(γI +Σh)

−1 Σh

∥∥∥op+∥∥∥(Σ+

h − (γI +Σh)−1)Σh

∥∥∥op≤ 1 + 2ǫ

since ‖Σh‖op ≤ 1.

D.2 The Guarantee of POLICYCOVER

In this section, we analyze Algorithm 6, which returns a policy cover and its estimated covariance matrices. The final

guarantee of the policy cover is provided in Lemma D.4, but we need to establish a couple of useful lemmas before

introducing that. Note that Algorithm 6 bears some similarity with [Wang et al., 2020, Algorithm 1] (except for the

design of the reward function rt), and thus the analysis is also similar to theirs.

We first define the following definitions, using notations defined in Algorithm 6 and Assumption 3.

Definition 1. For any π and m, define V πm to be the state value function for π with respect to reward function rm.

Precisely, this means V πm(xH) = 0 and for (x, a) ∈ Xh×A, h = H − 1, . . . , 0: V π

m(x) =∑

a π(a|x)Qπm(x, a) where

Qπm(x, a) = rm(x, a) + φ(x, a)⊤θπm,h and θπm,h =

∫

x′∈Xh+1

V πm(x′)νx

′

h dx′.

Furthermore, let π∗m be the optimal policy satisfying π∗

m = argmaxπ Vπm(x) for all x, and define shorthands V ∗

m(x) =

Vπ∗

mm (x), Q∗

m(x, a) = Qπ∗

mm (x, a), and θ∗m,h = θ

π∗

m

m,h.

The following lemma characterizes the optimistic nature of Algorithm 6.

Lemma D.2. With probability at least 1− δ, for all h, all (x, a) ∈ Xh ×A, and all π, Algorithm 6 ensures

0 ≤ Qm(x, a)−Qπm(x, a) ≤ Ex′∼P (·|x,a)

[Vm(x′)− V π

m(x′)]+ 2ξ‖φ(x, a)‖Γ−1

m,h.

Proof. The proof mostly follows that of [Wei et al., 2021, Lemma 4]. For notational convenience, denote φ(xτ,h, aτ,h)as φτ,h, and x′ ∼ P (·|xτ,h, aτ,h) as x′ ∼ (τ, h). We then have

θm,h − θπm,h

= Γ−1m,h

1

N0

(m−1)N0∑

τ=1

φτ,hVm(xτ,h+1)

− Γ−1

m,h

θπm,h +

1

N0

(m−1)N0∑

τ=1

φτ,hφ⊤τ,hθ

πm,h

= Γ−1m,h

1

N0

(m−1)N0∑

τ=1

φτ,hVm(xτ,h+1)

− Γ−1

m,h

1

N0

(m−1)N0∑

τ=1

φτ,hEx′∼(τ,h) [Vπm(x′)]

− Γ−1

m,hθπm,h

24

= Γ−1m,h

1

N0

(m−1)N0∑

τ=1

φτ,hEx′∼(τ,h)

[Vm(x′)− V π

m(x′)]+ ζm,h − Γ−1

t,hθπt,h

(define ζm,h = 1N0

Γ−1m,h

∑(m−1)N0

τ=1

(Vm(xτ,h+1)− Ex′∼(τ,h)Vm(x′)

))

= Γ−1m,h

1

N0

(m−1)N0∑

τ=1

φτ,hφ⊤τ,h

∫

x′∈Xh+1

νx′

h

(Vm(x′)− V π

m(x′))dx′

+ ζm,h − Γ−1

m,hθπm,h

=

∫

x′∈Xh+1

νx′

h

(Vm(x′)− V π

m(x′))dx′ + ζm,h − Γ−1

m,hθπm,h − Γ−1

m,h

∫

x′∈Xh+1

νx′

h

(Vm(x′)− V π

m(x′))dx′.

Therefore, for x ∈ Xh,

Qm(x, a)−Qπm(x, a)

= φ(x, a)⊤(θm,h − θπm,h

)+ ξ‖φ(x, a)‖Γ−1

m,h

= φ(x, a)⊤∫

x′∈Xh+1

νx′

h

(Vm(x′)− V π

m(x′))dx′ + φ(x, a)⊤ζm,h︸︷︷︸

term1

+ξ‖φ(x, a)‖Γ−1m,h

−φ(x, a)⊤Γ−1m,h

∫

x′∈Xh+1

νx′

h

(Vm(x′)− V π

m(x′))dx′

︸︷︷︸term2

−φ(x, a)⊤Γ−1m,hθ

πm,h︸︷︷︸

term3

= Ex′∼p(·|x,a)[Vm(x′)− V π

m(x′)]+ ξ ‖φ(x, a)‖Γ−1

m,h+ term1 + term2 + term3. (33)

It remains to bound |term1 + term2 + term3|. To do so, we follow the exact same arguments as in [Wei et al., 2021,

Lemma 4] to bound each of the three terms.

Bounding term1. First we have |term1| ≤ ‖ζm,h‖Γm,h‖φ(x, a)‖Γ−1

m,h. To bound ‖ζm,h‖Γm,h

, we use the exact

same argument of [Wei et al., 2021, Lemma 4] to arrive at (with probability at least 1− δ)

‖ζm,h‖Γm,h=

∥∥∥∥∥∥1

N0

(m−1)N0∑

τ=1

(Vm(xτ,h+1)− Ex′∼(τ,h)Vm(x′)

)∥∥∥∥∥∥Γ−1m,h

≤ 2H

√d

2log(M0 + 1) + log

Nε

δ+√8M2

0 ε2, (34)

where Nε is the ε-cover of the function class that Vm(·) lies in. Notice that for all m, Vm(·) can be expressed as the

following:

Vm(x) = min

{max

a

{ramp 1

T

(‖φ(x, a)‖2Z −

α

M0

)+ ξ‖φ(x, a)‖Z + φ(x, a)⊤θ

}, H

}

for some positive definite matrix Z ∈ Rd×d with 1

1+M0I � Z � I and vector θ ∈ R

d with ‖θ‖ ≤ supm,τ,h

∥∥∥Γ−1m,h

∥∥∥op×

M0‖φτ,h‖H ≤M0H . Therefore, we can write the class of functions that Vm(·) lies in as the following set:

V =

{min

{max

a

{ramp 1

T

(‖φ(x, a)‖2Z −

α

M0

)+ ξ‖φ(x, a)‖Z + φ(x, a)⊤θ

}, H

}:

θ ∈ Rd : ‖θ‖ ≤M0H, Z ∈ R

d×d :1

1 +M0I � Z � I

}.

Now we apply Lemma 12 of [Wei et al., 2021] to V , with the following choices of parameters: P = d2 + d, ε = 1T 3 ,

B = M0H , and L = T + ξ√1 +M0 + 1 ≤ 3T (without loss of generality, we assume that T is large enough so that

25

the last inequality holds). The value of the Lipschitzness parameter L is according to the following calculation that is

similar to [Wei et al., 2021]: for any ∆Z = ǫeie⊤j ,

1

|ǫ|

∣∣∣∣√φ(x, a)⊤(Z +∆Z)φ(x, a)−

√φ(x, a)⊤Zφ(x, a)

∣∣∣∣

≤∣∣φ(x, a)⊤eie⊤j φ(x, a)

∣∣√φ(x, a)⊤Zφ(x, a)

(√u+ v −√u ≤ |v|√

u)

≤φ(x, a)⊤

(12eie

⊤i + 1

2eje⊤j

)φ(x, a)

√φ(x, a)⊤Zφ(x, a)

≤ φ(x, a)⊤φ(x, a)√φ(x, a)⊤Zφ(x, a)

≤√

1

λmin(Z)≤√1 +M0;

1|ǫ|∣∣‖φ(x, a)‖2Z+∆Z − ‖φ(x, a)‖2Z

∣∣ = |e⊤i φ(x, a)φ(x, a)⊤ej | ≤ 1; and that ramp 1T(·) has a slope of T (this is why

we need to use the ramp function to approximate an indication function that is not Lipschitz). Overall, this leads to

logNε ≤ 20(d2 + d) logT . Using this fact in Eq. (34), we get

‖ζm,h‖Γm,h≤ 20H

√d2 log

(T

δ

)≤ 1

3ξ,

and thus |term1| ≤ ξ3‖φ(x, a)‖Γ−1

m,h.

Bounding term2 and term3. This is exactly the same as [Wei et al., 2021, Lemma 4], and we omit the details. In

summary, we can also prove |term2| ≤ ξ3 ‖φ(x, a)‖Γ−1

m,hand term3 ≤ ξ

3 ‖φ(x, a)‖Γ−1m,h

.

In sum, we can bound

|term1 + term2 + term3| ≤ |term1|+ |term2|+ |term3| ≤ ξ‖φ(x, a)‖Γ−1m,h

for all m,h and (s, a) with probability at least 1− δ.

Combining this with Eq. (33), we get

Qm(x, a) −Qπm(x, a) ≤ Ex′∼p(·|x,a)

[Vm(x′)− V π

m(x′)]+ 2ξ‖φ(x, a)‖Γ−1

m,h, (35)

Qm(x, a) −Qπm(x, a) ≥ Ex′∼p(·|x,a)

[Vm(x′)− V π

m(x′)], (36)

where Eq. (35) proves the second inequality in the lemma. To prove the first inequality in the lemma, we use and

induction to show that Vm(x) ≥ V πm(x) for all x, which combined with Eq. (36) finishes the proof. Recall that we

define Vm(xH) = V πm(xH) = 0. Assume that Vm(x) ≥ V π

m(x) holds for x ∈ Xh+1. Then by Eq. (36), Qm(x, a) −Qπ

m(x, a) ≥ 0 for all (x, a) ∈ Xh×A. Thus, Vm(x)−V πm(x) = maxa Qm(x, a)−∑a π(a|x)Qπ

m(x, a) ≥ 0, finishing

the induction.

The next lemma provides a “regret guarantee” for Algorithm 6 with respect to the fake rewards.

Lemma D.3. With probability at least 1− 2δ, Algorithm 6 ensures

M0∑

m=1

V ∗m(x0)−

M0∑

m=1

V πmm (x0) = O

(d3/2H2

√M0

).

Proof. For any t ∈ [(m− 1)N0 + 1,mN0] and any h,

Vm(xt,h)− V πmm (xt,h)

= maxa

Qm(xt,h, a)−Qπmm (xt,h, at,h) (πm is a deterministic policy)

= Qm(xt,h, at,h)−Qπmm (xt,h, at,h)

26

≤ Ex′∼(xt,h,at,h)

[Vm(x′)− V πm

m (x′)]+ 2ξ ‖φ(xt,h, at,h)‖Γ−1

m,h(Lemma D.2)

= Vm(xt,h+1)− V πmm (xt,h+1) + et,h + 2ξ ‖φ(xt,h, at,h)‖Γ−1

m,h. (define et,h to be the difference)

Thus,

Vm(x0)− V πmm (x0) ≤

∑

h

(2ξ ‖φ(xt,h, at,h)‖Γ−1

m,h+ et,h

).

Summing over t, and using the fact V ∗m(x0) ≤ Vm(x0) (from Lemma D.2), we get

1

M0

M0∑

m=1

(V ∗m(x0)− V πm

m (x0))

≤ 1

M0N0

M0N0∑

t=1

∑

h

(2ξ ‖φ(xt,h, at,h)‖Γ−1

m,h+ et,h

)

≤ 2ξ√M0N0

∑

h

√√√√M0N0∑

t=1

‖φ(xt,h, at,h)‖2Γ−1m,h

+1

M0N0

M0N0∑

t=1

∑

h

et,h. (Cauchy-Schwarz inequality)

Further using the fact∑M0N0

t=1 ‖φ(xt,h, at,h)‖2Γ−1m,h

= N0

∑M0

m=1

⟨Γm+1,h − Γm,h,Γ

−1m,h

⟩= O (N0d) (see e.g., [Jin et al.,

2020b, Lemma D.2]), we bound the first term above by O(ξH√d/M0

)= O

(H2√d3/M0

). For the second term,

note that∑M0N0

t=1 et,h is the sum of a martingale difference sequence. By Azuma’s inequality, the entire second term is

thus of order O(

H2 log(1/δ)√M0N0

)with probability at least 1− δ. This finishes the proof.

Finally, we are ready to show the guarantee of the returned policy cover. Recall our definition of known state set:

K ={x ∈ X : ∀a ∈ A, ‖φ(x, a)‖2

(Σcovh

)−1 ≤ α where h is such that x ∈ Xh

}.

Lemma D.4. For any h = 0, . . . , H − 1, with probability at least 1 − 4δ (over the randomness in the first T0 rounds),

the covariance matrices Σcovh returned by Algorithm 6 satisfies that for any policy π,

Prxh∼π

[xh /∈ K] ≤ O(dH

α

).

where xh ∈ Xh is sampled from executing π.

Proof. We define an auxiliary policy π′ which only differs from π for unknown states in layer h. Specifically, for

x ∈ Xh not in K, let a be such that ‖φ(x, a)‖2(Σcov

h)−1≥ α (which must exist by the definition of K), then π′(a′|x) =

1[a′ = a] for all a′ ∈ A. By doing so, we have

Prxh∼π

[xh /∈ K]

= Pr(xh,a)∼π′

[‖φ(xh, a)‖2(Σcov

h)−1 ≥ α

]

= Pr(xh,a)∼π′

[‖φ(xh, a)‖2Γ−1

M0+1,h

≥ α

M0

]

≤ 1

M0

M0∑

m=1

Pr(xh,a)∼π′

[‖φ(xh, a)‖2Γ−1

m,h

≥ α

M0

](Γm,h � ΓM0+1,h)

≤ 1

M0

M0∑

m=1

E(xh,a)∼π′

[ramp 1

T

(‖φ(x, a)‖2

Γ−1m,h

− α

M0

)](1[y ≥ 0] ≤ rampz(y))

27

≤ 1

M0

M0∑

m=1

V π′

m (x0) (rewards rm(·, ·) are non-negative)

≤ 1

M0

M0∑

m=1

V πmm (x0) +

1

M0× O

(d3/2H2

√M0

)(Lemma D.3)

≤ 1

M0N0

M0N0∑

t=1

H−1∑

h=0

rm(xt,h, at,h) + O(

H√M0N0

)+ O

(d3/2H2

√M0

)(by Azuma’s inequality)

≤ 1

M0N0× 1

αM0

M0N0∑

t=1

H−1∑

h=0

‖φ(xt,h, at,h)‖2Γ−1m,h

+ O(d3/2H2

√M0

)(rampz(y − y′) ≤ y

y′for y > 0, y′ > z > 0)

≤ 1

M0N0× 1

αM0

× O (N0dH) + O(d3/2H2

√M0

)(same calculation as done in the proof of Lemma D.3)

≤ O(dH

α+

d3/2H2

√M0

).

Finally, using the definition of M0 finishes the proof.

E Details Omitted in Section 5

In this section, we analyze Algorithm 2 and prove Theorem 5.1. In the analysis, we require that πt(a|x) and Bt(x, a)are defined for all x, a, t, but in Algorithm 2, they are only explicitly defined if the learner has ever visited state x.

Below, we construct a virtual process that is equivalent to Algorithm 2, but with all πt(a|x) and Bt(x, a) well-defined.

Imagine a virtual process where at the end of episode t (the moment when Σ+t has been defined), BONUS(t, x, a)

is called once for every (x, a), in an order from layer H − 1 to layer 0. Observe that within BONUS(t, x, a), other

BONUS(t′, x′, a′) might be called, but either t′ < t, or x′ is in a later layer. Therefore, in this virtual process, every

recursive call will soon be returned in the third line of Algorithm 3 because they have been called previously and the

values of them are already determined. Given that BONUS(t, x, a) are all called once, at the beginning of episode t+1,

πt+1 will be well-defined for all states since it only depends on BONUS(t′, x′, a′) with t′ ≤ t and other quantities that

are well-defined before episode t+ 1.

Comparing the virtual process and the real process, we see that the virtual process calculates all entries of BONUS(t, x, a),

while the real process only calculates a subset of them that are necessary for constructing πt and Σ+t . However, they

define exactly the same policies as long as the random seeds we use for each entry of BONUS(t, x, a) are the same

for both processes. Therefore, we can define Bt(x, a) unambiguously as the value returned by BONUS(t, x, a) in the

virtual process, and πt(a|x) as shown in (11) with BONUS(τ, x, a) replaced by Bτ (x, a).Now, we follow the exactly same regret decomposition as described in Section 4, with the new definition of

Qt(x, a) , φ(x, a)⊤ θt,h (for x ∈ Xh) and Bt(x, a) described above:

T∑

t=1

H−1∑

h=0

Exh∼π⋆ [〈πt(·|xh)− π⋆(·|xh), Qπt

t (xh, ·)−Bt(xh, ·)〉]

=T∑

t=1

H−1∑

h=0

Exh∼π⋆

[⟨πt(·|xh), Q

πt

t (xh, ·)− Qt(xh, ·)⟩]

︸︷︷︸BIAS-1

+T∑

t=1

H−1∑

h=0

Exh∼π⋆

[⟨π⋆(·|xh), Qt(xh, ·)−Qπt

t (xh, ·)⟩]

︸︷︷︸BIAS-2

+

T∑

t=1

H−1∑

h=0

Exh∼π⋆

[⟨πt(·|xh)− π⋆(·|xh), Qt(xh, ·)−Bt(xh, ·)

⟩]


.

We then bound E[BIAS-1 + BIAS-2] and E[REG-TERM] in Lemma E.1 and Lemma E.2 respectively.

28

Lemma E.1. If β ≤ H , then E[BIAS-1 + BIAS-2] is upper bounded by

β

4E

[T∑

t=1

H−1∑

h=0

Exh∼π⋆

[∑

a

(πt(a|xh) + π⋆(a|xh)

)‖φ(xh, a)‖2Σ+

t,h

]]+O

(γdH3T

β+ ǫH2T

).

Proof of Lemma E.1. Consider a specific (t, x, a). Let h be such that x ∈ Xh. Then we proceed as

Et

[Qπt

t (x, a) − Qt(x, a)]

= φ(x, a)⊤(θπt

t,h − Et

[θt,h

])

= φ(x, a)⊤(θπt

t,h − Et

[Σ+

t,h

]Et [φ(xt,h, at,h)Lt,h]

)(definition of θt,h)

= φ(x, a)⊤(θπt

t,h − (γI +Σt,h)−1

Et [φ(xt,h, at,h)Lt,h])+O(ǫH)

(by Eq. (29) of Lemma D.1 and that ‖φ(x, a)‖ ≤ 1 for all x, a and Lt,h ≤ H)

= φ(x, a)⊤(θπt

t,h − (γI +Σt,h)−1

Σt,hθπt

t,h

)+O(ǫH) (E[Lt,h] = φ(xt,h, at,h)

⊤θπt

t,h)

= γφ(x, a)⊤(γI +Σt,h)−1θπt

t,h +O(ǫH) (θπt

t,h = (γI +Σt,h)−1

(γI +Σt,h) θπt

t,h)

≤ γ‖φ(x, a)‖2(γI+Σt,h)−1‖θπt

t,h‖2(γI+Σt,h)−1 +O(ǫH) (Cauchy-Schwarz inequality)

≤ β

4‖φ(x, a)‖2(γI+Σt,h)−1 +

γ2

β‖θπt

t,h‖2(γI+Σt,h)−1 +O(ǫH) (AM-GM inequality)

≤ β

4Et

[‖φ(x, a)‖2

Σ+t,h

]+

γdH2

β+O (ǫ(H + β)) (37)

where in the last inequality we use Eq. (29) again and also ‖θπt,h‖2 ≤ dH2 according to Assumption 1. Taking expec-

tation over x and summing over t, a with weights πt(a|x), we get

E[BIAS-1] ≤ β

4E

[T∑

t=1

H−1∑

h=0

Exh∼π⋆

[∑

a

πt(a|xh)‖φ(xh, a)‖2Σ+t,h

]]+O

(γdH3T

β+ ǫH2T

). (using β ≤ H)

By the same argument, we can show that Et[Qt(x, a) − Qπt

t (x, a)] is also upper bounded by the right-hand side of

Eq. (37), and thus

E[BIAS-2] ≤ β

4E

[T∑

t=1

H−1∑

h=0

Exh∼π⋆

[∑

a

π⋆(a|xh)‖φ(xh, a)‖2Σ+t,h

]]+O

(γdH3T

β+ ǫH2T

).

Summing them up finishes the proof.

Lemma E.2. If ηβ ≤ γ12H2 and η ≤ γ

2H , then E[REG-TERM] is upper bounded by

H ln |A|η

+ 2ηH2E

[T∑

t=1

H−1∑

h=0

Exh∼π⋆

[∑

a

πt(a|xh)‖φ(xh, a)‖2Σ+t,h

]]

+1

HE

[T∑

t=1

H−1∑

h=0

Exh∼π⋆

[∑

a

πt(a|xh)Bt(x, a)

]]+O

(ηǫH3T +

ηH3

γ2T 2

).

Proof of Lemma E.2. Again, we will apply the regret bound of the exponential weight algorithm Lemma A.4 to each

state. We start by checking the required condition: η|φ(x, a)⊤ θτ,h −Bt(x, a)| ≤ 1. This can be seen by that

η∣∣∣φ(x, a)⊤ θτ,h

∣∣∣ = η∣∣∣φ(x, a)⊤Σ+

t,hφ(xt,h, at,h)Lt,h

∣∣∣

≤ η ×∥∥∥Σ+

t,h

∥∥∥op× Lt,h ≤

ηH

γ≤ 1

2, (Eq. (28) and the condition η ≤ γ

2H )

29

and that by the definition of BONUS(t, x, a), we have

ηBt(x, a) ≤ η ×H

(1 +

1

H

)H

× 2β supx,a,h‖φ(x, a)‖2

Σ+t,h

≤ 6ηβH

γ≤ 1

2H, (38)

where the last inequality is by Eq. (28) again and the condition ηβ ≤ γ12H2 .

Thus, by Lemma A.4, we have for any x,

E

[T∑

t=1

∑

a

(πt(a|x)− π⋆(a|x)) Qt(x, a)

]

≤ ln |A|η

+ 2ηE

[T∑

t=1

∑

a

πt(a|x)Qt(x, a)2

]+ 2ηE

[T∑

t=1

∑

a

πt(a|x)Bt(x, a)2

]. (39)

The last term in Eq. (39) can be upper bounded by E

[1H

∑Tt=1

∑a πt(a|x)Bt(x, a)

]because ηBt(x, a) ≤ 1

2H as we

verified in Eq. (38). To bound the second term in Eq. (39), we use the following: for (x, a) ∈ Xh ×A,

Et

[Qt(x, a)

2]≤ H2

Et

[φ(x, a)⊤Σ+

t,hφ(xt,h, at,h)φ(xt,h, at,h)⊤Σ+

t,hφ(x, a)]

= H2Et

[φ(x, a)⊤Σ+

t,hΣt,hΣ+t,hφ(x, a)

]

≤ H2Et

[φ(x, a)⊤Σ+

t,hΣt,h (γI +Σt,h)−1

φ(x, a)]+O

(ǫH2 +

H2

γ2T 3

)(∗)

≤ H2φ(x, a)⊤ (γI +Σt,h)−1

Σt,h (γI +Σt,h)−1

φ(x, a) +O(ǫH2 +

H2

γ2T 3

)(by Eq. (29))

≤ H2φ(x, a)⊤ (γI +Σt,h)−1 φ(x, a) +O

(ǫH2 +

H2

γ2T 3

)

≤ H2Et

[φ(x, a)⊤Σ+

t,hφ(x, a)]+O

(ǫH2 +

H2

γ2T 3

)(by Eq. (29) again)

= H2Et

[‖φ(x, a)‖2

Σ+t,h

]+O

(ǫH2 +

H2

γ2T 3

)

where (∗) is because by Eq. (30) and Eq. (31), ‖(γI +Σt,h)−1 − Σ+

t,h‖op ≤ 2ǫ and ‖Σ+t,hΣt,h‖op ≤ 1 + 2ǫ hold with

probability 1− 1T 3 ; for the remaining probability, we upper bound H2φ(x, a)⊤Σ+

t,hΣt,hΣ+t,hφ(x, a) by H2

γ2 . Combining

them with Eq. (39) and taking expectation over states finishes the proof.

With Lemma E.1 and Lemma E.2, we can now prove Theorem 5.1.

Proof of Theorem 5.1. Combining Lemma E.1 and Lemma E.2, we get (under the required conditions of the parame-

ters):

E [BIAS-1 + BIAS-2 + REG-TERM]

≤ O(H ln |A|

η+

γdH3T

β+ ǫH2T + ηǫH3T +

ηH3

γ2T 2

)

+

(2ηH2 +

β

4

)E

[T∑

t=1

H−1∑

h=0

Exh∼π⋆

[∑

a

(πt(a|xh) + π⋆(a|xh)

)‖φ(xh, a)‖2Σ+

t,h

]]

+1

HE

[T∑

t=1

H−1∑

h=0

Exh∼π⋆

[∑

a

πt(a|xh)Bt(xh, a)

]].

We see that Eq. (21) is satisfied in expectation as long as we have 2ηH2 + β4 ≤ β and define bt(x, a) ,

β‖φ(x, a)‖2Σ+

t,h

+ β∑

a′ πt(a′|x)‖φ(x, a′)‖2

Σ+t,h

(for x ∈ Xh). By the definition of Algorithm 3, Eq. (20) is also

30

satisfied with this choice of bt(x, a). Therefore, we can apply Lemma B.2 to obtain a regret bound. To simply the

presentation, we first pick ǫ = 1H3T so that all ǫ-related terms becomeO(1). Then we have

E[Reg]

= O(H

η+

γdH3T

β+

ηH3

γ2T 2+ E

[T∑

t=1

H−1∑

h=0

E(xh,a)∼πt[bt(x, a)]

])

= O(H

η+

γdH3T

β+

ηH3

γ2T 2+ βE

[T∑

t=1

H−1∑

h=0

E(xh,a)∼πt

[‖φ(x, a)‖2

Σ+t,h

]])

= O(H

η+

γdH3T

β+

ηH3

γ2T 2+ βE

[T∑

t=1

H−1∑

h=0

E(xh,a)∼πt

[‖φ(x, a)‖2(γI+Σt,h)−1

]])(Eq. (29) and β ≤ H)

= O(H

η+

γdH3T

β+

ηH3

γ2T 2+ βdHT

),

where the last step uses the fact

Et

[∑

h

E(xh,a)∼πt

[‖φ(x, a)‖2(γI+Σt,h)−1

]]≤ Et

[∑

h

E(xh,a)∼πt

[‖φ(x, a)‖2

Σ−1t,h

]]

=∑

h

⟨Σt,h,Σ

−1t,h

⟩= dH. (40)

Finally, choosing the parameters under the specified constraints as:

γ = (dT )−23 , β = H(dT )−

13 , ǫ =

1

H3T,

η = min

{γ

2H,3β

8H2,

γ

12βH2

},

we further bound the regret by O(H2(dT )

23 +H4(dT )

13

).

F Details Omitted in Section 6

In this section, we analyze our algorithm for linear MDPs. First, we show the main benefit of exploring with the policy

cover, that is, it ensures a small magnitude for bt(x, a), as shown below.

Lemma F.1. If γ ≥ 36β2

δeand βǫ ≤ 1

8 , then bk(x, a) ≤ 1 for all (x, a) and all k (with high probability).

Proof. According to the definition of bk(x, a) (in Algorithm 5), it suffices to show that for x ∈ K, β‖φ(x, a)‖2Σ+

k,h

≤ 12

for any a. To do so, note that the GEOMETRICRESAMPLING procedure ensures that Σ+k,h is an estimation of the inverse

of γI +Σmixk,h, where

Σmixk,h = δeΣ

covh + (1− δe)E(xh,a)∼πk

[φ(xh, a)φ(xh, a)⊤] (41)

and Σcovh = 1

M0

∑M0

m=1 E(xh,a)∼πm

[φ(xh, a)φ(xh, a)

⊤] is the covariance matrix of the policy cover πcov. By Eq. (30),

we have with probability at least 1− 1/T 3,

β‖φ(x, a)‖2Σ+

k,h

≤ β‖φ(x, a)‖2(γI+Σmixk,h

)−1 + 2βǫ ≤ β‖φ(x, a)‖2(γI+Σmixk,h

)−1 +1

4.

The first term can be further bounded as βδe‖φ(x, a)‖2( γ

δeI+Σcov

h)−1 ≤ β

δe‖φ(x, a)‖2

( 1M0

I+Σcovh

)−1 , where the last step is

because γδeM0 ≥ γ

δe× δ2e

36β2 ≥ 1 by our condition. Finally, we show that 1M0

I + Σcovh and Σcov

h are close. Recall the

definition of the latter:

Σcovh =

1

M0I +

1

M0N0

M0∑

m=1

mN0∑

t=(m−1)N0+1

φ(xt,h, at,h)φ(xt,h, at,h)⊤.

31

We now apply Lemma A.3 with n = N0 and

Xk =1

M0

M0∑

m=1

φ(xτ(m,k),h, aτ(m,k),h)φ(xτ(m,k),h, aτ(m,k),h)⊤ − 1

M0

M0∑

m=1

E(xh,a)∼πm

[φ(xh, a)φ(xh, a)

⊤] ,

for k = 1, . . . , N0, where τ(m, k) , (m − 1)N0 + k. Note that X2k � I . Therefore, we can pick Ak = I and σ = 1.

By Lemma A.3, we have with probability at least 1− δ,

∥∥∥∥Σcovh −

1

M0I − Σcov

h

∥∥∥∥op

≤√

8 log(d/δ)

N0.

Following the same proof as [Meng and Zheng, 2010, Theorem 2.1], we have

∥∥∥∥∥

(1

M0I +Σcov

h

)−1

−(Σcov

h

)−1∥∥∥∥∥

op

≤M20

∥∥∥∥Σcovh −

1

M0I − Σcov

h

∥∥∥∥op

≤M20

√8 log(d/δ)

N0≤ α

2.

(by our choice of N0 and M0)

Consequently, for any vector φ with ‖φ‖ ≤ 1, we have

∣∣∣∣‖φ‖2(

1M0

I+Σcovh

)−1 − ‖φ‖2(Σcov

h )−1

∣∣∣∣ ≤∥∥∥∥∥

(1

M0I +Σcov

h

)−1

−(Σcov

h

)−1∥∥∥∥∥

op

≤ α

2.

Therefore, combining everything we have

β‖φ(x, a)‖2Σ+

k,h

≤ β

δe

(‖φ(x, a)‖2

(Σcovh )

−1 +α

2

)+

1

4≤ 3βα

2δe+

1

4≤ 1

2,

where the last two steps use the fact x ∈ K and the value of α. This finishes the proof.

Next, we define the following notations for convenience due to the epoch schedule of our algorithm, and then

proceed to prove the main theorem.

Definition 2.

ℓk(x, a) =1

W

T0+kW∑

t=T0+(k−1)W+1

ℓt(x, a)

Qπ

k (x, a) = Qπ(x, a; ℓk)

θπ

k,h is such that Qπ

k (x, a) = φ(x, a)⊤θπ

k,h

Bk(x, a) = bk(x, a) +

(1 +

1

H

)Ex′∼P (·|x,a)Ea′∼πk(·|x′)[Bk(x

′, a′)]

Bk(x, a) = bk(x, a) + φ(x, a)⊤Λk,h (for x ∈ Xh)

Proof of Theorem 6.1. We first analyze the regret of policy optimization after the first T0 rounds. Our goal is again to

prove Eq. (21) which in this case bounds

(T−T0)/W∑

k=1

∑

h

EXh∋x∼π⋆

[∑

a

(πk(a|x)− π⋆(a|x)

) (Q

πk

k (x, a) −Bk(x, a))]

.

The first step is to separate known states and unknown states. For unknown states, we have

(T−T0)/W∑

k=1

∑

h

EXh∋x∼π⋆

[1[x /∈ K]

∑

a

(πk(a|x) − π⋆(a|x)

) (Q

πk

k (x, a)−Bk(x, a))]

32

≤ (T − T0)He

W

∑

h

EXh∋x∼π⋆ [1[x /∈ K]] = O(dH3T

αW

),

where the first step is by the facts 0 ≤ Qπk

k (x, a) ≤ H and 0 ≤ Bk(x, a) ≤ (1 + 1H )H × H ≤ He (Lemma F.1),

and the second step applies Lemma D.4. For known states, we apply a similar decomposition as previous analysis, but

since we also use function approximation for bonus Bt(x, a), we need to account for its estimation error, which results

in two extra bias terms:

(T−T0)/W∑

k=1

∑

h

EXh∋x∼π⋆

[1[x ∈ K]

∑

a

(πk(a|x) − π⋆(a|x)

) (Q

πk

k (x, a)−Bk(x, a))]

=

(T−T0)/W∑

k=1

∑

h

EXh∋x∼π⋆

[1[x ∈ K]

∑

a

πk(a|x)(Q

πk

k (x, a)− Qk(x, a))]

︸︷︷︸BIAS-1

+

(T−T0)/W∑

k=1

∑

h

EXh∋x∼π⋆

[1[x ∈ K]

∑

a

π⋆(a|x)(Qk(x, a)−Q

πk

k (x, a))]

︸︷︷︸BIAS-2

+

(T−T0)/W∑

k=1

∑

h

EXh∋x∼π⋆

[1[x ∈ K]

∑

a

πk(a|x)(Bk(x, ·)−Bk(x, ·)

)]

︸︷︷︸BIAS-3

+

(T−T0)/W∑

k=1

∑

h

EXh∋x∼π⋆

[1[x ∈ K]

∑

a

π⋆(a|x)(Bk(x, a)− Bk(x, a)

)]

︸︷︷︸BIAS-4

+

(T−T0)/W∑

k=1

∑

h

EXh∋x∼π⋆

[1[x ∈ K]

∑

a

(πk(·|x) − π⋆(·|x)

)(Qk(x, ·)− Bk(x, ·)

)]


.

Now we combine the bounds in Lemma F.2, Lemma F.3, and Lemma F.4 (included after this proof). Suppose that

the conditions on the parameters specified in Lemma F.4 hold. We get

E[BIAS-1 + BIAS-2 + BIAS-3 + BIAS-4 + REG-TERM]

= O(H

η+

ηǫH4T

W+

ηH4

γ2T 2W+

γdH3T

βW+

ǫH3T

W

)

+

(β

2+ 2ηH3

)∑

k

∑

h

EXh∋x∼π⋆

[1[x ∈ K]

∑

a

(π⋆(a|x) + πk(a|x))‖φ(x, a)‖2Σ+k,h

]

+1

H

∑

k

∑

h

EXh∋x∼π⋆

[∑

a

πk(a|x)Bk(x, a)

]

≤ O(H

η+

ηǫH4T

W+

ηH4

γ2T 2W+

γdH3T

βW+

ǫH3T

W

)

+∑

k

V π⋆

(x0; bk) +1

H

∑

k

∑

h

EXh∋x∼π⋆

[∑

a

πk(a|x)Bk(x, a)

]

where the last inequality is because β2 + 2ηH3 ≤ β (implied by η

β ≤ 120H4 , a condition specified in Lemma F.4).

Combining two cases and applying Lemma B.2, we thus have

E

(T−T0)/W∑

k=1

V πk(x0; ℓk)

−

(T−T0)/W∑

k=1

V π⋆

(x0; ℓk)

33

≤ O

H

η+

ηǫH4T

W+

ηH4

γ2T 2W+

γdH3T

βW+

ǫH3T

W+ βE

(T−T0)/W∑

k=1

∑

h

E(xh,a)∼πk

[‖φ(xh, a)‖2Σ+

k,h

]+

dH3T

αW

= O(H

η+

ηǫH4T

W+

ηH4

γ2T 2W+

γdH3T

βW+

ǫH3T

W+

βdHT

W+

dH3T

αW

). (by similar calculation as Eq. (40))

Finally, to get the overall regret, it remains to multiply the bound above by W , add the trivial bound HT0 =

2HM0N0 = O(

δ8ed10H11

β8

)for the initial T0 rounds, and consider the exploration probability δe, which leads to

E[Reg] = O(HW

η+ ηǫH4T +

ηH4

γ2T 2+

γdH3T

β+ ǫH3T + βdHT +

dH3T

α+

d10H11δ8eβ8

+ δeHT

)

= O(

H

ηǫ2γ3+

ηH4

γ2T 2+

γdH3T

β+ ǫH3T + βdHT +

dH3Tβ

δe+

d10H11δ8eβ8

+ δeHT

)

where we use the specified value of M , N , W = 2MN , α, and that ηH ≤ 1 (so that the second term ηǫH4T is

absorbed by the fifth term ǫH3T ).

Considering the constraints in Lemma F.4 and Lemma F.1, we choose γ = max{16ηH4, 4β2

δe

}. This gives the

following simplified regret

O(

1

ǫ2η4H11+

1

ηH4T 2+

ηdH7T

β+ ǫH3T + βdHT +

dH3Tβ

δe+

d10H11δ8eβ8

+ δeHT

).

Choosing δe optimally, and supposing η ≥ 1T , the above is simplified to

O(

1

ǫ2η4H11+

ηdH7T

β+ ǫH3T + βdHT +H2

√dβT + d2H

35/9T8/9

)

= O(

1

ǫ2η4H11+

ηdH7T

β+ ǫH3T +H2

√dβT + d2H

35/9T8/9

). (choosing β ≤ H2

d )

Picking optimal parameters in the last expression, we get O(d2H4T 14/15

).

Lemma F.2.

E[BIAS-1 + BIAS-2]

≤ β

4E

(T−T0)/W∑

k=1

∑

h

EXh∋x∼π⋆

[1[x ∈ K]

∑

a

(π⋆(a|x) + πk(a|x)

)‖φ(x, a)‖2Σ+

k,h

]+O

(γdH3T

βW+

ǫH3T

W

).

Proof. The proof of this lemma is similar to that of Lemma E.1, except that we replace T by (T −T0)/W , and consider

the averaged loss ℓk in an epoch instead of the single episode loss ℓt:

Ek

[Q

πk

k (x, a)− Qk(x, a)]

= φ(x, a)⊤(θπk

k,h − Ek

[θk,h

])

= φ(x, a)⊤

θ

πk

k,h − Ek

[Σ+

k,h

]Ek

1

|S′k|∑

t∈S′

k

((1 − Yt) + YtH1[h = h∗t ])φ(xt,h, at,h)Lt,h

(S′k is the S′ in Algorithm 5 within epoch k)

= φ(x, a)⊤

θ

πk

k,h −(γI +Σmix

k,h

)−1Ek

1

|S′k|∑

t∈S′

k

((1− Yt) + YtH1[h = h∗t ])φ(xt,h, at,h)Lt,h

+O(ǫH2)

(by Lemma D.1 and that ‖φ(x, a)‖ ≤ 1 for all x, a and Lt,h ≤ H ; Σmixk,h is defined in Eq. (41))

34

= φ(x, a)⊤

θ

πk

k,h −(γI +Σmix

k,h

)−1Ek

1

|S′k|∑

t∈S′

k

Σmixk,hθ

πk

t,h

+O(ǫH2)

= φ(x, a)⊤

θ

πk

k,h −(γI +Σmix

k,h

)−1Ek

1

W

kW∑

t=(k−1)W+1

Σmixk,hθ

πk

t,h

+O(ǫH2)

(S′k is randomly chosen from epoch k)

= φ(x, a)⊤(θπk

k,h −(γI +Σmix

k,h

)−1Σmix

k,hθπk

k,h

)+O(ǫH2)

= γφ(x, a)⊤(γI +Σmix

k,h

)−1θπk

k,h +O(ǫH2

)

≤ β

4‖φ(x, a)‖2(γI+Σmix

t,h)−1 +

γ2

β

∥∥∥θπk

k,h

∥∥∥2

(γI+Σmixk,h

)−1+O(ǫH2) (AM-GM inequality)

≤ β

4Ek

[‖φ(x, a)‖2Σ+

k,h

]+

γdH2

β+O

(ǫH2

).

The same bound also holds for Ek

[Qk(x, a)−Q

πk

k (x, a)]

by the same reasoning. Taking expectation over x, summing

over k, h and a (with weights πk(a|x) and π⋆(a|x) respectively) finishes the proof.

Lemma F.3.

E[BIAS-3 + BIAS-4]

≤ β

4E

(T−T0)/W∑

k=1

∑

h

EXh∋x∼π⋆

[1[x ∈ K]

∑

a

(π⋆(a|x) + πk(a|x)

)‖φ(x, a)‖2Σ+

k,h

]+O

(γdH3T

βW+

ǫH3T

W

).

Proof. The proof is almost identical to that of the previous lemma. The only difference is that Lt,h is replaced by Dt,h

and θπk

t,h is replaced by Λπk

k,h (recall the definition of Λπk

k,h in Section 6). Note that bt(x, a) ∈ [0, 1] (Lemma F.1), so

Dt,h ∈ [0, He], which is also the same order for Lt,h. Therefore, we get the same bound as in the previous lemma.

Lemma F.4. Let ηγ ≤ 1

16H4 and ηβ ≤ 1

40H4 . Then

E[REG-TERM] = O(H

η+

ηǫH4T

W+

ηH4

γ2T 2W

)

+ 2ηH3E

∑

k,h

EXh∋x∼π⋆

[1[x ∈ K]

∑

a

πk(x, a)‖φ(x, a)‖2Σ+k,h

]+

1

HE

∑

k,h

EXh∋x∼π⋆

[∑

a

πk(x, a)Bk(x, a)

] .

Proof. We first check the condition for Lemma A.4: η∣∣∣Qk(x, a)− Bt(x, a)

∣∣∣ ≤ 1. In our case,

η∣∣∣Qk(x, a)

∣∣∣ = η

∣∣∣∣∣φ(x, a)⊤Σ+

k,h

(1

|S′|∑

t∈S′

((1 − Yt) + YtH1[h = h∗t ])φ(xt,h, at,h)Lt,h

)∣∣∣∣∣

≤ η × ‖Σ+k,h‖op ×H × sup

t∈S′

Lt,h

≤ η × 1

γ×H2 (by Lemma D.1)

≤ 1

2(by the condition specified in the lemma)

and

η∣∣∣Bk(x, a)

∣∣∣ ≤ η |bk(x, a)|+ η

∣∣∣∣∣φ(x, a)⊤Σ+

k,h

(1

|S′|∑

t∈S′

((1 − Yt) + YtH1[h = h∗t ])φ(xt,h, at,h)Dt,h

)∣∣∣∣∣

35

≤ η + η × ‖Σ+k,h‖op ×H × sup

t∈S′

Dt,h

≤ η + η × ‖Σ+k,h‖op ×H × (H − 1)

(1 +

1

H

)H

(Lemma F.1)

≤ η +3ηH2

γ≤ 4ηH2

γ

≤ 1

2H. (by the condition specified in the lemma)

Now we derive an upper bound for Ek

[Qk(x, a)

2]:

Ek

[Qk(x, a)

2]

≤ Ek

1

|S′k|∑

t∈S′

k

H2φ(x, a)⊤Σ+k,h

(((1− Yt) + YtH1[h = h∗

t ])2φ(xt,h, at,h)φ(xt,h, at,h)

⊤)Σ+

k,hφ(x, a)

(∗)

= Ek

[H2φ(x, a)⊤Σ+

k,h

((1− δe)Σk,h + δeHΣcov

h

)Σ+

k,hφ(x, a)]

≤ H3Ek

[φ(x, a)⊤Σ+

k,hΣmixk,hΣ

+k,hφ(x, a)

]

≤ H3Ek

[φ(x, a)⊤Σ+

k,hΣmixk,h(γI +Σmix

k,h)−1φ(x, a)

]+ O

(ǫH3 +

H3

γ2T 3

)(Lemma D.1)

≤ H3φ(x, a)⊤(γI +Σmixk,h)

−1Σmixk,h(γI +Σmix

k,h)−1φ(x, a) + O

(ǫH3 +

H3

γ2T 3

)(Lemma D.1)

≤ H3φ(x, a)⊤(γI +Σmixk,h)

−1φ(x, a) + O(ǫH3 +

H3

γ2T 3

)

= H3Ek

[‖φ(x, a)‖2

Σ+k,h

]+ O

(ǫH3 +

H3

γ2T 3

), (42)

where in (∗) we use(

1|S′

k|∑

t∈S′

kvt

)2≤ 1

|S′

k|∑

t∈S′

kv2t with vt = φ(x, a)⊤Σ+

k,h ((1− Yt) + YtH1[h = h∗t ])φ(xt,h, at,h)Lt,h.

Next, we bound Et

[Bt(x, a)

2]:

Ek

[Bk(x, a)

2]

≤ 2Ek

[bk(x, a)

2]+ 2Ek

[(φ(x, a)⊤Λk,h)

2]

≤ 2Ek[bk(x, a)] + 18H3Ek

[‖φ(x, a)‖2

Σ+k,h

]+ O

(ǫH3 +

H3

γ2T 3

)

≤ 20H3

βbk(x, a) + O

(ǫH3 +

H3

γ2T 3

),

where in the second inequality we bound Ek

[(φ(x, a)⊤Λk,h)

2]

similarly as we bound Ek

[Qk(x, a)

2]

in Eq. (42),

except that we replace the upper bound H for Lt,h by the upper bound for Dt,h: H(1 + 1

H

)Hsupt,x,a bt(x, a) ≤ 3H

(since bt(x, a) ≤ 1 by Lemma F.1).

Thus, by Lemma A.4, we have

E[REG-TERM]

≤ O(H

η

)+ 2η

∑

k,h

EXh∋x∼π⋆

[1[x ∈ K]

∑

a

πk(a|x)(Qk(x, a)2 + Bk(x, a)

2)

]

≤ O(H

η+

ηǫH4T

W+

ηH4

γ2T 2W

)

36

+ 2ηH3E

∑

k,h

EXh∋x∼π⋆

[1[x ∈ K]

∑

a


]+

40ηH3

βE

∑

k,h

EXh∋x∼π⋆

[∑

a

πk(a|x)bk(x, a)]

≤ O(H

η+

ηǫH4T

W+

ηH4

γ2T 2W

)

+ 2ηH3E

∑

k,h

EXh∋x∼π⋆

[1[x ∈ K]

∑

a


]+

1

HE

∑

k,h

EXh∋x∼π⋆

[∑

a

πk(a|x)BONUSk(x, a)

]

where in the last inequality we use the conditions specified in the lemma and that Bk(x, a) ≥ bk(x, a).

37

Documents

Policy Optimization in Adversarial MDPs: Improved