Analysis of Reinforcement Learning Algorithmsniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture... · Formulating the reinforcement learning problem Goal: Find policy to maximize

Analysis of ReinforcementLearning Algorithms

Adriano L. Abrantes and Avinash N. Madavan

University of Illinois, Urbana-Champaign

Outline

I A brief introduction to reinforcement learning

I Introducing temporal difference (TD) learning

I Analysis of TD(0) algorithmsI Understanding the deterministic problem

I Evaluating the i.i.d case

I A projective algorithm for TD(0)

I An alternative algorithm for TD(0)

I Extension to TD(λ)

2

What is reinforcement learning?

Goal: Maximize reward in a potentially unknown game.

Examples:I Tic-tac-toe: easily enumerable known state-action space.

I Chess/Go: enumerable known state-action space.

I Starcraft: finite action space, and essentially inenumerable statespace, and imperfect information.

Existing techniques:I Use complete knowledge to compute min/max response. (Game

theory) . . . requires perfect information.I Play games to completion and utilize outcome to evaluate moves.

(Monte Carlo) . . . requires complete trajectory.

3

Formulating the reinforcement learning problem

Goal: Find policy to maximize reward, R, in an environment driven by anMDP, with state space, S, and transition kernel P .

Define,

V := value function, i.e., expected total accumulated reward.

µ := policy, i.e., an oracle that given a state returns an action.

Applying this notation, we can express the value function at state s ∈ S,with fixed policy, µ, with discount factor γ,

Vµ(s) := E

[ ∞∑k=0

γkR(sk) | s0 = s

]= R(s) + γ

∑s′∈SP(s′|s)Vµ(s′).

Remark. The discount factor prioritizes better actions early on to prevent“saving the best for last” scenarios.

4

Understanding the reinforcement learning problem

Goal: Find policy to maximize reward, R, in an environment driven by anMDP, with state space, S, and transition kernel P .

For simplicity, assume n = |S| is finite. Then P is a matrix P and,

Vµ(si) = R(si) + γ∑sj∈S

PijVµ(sj).

Remark. For known P , this is a dynamic programming problem.

Problems within reinforcement learning:I How do we find the optimal policy? (exploration/exploitation).

I How do we tell if a policy is good?

5

Mathematical preliminaries

The next slides cover some of the preliminary concepts we require to studyconvergence of the TD(0) algorithm. In particular, we will study

I approximation of the value function

I definition of a suitable inner product/norm

I the basic temporal difference (TD) algorithm

I necessary assumptions

6

Approximating the value functionAssume that we are given a set of feature vectors, {φk}dk=1, then Vµ(s) canbe approximated as a linear function,

Vµ(s) ≈ Vθ(s) := φ(s)>θ =[φ1(s) φ2(s) . . . φd(s)

]θ.

Collecting all the value functions in a vector, we have,

Vµ =

Vµ(s1)...

Vµ(sn)

≈ Vθ =

φ(s1)>

...φ(sn)>

θ = Φθ.

Letting ΠD be the projection onto {Φx|x ∈ Rd}, we want to solve theprojected Bellman equation,

Vθ(si) = ΠD

R(si) + γ∑sj∈S

PijVθ(sj)

7

Approximating the value functionAssume that we are given a set of feature vectors, {φk}dk=1, then Vµ(s) canbe approximated as a linear function,

Vµ(s) ≈ Vθ(s) := φ(s)>θ =[φ1(s) φ2(s) . . . φd(s)

]θ.

Collecting all the value functions in a vector, we have,

Vµ =

Vµ(s1)...

Vµ(sn)

≈ Vθ =

φ(s1)>

...φ(sn)>

θ = Φθ.

Letting ΠD be the projection onto {Φx|x ∈ Rd}, we want to solve theprojected Bellman equation,

Vθ(si) = ΠD

R(si) + γ∑sj∈S

PijVθ(sj)

8

Defining a suitable inner product space

For a symmetric positive definite matrixA, we define the inner product

〈x, y〉A = x>Ay

and the associated norm

‖x‖A =√x>Ax

LetD = diag(π(s1), . . . , π(sn)) denote the matrix whose elements aregiven by the stationary distribution π(·), then

‖V − V ′‖D =

√∑s∈S

π(s)(V (s)− V ′(s))2

measures the mean-square difference between the value predictions underV and V ′ in steady-state.

9

Defining a suitable inner product space

We apply these definitions to define a norm on the space of parametervectors:

‖Vθ − Vθ′‖D =

√∑s∈S

π(s)(φ(s)>(θ − θ′))2 = ‖θ − θ′‖Σ

whereΣ := Φ>DΦ =

∑s∈S

π(s)φ(s)φ(s)>

is the steady-state feature covariance matrix such that

ω := λmin(Σ) > 0.

10

A graphical interpretation of TD(0) learning

s0 a2

a1

a3

s2

s1

s3

Initial State Actions Final State

?

µ

11


s0 a2

a1

a3

s2

s1

s3


?

µ

12


s0 a2

a1

a3

s2

s1

s3


?

µ

13

The basic TD(0) algorithmRecall the Bellman equation,

Vθ(si) = R(si) + γ∑sj∈S

PijVθ(sj)

At each iteration, minimize sample Bellman loss, or

‖Vθ(sk)−(R(sk) + γVθk(s′k)

)︸︷︷︸empirical

‖2 = ‖φ(sk)>θ −R(sk)− γφ(s′k)

>θk‖2

(1)

The TD(0) algorithm with linear approximation. For given θ0 and positivesequence {αk}Tk+1, update θ as

θk+1 = θk − αk[φ(sk)

>θk −R(sk)− γφ(s′k)>θk

]φ(sk)︸︷︷︸

Gradient of (1)

14

Some basic assumptions

I We assume that the underlying Markov chain is irreducible andaperiodic so that there are constants m > 0 and ρ ∈ (0, 1) such that

sups∈S

dTV (P(sk|s0), π) ≤ mρk, ∀k ∈ N0 := {0, 1, . . .}

and the mixing time may be written as

τ(ε) = min{k ∈ N0|mρk ≤ ε}

I Rewards are uniformly bounded, i.e.

∃rmax s.t. |R(s)| ≤ rmax ∀s ∈ S.

I Feature vectors are linearly independent and normalized, i.e.

‖φ(s)‖ ≤ 1 , ∀s ∈ S,

15

The challenge in analyzing TD

I The main challenge in TD analysis stems from the fact that gk(θk) arenot true stochastic gradients with respect to any fixed objective.

I At time k, gk(θk) pulls Vθk+1(sk) towards

θk-based target︷︸︸︷R(sk, sk+1) + γVθk(sk+1)

Will this circular process converge?

16

The TD limit point

I Tsitsiklis and Van Roy (1997) characterize the TD limit point θ∗ as theunique solution to the projected Bellman equation:

Φθ = ΠDTµΦθ

I Bhandari et al. (2018) approaches TD analysis starting with simplifiedsettings and incrementally extends the analysis to more complexcases:

I Noiseless case or mean-path TDI i.i.d. noise or i.i.d. observation modelI Markov noiseI TD(λ)

17

Gradient descent on a value function loss

Consider the cost function

f(θ) =1

2‖Vθ∗ − Vθ‖2D =

1

2‖θ∗ − θ‖2Σ

and suppose we had access to ∇f(·) to perform gradient descent with theupdate

θk+1 = θk − α∇f(θk).

Then, we would write

‖θ∗−θk+1‖22 = ‖θ∗−θk‖22 + 2α∇f(θk)>(θ∗−θk) +α2‖∇f(θk)‖22 (2)

18

Gradient descent on a value function lossUsing

∇f(θ)>(θ∗ − θ) = −‖θ∗ − θ‖2Σ = −‖Vθ∗ − Vθ‖D. (3)

and‖∇f(θ)‖2 ≤ ‖Vθ∗ − Vθ‖D (4)

We can set α = 1 in (2) to establish

‖θ∗ − θk+1‖22 ≤ ‖θ∗ − θk‖22 −

≥ω‖θ∗−θk‖22︷︸︸︷‖Vθ∗ − Vθk‖

2D

≤ (1− ω)‖θ∗ − θk‖22 ≤ . . . ≤ (1− ω)k+1‖θ∗ − θ0‖22.(5)

Moreover, if we work with the averaged iterate θk = T−1∑K−1

k=0 θk, we geta bound that does not depend on ω:

‖Vθ∗ − Vθk‖2D ≤

1

K

K−1∑k=0


‖θ∗ − θ0‖22K

(6)

19

The deterministic case (mean-path TD)

g(θ) :=∑s,s′∈S

π(s)P(s′|s)(R(s, s′) + γφ(s′)>θ − φ(s)>θ

)φ(s)

Lemma 2 [Tsitsiklis and Van Roy (1997)].

g(θ)>(θ∗ − θ) > 0 ∀θ 6= θ∗

Lemma 3. For any θ ∈ Rd,

g(θ)>(θ∗ − θ) ≥ (1− γ)‖Vθ∗ − Vθ‖2D

20


Lemma 4. For any θ ∈ Rd,

‖g(θ)‖2 ≤ 2‖Vθ∗ − Vθ‖D

Consider the expansion

‖θ∗ − θk+1‖22 = ‖θ∗ − θk‖22 −

Lemma 3︷︸︸︷2αg(θk)

>(θ∗ − θk) +α2‖g(θk)‖22︸︷︷︸Lemma 4

Setting α = (1− γ)/4 yields

‖θ∗ − θk+1‖22 ≤ ‖θ∗ − θk‖22 −(

(1− γ)2

4

)‖Vθ∗ − Vθk‖

2D

21


Theorem 1. Consider a sequence of parameters (θ0,θ1, . . .) obeying therecursion

θk+1 = θk + αg(θk), k ∈ {0, 1, 2, . . .},where α = (1− γ)/4. Then

‖θ∗ − θk‖22 ≤ exp

{−(

(1− γ)2ω

4T

)}‖θ∗ − θ0‖22 (7)

and


4‖θ∗ − θ0‖22T (1− γ)2

(8)

where (7) is analogous to (5) and (8) is analogous to (6).

22

The case with i.i.d. observations

Assume that the tuples Ok = (sk, rk, s′k) observed by the TD algorithm are

i.i.d. samples of the stationary distribution:

P[(sk, rk, s′k) = (s,R(s, s′)s′)] = π(s)P(s′|s)

To analyze this case, we extend Lemma 4 of Bhandari et al. to obtain abound on the expected norm of the stochastic gradient:

Lemma 5. For any fixed θ ∈ Rd,

E[‖gk(θ)‖22

]≤ 2σ2 + 8‖Vθ − Vθ∗‖2D

where σ2 = E[‖gk(θ∗)‖22

].

23


Expanding the squares, we have

‖θ∗ − θk+1‖22 = ‖θ∗ − θk‖22 − 2αgk(θk)>(θ∗ − θk) + α2

k‖gk(θ)‖22 (9)

Thus, taking the expectation

E [‖θ∗ − θk+1‖22] =

E [‖θ∗ − θk‖22]− 2αkE[gk(θk)

>(θ∗ − θk)]

+ α2kE[‖gk(θk)‖22

]=

E [‖θ∗ − θk‖22]− 2αkE[E[gk(θk)

>(θ∗−θk)|θk]]

+α2kE[E[‖gk(θk)‖22|θk

]]≤

E [‖θ∗ − θk‖22]−(2αk(1− γ)− 8α2

k

)E[‖Vθ∗ − Vθk‖

2D

]+ 2α2

kσ2 ≤

E [‖θ∗ − θk‖22]− αk(1− γ)E[‖Vθ∗ − Vθk‖

2D

]+ 2α2

kσ2 (10)

(10) is used along with different step-size choices to provide finite-time TDbounds under the i.i.d. assumption.

24


Theorem 2. Suppose TD is applied under the i.i.d. observation model.Then

(a) For any T ≥ (8/(1− γ))2 and a constant step-size α = 1/√T ,

E[‖Vθ∗ − Vθk‖

2D

]≤ ‖θ∗ − θ0‖22 + 2σ2

√T (1− γ)

(b) For any constant step-size α ≤ ω(1− γ)/8,

E[‖θ∗ − θk‖22

]≤ (exp {−α(1− γ)ωT}) ‖θ∗−θ0‖22+α

(2σ2

(1− γ)ω

)(c) For a decaying step-size αk = β

λ+k with β = 2(1−γ)ω and λ = 16

(1−γ)2ω

E[‖θ∗ − θk‖22

]≤ ν

λ+ Twhere ν = max

{8σ2

(1− γ)2ω2,16‖θ∗ − θ0‖22

(1− γ)2ω

}

25

On the i.i.d. observations assumptionThe approach based on i.i.d. observations is unrealistic, as the observedtuples Ok = (sk, rk = R(sk, sk+1), sk+1) do not stem from a Markov chainsample path and, therefore, the analysis ignores the possibly strongdependence between θk and Ok.Let

h(θ, Ok) := gk(θ) =(rk + γφ(sk+1)>θ − φ(sk)

>θ)φ(sk)

and note that we defined

g(θ) = E [h(θ, Ok)] .

If we consider θk to be a function of {O1, . . . , Ok−1}, then

gk(θ) 6= E [h(θ, Ok)|θk = θ]

and there is bias in the gradient evaluation:

E [h(θk, Ok)− g(θk)] 6= 0

26

A projected TD(0) algorithm

I To control the gradient bias, Bandhari et al. work with a projectedversion of TD:

θk+1 = Π2,R(θk + αk + gk(θk)

whereΠ2,R(θ) = arg min

θ′:‖θ′‖2≤R‖θ − θ′‖2

I The projection results in bounds on the gradient norms:

‖gk(θ)‖2 ≤ rmax + 2‖θ‖2, ∀θ ∈ Rd

Let G := rmax + 2R, then

‖gk(θ)‖2 ≤ G, ∀θ ∈ ΘR = {θ ∈ Rd : ‖θ‖2 ≤ R}

27

Analysis of Projected TD(0)Define the “gradient evaluation error”:

ζk(θ) := (gk(θ)− gk(θ))>(θ − θ∗), ∀θ ∈ ΘR

Then

‖θ∗ − θk+1‖22 = ‖θ∗ −Π2,R(θk + αkgk(θk))‖22= ‖Π2,R(θ∗)−Π2,R(θk + αkgk(θk))‖22≤ ‖θ∗ − θk − αkgk(θk)‖22= ‖θ∗ − θk‖22 − 2αkgk(θk)

>(θ∗ − θk) + α2k‖gk(θk)‖22

≤ ‖θ∗ − θk‖22 − 2αkgk(θk)>(θ∗ − θk) + α2

kG2

= ‖θ∗ − θk‖22 − 2αkgk(θk)>(θ∗ − θk) + 2αkζk(θk) + α2

kG2

≤ ‖θ∗ − θk‖22 − 2αk(1− γ)‖Vθ∗ − Vθk)‖2D+ 2αkζk(θk) + α2

kG2. (11)

28

Analysis of Projected TD(0)

Taking the expectation of (11) and assuming a fixed α:

E[‖θ∗ − θk+1‖22

]≤

E[‖θ∗ − θk‖22

]− 2α(1− γ)E

[‖Vθ∗ − Vθk‖

2D

]+ E [αζk(θk)] + α2G2 ≤

E[‖θ∗ − θk‖22

]− 2α(1− γ)E

[‖Vθ∗ − Vθk‖

2D

]+ α2(5 + 6τ(α))G2

Where the last inequality follows from an upper bound on the gradient bias:

E [αζk(θk)] ≤ α2(4 + 6τ(α))G2

29

Finite-time bounds on Projected TD(0)

Theorem 3. Suppose the projected TD(0) algorithm is applied with param-eter R ≥ ‖θ∗‖2 and a mixing time function τ(·). Set G = (rmax + 2R).Then

(a) With a constant step-size α = 1/√K,


2D

]≤ ‖θ∗ − θ0‖22 +G2(9 + 12τ(1/

√K))

2√K(1− γ)

(b) For any constant step-size α ≤ 1/(2ω(1− γ)),

E[‖θ∗ − θk‖22

]≤(e−2α(1−γ)ωK

)‖θ∗ − θ0‖22+α

(G2(9 + 12τ(α))

2(1− γ)ω

)(c) For a decaying step-size αk = 1/(ω(k + 1)(1− γ)),


2D

]≤ G2(9 + 24τ(αK))

K(1− γ)2ω(1 + logK)

30

An alternative TD(0) algorithm

We will now apply a control theoretic Lyapunov drift analysis.

Theorem. For any k ≥ τ and α such that κ1ατ +αγmax ≤ 0.05, we havethe following finite-time bound:

E[‖θk − θ∗‖2

]≤ κQ

(1− 0.9α

γmax

)k−τ(1.5‖θ0 − θ∗‖+ 0.5rmax)2 +

κ2κQ0.9

ατ,

where,

κ1 = 62γmax(1 + rmax), κ2 = 2(55γmax(1 + rmax)3 + γmaxr

2max

)

I The values γmin, γmax, κQ denote the minimum eigenvalue, maximumeigenvalue, and condition number of the Lyapunov function.

I Assume, without loss of generality, that θ∗ = 0.

31

Understanding the continuous-time dynamics

Lemma [Tsitsiklis and Van Roy (1997)]. Under a diminishing step-sizescheme, the discrete-time dynamics track the ODE,

θ = −ΦD [I − γP ]Φ>θ −ΦDE[R],

whereR :=[R(s1) . . . R(sn)

]>.

The Lyapunov function is chosen with standard control theory techniques as,

W (θk) = θ>k Qθk,

whereQ satisfies the Lyapunov equation of the continuous-time dynamics,i.e.

A>Q+QA = −I, A = −ΦD [I − γP ]Φ>.

From assumptions of regularity of φ, A is Hurwitz.

32

Understanding Lyapunov drift in continuous-timeThe unconditioned drift function at steady-state satisfies,

E[W (θk+1)−W (θk)] = E[θ>k+1Qθk+1 − θ>k Qθk] = 0

Taking the Taylor expansion with appropriate choice of θ yields

E

∇>W (θk)(θk+1 − θk) +1

2‖θk+1 − θk‖∇2W (θ)︸︷︷︸

=Q

= 0.

Select W (θk) according to Stein’s method,

∇>W (θk)E [θk+1 − θk | θk] = −‖θk‖2.

Having Stein’s equation hold true for all θk results in the Lyapunov equationfrom before

A>Q+QA = −I.

33

A series of useful lemmas1. Bound the error during Markov chain mixing.

Lemma.

‖θτ − θ0‖ ≤ 2ατ‖θ0‖+ 2ατrmax,

‖θτ − θ0‖ ≤ 4ατ‖θτ‖+ 4ατrmax,

‖θτ − θ0‖2 ≤ 32α2τ2‖θτ‖2 + 32α2τ2r2max.

34


2. Bound the one-step error.

Lemma. For all k ≥ 0,

‖θk+1 − θk‖2Q ≤ 2α2γmax‖θk‖2 + 2α2γmaxr2max.

35



3. Bound Stein’s equation for the drift process. Recall in steady statecontinuous-time,

∇>W (θk) [θk+1 − θk | θk] = 0.

Lemma. For any k ≥ τ ,∣∣∣∣∣∣∣∣∣EθkQ

drift from steady-state︷︸︸︷(Aθk −

1

α(θk+1 − θk)

)| θk−τ , sk−τ , s′k−τ

∣∣∣∣∣∣∣∣∣

≤ κ1ατE[‖θk‖2 | θk−τ

]+ κ2ατ,

where,

κ1 = 62γmax(1 + rmax), κ2 = 55γmax(1 + rmax)3

36



3. Bound Stein’s equation for the drift process.

4. Bound the discrete-time Lyapunov drift.

Lemma. For any k ≥ τ and α such that κ1ατ + αγmax ≤ 0.05,

E[W (θk+1)] ≤(

1− 0.9α

γmax

)E[W (θk)] + κ2α

2τ,

where,

κ2 = 2(κ2 + γmaxr2max).

37



3. Bound Stein’s equation for the drift process.

4. Bound the discrete-time Lyapunov drift.

Lemma. For any k ≥ τ and α such that κ1ατ + αγmax ≤ 0.05,

E[W (θk+1)] ≤(

1− 0.9α

γmax

)E[W (θk)] + κ2α

2τ,

where,

κ2 = 2(κ2 + γmaxr2max).

5. Taking the summation over k yields the result,

E[‖θk‖2

]≤ κQ

(1− 0.9α

γmax

)k−τ (1.5‖θ0‖2 + 0.5rmax

)2+κ2κQ0.9

ατ.

38

A conceptual comparison of the algorithms

Information theoretic approach (1st paper)

I Lyapunov function, W1 = ‖Vθ − Vθ∗‖2D = ‖θ − θ∗‖2Σ.I Σ is the steady-state feature covariance matrix, Φ>DΦ.

I Bounds the total gradient noise.

Control theoretic approach (2nd paper)

I Lyapunov function, W2 = ‖θ − θ∗‖2Q.I Q comes from solving Lyapunov equation for steady-state

continuous-time dynamics,

A>Q+QA = −I, A = −ΦD (I − αP )Φ

I Bounds the Lyapunov drift.

39

A comparison of the algorithm guarantees

For the constant step-size case,

Information theoretic approach (1st paper)

E[‖θ∗ − θk‖2

]≤ (1− α(1− γ)ω)k ‖θ∗ − θ0‖2 + α

(2σ2

(1− γ)ω

)

Control theoretic approach (1st paper)

E[‖θk − θ∗‖2

]≤ κQ

(1− 0.9α

γmax

)k−τ(1.5‖θ0 − θ∗‖+ 0.5rmax)2 +

κ2κQ0.9

ατ.

40

The TD(λ) algorithm

Definition (eligibility trace). An eligibility trace is a geometric weightedaverage of the feature vectors at all the previously visited states, given by

ψk = (γλ)ψk−1 + φ(sk)

Notice for λ = 0, this only updates based on the current state, whereasλ = 1, is the average discounted feature vector.

The TD(λ) algorithm with linear approximation. For given θ0 and posi-tive sequence {αk}Kk+1, update θ as

θk+1 = θk − αk[φ(sk)

>θk −R(sk)− γφ(s′k)>θk

]ψk

Remark. Both styles of analysis can be extended to provide finite-timeguarantees for the TD(λ) case.

41

References

1. Jalaj Bhandari, Daniel Russo, and Raghav Singal. “A finite timeanalysis of temporal difference learning with linear functionapproximation”. In: arXiv preprint arXiv:1806.02450 (2018)

2. R Srikant and Lei Ying. “Finite-time error bounds for linear stochasticapproximation and TD learning”. In: arXiv preprint arXiv:1902.00923(2019)

3. John N Tsitsiklis and Benjamin Van Roy. “Analysis oftemporal-difference learning with function approximation”. In:Advances in neural information processing systems. 1997,pp. 1075–1081

42

Thank you!

43

Documents

Analysis of Reinforcement Learning Algorithmsniaohe.ise.illinois.edu/IE598_2020/IE598NH-lecture... · Formulating the reinforcement learning problem Goal: Find policy to maximize