Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Analysis of ReinforcementLearning Algorithms
Adriano L. Abrantes and Avinash N. Madavan
University of Illinois, Urbana-Champaign
Outline
I A brief introduction to reinforcement learning
I Introducing temporal difference (TD) learning
I Analysis of TD(0) algorithmsI Understanding the deterministic problem
I Evaluating the i.i.d case
I A projective algorithm for TD(0)
I An alternative algorithm for TD(0)
I Extension to TD(λ)
2
What is reinforcement learning?
Goal: Maximize reward in a potentially unknown game.
Examples:I Tic-tac-toe: easily enumerable known state-action space.
I Chess/Go: enumerable known state-action space.
I Starcraft: finite action space, and essentially inenumerable statespace, and imperfect information.
Existing techniques:I Use complete knowledge to compute min/max response. (Game
theory) . . . requires perfect information.I Play games to completion and utilize outcome to evaluate moves.
(Monte Carlo) . . . requires complete trajectory.
3
Formulating the reinforcement learning problem
Goal: Find policy to maximize reward, R, in an environment driven by anMDP, with state space, S, and transition kernel P .
Define,
V := value function, i.e., expected total accumulated reward.
µ := policy, i.e., an oracle that given a state returns an action.
Applying this notation, we can express the value function at state s ∈ S,with fixed policy, µ, with discount factor γ,
Vµ(s) := E
[ ∞∑k=0
γkR(sk) | s0 = s
]= R(s) + γ
∑s′∈SP(s′|s)Vµ(s′).
Remark. The discount factor prioritizes better actions early on to prevent“saving the best for last” scenarios.
4
Understanding the reinforcement learning problem
Goal: Find policy to maximize reward, R, in an environment driven by anMDP, with state space, S, and transition kernel P .
For simplicity, assume n = |S| is finite. Then P is a matrix P and,
Vµ(si) = R(si) + γ∑sj∈S
PijVµ(sj).
Remark. For known P , this is a dynamic programming problem.
Problems within reinforcement learning:I How do we find the optimal policy? (exploration/exploitation).
I How do we tell if a policy is good?
5
Mathematical preliminaries
The next slides cover some of the preliminary concepts we require to studyconvergence of the TD(0) algorithm. In particular, we will study
I approximation of the value function
I definition of a suitable inner product/norm
I the basic temporal difference (TD) algorithm
I necessary assumptions
6
Approximating the value functionAssume that we are given a set of feature vectors, {φk}dk=1, then Vµ(s) canbe approximated as a linear function,
Vµ(s) ≈ Vθ(s) := φ(s)>θ =[φ1(s) φ2(s) . . . φd(s)
]θ.
Collecting all the value functions in a vector, we have,
Vµ =
Vµ(s1)...
Vµ(sn)
≈ Vθ =
φ(s1)>
...φ(sn)>
θ = Φθ.
Letting ΠD be the projection onto {Φx|x ∈ Rd}, we want to solve theprojected Bellman equation,
Vθ(si) = ΠD
R(si) + γ∑sj∈S
PijVθ(sj)
7
Approximating the value functionAssume that we are given a set of feature vectors, {φk}dk=1, then Vµ(s) canbe approximated as a linear function,
Vµ(s) ≈ Vθ(s) := φ(s)>θ =[φ1(s) φ2(s) . . . φd(s)
]θ.
Collecting all the value functions in a vector, we have,
Vµ =
Vµ(s1)...
Vµ(sn)
≈ Vθ =
φ(s1)>
...φ(sn)>
θ = Φθ.
Letting ΠD be the projection onto {Φx|x ∈ Rd}, we want to solve theprojected Bellman equation,
Vθ(si) = ΠD
R(si) + γ∑sj∈S
PijVθ(sj)
8
Defining a suitable inner product space
For a symmetric positive definite matrixA, we define the inner product
〈x, y〉A = x>Ay
and the associated norm
‖x‖A =√x>Ax
LetD = diag(π(s1), . . . , π(sn)) denote the matrix whose elements aregiven by the stationary distribution π(·), then
‖V − V ′‖D =
√∑s∈S
π(s)(V (s)− V ′(s))2
measures the mean-square difference between the value predictions underV and V ′ in steady-state.
9
Defining a suitable inner product space
We apply these definitions to define a norm on the space of parametervectors:
‖Vθ − Vθ′‖D =
√∑s∈S
π(s)(φ(s)>(θ − θ′))2 = ‖θ − θ′‖Σ
whereΣ := Φ>DΦ =
∑s∈S
π(s)φ(s)φ(s)>
is the steady-state feature covariance matrix such that
ω := λmin(Σ) > 0.
10
A graphical interpretation of TD(0) learning
s0 a2
a1
a3
s2
s1
s3
Initial State Actions Final State
?
µ
11
A graphical interpretation of TD(0) learning
s0 a2
a1
a3
s2
s1
s3
Initial State Actions Final State
?
µ
12
A graphical interpretation of TD(0) learning
s0 a2
a1
a3
s2
s1
s3
Initial State Actions Final State
?
µ
13
The basic TD(0) algorithmRecall the Bellman equation,
Vθ(si) = R(si) + γ∑sj∈S
PijVθ(sj)
At each iteration, minimize sample Bellman loss, or
‖Vθ(sk)−(R(sk) + γVθk(s′k)
)︸ ︷︷ ︸empirical
‖2 = ‖φ(sk)>θ −R(sk)− γφ(s′k)
>θk‖2
(1)
The TD(0) algorithm with linear approximation. For given θ0 and positivesequence {αk}Tk+1, update θ as
θk+1 = θk − αk[φ(sk)
>θk −R(sk)− γφ(s′k)>θk
]φ(sk)︸ ︷︷ ︸
Gradient of (1)
14
Some basic assumptions
I We assume that the underlying Markov chain is irreducible andaperiodic so that there are constants m > 0 and ρ ∈ (0, 1) such that
sups∈S
dTV (P(sk|s0), π) ≤ mρk, ∀k ∈ N0 := {0, 1, . . .}
and the mixing time may be written as
τ(ε) = min{k ∈ N0|mρk ≤ ε}
I Rewards are uniformly bounded, i.e.
∃rmax s.t. |R(s)| ≤ rmax ∀s ∈ S.
I Feature vectors are linearly independent and normalized, i.e.
‖φ(s)‖ ≤ 1 , ∀s ∈ S,
15
The challenge in analyzing TD
I The main challenge in TD analysis stems from the fact that gk(θk) arenot true stochastic gradients with respect to any fixed objective.
I At time k, gk(θk) pulls Vθk+1(sk) towards
θk-based target︷ ︸︸ ︷R(sk, sk+1) + γVθk(sk+1)
Will this circular process converge?
16
The TD limit point
I Tsitsiklis and Van Roy (1997) characterize the TD limit point θ∗ as theunique solution to the projected Bellman equation:
Φθ = ΠDTµΦθ
I Bhandari et al. (2018) approaches TD analysis starting with simplifiedsettings and incrementally extends the analysis to more complexcases:
I Noiseless case or mean-path TDI i.i.d. noise or i.i.d. observation modelI Markov noiseI TD(λ)
17
Gradient descent on a value function loss
Consider the cost function
f(θ) =1
2‖Vθ∗ − Vθ‖2D =
1
2‖θ∗ − θ‖2Σ
and suppose we had access to ∇f(·) to perform gradient descent with theupdate
θk+1 = θk − α∇f(θk).
Then, we would write
‖θ∗−θk+1‖22 = ‖θ∗−θk‖22 + 2α∇f(θk)>(θ∗−θk) +α2‖∇f(θk)‖22 (2)
18
Gradient descent on a value function lossUsing
∇f(θ)>(θ∗ − θ) = −‖θ∗ − θ‖2Σ = −‖Vθ∗ − Vθ‖D. (3)
and‖∇f(θ)‖2 ≤ ‖Vθ∗ − Vθ‖D (4)
We can set α = 1 in (2) to establish
‖θ∗ − θk+1‖22 ≤ ‖θ∗ − θk‖22 −
≥ω‖θ∗−θk‖22︷ ︸︸ ︷‖Vθ∗ − Vθk‖
2D
≤ (1− ω)‖θ∗ − θk‖22 ≤ . . . ≤ (1− ω)k+1‖θ∗ − θ0‖22.(5)
Moreover, if we work with the averaged iterate θk = T−1∑K−1
k=0 θk, we geta bound that does not depend on ω:
‖Vθ∗ − Vθk‖2D ≤
1
K
K−1∑k=0
‖Vθ∗ − Vθk‖2D ≤
‖θ∗ − θ0‖22K
(6)
19
The deterministic case (mean-path TD)
g(θ) :=∑s,s′∈S
π(s)P(s′|s)(R(s, s′) + γφ(s′)>θ − φ(s)>θ
)φ(s)
Lemma 2 [Tsitsiklis and Van Roy (1997)].
g(θ)>(θ∗ − θ) > 0 ∀θ 6= θ∗
Lemma 3. For any θ ∈ Rd,
g(θ)>(θ∗ − θ) ≥ (1− γ)‖Vθ∗ − Vθ‖2D
20
The deterministic case (mean-path TD)
Lemma 4. For any θ ∈ Rd,
‖g(θ)‖2 ≤ 2‖Vθ∗ − Vθ‖D
Consider the expansion
‖θ∗ − θk+1‖22 = ‖θ∗ − θk‖22 −
Lemma 3︷ ︸︸ ︷2αg(θk)
>(θ∗ − θk) +α2‖g(θk)‖22︸ ︷︷ ︸Lemma 4
Setting α = (1− γ)/4 yields
‖θ∗ − θk+1‖22 ≤ ‖θ∗ − θk‖22 −(
(1− γ)2
4
)‖Vθ∗ − Vθk‖
2D
21
The deterministic case (mean-path TD)
Theorem 1. Consider a sequence of parameters (θ0,θ1, . . .) obeying therecursion
θk+1 = θk + αg(θk), k ∈ {0, 1, 2, . . .},where α = (1− γ)/4. Then
‖θ∗ − θk‖22 ≤ exp
{−(
(1− γ)2ω
4T
)}‖θ∗ − θ0‖22 (7)
and
‖Vθ∗ − Vθk‖2D ≤
4‖θ∗ − θ0‖22T (1− γ)2
(8)
where (7) is analogous to (5) and (8) is analogous to (6).
22
The case with i.i.d. observations
Assume that the tuples Ok = (sk, rk, s′k) observed by the TD algorithm are
i.i.d. samples of the stationary distribution:
P[(sk, rk, s′k) = (s,R(s, s′)s′)] = π(s)P(s′|s)
To analyze this case, we extend Lemma 4 of Bhandari et al. to obtain abound on the expected norm of the stochastic gradient:
Lemma 5. For any fixed θ ∈ Rd,
E[‖gk(θ)‖22
]≤ 2σ2 + 8‖Vθ − Vθ∗‖2D
where σ2 = E[‖gk(θ∗)‖22
].
23
The case with i.i.d. observations
Expanding the squares, we have
‖θ∗ − θk+1‖22 = ‖θ∗ − θk‖22 − 2αgk(θk)>(θ∗ − θk) + α2
k‖gk(θ)‖22 (9)
Thus, taking the expectation
E [‖θ∗ − θk+1‖22] =
E [‖θ∗ − θk‖22]− 2αkE[gk(θk)
>(θ∗ − θk)]
+ α2kE[‖gk(θk)‖22
]=
E [‖θ∗ − θk‖22]− 2αkE[E[gk(θk)
>(θ∗−θk)|θk]]
+α2kE[E[‖gk(θk)‖22|θk
]]≤
E [‖θ∗ − θk‖22]−(2αk(1− γ)− 8α2
k
)E[‖Vθ∗ − Vθk‖
2D
]+ 2α2
kσ2 ≤
E [‖θ∗ − θk‖22]− αk(1− γ)E[‖Vθ∗ − Vθk‖
2D
]+ 2α2
kσ2 (10)
(10) is used along with different step-size choices to provide finite-time TDbounds under the i.i.d. assumption.
24
The case with i.i.d. observations
Theorem 2. Suppose TD is applied under the i.i.d. observation model.Then
(a) For any T ≥ (8/(1− γ))2 and a constant step-size α = 1/√T ,
E[‖Vθ∗ − Vθk‖
2D
]≤ ‖θ∗ − θ0‖22 + 2σ2
√T (1− γ)
(b) For any constant step-size α ≤ ω(1− γ)/8,
E[‖θ∗ − θk‖22
]≤ (exp {−α(1− γ)ωT}) ‖θ∗−θ0‖22+α
(2σ2
(1− γ)ω
)(c) For a decaying step-size αk = β
λ+k with β = 2(1−γ)ω and λ = 16
(1−γ)2ω
E[‖θ∗ − θk‖22
]≤ ν
λ+ Twhere ν = max
{8σ2
(1− γ)2ω2,16‖θ∗ − θ0‖22
(1− γ)2ω
}
25
On the i.i.d. observations assumptionThe approach based on i.i.d. observations is unrealistic, as the observedtuples Ok = (sk, rk = R(sk, sk+1), sk+1) do not stem from a Markov chainsample path and, therefore, the analysis ignores the possibly strongdependence between θk and Ok.Let
h(θ, Ok) := gk(θ) =(rk + γφ(sk+1)>θ − φ(sk)
>θ)φ(sk)
and note that we defined
g(θ) = E [h(θ, Ok)] .
If we consider θk to be a function of {O1, . . . , Ok−1}, then
gk(θ) 6= E [h(θ, Ok)|θk = θ]
and there is bias in the gradient evaluation:
E [h(θk, Ok)− g(θk)] 6= 0
26
A projected TD(0) algorithm
I To control the gradient bias, Bandhari et al. work with a projectedversion of TD:
θk+1 = Π2,R(θk + αk + gk(θk)
whereΠ2,R(θ) = arg min
θ′:‖θ′‖2≤R‖θ − θ′‖2
I The projection results in bounds on the gradient norms:
‖gk(θ)‖2 ≤ rmax + 2‖θ‖2, ∀θ ∈ Rd
Let G := rmax + 2R, then
‖gk(θ)‖2 ≤ G, ∀θ ∈ ΘR = {θ ∈ Rd : ‖θ‖2 ≤ R}
27
Analysis of Projected TD(0)Define the “gradient evaluation error”:
ζk(θ) := (gk(θ)− gk(θ))>(θ − θ∗), ∀θ ∈ ΘR
Then
‖θ∗ − θk+1‖22 = ‖θ∗ −Π2,R(θk + αkgk(θk))‖22= ‖Π2,R(θ∗)−Π2,R(θk + αkgk(θk))‖22≤ ‖θ∗ − θk − αkgk(θk)‖22= ‖θ∗ − θk‖22 − 2αkgk(θk)
>(θ∗ − θk) + α2k‖gk(θk)‖22
≤ ‖θ∗ − θk‖22 − 2αkgk(θk)>(θ∗ − θk) + α2
kG2
= ‖θ∗ − θk‖22 − 2αkgk(θk)>(θ∗ − θk) + 2αkζk(θk) + α2
kG2
≤ ‖θ∗ − θk‖22 − 2αk(1− γ)‖Vθ∗ − Vθk)‖2D+ 2αkζk(θk) + α2
kG2. (11)
28
Analysis of Projected TD(0)
Taking the expectation of (11) and assuming a fixed α:
E[‖θ∗ − θk+1‖22
]≤
E[‖θ∗ − θk‖22
]− 2α(1− γ)E
[‖Vθ∗ − Vθk‖
2D
]+ E [αζk(θk)] + α2G2 ≤
E[‖θ∗ − θk‖22
]− 2α(1− γ)E
[‖Vθ∗ − Vθk‖
2D
]+ α2(5 + 6τ(α))G2
Where the last inequality follows from an upper bound on the gradient bias:
E [αζk(θk)] ≤ α2(4 + 6τ(α))G2
29
Finite-time bounds on Projected TD(0)
Theorem 3. Suppose the projected TD(0) algorithm is applied with param-eter R ≥ ‖θ∗‖2 and a mixing time function τ(·). Set G = (rmax + 2R).Then
(a) With a constant step-size α = 1/√K,
E[‖Vθ∗ − Vθk‖
2D
]≤ ‖θ∗ − θ0‖22 +G2(9 + 12τ(1/
√K))
2√K(1− γ)
(b) For any constant step-size α ≤ 1/(2ω(1− γ)),
E[‖θ∗ − θk‖22
]≤(e−2α(1−γ)ωK
)‖θ∗ − θ0‖22+α
(G2(9 + 12τ(α))
2(1− γ)ω
)(c) For a decaying step-size αk = 1/(ω(k + 1)(1− γ)),
E[‖Vθ∗ − Vθk‖
2D
]≤ G2(9 + 24τ(αK))
K(1− γ)2ω(1 + logK)
30
An alternative TD(0) algorithm
We will now apply a control theoretic Lyapunov drift analysis.
Theorem. For any k ≥ τ and α such that κ1ατ +αγmax ≤ 0.05, we havethe following finite-time bound:
E[‖θk − θ∗‖2
]≤ κQ
(1− 0.9α
γmax
)k−τ(1.5‖θ0 − θ∗‖+ 0.5rmax)2 +
κ2κQ0.9
ατ,
where,
κ1 = 62γmax(1 + rmax), κ2 = 2(55γmax(1 + rmax)3 + γmaxr
2max
)
I The values γmin, γmax, κQ denote the minimum eigenvalue, maximumeigenvalue, and condition number of the Lyapunov function.
I Assume, without loss of generality, that θ∗ = 0.
31
Understanding the continuous-time dynamics
Lemma [Tsitsiklis and Van Roy (1997)]. Under a diminishing step-sizescheme, the discrete-time dynamics track the ODE,
θ = −ΦD [I − γP ]Φ>θ −ΦDE[R],
whereR :=[R(s1) . . . R(sn)
]>.
The Lyapunov function is chosen with standard control theory techniques as,
W (θk) = θ>k Qθk,
whereQ satisfies the Lyapunov equation of the continuous-time dynamics,i.e.
A>Q+QA = −I, A = −ΦD [I − γP ]Φ>.
From assumptions of regularity of φ, A is Hurwitz.
32
Understanding Lyapunov drift in continuous-timeThe unconditioned drift function at steady-state satisfies,
E[W (θk+1)−W (θk)] = E[θ>k+1Qθk+1 − θ>k Qθk] = 0
Taking the Taylor expansion with appropriate choice of θ yields
E
∇>W (θk)(θk+1 − θk) +1
2‖θk+1 − θk‖∇2W (θ)︸ ︷︷ ︸
=Q
= 0.
Select W (θk) according to Stein’s method,
∇>W (θk)E [θk+1 − θk | θk] = −‖θk‖2.
Having Stein’s equation hold true for all θk results in the Lyapunov equationfrom before
A>Q+QA = −I.
33
A series of useful lemmas1. Bound the error during Markov chain mixing.
Lemma.
‖θτ − θ0‖ ≤ 2ατ‖θ0‖+ 2ατrmax,
‖θτ − θ0‖ ≤ 4ατ‖θτ‖+ 4ατrmax,
‖θτ − θ0‖2 ≤ 32α2τ2‖θτ‖2 + 32α2τ2r2max.
34
A series of useful lemmas1. Bound the error during Markov chain mixing.
2. Bound the one-step error.
Lemma. For all k ≥ 0,
‖θk+1 − θk‖2Q ≤ 2α2γmax‖θk‖2 + 2α2γmaxr2max.
35
A series of useful lemmas1. Bound the error during Markov chain mixing.
2. Bound the one-step error.
3. Bound Stein’s equation for the drift process. Recall in steady statecontinuous-time,
∇>W (θk) [θk+1 − θk | θk] = 0.
Lemma. For any k ≥ τ ,∣∣∣∣∣∣∣∣∣EθkQ
drift from steady-state︷ ︸︸ ︷(Aθk −
1
α(θk+1 − θk)
)| θk−τ , sk−τ , s′k−τ
∣∣∣∣∣∣∣∣∣
≤ κ1ατE[‖θk‖2 | θk−τ
]+ κ2ατ,
where,
κ1 = 62γmax(1 + rmax), κ2 = 55γmax(1 + rmax)3
36
A series of useful lemmas1. Bound the error during Markov chain mixing.
2. Bound the one-step error.
3. Bound Stein’s equation for the drift process.
4. Bound the discrete-time Lyapunov drift.
Lemma. For any k ≥ τ and α such that κ1ατ + αγmax ≤ 0.05,
E[W (θk+1)] ≤(
1− 0.9α
γmax
)E[W (θk)] + κ2α
2τ,
where,
κ2 = 2(κ2 + γmaxr2max).
37
A series of useful lemmas1. Bound the error during Markov chain mixing.
2. Bound the one-step error.
3. Bound Stein’s equation for the drift process.
4. Bound the discrete-time Lyapunov drift.
Lemma. For any k ≥ τ and α such that κ1ατ + αγmax ≤ 0.05,
E[W (θk+1)] ≤(
1− 0.9α
γmax
)E[W (θk)] + κ2α
2τ,
where,
κ2 = 2(κ2 + γmaxr2max).
5. Taking the summation over k yields the result,
E[‖θk‖2
]≤ κQ
(1− 0.9α
γmax
)k−τ (1.5‖θ0‖2 + 0.5rmax
)2+κ2κQ0.9
ατ.
38
A conceptual comparison of the algorithms
Information theoretic approach (1st paper)
I Lyapunov function, W1 = ‖Vθ − Vθ∗‖2D = ‖θ − θ∗‖2Σ.I Σ is the steady-state feature covariance matrix, Φ>DΦ.
I Bounds the total gradient noise.
Control theoretic approach (2nd paper)
I Lyapunov function, W2 = ‖θ − θ∗‖2Q.I Q comes from solving Lyapunov equation for steady-state
continuous-time dynamics,
A>Q+QA = −I, A = −ΦD (I − αP )Φ
I Bounds the Lyapunov drift.
39
A comparison of the algorithm guarantees
For the constant step-size case,
Information theoretic approach (1st paper)
E[‖θ∗ − θk‖2
]≤ (1− α(1− γ)ω)k ‖θ∗ − θ0‖2 + α
(2σ2
(1− γ)ω
)
Control theoretic approach (1st paper)
E[‖θk − θ∗‖2
]≤ κQ
(1− 0.9α
γmax
)k−τ(1.5‖θ0 − θ∗‖+ 0.5rmax)2 +
κ2κQ0.9
ατ.
40
The TD(λ) algorithm
Definition (eligibility trace). An eligibility trace is a geometric weightedaverage of the feature vectors at all the previously visited states, given by
ψk = (γλ)ψk−1 + φ(sk)
Notice for λ = 0, this only updates based on the current state, whereasλ = 1, is the average discounted feature vector.
The TD(λ) algorithm with linear approximation. For given θ0 and posi-tive sequence {αk}Kk+1, update θ as
θk+1 = θk − αk[φ(sk)
>θk −R(sk)− γφ(s′k)>θk
]ψk
Remark. Both styles of analysis can be extended to provide finite-timeguarantees for the TD(λ) case.
41
References
1. Jalaj Bhandari, Daniel Russo, and Raghav Singal. “A finite timeanalysis of temporal difference learning with linear functionapproximation”. In: arXiv preprint arXiv:1806.02450 (2018)
2. R Srikant and Lei Ying. “Finite-time error bounds for linear stochasticapproximation and TD learning”. In: arXiv preprint arXiv:1902.00923(2019)
3. John N Tsitsiklis and Benjamin Van Roy. “Analysis oftemporal-difference learning with function approximation”. In:Advances in neural information processing systems. 1997,pp. 1075–1081
42
Thank you!
43