Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Lecture 12: Policy Optimization II

Bolei Zhou

The Chinese University of Hong Kong

[email protected]

February 21, 2019

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 1 / 21

Today’s Plan

1 Reinforce your understanding on policy optimization

2 Improving the policy gradient with a critic


Policy Optimization

1 Policy-based RL, we parameterize policy function πθ to optimizeθ∗ = arg maxθ Eτ∼πθ [R(τ)]

2 So the gradient is

∇θEτ∼πθ [R(τ)] = Eτ [R(τ)∇θ log π(τ ; θ)]

3 Consider the generic form in terms of score function gradientestimator as

∇θEx∼p(x ;θ)[f (x)] = Ex [f (x)∇θ log p(x ; θ)]

4 R(τ) could be any reasonable reward function on τ


Review on Policy Gradient

1 After decomposing τ we derived that

∇θEτ [R(τ)] = Eτ[( T−1∑

t=0

rt)( T−1∑

t=0

∇θ log πθ(at |st))]

2 After applying the causality, we have

∇θEτ∼πθ [R] =Eτ[ T−1∑

t=0

Gt · ∇θ log πθ(at |st)]

1 Gt =∑T−1

t′=t rt′ is the return for a trajectory at step t


REINFORCE: MC Policy Graident Algorithm


Policy Gradient is On-Policy RL

1 ∇θJ(θ) = Eτ∼πθ [∇θ log πθ(τ)r(τ)]

2 In REINFORCE algorithm: there is the on-policy sampling (sample{τ} from πθ)

3 On-policy learning can be extremely inefficient


Off-Policy Learning using Importance Sampling

1 θ∗ = arg maxθ J(θ) where J(θ) = Eτ∼πθ [r(τ)]

2 Let’s say we have a behavior policy π where we can generate samples,then

J(θ) = Eτ∼π[πθ(τ)

π(τ)r(τ)]

3 Then we can derive the off-policy policy gradient1 Levine & Koltun (2013). Guided policy search: deep RL with

importance sampled policy gradient. ICML


Overcoming Non-differentiable Computation

1 Another interesting application of Policy Gradient is that it allows usto overcome the non-differentiable computation

2 During training we will produce several samples (indicated by thebranches below), and then well encourage samples that eventually ledto good outcomes (in this case for example measured by the loss atthe end)


Reducing Variance by a Baseline

1 The original update


t=0


1 Gt =∑T−1

t′=t rt′ is the return for a Monte-Carlo trajectory which mighthave high variance

2 We subtract a baseline b(s) from the policy gradient to reducevariance


t=0

(Gt − b(st)

)· ∇θ log πθ(at |st)

]3 A good baseline is the expected return

b(st) = E[rt + rt+1 + ...+ rT−1]


Reducing Variance by a Baseline

∇θEτ∼πθ [R] = Eτ[ T−1∑

t=0

(Gt − bw(st)

)· ∇θ log πθ(at |st)

]

1 We can prove that Eτ[∇θ log πθ(at |st)b(st)

]= 0

2 Baseline b(s) can reduce variance, without changing the expectation

3 bw(s) also has a parameter w to learn so that we have two set ofparameters w and θ


Vanilla Policy Gradient Algorithm with Baseline


Reducing Variance Using a Critic

1 The update is

∇θEπθ [R] =Eπθ[ T−1∑

t=0


2 In practice, Gt is a sample from Monte Carlo policy gradient, which isthe unbiased but noisy estimate of Qπθ(st , at)

3 Instead we can use a critic to estimate the action-value function,

Qw(s, a) ≈ Qπθ(s, a)

4 Then the update becomes


t=0

Qw(st , at) · ∇θ log πθ(at |st)]


Reducing Variance Using a Critic


t=0

Qw(st , at) · ∇θ log πθ(at |st)]

1 It becomes Actor-Critic Policy Gradient1 Actor: the policy function used to generate the action2 Critic: the value function used to evaluate the reward of the actions

2 Actor-critic algorithms maintain two sets of parameters1 Actor: Updates policy parameters θ, in direction suggested by critic2 Critic: Updates action-value function parameters w


Estimating the Action-Value Function

1 The critic is solving a familiar problem: policy evaluation1 How good is policy πθ for current parameter θ

2 Policy evaluation was explored in previous lectures, e.g.1 Monte-Carlo policy evaluation2 Temporal-Difference learning3 Least-squares policy evaluation


Action-Value Actor-Critic Algorithm

1 Using a linear value function approximation: Qw(s, a) = ψ(s, a)Tw1 Critic: update w by a linear TD(0)2 Actor: update θ by policy gradient

Algorithm 1 Simple QAC

1: for each step do2: generate sample s, a, r , s ′, a′ following πθ3: δ = r + γQw(s ′, a′)− Qw(s, a) #TD error4: w← w + βδψ(s, a)5: θ ← θ + α∇θ log πθ(s, a)Qw(s, a)6: end for


Reducing the Variance of Actor-Critic by a Baseline

1 Recall Q-function / state-action-value function:

Qπ,γ(s, a) = Eπ[r1 + γr2 + ...|s1 = s, a1 = a]

2 State value function can serve as a great baseline

V π,γ(s) =Eπ[r1 + γr2 + ...|s1 = s]

=Ea∼π[Qπ,γ(s, a)]

3 Advantage function: combing Q with baseline V

Aπ,γ(s, a) = Qπ,γ(s, a)− V π,γ(s)

4 Then the policy gradient becomes:

∇θJ(θ) = Eπθ [∇θ log πθ(s, a)Aπ,γ(s, a)]


N-step estimators

1 We used the Monte-Carlo estimates of the reward2 We can also us TD methods for the policy gradient update, or any

intermediate blend between TD and MC methods:3 Consider the following n-step returns for n = 1, 2,∞

n = 1(TD) G(1)t =rt+1 + γv(st+1)

n = 2 G(2)t =rt+1 + γrt+2 + γ2v(st+2)

n =∞(MC ) G(∞)t =rt+1 + γrt+2 + ...+ γT−t−1rT

4 Then the advantage estimators become

A(1)t =rt+1 + γv(st+1)− v(st)

A(2)t =rt+1 + γrt+2 + γ2v(st+2)− v(st)

A(∞)t =rt+1 + γrt+2 + ...+ γT−t−1rT − v(st)

A(1) has low variance and high bias. A(∞) has high variance but low biasBolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 17 / 21

Extension on Policy Gradient

1 State-of-the-art RL methods are almost all policy-based1 DDPG: TP Lillicrap, et al (2015). Continuous control with deep

reinforcement learning2 TRPO: Schulman, L., Moritz, Jordan, Abbeel (2015). Trust region

policy optimization: deep RL with natural policy gradient and adaptivestep size

3 PPO: Schulman, Wolski, Dhariwal, Radford, Klimov (2017). Proximalpolicy optimization algorithms: deep RL with importance sampledpolicy gradient


Different Schools of Reinforcement Learning

1 Value-based RL: solve RL through dynamic programming1 Classic RL and control theory2 Representative algorithms: Deep Q-learning and its variant3 Representative person: Richard Sutton (no more than 20 pages on PG

out of the 500-page textbook), David Silver, from DeepMind

2 Policy-based RL: solve RL mainly through learning1 Machine learning and deep learning2 Representative algorithms: PG, and its variants TRPO, PPO, and

others3 Representative person: Pieter Abbeel, Sergey Levine, John Schulman,

from OpenAI, Berkelely


Policy gradient code example

1 Code example of policy gradient: https://github.com/

metalbubble/RLexample/blob/master/pg-pong.py

2 Educational blog by Karpathy:http://karpathy.github.io/2016/05/31/rl/

3 Episodes/trajectories generated by the game, so PG encourages thegood actions and penalizes the bad actions


https://github.com/metalbubble/RLexample/blob/master/pg-pong.py

https://github.com/metalbubble/RLexample/blob/master/pg-pong.py

http://karpathy.github.io/2016/05/31/rl/

Next Lecture

1 Next Tuesday: deeper into actor-critic algorithms

2 Next Thursday: Guest Lecture by Sirui Xie3 MOVE TO NEXT NEXT WEEK: We will have our first student-led

seminar (2 sessions)1 session 1: Hao Li, Zijian Lei: Visual Navigation2 session 2: Rui Xu, Xinge Zhu: Generalization in RL3 20-min presentation + 5-min QA


Documents

Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms: