21
Lecture 12: Policy Optimization II Bolei Zhou The Chinese University of Hong Kong [email protected] February 21, 2019 Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 1 / 21

Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Lecture 12: Policy Optimization II

Bolei Zhou

The Chinese University of Hong Kong

[email protected]

February 21, 2019

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 1 / 21

Page 2: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Today’s Plan

1 Reinforce your understanding on policy optimization

2 Improving the policy gradient with a critic

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 2 / 21

Page 3: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Policy Optimization

1 Policy-based RL, we parameterize policy function πθ to optimizeθ∗ = arg maxθ Eτ∼πθ [R(τ)]

2 So the gradient is

∇θEτ∼πθ [R(τ)] = Eτ [R(τ)∇θ log π(τ ; θ)]

3 Consider the generic form in terms of score function gradientestimator as

∇θEx∼p(x ;θ)[f (x)] = Ex [f (x)∇θ log p(x ; θ)]

4 R(τ) could be any reasonable reward function on τ

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 3 / 21

Page 4: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Review on Policy Gradient

1 After decomposing τ we derived that

∇θEτ [R(τ)] = Eτ[( T−1∑

t=0

rt)( T−1∑

t=0

∇θ log πθ(at |st))]

2 After applying the causality, we have

∇θEτ∼πθ [R] =Eτ[ T−1∑

t=0

Gt · ∇θ log πθ(at |st)]

1 Gt =∑T−1

t′=t rt′ is the return for a trajectory at step t

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 4 / 21

Page 5: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

REINFORCE: MC Policy Graident Algorithm

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 5 / 21

Page 6: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Policy Gradient is On-Policy RL

1 ∇θJ(θ) = Eτ∼πθ [∇θ log πθ(τ)r(τ)]

2 In REINFORCE algorithm: there is the on-policy sampling (sample{τ} from πθ)

3 On-policy learning can be extremely inefficient

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 6 / 21

Page 7: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Off-Policy Learning using Importance Sampling

1 θ∗ = arg maxθ J(θ) where J(θ) = Eτ∼πθ [r(τ)]

2 Let’s say we have a behavior policy π where we can generate samples,then

J(θ) = Eτ∼π[πθ(τ)

π(τ)r(τ)]

3 Then we can derive the off-policy policy gradient1 Levine & Koltun (2013). Guided policy search: deep RL with

importance sampled policy gradient. ICML

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 7 / 21

Page 8: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Overcoming Non-differentiable Computation

1 Another interesting application of Policy Gradient is that it allows usto overcome the non-differentiable computation

2 During training we will produce several samples (indicated by thebranches below), and then well encourage samples that eventually ledto good outcomes (in this case for example measured by the loss atthe end)

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 8 / 21

Page 9: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Reducing Variance by a Baseline

1 The original update

∇θEτ∼πθ [R] =Eτ[ T−1∑

t=0

Gt · ∇θ log πθ(at |st)]

1 Gt =∑T−1

t′=t rt′ is the return for a Monte-Carlo trajectory which mighthave high variance

2 We subtract a baseline b(s) from the policy gradient to reducevariance

∇θEτ∼πθ [R] =Eτ[ T−1∑

t=0

(Gt − b(st)

)· ∇θ log πθ(at |st)

]3 A good baseline is the expected return

b(st) = E[rt + rt+1 + ...+ rT−1]

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 9 / 21

Page 10: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Reducing Variance by a Baseline

∇θEτ∼πθ [R] = Eτ[ T−1∑

t=0

(Gt − bw(st)

)· ∇θ log πθ(at |st)

]

1 We can prove that Eτ[∇θ log πθ(at |st)b(st)

]= 0

2 Baseline b(s) can reduce variance, without changing the expectation

3 bw(s) also has a parameter w to learn so that we have two set ofparameters w and θ

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 10 / 21

Page 11: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Vanilla Policy Gradient Algorithm with Baseline

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 11 / 21

Page 12: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Reducing Variance Using a Critic

1 The update is

∇θEπθ [R] =Eπθ[ T−1∑

t=0

Gt · ∇θ log πθ(at |st)]

2 In practice, Gt is a sample from Monte Carlo policy gradient, which isthe unbiased but noisy estimate of Qπθ(st , at)

3 Instead we can use a critic to estimate the action-value function,

Qw(s, a) ≈ Qπθ(s, a)

4 Then the update becomes

∇θEπθ [R] =Eπθ[ T−1∑

t=0

Qw(st , at) · ∇θ log πθ(at |st)]

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 12 / 21

Page 13: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Reducing Variance Using a Critic

∇θEπθ [R] =Eπθ[ T−1∑

t=0

Qw(st , at) · ∇θ log πθ(at |st)]

1 It becomes Actor-Critic Policy Gradient1 Actor: the policy function used to generate the action2 Critic: the value function used to evaluate the reward of the actions

2 Actor-critic algorithms maintain two sets of parameters1 Actor: Updates policy parameters θ, in direction suggested by critic2 Critic: Updates action-value function parameters w

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 13 / 21

Page 14: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Estimating the Action-Value Function

1 The critic is solving a familiar problem: policy evaluation1 How good is policy πθ for current parameter θ

2 Policy evaluation was explored in previous lectures, e.g.1 Monte-Carlo policy evaluation2 Temporal-Difference learning3 Least-squares policy evaluation

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 14 / 21

Page 15: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Action-Value Actor-Critic Algorithm

1 Using a linear value function approximation: Qw(s, a) = ψ(s, a)Tw1 Critic: update w by a linear TD(0)2 Actor: update θ by policy gradient

Algorithm 1 Simple QAC

1: for each step do2: generate sample s, a, r , s ′, a′ following πθ3: δ = r + γQw(s ′, a′)− Qw(s, a) #TD error4: w← w + βδψ(s, a)5: θ ← θ + α∇θ log πθ(s, a)Qw(s, a)6: end for

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 15 / 21

Page 16: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Reducing the Variance of Actor-Critic by a Baseline

1 Recall Q-function / state-action-value function:

Qπ,γ(s, a) = Eπ[r1 + γr2 + ...|s1 = s, a1 = a]

2 State value function can serve as a great baseline

V π,γ(s) =Eπ[r1 + γr2 + ...|s1 = s]

=Ea∼π[Qπ,γ(s, a)]

3 Advantage function: combing Q with baseline V

Aπ,γ(s, a) = Qπ,γ(s, a)− V π,γ(s)

4 Then the policy gradient becomes:

∇θJ(θ) = Eπθ [∇θ log πθ(s, a)Aπ,γ(s, a)]

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 16 / 21

Page 17: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

N-step estimators

1 We used the Monte-Carlo estimates of the reward2 We can also us TD methods for the policy gradient update, or any

intermediate blend between TD and MC methods:3 Consider the following n-step returns for n = 1, 2,∞

n = 1(TD) G(1)t =rt+1 + γv(st+1)

n = 2 G(2)t =rt+1 + γrt+2 + γ2v(st+2)

n =∞(MC ) G(∞)t =rt+1 + γrt+2 + ...+ γT−t−1rT

4 Then the advantage estimators become

A(1)t =rt+1 + γv(st+1)− v(st)

A(2)t =rt+1 + γrt+2 + γ2v(st+2)− v(st)

A(∞)t =rt+1 + γrt+2 + ...+ γT−t−1rT − v(st)

A(1) has low variance and high bias. A(∞) has high variance but low biasBolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 17 / 21

Page 18: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Extension on Policy Gradient

1 State-of-the-art RL methods are almost all policy-based1 DDPG: TP Lillicrap, et al (2015). Continuous control with deep

reinforcement learning2 TRPO: Schulman, L., Moritz, Jordan, Abbeel (2015). Trust region

policy optimization: deep RL with natural policy gradient and adaptivestep size

3 PPO: Schulman, Wolski, Dhariwal, Radford, Klimov (2017). Proximalpolicy optimization algorithms: deep RL with importance sampledpolicy gradient

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 18 / 21

Page 19: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Different Schools of Reinforcement Learning

1 Value-based RL: solve RL through dynamic programming1 Classic RL and control theory2 Representative algorithms: Deep Q-learning and its variant3 Representative person: Richard Sutton (no more than 20 pages on PG

out of the 500-page textbook), David Silver, from DeepMind

2 Policy-based RL: solve RL mainly through learning1 Machine learning and deep learning2 Representative algorithms: PG, and its variants TRPO, PPO, and

others3 Representative person: Pieter Abbeel, Sergey Levine, John Schulman,

from OpenAI, Berkelely

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 19 / 21

Page 20: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Policy gradient code example

1 Code example of policy gradient: https://github.com/

metalbubble/RLexample/blob/master/pg-pong.py

2 Educational blog by Karpathy:http://karpathy.github.io/2016/05/31/rl/

3 Episodes/trajectories generated by the game, so PG encourages thegood actions and penalizes the bad actions

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 20 / 21

Page 21: Lecture 12: Policy Optimization IIierg6130/2019/slide/lecture12.pdf · 2 Policy-based RL: solve RL mainly through learning 1 Machine learning and deep learning 2 Representative algorithms:

Next Lecture

1 Next Tuesday: deeper into actor-critic algorithms

2 Next Thursday: Guest Lecture by Sirui Xie3 MOVE TO NEXT NEXT WEEK: We will have our first student-led

seminar (2 sessions)1 session 1: Hao Li, Zijian Lei: Visual Navigation2 session 2: Rui Xu, Xinge Zhu: Generalization in RL3 20-min presentation + 5-min QA

Bolei Zhou (CUHK) IERG6130 Reinforcement Learning February 21, 2019 21 / 21