The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../2019/11/pg_tutorial.pdf · 2020-05-02 · The Mathematical Foundations of Policy Gradient

The Mathematical Foundations of Policy Gradient Methods

Sham M. KakadeUniversity of Washington

&Microsoft Research

Reinforcement (interactive) learning (RL):

Setting: Markov decision processes

S states. start with s0 ⇠ d0

A actions.dynamics model P(s0|s, a).reward function r(s)discount factor �

Sutton, Barto ’18

Stochastic policy ⇡: st ! at

Standard objective: find ⇡ which maximizes:

V⇡(s0) = E[r(s0) + �r(s1) + �2r(s2) + . . .]

where the distribution of st , at is induced by ⇡.

S. M. Kakade (UW) Curiosity 1 / 1

Markov Decision Processes:a framework for RL

• A policy:𝜋: States → Actions

• We execute 𝜋 to obtain a trajectory:𝑠!, 𝑎!, 𝑟!, 𝑠", 𝑎", 𝑟", …

• Total 𝛾-discounted reward:

𝑉#(𝑠!) = 𝐸 9$%!

&

𝛾$𝑟$ │𝑠!, 𝜋

Goal: Find a policy that maximizes our value, 𝑉!(𝑠").

Challenges in RL

1. Exploration(the environment may be unknown)

2. Credit assignment problem(due to delayed rewards)

3. Large state/action spaces:hand state: joint angles/velocitiescube state: configuration actions: forces applied to actuators

Dexterous Robotic Hand ManipulationOpenAI, Oct 15, 2019

Values, State-Action Values, and Advantages

• Expectation with respect to sampled trajectories under 𝜋• Have S states and A actions.

• Effective “horizon” is 1/(1 − 𝛾) time steps.

𝑉!(𝑠") = 𝐸 '#$"

%

𝛾#𝑟(𝑠#, 𝑎#) │𝑠", 𝜋

𝑄!(𝑠", 𝑎") = 𝐸 '#$"

%

𝛾# 𝑟(𝑠#, 𝑎#)│𝑠", 𝑎", 𝜋

𝐴! 𝑠, 𝑎 = 𝑄! 𝑠, 𝑎 − 𝑉!(𝑠)

The “Tabular” Dynamic Programming approach

• Table: ‘bookkeeping’ for dynamic programming (with known rewards/dynamics)

1. Estimate the state-action value 𝑄!(𝑠, 𝑎) for every entry in the table.

2. Update the policy 𝜋 & goto step 1

• Generalization: how can we deal with this infinite table? using sampling/supervised learning?

State 𝒔: (joint angles, … cube config,…)

Action 𝒂:(forces at joints)

𝑸𝝅(𝒔, 𝒂): state-action value“one step look-ahead value” using 𝝅

(31°, 12°, … , 8134,… ) (1.2 Newton, 0.1 Newton,… ) 8 units of reward

⋮ ⋮ ⋮

§ Part – I: BasicsA. Derivation and EstimationB. Preconditioning and the Natural Policy Gradient

§ Part – II: Convergence and ApproximationA. Convergence: This is a non-convex problems!B. Approximation: How to the think about the role of deep learning?

This Tutorial: Mathematical Foundations of Policy Gradient Methods

Part-1: Basics

State-Action Visitation Measures!• This helps to clean up notation!

• “Occupancy frequency” of being in state 𝑠 and action a, after following 𝜋 starting in 𝑠!

𝑑$!! 𝑠 = 1 − 𝛾 𝐸 +

%&"

'

𝛾% 𝐼 𝑠% = 𝑠 │𝑠", 𝜋

• 𝑑.!# is a probability distribution

• With this notation:

𝑉#(𝑠!) =1

1 − 𝛾 𝐸.∼0"!# ,1∼# 𝑟(𝑠, 𝑎)

Direct Policy Optimization over Stochastic Policies

• 𝜋( 𝑎 𝑠 is the probability of action 𝑎 given 𝑠, parameterized by

𝜋( 𝑎 𝑠 ∝ exp(𝑓((𝑠, 𝑎))

• Softmax policy class: 𝑓( 𝑠, 𝑎 = 𝜃$,*• Linear policy class: 𝑓( 𝑠, 𝑎 = �⃗� ⋅ 𝜙(𝑠, 𝑎)where 𝜙(𝑠, 𝑎) ∈ 𝑅+

• Neural policy class: 𝑓((𝑠, 𝑎) is a neural network

In practice, policy gradient methods rule…

• Why do we like them?

• They easily deal with large state/action spaces (through the neural net parameterization)

• We can estimate the gradient using only simulation of our current policy 𝜋!(the expectation is under the state actions visited under 𝜋!)

• They directly optimize the cost function of interest!

They are the most effective method for obtaining state of the art.

𝜃 ← 𝜃 + 𝜂 𝛻𝑉!A(𝑠")

Two (equal) expressions for the policy gradient!

(some shorthand notation above)

• Where do these expression come from?

• How do we compute this?

𝛻𝑉# 𝑠" =1

1 − 𝛾 𝐸$∼&B,(∼! 𝑄# 𝑠, 𝑎 𝛻log 𝜋# 𝑎|𝑠

𝛻𝑉# 𝑠" =1

1 − 𝛾 𝐸$∼&B,(∼! 𝐴# 𝑠, 𝑎 𝛻log 𝜋# 𝑎|𝑠

Example: an important special case!• Remember the softmax policy class (a “tabular” parameterization)

𝜋C 𝑎 𝑠 ∝ exp(𝜃D,E)

• Complete class with 𝑆𝐴 params:one parameter per state action, so it contains the optimal policy

• Expression for softmax class:

𝜕𝑉C 𝑠"𝜕𝜃D,E

= 𝑑!2 𝑠 𝜋C 𝑎 𝑠 𝐴C 𝑠, 𝑎

• Intuition: increase 𝜃!,# if the ‘weighted’ advantage is large.

Part-1A: Derivations and Estimation

General DerivationrV ⇡✓ (s0)

= rX

a0

⇡✓(a0|s0)Q⇡✓ (s0, a0)

=X

a0

⇣r⇡✓(a0|s0)

⌘Q⇡✓ (s0, a0) +

X

a0

⇡✓(a0|s0)rQ⇡✓ (s0, a0)

=X

a0

⇡✓(a0|s0)⇣r log ⇡✓(a0|s0)

⌘Q⇡✓ (s0, a0)

+X

a0

⇡✓(a0|s0)r⇣r(s0, a0) + �

X

s1

P (s1|s0, a0)V ⇡✓ (s1)⌘

=X

a0

⇡✓(a0|s0)⇣r log ⇡✓(a0|s0)

⌘Q⇡✓ (s0, a0) + �

X

a0,s1

⇡✓(a0|s0)P (s1|s0, a0)rV ⇡✓ (s1)

= E [Q⇡✓ (s0, a0)r log ⇡✓(a0|s0)] + �E [rV ⇡✓ (s1)] .

SL vs RL: How do we obtain gradients?• In supervised learning, how do we compute the gradient of our loss 𝛻𝐿(𝜃)?

𝜃 ← 𝜃 + 𝜂 𝛻𝐿(𝜃)• Hint: can we compute our loss?

• In reinforcement learning, how do we compute the policy gradient 𝛻𝑉3(𝑠!)?

𝜃 ← 𝜃 + 𝜂 𝛻𝑉C(𝑠")

𝛻𝑉# 𝑠" =1

1 − 𝛾 𝐸$,( 𝑄# 𝑠, 𝑎 𝛻log 𝜋# 𝑎|𝑠

Monte Carlo Estimation• Sample a trajectory: execute 𝜋3 and 𝑠!, 𝑎!, 𝑟!, 𝑠", 𝑎", 𝑟", …

• Lemma: [Glynn ’90, Williams ‘92]] This gives an unbiased estimate of the gradient:

E #𝛻𝑉$ = 𝛻𝑉$(𝑠%)This is the “likelihood ratio” method.

bQ(st, at) =1X

t0=0

�t0r(st0+t, at0+t)

[rV ✓ =1X

t=0

�t bQ(st, at)r log ⇡✓(at|st)

Back to the softmax policy class…

𝜋C 𝑎 𝑠 ∝ exp(𝜃D,E)

• Expression for softmax class:

𝜕𝑉C 𝑠"𝜕𝜃D,E

= 𝑑!2 𝑠 𝜋C 𝑎 𝑠 𝐴C 𝑠, 𝑎

• What might be making gradient estimation difficult here?(hint: when does gradient descent “effective” stop?)

Part-1B: Preconditioning and the Natural Policy Gradient

A closer look at Natural Policy Gradient (NPG)

• Practice: (almost) all methods are gradient based, usually variants of:Natural Policy Gradient [K. ‘01]; TRPO [Schulman ‘15]; PPO [Schulman ‘17]• NPG warps the distance metric to stretch the corners out (using the Fisher

information metric) move ‘more’ near the boundaries. The update is:

𝐹 𝜃 = 𝐸.∼0#,1∼# 𝛻log 𝜋3 𝑎|𝑠 𝛻log 𝜋3 𝑎|𝑠 4

𝜃 ← 𝜃 + 𝜂 𝐹 𝜃 5"𝛻𝑉3(𝑠!)

TRPO (Trust Region Policy Optimization)

• TRPO [Schulman ‘15] (related: PPO [Schulman ‘17]): move staying “close” in KL to previous policy:

𝜃$6" = argmin3 𝑉3(𝑠!)

s. t. 𝐸.∼0#$ 𝐾𝐿 𝜋3 ⋅ 𝑠 R 𝜋3$ ⋅ 𝑠

• NPG=TRPO: they are first order equivalent (and have same practical behavior)

NPG intuition. But first….• NPG as preconditioning:

𝜃 ← 𝜃 + 𝜂 𝐹 𝜃 5"𝛻𝑉3(𝑠!)OR

𝜃 ← 𝜃 +𝜂

1 − 𝛾𝐸 𝛻log 𝜋3 𝑎|𝑠 𝛻log 𝜋3 𝑎|𝑠 4 5"𝐸 𝛻log 𝜋3 𝑎|𝑠 𝐴3(𝑠, 𝑎)

• What does the following problem remind you of?

𝐸 𝑋𝑋7 5"𝐸[𝑋𝑌]

• What is NPG is trying to approximate?

Equivalent Update Rule (for the softmax)• Take the best linear fit of 𝑄3 in “policy space”-features”: this gives

W.,1∗ = 𝐴3(𝑠, 𝑎)

• Using the NPG update rule : 𝜃.,1 ← 𝜃.,1 +

𝜂1 − γ

𝐴3(𝑠, 𝑎)

• And so an equivalent update rule to NPG is:

𝜋3 𝑎|𝑠 ← 𝜋3 𝑎|𝑠 exp𝜂

1 − γ𝐴3(𝑠, 𝑎) /𝑍

• What algorithm does this remind you of?

Questions: convergence? General case/approximation?

But does gradient descent even work in RL??

Supervised Learning Reinforcement Learning

What about approximation?

Stay tuned!!

Part-2: Convergence and Approximation

The Optimization Landscape

Supervised Learning:• Gradient descent tends to ‘just

work’ in practice and is not sensitive to initialization• Saddle points not a problem…

Reinforcement Learning:• Local search depends on initialization in

many real problems, due to “very” flat regions.• Gradients can be exponentially small in

the “horizon”

RL and the vanishing gradient problem

Reinforcement Learning:• The random init. has “very” flat regions in real problems (lack of ‘exploration’)• Lemma: [Agarwal, Lee, K., Mahajan 2019]

With random init, all 𝑘-th higher-order gradients are 2#$/& in magnitude for up to k <H/ ln 𝐻 orders, 𝐻 = 1/(1 − 𝛾).

• This is a landscape/optimization issues.(also a statistical issue if we used random init).

Prior work: The Explore/Exploit Tradeoff

Thrun ’92

Random search does not find the reward quickly.

(theory) Balancing the explore/exploit tradeoff:[Kearns & Singh, ’02] E3 is a near-optimal algo.Sample complexity: [K. ’03, Azar ’17]Model free: [Strehl et.al. ’06; Dann and Brunskill ’15; Szita &Szepesvari ’10; Lattimore et.al. ’14; Jin et.al. ’18]


s!

§ A: Convergence • Let’s look at the tabular/”softmax” case

§ B: Approximation§ Approximation: “linear” policies and neural nets

Part 2:Understanding the convergence properties of the (NPG) policy gradient methods!

NPG: back to the “soft” policy iteration interpretation

• Remember the softmax policy class

𝜋3 𝑎 𝑠 ∝ exp(𝜃.,1)

has 𝑆 ⋅ 𝐴 params• At iteration t, the NPG update rule:

𝜃$6" ← 𝜃$ + 𝜂 𝐹 𝜃$ 5"𝛻𝑉$(𝑠!)

is equivalent to a “soft” (exact) policy iteration update rule:

𝜋$6" 𝑎|𝑠 ← 𝜋$ 𝑎|𝑠 exp𝜂

1 − γ𝐴$(𝑠, 𝑎) /𝑍

• What happens for this non-convex update rule?

Part-2A: Global Convergence

Provable Global Convergence of NPGTheorem [Agarwal, Lee, K., Mahajan 2019]

For the softmax policy class, with 𝜂 = 1 − 𝛾 & log 𝐴 ,we have after T iterations,

𝑉 ' 𝑠% ≥ 𝑉⋆ 𝑠% −2

1 − 𝛾 & 𝑇• Dimension free iteration complexity! (No dependence on 𝑆, 𝐴)

Also a “FAST RATE”!• Even though problem is non-convex, a mirror descent analysis applies.

Analysis idea from [Even-Dar, K., Mansour 2009]• What about approximate/sampled gradients and large state space?

Notes: Potentials and Progress?

But first, the “Performance Difference Lemma”• Lemma: [K’02]: a characterization of the performance gap between any two policies:

𝑉! 𝑠" − 𝑉!9 𝑠" = 𝐸E:,D;,E;…∼! '#$"

%

𝛾#𝐴!9(𝑠#, 𝑎#) │𝑠"

= QQRS

𝐸D∼T<,E∼! 𝐴!9 𝑠, 𝑎

Mirror Descent Gives a Proof!(even though it is non-convex)

Es⇠d?

⇣KL(⇡?

s ||⇡(t)s )�KL(⇡?

s ||⇡(t+1)s )

⌘

= Es⇠d?

X

a

⇡?(a|s) log ⇡(t+1)(a|s)⇡(t)(a|s)

= Es⇠d?

X

a

⇡?(a|s)✓

⌘

1� �A(t)(s, a)

◆�X

a

⇡⇤(a|s) logZt(s)

!

= ⌘⇣V ⇡?

(s0)� V (t)(s0)� Es⇠d? logZt(s)⌘

Notes: are we making progress?

Re-arrangingV ⇡?

(s0)� V (t)(s0) =1

⌘Es⇠d?

⇣KL(⇡?

s ||⇡(t)s )�KL(⇡?

s ||⇡(t+1)s ) + logZt(s)

⌘

Understanding progress:

V ⇡?

(s0)� V (T�1)(s0)

1

T

T�1X

t=0

(V ⇡?

(s0)� V (t)(s0))

1

⌘TEs⇠d?(KL(⇡?

s ||⇡(0)s )�KL(⇡?

s ||⇡(T )s )) +

1

⌘T

T�1X

t=0

Es⇠d? logZt(s)

log |A|⌘T

+1

⌘T

T�1X

t=0

Es⇠d? logZt(s)

A slow rate proof sketch…

The key lemma for the fast rate…

Es⇠µ logZt(s) ...

. . .

. . .

. . .

⌘

1� �Es⇠µ

⇣V (t+1)(s)� V (t)(s)

⌘

The fast rate proof!

V ⇡?

(s0)� V (T�1)(s0)

log |A|⌘T

+1

⌘T

T�1X

t=0

Es⇠d? logZt(s)

log |A|⌘T

+1

(1� �)T

T�1X

t=0

⇣V (t+1)(d?)� V (t)(d?)

⌘

=log |A|⌘T

+V (T )(d?)� V (0)(d?)

(1� �)T

log |A|⌘T

+1

(1� �)2T.

Part-2B: Approximation(and statistics)

Remember our policy classes:

• 𝜋( 𝑎 𝑠 is the probability of action 𝑎 given 𝑠, parameterized by

𝜋( 𝑎 𝑠 ∝ exp(𝑓((𝑠, 𝑎))

• Softmax policy class: 𝑓( 𝑠, 𝑎 = 𝜃$,*• Linear policy class: 𝑓( 𝑠, 𝑎 = �⃗� ⋅ 𝜙(𝑠, 𝑎)where 𝜙(𝑠, 𝑎) ∈ 𝑅+

• Neural policy class: 𝑓((𝑠, 𝑎) is a neural network

OpenAI: dexterous hand manipulation not far off?

Trained with “domain randomization”

Basically: The measure 𝑠! ∼ 𝜇 was diverse.

𝑚𝑎𝑥(∈.

𝐸$∼0[𝑉( 𝑠 ]

Prior work: The Explore/Exploit Tradeoff

Thrun ’92

Random search does not find the reward quickly.

(theory) Balancing the explore/exploit tradeoff:[Kearns & Singh, ’02] E3 is a near-optimal algo.Sample complexity: [K. ’03, Azar ’17]Model free: [Strehl et.al. ’06; Dann and Brunskill ’15; Szita &Szepesvari ’10; Lattimore et.al. ’14; Jin et.al. ’18]


s"

Policy search algorithms: exploration and start state-measures

• Idea: Reweighting by a diverse distribution 𝜇 to handles the ”vanishing gradient” problem.• There is sense in which this reweighting is related to the a “condition number”

• Related theory:• [K. & Langford; ‘02] [K. ‘03] Conservative policy iteration (CPI) has the strongest provable

guarantees, in terms of the 𝜇 along with the error of a ‘supervised learning’ black box.• Other ‘reductions to SL’ : [Bagnell et al, ‘04], [Scherer & Geist, ‘14], [Geist et al., ‘19], etc…• helpful for imitation learning: [Ross et al., 2011]; [Ross & Bagnell, 2014]; [Sun et al., 2017 ]

NPG for the linear policy class• Now:

𝜋3 𝑎 𝑠 ∝ exp(𝜃.,1 ⋅ 𝜙.,1)• Take the best linear fit in “policy space”-features:

W∗ = argmin 𝐸.! ∼=𝐸.,1 ∼0"!# 𝑊 ⋅ 𝜙.,1 − 𝐴3 𝑠, 𝑎>

• 𝜇 is our start-state distribution, hopefully with “coverage”

• Define b𝐴3 𝑠, 𝑎 = 𝑊∗ ⋅ 𝜙.,1, and the NPG is update is equivalent to:

𝜋3 𝑎|𝑠 ← 𝜋3 𝑎|𝑠 exp𝜂

1 − γb𝐴3 𝑠, 𝑎 /𝑍

• This is like a soft “approximate” policy iteration step.

Sample Based NPG, linear case• Sample trajectories: at iteration t, using start state s! ∼ 𝜇, then follow 𝜋• Now do regression on this sampled data:

b𝑊$ = argmin c𝐸.,1 𝑊 ⋅ 𝜙.,1 − 𝐴3$ 𝑠, 𝑎>

• Define: b𝐴$ 𝑠, 𝑎 = b𝑊$ ⋅ 𝜙.,1

• And so an equivalent update rule to NPG is:


1 − γb𝐴$ 𝑠, 𝑎 /𝑍

Guarantees: NPG for linear policy classes

Theorem [Agarwal, Lee, K., Mahajan 2019]𝐴: # actions. 𝐻: horizon. After 𝑇 iterations, for all 𝑠!, the NPG algorithm satisfies:

𝑉3% 𝑠! ≥ 𝑉∗ 𝑠! −𝐻 2 log𝐴

𝑇+ 𝐻?𝜅 𝜀

• (realizability) Suppose that 𝐴' 𝑠, 𝑎 is a linear function in 𝜙(,*• supervised learning error: suppose we have bounded regression error, say due to sampling

Y𝐸 𝐴+ 𝑠, 𝑎 − [𝑊+ ⋅ 𝜙 𝑠, 𝑎&≤ 𝜀

• relative condition number: (to opt state-action measure 𝑑∗starting from 𝑠-)

𝜅 = max@

𝐸.,1∼0∗ 𝜙.,1 ⋅ 𝑥>

𝐸.,1∼A 𝜙.,1 ⋅ 𝑥>

Sample Based NPG, neural case• Now:

𝜋3 𝑎 𝑠 ∝ exp(𝑓3(𝑠, 𝑎))• Sampling: at iteration t, sample s! ∼ 𝜇 and follow 𝜋, • Supervised learning/regression:

b𝑊$ = argmin c𝐸.,1 𝑊 ⋅ 𝛻𝑓3$ 𝑠, 𝑎 − 𝐴3$ 𝑠, 𝑎>

• Define: b𝐴$ 𝑠, 𝑎 = b𝑊$ ⋅ 𝛻𝑓3$ 𝑠, 𝑎

• The NPG is:


1 − γb𝐴$ 𝑠, 𝑎 /𝑍

Guarantees: NPG for linear policy classes

Theorem [Agarwal, Lee, K., Mahajan 2019]𝐴: # actions. 𝐻: horizon. After 𝑇 iterations, for all 𝑠!, the NPG algorithm satisfies:

𝑉3% 𝑠! ≥ 𝑉∗ 𝑠! −𝐻 2 log𝐴

𝑇+ 𝐻?𝜅 𝜀

• (realizability) Suppose that 𝐴' 𝑠, 𝑎 is a linear function in 𝛻𝑓' 𝑠, 𝑎• supervised learning error: suppose we have bounded regression error, say due to sampling

𝐸(,*~/ 𝐴' 𝑠, 𝑎 − [𝑊+ ⋅ 𝛻𝑓' 𝑠, 𝑎&≤ 𝜀

• relative condition number: (to opt state-action measure 𝑑∗starting from 𝑠-)

𝜅 = max@

𝐸.,1∼0∗ 𝛻𝑓! 𝑠, 𝑎 ⋅ 𝑥 >

𝐸.,1∼A 𝛻𝑓! 𝑠, 𝑎 .,1 ⋅ 𝑥>

NTK+TRPO analysis [Lie et. al ‘19]

Thank you!• Today: mathematical foundations of policy gradient methods.

• With “coverage”, policy gradients have the strongest theoretical guarantees and are practically effective!

• New directions/not discussed:• design of good exploratory distributions 𝜇…• Relations to transfer learning and “distribution shift”

RL is a very relevant area, both now and the in the future! With some basics, please participate…

Some details for the fast rate!

V (t+1)(µ)� V (t)(µ)

=1

1� �Es⇠d(t+1)

µ

X

a

⇡(t+1)(a|s)A(t)(s, a)

=1

⌘Es⇠d(t+1)

µ

X

a

⇡(t+1)(a|s) log ⇡(t+1)(a|s)Zt(s)

⇡(t)(a|s)

=1

⌘Es⇠d(t+1)

µKL(⇡(t+1)

s ||⇡(t)s ) +

1

⌘Es⇠d(t+1)

µlogZt(s)

� 1

⌘Es⇠d(t+1)

µlogZt(s) �

1� �

⌘Es⇠µ logZt(s).

Documents

The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../2019/11/pg_tutorial.pdf · 2020-05-02 · The Mathematical Foundations of Policy Gradient