28
RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University of Toronto) RL Tutorial November 17, 2019 1 / 28

RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

RL TutorialCSC2515 Fall 2019

Adamo Young

University of Toronto

November 17 2019

Adamo Young (University of Toronto) RL Tutorial November 17 2019 1 28

Markov Decision Process (MDP)

St isin SAt isin ARt isin RTimesteps t isin N (discrete)

Finite SA both have finite number of elements (discrete)

Fully observable St is known exactly

Trajectory sequence of state-action-reward S0A0R1 S1A1

Adamo Young (University of Toronto) RL Tutorial November 17 2019 2 28

Markov Property

Dynamics of the MDP are completely captured in the current stateand the proposed action

States include all information about the past that make a differencefor the future

p(s prime r |s a) PrSt = s primeRt = r |Stminus1 = sAtminus1 = a

Adamo Young (University of Toronto) RL Tutorial November 17 2019 3 28

MDP Generalizations

Continuous timesteps

Continuous statesactions

Partially observable (POMDP) probability distribution over St

Donrsquot worry about them for now

Adamo Young (University of Toronto) RL Tutorial November 17 2019 4 28

Episodic vs Continuing Tasks

Episodic Task

t = 0 1 T

T (length of episode) canvary from episode to episode

Example Pacman

Continuing Task

No upper limit on t

Example learning to walk

Adamo Young (University of Toronto) RL Tutorial November 17 2019 5 28

Expected Return and Unified Task Notation

Gt is the expected return (expected sum of rewards)

Gt infinsumk=0

γkRt+k+1

γ is the discount factor 0 lt γ le 1

Episodic TasksI Rt = 0 for all t gt T finite sumI Often choose γ = 1I Always converges since sum is finite

Continuing TasksI Infinite sumI γ lt 1 (future rewards are valued less)I If Rt is bounded the infinite sum converges

Either way we can insure that Gt is well-defined

Adamo Young (University of Toronto) RL Tutorial November 17 2019 6 28

Policies

Policy π a mapping from state to actions

π(a|s) PrAt = a|St = sCan be deterministic or probabilistic

Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around

Adamo Young (University of Toronto) RL Tutorial November 17 2019 7 28

Value Functions

State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]

I Intuitively how rdquogoodrdquo an action a is in state s under policy π

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28

MDP Illustration

Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)

This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)

Even only two states and three possible actions can be prettycomplicated to model

Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28

Some Potentially Useful Value Identities

Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]

Recall qπ(s a) Eπ [Gt |St = sAt = a]

vπ(s) = Eπ [qπ(s a)|St = s] =sum

aisinA(s)

π(a|s)qπ(s a)

qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]

qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]

Common theme value functions are recursive expand one step to seea useful pattern

Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28

Solving RL Problems

Two main classes of solutions

Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space

Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor

policy functionsI Policy Gradient Methods fall under this category

Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28

Introducing Policy Gradient

Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ

Measure policy performance with scalar performance measure J(θ)

Maximize J(θ) with gradient ascent on θ

General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)

We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards

For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation

Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28

The Policy Gradient Theorem

J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ

The Policy Gradient Theorem states that

nablaJ(θ) = nablaθvπ(s0) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

micro(s) η(s)sumsprime η(s prime)

the on-policy distribution under π

I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s

Constant of proportionality is the average length of an episode

See p 325 of Sutton RL Book (Second Edition) for concise proof

Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

= Eπ

[suma

qπ(St a)nablaπ(a|St θ)

]

= Eπ

[suma

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

]

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 2: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

Markov Decision Process (MDP)

St isin SAt isin ARt isin RTimesteps t isin N (discrete)

Finite SA both have finite number of elements (discrete)

Fully observable St is known exactly

Trajectory sequence of state-action-reward S0A0R1 S1A1

Adamo Young (University of Toronto) RL Tutorial November 17 2019 2 28

Markov Property

Dynamics of the MDP are completely captured in the current stateand the proposed action

States include all information about the past that make a differencefor the future

p(s prime r |s a) PrSt = s primeRt = r |Stminus1 = sAtminus1 = a

Adamo Young (University of Toronto) RL Tutorial November 17 2019 3 28

MDP Generalizations

Continuous timesteps

Continuous statesactions

Partially observable (POMDP) probability distribution over St

Donrsquot worry about them for now

Adamo Young (University of Toronto) RL Tutorial November 17 2019 4 28

Episodic vs Continuing Tasks

Episodic Task

t = 0 1 T

T (length of episode) canvary from episode to episode

Example Pacman

Continuing Task

No upper limit on t

Example learning to walk

Adamo Young (University of Toronto) RL Tutorial November 17 2019 5 28

Expected Return and Unified Task Notation

Gt is the expected return (expected sum of rewards)

Gt infinsumk=0

γkRt+k+1

γ is the discount factor 0 lt γ le 1

Episodic TasksI Rt = 0 for all t gt T finite sumI Often choose γ = 1I Always converges since sum is finite

Continuing TasksI Infinite sumI γ lt 1 (future rewards are valued less)I If Rt is bounded the infinite sum converges

Either way we can insure that Gt is well-defined

Adamo Young (University of Toronto) RL Tutorial November 17 2019 6 28

Policies

Policy π a mapping from state to actions

π(a|s) PrAt = a|St = sCan be deterministic or probabilistic

Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around

Adamo Young (University of Toronto) RL Tutorial November 17 2019 7 28

Value Functions

State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]

I Intuitively how rdquogoodrdquo an action a is in state s under policy π

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28

MDP Illustration

Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)

This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)

Even only two states and three possible actions can be prettycomplicated to model

Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28

Some Potentially Useful Value Identities

Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]

Recall qπ(s a) Eπ [Gt |St = sAt = a]

vπ(s) = Eπ [qπ(s a)|St = s] =sum

aisinA(s)

π(a|s)qπ(s a)

qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]

qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]

Common theme value functions are recursive expand one step to seea useful pattern

Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28

Solving RL Problems

Two main classes of solutions

Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space

Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor

policy functionsI Policy Gradient Methods fall under this category

Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28

Introducing Policy Gradient

Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ

Measure policy performance with scalar performance measure J(θ)

Maximize J(θ) with gradient ascent on θ

General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)

We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards

For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation

Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28

The Policy Gradient Theorem

J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ

The Policy Gradient Theorem states that

nablaJ(θ) = nablaθvπ(s0) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

micro(s) η(s)sumsprime η(s prime)

the on-policy distribution under π

I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s

Constant of proportionality is the average length of an episode

See p 325 of Sutton RL Book (Second Edition) for concise proof

Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

= Eπ

[suma

qπ(St a)nablaπ(a|St θ)

]

= Eπ

[suma

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

]

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 3: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

Markov Property

Dynamics of the MDP are completely captured in the current stateand the proposed action

States include all information about the past that make a differencefor the future

p(s prime r |s a) PrSt = s primeRt = r |Stminus1 = sAtminus1 = a

Adamo Young (University of Toronto) RL Tutorial November 17 2019 3 28

MDP Generalizations

Continuous timesteps

Continuous statesactions

Partially observable (POMDP) probability distribution over St

Donrsquot worry about them for now

Adamo Young (University of Toronto) RL Tutorial November 17 2019 4 28

Episodic vs Continuing Tasks

Episodic Task

t = 0 1 T

T (length of episode) canvary from episode to episode

Example Pacman

Continuing Task

No upper limit on t

Example learning to walk

Adamo Young (University of Toronto) RL Tutorial November 17 2019 5 28

Expected Return and Unified Task Notation

Gt is the expected return (expected sum of rewards)

Gt infinsumk=0

γkRt+k+1

γ is the discount factor 0 lt γ le 1

Episodic TasksI Rt = 0 for all t gt T finite sumI Often choose γ = 1I Always converges since sum is finite

Continuing TasksI Infinite sumI γ lt 1 (future rewards are valued less)I If Rt is bounded the infinite sum converges

Either way we can insure that Gt is well-defined

Adamo Young (University of Toronto) RL Tutorial November 17 2019 6 28

Policies

Policy π a mapping from state to actions

π(a|s) PrAt = a|St = sCan be deterministic or probabilistic

Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around

Adamo Young (University of Toronto) RL Tutorial November 17 2019 7 28

Value Functions

State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]

I Intuitively how rdquogoodrdquo an action a is in state s under policy π

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28

MDP Illustration

Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)

This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)

Even only two states and three possible actions can be prettycomplicated to model

Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28

Some Potentially Useful Value Identities

Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]

Recall qπ(s a) Eπ [Gt |St = sAt = a]

vπ(s) = Eπ [qπ(s a)|St = s] =sum

aisinA(s)

π(a|s)qπ(s a)

qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]

qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]

Common theme value functions are recursive expand one step to seea useful pattern

Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28

Solving RL Problems

Two main classes of solutions

Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space

Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor

policy functionsI Policy Gradient Methods fall under this category

Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28

Introducing Policy Gradient

Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ

Measure policy performance with scalar performance measure J(θ)

Maximize J(θ) with gradient ascent on θ

General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)

We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards

For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation

Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28

The Policy Gradient Theorem

J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ

The Policy Gradient Theorem states that

nablaJ(θ) = nablaθvπ(s0) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

micro(s) η(s)sumsprime η(s prime)

the on-policy distribution under π

I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s

Constant of proportionality is the average length of an episode

See p 325 of Sutton RL Book (Second Edition) for concise proof

Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

= Eπ

[suma

qπ(St a)nablaπ(a|St θ)

]

= Eπ

[suma

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

]

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 4: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

MDP Generalizations

Continuous timesteps

Continuous statesactions

Partially observable (POMDP) probability distribution over St

Donrsquot worry about them for now

Adamo Young (University of Toronto) RL Tutorial November 17 2019 4 28

Episodic vs Continuing Tasks

Episodic Task

t = 0 1 T

T (length of episode) canvary from episode to episode

Example Pacman

Continuing Task

No upper limit on t

Example learning to walk

Adamo Young (University of Toronto) RL Tutorial November 17 2019 5 28

Expected Return and Unified Task Notation

Gt is the expected return (expected sum of rewards)

Gt infinsumk=0

γkRt+k+1

γ is the discount factor 0 lt γ le 1

Episodic TasksI Rt = 0 for all t gt T finite sumI Often choose γ = 1I Always converges since sum is finite

Continuing TasksI Infinite sumI γ lt 1 (future rewards are valued less)I If Rt is bounded the infinite sum converges

Either way we can insure that Gt is well-defined

Adamo Young (University of Toronto) RL Tutorial November 17 2019 6 28

Policies

Policy π a mapping from state to actions

π(a|s) PrAt = a|St = sCan be deterministic or probabilistic

Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around

Adamo Young (University of Toronto) RL Tutorial November 17 2019 7 28

Value Functions

State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]

I Intuitively how rdquogoodrdquo an action a is in state s under policy π

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28

MDP Illustration

Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)

This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)

Even only two states and three possible actions can be prettycomplicated to model

Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28

Some Potentially Useful Value Identities

Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]

Recall qπ(s a) Eπ [Gt |St = sAt = a]

vπ(s) = Eπ [qπ(s a)|St = s] =sum

aisinA(s)

π(a|s)qπ(s a)

qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]

qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]

Common theme value functions are recursive expand one step to seea useful pattern

Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28

Solving RL Problems

Two main classes of solutions

Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space

Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor

policy functionsI Policy Gradient Methods fall under this category

Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28

Introducing Policy Gradient

Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ

Measure policy performance with scalar performance measure J(θ)

Maximize J(θ) with gradient ascent on θ

General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)

We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards

For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation

Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28

The Policy Gradient Theorem

J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ

The Policy Gradient Theorem states that

nablaJ(θ) = nablaθvπ(s0) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

micro(s) η(s)sumsprime η(s prime)

the on-policy distribution under π

I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s

Constant of proportionality is the average length of an episode

See p 325 of Sutton RL Book (Second Edition) for concise proof

Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

= Eπ

[suma

qπ(St a)nablaπ(a|St θ)

]

= Eπ

[suma

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

]

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 5: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

Episodic vs Continuing Tasks

Episodic Task

t = 0 1 T

T (length of episode) canvary from episode to episode

Example Pacman

Continuing Task

No upper limit on t

Example learning to walk

Adamo Young (University of Toronto) RL Tutorial November 17 2019 5 28

Expected Return and Unified Task Notation

Gt is the expected return (expected sum of rewards)

Gt infinsumk=0

γkRt+k+1

γ is the discount factor 0 lt γ le 1

Episodic TasksI Rt = 0 for all t gt T finite sumI Often choose γ = 1I Always converges since sum is finite

Continuing TasksI Infinite sumI γ lt 1 (future rewards are valued less)I If Rt is bounded the infinite sum converges

Either way we can insure that Gt is well-defined

Adamo Young (University of Toronto) RL Tutorial November 17 2019 6 28

Policies

Policy π a mapping from state to actions

π(a|s) PrAt = a|St = sCan be deterministic or probabilistic

Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around

Adamo Young (University of Toronto) RL Tutorial November 17 2019 7 28

Value Functions

State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]

I Intuitively how rdquogoodrdquo an action a is in state s under policy π

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28

MDP Illustration

Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)

This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)

Even only two states and three possible actions can be prettycomplicated to model

Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28

Some Potentially Useful Value Identities

Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]

Recall qπ(s a) Eπ [Gt |St = sAt = a]

vπ(s) = Eπ [qπ(s a)|St = s] =sum

aisinA(s)

π(a|s)qπ(s a)

qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]

qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]

Common theme value functions are recursive expand one step to seea useful pattern

Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28

Solving RL Problems

Two main classes of solutions

Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space

Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor

policy functionsI Policy Gradient Methods fall under this category

Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28

Introducing Policy Gradient

Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ

Measure policy performance with scalar performance measure J(θ)

Maximize J(θ) with gradient ascent on θ

General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)

We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards

For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation

Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28

The Policy Gradient Theorem

J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ

The Policy Gradient Theorem states that

nablaJ(θ) = nablaθvπ(s0) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

micro(s) η(s)sumsprime η(s prime)

the on-policy distribution under π

I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s

Constant of proportionality is the average length of an episode

See p 325 of Sutton RL Book (Second Edition) for concise proof

Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

= Eπ

[suma

qπ(St a)nablaπ(a|St θ)

]

= Eπ

[suma

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

]

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 6: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

Expected Return and Unified Task Notation

Gt is the expected return (expected sum of rewards)

Gt infinsumk=0

γkRt+k+1

γ is the discount factor 0 lt γ le 1

Episodic TasksI Rt = 0 for all t gt T finite sumI Often choose γ = 1I Always converges since sum is finite

Continuing TasksI Infinite sumI γ lt 1 (future rewards are valued less)I If Rt is bounded the infinite sum converges

Either way we can insure that Gt is well-defined

Adamo Young (University of Toronto) RL Tutorial November 17 2019 6 28

Policies

Policy π a mapping from state to actions

π(a|s) PrAt = a|St = sCan be deterministic or probabilistic

Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around

Adamo Young (University of Toronto) RL Tutorial November 17 2019 7 28

Value Functions

State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]

I Intuitively how rdquogoodrdquo an action a is in state s under policy π

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28

MDP Illustration

Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)

This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)

Even only two states and three possible actions can be prettycomplicated to model

Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28

Some Potentially Useful Value Identities

Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]

Recall qπ(s a) Eπ [Gt |St = sAt = a]

vπ(s) = Eπ [qπ(s a)|St = s] =sum

aisinA(s)

π(a|s)qπ(s a)

qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]

qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]

Common theme value functions are recursive expand one step to seea useful pattern

Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28

Solving RL Problems

Two main classes of solutions

Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space

Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor

policy functionsI Policy Gradient Methods fall under this category

Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28

Introducing Policy Gradient

Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ

Measure policy performance with scalar performance measure J(θ)

Maximize J(θ) with gradient ascent on θ

General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)

We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards

For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation

Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28

The Policy Gradient Theorem

J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ

The Policy Gradient Theorem states that

nablaJ(θ) = nablaθvπ(s0) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

micro(s) η(s)sumsprime η(s prime)

the on-policy distribution under π

I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s

Constant of proportionality is the average length of an episode

See p 325 of Sutton RL Book (Second Edition) for concise proof

Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

= Eπ

[suma

qπ(St a)nablaπ(a|St θ)

]

= Eπ

[suma

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

]

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 7: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

Policies

Policy π a mapping from state to actions

π(a|s) PrAt = a|St = sCan be deterministic or probabilistic

Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around

Adamo Young (University of Toronto) RL Tutorial November 17 2019 7 28

Value Functions

State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]

I Intuitively how rdquogoodrdquo an action a is in state s under policy π

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28

MDP Illustration

Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)

This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)

Even only two states and three possible actions can be prettycomplicated to model

Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28

Some Potentially Useful Value Identities

Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]

Recall qπ(s a) Eπ [Gt |St = sAt = a]

vπ(s) = Eπ [qπ(s a)|St = s] =sum

aisinA(s)

π(a|s)qπ(s a)

qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]

qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]

Common theme value functions are recursive expand one step to seea useful pattern

Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28

Solving RL Problems

Two main classes of solutions

Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space

Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor

policy functionsI Policy Gradient Methods fall under this category

Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28

Introducing Policy Gradient

Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ

Measure policy performance with scalar performance measure J(θ)

Maximize J(θ) with gradient ascent on θ

General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)

We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards

For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation

Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28

The Policy Gradient Theorem

J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ

The Policy Gradient Theorem states that

nablaJ(θ) = nablaθvπ(s0) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

micro(s) η(s)sumsprime η(s prime)

the on-policy distribution under π

I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s

Constant of proportionality is the average length of an episode

See p 325 of Sutton RL Book (Second Edition) for concise proof

Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

= Eπ

[suma

qπ(St a)nablaπ(a|St θ)

]

= Eπ

[suma

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

]

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 8: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

Value Functions

State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]

I Intuitively how rdquogoodrdquo an action a is in state s under policy π

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 8 28

MDP Illustration

Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)

This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)

Even only two states and three possible actions can be prettycomplicated to model

Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28

Some Potentially Useful Value Identities

Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]

Recall qπ(s a) Eπ [Gt |St = sAt = a]

vπ(s) = Eπ [qπ(s a)|St = s] =sum

aisinA(s)

π(a|s)qπ(s a)

qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]

qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]

Common theme value functions are recursive expand one step to seea useful pattern

Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28

Solving RL Problems

Two main classes of solutions

Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space

Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor

policy functionsI Policy Gradient Methods fall under this category

Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28

Introducing Policy Gradient

Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ

Measure policy performance with scalar performance measure J(θ)

Maximize J(θ) with gradient ascent on θ

General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)

We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards

For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation

Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28

The Policy Gradient Theorem

J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ

The Policy Gradient Theorem states that

nablaJ(θ) = nablaθvπ(s0) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

micro(s) η(s)sumsprime η(s prime)

the on-policy distribution under π

I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s

Constant of proportionality is the average length of an episode

See p 325 of Sutton RL Book (Second Edition) for concise proof

Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

= Eπ

[suma

qπ(St a)nablaπ(a|St θ)

]

= Eπ

[suma

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

]

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 9: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

MDP Illustration

Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)

This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)

Even only two states and three possible actions can be prettycomplicated to model

Adamo Young (University of Toronto) RL Tutorial November 17 2019 9 28

Some Potentially Useful Value Identities

Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]

Recall qπ(s a) Eπ [Gt |St = sAt = a]

vπ(s) = Eπ [qπ(s a)|St = s] =sum

aisinA(s)

π(a|s)qπ(s a)

qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]

qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]

Common theme value functions are recursive expand one step to seea useful pattern

Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28

Solving RL Problems

Two main classes of solutions

Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space

Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor

policy functionsI Policy Gradient Methods fall under this category

Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28

Introducing Policy Gradient

Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ

Measure policy performance with scalar performance measure J(θ)

Maximize J(θ) with gradient ascent on θ

General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)

We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards

For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation

Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28

The Policy Gradient Theorem

J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ

The Policy Gradient Theorem states that

nablaJ(θ) = nablaθvπ(s0) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

micro(s) η(s)sumsprime η(s prime)

the on-policy distribution under π

I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s

Constant of proportionality is the average length of an episode

See p 325 of Sutton RL Book (Second Edition) for concise proof

Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

= Eπ

[suma

qπ(St a)nablaπ(a|St θ)

]

= Eπ

[suma

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

]

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 10: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

Some Potentially Useful Value Identities

Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]

Recall qπ(s a) Eπ [Gt |St = sAt = a]

vπ(s) = Eπ [qπ(s a)|St = s] =sum

aisinA(s)

π(a|s)qπ(s a)

qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]

qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]

Common theme value functions are recursive expand one step to seea useful pattern

Adamo Young (University of Toronto) RL Tutorial November 17 2019 10 28

Solving RL Problems

Two main classes of solutions

Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space

Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor

policy functionsI Policy Gradient Methods fall under this category

Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28

Introducing Policy Gradient

Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ

Measure policy performance with scalar performance measure J(θ)

Maximize J(θ) with gradient ascent on θ

General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)

We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards

For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation

Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28

The Policy Gradient Theorem

J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ

The Policy Gradient Theorem states that

nablaJ(θ) = nablaθvπ(s0) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

micro(s) η(s)sumsprime η(s prime)

the on-policy distribution under π

I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s

Constant of proportionality is the average length of an episode

See p 325 of Sutton RL Book (Second Edition) for concise proof

Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

= Eπ

[suma

qπ(St a)nablaπ(a|St θ)

]

= Eπ

[suma

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

]

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 11: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

Solving RL Problems

Two main classes of solutions

Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space

Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor

policy functionsI Policy Gradient Methods fall under this category

Adamo Young (University of Toronto) RL Tutorial November 17 2019 11 28

Introducing Policy Gradient

Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ

Measure policy performance with scalar performance measure J(θ)

Maximize J(θ) with gradient ascent on θ

General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)

We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards

For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation

Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28

The Policy Gradient Theorem

J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ

The Policy Gradient Theorem states that

nablaJ(θ) = nablaθvπ(s0) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

micro(s) η(s)sumsprime η(s prime)

the on-policy distribution under π

I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s

Constant of proportionality is the average length of an episode

See p 325 of Sutton RL Book (Second Edition) for concise proof

Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

= Eπ

[suma

qπ(St a)nablaπ(a|St θ)

]

= Eπ

[suma

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

]

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 12: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

Introducing Policy Gradient

Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ

Measure policy performance with scalar performance measure J(θ)

Maximize J(θ) with gradient ascent on θ

General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)

We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards

For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation

Adamo Young (University of Toronto) RL Tutorial November 17 2019 12 28

The Policy Gradient Theorem

J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ

The Policy Gradient Theorem states that

nablaJ(θ) = nablaθvπ(s0) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

micro(s) η(s)sumsprime η(s prime)

the on-policy distribution under π

I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s

Constant of proportionality is the average length of an episode

See p 325 of Sutton RL Book (Second Edition) for concise proof

Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

= Eπ

[suma

qπ(St a)nablaπ(a|St θ)

]

= Eπ

[suma

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

]

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 13: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

The Policy Gradient Theorem

J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ

The Policy Gradient Theorem states that

nablaJ(θ) = nablaθvπ(s0) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

micro(s) η(s)sumsprime η(s prime)

the on-policy distribution under π

I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s

Constant of proportionality is the average length of an episode

See p 325 of Sutton RL Book (Second Edition) for concise proof

Adamo Young (University of Toronto) RL Tutorial November 17 2019 13 28

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

= Eπ

[suma

qπ(St a)nablaπ(a|St θ)

]

= Eπ

[suma

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

]

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 14: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

= Eπ

[suma

qπ(St a)nablaπ(a|St θ)

]

= Eπ

[suma

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

]

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]

Adamo Young (University of Toronto) RL Tutorial November 17 2019 14 28

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 15: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

]]

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

]

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 15 28

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 16: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

N

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

Adamo Young (University of Toronto) RL Tutorial November 17 2019 16 28

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 17: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

Adamo Young (University of Toronto) RL Tutorial November 17 2019 17 28

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 18: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

Adamo Young (University of Toronto) RL Tutorial November 17 2019 18 28

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 19: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

REINFORCE as a Score Function Gradient Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

N

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 19 28

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 20: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Adamo Young (University of Toronto) RL Tutorial November 17 2019 20 28

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 21: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 21 28

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 22: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

Adamo Young (University of Toronto) RL Tutorial November 17 2019 22 28

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 23: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

REINFORCE with Baseline vs without

Adamo Young (University of Toronto) RL Tutorial November 17 2019 23 28

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 24: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 24 28

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 25: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

One-step Actor-Critic Algorithm

Adamo Young (University of Toronto) RL Tutorial November 17 2019 25 28

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 26: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

Adamo Young (University of Toronto) RL Tutorial November 17 2019 26 28

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 27: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 27 28

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods
Page 28: RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

Adamo Young (University of Toronto) RL Tutorial November 17 2019 28 28

  • General RL Overview
    • Agent-Environment Interface
      • Policy Gradient Methods