RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University

RL TutorialCSC2515 Fall 2019

Adamo Young

University of Toronto

November 17 2019

Adamo Young (University of Toronto) RL Tutorial November 17 2019 1 28

Markov Decision Process (MDP)

St isin SAt isin ARt isin RTimesteps t isin N (discrete)

Finite SA both have finite number of elements (discrete)

Fully observable St is known exactly

Trajectory sequence of state-action-reward S0A0R1 S1A1

Markov Property

Dynamics of the MDP are completely captured in the current stateand the proposed action

States include all information about the past that make a differencefor the future

p(s prime r |s a) PrSt = s primeRt = r |Stminus1 = sAtminus1 = a

MDP Generalizations

Continuous timesteps

Continuous statesactions

Partially observable (POMDP) probability distribution over St

Donrsquot worry about them for now

Episodic vs Continuing Tasks

Episodic Task

t = 0 1 T

T (length of episode) canvary from episode to episode

Example Pacman

Continuing Task

No upper limit on t

Example learning to walk

Expected Return and Unified Task Notation

Gt is the expected return (expected sum of rewards)

Gt infinsumk=0

γkRt+k+1

γ is the discount factor 0 lt γ le 1

Episodic TasksI Rt = 0 for all t gt T finite sumI Often choose γ = 1I Always converges since sum is finite

Continuing TasksI Infinite sumI γ lt 1 (future rewards are valued less)I If Rt is bounded the infinite sum converges

Either way we can insure that Gt is well-defined

Policies

Policy π a mapping from state to actions

π(a|s) PrAt = a|St = sCan be deterministic or probabilistic

Example deterministic wall-following maze navigation policyI Turn right if possibleI If you canrsquot turn right go straight if possibleI If you canrsquot turn right or go straight turn left if possibleI Otherwise turn around

Value Functions

State-value function vπ(s) Eπ [Gt |St = s]I Intuitively how rdquogoodrdquo a state s is under policy π

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

]Action-value function qπ(s a) Eπ [Gt |St = sAt = a]

I Intuitively how rdquogoodrdquo an action a is in state s under policy π

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

MDP Illustration

Note in this case we are using p(s prime|s a) to model the environmentinstead of p(s prime r |s a)

This is because there is no ambiguity in reward r given s a s prime (but ingeneral there could be)

Even only two states and three possible actions can be prettycomplicated to model

Some Potentially Useful Value Identities

Recall π(a|s) PrAt = a|St = sRecall vπ(s) Eπ [Gt |St = s]

Recall qπ(s a) Eπ [Gt |St = sAt = a]

vπ(s) = Eπ [qπ(s a)|St = s] =sum

aisinA(s)

π(a|s)qπ(s a)

qπ(s a) = Eπ [Rt+1 + γGt+1|St = sAt = a]

qπ(s a) = Ep(sprimer |sa) [Rt+1 + γvπ(St+1)|St = sAt = a]

Common theme value functions are recursive expand one step to seea useful pattern

Solving RL Problems

Two main classes of solutions

Tabular Solution MethodsI Provide exact solutionsI Do not scale very well with search space

Approximate Solution MethodsI Use function approximators (ie neural networks) for value andor

policy functionsI Policy Gradient Methods fall under this category

Introducing Policy Gradient

Parameterize policy π(a|s θ) PrAt = a|St = s θt = θI For example π could be a neural network with weights θ

Measure policy performance with scalar performance measure J(θ)

Maximize J(θ) with gradient ascent on θ

General update rule θt+1 = θt + αnablaθJ(θt)I α is step sizeI nablaθJ(θt) is an approximation to nablaθJ(θt)

We are increasing the probability of trajectories with large rewardsand decreasing the probability of trajectories with small rewards

For simplicity wersquoll only consider episodic tasks for the remainder ofthe presentation

The Policy Gradient Theorem

J(θ) vπ(s0) where s0 is initial state of a trajectory and π is a policyparameterized by θ

The Policy Gradient Theorem states that

nablaJ(θ) = nablaθvπ(s0) propsums

micro(s)suma

qπ(s a)nabla(π(a|s θ))

micro(s) η(s)sumsprime η(s prime)

the on-policy distribution under π

I η(s) is the expected number of visits to state sI micro(s) is the fraction of time spent in state s

Constant of proportionality is the average length of an episode

See p 325 of Sutton RL Book (Second Edition) for concise proof

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

qπ(St a)nablaπ(a|St θ)

π(a|St θ)qπ(St a)nablaπ(a|St θ)

π(a|St θ)

= Eπ[Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

REINFORCE Update (Continued)

nablaJ(θ) prop Eπ[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Gtnablaπ(At |St θ)

π(At |St θ)

∣∣∣∣∣St At

= Eπ[Gtnablaπ(At |St θ)

π(At |St θ)

Reinforce update θt+1 = θt + αGtnablaπ(At |St θ)

π(At |St θ)

Quick Aside Monte Carlo

Way of approximating an integralexpectation

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

If p(x) andor f (x) is complicated the integral is difficult to solveanalytically

Simple idea sample N values from the domain of the distribution Xand take the average

Ep(x) [f (x)] asymp 1

Nsumi=1

p(x (i))f (x (i))

The larger N is the better the estimate

REINFORCE Algorithm

Generate entire trajectories and compute Gt from it

Gradient estimates only need to be proportional to the true gradientsince constant of proportionality can be absorbed in the step size α

REINFORCE as a Score Function Gradient Estimator

We can rewrite J(θ) in terms of an expectation over trajectories τ

J(θ) vπ(s0) = Eτ [G0(τ)] =sum

τ p(τ |πθ)G0(τ)I p(τ |πθ) is the probability of the trajectory under policy πθI G0(τ) is the sum of the rewards given trajectory τ

Recall nablax log f (x) =nablax f (x)

f (x)by chain rule

We can use the log-derivative trick to derive an expression for nablaJ(θ)

In statistics this estimator is called the Score Function Estimator

nablaJ(θ) = nablasumτ

p(τ |πθ)G0(τ)

=sumτ

nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

p(τ |πθ)nablap(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)nabla log p(τ |πθ)G0(τ)

= Eτ [nabla log p(τ |πθ)G0(τ)]

nablaJ(θ) =1

Nsumi=1

log p(τ (i)|πθ)G0(τ (i))

(Monte Carlo approximation)

Properties of REINFORCEScore Function GradientEstimator

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

I Very general (reward does not need to be continuousdifferentiable)

ConsI High variance (noisy)I Requires many sample trajectories to estimate gradient

Reducing REINFORCE Estimator Variance

Recall nablaJ(θ) propsums

micro(s)suma

Let b(s) be a baseline (some function of s that does not depend on θ)suma

b(s)nablaπ(a|s θ) = b(s)nablasuma

π(a|s θ) = b(s)nabla1 = 0

Thus nablaJ(θ) propsums

micro(s)suma

(qπ(s a)minus b(s))nabla(π(a|s θ))

New update rule θt+1 = θt + α (Gt minus b(St))nablaπ(At |St θt)π(At |St θt)

Natural choice for b(s) an estimate of the state value v(St w)parameterized by weights w

I Want to adjust θ based on how much better (or worse) the path isthan the average score of paths starting in the current state (v)

REINFORCE with Baseline Algorithm

Two different step sizes αθ and αw

v is also estimated with Monte Carlo

REINFORCE with Baseline vs without

Extensions Actor-Critic

Use bootstrapping to update the state-value estimate v(sw) for astate s from the estimated values of subsequent states

Introduces bias but reduces variance

Can use one-step or n-step return (as opposed to full returns inREINFORCE)

One-step Actor-Critic Algorithm

OpenAI Gym

Free python library of environments for reinforcement learning

Easy to use good for benchmarking

Environments includeI Classic control problemsI Atari gamesI 3D simulations (Mujoco)I Robotics

Check out the docs and environments

REINFORCE bipedal walker demo

OpenAI Gym Environment

Train a 2D walker to navigate a landscape

Different levels of difficulty

Code taken from Michael Guerzhoyrsquos CSC411 course webpage

Uses policy gradient with REINFORCE

Note requires tensorflow 1x (not 2x)

References

Sutton and Barto rdquoReinforcement Learning An Introduction (SecondEdition)rdquo 2018

Xuchan (Jenny) Bao Intro to RL and Policy Gradient CSC421Tutorial 2019 (link)

Michael Guerzhoy CSC411 Project 4 Reinforcement Learning withPolicy Gradients 2017 (link)

General RL Overview

Agent-Environment Interface

Policy Gradient Methods

Markov Decision Process (MDP)

St isin SAt isin ARt isin RTimesteps t isin N (discrete)

Finite SA both have finite number of elements (discrete)

Fully observable St is known exactly

Trajectory sequence of state-action-reward S0A0R1 S1A1

Markov Property

MDP Generalizations

Episodic Task

t = 0 1 T

Example Pacman

Continuing Task

No upper limit on t

Gt infinsumk=0

γkRt+k+1

Policies

Value Functions

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

MDP Illustration

aisinA(s)

π(a|s)qπ(s a)

Solving RL Problems

micro(s)suma

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

π(a|St θ)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

π(At |St θ)

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

Markov Property

MDP Generalizations

Episodic Task

t = 0 1 T

Example Pacman

Continuing Task

No upper limit on t

Gt infinsumk=0

γkRt+k+1

Policies

Value Functions

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

MDP Illustration

aisinA(s)

π(a|s)qπ(s a)

Solving RL Problems

micro(s)suma

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

π(a|St θ)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

π(At |St θ)

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

MDP Generalizations

Episodic Task

t = 0 1 T

Example Pacman

Continuing Task

No upper limit on t

Gt infinsumk=0

γkRt+k+1

Policies

Value Functions

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

MDP Illustration

aisinA(s)

π(a|s)qπ(s a)

Solving RL Problems

micro(s)suma

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

π(a|St θ)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

π(At |St θ)

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

Episodic Task

t = 0 1 T

Example Pacman

Continuing Task

No upper limit on t

Gt infinsumk=0

γkRt+k+1

Policies

Value Functions

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

MDP Illustration

aisinA(s)

π(a|s)qπ(s a)

Solving RL Problems

micro(s)suma

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

π(a|St θ)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

π(At |St θ)

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

Gt infinsumk=0

γkRt+k+1

Policies

Value Functions

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

MDP Illustration

aisinA(s)

π(a|s)qπ(s a)

Solving RL Problems

micro(s)suma

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

π(a|St θ)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

π(At |St θ)

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

Policies

Value Functions

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

MDP Illustration

aisinA(s)

π(a|s)qπ(s a)

Solving RL Problems

micro(s)suma

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

π(a|St θ)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

π(At |St θ)

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

Value Functions

I vπ(s) = Eπ

[ infinsumk=0

γkRt+k+1

∣∣∣∣∣St = s

I qπ(s a) = Eπ

[suminfink=0 γ

kRt+k+1

∣∣∣∣∣St = sAt = a

MDP Illustration

aisinA(s)

π(a|s)qπ(s a)

Solving RL Problems

micro(s)suma

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

π(a|St θ)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

π(At |St θ)

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

MDP Illustration

aisinA(s)

π(a|s)qπ(s a)

Solving RL Problems

micro(s)suma

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

π(a|St θ)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

π(At |St θ)

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

aisinA(s)

π(a|s)qπ(s a)

Solving RL Problems

micro(s)suma

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

π(a|St θ)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

π(At |St θ)

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

Solving RL Problems

micro(s)suma

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

π(a|St θ)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

π(At |St θ)

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

micro(s)suma

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

π(a|St θ)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

π(At |St θ)

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

micro(s)suma

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

π(a|St θ)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

π(At |St θ)

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

REINFORCE Update

nablaJ(θ) propsums

micro(s)suma

π(a|St θ)

nablaπ(At |St θ)

π(At |St θ)

]]= Eπ

[qπ(St At)

nablaπ(At |St θ)

π(At |St θ)

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

π(At |St θ)

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

[Eπ [Gt |St At ]

nablaπ(At |St θ)

π(At |St θ)

]= Eπ

π(At |St θ)

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

Ep(x) [f (x)] =

intxisinX

p(x)f (x)dx

Nsumi=1

p(x (i))f (x (i))

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

REINFORCE Algorithm

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

f (x)by chain rule

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

p(τ |πθ)G0(τ)

=sumτ

p(τ |πθ)

=sumτ

nablaJ(θ) =1

Nsumi=1

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

ProsI Unbiased E

[nablaJ(θ)

]= J(θ)

micro(s)suma

OpenAI Gym

References

General RL Overview

micro(s)suma

OpenAI Gym

References

General RL Overview

OpenAI Gym

References

General RL Overview

OpenAI Gym

References

General RL Overview

OpenAI Gym

References

General RL Overview

OpenAI Gym

References

General RL Overview

OpenAI Gym

References

General RL Overview

References

General RL Overview

References

General RL Overview

Documents

RL Tutorial - CSC2515 Fall 2019rgrosse/courses/csc2515_2019/tutorials/tut9/R… · RL Tutorial CSC2515 Fall 2019 Adamo Young University of Toronto November 17, 2019 Adamo Young (University