Monte Carlo is important in practice CSE 190 ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_6.pdf · A good number of these slides are cribbed from Rich Sutton CSE 190: Reinforcement

CSE 190: Reinforcement Learning:An Introduction

Chapter 6: Temporal DifferenceLearning

Acknowledgment:Acknowledgment:A good number of these slidesA good number of these slides are cribbed from Rich Suttonare cribbed from Rich Sutton

22CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Monte Carlo is important in practice

• When there are just a few possibilities to value, out of alarge state space, Monte Carlo is a big win

• Backgammon, Go, …

• But it is still somewhat inefficient, because it figures outstate values in isolation from other states (nobootstrapping)


Chapter 6: Temporal Difference Learning

• Introduce Temporal Difference (TD) learning

• As usual, we focus first on policy evaluation, orprediction, methods

• Then extend to control methods

Objectives of this chapter:


TD Prediction

Simple every-visit Monte Carlo method:V (st ) !V (st ) +" Rt #V (st )[ ]

Policy Evaluation (the prediction problem): for a given policy !, compute the state-value function V !

The simplest TD method, TD(0):V (st ) !V (st ) +" rt+1 + # V (st+1) $V (st )[ ]

targettarget: the actual return after time t

targettarget: an estimate of the return


TD Prediction

Note that:

V ! (s) = E! Rt st = s{ }= E! " krt+ k+1 st = s

k=0

#

$%&'

()*

= E! rt+1 + " " krt+ k+2 st = sk=0

#

$%&'

()*

= E! rt+1 + "V! (st+1) st = s{ }

Hence the two targets are the same - in expectation.In practice, the results may differ due to finite training sets


TD Prediction

Simple every-visit Monte Carlo method:V (st ) !V (st ) +" Rt #V (st )[ ]

Policy Evaluation (the prediction problem): for a given policy !, compute the state-value function V !

The simplest TD method, TD(0):V (st ) !V (st ) +" rt+1 + # V (st+1) $V (st )[ ]

targettarget: the actual return after time t

targettarget: an estimate of the return


Simple Monte Carlo

T T T TT

T T T T T

V (st ) !V (st ) +" Rt #V (st )[ ]where Rt is the actual return following state st .

st

T T

T T

TT T

T TT

Rt


Simplest TD Method

T T T TT

T T T T T

st+1rt+1

st

V (st )!V (st ) +" rt+1 + # V (st+1) $V (st )[ ]

TTTTT

T T T T T


cf. Dynamic Programming

V (st )! E" rt+1 + # V (st ){ }

T

T T T

st

rt+1st+1

T

TT

T

TT

T

T

T


Comparing MC and DynamicProgramming

•• BootstrappingBootstrapping: update involves an estimateestimatefrom a nearby statefrom a nearby state• MC does not bootstrap:

• Each state is independently estimated

• DP bootstraps• Estimates for each state depend on nearby states

•• SamplingSampling: update involves an actual returnactual return• MC samples

• Learns from experience in the environment, gets an actual return R

• DP does not sample• Learns by spreading constraints until convergence through iterating

the Bellman equations - without interaction with the environment!


TD methods bootstrap and sample

• Bootstrapping: update involves an estimate• MC does not bootstrap

• DP bootstraps

•• TD bootstrapsTD bootstraps

• Sampling: update involves an actual return• MC samples

• DP does not sample

•• TD samplesTD samples


Policy Evaluation with TD(0)

• Note the on-line nature: updates as you go.


Example: Driving Home

State Elapsed Time (minutes)

Predicted Time to Go

Predicted Total Time

leaving office 0 30 30

reach car, raining 5 35 40

exit highway 20 15 35

behind truck 30 10 40

home street 40 3 43

arrive home 43 0 43

(5)

(15)

(10)

(10)

(3)

(!)

Value of a state: The remaining time to get home, or time to go


Driving Home

Changes recommended byChanges recommended byMonte Carlo methods Monte Carlo methods ("("=1)=1)

Changes recommendedChanges recommendedby TD methods (by TD methods (""=1)=1)


Advantages of TD Learning

• TD methods do not require a model of the environment,only experience

• TD, but not MC, methods can be fully incremental• You can learn before knowing the final outcome

• Less memory

• Less peak computation

• You can learn without the final outcome• From incomplete sequences

• Both MC and TD converge (under certain assumptions tobe detailed later), but which is faster?


Random Walk Example

Values learned by TD(0) after various

numbers of episodes

What happened on the first episode?

rewardsrewards


TD and MC on the Random Walk

Data averaged overData averaged over100 sequences of episodes100 sequences of episodes


Optimality of TD(0)

Batch UpdatingBatch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence.

Compute updates according to TD(0), but only update estimates after each complete pass through the data.

For any finite Markov prediction task, under batch updating,TD(0) converges for sufficiently small ".

Constant-" MC also converges under these conditions, but tobut toa different answer!a different answer!


Random Walk under Batch Updating

After each new episode, all previous episodes were treated as a batch,After each new episode, all previous episodes were treated as a batch,and algorithm was trained until convergence. All repeated 100 times.and algorithm was trained until convergence. All repeated 100 times.


You are the PredictorSuppose you observe the following 8 episodes:(NOTE: these are not from the previous example!)

1. A, 0, B, 0, terminate2. B, 1, terminate3. B, 1, terminate4. B, 1, terminate5. B, 1, terminate6. B, 1, terminate7. B, 1, terminate8. B, 0, terminate

V (A)?V (B)?


You are the Predictor

V (A)?


You are the Predictor

• The prediction that best matches the training data isV(A)=0• This minimizes the mean-square-error on the training set

• This is what a batch Monte Carlo method gets

• If we consider the sequentiality of the problem, then wewould set V(A)=.75• This is correct for the maximum likelihood estimate of a

Markov model generating the data

• i.e, if we do a best fit Markov model, and assume it is exactlycorrect, and then compute what it predicts (how?)

• This is called the certainty-equivalence estimate• This is what TD(0) gets


Learning An Action-Value Function

Estimate Q! for the current behavior policy ! .

After every transition from a nonterminal state st , do this:

Q st ,at( )!Q st ,at( ) +" rt+1 + # Q st+1,at+1( ) $Q st ,at( )%& '(If st+1 is terminal, then Q(st+1,at+1) = 0. (no action out of a terminal state!)


Sarsa: On-Policy TD Control

Turn this into a control method by always updating thepolicy to be greedy with respect to the current estimate:


Windy Gridworld

undiscounted, episodic, reward = –1 until goal


Results of Sarsa on the WindyGridworld


Q-Learning: Off-Policy TD ControlOne-step Q-learning:

Q st ,at( )!Q st ,at( ) +" rt+1 + # maxaQ st+1,a( ) $Q st ,at( )%

&'(

Compare to SARSA:

Q st ,at( )!Q st ,at( ) +" rt+1 + # Q st+1,at+1( ) $Q st ,at( )%& '(

• This tiny difference in the update makes a BIG difference!

• Q-learning learns Q*, the optimal Q-values,with probability 1, nomatter what policy is being followed (as long as it is exploratory andvisits each state-action pair infinitely often - and the learning rates getsmaller at the appropriate rate.

• Hence this is “off-policy” because it finds #* while following anotherpolicy…


Q-Learning: Off-Policy TD Control

One-step Q-learning:

Q st ,at( )!Q st ,at( ) +" rt+1 + # maxaQ st+1,a( ) $Q st ,at( )%

&'(


Cliffwalking

#$#$greedygreedy, # , # = 0.1= 0.1


Actor-Critic Methods

• Explicit representation of policyas well as value function

• Minimal computation to selectactions

• Can learn an explicit stochasticpolicy

• Can put constraints on policies

• Appealing as psychological andneural models


Actor-Critic Details

The critic evaluates the action using the TD-error:! t = rt+1 + " V (st+1) #V (st )

If actions are determined by preferences, p(s,a), as follows:

! t (s,a) = Pr at = a st = s{ } = ep(s,a)

ep(s,b)

b"

,

(a softmax policy) then you can update the actor's preferences like this:p(st ,at ) # p(st ,at ) + $% t

Suppose we take action a from state st, reaching state st+1


Dopamine Neurons and TD ErrorW. Schultz et al.

Universite de Fribourg


Can we apply these methods to continuoustasks without Discounting?

Yes: Average Reward Per Time StepAverage expected reward per time step under policy ! :

"! = limn#$

1n

E! rt{ }t=1

n

% the same for each state if ergodic

Value of a state relative to !" :

!V " s( ) = E" rt+ k # !" st = s{ }k=1

$

%Value of a state-action pair relative to !" :

!Q" s,a( ) = E" rt+ k # !" st = s ,at = a{ }k=1

$

%These code values as the transients away from the average...


R-Learning


Access-Control Queuing Task• n servers• Customers have four

different priorities, whichpay reward of 1, 2, 4, or 8, ifserved

• At each time step, customerat head of queue is accepted(assigned to a server) orremoved from the queue - :-(

• Proportion of randomlydistributed high prioritycustomers in queue is h

• Busy server becomes freewith probability p on eachtime step

• Statistics of arrivals anddepartures are unknown

n=10, h=.5, p=.06

Apply R-learning


Afterstates• Usually, a state-value function evaluates states in which the agent can

take an action.

• But sometimes it is useful to evaluate states after agent has acted, as intic-tac-toe.

• Why is this useful?

• What is this in general?


Summary

• TD prediction• Introduced one-step tabular model-free TD methods• Extend prediction to control by employing some form of

GPI• On-policy control: Sarsa• Off-policy control: Q-learning and R-learning

• These methods bootstrap and sample, combiningaspects of DP and MC methods


Questions

• What can I tell you about RL?

• What is common to all three classes of methods? – DP,MC, TD

• What are the principal strengths and weaknesses of each?

• In what sense is our RL view complete?

• In what senses is it incomplete?• What are the principal things missing?

• The broad applicability of these ideas…

• What does the term bootstrapping refer to?

• What is the relationship between DP and learning?

END4040CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

The Book

• Part I: The Problem• Introduction• Evaluative Feedback• The Reinforcement Learning Problem

• Part II: Elementary Solution Methods• Dynamic Programming• Monte Carlo Methods• Temporal Difference Learning

• Part III: A Unified View• Eligibility Traces• Generalization and Function Approximation• Planning and Learning• Dimensions of Reinforcement Learning• Case Studies


Unified View

Documents

Monte Carlo is important in practice CSE 190 ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_6.pdf · A good number of these slides are cribbed from Rich Sutton CSE 190: Reinforcement