11
CSE 190: Reinforcement Learning: An Introduction Chapter 6: Temporal Difference Learning Acknowledgment: Acknowledgment: A good number of these slides A good number of these slides are cribbed from Rich Sutton are cribbed from Rich Sutton 2 CSE 190: Reinforcement Learning, Lecture CSE 190: Reinforcement Learning, Lecture on Chapter on Chapter 6 Monte Carlo is important in practice When there are just a few possibilities to value, out of a large state space, Monte Carlo is a big win Backgammon, Go, … But it is still somewhat inefficient, because it figures out state values in isolation from other states (no bootstrapping) 3 CSE 190: Reinforcement Learning, Lecture CSE 190: Reinforcement Learning, Lecture on Chapter on Chapter 6 Chapter 6: Temporal Difference Learning Introduce Temporal Difference (TD) learning As usual, we focus first on policy evaluation, or prediction, methods Then extend to control methods Objectives of this chapter: 4 CSE 190: Reinforcement Learning, Lecture CSE 190: Reinforcement Learning, Lecture on Chapter on Chapter 6 TD Prediction Simple every-visit Monte Carlo method: V ( s t ) ! V ( s t ) + " R t # V ( s t ) [ ] Policy Evaluation (the prediction problem): for a given policy !, compute the state-value function V ! The simplest TD method, TD(0): V ( s t ) ! V ( s t ) + " r t +1 + # V ( s t +1 ) $ V ( s t ) [ ] target target: the actual return after time t target target: an estimate of the return

Monte Carlo is important in practice CSE 190 ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_6.pdf · A good number of these slides are cribbed from Rich Sutton CSE 190: Reinforcement

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Monte Carlo is important in practice CSE 190 ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_6.pdf · A good number of these slides are cribbed from Rich Sutton CSE 190: Reinforcement

CSE 190: Reinforcement Learning:An Introduction

Chapter 6: Temporal DifferenceLearning

Acknowledgment:Acknowledgment:A good number of these slidesA good number of these slides are cribbed from Rich Suttonare cribbed from Rich Sutton

22CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Monte Carlo is important in practice

• When there are just a few possibilities to value, out of alarge state space, Monte Carlo is a big win

• Backgammon, Go, …

• But it is still somewhat inefficient, because it figures outstate values in isolation from other states (nobootstrapping)

33CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Chapter 6: Temporal Difference Learning

• Introduce Temporal Difference (TD) learning

• As usual, we focus first on policy evaluation, orprediction, methods

• Then extend to control methods

Objectives of this chapter:

44CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

TD Prediction

Simple every-visit Monte Carlo method:V (st ) !V (st ) +" Rt #V (st )[ ]

Policy Evaluation (the prediction problem): for a given policy !, compute the state-value function V !

The simplest TD method, TD(0):V (st ) !V (st ) +" rt+1 + # V (st+1) $V (st )[ ]

targettarget: the actual return after time t

targettarget: an estimate of the return

Page 2: Monte Carlo is important in practice CSE 190 ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_6.pdf · A good number of these slides are cribbed from Rich Sutton CSE 190: Reinforcement

55CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

TD Prediction

Note that:

V ! (s) = E! Rt st = s{ }= E! " krt+ k+1 st = s

k=0

#

$%&'

()*

= E! rt+1 + " " krt+ k+2 st = sk=0

#

$%&'

()*

= E! rt+1 + "V! (st+1) st = s{ }

Hence the two targets are the same - in expectation.In practice, the results may differ due to finite training sets

66CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

TD Prediction

Simple every-visit Monte Carlo method:V (st ) !V (st ) +" Rt #V (st )[ ]

Policy Evaluation (the prediction problem): for a given policy !, compute the state-value function V !

The simplest TD method, TD(0):V (st ) !V (st ) +" rt+1 + # V (st+1) $V (st )[ ]

targettarget: the actual return after time t

targettarget: an estimate of the return

77CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Simple Monte Carlo

T T T TT

T T T T T

V (st ) !V (st ) +" Rt #V (st )[ ]where Rt is the actual return following state st .

st

T T

T T

TT T

T TT

Rt

88CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Simplest TD Method

T T T TT

T T T T T

st+1rt+1

st

V (st )!V (st ) +" rt+1 + # V (st+1) $V (st )[ ]

TTTTT

T T T T T

Page 3: Monte Carlo is important in practice CSE 190 ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_6.pdf · A good number of these slides are cribbed from Rich Sutton CSE 190: Reinforcement

99CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

cf. Dynamic Programming

V (st )! E" rt+1 + # V (st ){ }

T

T T T

st

rt+1st+1

T

TT

T

TT

T

T

T

1010CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Comparing MC and DynamicProgramming

•• BootstrappingBootstrapping: update involves an estimateestimatefrom a nearby statefrom a nearby state• MC does not bootstrap:

• Each state is independently estimated

• DP bootstraps• Estimates for each state depend on nearby states

•• SamplingSampling: update involves an actual returnactual return• MC samples

• Learns from experience in the environment, gets an actual return R

• DP does not sample• Learns by spreading constraints until convergence through iterating

the Bellman equations - without interaction with the environment!

1111CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

TD methods bootstrap and sample

• Bootstrapping: update involves an estimate• MC does not bootstrap

• DP bootstraps

•• TD bootstrapsTD bootstraps

• Sampling: update involves an actual return• MC samples

• DP does not sample

•• TD samplesTD samples

1212CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Policy Evaluation with TD(0)

• Note the on-line nature: updates as you go.

Page 4: Monte Carlo is important in practice CSE 190 ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_6.pdf · A good number of these slides are cribbed from Rich Sutton CSE 190: Reinforcement

1313CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Example: Driving Home

State Elapsed Time (minutes)

Predicted Time to Go

Predicted Total Time

leaving office 0 30 30

reach car, raining 5 35 40

exit highway 20 15 35

behind truck 30 10 40

home street 40 3 43

arrive home 43 0 43

(5)

(15)

(10)

(10)

(3)

(!)

Value of a state: The remaining time to get home, or time to go

1414CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Driving Home

Changes recommended byChanges recommended byMonte Carlo methods Monte Carlo methods ("("=1)=1)

Changes recommendedChanges recommendedby TD methods (by TD methods (""=1)=1)

1515CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Advantages of TD Learning

• TD methods do not require a model of the environment,only experience

• TD, but not MC, methods can be fully incremental• You can learn before knowing the final outcome

• Less memory

• Less peak computation

• You can learn without the final outcome• From incomplete sequences

• Both MC and TD converge (under certain assumptions tobe detailed later), but which is faster?

1616CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Random Walk Example

Values learned by TD(0) after various

numbers of episodes

What happened on the first episode?

rewardsrewards

Page 5: Monte Carlo is important in practice CSE 190 ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_6.pdf · A good number of these slides are cribbed from Rich Sutton CSE 190: Reinforcement

1717CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

TD and MC on the Random Walk

Data averaged overData averaged over100 sequences of episodes100 sequences of episodes

1818CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Optimality of TD(0)

Batch UpdatingBatch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence.

Compute updates according to TD(0), but only update estimates after each complete pass through the data.

For any finite Markov prediction task, under batch updating,TD(0) converges for sufficiently small ".

Constant-" MC also converges under these conditions, but tobut toa different answer!a different answer!

1919CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Random Walk under Batch Updating

After each new episode, all previous episodes were treated as a batch,After each new episode, all previous episodes were treated as a batch,and algorithm was trained until convergence. All repeated 100 times.and algorithm was trained until convergence. All repeated 100 times.

2020CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

You are the PredictorSuppose you observe the following 8 episodes:(NOTE: these are not from the previous example!)

1. A, 0, B, 0, terminate2. B, 1, terminate3. B, 1, terminate4. B, 1, terminate5. B, 1, terminate6. B, 1, terminate7. B, 1, terminate8. B, 0, terminate

V (A)?V (B)?

Page 6: Monte Carlo is important in practice CSE 190 ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_6.pdf · A good number of these slides are cribbed from Rich Sutton CSE 190: Reinforcement

2121CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

You are the Predictor

V (A)?

2222CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

You are the Predictor

• The prediction that best matches the training data isV(A)=0• This minimizes the mean-square-error on the training set

• This is what a batch Monte Carlo method gets

• If we consider the sequentiality of the problem, then wewould set V(A)=.75• This is correct for the maximum likelihood estimate of a

Markov model generating the data

• i.e, if we do a best fit Markov model, and assume it is exactlycorrect, and then compute what it predicts (how?)

• This is called the certainty-equivalence estimate• This is what TD(0) gets

2323CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Learning An Action-Value Function

Estimate Q! for the current behavior policy ! .

After every transition from a nonterminal state st , do this:

Q st ,at( )!Q st ,at( ) +" rt+1 + # Q st+1,at+1( ) $Q st ,at( )%& '(If st+1 is terminal, then Q(st+1,at+1) = 0. (no action out of a terminal state!)

2424CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Sarsa: On-Policy TD Control

Turn this into a control method by always updating thepolicy to be greedy with respect to the current estimate:

Page 7: Monte Carlo is important in practice CSE 190 ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_6.pdf · A good number of these slides are cribbed from Rich Sutton CSE 190: Reinforcement

2525CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Windy Gridworld

undiscounted, episodic, reward = –1 until goal

2626CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Results of Sarsa on the WindyGridworld

2727CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Q-Learning: Off-Policy TD ControlOne-step Q-learning:

Q st ,at( )!Q st ,at( ) +" rt+1 + # maxaQ st+1,a( ) $Q st ,at( )%

&'(

Compare to SARSA:

Q st ,at( )!Q st ,at( ) +" rt+1 + # Q st+1,at+1( ) $Q st ,at( )%& '(

• This tiny difference in the update makes a BIG difference!

• Q-learning learns Q*, the optimal Q-values,with probability 1, nomatter what policy is being followed (as long as it is exploratory andvisits each state-action pair infinitely often - and the learning rates getsmaller at the appropriate rate.

• Hence this is “off-policy” because it finds #* while following anotherpolicy…

2828CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Q-Learning: Off-Policy TD Control

One-step Q-learning:

Q st ,at( )!Q st ,at( ) +" rt+1 + # maxaQ st+1,a( ) $Q st ,at( )%

&'(

Page 8: Monte Carlo is important in practice CSE 190 ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_6.pdf · A good number of these slides are cribbed from Rich Sutton CSE 190: Reinforcement

2929CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Cliffwalking

#$#$greedygreedy, # , # = 0.1= 0.1

3030CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Actor-Critic Methods

• Explicit representation of policyas well as value function

• Minimal computation to selectactions

• Can learn an explicit stochasticpolicy

• Can put constraints on policies

• Appealing as psychological andneural models

3131CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Actor-Critic Details

The critic evaluates the action using the TD-error:! t = rt+1 + " V (st+1) #V (st )

If actions are determined by preferences, p(s,a), as follows:

! t (s,a) = Pr at = a st = s{ } = ep(s,a)

ep(s,b)

b"

,

(a softmax policy) then you can update the actor's preferences like this:p(st ,at ) # p(st ,at ) + $% t

Suppose we take action a from state st, reaching state st+1

3232CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Dopamine Neurons and TD ErrorW. Schultz et al.

Universite de Fribourg

Page 9: Monte Carlo is important in practice CSE 190 ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_6.pdf · A good number of these slides are cribbed from Rich Sutton CSE 190: Reinforcement

3333CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Can we apply these methods to continuoustasks without Discounting?

Yes: Average Reward Per Time StepAverage expected reward per time step under policy ! :

"! = limn#$

1n

E! rt{ }t=1

n

% the same for each state if ergodic

Value of a state relative to !" :

!V " s( ) = E" rt+ k # !" st = s{ }k=1

$

%Value of a state-action pair relative to !" :

!Q" s,a( ) = E" rt+ k # !" st = s ,at = a{ }k=1

$

%These code values as the transients away from the average...

3434CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

R-Learning

3535CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Access-Control Queuing Task• n servers• Customers have four

different priorities, whichpay reward of 1, 2, 4, or 8, ifserved

• At each time step, customerat head of queue is accepted(assigned to a server) orremoved from the queue - :-(

• Proportion of randomlydistributed high prioritycustomers in queue is h

• Busy server becomes freewith probability p on eachtime step

• Statistics of arrivals anddepartures are unknown

n=10, h=.5, p=.06

Apply R-learning

3636CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Afterstates• Usually, a state-value function evaluates states in which the agent can

take an action.

• But sometimes it is useful to evaluate states after agent has acted, as intic-tac-toe.

• Why is this useful?

• What is this in general?

Page 10: Monte Carlo is important in practice CSE 190 ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_6.pdf · A good number of these slides are cribbed from Rich Sutton CSE 190: Reinforcement

3737CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Summary

• TD prediction• Introduced one-step tabular model-free TD methods• Extend prediction to control by employing some form of

GPI• On-policy control: Sarsa• Off-policy control: Q-learning and R-learning

• These methods bootstrap and sample, combiningaspects of DP and MC methods

3838CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Questions

• What can I tell you about RL?

• What is common to all three classes of methods? – DP,MC, TD

• What are the principal strengths and weaknesses of each?

• In what sense is our RL view complete?

• In what senses is it incomplete?• What are the principal things missing?

• The broad applicability of these ideas…

• What does the term bootstrapping refer to?

• What is the relationship between DP and learning?

END4040CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

The Book

• Part I: The Problem• Introduction• Evaluative Feedback• The Reinforcement Learning Problem

• Part II: Elementary Solution Methods• Dynamic Programming• Monte Carlo Methods• Temporal Difference Learning

• Part III: A Unified View• Eligibility Traces• Generalization and Function Approximation• Planning and Learning• Dimensions of Reinforcement Learning• Case Studies

Page 11: Monte Carlo is important in practice CSE 190 ...cseweb.ucsd.edu/~gary/190-RL/Lecture_Ch_6.pdf · A good number of these slides are cribbed from Rich Sutton CSE 190: Reinforcement

4141CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66

Unified View