Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CSE 190: Reinforcement Learning:An Introduction
Chapter 6: Temporal DifferenceLearning
Acknowledgment:Acknowledgment:A good number of these slidesA good number of these slides are cribbed from Rich Suttonare cribbed from Rich Sutton
22CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Monte Carlo is important in practice
• When there are just a few possibilities to value, out of alarge state space, Monte Carlo is a big win
• Backgammon, Go, …
• But it is still somewhat inefficient, because it figures outstate values in isolation from other states (nobootstrapping)
33CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Chapter 6: Temporal Difference Learning
• Introduce Temporal Difference (TD) learning
• As usual, we focus first on policy evaluation, orprediction, methods
• Then extend to control methods
Objectives of this chapter:
44CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
TD Prediction
Simple every-visit Monte Carlo method:V (st ) !V (st ) +" Rt #V (st )[ ]
Policy Evaluation (the prediction problem): for a given policy !, compute the state-value function V !
The simplest TD method, TD(0):V (st ) !V (st ) +" rt+1 + # V (st+1) $V (st )[ ]
targettarget: the actual return after time t
targettarget: an estimate of the return
55CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
TD Prediction
Note that:
V ! (s) = E! Rt st = s{ }= E! " krt+ k+1 st = s
k=0
#
$%&'
()*
= E! rt+1 + " " krt+ k+2 st = sk=0
#
$%&'
()*
= E! rt+1 + "V! (st+1) st = s{ }
Hence the two targets are the same - in expectation.In practice, the results may differ due to finite training sets
66CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
TD Prediction
Simple every-visit Monte Carlo method:V (st ) !V (st ) +" Rt #V (st )[ ]
Policy Evaluation (the prediction problem): for a given policy !, compute the state-value function V !
The simplest TD method, TD(0):V (st ) !V (st ) +" rt+1 + # V (st+1) $V (st )[ ]
targettarget: the actual return after time t
targettarget: an estimate of the return
77CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Simple Monte Carlo
T T T TT
T T T T T
V (st ) !V (st ) +" Rt #V (st )[ ]where Rt is the actual return following state st .
st
T T
T T
TT T
T TT
Rt
88CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Simplest TD Method
T T T TT
T T T T T
st+1rt+1
st
V (st )!V (st ) +" rt+1 + # V (st+1) $V (st )[ ]
TTTTT
T T T T T
99CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
cf. Dynamic Programming
V (st )! E" rt+1 + # V (st ){ }
T
T T T
st
rt+1st+1
T
TT
T
TT
T
T
T
1010CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Comparing MC and DynamicProgramming
•• BootstrappingBootstrapping: update involves an estimateestimatefrom a nearby statefrom a nearby state• MC does not bootstrap:
• Each state is independently estimated
• DP bootstraps• Estimates for each state depend on nearby states
•• SamplingSampling: update involves an actual returnactual return• MC samples
• Learns from experience in the environment, gets an actual return R
• DP does not sample• Learns by spreading constraints until convergence through iterating
the Bellman equations - without interaction with the environment!
1111CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
TD methods bootstrap and sample
• Bootstrapping: update involves an estimate• MC does not bootstrap
• DP bootstraps
•• TD bootstrapsTD bootstraps
• Sampling: update involves an actual return• MC samples
• DP does not sample
•• TD samplesTD samples
1212CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Policy Evaluation with TD(0)
• Note the on-line nature: updates as you go.
1313CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Example: Driving Home
State Elapsed Time (minutes)
Predicted Time to Go
Predicted Total Time
leaving office 0 30 30
reach car, raining 5 35 40
exit highway 20 15 35
behind truck 30 10 40
home street 40 3 43
arrive home 43 0 43
(5)
(15)
(10)
(10)
(3)
(!)
Value of a state: The remaining time to get home, or time to go
1414CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Driving Home
Changes recommended byChanges recommended byMonte Carlo methods Monte Carlo methods ("("=1)=1)
Changes recommendedChanges recommendedby TD methods (by TD methods (""=1)=1)
1515CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Advantages of TD Learning
• TD methods do not require a model of the environment,only experience
• TD, but not MC, methods can be fully incremental• You can learn before knowing the final outcome
• Less memory
• Less peak computation
• You can learn without the final outcome• From incomplete sequences
• Both MC and TD converge (under certain assumptions tobe detailed later), but which is faster?
1616CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Random Walk Example
Values learned by TD(0) after various
numbers of episodes
What happened on the first episode?
rewardsrewards
1717CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
TD and MC on the Random Walk
Data averaged overData averaged over100 sequences of episodes100 sequences of episodes
1818CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Optimality of TD(0)
Batch UpdatingBatch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence.
Compute updates according to TD(0), but only update estimates after each complete pass through the data.
For any finite Markov prediction task, under batch updating,TD(0) converges for sufficiently small ".
Constant-" MC also converges under these conditions, but tobut toa different answer!a different answer!
1919CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Random Walk under Batch Updating
After each new episode, all previous episodes were treated as a batch,After each new episode, all previous episodes were treated as a batch,and algorithm was trained until convergence. All repeated 100 times.and algorithm was trained until convergence. All repeated 100 times.
2020CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
You are the PredictorSuppose you observe the following 8 episodes:(NOTE: these are not from the previous example!)
1. A, 0, B, 0, terminate2. B, 1, terminate3. B, 1, terminate4. B, 1, terminate5. B, 1, terminate6. B, 1, terminate7. B, 1, terminate8. B, 0, terminate
V (A)?V (B)?
2121CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
You are the Predictor
V (A)?
2222CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
You are the Predictor
• The prediction that best matches the training data isV(A)=0• This minimizes the mean-square-error on the training set
• This is what a batch Monte Carlo method gets
• If we consider the sequentiality of the problem, then wewould set V(A)=.75• This is correct for the maximum likelihood estimate of a
Markov model generating the data
• i.e, if we do a best fit Markov model, and assume it is exactlycorrect, and then compute what it predicts (how?)
• This is called the certainty-equivalence estimate• This is what TD(0) gets
2323CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Learning An Action-Value Function
Estimate Q! for the current behavior policy ! .
After every transition from a nonterminal state st , do this:
Q st ,at( )!Q st ,at( ) +" rt+1 + # Q st+1,at+1( ) $Q st ,at( )%& '(If st+1 is terminal, then Q(st+1,at+1) = 0. (no action out of a terminal state!)
2424CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Sarsa: On-Policy TD Control
Turn this into a control method by always updating thepolicy to be greedy with respect to the current estimate:
2525CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Windy Gridworld
undiscounted, episodic, reward = –1 until goal
2626CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Results of Sarsa on the WindyGridworld
2727CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Q-Learning: Off-Policy TD ControlOne-step Q-learning:
Q st ,at( )!Q st ,at( ) +" rt+1 + # maxaQ st+1,a( ) $Q st ,at( )%
&'(
Compare to SARSA:
Q st ,at( )!Q st ,at( ) +" rt+1 + # Q st+1,at+1( ) $Q st ,at( )%& '(
• This tiny difference in the update makes a BIG difference!
• Q-learning learns Q*, the optimal Q-values,with probability 1, nomatter what policy is being followed (as long as it is exploratory andvisits each state-action pair infinitely often - and the learning rates getsmaller at the appropriate rate.
• Hence this is “off-policy” because it finds #* while following anotherpolicy…
2828CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Q-Learning: Off-Policy TD Control
One-step Q-learning:
Q st ,at( )!Q st ,at( ) +" rt+1 + # maxaQ st+1,a( ) $Q st ,at( )%
&'(
2929CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Cliffwalking
#$#$greedygreedy, # , # = 0.1= 0.1
3030CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Actor-Critic Methods
• Explicit representation of policyas well as value function
• Minimal computation to selectactions
• Can learn an explicit stochasticpolicy
• Can put constraints on policies
• Appealing as psychological andneural models
3131CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Actor-Critic Details
The critic evaluates the action using the TD-error:! t = rt+1 + " V (st+1) #V (st )
If actions are determined by preferences, p(s,a), as follows:
! t (s,a) = Pr at = a st = s{ } = ep(s,a)
ep(s,b)
b"
,
(a softmax policy) then you can update the actor's preferences like this:p(st ,at ) # p(st ,at ) + $% t
Suppose we take action a from state st, reaching state st+1
3232CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Dopamine Neurons and TD ErrorW. Schultz et al.
Universite de Fribourg
3333CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Can we apply these methods to continuoustasks without Discounting?
Yes: Average Reward Per Time StepAverage expected reward per time step under policy ! :
"! = limn#$
1n
E! rt{ }t=1
n
% the same for each state if ergodic
Value of a state relative to !" :
!V " s( ) = E" rt+ k # !" st = s{ }k=1
$
%Value of a state-action pair relative to !" :
!Q" s,a( ) = E" rt+ k # !" st = s ,at = a{ }k=1
$
%These code values as the transients away from the average...
3434CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
R-Learning
3535CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Access-Control Queuing Task• n servers• Customers have four
different priorities, whichpay reward of 1, 2, 4, or 8, ifserved
• At each time step, customerat head of queue is accepted(assigned to a server) orremoved from the queue - :-(
• Proportion of randomlydistributed high prioritycustomers in queue is h
• Busy server becomes freewith probability p on eachtime step
• Statistics of arrivals anddepartures are unknown
n=10, h=.5, p=.06
Apply R-learning
3636CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Afterstates• Usually, a state-value function evaluates states in which the agent can
take an action.
• But sometimes it is useful to evaluate states after agent has acted, as intic-tac-toe.
• Why is this useful?
• What is this in general?
3737CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Summary
• TD prediction• Introduced one-step tabular model-free TD methods• Extend prediction to control by employing some form of
GPI• On-policy control: Sarsa• Off-policy control: Q-learning and R-learning
• These methods bootstrap and sample, combiningaspects of DP and MC methods
3838CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Questions
• What can I tell you about RL?
• What is common to all three classes of methods? – DP,MC, TD
• What are the principal strengths and weaknesses of each?
• In what sense is our RL view complete?
• In what senses is it incomplete?• What are the principal things missing?
• The broad applicability of these ideas…
• What does the term bootstrapping refer to?
• What is the relationship between DP and learning?
END4040CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
The Book
• Part I: The Problem• Introduction• Evaluative Feedback• The Reinforcement Learning Problem
• Part II: Elementary Solution Methods• Dynamic Programming• Monte Carlo Methods• Temporal Difference Learning
• Part III: A Unified View• Eligibility Traces• Generalization and Function Approximation• Planning and Learning• Dimensions of Reinforcement Learning• Case Studies
4141CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 66
Unified View