Upload
antonio-elliott
View
223
Download
1
Tags:
Embed Size (px)
Citation preview
Reinforcement Learning:Learning from Interaction
Winter School on Machine Learning and Vision, 2010
B. RavindranMany slides adapted from Sutton and Barto
Intro to RL 2
Learning to Control
• So far looked at two models of learning– Supervised: Classification, Regression, etc.– Unsupervised: Clustering, etc.
• How did you learn to cycle?– Neither of the above – Trial and error!– Falling down hurts!
3
Can You hear me now?
Can You hear me now?
Can You hear me now?
Intro to RL 4
Reinforcement Learning
• A trial-and-error learning paradigm– Rewards and Punishments
• Not just an algorithm but a new paradigm in itself
• Learn about a system – – behaviour– controlfrom minimal feed back
• Inspired by behavioural psychology
5
RL Framework
• Learn from close interaction• Stochastic environment• Noisy delayed scalar evaluation• Maximize a measure of long term performance
Environment
AgentActionState
evaluation
Intro to RL 6
Not Supervised Learning!
• Very sparse “supervision”• No target output provided• No error gradient information available• Action chooses next state• Explore to estimate gradient – Trail and error
learning
AgentOutputInput
TargetError
Intro to RL 7
Not Unsupervised Learning
• Sparse “supervision” available
• Pattern detection not primary goal
AgentActivationInput
Evaluation
Intro to RL 8
TD GammonTesauro 1992, 1994, 1995, ...
• White has just rolled a 5 and a 2 so can move one of his pieces 5 and one (possibly the same) 2 steps
• Objective is to advance all pieces to points 19-24
• Hitting• 30 pieces, 24 locations implies
enormous number of configurations
• Effective branching factor of 400
9
The Agent-Environment Interface
Agent and environment interact at discrete time steps: t =0, 1, 2,K Agent observes state at step t: st ∈S produces action at step t : at ∈A(st) gets resulting reward: rt+1 ∈ℜ and resulting next state: st+1
t
. . . st art +1st +1
t +1art +2st +2
t +2art +3st +3 . . .
t +3a
Intro to RL 10
Policy at step t, πt :
a mapping from states to action probabilities
πt(s,a) = probability that at =a when st =s
The Agent Learns a Policy
• Reinforcement learning methods specify how the agent changes its policy as a result of experience.
• Roughly, the agent’s goal is to get as much reward as it can over the long run.
Intro to RL 11
Goals and Rewards
• Is a scalar reward signal an adequate notion of a goal?—maybe not, but it is surprisingly flexible.
• A goal should specify what we want to achieve, not how we want to achieve it.
• A goal must be outside the agent’s direct control—thus outside the agent.
• The agent must be able to measure success:– explicitly;– frequently during its lifespan.
Intro to RL 12
Returns
maximize? want to wedoWhat
, , ,
:is stepafter rewards of sequence theSuppose
321 K+++ ttt rrrt
In general,
we want to maximize the expected return, E Rt{ }, for each step t.
Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze.
Rt =rt+1 + rt+2 +L + rT ,
where T is a final time step at which a terminal state is reached, ending an episode.
Intro to RL 13
The Markov Property• “the state” at step t, means whatever information is
available to the agent at step t about its environment.• The state can include immediate “sensations”, highly
processed sensations, and structures built up over time from sequences of sensations.
• Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property:
Pr st+1 = ′s ,rt+1 =r st,at,rt,st−1,at−1,K ,r1,s0 ,a0{ } =
Pr st+1 = ′s ,rt+1 =r st,at{ }
for all ′s , r, and histories st,at,rt,st−1,at−1,K ,r1,s0 ,a0 .
Intro to RL 14
Markov Decision Processes• If a reinforcement learning task has the Markov
Property, it is basically a Markov Decision Process (MDP).
• If state and action sets are finite, it is a finite MDP. • To define a finite MDP, you need to give:
– state and action sets– one-step “dynamics” defined by transition
probabilities:
– reward expectations:
Ps ′ s a =Pr st+1 = ′ s st =s,at =a{ } for all s, ′ s ∈S, a∈A(s).
Rs ′ s a =E rt+1 st =s,at =a,st+1 = ′ s { } for all s, ′ s ∈S, a∈A(s).
RPASM ,,,=
Intro to RL 15
Recycling Robot An Example Finite MDP
• At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge.
• Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad).
• Decisions made on basis of current energy level: high, low.• Reward = number of cans collected
Intro to RL 16
Recycling Robot MDP
S = high ,low{ }
A(high ) = search , wait{ }
A(low) = search ,wait , recharge{ }
Rsearch = expected no. of cans while searching
Rwait = expected no. of cans while waiting
Rsearch >Rwait
Intro to RL 17
• The value of a state is the expected return starting from that state; depends on the agent’s policy:
• The value of a state-action pair is the expected return starting from that state, taking that action, and thereafter following :
Value Functions
State - value function for policy :
V (s) =E Rt st =s{ } =E rt+k+1 st =sk=0
T
∑⎧⎨⎩
⎫⎬⎭
{ }⎭⎬⎫
⎩⎨⎧
====== ∑=
++
T
kttktttt aassrEaassREasQ
01 ,,),(
:
πππ
π policyfor function value-Action
Intro to RL 18
Bellman Equation for a Policy
( )11
4321
4321
++
++++
++++
+=
+++=
+++=
tt
tttt
ttttt
Rr
rrrr
rrrrR
L
L
The basic idea:
So: V (s) =E Rt st =s{ }
=E rt+1 +V st+1( ) st =s{ }
Or, without the expectation operator:
V (s) = (s,a) Ps ′sa Rs ′s
a +V ( ′s )⎡⎣ ⎤⎦′s∑
a∑
Intro to RL 19
• For finite MDPs, policies can be partially ordered:
• There is always at least one (and possibly many) policies that is better than or equal to all the others. This is an optimal policy. We denote them all *.
• Optimal policies share the same optimal state-value function:
π ≥ ′ π if and only if Vπ (s) ≥V ′ π (s) for all s∈S
Optimal Value Functions
V∗(s) =maxπ
Vπ (s) for all s∈S
Intro to RL 20
Bellman Optimality Equation
V∗(s) =maxa∈A(s)
E rt+1 +V∗(st+1) st =s,at =a{ }
=maxa∈A(s)
Ps ′sa
′s∑ Rs ′s
a +V∗( ′s )⎡⎣ ⎤⎦
The value of a state under an optimal policy must equalthe expected return for the best action from that state:
is the unique solution of this system of nonlinear equations.V∗
Intro to RL 21
Bellman Optimality Equation
Q∗(s,a) =E rt+1 +max′a
Q∗(st+1, ′a ) st =s,at =a{ }
= Ps ′sa Rs ′s
a +max′a
Q∗( ′s , ′a )⎡⎣
⎤⎦
′s∑
is the unique solution of this system of nonlinear equations.Q*
Similarly, the optimal value of a state-action pair is the expected return for taking that action and thereafter following the optimal policy
Intro to RL 22
Dynamic Programming• DP is the solution method of choice for MDPs
– Require complete knowledge of system dynamics (P and R)
– Expensive and often not practical– Curse of dimensionality– Guaranteed to converge!
• RL methods: online approximate dynamic programming– No knowledge of P and R– Sample trajectories through state space– Some theoretical convergence analysis available
Intro to RL 23
Policy Evaluation
State - value function for policy :
V (s) =E Rt st =s{ } =E rt+k+1 st =sk=0
∞
∑⎧⎨⎩
⎫⎬⎭
[ ]
yiterativel solve—
equationslinear ussimultaneo of system a—
)(),()(
:
S
sVRPassV
V
a s
ass
ass∑ ∑
′′′ ′+= ππ
π
π
for equation Bellman
Policy Evaluation: for a given policy , compute the state-value function Vπ
Recall:
Intro to RL 24
Policy ImprovementSuppose we have computed for a deterministic policy .Vπ
For a given state s, would it be better to do an action ? a≠π(s)
Q (s,a) =E rt+1 +V (st+1) st =s,at =a{ }
= Ps ′sa
′s∑ Rs ′s
a +V ( ′s )⎡⎣ ⎤⎦
The value of doing a in state s is:
It is better to switch to action a for state s if and only if
Qπ (s,a) >Vπ (s)
Intro to RL 25
Policy Improvement Cont.
[ ])(maxarg
),(maxarg)(
sVRP
asQs
ass
s
ass
a
a
′+=
=′
′′
′∑ π
ππ
Do this for all states to get a new policy ′ π that is
greedy with respect to Vπ :
Then V ′ π ≥Vπ
Intro to RL 26
Policy Improvement Cont.
[ ] ? )(max)( , allfor i.e.,
? ifWhat
sVRPsVSs
VVass
s
ass
a′+=∈
=
′′
′′
′
∑ ππ
ππ
But this is the Bellman Optimality Equation.
So V ′ π =V∗ and both π and ′ π are optimal policies.
Intro to RL 27
Policy Iteration
0 →V π 0 → π 1 →V π1 → L π * →V * → π *
policy evaluationpolicy improvement“greedification”
Intro to RL 28
Value Iteration
Recall the Bellman optimality equation:
Vk+1(s) ← maxa
Ps ′sa Rs ′s
a +Vk( ′s )⎡⎣ ⎤⎦′s∑
We can convert it to an full value iteration backup:
V∗(s) =maxa∈A(s)
E rt+1 +V∗(st+1) st =s,at =a{ }
=maxa∈A(s)
Ps ′sa
′s∑ Rs ′s
a +V∗( ′s )⎡⎣ ⎤⎦
Iterate until “convergence”
Intro to RL 29
Generalized Policy IterationGeneralized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity.
A geometric metaphor forconvergence of GPI:
Intro to RL 30
Dynamic ProgrammingV (st ) ← E rt+1 +V(st){ }
T
T T T
st
rt+1
st+1
T
TT
T
TT
T
T
T
Intro to RL 31
Simplest TD Method
T T T TT
T T T T T
st+1
rt+1
st
V (st )← V(st) +α rt+1 +V(st+1)−V(st)[ ]
TTTTT
T T T T T
Intro to RL 32
RL Algorithms – Prediction
• Policy Evaluation (the prediction problem): for a given policy, compute the state-value function.
• No knowledge of P and R, but access to the real system, or a “sample” model assumed.
• Uses “bootstrapping” and sampling
[ ])()( )()(
:TD(0) method, TDsimplest The
11 ttttt sVsVrsVsV −++← ++α
Intro to RL 33
Advantages of TD
• TD methods do not require a model of the environment, only experience
• TD methods can be fully incremental– You can learn before knowing the final
outcome• Less memory• Less peak computation
– You can learn without the final outcome• From incomplete sequences
Intro to RL 34
RL Algorithms – Control
( ) ( ) ( ) ( )[ ].0),( then terminal,is If
,, ,,
: thisdo , state lnontermina a fromsition every tranAfter
111
111
=
−++←
+++
+++
ttt
ttttttttt
t
asQs
asQasQrasQasQ
s
α
SARSA
One-step Q-learning:
Q st ,at( ) ← Q st,at( ) +α rt+1 +maxa
Q st+1,a( )−Q st,at( )⎡⎣
⎤⎦
Q-learning
Intro to RL 35
Cliffwalking
greedy = 0.1
Intro to RL 36
Actor-Critic Methods• Explicit representation of policy
as well as value function• Minimal computation to select
actions• Can learn an explicit stochastic
policy• Can put constraints on policies• Appealing as psychological
and neural models
Intro to RL 37
Actor-Critic DetailsTD-error is used to evaluate actions:
δt =rt+1 +V(st+1)−V(st)
If actions are determined by preferences, p(s,a), as follows:
πt(s,a) =Pr at =a st =s{ }=ep(s,a)
ep(s,b)
b∑
,
then you can update the preferences like this:
p(st,at)← p(st,at)+βδt
Intro to RL 38
Applications of RL
• Robot navigation• Adaptive control
– Helicopter pilot!
• Combinatorial optimization– VLSI placement and routing , elevator dispatching
• Game playing– Backgammon – world’s best player!
• Computational Neuroscience– Modeling of reward processes
Intro to RL 39
Other Topics
• Different measures of return– Average rewards, Discounted Returns
• Policy gradient approaches– Directly perturb policies
• Generalization [Saturday]– Function approximation– Temporal abstraction
• Least square methods– Better use of data– More suited for “off-line” RL
Intro to RL 40
ReferencesMitchell, T. Machine Learning. McGraw Hill. 1992
Russell, S. J. and Norvig, P. Artificial Intelligence – A modern approach. Pearson Educational. 2000.
Sutton, R.S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press. 1998.
Dayan, P. and Abbott, L. F. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press. 2001.
Bertsikas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific. 1997.