Reinforcement Learning: Learning from Interaction Winter School on Machine Learning and Vision, 2010 B. Ravindran Many slides adapted from Sutton and Barto

Reinforcement Learning:Learning from Interaction

Winter School on Machine Learning and Vision, 2010

B. RavindranMany slides adapted from Sutton and Barto

Intro to RL 2

Learning to Control

• So far looked at two models of learning– Supervised: Classification, Regression, etc.– Unsupervised: Clustering, etc.

• How did you learn to cycle?– Neither of the above – Trial and error!– Falling down hurts!

3

Can You hear me now?



Intro to RL 4

Reinforcement Learning

• A trial-and-error learning paradigm– Rewards and Punishments

• Not just an algorithm but a new paradigm in itself

• Learn about a system – – behaviour– controlfrom minimal feed back

• Inspired by behavioural psychology

5

RL Framework

• Learn from close interaction• Stochastic environment• Noisy delayed scalar evaluation• Maximize a measure of long term performance

Environment

AgentActionState

evaluation

Intro to RL 6

Not Supervised Learning!

• Very sparse “supervision”• No target output provided• No error gradient information available• Action chooses next state• Explore to estimate gradient – Trail and error

learning

AgentOutputInput

TargetError

Intro to RL 7

Not Unsupervised Learning

• Sparse “supervision” available

• Pattern detection not primary goal

AgentActivationInput

Evaluation

Intro to RL 8

TD GammonTesauro 1992, 1994, 1995, ...

• White has just rolled a 5 and a 2 so can move one of his pieces 5 and one (possibly the same) 2 steps

• Objective is to advance all pieces to points 19-24

• Hitting• 30 pieces, 24 locations implies

enormous number of configurations

• Effective branching factor of 400

9

The Agent-Environment Interface

Agent and environment interact at discrete time steps: t =0, 1, 2,K Agent observes state at step t: st ∈S produces action at step t : at ∈A(st) gets resulting reward: rt+1 ∈ℜ and resulting next state: st+1

t

. . . st art +1st +1

t +1art +2st +2

t +2art +3st +3 . . .

t +3a

Intro to RL 10

Policy at step t, πt :

a mapping from states to action probabilities

πt(s,a) = probability that at =a when st =s

The Agent Learns a Policy

• Reinforcement learning methods specify how the agent changes its policy as a result of experience.

• Roughly, the agent’s goal is to get as much reward as it can over the long run.

Intro to RL 11

Goals and Rewards

• Is a scalar reward signal an adequate notion of a goal?—maybe not, but it is surprisingly flexible.

• A goal should specify what we want to achieve, not how we want to achieve it.

• A goal must be outside the agent’s direct control—thus outside the agent.

• The agent must be able to measure success:– explicitly;– frequently during its lifespan.

Intro to RL 12

Returns

maximize? want to wedoWhat

, , ,

:is stepafter rewards of sequence theSuppose

321 K+++ ttt rrrt

In general,

we want to maximize the expected return, E Rt{ }, for each step t.

Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze.

Rt =rt+1 + rt+2 +L + rT ,

where T is a final time step at which a terminal state is reached, ending an episode.

Intro to RL 13

The Markov Property• “the state” at step t, means whatever information is

available to the agent at step t about its environment.• The state can include immediate “sensations”, highly

processed sensations, and structures built up over time from sequences of sensations.

• Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property:

Pr st+1 = ′s ,rt+1 =r st,at,rt,st−1,at−1,K ,r1,s0 ,a0{ } =

Pr st+1 = ′s ,rt+1 =r st,at{ }

for all ′s , r, and histories st,at,rt,st−1,at−1,K ,r1,s0 ,a0 .

Intro to RL 14

Markov Decision Processes• If a reinforcement learning task has the Markov

Property, it is basically a Markov Decision Process (MDP).

• If state and action sets are finite, it is a finite MDP. • To define a finite MDP, you need to give:

– state and action sets– one-step “dynamics” defined by transition

probabilities:

– reward expectations:

Ps ′ s a =Pr st+1 = ′ s st =s,at =a{ } for all s, ′ s ∈S, a∈A(s).

Rs ′ s a =E rt+1 st =s,at =a,st+1 = ′ s { } for all s, ′ s ∈S, a∈A(s).

RPASM ,,,=

Intro to RL 15

Recycling Robot An Example Finite MDP

• At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge.

• Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad).

• Decisions made on basis of current energy level: high, low.• Reward = number of cans collected

Intro to RL 16

Recycling Robot MDP

S = high ,low{ }

A(high ) = search , wait{ }

A(low) = search ,wait , recharge{ }

Rsearch = expected no. of cans while searching

Rwait = expected no. of cans while waiting

Rsearch >Rwait

Intro to RL 17

• The value of a state is the expected return starting from that state; depends on the agent’s policy:

• The value of a state-action pair is the expected return starting from that state, taking that action, and thereafter following :

Value Functions

State - value function for policy :

V (s) =E Rt st =s{ } =E rt+k+1 st =sk=0

T

∑⎧⎨⎩

⎫⎬⎭

{ }⎭⎬⎫

⎩⎨⎧

====== ∑=

++

T

kttktttt aassrEaassREasQ

01 ,,),(

:

πππ

π policyfor function value-Action

Intro to RL 18

Bellman Equation for a Policy

( )11

4321

4321

++

++++

++++

+=

+++=

+++=

tt

tttt

ttttt

Rr

rrrr

rrrrR

L

L

The basic idea:

So: V (s) =E Rt st =s{ }

=E rt+1 +V st+1( ) st =s{ }

Or, without the expectation operator:

V (s) = (s,a) Ps ′sa Rs ′s

a +V ( ′s )⎡⎣ ⎤⎦′s∑

a∑

Intro to RL 19

• For finite MDPs, policies can be partially ordered:

• There is always at least one (and possibly many) policies that is better than or equal to all the others. This is an optimal policy. We denote them all *.

• Optimal policies share the same optimal state-value function:

π ≥ ′ π if and only if Vπ (s) ≥V ′ π (s) for all s∈S

Optimal Value Functions

V∗(s) =maxπ

Vπ (s) for all s∈S

Intro to RL 20

Bellman Optimality Equation

V∗(s) =maxa∈A(s)

E rt+1 +V∗(st+1) st =s,at =a{ }

=maxa∈A(s)

Ps ′sa

′s∑ Rs ′s

a +V∗( ′s )⎡⎣ ⎤⎦

The value of a state under an optimal policy must equalthe expected return for the best action from that state:

is the unique solution of this system of nonlinear equations.V∗

Intro to RL 21

Bellman Optimality Equation

Q∗(s,a) =E rt+1 +max′a

Q∗(st+1, ′a ) st =s,at =a{ }

= Ps ′sa Rs ′s

a +max′a

Q∗( ′s , ′a )⎡⎣

⎤⎦

′s∑

is the unique solution of this system of nonlinear equations.Q*

Similarly, the optimal value of a state-action pair is the expected return for taking that action and thereafter following the optimal policy

Intro to RL 22

Dynamic Programming• DP is the solution method of choice for MDPs

– Require complete knowledge of system dynamics (P and R)

– Expensive and often not practical– Curse of dimensionality– Guaranteed to converge!

• RL methods: online approximate dynamic programming– No knowledge of P and R– Sample trajectories through state space– Some theoretical convergence analysis available

Intro to RL 23

Policy Evaluation

State - value function for policy :

V (s) =E Rt st =s{ } =E rt+k+1 st =sk=0

∞

∑⎧⎨⎩

⎫⎬⎭

[ ]

yiterativel solve—

equationslinear ussimultaneo of system a—

)(),()(

:

S

sVRPassV

V

a s

ass

ass∑ ∑

′′′ ′+= ππ

π

π

for equation Bellman

Policy Evaluation: for a given policy , compute the state-value function Vπ

Recall:

Intro to RL 24

Policy ImprovementSuppose we have computed for a deterministic policy .Vπ

For a given state s, would it be better to do an action ? a≠π(s)

Q (s,a) =E rt+1 +V (st+1) st =s,at =a{ }

= Ps ′sa

′s∑ Rs ′s

a +V ( ′s )⎡⎣ ⎤⎦

The value of doing a in state s is:

It is better to switch to action a for state s if and only if

Qπ (s,a) >Vπ (s)

Intro to RL 25

Policy Improvement Cont.

[ ])(maxarg

),(maxarg)(

sVRP

asQs

ass

s

ass

a

a

′+=

=′

′′

′∑ π

ππ

Do this for all states to get a new policy ′ π that is

greedy with respect to Vπ :

Then V ′ π ≥Vπ

Intro to RL 26

Policy Improvement Cont.

[ ] ? )(max)( , allfor i.e.,

? ifWhat

sVRPsVSs

VVass

s

ass

a′+=∈

=

′′

′′

′

∑ ππ

ππ

But this is the Bellman Optimality Equation.

So V ′ π =V∗ and both π and ′ π are optimal policies.

Intro to RL 27

Policy Iteration

0 →V π 0 → π 1 →V π1 → L π * →V * → π *

policy evaluationpolicy improvement“greedification”

Intro to RL 28

Value Iteration

Recall the Bellman optimality equation:

Vk+1(s) ← maxa

Ps ′sa Rs ′s

a +Vk( ′s )⎡⎣ ⎤⎦′s∑

We can convert it to an full value iteration backup:

V∗(s) =maxa∈A(s)

E rt+1 +V∗(st+1) st =s,at =a{ }

=maxa∈A(s)

Ps ′sa

′s∑ Rs ′s

a +V∗( ′s )⎡⎣ ⎤⎦

Iterate until “convergence”

Intro to RL 29

Generalized Policy IterationGeneralized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity.

A geometric metaphor forconvergence of GPI:

Intro to RL 30

Dynamic ProgrammingV (st ) ← E rt+1 +V(st){ }

T

T T T

st

rt+1

st+1

T

TT

T

TT

T

T

T

Intro to RL 31

Simplest TD Method

T T T TT

T T T T T

st+1

rt+1

st

V (st )← V(st) +α rt+1 +V(st+1)−V(st)[ ]

TTTTT

T T T T T

Intro to RL 32

RL Algorithms – Prediction

• Policy Evaluation (the prediction problem): for a given policy, compute the state-value function.

• No knowledge of P and R, but access to the real system, or a “sample” model assumed.

• Uses “bootstrapping” and sampling

[ ])()( )()(

:TD(0) method, TDsimplest The

11 ttttt sVsVrsVsV −++← ++α

Intro to RL 33

Advantages of TD

• TD methods do not require a model of the environment, only experience

• TD methods can be fully incremental– You can learn before knowing the final

outcome• Less memory• Less peak computation

– You can learn without the final outcome• From incomplete sequences

Intro to RL 34

RL Algorithms – Control

( ) ( ) ( ) ( )[ ].0),( then terminal,is If

,, ,,

: thisdo , state lnontermina a fromsition every tranAfter

111

111

=

−++←

+++

+++

ttt

ttttttttt

t

asQs

asQasQrasQasQ

s

α

SARSA

One-step Q-learning:

Q st ,at( ) ← Q st,at( ) +α rt+1 +maxa

Q st+1,a( )−Q st,at( )⎡⎣

⎤⎦

Q-learning

Intro to RL 35

Cliffwalking

greedy = 0.1

Intro to RL 36

Actor-Critic Methods• Explicit representation of policy

as well as value function• Minimal computation to select

actions• Can learn an explicit stochastic

policy• Can put constraints on policies• Appealing as psychological

and neural models

Intro to RL 37

Actor-Critic DetailsTD-error is used to evaluate actions:

δt =rt+1 +V(st+1)−V(st)

If actions are determined by preferences, p(s,a), as follows:

πt(s,a) =Pr at =a st =s{ }=ep(s,a)

ep(s,b)

b∑

,

then you can update the preferences like this:

p(st,at)← p(st,at)+βδt

Intro to RL 38

Applications of RL

• Robot navigation• Adaptive control

– Helicopter pilot!

• Combinatorial optimization– VLSI placement and routing , elevator dispatching

• Game playing– Backgammon – world’s best player!

• Computational Neuroscience– Modeling of reward processes

Intro to RL 39

Other Topics

• Different measures of return– Average rewards, Discounted Returns

• Policy gradient approaches– Directly perturb policies

• Generalization [Saturday]– Function approximation– Temporal abstraction

• Least square methods– Better use of data– More suited for “off-line” RL

Intro to RL 40

ReferencesMitchell, T. Machine Learning. McGraw Hill. 1992

Russell, S. J. and Norvig, P. Artificial Intelligence – A modern approach. Pearson Educational. 2000.

Sutton, R.S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press. 1998.

Dayan, P. and Abbott, L. F. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press. 2001.

Bertsikas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific. 1997.

Documents

Reinforcement Learning: Learning from Interaction Winter School on Machine Learning and Vision, 2010 B. Ravindran Many slides adapted from Sutton and Barto