Stochastic Optimal Control – part 2 discrete time, Markov ... · Stochastic Optimal Control – part 2 discrete time, Markov Decision Processes, Reinforcement Learning Marc Toussaint

Stochastic Optimal Control – part 2discrete time, Markov Decision Processes,

Reinforcement Learning

Marc Toussaint

Machine Learning & Robotics Group – TU [email protected]

ICML 2008, Helsinki, July 5th, 2008

• Why stochasticity?

• Markov Decision Processes

• Bellman optimality equation, Dynamic Programming, Value Iteration

• Reinforcement Learning: learning from experience

1/21

Why consider stochasticity?

1) system is inherently stochastic

2) true system is actually deterministic but

a) system is described on level of abstraction, simplification which makesmodel approximate and stochastic

b) sensors/observations are noisy, we never know the exact statec) we can handle only a part of the whole system

– partial knowledge → uncertainty– decomposed planning; factored state representation

2

agent

world

agent

1

• probabilities are a tool to represent information and uncertainty– there are many sources of uncertainty

2/21

Machine Learning models of stochastic processes

• Markov Processesdefined by random variables x0, x1, .. and transition probabilitiesP (xt+1 |xt)

x0 x1 x2

• non-Markovian Processes– higher order Markov Processes, auto regression models– structured models (hierarchical, grammars, text models)– Gaussian processes (both, discrete and continuous time)– etc

• continuous time processes– stochastic differential equations

3/21

Markov Decision Processes

• Markov Process on the random variables of states xt, actions at, andrewards rt

x2x1

a2a1a0

r1 r2r0

π

x0

P (xt+1 | at, xt) transition probability (1)

P (rt | at, xt) reward probability (2)

P (at |xt) = π(at |xt) policy (3)

• we will assume stationarity, no explicit dependency on time– P (x′ | a, x) and P (r | a, x) are invariable properties of the world– the policy π is a property of the agent

4/21

optimal policies

• value (expected discounted return) of policy π when started in x

V π(x) = E{r0 + γr1 + γ2r2 + · · · |x0 =x;π

}(cf. cost function C(x0, a0:T ) = φ(xT ) +

∑T -10 R(t, xt, at) )

• optimal value function:

V ∗(x) = maxπ

V π(x)

• policy π∗ if optimal iff

∀x : V π∗(x) = V ∗(x)

(simultaneously maximizing the value in all states)

• There always exists (at least one) optimal deterministic policy!

5/21

Bellman optimality equation

V π(x) = E{r0 + γr1 + γ2r2 + · · · |x0 =x;π

}= E {r0 |x0 =x;π}+ γE {r1 + γr2 + · · · |x0 =x;π}

= R(π(x), x) + γ∑

x′ P (x′ |π(x), x) E {r1 + γr2 + · · · |x1 =x′;π}

= R(π(x), x) + γ∑

x′ P (x′ |π(x), x) V π(x′)

• Bellman optimality equation

V ∗(x) = maxa

[R(a, x) + γ

∑x′ P (x′ | a, x) V ∗(x′)

]π∗(x) = argmax

a

[R(a, x) + γ

∑x′ P (x′ | a, x) V ∗(x′)

](if π would select another action than argmaxa[·], π wouldn’t be optimal:π′ which = π everywhere except π′(x) = argmaxa[·] would be better)

• this is the principle of optimality in the stochastic case(related to Viterbi, max-product algorithm)

6/21

Dynamic Programming

• Bellman optimality equation

V ∗(x) = maxa

[R(a, x) + γ

∑x′ P (x′ | a, x) V ∗(x′)

]

• Value iteration (initialize V0(x) = 0, iterate k = 0, 1, ..)

∀x : Vk+1(x) = maxa

[R(a, x) + γ

∑x′ P (x′|a, x) Vk(x′)

]– stopping criterion: maxx |Vk+1(x)− Vk(x)| ≤ ε

(see script for proof of convergence)

• once it converged, choose the policy

πk(x) = argmaxa

[R(a, x) + γ

∑x′ P (x′ | a, x) Vk(x′)

]7/21

maze example

• typical example for avalue function in navigation

[online demo – or switch to Terran Lane’s lecture...]8/21

comments

• Bellman’s principle of optimality is the core of the methods

• it refers to the recursive thinking of what makes a path optimal– the recursive property of the optimal value function

• related to Viterbi, max-product algorithm

9/21

Learning from experience

• Reinforcement Learning problem: model P (x′ | a, x) and P (r | a, x) arenot known, only exploration is allowed

Dynam

icProg.

mod

ellearning

policyup

date

policysearch

TD

lear

nin

gQ

-lea

rnin

gEM

policy optim.

experience{xt, at, rt}

MDP modelπP,R

value/Q-functionV,Q

policy

10/21

Model learning

• trivial on direct discrete representation:use experience data to estimate model

P̂ (x′|a, x) ∝ #(x′ ← x|a)

– for non-direct representations: Machine Learning methods

• use DP to compute optimal policy for estimated model

• Exploration-Exploitation is not a Dilemmapossible solutions: E3 algorithm, Bayesian RL (see later)

11/21

Temporal Difference

• recall Value Iteration

∀x : Vk+1(x) = maxa

[R(a, x) + γ

∑x′ P (x′|a, x) Vk(x′)

]• Temporal Difference learning (TD): given experience (xtatrt xt+1)

Vnew(xt) = (1− α)Vold(xt) + α[rt + γVold(xt+1)]

= Vold(xt) + α[rt + γVold(xt+1)− Vold(xt)] .

... is a stochastic variant of Dynamic Programming→ one can prove convergence with probability 1 (see Q-learning in script)

• reinforcement:– more reward than expected (rt > γVold(xt+1)− Vold(xt))→ increase V (xt)

– less reward than expected (rt < γVold(xt+1)− Vold(xt))→ decrease V (xt)

12/21

Q-learning convergence with prob 1

• Q-learning

Qπ(a, x) = E{r0 + γr1 + γ2r2 + · · · |x0 =x, a0 =u;π

}Q∗(a, x) = R(a, x) + γ

∑x′ P (x′ | a, x) maxa′ Q∗(a′, x′)

∀a,x : Qk+1(a, x) = R(a, x) + γ∑

x′ P (x′|a, x) maxa′ Qk(a′, x′)

Qnew(xt, at) = (1− α)Qold(xt, at) + α[rt + γ maxa

Qold(xt+1, a)]

• Q-learning is a stochastic approximation of Q-VI:Q-VI is deterministic: Qk+1 = T (Qk)

Q-learning is stochastic: Qk+1 = (1− α)Qk + α[T (Qk) + ηk]

ηk is zero mean!

13/21

Q-learning impact

• Q-Learning (Watkins, 1988) is the first provably convergent directadaptive optimal control algorithm

• Great impact on the field of Reinforcement Learning– smaller representation than models– automatically focuses attention to where it is needed i.e., nosweeps through state space– though does not solve the exploration versus exploitation issue– epsilon-greedy, optimistic initialization, etc,...

14/21

Eligibility traces• Temporal Difference:

Vnew(x0) = Vold(x0) + α[r0 + γVold(x1)− Vold(x0)]

• longer reward sequence: r0 r1 r2 r3 r4 r5 r6 r7

temporal credit assignment, think further backwards, receiving r3 alsotells us something about V (x0)

Vnew(x0) = Vold(x0) + α[r0 + γr1 + γ2r2 + γ3Vold(x3)− Vold(x0)]

• online implementation: remember where you’ve been recently(“eligibility trace”) and update those values as well:

e(xt)← e(xt) + 1

∀x : Vnew(x) = Vold(x) + αe(x)[rt + γVold(xt+1)− Vold(xt)]

∀x : e(x)← γλe(x)

• core topic of Sutton & Barto book– great improvement 15/21

comments

• again, Bellman’s principle of optimality is the core of the methods

TD(λ), Q-learning, eligibilities, are all methods to converge to afunction obeying the Bellman optimality equation

16/21

E3 : Explicit Explore or Exploit

• (John Langford) from observed data construct two MDPs:

(1) MDPknown includes sufficiently often visited states and executed actionswith (rather exact) estimates of P and R.(model which captures what you know)

(2) MDPunknown = MDPknown except the reward is 1 for all actions which leavethe known states and 0 otherwise.(model which captures optimism of exploration)

• the algorithm:

(1) If last x not in Known: choose the least previously used action

(2) Else:

(a) [seek exploration] If Vunknown > ε then act according to Vunknown until

state is unknown (or t mod T = 0) then goto (1)

(b) [exploit] else act according to Vknown

17/21

E3 – Theory

• for any (unknown) MDP:– total number of actions and computation time required by E3 arepoly(|X|, |A|, T ∗, 1

ε , ln1δ )

– performance guarantee: with probability at least (1− δ) exp. return ofE3 will exceed V ∗ − ε

• details– actual return: 1

T

∑Tt=1 rt

– let T ∗ denote the (unknown) mixing time of the MDP– one key insight: even the optimal policy will take time O(T ∗) toachieve actual return that is near-optimal

• straight-forward & intuitive approach!– the exploration-exploitation dilemma is not a dilemma!– cf. active learning, information seeking, curiosity, variance analysis

18/21

Bayesian Reinforcement Learning• initially, we don’t know the MDP

– but based on experience we can estimate the MDP

• parametrize MDP by a parameter θ

(e.g., direct parametrization: P (x′ | a, x) = θx′ax)– given experience x0a0 x1a1 x0a0 · · · we can estimate the posterior

b(θ) = P (θ |x0:t, a0:t)

• given a “posterior belief b about the world”, plan to maximize policyvalue in this distribution of worlds

V ∗(x, b) = maxa

[R(a, x) + γ

∑x′

∫θP (x′|a, x; θ)b(θ)dθ V ∗(x′, b′)

]

(see last year’s ICML tutorial, old theory from Operations Research)19/21

comments

• there is no fundamentally unsolved exploration-exploitation dilemma!

– but an efficiency issue

20/21

further topics

• function approximation:– Laplacian eigen functions as value function representation(Mahadevan)– Gaussian processes (Carl Rasmussen, Yaakov Engel)

• representations:– macro (hierarchical) policies, abstractions, options (Precup, Sutton,Singh, Dietterich, Parr)– predictive state representations (PSR, Littman, Sutton, Singh)

• partial observability (POMDPs)

21/21

• we conclude with a demo from Andrew Ng– helicopter flight– actions: standard remote control– reward functions: hand-designed for specific tasks

[rolls] [inverted flight]

22/21

appendix: discrete time continuous state control• same framework as MDPs, different conventional notation:control MDPs

ut control at action

xt+1 = xt + f(t , xt, ut) + ξt discretedx = f(t, x, u) dt+dξ continuous system

P (xt+1 | at, xt) transition prob

φ(xT ) final costR(t, xt, ut) local cost

P (rt | at, xt) reward prob

C(x0, u0:T ) exp. trajectory cost V π(x) policy value

J(t, x) optimal cost-to-go V ∗(x) optimal value function

• discrete time stochastic controlled system

xt+1 = f(xt, ut) + ξ , ξ ∼ N(0, Q)

P (xt+1 |ut, xt) = N(xt+1 | f(xt, ut), Q)

• objective is to minimize the expectation of the cost

C(x0:T , u0:T ) =T∑t=0

R(t, xt, ut) .

23/21

appendix: discrete time continuous state control

• just as we had in the MDP case, the value function obeys the Bellmanoptimality equation

Jt(x) = minu

[R(t, x, u) +

∫x′

P (x′ |u, x) Jt+1(x′)

]• 2 types of optimal control problems

open-loop: find control sequence u∗1:T that minimizes the expected

cost

closed-loop: find a control law π∗ : (t, x) 7→ ut (that exploits the truestate observation in each time step and maps it to a feedback controlsignal) that minimizes the expected cost

24/21

appendix: Linear-quadratic-Gaussian (LQG) case• consider a linear control process with Gaussian noise and quadratic costs,

P (xt |xt-1, ut) = N(xt |Axt-1 + But, Q) ,

C(x1:T , u1:T ) =TX

t=1

xTt Rxt + uT

t Hut .

• assume we know the exact cost-to-go Jt(x) at time t and that it has the formJt(x) = xT Vtx. Then,

Jt-1(x) = mina

hxT Rx + uT Hu +

Zy

N(y |Ax + Bu, Q) yT Vty dxi

= mina

hxT Rx + uT Hu + (Ax + Bu)T Vt(Ax + Bu) + tr(VtQ)

i= min

a

hxT Rx + uT (H + BT VtB)u + 2uT BT VtAx + xT AVtAx + tr(VtQ)

iminimization yields

0 = 2(H + BT VtB)u∗ + 2BT VtAx ⇒ u∗ = −(H + BT VtB)-1BT VtAx

Jt-1(x) = xT Vt-1x , Vt-1 = R + AT VtA−AT VtB(H + BT VtB)-1BT VtA

• this is called the Ricatti equation 25/21

Documents

Stochastic Optimal Control – part 2 discrete time, Markov ... · Stochastic Optimal Control – part 2 discrete time, Markov Decision Processes, Reinforcement Learning Marc Toussaint