47
10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Solving known MDPs: Dynamic Programming Slides borrowed from Katerina Fragkiadaki

10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

10703DeepReinforcementLearningandControl

RussSalakhutdinov

Solving known MDPs: Dynamic Programming

SlidesborrowedfromKaterinaFragkiadaki

Page 2: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

A Markov Decision Process is a tuple

•  is a finite set of states

•  is a finite set of actions

•  is a state transition probability function

•  is a reward function

•  is a discount factor

Markov Decision Process (MDP)!

Page 3: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Solving MDPs!

•  Prediction: Given an MDP and a policy

find the state and action value functions.

•  Optimal control: given an MDP , find the optimal policy (aka the planning problem).

•  Compare with the learning problem with missing information about rewards/dynamics.

•  We still consider finite MDPs (finite and ) with known dynamics.

Page 4: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Outline!

•  Policy evaluation

•  Policy iteration

•  Value iteration

•  Asynchronous DP

Page 5: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Policy Evaluation!

Policy evaluation: for a given policy , compute the state value function

where is implicitly given by the Bellman equation

a system of simultaneous equations.

Page 6: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

MDPs to MRPs!

MDP under a fixed policy becomes Markov Reward Process (MRP)

where

Page 7: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Back Up Diagram!

MDP

Page 8: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Back Up Diagram!

MDP

Page 9: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Matrix Form!

The Bellman expectation equation can be written concisely using the induced form:

with direct solution

of complexity

Page 10: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Iterative Methods: Recall the Bellman Equation!

Page 11: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Iterative Methods: Backup Operation!

Given an expected value function at iteration k, we back up the expected value function at iteration k+1:

Page 12: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

A sweep consists of applying the backup operation for all the states in

Applying the back up operator iteratively

Iterative Methods: Sweep!

Page 13: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

A Small-Grid World!

•  An undiscounted episodic task

•  Nonterminal states: 1, 2, … , 14

•  Terminal state: one, shown in shaded square

•  Actions that would take the agent off the grid leave the state unchanged

•  Reward is -1 until the terminal state is reached

R

γ = 1

Page 14: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

•  An undiscounted episodic task

•  Nonterminal states: 1, 2, … , 14

•  Terminal state: one, shown in shaded square

•  Actions that would take the agent off the grid leave the state unchanged

•  Reward is -1 until the terminal state is reached

Policy , an equiprobable random action

Iterative Policy Evaluation! for therandom policy

Page 15: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

•  An undiscounted episodic task

•  Nonterminal states: 1, 2, … , 14

•  Terminal state: one, shown in shaded square

•  Actions that would take the agent off the grid leave the state unchanged

•  Reward is -1 until the terminal state is reached

Policy , an equiprobable random action

Iterative Policy Evaluation! for therandom policy

Page 16: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

•  An undiscounted episodic task

•  Nonterminal states: 1, 2, … , 14

•  Terminal state: one, shown in shaded square

•  Actions that would take the agent off the grid leave the state unchanged

•  Reward is -1 until the terminal state is reached

Policy , an equiprobable random action

Iterative Policy Evaluation! for therandom policy

Page 17: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

An operator on a normed vector space is a -contraction, for , provided for all

Contraction Mapping Theorem!

Page 18: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

An operator on a normed vector space is a -contraction, for , provided for all

Theorem (Contraction mapping)For a -contraction in a complete normed vector space

•  converges to a unique fixed point in

•  at a linear convergence rate

Remark. In general we only need metric (vs normed) space

Contraction Mapping Theorem!

Page 19: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Value Function Sapce!

•  Consider the vector space over value functions

•  There are dimensions

•  Each point in this space fully specifies a value function

•  Bellman backup brings value functions closer in this space?

•  And therefore the backup must converge to a unique solution

Page 20: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Value Function -Norm !

•  We will measure distance between state-value functions and by the -norm

•  i.e. the largest difference between state values:

Page 21: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Bellman Expectation Backup is a Contraction!

•  Define the Bellman expectation backup operator

•  This operator is a -contraction, i.e. it makes value functions closer by at least ,

Page 22: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Convergence of Iter. Policy Evaluation and Policy Iteration!

•  The Bellman expectation operator has a unique fixed point

•  is a fixed point of (by Bellman expectation equation)

•  By contraction mapping theorem

•  Iterative policy evaluation converges on

Page 23: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Policy Improvement!

•  Suppose we have computed for a deterministic policy

•  For a given state , would it be better to do an action ?

•  It is better to switch to action for state if and only if

•  And we can compute from by:

q_\pi(s, a) & = \mathbb{E}[R_{t+1} + \gamma \text{v}_\pi(S_{t+1})|S_t=s,A_t=a] \\

Page 24: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Policy Improvement!

•  Suppose we have computed for a deterministic policy

•  For a given state , would it be better to do an action ?

•  It is better to switch to action for state if and only if

•  And we can compute from by:

q_\pi(s, a) & = \mathbb{E}[R_{t+1} + \gamma \text{v}_\pi(S_{t+1})|S_t=s,A_t=a] \\

Page 25: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Policy Improvement Cont.!

•  Do this for all states to get a new policy that is greedy with respect to :

•  What if the policy is unchanged by this?

•  Then the policy must be optimal.

\pi'(s) & = \arg\max_{a} q_\pi(s, a) \\& = \arg\max_{a} \mathbb{E}[R_{t+1} + \gamma \text{v}_\pi(s')|S_t=s,A_t=a] \\

Page 26: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Policy Iteration!

policy evaluation policy improvement“greedification”

Page 27: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Policy Iteration!

Page 28: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

•  An undiscounted episodic task

•  Nonterminal states: 1, 2, … , 14

•  Terminal state: one, shown in shaded square

•  Actions that take the agent off the grid leave the state unchanged

•  Reward is -1 until the terminal state is reached

6

Iterative Policy Eval!for the Small Gridworld!

R

γ = 1

Policy , an equiprobable random action

Page 29: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

•  An undiscounted episodic task

•  Nonterminal states: 1, 2, … , 14

•  Terminal state: one, shown in shaded square

•  Actions that take the agent off the grid leave the state unchanged

•  Reward is -1 until the terminal state is reached

Iterative Policy Eval!for the Small Gridworld!

R

γ = 1

Policy , an equiprobable random action

Page 30: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

•  Does policy evaluation need to converge to ?

•  Or should we introduce a stopping condition

•  e.g. -convergence of value function

•  Or simply stop after k iterations of iterative policy evaluation?

•  For example, in the small grid world k = 3 was sufficient to achieve optimal policy

•  Why not update policy every iteration? i.e. stop after k = 1

•  This is equivalent to value iteration (next section)

Generalized Policy Iteration!

Page 31: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Generalized Policy Iteration!

Generalized Policy Iteration (GPI): any interleaving of policy evaluation and policy improvement, independent of their granularity.

A geometric metaphor forconvergence of GPI:

Page 32: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Principle of Optimality!

•  Any optimal policy can be subdivided into two components:

•  An optimal first action

•  Followed by an optimal policy from successor state

•  Theorem (Principle of Optimality)

•  A policy achieves the optimal value from state , dfsfdsfdf dsfdf , if and only if

•  For any state reachable from , achieves the optimal value from state ,

Page 33: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

•  If we know the solution to subproblems

•  Then solution can be found by one-step lookahead

•  The idea of value iteration is to apply these updates iteratively

•  Intuition: start with final rewards and work backwards

•  Still works with loopy, stochastic MDPs

Value Iteration!

\text{v}_*(s) \leftarrow \max_{a \in \mathcal{A}}{ r(s,a) + \gamma \sum_{s'\in \mathcal{S}} T(s'|s,a) {\text{v}_*(s')} }

Page 34: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Example: Shortest Path!

Lecture 3: Planning by Dynamic Programming

Value Iteration

Value Iteration in MDPs

Example: Shortest Path

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

0

-1

-2

-2

-1

-2

-2

-2

-2

-2

-2

-2

-2

-2

-2

-2

0

-1

-2

-3

-1

-2

-3

-3

-2

-3

-3

-3

-3

-3

-3

-3

0

-1

-2

-3

-1

-2

-3

-4

-2

-3

-4

-4

-3

-4

-4

-4

0

-1

-2

-3

-1

-2

-3

-4

-2

-3

-4

-5

-3

-4

-5

-5

0

-1

-2

-3

-1

-2

-3

-4

-2

-3

-4

-5

-3

-4

-5

-6

g

Problem V1 V2 V3

V4 V5 V6 V7

Page 35: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Value Iteration!

•  Problem: find optimal policy

•  Solution: iterative application of Bellman optimality backup

• 

•  Using synchronous backups

•  At each iteration k + 1

•  For all states

•  Update from

\text{v}_1 \rightarrow \text{v}_2 \rightarrow ... \rightarrow \text{v}_*

Page 36: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Value Iteration (2)!

\text{v}_{k+1}(s) = \max_{a \in \mathcal{A}}{ \left( r(s,a) + \gamma \sum_{s'\in \mathcal{S}} T(s'|s,a) {\text{v}_k(s')} \right) }

Page 37: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Bellman Optimality Backup is a Contraction!

•  Define the Bellman optimality backup operator ,

•  This operator is a -contraction, i.e. it makes value functions closer by at least (similar to previous proof)

Page 38: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Convergence of Value Iteration!

•  The Bellman optimality operator has a unique fixed point

•  is a fixed point of (by Bellman optimality equation)

•  By contraction mapping theorem

•  Value iteration converges on

Page 39: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

•  Algorithms are based on state-value function or •  Complexity per iteration, for actions and states•  Could also apply to action-value function or

Synchronous Dynamic Programming Algorithms!

Problem ! Bellman Equation! Algorithm!

Prediction! Bellman Expectation Equation! Iterative Policy Evaluation!

Control! Bellman Expectation Equation + Greedy Policy Improvement! Policy Iteration!

Control! Bellman Optimality Equation ! Value Iteration!

Page 40: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Efficiency of DP!

•  To find an optimal policy is polynomial in the number of states

•  BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”).

•  In practice, classical DP can be applied to problems with a few millions of states.

Page 41: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Asynchronous DP!

•  All the DP methods described so far require exhaustive sweeps of the entire state set.

•  Asynchronous DP does not use sweeps. Instead it works like this:

•  Repeat until convergence criterion is met:

•  Pick a state at random and apply the appropriate backup

•  Still need lots of computation, but does not get locked into hopelessly long sweeps

•  Guaranteed to converge if all states continue to be selected

•  Can you select states to backup intelligently? YES: an agent’s experience can act as a guide.

Page 42: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Asynchronous Dynamic Programming!

•  Three simple ideas for asynchronous dynamic programming:

•  In-place dynamic programming

•  Prioritized sweeping

•  Real-time dynamic programming

Page 43: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

•  Synchronous value iteration stores two copies of value function

•  for all in

•  In-place value iteration only stores one copy of value function

•  for all in

In-Place Dynamic Programming!

\text{v}_{new}(s) \leftarrow \max_{a \in \mathcal{A}} {\left( r(s,a) + \gamma \sum_{s'\in \mathcal{S}} T(s'|s,a) {\text{v}_{old}(s')} \right)}

Page 44: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Prioritized Sweeping!

•  Use magnitude of Bellman error to guide state selection, e.g.

•  Backup the state with the largest remaining Bellman error

•  Requires knowledge of reverse dynamics (predecessor states)

•  Can be implemented efficiently by maintaining a priority queue

\left\lvert \max_{a \in \mathcal{A}} {\left( r(s,a) + \gamma \sum_{s'\in \mathcal{S}} T(s'|s,a) textcolo\r{red}{\text{v}(s')} \right)} - \text{v}(s) \right\rvert

Page 45: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Real-time Dynamic Programming!

•  Idea: update only states that are relevant to agent

•  Use agent’s experience to guide the selection of states

•  After each time-step

•  Backup the state

Page 46: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Sample Backups!

•  In subsequent lectures we will consider sample backups

•  Using sample rewards and sample transitions

•  Instead of reward function and transition dynamics

•  Advantages:

•  Model-free: no advance knowledge of MDP required

•  Breaks the curse of dimensionality through sampling

•  Cost of backup is constant, independent of

Page 47: 10703 Deep Reinforcement Learning and Controlrsalakhu/10703/Lectures/Lecture3_exactmethods… · Solving MDPs! • Prediction: Given an MDP and a policy find the state and action

Approximate Dynamic Programming!

•  Approximate the value function

•  Using a function approximate

•  Apply dynamic programming to

•  e.g. Fitted Value Iteration repeats at each iteration k,

•  Sample states

•  For each state , estimate target value using Bellman optimality equation,

•  Train next value function using targets