42
Lecture 3: Planning by Dynamic Programming Lecture 3: Planning by Dynamic Programming Joseph Modayil

Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Lecture 3: Planning by Dynamic Programming

Joseph Modayil

Page 2: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Outline

1 Introduction

2 Policy Evaluation

3 Policy Iteration

4 Value Iteration

5 Extensions to Dynamic Programming

6 Contraction Mapping

Reference: Sutton & Barto, chapter 4

Page 3: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Introduction

Motivation: Solving an MDP

1 2 3 4 5 6 7 8 9 10

11121314151617181920

21 22 23 24 25 26 27 28 29 30

31323334353637383940

41 42 43 44 45 46 47 48 49 50

51525354555657585960

61 62 63 64 65 66 67 68 69 70

71727374757677787980

81 82 83 84 85 86 87 88 89 90

919293949596979899100Action=roll

Consider a modified game ofSnakes and Ladders.

States: Squares of the board(1=Start 100=Terminal)

Two actions:

Flip a coin (move 1 or 2)Roll a die (move 1–6)

Reward is -1 per step, γ is 1

Transitions: Move # squares,climb up ladders, slide down snakes

What is the meaning of value here?

What can happen from state 1?

What action is best in each square?

Page 4: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Introduction

Motivation: Solving an MDP (2)

1

60 4 5 7272

RollFlip

1 2 3 4 5 6 7 8 9 10

11121314151617181920

21 22 23 24 25 26 27 28 29 30

31323334353637383940

41 42 43 44 45 46 47 48 49 50

51525354555657585960

61 62 63 64 65 66 67 68 69 70

71727374757677787980

81 82 83 84 85 86 87 88 89 90

919293949596979899100

(-14.6)flip

(-15.7)flip

(-18.1)roll

(-18.2)roll

(-18.4)flip

(-16.7)flip

(-17.6)flip

(-19.5)roll

(-19.2)roll

(-18.9)roll

(-18.6)roll

(-18.2)roll

(-17.9)roll

(-17.6)roll

(-16.9)roll

(-17.1)roll

(-16.9)roll

(-17.2)roll

(-15.2)flip

(-16.2)flip

(-17.4)roll

(-17.1)roll

(-16.8)roll

(-16.9)flip

(-15.9)roll

(-15.8)roll

(-16.0)roll

(-15.6)flip

(-14.5)flip

(-14.5)flip

(-18.6)roll

(-19.7)flip

(-18.7)roll

(-18.7)roll

(-18.1)roll

(-16.9)roll

(-17.5)roll

(-15.7)roll

(-15.6)roll

(-14.1)flip

(-13.4)roll

(-13.7)roll

(-13.3)roll

(-12.9)roll

(-12.9)roll

(-12.6)roll

(-12.1)roll

(-12.0)roll

(-10.9)roll

(-11.6)roll

(-11.3)flip

(-11.3)flip

(-9.0)flip

(-11.1)flip

(-15.5)flip

(-17.3)roll

(-17.0)roll

(-16.7)roll

(-16.3)roll

(-16.0)roll

(-15.8)roll

(-15.5)roll

(-15.0)roll

(-14.8)roll

(-14.5)roll

(-14.3)roll

(-14.0)roll

(-13.6)roll

(-13.3)roll

(-13.0)roll

(-12.9)roll

(-12.6)roll

(-12.1)roll

(-11.5)roll

(-11.0)roll

(-11.8)roll

(-9.7)flip

(-9.1)roll

(-8.4)roll

(-7.7)roll

(-6.0)flip

(-5.3)flip

(-4.7)roll

(-4.0)roll

(-3.4)roll

(-2.9)roll

Page 5: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Introduction

What is Dynamic Programming?

Dynamic sequential or temporal component to the problem

Programming optimising a “program”, i.e. a policy

c.f. linear programming

A method for solving complex problems

By breaking them down into subproblems

Solve the subproblemsCombine solutions to subproblems

Page 6: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Introduction

Requirements for Dynamic Programming

Dynamic Programming is a very general solution method forproblems which have two properties:

Optimal substructures

Optimal solution to a problem composed fromoptimal solutions to subproblems

Overlapping subproblems

Subproblems recur many timesSolutions can be cached and reused

Markov decision processes satisfy both properties

Bellman equation gives recursive decompositionValue function stores and reuses solutions

Page 7: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Introduction

Planning by Dynamic Programming

Dynamic programming assumes full knowledge of the MDP

It is used for planning in an MDP

For prediction:

Input: MDP 〈S,A,P,R, γ〉 and policy πor: MRP 〈S,Pπ,Rπ, γ〉

Output: value function vπ

Or for control:

Input: MDP 〈S,A,P,R, γ〉Output: optimal value function v∗

and: optimal policy π∗

Page 8: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Introduction

Other Applications of Dynamic Programming

Dynamic programming is used to solve many other problems, e.g.

Scheduling algorithms

String algorithms (e.g. sequence alignment)

Graph algorithms (e.g. shortest path algorithms)

Graphical models (e.g. Viterbi algorithm)

Bioinformatics (e.g. lattice models)

Page 9: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Iterative Policy Evaluation

Iterative Policy Evaluation

Problem: evaluate a given fixed policy π

Solution: iterative application of Bellman expectation backup

V1 → V2 → ...→ vπ

Using synchronous backups,

At each iteration k + 1For all states s ∈ SUpdate Vk+1(s) from Vk(s ′)where s ′ is a successor state of s

Convergence to vπ will be proven at the end of the lecture

Page 10: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Iterative Policy Evaluation

Iterative Policy Evaluation (2)

s

a

Vk+1(s)

s'

r

Vk(s')

Vk+1(s) =∑a∈A

π(s, a)

(Ra

s + γ∑s′∈SPass′Vk(s ′)

)

Page 11: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Example: Small Gridworld

Small Gridworld

Undiscounted episodic MDP

γ = 1All episodes terminate in absorbing terminal state

Nonterminal states 1, ..., 14

One terminal state (shown twice as shaded squares)

Actions that would take agent off the grid leave stateunchanged

Reward is -1 until the terminal state is reached

Page 12: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Example: Small Gridworld

Iterative Policy Evaluation in Small Gridworld

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

-1.0 -1.0 -1.0-1.0 -1.0 -1.0 -1.0-1.0 -1.0 -1.0 -1.0-1.0 -1.0 -1.0

-1.7 -2.0 -2.0-1.7 -2.0 -2.0 -2.0-2.0 -2.0 -2.0 -1.7-2.0 -2.0 -1.7

-2.4 -2.9 -3.0-2.4 -2.9 -3.0 -2.9-2.9 -3.0 -2.9 -2.4-3.0 -2.9 -2.4

-6.1 -8.4 -9.0-6.1 -7.7 -8.4 -8.4-8.4 -8.4 -7.7 -6.1-9.0 -8.4 -6.1

-14. -20. -22.-14. -18. -20. -20.-20. -20. -18. -14.-22. -20. -14.

Vk for theRandom Policy

Greedy Policyw.r.t. Vk

k = 0

k = 1

k = 2

k = 10

k = °

k = 3

optimal policy

random policy

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

Page 13: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Example: Small Gridworld

Iterative Policy Evaluation in Small Gridworld (2)

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

-1.0 -1.0 -1.0-1.0 -1.0 -1.0 -1.0-1.0 -1.0 -1.0 -1.0-1.0 -1.0 -1.0

-1.7 -2.0 -2.0-1.7 -2.0 -2.0 -2.0-2.0 -2.0 -2.0 -1.7-2.0 -2.0 -1.7

-2.4 -2.9 -3.0-2.4 -2.9 -3.0 -2.9-2.9 -3.0 -2.9 -2.4-3.0 -2.9 -2.4

-6.1 -8.4 -9.0-6.1 -7.7 -8.4 -8.4-8.4 -8.4 -7.7 -6.1-9.0 -8.4 -6.1

-14. -20. -22.-14. -18. -20. -20.-20. -20. -18. -14.-22. -20. -14.

Vk for theRandom Policy

Greedy Policyw.r.t. Vk

k = 0

k = 1

k = 2

k = 10

k = °

k = 3

optimal policy

random policy

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

Page 14: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Policy Improvement

Policy Improvement

Consider a deterministic policy, a = π(s)We can improve the policy by acting greedily

π′(s) = argmaxa∈A

qπ(s, a)

This improves the value from any state s over one step,

qπ(s, π′(s)) = maxa∈A

qπ(s, a) ≥ qπ(s, π(s)) = vπ(s)

It therefore improves the value function, vπ′(s) ≥ vπ(s)

vπ(s) ≤ qπ(s, π′(s)) = Eπ′ [Rt+1 + γvπ(St+1) | St = s]

≤ Eπ′[Rt+1 + γqπ(At+1, π

′(St+1)) | St = s]

≤ Eπ′[Rt+1 + γRt+2 + γ2qπ(St+2, π

′(St+2)) | St = s]

≤ Eπ′ [Rt+1 + γRt+2 + ... | St = s] = vπ′(s)

Page 15: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Policy Improvement

Policy Improvement (2)

If improvements stop,

qπ(s, π′(s)) = maxa∈A

qπ(s, a) = qπ(s, π(s)) = vπ(s)

Then the Bellman optimality equation has been satisfied

vπ(s) = maxa∈A

qπ(s, a)

Therefore vπ(s) = v∗(s) for all s ∈ Sso π is an optimal policy

Page 16: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Policy Improvement

Policy Iteration

Policy evaluation Estimate vπ

Iterative policy evaluation

Policy improvement Generate π′ ≥ πGreedy policy improvement

Page 17: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Example: Jack’s Car Rental

Jack’s Car Rental

States: Two locations, maximum of 20 cars at each

Actions: Move up to 5 cars overnight (-$2 each)

Reward: $10 for each car rented (must be available), γ = 0.9

Transitions: Cars returned and requested randomly

Poisson distribution, n returns/requests with prob λn

n! e−λ

1st location: average requests = 3, average returns = 32nd location: average requests = 4, average returns = 2

Page 18: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Example: Jack’s Car Rental

Policy Iteration in Jack’s Car Rental

Page 19: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Extensions to Policy Iteration

Modified Policy Iteration

Does policy evaluation need to converge to vπ?

Or should we introduce a stopping condition

e.g. ε-convergence of value function

Or simply stop after k iterations of iterative policy evaluation?

For example, in the small gridworld k = 3 was sufficient toachieve optimal policy

Why not update policy every iteration? i.e. stop after k = 1

This is equivalent to value iteration (next section)

Page 20: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Extensions to Policy Iteration

Generalised Policy Iteration

Policy evaluation Estimate vπ

Any policy evaluation algorithm

Policy improvement Generate π′ ≥ πAny policy improvement algorithm

Page 21: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Value Iteration

Deterministic Value Iteration

Deterministic Value Iteration

If we know the solution to subproblems v∗(s ′)

Then it is easy to construct the solution to v∗(s)

v∗(s)← maxa∈ARa

s + v∗(s ′)

The idea of value iteration is to apply these updates iteratively

e.g. Starting at the goal (horizon) and working backwards

Page 22: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Value Iteration

Deterministic Value Iteration

Example: Shortest Path

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

0

-1

-2

-2

-1

-2

-2

-2

-2

-2

-2

-2

-2

-2

-2

-2

0

-1

-2

-3

-1

-2

-3

-3

-2

-3

-3

-3

-3

-3

-3

-3

0

-1

-2

-3

-1

-2

-3

-4

-2

-3

-4

-4

-3

-4

-4

-4

0

-1

-2

-3

-1

-2

-3

-4

-2

-3

-4

-5

-3

-4

-5

-5

0

-1

-2

-3

-1

-2

-3

-4

-2

-3

-4

-5

-3

-4

-5

-6

g

Problem V1 V2 V3

V4 V5 V6 V7

Page 23: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Value Iteration

Value Iteration in MDPs

Value Iteration in MDPs

MDPs don’t usually have a finite horizon

They are typically loopy

So there is no “end” to work backwards from

However, we can still propagate information backwards

Using Bellman optimality equation to backup V (s) from V (s ′)

Each subproblem is “easier” due to discount factor γ

Iterate until convergence

Page 24: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Value Iteration

Value Iteration in MDPs

Optimality in MDPs

An optimal policy π∗ must provide both

An optimal first action a∗ from any state s,

Followed by an optimal policy from successor state s ′

v∗(s) = maxa∈ARa

s + γ∑s′∈SPass′v∗(s ′)

Page 25: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Value Iteration

Value Iteration in MDPs

Value Iteration

Problem: find optimal policy π

Solution: iterative application of Bellman optimality backup

V1 → V2 → ...→ v∗

Using synchronous backups

At each iteration k + 1For all states s ∈ SUpdate Vk+1(s) from Vk(s ′)

Convergence to v∗ will be proven later

Unlike policy iteration, there is no explicit policy

Intermediate value functions may not correspond to any policy

Page 26: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Value Iteration

Value Iteration in MDPs

Value Iteration (2)

s

a

Vk+1(s)

s'

r

Vk(s')

Vk+1(s) = maxa∈A

(Ra

s + γ∑s′∈SPass′Vk(s ′)

)

Page 27: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Value Iteration

Summary of DP Algorithms

Synchronous Dynamic Programming Algorithms

Problem Bellman Equation Algorithm

Prediction Bellman Expectation EquationIterative

Policy Evaluation

ControlBellman Expectation Equation

Policy Iteration+ Greedy Policy Improvement

Control Bellman Optimality Equation Value Iteration

Algorithms are based on state-value function vπ(s) or v∗(s)

Complexity O(mn2) per iteration, for m actions and n states

Could also apply to action-value function qπ(s, a) or q∗(s, a)

Complexity O(m2n2) per iteration

Page 28: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Asynchronous Dynamic Programming

Asynchronous Dynamic Programming

DP methods described so far used synchronous backups

i.e. all states are backed up in parallel

Asynchronous DP backs up states individually, in any order

For each selected state, apply the appropriate backup

Can significantly reduce computation

Guaranteed to converge if all states continue to be selected

Page 29: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Asynchronous Dynamic Programming

Asynchronous Dynamic Programming

Three simple ideas for asynchronous dynamic programming:

In-place dynamic programming

Prioritised sweeping

Real-time dynamic programming

Page 30: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Asynchronous Dynamic Programming

In-Place Dynamic Programming

Synchronous value iteration stores two copies of value function

for all s in S

Vnew (s)← maxa∈A

(Ra

s + γ∑s′∈SPass′Vold(s ′)

)Vold ← Vnew

In-place value iteration only stores one copy of value function

for all s in S

V (s)← maxa∈A

(Ra

s + γ∑s′∈SPass′V (s ′)

)

Page 31: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Asynchronous Dynamic Programming

Prioritised Sweeping

Use magnitude of Bellman error to guide state selection, e.g.∣∣∣∣∣maxa∈A

(Ra

s + γ∑s′∈SPass′Vk(s ′)

)− Vk(s)

∣∣∣∣∣Backup the state with the largest remaining Bellman error

Update Bellman error of affected states after each backup

Requires knowledge of reverse dynamics (predecessor states)

Can be implemented efficiently by maintaining a priority queue

Page 32: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Asynchronous Dynamic Programming

Real-Time Dynamic Programming

Idea: only states that are relevant to agent

Use agent’s experience to guide the selection of states

After each time-step St ,At ,Rt+1

Backup the state St

V (St)← maxa∈A

(Ra

St + γ∑s′∈SPaSts′V (s ′)

)

Page 33: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Full-width and sample backups

Full-Width Backups

DP uses full-width backups

For each backup (sync or async)

Every successor state and action isconsideredUsing knowledge of the MDP transitionsand reward function

DP is effective for medium-sized problems(millions of states)

For large problems DP suffers Bellman’scurse of dimensionality

Number of states n = |S| growsexponentially with number of statevariables

Even one backup can be too expensive

s

a

Vk+1(s)

s'

r

Vk(s')

Page 34: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Approximate Dynamic Programming

Approximate Dynamic Programming

Approximate the value function

Using a function approximator V θ(s) = v(s; θ), with aparameter vector θ ∈ Rn.

The estimated value function at iteration k is Vk = V θk

Use dynamic programming to compute V θk+1 from V θk .

e.g. Fitted Value Iteration repeats at each iteration k,Sample states S ⊆ SFor each sample state s ∈ S, compute target value usingBellman optimality equation,

Vk(s) = maxa∈A

(Ra

s + γ∑s′∈SPass′V

θk (s ′)

)

Train next value function V θk+1 using targets{〈s, Vk(s)〉

}

Page 35: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Contraction Mapping

Some Technical Questions

How do we know that value iteration converges to v∗?

Or that iterative policy evaluation converges to vπ?

And therefore that policy iteration converges to v∗?

Is the solution unique?

How fast do these algorithms converge?

These questions are resolved by contraction mapping theorem

Page 36: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Contraction Mapping

Value Function Space

Consider the vector space V over value functions

There are |S| dimensions

Each point in this space fully specifies a function V (s)

What does a Bellman backup do to points in this space?

We will show that it brings value functions closer

And therefore the backups must converge on a unique solution

Page 37: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Contraction Mapping

Value Function ∞-Norm

We will measure distance between state-value functions U andV by the ∞-norm

i.e. the largest difference between state values,

||U − V ||∞ = maxs∈S|U(s)− V (s)|

Define the Bellman expectation backup operator Tπ,

Tπ(V ) = Rπ + γPπV

where Rπ(s) =∑

a∈A π(a|s)Ras

and (PπV )(s) =∑

a∈A π(a|s)∑

s′∈S Pass′V (s ′).

Page 38: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Contraction Mapping

Bellman Expectation Backup is a Contraction

The Bellman expectation backup operator is a γ-contraction,i.e. it makes value functions closer by at least γ,

||TπU − TπV ||∞ = maxs|(Rπ + γPπU)(s)− (Rπ + γPπV )(s))|

= maxs|γ∑

a π(a|s)∑

s′∈S Pass′(U(s ′)− V (s ′))|

≤ maxs

γ∑

a π(a|s)∑

s′∈S Pass′ |U(s ′)− V (s ′)|

≤ maxs

γ∑

a π(a|s)∑

s′∈S Pass′ ||U − V ||∞

≤ γ||U − V ||∞(maxs

∑a π(a|s)

∑s′∈S Pa

ss′)

≤ γ||U − V ||∞

Page 39: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Contraction Mapping

Contraction Mapping Theorem

Theorem (Contraction Mapping Theorem)

For any metric space V that is complete (i.e. contains its limitpoints) under an operator T (V ), where T is a γ-contraction,

T converges to a unique fixed point

At a linear convergence rate of γ

Page 40: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Contraction Mapping

Convergence of Iter. Policy Evaluation and Policy Iteration

The Bellman expectation operator Tπ has a unique fixed point

vπ is a fixed point of Tπ (by Bellman expectation equation)

By contraction mapping theorem

Iterative policy evaluation converges on vπ

Policy iteration converges on v∗

Page 41: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Contraction Mapping

Bellman Optimality Backup is a Contraction

Define the Bellman optimality backup operator T ∗,

T ∗(V ) = maxa∈ARa + γPaV

This operator is a γ-contraction, i.e. it makes value functionscloser by at least γ (similar to previous proof)

||T ∗(U)− T ∗(V )||∞ ≤ γ||U − V ||∞

Page 42: Lecture 3: Planning by Dynamic Programmingjosephmodayil.com/UCLRL/DP.pdf · Lecture 3: Planning by Dynamic Programming Introduction Motivation: Solving an MDP 1 2 3 4 5 6 7 8 9 10

Lecture 3: Planning by Dynamic Programming

Contraction Mapping

Convergence of Value Iteration

The Bellman optimality operator T ∗ has a unique fixed point

v∗ is a fixed point of T ∗ (by Bellman optimality equation)

By contraction mapping theorem

Value iteration converges on v∗