22
1 Introduction to Markov Decision Processes (MDP) 1. Decision Making Problem Multi-stage decision problems with a single decision maker Competitive MDP: more than one decision makers Open-loop v.s. Closed-loop problems Open-loop: front end, plan-act Closed-loop: observe-act-observe, dependent policy Short-term v.s. long-term decisions myopic control/greedy strategy MDP balance short-term and long-term 2. What is MDP? Process observed: perfectly or imperfectly (Bayesian analysis) At decision epoch (typically discrete) over a horizon (finite or infinite) Actions are taken after observation e.g., order more inventory, continue investment, etc. Rewards are received Based on state, time of action and action Probability distribution for the state transition –Probability of having j units at time t+1, given that we had i units at time t and order k units. The state succinctly summarizes the impact of the past decisions on subsequent decisions 3. Features for MDP: Key ingredients of sequential decision making model A set of decision epochs A set of system states A set of available actions A set of state and action dependent immediate reward or cost A set of state and action dependent transition probabilities Apart from the mild separability assumptions, the dynamic programming framework is very general. objective function f (x 1 ,x 2 ,x 3 )= f 1 (x 1 )+ f 2 (x 2 )+ f 3 (x 3 ) or f (x 1 ,x 2 ,x 3 )= f 1 (x 1 )+ f 2 (x 1 ,x 2 )+ f 3 (x 1 ,x 2 ,x 3 ) No assumptions on cost functions (linear, nonlinear, etc.) In summary, every separable non-linear programming can be formulated as a Dynamic Programming (DP) Limitations of Markov Decision Processes: The curse of dimensionality; Exponential growth in the state space. What questions can we answer? 1

1 Introduction to Markov Decision Processes (MDP)gphu.public.iastate.edu/lecture.pdf1 Introduction to Markov Decision Processes (MDP) 1.Decision Making Problem ... State: product inventory

  • Upload
    lyphuc

  • View
    227

  • Download
    0

Embed Size (px)

Citation preview

1 Introduction to Markov Decision Processes (MDP)

1. Decision Making Problem

• Multi-stage decision problems with a single decision makerCompetitive MDP: more than one decision makers

• Open-loop v.s. Closed-loop problemsOpen-loop: front end, plan-actClosed-loop: observe-act-observe, dependent policy

• Short-term v.s. long-term decisionsmyopic control/greedy strategyMDP balance short-term and long-term

2. What is MDP?

• Process observed: perfectly or imperfectly (Bayesian analysis)At decision epoch (typically discrete) over a horizon (finite or infinite)

• Actions are taken after observatione.g., order more inventory, continue investment, etc.

• Rewards are receivedBased on state, time of action and action

• Probability distribution for the state transition–Probability of having j units at time t+1, given that we had i units at time t and orderk units.

• The state succinctly summarizes the impact of the past decisions on subsequent decisions

3. Features for MDP:

• Key ingredients of sequential decision making model

– A set of decision epochs

– A set of system states

– A set of available actions

– A set of state and action dependent immediate reward or cost

– A set of state and action dependent transition probabilities

• Apart from the mild separability assumptions, the dynamic programming framework isvery general.objective function f(x1, x2, x3) = f1(x1) + f2(x2) + f3(x3)or f(x1, x2, x3) = f1(x1) + f2(x1, x2) + f3(x1, x2, x3)

• No assumptions on cost functions (linear, nonlinear, etc.)

• In summary, every separable non-linear programming can be formulated as a DynamicProgramming (DP)

• Limitations of Markov Decision Processes: The curse of dimensionality; Exponentialgrowth in the state space.

• What questions can we answer?

1

– When does an optimal policy exist?

– When does it have a particular form or structure

– How can we efficiently compute the optimal policy?Value iteration, policy iteration or Linear programming

4. Markov Decision Processes Application Problem

• Inventory Management:(1)determining optimal reorder points and reorder levels;Decision epoch: weekly reviewState: product inventory level at the time of reviewAction: amount of stock to orderProbability transition: depends on how much ordered and the random demand for thatweekA decision rule specifies the quantity to be ordered as a function of the stock on handat the time of reviewA policy consists of a sequence of such restocking functions

(2)determining reorder points for complex product network;

• Maintenance and Replacement Problem:(1)Bus engine maintenance and replacement: decision makers periodically inspects thecondition of the equipment; based on the age and condition to decide on the extent ofthe maintenance or replacement; cost are associated with the maintenance and operatingthe machine in the current status; the objective is to balance these two (maintenanceand operating costs) to minimized a measure of long-term operating costs.

(2)Highway pavement maintenance: minimize long-run average costs subject to roadquality.

• Communication Models: wide range of computer, manufacturing and communicationsystems can be modeled by networks of interrelated queues and servers, control thechannels

• Behavior ecology:(1)animal behavior– bird nestling problem; state will be the female and the brood health;objective is to maximize a weighted average of the probability of nestling survival andprobability of the female’s survival to the next breeding season; the model will find theoptimal behavioral strategy (stay at the nest to protect the young; hunt to supplementthe food supply by the male; desert the nest)

(2)gambling model–find the optimal digit locations to maximize the probability of win-ning

2

2 Markov Decision Processes (MDP) Model Formulation

A decision maker’s goal is to choose a sequence of actions which causes the system to performoptimally with respect to some predetermined criteria. MDP has five elements: decision epochs,states, actions, transition probabilities and rewards.

• Decision epochs T

• States S

• Actions As

• Transition probability p

• Rewards/costs rt(s, a)

1. Decision Epochs: let T be the decision epochsDiscrete:Finite: T = {0, 1, 2, ..., N}Infinite: T = {0, 1, 2, ......}Continuous:T = [0, N ] and T = [0,∞]Our focus is on discrete-time (primary infinite horizon)

2. States and actions

• We observe the system in a state at a decision epoch

• Set of all possible states: S

• In a state s ∈ S, the decision maker selects an action a ∈ As, where As is the set offeasible actions in state s ∈ S• The set of feasible actions is A = ∪s∈SAs• We primarily deal with discrete sets of states and actions (finite or countably infinite)

• Actions chosen randomly or deterministically according to a selected ”policy”

3. Rewards and Transition probabilities

At time t ∈ T , system is in state s ∈ S, and the decision maker selects an action a ∈ As,then,

• Receive a reward rt(s, a) ∈ R, which could be profit or cost;

• System moves to state j ∈ S with conditional probability pt(j|s, a)

4. MDP is defined by {T, S,As, Pt(·|s, a), rt(s, a)}; finite horizon vs. infinite horizon.

5. Decision Rules and PoliciesDecision Rules: function mapping from the state space to action

• Deterministic Markovian decision rules: dt(·) : S −→ As for j ∈ S, dt(j) ∈ Aj

3

• History dependentHistory of the process: ht = {s1, a1, s2, a2, · · · , st−1, at−1, s} and ht ∈ Ht

Given the history of the process:mapping: dt : Ht −→ Ast

• Randomized decision rules: action is not selected with certainty; select a distributionqdt(·) to select an action

• HR–set of history dependent randomized decision rules

• HD–set of history dependent deterministic decision rules

• MR–set of Markovian randomized decision rules

• MD–set of Markovian deterministic decision rules

Polices/contingency plan/strategy:

• A policy Π is a sequence of decision rules:Π = (d1, d2, · · · ) and di ∈ Dk

i , i ∈ T• It is a stationary policy if dt = d for all t ∈ T

Π = (d1, d1, · · · ) or Π = d∞

• ΠSD = set of stationary deterministic policies

• ΠSR = set of stationary randomized policies

• For finite horizon problem, stationary policies are not optimal.For infinite horizon problem, stationary policies are optimal.

Machine maintenance exampleInventory control example

6. One period MDP problem

• Why important? Multi-period decomposed into single period problem

• N = 2, T = {1, 2}, S = {1, 2, ..., n}To maximize the sum of immediate reward and expected terminal reward

• Value of a policy Π:(decision rule is denoted by d(s), which can be used interchangeably with as)

v(s) = r1(s, d1(s)) +n∑j+1

p1(j|s, d1(s)) · v(j)

• what is the best action to select from state S?

maxa∈As{r1(s, d1(s)) +

n∑j=1

p1(j|s, d1(s)) · v(j)}

Then

d∗1(s) = arg maxa∈As{r1(s, a) +

n∑j=1

p1(j|s, d1(s)) · v(j)}

4

7. Example of equipment replacement

• current machine is one year old

• Planning for 3 years

• Expense

Age 0 1 2 3 4

Maintenance $0 $1 $2 $3 -Salvage - $8 $7 $6 $5

• New machine is $10.

• Assume salvage at the end of year 3.

5

3 MDP applications and examples

1. Two state MDP

• Assumption: stationary rewards and stationary transition probabilities, i.e., the rewardsand transition probabilities won’t change with time; there are two states: s1 and s2.

• MDP formulation: 5 modeling components (decision epochs, states, actions, rewardsand transition probabilities)

2. Single Product Inventory Control

• Assumptions

– No delivery lead time, instantaneous delivery

– All orders’demands are filled at the end of the month

– No backlogging (No negative inventory allowed)

– stationary demand (No seasonality)

– Warehouse capacity: M units

• question to be answered: For a given stock level at month t, how much do I order?

• MDP formulation: 5 modeling components (decision epochs, states, actions, rewardsand transition probabilities)

• state transition St+1 = [St + at −Dt]+

• cost of placing orders: o(u)

• inventory holding cost: h(u)

• stochastic demand (stationary, does not change with time): p(Dt = j) = pj

• revenue of selling j units of products: f(j)

• expected profit for one period:

– first steprt(St, at, St+1) = f(St + at − St+1)− o(at)− h(st + at)

– then take the expectation over St+1

rt(St, at) =∑j

(f(St + at −max(0, St + at − j))− o(at)− h(st + at))pj

= −o(at)− h(st + at) +∑j

(f(St + at −max(0, St + at − j)))pj

If st + at > j, then f(St + at −max(0, St + at − j)) = f(j)If st + at ≤ j, then f(St + at −max(0, St + at − j)) = f(St + at)

– Finite horizon: rN = g(s)

pt(j|s, a) =

0, if M ≥ j > s+ a;ps+a−j , if M ≥ s+ a ≥ j > 0;qs+a, if M ≥ s+ a and j = 0.

3. Shortest Route and Critical Path Models; Sequential Allocation Model

4. Optimal Stopping Problem

6

Practice problems: 1. suppose we have a machine that is either running or broken down. Ifit runs throughout one week, it makes a gross profit of $100. If it fails during the week, grossprofit is zero. If it is running at the start of the week ans we perform preventive maintenance,the probability of failure is 0.7. However, maintenance will cost $20. When the machine isbroken down at the start of that week, it may either by repaired at a cost of $40, in whichcase it will fail during the week with a probability of 0.4, or it may be replaced at a cost of$150 by a new machine that is guaranteed to run through its first week of operation. Findthe optimal repair, replacement and maintenance policy that maximize total profit over threeweeks, assuming a new machine at the start of the first week.

Practice problems: 2. At the start of each week, a worker receives a wage offer of w units perweek. He may either accept the offer and work at that wage for the entire week or insteadseek alternative employment. If he decides to work in current week, then at the start of thenext week, with probability p, he will have the same wage offer available, with probability of1 − p, he will be unemployed and unable to seek employment during that week. If he seeksalternative employment, he receives no income in the current week and obtains a wage offerof w′ for the subsequent week according to a transition probability of pt(w

′|w). Assume hisutility when receiving wage w is Φt(w).

7

4 Finite Horizon MDP

1. Introduction

• The solution of a finite-horizon MDP depends on the ”optimality equations”

• Solution is found by analyzing a sequence of a smaller deductively defined problems

• Principle of optimality:”An optimal policy has the property that whatever the initial state and decision are, theremaining decisions constitute on optimal policy with regard to the state resulting fromthe first decision”

Shortest path problem example: using backwards induction.

Optimality Equation: Ut(s) = maxa∈As{rt(s, a)+∑

j∈S pt(j|s, a)Ut+1(j)} for t = 1, 2, ..., N−1

UN (s) = rN (s),∀s

Infinite horizon problem can not work backwards since there is no terminal rewards.

2. Optimality Criteria: How do We select Best policy Π∗ = (d∗1, d∗2, ..., d

∗N )

• A fixed policy induces a stochastic process of states visited. For a given d(.), we aregenerating a Markov chain over states for a fixed policy.

• Let xt and yt be the random variables describing the state and action selected at timet, then {(xt, yt)}Nt=1 is a stochastic process for a given policy: Π = (d1, d2, ..., dN )

Sample path: (s, d1(s))→)(s′, d2(s′))→i.e, (X,Y )→)(X ′, Y ′)→

And {r1(x1, y1), r1(x1, y1), ..., rN−1(xN−1, yN−1), rN (xN )} is also a stochastic process

• Define Rt = rt(xt, yt), a given policy Π will provide a probability of receiving a certainreward stream: pΠ(R1, R2, ..., RN )

Typically, difficult to determine a measure for maintaining order of stochastic elements,so we use expected value as one measurement.

We say that policy Π∗ is preferred to policy Π1 if

EΠ∗ [f(R1, R2, ..., RN )] ≥ EΠ1[f(R1, R2, ..., RN )]

Other types of measurements: mean-variance trade-off, risk-reward measure, time-sensitivemeasure (time value of money). In summary, more complex measures are available.

8

• Expected total reward criteria: For Π ∈ ΠHD (history deterministic), the value of policyΠ is

V ΠN (s) = EΠ

s {N−1∑t=1

rt(xt, dt(ht)) + rN (xN )}

Value of an N-period problem in state s at time 1 under policy Π.Assume: |rt(s, a)| ≤M for all (s, a) ∈ S ×A (bounded rewards), then V Π

N (s) exists.

• We seek an optimal policy Π∗ ∈ ΠHR (more general case), where V Π∗N (s) ≥ V Π

N (s), forall Π∗ ∈ ΠHR.The value of the MDP is the value of the policy V ∗N . It may not be achievable.

3. Optimality Equations and the Principle of Optimality

• How do we find the optimal policy?1. By analyzing a sequence of smaller inductively defined problems.2. Using the principle of optimality: An optimal policy has the property that whateverthe initial state and decision are, the remaining decisions constitute on optimal policywith regard to the state resulting from the first decision (any subsequence should beoptimal if the big piece is optimal).3. Take N-period problem and solve a series of N 1-period problem.

• Define

UΠt (ht) = EΠ

ht{N−1∑n=t

rn(xn, yn) + rN (xN )}

given the decision maker has seen history ht at time t, UΠt (ht) is the expected remaining

reward under policy Π from time t onward, where Π = {d1, d2, ..., dt−1, dt, dt+1, ..., dN},and yn = dn(hn).If we know ht, then we know xt and yt = dt(ht) and rt(xt, yt), so

UΠt (ht) = rt(xt, yt) + EΠ

ht{N−1∑n=t+1

rn(xn, yn) + rN (xN )}

= rt(xt, yt) +∑j∈S

pt(j|st, dt(ht))EΠht+1{N−1∑n=t+1

rn(xn, yn) + rN (xN )}

where ht+1 = {ht, dt(ht), j}

UΠt (ht) = rt(xt, yt) +

∑j∈S

pt(j|st, dt(ht))UΠt+1(ht+1)

This is the core of Dynamic programming.

This is the recursive way to find the value of MDP.

9

• How to compute?By principle of optimality, decompose to get optimality equations.

Ut(ht) = maxa∈As(t)

{rt(st, a) +∑j∈S

pt(j|st, a)Ut+1(ht+1)}

UN (hN ) = rN (sN )

inductively computed the optimal value.

UN (hN ) is given, compute UN−1(hN−1) for all hN−1, sometimes it is not feasible dueto the large number of hN−1, then UN−1(hN−1) is given, compute UN−2(hN−2) for allhN−2, this is called backward recursion.

4. Optimality of Deterministic Markov PoliciesConditions under which there exists an optimal policy which is deterministic and MarkovianUse backward induction to determine the structure of an optimal policyWhen the immediate rewards and transition probabilities depend on the past only throughthe current state of the system (as we have assumed), the optimal value function depend onthe history only through the current state of the system.

1. Theorem 4.4.1: Existence of a deterministic history dependent policy.2. Theorem 4.4.2: Existence of a deterministic Markovian policy.

3. If there are K states which L actions per state, there are (LK)(N−1) feasible policies.Each requiring (N − 1)LK multiplications to evaluate, backwards induction requires only(N − 1)LK2 multiplication.4. Reward function may be complicated functions to compute. Lots of research on computa-tion reduction.

5. Backward Induction: based on theorem 4.4, there exist Markovian deterministic policy. Soinstead of finding uht , we only need to find ust

6. Examples problems revisited

10

Practice problem on stock call option:

Suppose the current price of some stock is $30 per share, and its daily price increases by$0.10 with probability 0.6, remains the same with probability 0.1 and decrease by $0.10 withprobability 0.3. Find the value of a call option to purchase 100 shares of this stock at $31any time in the next 30 days by finding an optimal policy for exercising this option. Assumea transaction cost of $50.

11

7. Optimality of Monotone Policies (4.7)

• Structured policies appeal to decision makers.Typical structure: control (action) limit (state) policydt(s) = a1 when s < s∗ and dt(s) = a1 when s ≥ s∗where s∗ is the control limit.

• Only need to find s∗ at each t, instead of d∗t (s) for each t and s.Define X state and Y action are partially ordered sets and g(x, y) : X × Y −→ Rthen g is superadditive if for x+ ≥ x− and y+ ≥ y−, theng(x+, y+)− g(x+, y−) ≥ g(x−, y+)− g(x−, y−)Monotone increasing difference: g(30, E)− g(30, w) ≥ g(20, E)− g(20, w)

• Lemma 4.7.1, if g is superadditive and for each x ∈ X, maxy∈Y g(x, y) exists, then

f(x) = max{y′ ∈ argmaxy∈Y g(x, y)}

is monotone nondecreasing in X.The supperadditivity is sufficient but not necessary condition.

• Conditions under which monotone policies are optimal.

Define qt(k|s, a) =∑∞

j=k pt(j|s, a), then if qt(k|s, a) is non-decreasing in S for all k ∈S, a ∈ A, then for the state dependent non-decreasing sequence:

ut(j + 1) ≥ ut(j)

for all j ∈ S. Then for any decision rule At ∈ DMD

∞∑j=0

pt(j|s′, dt(s′))ut+1(j) ≥∞∑j=0

pt(j|s, dt(s))ut+1(j)

for s′ ≥ s, s ∈ S. That is, the expected future reward for s′ is as large as that of s.

If the optimal value function Ut(.) is nondecreasing and qt(k|s, a) is nondecreasing, thenwe prefer to be in a ”higher” state.

Revisit the stock example.

• Monotonic policies: if s′ ≥ s, then d∗(s′) ≥ d∗(s)Prop 4.7.3 Suppose the max of the optimality equation is attained,1. rt(s, a) is nondecreasing in S for all a ∈ A, t = 1, 2, ..., N − 12. rN (s) is nondecreasing in S for all a ∈ A, t = 1, 2, ..., N − 13. qt(k|s, a) (probability of given state s, going to a higher state) is nondecreasing in Sfor all k ∈ S, a ∈ A, that is, higher state get a higher probability to go to a higher state.Then, U∗t (s) is nondecreasing in S for all t = 1, 2, ..., N − 1

12

• How to show monotonic optimal policy exist?Optimal value function structure ⇒optimal policy structure.Use theorem 4.7.4: list of conditions/checklist

• Backward Induction Algorithm for MDP with Optimal Monotone Policy

13

5 Infinite Horizon MDP

1. Introduction to infinite horizon problem1. Why? Horizon length is often unknown or stochastic.2. Assume time homogenous data: pt(j|s, a) = p(j|s, a) and rt(s, a) = r(s, a)3. Non-homogenous data for ∞ horizon, usually forecast for T periods, implement for thatdecision for the current planning horizon and reforecast for T more periods. (Turnpike The-orem)

2. Value of a policy for infinite horizon problem1. For infinite horizon problem, optimal policy is always stationery. Policy Π∗ = (d, d, d, ...) =d∞, dt = d is a stationary policy.

2. For a given policy, we receive an infinite stream of rewards r(xt, yt) with xt being the stateand yt being the action taken under the policy d∞

3. For a given policy d∞, the process becomes a Markov reward chain/process.

Define p(j|i) = p(j|i, d(i)) = pij (since the policy is stationary, we do not need to specify it inthe reward function and transition probabilities), when will this process converge? How canwe measure the optimality?

• Expected total reward: limit may be ∞ or −∞ ; limit does not exist

• Expected total discounted reward: limit exist if sups,a |r(s, a)| = M <∞• Average reward/cost criterion: exist only if limN→∞ sup = limN→∞ inf

3. Criteria to measure a policy for infinite MDP problem

1.Expected total reward: vπ∗ ≥ vπ for each s ∈ S and all π ∈ ΠHR

2.Expected total discounted reward: vπ∗

λ ≥ vπλ for each s ∈ S and all π ∈ ΠHR

3.Average reward/cost criterion: gπ∗ ≥ gπ for each s ∈ S and all π ∈ ΠHR

4. Markov Policies1. We now restrict our attention from ΠHR to ΠMR, eventually to ΠMD

2. Thm 5.5.1Proof by induction that Markovian policy exist.

5. Vector notations for MDP

14

6 Discounted MDP

We study infinite-horizon Markov decision processes with the expected total discounted rewardoptimality criterion.This approach is widely used and applied, such as, to consider the time valueof money, technical changes.

Assumptions:

• Stationary and bounded reward

• Stationary transition probabilities

• Discounted future rewards: 0 < λ ≥ 1

• Discrete state space

1. How to evaluate a policy?For nonstationary policy Π1 = {d1, d2, ...},

we know V Π1

λ = EΠ1

s {∑∞

t=1 λt−1r(xt, yt)}

If we define Π2 = {d2, d3, ...} ∈ ΠMD or in general Πn = {dn, dn+1, ...} ∈ ΠMD

V Π1

λ = expected discounted reward of being in state S and under policy Π1

thenV Π1

λ = r(s, d1(s)) +∑j∈S

p(j|s, d1(s))[λr(j, d2(j)) +∑k∈S

p(k|j, d1(j))[λ2...]]

orV Π1

λ = rd1(s) + λPd1VΠ2

λ

If the policy is not stationary, i.e, di 6= dj , then nearly impossible to compute without specialstructure.

If we have stationary data, why the decisions will be different at same state at a later time?The future will be viewed the same, so we should follow the same decision rule.

We only look at stationary policy for infinite horizon problems.

Define Π = d∞ = {d, d, d...}

V d∞λ = rd(s) + λ

∑j∈S

p(j|s, d(s))V d∞λ (j)

orV d∞λ = rd + λPdV

d∞λ

In summary, we want the solution of

V = rd + λPdV ⇒ (I − λPd)V = rd

15

If (I − λpd)−1 exists, the system has a unique solution.

From linear algebra, if limn→∞ ‖ (λPd)n ‖1/n< 1, then (I − λpd)−1 exists and

(I − λpd)−1 =∞∑t=0

λtP td

If U ≥ 0, then (I − λpd)−1U ≥ 0, if rewards are nonnegative, then the policy is non-negativeand then:

vd∞λ = (I − λpd)−1rd =

∞∑t=0

λtP tdrd = rd + λpdrd + (λpd)2rd + · · ·

Now define:LdV ≡ rd + λpdV

linear transformation defined by d, it is a function of V , an operation on V .

If find V ′ ∈ V , such that

LdV′ = V ′ ⇒ rd + λpdV

′ = V ′ ⇒ V ′ = (I − λpd)−1rd

or V ′ = V d∞λ .

Therefore, in order to find the value of the stationary policy Π = {d∞}, we need to find thefixed point of the operator Ld.

Fixed point of a function f(x) is a y, such that f(y) = y.

For finite horizon under stationary assumption:

Ut(s) = supa∈As{r(s, a) + λ∑

j∈S p(j|s, a)Ut+1(j)} for t = 1, 2, · · · , N − 1

UN (s) = rN (s)

What is the optimality equation for infinite horizon problem?

For infinite horizon case:

V (s) = supa∈As{r(s, a) + λ

∑j∈S

p(j|s, a)V(j)}

for s ∈ S.

Starting from now, we focus on πMD, which is deterministic and stationary.

16

Define operator LV ≡ maxd∈D{rd + λPdV }, then for s ∈ S,

LV (s) = maxd∈DMD

{rd(s) + λ∑j∈S

P (j|s, d(s))V (j)} (1)

= maxa∈As{rd(s) + λ

∑j∈S

P (j|s, d(s))V (j)} (2)

ThenLV (s) = max

a∈As{rd(s) + λ

∑j∈S

P (j|s, d(s))V (j)} = V (s)

Then we want v′ ∈ V , such thatLv′ = v′

In other words, we need to find the fixed point of L.

Numerical example 6.1.1.

2. Optimality Equation

We will show the following:(a)The optimality equation has a unique solution in V .(b)The value of the discounted MDP satisfies the optimality equation.(c)The optimality equation characterizes stationary optimal policies.(d)Optimal policies exist under reasonable conditions on the states, actions, rewards, andtransition probabilities.

For infinite horizon optimality, the Bellman equation is:

v(s) = supa∈As{r(s, a) +

∑j∈S

λp(j|s, a)v(j)}

Proposition 6.2.1 provide the proof to restrict the policy solution space from Markov Ran-domized to Markov Deterministic. Remember in Section 5.5, we prove that for each fixedinitial state, we may restrict attention to Markov Policies from the History dependent policies.

Properties of optimality equations:Theorem 6.2.2 for v ∈ V , if(a)v ≥ Lv, then v ≥ v∗λ(b)v ≤ Lv, then v ≤ v∗λ(c)v = Lv, then v = v∗λ

If v ≥ Lv, then Lv ≥ L2v, then v ≥ Lv ≥ L2v ≥ · · · .

We need to prove: LkV → V ∗λ , as k → ∞. In other words, prove for bounded v ∈ V, v∗λ =limk→∞ L

kv where v∗λ is the value of the policy under operator L.

17

Given v, then Lv is the optimal value of a one period problem with terminal value vL2v is the optimal value of a two-period problem with terminal value λ2v· · ·Lnv is the optimal value of a n-period problem with terminal value λnv

n→∞, λnv → 0, it becomes irrelevant.

L∞v ≡ v∗λ, which is the infinite horizon problem solution.

Definition of normed linear space:Define V as the set of bounded real-valued function on SThe norm of V : ‖V ‖ = maxs∈S |V (s)|V is closed under addition and scaler multiplication.then (V, ‖ · ‖) is a normed linear space.

Properties of normed linear space:If v1, v2 ∈ V then(1)v1 + v2 = v2 + v1

(2)v1 + (v2 + v3) = (v1 + v2) + v3

(3)∃ a unique vector 0, v + 0 = v(4)For each v ∈ V , ∃ a unique vector −v ∈ V such that v + (−v) = 0

Scalar multiplication: α ∈ <, β ∈ <(1)α(βv) = (αβ)v(2)1 ∗ v = v for all v(3)(α+ β)v = αv + βv(4)α(v1 + v2) = αv1 + αv2

For normed linear space, triangle inequality holds: ‖v1 + v2‖ ≤ ‖v1‖+ ‖v2‖

Definition of contraction mapping: T : U → U is a contraction mapping if there exist aλ, 0 ≤ λ ≤ 1 such that

‖ Tv − Tu ‖≤‖ v − u ‖for all u and v in U .

Show that if T is a contraction mapping, then ‖T kv − T kv′‖ ≤ λk‖v − v′‖ → 0.

Theorem 6.2.3: (Banach fixed point theorem)Let T be a contraction mapping on a Banach space v,then(1)∃ unique v∗ ∈ v, s.t, Tv∗ = v∗

(2)limk→∞ ‖T kv − v∗‖ = 0 for any v ∈ V .

We know we can find the optimal policy, but what is the optimal policy.We say d∗ is a conserving decision rule, if

18

rd∗ + λpd∗v∗λ = v∗λ where v∗λ is the optimal value vector.

Component-wise, r(s, d(s)) + λ∑

j∈S p(j|s, d∗(s))v∗λ(j) = v∗λ(s)

In other words, d∗ satisfies the optimality equations. Is d∗ unique?

3. Value iterationValue iteration is the most widely used and best understood algorithm for solving discountedMarkov decision problems.

(1)other names for value iteration: successive approximation, backward induction, pre-Jacobiiteration.(2)advantage 1: conceptual simplicity.(3)advantage 2: easy to code and implement.(4)advantage 3: similarity to other areas of applied mathematics.

Bellman’s Equation for Infinite Horizon MDP:

V (s) = maxa∈A{r(s, a) + λ

∑j∈S

p(j|s, a)V (j)}

Optimality equation:Lv(s) = maxa∈A{r(s, a) + λ

∑j∈S p(j|s, a)V (j)}, this is the component-based operation.

Lv = maxd∈D{rd + λpdv}, this is the vector-based operation.

We want to find v∗, such thatv∗ = Lv∗

If v ≥ Lv, then Lv ≥ L2v, then v ≥ Lv ≥ L2v ≥ · · · .

For two bounded vectors X,Y , if X ≤ Y , then LX ≤ LY or if X ≤ Y , then LX ≤ LY . Thismeans the L operation is monotone.

Value iteration/Successive approximation is not just used in MDP, it is generally used inoptimization in function space.

Find a solution for the system:

v(s) = maxa∈As{r(s, a) + λ

∑j∈S

p(j|s, a)v(j)}

orv(s) = Lv(s)

19

We want to find the fixed point for operator L, v∗λ is the optimal value, and the optimalstationary, deterministic policy (d∗)∞ is defined by:

d∗(s) = argmaxa∈As{r(s, a) + λ∑j∈S

p(j|s, a)v∗λ(j)}

Π∗ = (d∗, d∗, · · · )

By fixed point theorem, given v0 ∈ V , let vn = Lnv0, then vn → v∗λ as n→∞

How long till it convergence? What is the stopping condition?

Value iteration:1. Select v0 ∈ V, ε > 0, set n = 0,2. Let vn+1 = Lvn ⇔ vn+1(s) = maxa∈As{r(s, a) + λ

∑j∈S p(j|s, a)v(j)}

3. If ‖vn+1 − vn‖ = maxs∈S |vn+1(s)− vn(s)| < ε(1−λ)2λ , go to step 4, otherwise, let n = n+ 1

and go to step 2. 4.

dε(s) = argmaxa∈As{r(s, a) + λ∑j∈S

p(j|s, a)vn+1(j)}

or Ldεvn+1 = Lvn+1, then dε(s) is ε−optimal! (theorem 6.3.1)

page 62,nice graph on d1 and d2 convergence. This type of analysis is practical for finite stateor structured problems.

-Value iteration is not the best method for solving MDPs, but useful for proving structuredproperties (control-limit policies).-Value iteration generates sequence of value vectors in V .

4. Policy Iteration AlgorithmLet us explore an algorithm that generates a sequence of stationary policies in ΠMD, insteadof V in value iteration algorithm.

Proposition 6.4.1, let d1 and d2 be two stationary policies such that

Ld2vd1λ = max

d∈DMD{Ldvd1λ }

thenvd1λ ≤ v

d2λ

Hence define a sequence, let d0 ∈ DMD,

rd0 + λPd0V = V

which can be done by analytically solving the equation or by applying Ld0 over and overagain.

V d0λ = (I − λP−1

d0rd0)

20

Let v0 = vd0 , and V n+1 = (I − λP−1dn+1

rdn+1), where Ldn+1Vn = Lvn, then we can conclude

V n+1 ≥ vn by the proposition above.

The concept behind this algorithm is that instead of computing a series of value, we aregenerating an improving sequence of policies. This is discrete and less steps.

Policy iteration:1. Set n = 0, select d0 ∈ D .

= DMD

2. Policy evaluation:V n is the solution of V = rdn + λPdnV or V n = (I − λPdn)−1rdn3. Policy improvement:Select dn+1 such that Ldn+1V

n = LV n, V n = Vλdn or

dn+1(s) ∈ argmaxa∈As{r(s, a) + λ∑j∈S

p(j|s, a)vn(j)}

set dn+1(s) = dn(s) if possible.4. If dn+1 = dn, stop, otherwise n = n+ 1, go to step 2.

-If finite state and finite action, policy iteration will terminate in finite number of steps. (witha small example in page 65)

Policy iteration variants:-Policy iteration is computationally expensive.-The exact value is not necessary. We can just compute close approximation.-Updates may only be performed for selected states Sk ∈ S, which is the subspace.-Have a sequence of vectors in value and policies.

(V0, d0), (V1, d1), · · · , (Vk, dk), (Vk+1, dk+1), · · ·

We can generate this sequence in one of two ways:(1) update Vk by vk+1(s) = Ld1vk(s) for s ∈ Sk, for other states, stay the same.(2) update dk by dk+1(s) = argmaxa∈As{r(s, a) + λ

∑j∈S p(j|s, a)vk(j)} for s ∈ Sk, for other

states, stay the same.

For the variants:(a)Let Sk = S, begin with (V0,−), perform (2) then (1), what is this algorithm?V0 ⇒ apply L, find d0, such that Ld0 = LV0 ⇒ V1 = Ld0V0 = LV0

This is value iteration.

(b)Begin with (−, d0), perform (1) then (2), what is this algorithm?This becomes policy iteration.

(c) Perform mk iterations of (1) followed by (2), this is the modified policy iteration, whichis so far the best method in solving MDPs.The most important thing is the selection of mk, fixed? increasing? conditional?

21

5. Linear Programming and MDPsHow to find max{3, 5, 7, 9} with linear programming?

Please refer to section 6.9 in the textbook and in-class lecture notes.

6. Optimality of Structured Policies

Please refer to section 6.11 in the textbook and in-class lecture notes.

22