6
Reinforcement Learning Algorithms for Semi-Markov Decision Processes with Average Reward Yanjie Li Harbin Institute of Technology Shenzhen Graduate School Email: [email protected] Abstract—In this paper, we study reinforcement learning (RL) algorithms based on a perspective of performance sensitivity analysis for SMDPs with average reward. We present the results about performance sensitivity analysis for SMDPs with average reward. On these bases, two RL algorithms for average-reward SMDPs are studied. One algorithm is the relative value iteration (RVI) RL algorithm, which avoids the estimation of optimal average reward in the process of learning. Another algorithm is a policy gradient estimation algorithm, which extends the policy gradient estimation algorithm for discrete time Markov decision processes (MDPs) to SMDPs and only requires half storage of the existing algorithm. I. I NTRODUCTION Semi-Markov decision processes(SMDPs) are fundamental models to describe the sequential decision-making problems in stochastic environments. Compared with Markov decision processes(MDPs), SMDPs have more general sojourn-time distributions and thus have wider applications in many practi- cal systems [1], [2]. Generally, policy iteration(PI), value iter- ation(VI), linear programming(LP) and reinforcement learn- ing(RL) [3], [4], [5], [6], [7], [8], [9], [10] are the main solution methods for SMDPs with average reward. The recent study shows that many results in performance op- timization of stochastic systems can be derived and explained from a sensitivity point of view [11]. With the performance sensitivity formulas – performance difference formula and performance derivative formula, the policy iteration algo- rithms, policy gradient algorithms and RL algorithms can be developed for MDPs [11]. This sensitivity-based idea has been extended to SMDPs [12]. This extension provides a unified framework for policy iteration and sensitivity analysis of SMDPs with average- and discounted-reward performances. This paper is a continuation of sensitivity-based idea for SMDPs in [12]. By using this perspective of performance sensitivity, we study the RL algorithms for SMDPs. RL algorithms had been successfully applied to MDPs [14], [15], [16], [17], [18], [19], [30], [21], [22], [23], [24]. Recently, RL algorithms had also been extended to SMDPs. These algo- rithms can be divided into two classes: value function based RL algorithms [7], [8], [9], [10] and policy gradient algorithm [27]. A policy iteration based RL algorithm was proposed in [25]. For average reward SMDPs, the existing value function based RL algorithms generally need to estimate the optimal average reward [8], [9], [10], [25]. The convergence to optimal average reward might be slow because it has to wait until the learned policy approaches to the optimal policy. A relative value iteration (RVI) RL for MDPs was considered in [23], which avoids the estimation of optimal average reward in MDP. However, the extension of RVI RL to SMDPs is not straightforward and the main difficulty is that the optimal average reward should be known beforehand [10]. Policy gra- dient methods are generally viewed as special RL algorithms in policy search space and well studied for MDP problems [19], [26], [21], [22]. Value function approximation can be applied to policy gradient methods to combine the advantages of gradient estimation and value function approximation [28], [29], [30], [31]. Inspired by the work in [21], a policy gradient method for SMDPs with application to call admission control was introduced in [27]. In this paper, we develop a RVI RL algorithm for SMDPs based on the results of performance sensitivity analysis. In the new algorithm, the optimal average reward can be directly omitted by using relative value iteration similarly to the case of MDPs and thus we do not need to estimate the optimal average reward in the process of leaning. The RVI algorithms may exhibit a good convergence property. In addition, we propose a policy gradient algorithm for SMDPs based on the performance derivative formula. The new policy gradient algorithm requires half storage of the algorithm in [27]: 2K +1 in comparison with 4K +2, where K is the number of parameters in the parameterized policy. II. SEMI -MARKOV DECISION PROCESSES Consider a SMDP [4] on state space S = {1, 2,...,S} with a finite action space denoted by A. Let τ 0 1 ,...,τ n ,..., with τ 0 =0, be the decision epochs and X n ,n =0, 1, 2,..., denote the state at decision epoch τ n . At each decision epoch τ n , if the system is in state X n = i ∈S , an action A n = a is taken from an available action set A(i) ⊂A according to the current policy. As a consequence of choosing a, the next decision epoch occurs within t time units, and the system state at that decision epoch equals j with probability p(j, t|i, a), which means p(j, t|i, a)= P (X n+1 = j, τ n+1 τ n t|X n = i, A n = a). The probabilities p(j, t|i, a), i, j ∈S ,a A(i), are called semi-Markov kernel. We refer to {X 0 ,X 1 ,...} as 157 978-1-4673-0390-3/12/$31.00 ©2012 IEEE

[IEEE 2012 9th IEEE International Conference on Networking, Sensing and Control (ICNSC) - Beijing, China (2012.04.11-2012.04.14)] Proceedings of 2012 9th IEEE International Conference

  • Upload
    yanjie

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Reinforcement Learning Algorithms forSemi-Markov Decision Processes with Average

Reward

Yanjie LiHarbin Institute of Technology

Shenzhen Graduate School

Email: [email protected]

Abstract—In this paper, we study reinforcement learning (RL)algorithms based on a perspective of performance sensitivityanalysis for SMDPs with average reward. We present the resultsabout performance sensitivity analysis for SMDPs with averagereward. On these bases, two RL algorithms for average-rewardSMDPs are studied. One algorithm is the relative value iteration(RVI) RL algorithm, which avoids the estimation of optimalaverage reward in the process of learning. Another algorithm isa policy gradient estimation algorithm, which extends the policygradient estimation algorithm for discrete time Markov decisionprocesses (MDPs) to SMDPs and only requires half storage ofthe existing algorithm.

I. INTRODUCTION

Semi-Markov decision processes(SMDPs) are fundamentalmodels to describe the sequential decision-making problems

in stochastic environments. Compared with Markov decision

processes(MDPs), SMDPs have more general sojourn-timedistributions and thus have wider applications in many practi-

cal systems [1], [2]. Generally, policy iteration(PI), value iter-

ation(VI), linear programming(LP) and reinforcement learn-ing(RL) [3], [4], [5], [6], [7], [8], [9], [10] are the main

solution methods for SMDPs with average reward.

The recent study shows that many results in performance op-timization of stochastic systems can be derived and explained

from a sensitivity point of view [11]. With the performancesensitivity formulas – performance difference formula and

performance derivative formula, the policy iteration algo-

rithms, policy gradient algorithms and RL algorithms canbe developed for MDPs [11]. This sensitivity-based idea has

been extended to SMDPs [12]. This extension provides a

unified framework for policy iteration and sensitivity analysisof SMDPs with average- and discounted-reward performances.

This paper is a continuation of sensitivity-based idea forSMDPs in [12]. By using this perspective of performance

sensitivity, we study the RL algorithms for SMDPs.

RL algorithms had been successfully applied to MDPs [14],[15], [16], [17], [18], [19], [30], [21], [22], [23], [24]. Recently,

RL algorithms had also been extended to SMDPs. These algo-

rithms can be divided into two classes: value function basedRL algorithms [7], [8], [9], [10] and policy gradient algorithm

[27]. A policy iteration based RL algorithm was proposed in[25]. For average reward SMDPs, the existing value function

based RL algorithms generally need to estimate the optimal

average reward [8], [9], [10], [25]. The convergence to optimal

average reward might be slow because it has to wait until the

learned policy approaches to the optimal policy. A relativevalue iteration (RVI) RL for MDPs was considered in [23],

which avoids the estimation of optimal average reward in

MDP. However, the extension of RVI RL to SMDPs is notstraightforward and the main difficulty is that the optimal

average reward should be known beforehand [10]. Policy gra-dient methods are generally viewed as special RL algorithms

in policy search space and well studied for MDP problems

[19], [26], [21], [22]. Value function approximation can beapplied to policy gradient methods to combine the advantages

of gradient estimation and value function approximation [28],

[29], [30], [31]. Inspired by the work in [21], a policy gradientmethod for SMDPs with application to call admission control

was introduced in [27].

In this paper, we develop a RVI RL algorithm for SMDPsbased on the results of performance sensitivity analysis. In

the new algorithm, the optimal average reward can be directlyomitted by using relative value iteration similarly to the case

of MDPs and thus we do not need to estimate the optimal

average reward in the process of leaning. The RVI algorithmsmay exhibit a good convergence property. In addition, we

propose a policy gradient algorithm for SMDPs based on

the performance derivative formula. The new policy gradientalgorithm requires half storage of the algorithm in [27]: 2K+1in comparison with 4K + 2, where K is the number ofparameters in the parameterized policy.

II. SEMI-MARKOV DECISION PROCESSES

Consider a SMDP [4] on state space S = {1, 2, . . . , S}with a finite action space denoted by A. Let τ0, τ1, . . . , τn, . . .,with τ0 = 0, be the decision epochs and Xn, n = 0, 1, 2, . . . ,denote the state at decision epoch τn. At each decision epoch

τn, if the system is in state Xn = i ∈ S , an action An = a

is taken from an available action set A(i) ⊂ A according to

the current policy. As a consequence of choosing a, the next

decision epoch occurs within t time units, and the system stateat that decision epoch equals j with probability p(j, t|i, a),which means p(j, t|i, a) = P(Xn+1 = j, τn+1− τn ≤ t|Xn =i, An = a). The probabilities p(j, t|i, a), i, j ∈ S, a ∈ A(i),are called semi-Markov kernel. We refer to {X0, X1, . . .} as

157978-1-4673-0390-3/12/$31.00 ©2012 IEEE

an embedded Markov chain of SMDP. Let p(j|i, a) denote theprobability that the embedded Markov chain occupies state j

at the subsequent decision epoch when a is chosen in state

i at the current decision epoch. Then, we have p(j|i, a) =p(j,∞|i, a). Let F (t|i, a) denote the probability that the next

decision epoch occurs within t time units after the currentdecision epoch given that action a is chosen from A(i) in

state i at the current decision epochs. Then, we have

F (t|i, a) =∑j∈S

p(j, t|i, a). (1)

To avoid the possibility of an infinite number of decision

epochs within finite time, the following assumption is needed

[4]:Assumption 1: There exist ε > 0 and δ > 0 such that

F (δ|i, a) ≤ 1− ε

for all i ∈ S and a ∈ A(i).Between any two sequent decision epochs τn and τn+1, the

system state may vary. The evolution process is called naturalprocess, denoted by Ws, τn ≤ s ≤ τn+1. At each decision

epoch τn, the system generates a fixed reward f(Xn, An) and

it accumulates additional rewards at rate c(Ws, Xn, An) untilτn+1. Let r(i, a) denote the expected total reward between

two decision epochs, given that the system occupies state i

and action a is taken at the first decision epoch. Then, we

have

r(i, a) = f(i, a) + Eai

{∫ τn+1

τn

c(Ws, i, a)ds

}, (2)

where Eai denotes the expectation with respect to the distri-

bution F (t|i, a) and the probability distribution of the naturalprocess under action a. For each i ∈ S and a ∈ A(i), define

τ(i, a) by

τ(i, a) = Eai {τn+1 − τn} =

∫ ∞

0

tF (dt|i, a), (3)

which denotes the expected length of time until the next

decision epoch given that action a is taken in state i at thecurrent decision epoch.

Let Πm be the set of all stationary Markov policies. Ifμ ∈ Πm, then it chooses an action a from A(i) with

probability μ(a|i) when state is in i ∈ S at any decision epoch.Thus,

∑a∈A(i) μ(a|i) = 1. Naturally, for a given stationary

Markov policy μ ∈ Πm, the SMDP evolves according to semi-

Markov kernel p(j, t|i, μ) =∑

a∈A(i) μ(a|i)p(j, t|i, a) and theembedded Markov chain evolves according to the transition

probability matrix Pμ = [p(j|i, μ)] with

p(j|i, μ) =∑

a∈A(i)

μ(a|i)p(j|i, a). (4)

Moreover, the expected total reward and the expected length of

time between two decision epochs under any policy μ ∈ Πm

are

rμ(i) =∑

a∈A(i)

μ(a|i)r(i, a), τμ(i) =∑

a∈A(i)

μ(a|i)τ(i, a), (5)

respectively, given that the system occupies state i at the cur-rent decision epoch. Let rμ and τμ denote their corresponding

column vectors, respectively.When policy μ is deterministic, that is, the policy chooses

an action with probability 1 at any state. At this time,

we denote the policy as v, which is a mapping v :S → A, i.e., for any state i, v specifies an actionv(i) := a ∈ A(i) with probability 1. Then, under de-

terministic policy v, p(j, t|i, μ), p(j|i, μ), rμ(i), and τμ(i)are p(j, t|i, v(i)), p(j|i, v(i)), r(i, v(i)), and τ(i, v(i)), respec-tively. Let Πd denote the set of all these deterministic station-

ary policies. It is obvious that Πd ⊂ Πm.We assume that the embedded Markov chain is ergodic un-

der any policy μ ∈ Πm. Let πμ = (πμ(1), πμ(2), . . . , πμ(S))denote the (row) vector representing the steady-state probabil-

ities of embedded Markov chain, then we have πμPμ = πμ

and πμe = 1, where e denotes a column vector whose all

components are 1. Let σs denote the number of decision

epochs up to time s. The infinite-horizon average-reward isdefined as [4]

ημ =πμrμ

πμτμ. (6)

If policy μ∗ satisfies ημ∗

≥ ημ for all policies μ ∈ Πm, we

call it average-reward optimal policy and the corresponding

average reward ημ∗

is called optimal average reward.

III. PERFORMANCE SENSITIVITY ANALYSIS AND

OPTIMALITY EQUATION

For the SMDP introduced in Section II, an equivalent

infinitesimal generator Aμ = [Aμ(j|i)], i, j ∈ S, as follows,was defined in [12]

Aμ = Γμ[Pμ − I],

where Γμ is a diagonal matrix with diagonal components1

τμ(1) , . . . ,1

τμ(S) and I is an identity matrix. Based on the

infinitesimal generator, the average-reward performance dif-ference under different policies w ∈ Πm and μ ∈ Πm was

given by the following lemma [12].Lemma 1: For any policies w ∈ Πm and μ ∈ Πm,

ηw − ημ = pw[Γwrw − Γμrμ + (Aw −Aμ)gμ

], (7)

where pw is the steady-state distribution of Aw, i.e., pwAw =0, pwe = 1, and gμ is the performance potential satisfying

the Poisson equation

Aμgμ = −Γμrμ + ημe. (8)

On the basis of performance difference formula (7), policyiteration algorithm was derived by noting pw > 01 in [12].

Moreover, the following optimality-condition lemma can beeasily obtained similarly to the MDP version in [11, p.187].

Lemma 2: A policy v∗ is the average-reward optimal policy

if and only if for any policy v ∈ Πs,

Γv∗

rv∗

+Av∗

gv∗

≥ Γvrv +Avgv∗

. (9)

1Vector p > 0 denotes each component of p is positive.

158

From Poisson equation (8), we have Γv∗

rv∗

+ Av∗

gv∗

=ηv∗

e. Thus, optimality inequality condition (9) can be rewrit-

ten as the following Bellman optimality equation.

Theorem 1: A policy v∗ is the average-reward optimal

policy if and only if

0 = maxv∈Πs

{Γvrv +Avgv∗

− ηv∗

e}

= maxv∈Πs

{Γvrv + Γv(P v − I)gv∗

− ηv∗

e}

. (10)

From the above discussion, we find that the equivalent in-finitesimal generator Av plays an important role in optimality

equation. This analysis is based on the equivalent continuoustime Markov chain. Next, we introduce another sensitivity

analysis results from the embedded Markov chain [13], which

provides an intuitive explanation for the policy iteration andoptimality equation in [4] and [5].

Theorem 2: For any policies w ∈ Πm and μ ∈ Πm,

ηw − ημ =πw

πwτw

{(rw − rμ)− (τw − τμ)ημ

+(Pw − Pμ)gμ}, (11)

On the basis of the performance difference formula (11),the following optimality condition can be obtained similarly

to Lemma 2.

Lemma 3: A policy v∗ is the average-reward optimal policy

if and only if for any policy v ∈ Πs,

rv∗

− ηv∗

τv∗

+ P v∗

gv∗

≥ rv − ηv∗

τv + P vgv∗

. (12)

By using the variation of Poisson equation (8) with optimal

policy v∗, i.e., rv∗

−ηv∗

τv∗

+P v∗

gv∗

= gv∗

, we have anotherform of Bellman optimality equation.

Theorem 3: A policy v∗ is the average-reward optimalpolicy if and only if

gv∗

= maxv∈Πs

{rv − ηv∗

τv + P vgv∗}

. (13)

Equation (13) had been widely applied to value function

based RL for SMDPs with average reward in [8], [9], [10].

The application of Bellman optimality equation (10) in RL isnot considered. In next section, we will propose a new RL

algorithm based on equation (10).

IV. RVI REINFORCEMENT LEARNING

An important difference between (10) and (13) is that the

optimal average reward ηv∗

is coupled with the expected

sojourn time τv in (13), while in (10), the optimal averagereward ηv∗

is separate. This character of (10) will facilitate

us to obtain a RVI RL algorithm. Next theorem provides us a

variation of Bellman equation (10).

Theorem 4: A policy v∗ is the average-reward optimalpolicy if and only if

g̃v∗

= maxv∈Πs

{Γvrv + λΓvP v g̃v∗

+ (I − λΓv)g̃v∗

− ηv∗

e}

, (14)

where λ is a constant and g̃v∗

= λ−1gv∗

.

Proof: From Bellman equation (10), we have

ηv∗

e = maxv∈Πs

{Γvrv + λΓv [P v − I] g̃v∗

}, (15)

Adding g̃v∗

on both sides of (15), we can obtain (14) by noting

that g̃v∗

and ηv∗

do not depend on v.

Equation (14) suggests a RVI RL algorithm. We define anoptimal Q-factor Q̃v∗

(i, a) for each state-action pair (i, a), i ∈S, a ∈ A(i), as follows.

Q̃v∗

(i, a) =r(i, a)

τ(i, a)+

λ

τ(i, a)

∑j∈S

p(j|i, a)g̃v∗

(j)

+(1−

λ

τ(i, a)

)g̃v∗

(i)− ηv∗

, (16)

which is different from the classical Q-factor for SMDPsdefined as follows[8], [9], [10]:

Qv∗

(i, a) = r(i, a)− ηv∗

τ(i, a) +∑j∈S

p(j|i, a)gv∗

(j). (17)

Taking the maximum on both sides of (16) over the action

space A(i) and from (14), we have

g̃v∗

(i) = maxa∈A(i)

Q̃v∗

(i, a), for any i ∈ S. (18)

Substituting (18) into (16), we can obtain the optimality

equation for Q-factors:

Q̃v∗

(i, a) =r(i, a)

τ(i, a)+

λ

τ(i, a)

∑j∈S

p(j|i, a) maxa∈A(j)

Q̃v∗

(j, a)

+(1−

λ

τ(i, a)

)max

a∈A(i)Q̃v∗

(i, a)− ηv∗

. (19)

If we can estimate the optimal Q-factors Q̃v∗

(i, a), the optimalpolicy can be obtained by

v∗(i) ∈ arg maxa∈A(i)

Q̃v∗

(i, a).

Define

m(n, i, a) =n∑

l=0

I(i,a)(Xl, Al),

where I(·)(·) is an indicator function. Thus m(n, i, a) denotesthe number of times that state-action pair (i, a) occurs up

to time n. Let R(Xn, An) denote the observed total rewardbetween decision epochs τn and τn+1, i.e., R(Xn, An) =f(Xn, An) +

∫ τn+1

τnc(Ws, Xn, An)ds. Then, we can design

the following Q-learning algorithm:

Q̃(Xn, An) := Q̃(Xn, An) + γ(m(n, Xn, An))δn,

δn =R(Xn, An)

τ(Xn, An)+

λ

τ(Xn, An)max

a∈A(Xn+1)Q̃(Xn+1, a)

+(1−

λ

τ(Xn, An)

)max

a∈A(Xn)Q̃(Xn, a)

− η̄ − Q̃(Xn, An). (20)

159

where η̄ is an estimate of the optimal average reward and γ(n)are the learning rates, which satisfy

∞∑n=0

γ(n) =∞,

∞∑n=0

γ2(n) < ∞.

The estimate of the optimal average reward η̄ is not expectedbecause its convergence to the optimal average reward might

be slow. From the definition of Q-factors (16), the Q-factors

is up to a constant. Thus, only the relative values of Q-factorsplay role in the choice of policy. In particular, the optimal

policy does not change if all the Q-factors are added by a

constant. Thus, we can directly remove the item η̄ in (20). Thischaracter shows the advantage of Q-factor (16). In the classical

Q-factor (17), the item ηv∗

τ(i, a) cannot be omitted since theoptimal average reward ηv∗

is coupled with the expected time

length τ(i, a).Removing η̄ in (20) will result in a numerical instability.

We may choose a reference state-action pair (i∗, a∗), applythe relative value iteration and obtain

δn =R(Xn, An)

τ(Xn, An)+

λ

τ(Xn, An)max

aQ̃n(Xn+1, a)

+(1−

λ

τ(Xn, An)

)max

aQ̃n(Xn, a)

−Q̃n(Xn, An)− Q̃n(i∗, a∗).

where reference state i∗ and reference action a∗ can be chosen

arbitrarily. Another way is to add a G-adjustment step [11] asfollows to avoid Q-factors going to infinity.

if maxi∈S,a∈A(i)

|Q̃n(i, a)| > G,

set Q̃n(i, a) = Q̃n(i, a)− Q̃n(i∗, a∗),

for all i ∈ S, a ∈ A(i), where G is a large constant.

It is noted that since the τ(Xn, An) is at the denominator,we cannot directly replace the τ(Xn, An) by the stochastic

sojourn time τn+1 − τn in the process of learning; otherwise,this will lead to a large variance. Thus, we need to estimate

the τ(i, a) for each i ∈ S and a ∈ A. It is easy to implement

these estimates by 1m(n,i,a)

∑nl=0(τl+1 − τl)I(i,a)(Xl, Al). If

each state-action pair occurs infinite often, the estimates are

consistent.

V. POLICY GRADIENT ESTIMATION

We assume that the Markov policy μ is parameterized anddenoted as μ(a|i, θ), i ∈ S, a ∈ A(i), where θ is the param-

eter. All quantities associated with the policy are obviously

functions of parameter θ. Therefore, the transition matrixand steady-state probability of embedded Markov chain, the

average reward, the performance potential, the expected to-

tal reward and expected sojourn time between two decisionepochs are denoted as P (θ), π(θ), η(θ), g(θ), r(θ) and τ(θ),respectively. We assume that μ(a|i, θ), i ∈ S, a ∈ A(i) aredifferentiable with respect to θ. If parameter θ takes different

values θ0 and θ1, there are two different policies μ(a|i, θ0)

and μ(a|i, θ1). Letting θ1 tend to θ0, from difference formula(11), we may obtain the performance gradient at θ = θ0.

dη(θ0)

=π(θ0)

π(θ0)τ(θ0)

{dr(θ0)

dθ−

dτ(θ0)

dθη +

dP (θ0)

dθg(θ0)

}.

(21)

According to (4) and (5), the performance gradient (21) canbe rewritten as follows:

dη(θ0)

dθ=

∑i∈S

π(i, θ0)

π(θ0)τ(θ0)

∑a∈A(i)

dμ(a|i, θ0)

dθQμ(i, a), (22)

where

Qμ(i, a) = r(i, a)− τ(i, a)η(θ0) +∑j∈S

p(j|i, a)g(j, θ0),

which is the classical Q-factor under policy μ of SMDPs. Wecan easily prove

Qμ(i, a) = Eμi

{∞∑

n=0

[r(Xn, An)− τ(Xn, An)η(θ0)]∣∣∣A0 = a

}.

Define ∇μ(a|i, θ) = dμ(a|i,θ)dθ

. From (22) and the basic

formula in [22], we have

dη(θ0)

=1

π(θ0)τ(θ0)lim

N→∞

1

N

N−1∑n=0

∇μ(An|Xn, θ0)

μ(An|Xn, θ0)Qμ(Xn, An),

(23)

To ensure the existence of∇μ(a|i,θ0)μ(a|i,θ0)

, here we need to make a

standard assumption: for any i ∈ S, a ∈ A(i), ifdμ(a|i,θ)

dθ= 0,

then μ(a|i, θ) = 0.

Formula (23) is similar to the basic formula for MDPs in[22] except the constant term 1

π(θ0)τ(θ0)and the Q-factors

Qμ(i, a). Using the same idea as that in [22], we may useany sample-path-based estimate Q̂(Xn, An, Xn+1, An+1, . . .),with E[Q̂(Xn, AnXn+1, An+1, . . .)|Xn, An] ≈ Qμ(Xn, An),to replace Qμ(Xn, An) in equation (23). In this way, using

different sample-path-based estimates of Qμ(Xn, An), includ-

ing discounted approximation, perturbation realization factorsand approximation by truncation [22], we may obtain different

gradient estimation algorithms. Here, we only consider the

discounted approximation. Define the following discounted Q-factor,

Qμβ(i, a)

= Eμi

{∞∑

n=0

βn[r(Xn, An)− τ(Xn, An)η(θ0)]∣∣∣A0 = a

},

where β is a discounted factor with 0 < β < 1. We can easily

prove that Qμβ(i, a) converges to Qμ(i, a) when β approaches

160

to 1. Thus, we may let

Q̂(Xn, An, Xn+1, An+1, . . .)

=

∞∑l=n

βl−n[r(Xl, Al)− τ(Xl, Al)η(θ0)]. (24)

When β is close to 1, the above approximation is a goodestimation of Qμ(Xn, An). Discounted factor β needs to be

carefully chosen to balance the bias and variance of the

estimate[21]. Putting (24) into (23), we have

dη(θ0)

dθ≈ C(θ0) lim

N→∞

1

N

N−1∑l=0

[r(Xl, Al)− τ(Xl, Al)η(θ0)]

l∑n=0

βl−n∇μ(An|Xn, θ0)

μ(An|Xn, θ0), (25)

where C(θ0) =1

π(θ0)τ(θ0). Note that η(θ0) in (25) is generally

unknown. Let

ηn =1

τn

{∫ τn

0

c(Ws, Xσs, Aσs

)ds+

n−1∑l=0

f(Xn, An)

}.

From the ergodicity, we have

limn→∞

ηn = η(θ0), w.p.1.

Thus, when n is large enough, we may use ηn as an estimateof η(θ0). For ηn, we have the following iteration

ηn+1 = ηn +1

τn+1

[R(Xn, An)− (τn+1 − τn)ηn

]. (26)

With (25) and (26), we may develop the following gradientestimation algorithm of SMDP (GSMDP), which yields an

estimate Δn for the performance gradient direction based on

a sample path of semi-Markov decision process. (Since C(θ0)is a constant, omitting it does not affect the gradient direction)

GSMDP Algorithm:1) Given θ0 and a sample path {X0, A0, τ0, R(X0, A0),

X1, A1, τ1, R(X1, A1), . . .} under policy μ(·|·, θ0),choose a discount factor 0 < β < 1.

2) Set Z0 = 0, η0 = 0, Δ0 = 0 and n = 0.3) for state Xn and action An at decision time τn and

total reward R(Xn, An) between decision epochs τn and

τn+1, do

Zn+1 = βZn +∇μ(An|Xn)

μ(An|Xn),

ηn+1 = ηn +1

τn+1

[R(Xn, An)− (τn+1 − τn)ηn

],

Δn+1 = Δn + γn+1

{[R(Xn, An)

− ηn+1(τn+1 − τn)]Zn+1 −Δn

},

where Zn+1 evaluates∑n

l=0 βn−l∇μ(Al|Xl,θ0)μ(Al|Xl,θ0)

. From the

above algorithm, we can find GSMDP algorithm is equivalent

to the policy gradient for discrete time MDPs in [21], [22]with a new reward function r̄(i, a) = r(i, a) − τ(i, a)η.

The new reward function cannot be obtained directly but we

can use the stochastic observation between decision epochτn and τn+1, R(Xn, An) − ηn+1(τn+1 − τn), to replace

it. From (23) and (25), the Δn in the GSMDP algorithm

converges to∑

i∈S,a∈A(i) π(i, θ0)∇μ(a|i, θ0)Qμβ(i, a), which

is an approximation of gradient direction, under the condition

that the Markov chain {(Xn, An, τn+1− τn), n = 0, 1, . . .} ispositive Harris[27].

The GSMDP algorithm extends the policy gradient algo-

rithm in [21], [22] to SMDPs and can be easily appliedto continuous time Markov processes. Compared with the

algorithm in [27], the new algorithm only requires half mem-

ory requirement. In [27], four quantities (zk,Δck,Δτ

k andηk

τΔck − ηk

cΔτk) need to be estimated (computed) and stored

for each parameter (c.f. (9)-(13) in [27]), plus two average

rewards (ηkc and ηk

τ ). Thus, the algorithm in [27] needs 4K+2memory units when there are K parameters. In the GSMDP

algorithm, there are only two quantities (Zk and Δk) needto be estimated for each parameter, plus one average reward

(ηn). The GSMDP algorithm only needs 2K+1 memory units.

When there are many parameters in the policy, the saving ofstorage is very preferable.

VI. EXPERIMENTAL RESULTS

Consider a three-state SMDP in [13]. For this example, we

can obtain the optimal policy is v∗(1) = 3, v∗(2) = 1, v∗(3) =2 and the optimal average reward ηv∗

= 2.0189. Then, we

apply the RVI RL algorithm to learn the optimal Q-factors. In

the RVI algorithm, we set λ = 0.9 and obtain the simulationresult as Fig. 1(a). From the simulation, the optimal policy

is v∗. To compare the performance between our algorithmwith the RL algorithm based on the classical Q-factor (17),

we simulate the RL algorithm with the estimation of optimal

average reward. The simulation results is described in Fig 1(b).Comparing simulation results in Fig. 1(a) and Fig. 1(b), the

RVI RL algorithm has a good convergence rate.

Moreover, we parameterize and randomize the policy byusing the softmax functions Consider the gradient estimation

at θ = 0 and set β = 0.9 in the GSMDP algorithm. Fig.

1(c) describes the simulation results. Applying the gradient-based optimization algorithm with step size equal 1 to the

three-state SMDP, the optimal average reward of the systemcan be obtained as described in Fig.1(d). In each gradient-

based iteration, we use GSMDP algorithm to estimate the

performance gradient by simulating a sample path with 10000state-transitions.

VII. CONCLUSION

In this paper, by analyzing two performance difference

formulas, we presented two equivalent forms of Bellmanoptimality equations. The Bellman optimality equation based

on equivalent infinitesimal generators provided us a RVI RL

algorithm. Experiment results showed that the RVI algorithmhad a good convergence property. As a natural consequence

of the performance difference formula based on embeddedMarkov chain, a new performance derivative formula was

established. With the performance derivative formula, we

161

0 2 4 6 8 10

x 104

−4

−2

0

2

4

6

8

10

Step Number of Learning

Q−V

alue

Q(1,1)Q(1,2)Q(1,3)Q(2,1)Q(2,2)Q(3,1)Q(3,2)

(a) RVI ReinforcementLearning

0 2 4 6 8 10

x 104

−8

−6

−4

−2

0

2

4

6

8

Step Number of Learning

Q−V

alue

Q(1,1)Q(1,2)Q(1,3)Q(2,1)Q(2,2)Q(3,1)Q(3,2)

(b) RL with estimation ofoptimal average reward

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Gradient Estimation at θ=(0 0 0 0 0 0 0) with GSMDP Algorithm

Per

form

ance

Der

ivat

ive

dη/θ11

dη/θ12

dη/θ13

dη/θ21

dη/θ22

dη/θ31

dη/θ32

(c) GSMDP algorithm

0 20 40 60 80 1000.8

1

1.2

1.4

1.6

1.8

2

2.2

Step Number of Gradient−Based Iterations

Ave

rage

Rew

ard

(d) Optimization result

Fig. 1. Simulation Results

proposed an on-line policy gradient estimation algorithm. The

sensitivity-based analysis approach in our paper fits well therecently developed sensitivity-based learning and optimization

framework [11] and provides some new insight for SMDPs.

Further research includes the convergence proof of RVI RLalgorithm, the applications in practical systems and the hier-

archical learning theory for Markov systems, where SMDP isthe main mathematical frameworks for the hierarchy.

ACKNOWLEDGMENT

The authors would like to thank the support from the

NSFC(No.61004036), Doctoral Fund of Ministry of Education(No. 20102302120071) and Shenzhen Basic Research Project

(No. JC201005260179A).

REFERENCES

[1] J. Janssen, Semi-Markov Models: Theory and Applications, New York:Springer, 1999.

[2] J. Janssen and R. Manca, Applied Semi-Markov Processes, New York:Springer, 2005.

[3] R. Howard, Dynamic Probabilistic Systems Volume II: Semi-Markov andDecision Processes, New York: Wiley, 1971.

[4] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dy-namic Programming, New York: Wiley, 1994.

[5] H. C. Tijms, Stochastic Models: An Algorithmic Approach, New York:John Wiley & Sons, 1994.

[6] D. P. Bertsekas, Dynamic Programming and Optimal Control, BelmontMassachusetts: Athena Scientific, 1995.

[7] S. J. Bradtke and M. O. Duff, “Reinforcement learning methods forcontinuous-time Markov decision problems,” Proceedings of the Con-ference on Advances in Neural Information Processing Systems, vol. 7,pp. 393–400, 1995.

[8] S. Mahadevan, N. Marchalleck, T. Das, and A. Gosavi, “Self-improvingfactory simulation using continuous-time average-reward reinforcementlearning,” Proceedings of the International Conference on Machine Lean-rning, San Francisco, CA: Morgan Kaufmann, pp. 202–210, 1997.

[9] T. K. Das, A. Gosavi, S. Mahadevan and N. Marchalleck, “Solving semi-Markov decision problems using average reward reinforcement learning,”Management Science, vol. 45, no. 4, pp. 560–574, 1999.

[10] A. Gosavi, “Reinforcement learning for long-run average cost,” Euro-pean Journal of Operational Research, vol. 155, pp. 654–674, 2004.

[11] X. R. Cao, Stochastic Learning and Optimization: A Sensitivity-BasedApproach, New York: Springer, 2007.

[12] X. R. Cao, “Semi-Markov decision problems and performance sensitivityanalysis,” IEEE Transactions on Automatic Control, vol. 48, no. 5, pp.758–769, 2003.

[13] Y. J. Li and F. Cao“Infinite-horizon gradient estimation for semi-Markovdecision processes”, 8th Asian Control Conference, Kaohsiung, 2011.

[14] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,Cambridge, MA: MIT Press, 1998.

[15] D. P. Bertsekas and T. N. Tsitsiklis, Neuro-Dynamic Programming,Belmont, Massachusetts: Athena Scientific, 1996.

[16] R. Sutton, “Learning to predict by the methods of temporal difference,”Machine Learning, vol. 3, no. 1, pp. 9–44, 1988.

[17] J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8,pp. 279–292, 1992.

[18] S. Mahadevan, “Average reward reinforcement learning: foundations,algorithms, and empirical results,” Machine Learning, vol. 22, pp. 159–196, 1996.

[19] R. J. Williams, “Simple statistical gradient-following algorithms forconnectionist reinforcement learning,” Machine Learning, vol. 8, pp. 229–256, 1992.

[20] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradi-ent methods for reinforcement learning with function approximation,”Proceedings of the Conference on Advances in Neural InformationProcessing Systems, vol. 12, pp. 1057–1063, 2000.

[21] J. Baxter and P. L. Bartlett, “Infinite-horizon policy-gradient estimation,”Journal of Artificial Intelligence Research, vol. 15, pp. 319–350, 2001.

[22] X. R. Cao, “A basic formula for online policy gradient algorithms,”IEEE Transactions on Automatic Control, vol. 50, no. 5, pp. 696–699,May 2005.

[23] J. Abounadi D. Bertsekas and V. S. Borkar, “Learning algorithms forMarkov decision processes with average cost,” SIAM Journal on Controland Optimization, vol. 40, no. 3, 681–698, 2001.

[24] D. Y. Dong, C. L. Chen, H. X. Li and T. J. Tarn, Quantum reinforcementlearning, IEEE Transactions on Systems, Man and Cybernetics, Part B,vol. 38. vol. 5, 1207-1219, 2008.

[25] A. Gosavi, “A reinforcement learning algorithm based on policy iter-ation for average reward: empirical results with yield management andconvergence analysis”, Machine Learning, vol. 55, no. 1, pp. 5-29, 2004.

[26] P. Marbach and J. N. Tsitsiklis, “Simulation-based optimization ofMarkov reward processes,” IEEE Transactions on Automatic Control, vol.46, no. 2. pp. 191–209, 2001.

[27] S. S. Singh, V. B. Tadic, and A. Doucet, “A policy gradient methodfor semi-Markov decision processes with application to call admissioncontrol,” European Journal of Operational Research, vol. 178, pp. 808–818, 2007.

[28] H. Kimura, K. Miyazaki, and S. Kobayashi, “Reinforcement learning inPOMDPs with function approximation,” Proceedings of the InternationalConference on Machine Learning, San Francisco, CA: Morgan Kauf-mann, pp. 152–160, 1997.

[29] L. C. Baird and A. W. Moore, “Gradient descent for general reinforce-ment learning,” Proceedings of the Conference on Advances in NeuralInformation Processing Systems, pp. 968–974, 1998.

[30] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policygradient methods for reinforcement learning with function approxima-tion,” Proceedings of the Conference on Advances in Neural InformationProcessing Systems, pp. 1057–1063, 1999.

[31] V. R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,” SIAMJournal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166,2003.

162