Upload
yanjie
View
214
Download
2
Embed Size (px)
Citation preview
Reinforcement Learning Algorithms forSemi-Markov Decision Processes with Average
Reward
Yanjie LiHarbin Institute of Technology
Shenzhen Graduate School
Email: [email protected]
Abstract—In this paper, we study reinforcement learning (RL)algorithms based on a perspective of performance sensitivityanalysis for SMDPs with average reward. We present the resultsabout performance sensitivity analysis for SMDPs with averagereward. On these bases, two RL algorithms for average-rewardSMDPs are studied. One algorithm is the relative value iteration(RVI) RL algorithm, which avoids the estimation of optimalaverage reward in the process of learning. Another algorithm isa policy gradient estimation algorithm, which extends the policygradient estimation algorithm for discrete time Markov decisionprocesses (MDPs) to SMDPs and only requires half storage ofthe existing algorithm.
I. INTRODUCTION
Semi-Markov decision processes(SMDPs) are fundamentalmodels to describe the sequential decision-making problems
in stochastic environments. Compared with Markov decision
processes(MDPs), SMDPs have more general sojourn-timedistributions and thus have wider applications in many practi-
cal systems [1], [2]. Generally, policy iteration(PI), value iter-
ation(VI), linear programming(LP) and reinforcement learn-ing(RL) [3], [4], [5], [6], [7], [8], [9], [10] are the main
solution methods for SMDPs with average reward.
The recent study shows that many results in performance op-timization of stochastic systems can be derived and explained
from a sensitivity point of view [11]. With the performancesensitivity formulas – performance difference formula and
performance derivative formula, the policy iteration algo-
rithms, policy gradient algorithms and RL algorithms canbe developed for MDPs [11]. This sensitivity-based idea has
been extended to SMDPs [12]. This extension provides a
unified framework for policy iteration and sensitivity analysisof SMDPs with average- and discounted-reward performances.
This paper is a continuation of sensitivity-based idea forSMDPs in [12]. By using this perspective of performance
sensitivity, we study the RL algorithms for SMDPs.
RL algorithms had been successfully applied to MDPs [14],[15], [16], [17], [18], [19], [30], [21], [22], [23], [24]. Recently,
RL algorithms had also been extended to SMDPs. These algo-
rithms can be divided into two classes: value function basedRL algorithms [7], [8], [9], [10] and policy gradient algorithm
[27]. A policy iteration based RL algorithm was proposed in[25]. For average reward SMDPs, the existing value function
based RL algorithms generally need to estimate the optimal
average reward [8], [9], [10], [25]. The convergence to optimal
average reward might be slow because it has to wait until the
learned policy approaches to the optimal policy. A relativevalue iteration (RVI) RL for MDPs was considered in [23],
which avoids the estimation of optimal average reward in
MDP. However, the extension of RVI RL to SMDPs is notstraightforward and the main difficulty is that the optimal
average reward should be known beforehand [10]. Policy gra-dient methods are generally viewed as special RL algorithms
in policy search space and well studied for MDP problems
[19], [26], [21], [22]. Value function approximation can beapplied to policy gradient methods to combine the advantages
of gradient estimation and value function approximation [28],
[29], [30], [31]. Inspired by the work in [21], a policy gradientmethod for SMDPs with application to call admission control
was introduced in [27].
In this paper, we develop a RVI RL algorithm for SMDPsbased on the results of performance sensitivity analysis. In
the new algorithm, the optimal average reward can be directlyomitted by using relative value iteration similarly to the case
of MDPs and thus we do not need to estimate the optimal
average reward in the process of leaning. The RVI algorithmsmay exhibit a good convergence property. In addition, we
propose a policy gradient algorithm for SMDPs based on
the performance derivative formula. The new policy gradientalgorithm requires half storage of the algorithm in [27]: 2K+1in comparison with 4K + 2, where K is the number ofparameters in the parameterized policy.
II. SEMI-MARKOV DECISION PROCESSES
Consider a SMDP [4] on state space S = {1, 2, . . . , S}with a finite action space denoted by A. Let τ0, τ1, . . . , τn, . . .,with τ0 = 0, be the decision epochs and Xn, n = 0, 1, 2, . . . ,denote the state at decision epoch τn. At each decision epoch
τn, if the system is in state Xn = i ∈ S , an action An = a
is taken from an available action set A(i) ⊂ A according to
the current policy. As a consequence of choosing a, the next
decision epoch occurs within t time units, and the system stateat that decision epoch equals j with probability p(j, t|i, a),which means p(j, t|i, a) = P(Xn+1 = j, τn+1− τn ≤ t|Xn =i, An = a). The probabilities p(j, t|i, a), i, j ∈ S, a ∈ A(i),are called semi-Markov kernel. We refer to {X0, X1, . . .} as
157978-1-4673-0390-3/12/$31.00 ©2012 IEEE
an embedded Markov chain of SMDP. Let p(j|i, a) denote theprobability that the embedded Markov chain occupies state j
at the subsequent decision epoch when a is chosen in state
i at the current decision epoch. Then, we have p(j|i, a) =p(j,∞|i, a). Let F (t|i, a) denote the probability that the next
decision epoch occurs within t time units after the currentdecision epoch given that action a is chosen from A(i) in
state i at the current decision epochs. Then, we have
F (t|i, a) =∑j∈S
p(j, t|i, a). (1)
To avoid the possibility of an infinite number of decision
epochs within finite time, the following assumption is needed
[4]:Assumption 1: There exist ε > 0 and δ > 0 such that
F (δ|i, a) ≤ 1− ε
for all i ∈ S and a ∈ A(i).Between any two sequent decision epochs τn and τn+1, the
system state may vary. The evolution process is called naturalprocess, denoted by Ws, τn ≤ s ≤ τn+1. At each decision
epoch τn, the system generates a fixed reward f(Xn, An) and
it accumulates additional rewards at rate c(Ws, Xn, An) untilτn+1. Let r(i, a) denote the expected total reward between
two decision epochs, given that the system occupies state i
and action a is taken at the first decision epoch. Then, we
have
r(i, a) = f(i, a) + Eai
{∫ τn+1
τn
c(Ws, i, a)ds
}, (2)
where Eai denotes the expectation with respect to the distri-
bution F (t|i, a) and the probability distribution of the naturalprocess under action a. For each i ∈ S and a ∈ A(i), define
τ(i, a) by
τ(i, a) = Eai {τn+1 − τn} =
∫ ∞
0
tF (dt|i, a), (3)
which denotes the expected length of time until the next
decision epoch given that action a is taken in state i at thecurrent decision epoch.
Let Πm be the set of all stationary Markov policies. Ifμ ∈ Πm, then it chooses an action a from A(i) with
probability μ(a|i) when state is in i ∈ S at any decision epoch.Thus,
∑a∈A(i) μ(a|i) = 1. Naturally, for a given stationary
Markov policy μ ∈ Πm, the SMDP evolves according to semi-
Markov kernel p(j, t|i, μ) =∑
a∈A(i) μ(a|i)p(j, t|i, a) and theembedded Markov chain evolves according to the transition
probability matrix Pμ = [p(j|i, μ)] with
p(j|i, μ) =∑
a∈A(i)
μ(a|i)p(j|i, a). (4)
Moreover, the expected total reward and the expected length of
time between two decision epochs under any policy μ ∈ Πm
are
rμ(i) =∑
a∈A(i)
μ(a|i)r(i, a), τμ(i) =∑
a∈A(i)
μ(a|i)τ(i, a), (5)
respectively, given that the system occupies state i at the cur-rent decision epoch. Let rμ and τμ denote their corresponding
column vectors, respectively.When policy μ is deterministic, that is, the policy chooses
an action with probability 1 at any state. At this time,
we denote the policy as v, which is a mapping v :S → A, i.e., for any state i, v specifies an actionv(i) := a ∈ A(i) with probability 1. Then, under de-
terministic policy v, p(j, t|i, μ), p(j|i, μ), rμ(i), and τμ(i)are p(j, t|i, v(i)), p(j|i, v(i)), r(i, v(i)), and τ(i, v(i)), respec-tively. Let Πd denote the set of all these deterministic station-
ary policies. It is obvious that Πd ⊂ Πm.We assume that the embedded Markov chain is ergodic un-
der any policy μ ∈ Πm. Let πμ = (πμ(1), πμ(2), . . . , πμ(S))denote the (row) vector representing the steady-state probabil-
ities of embedded Markov chain, then we have πμPμ = πμ
and πμe = 1, where e denotes a column vector whose all
components are 1. Let σs denote the number of decision
epochs up to time s. The infinite-horizon average-reward isdefined as [4]
ημ =πμrμ
πμτμ. (6)
If policy μ∗ satisfies ημ∗
≥ ημ for all policies μ ∈ Πm, we
call it average-reward optimal policy and the corresponding
average reward ημ∗
is called optimal average reward.
III. PERFORMANCE SENSITIVITY ANALYSIS AND
OPTIMALITY EQUATION
For the SMDP introduced in Section II, an equivalent
infinitesimal generator Aμ = [Aμ(j|i)], i, j ∈ S, as follows,was defined in [12]
Aμ = Γμ[Pμ − I],
where Γμ is a diagonal matrix with diagonal components1
τμ(1) , . . . ,1
τμ(S) and I is an identity matrix. Based on the
infinitesimal generator, the average-reward performance dif-ference under different policies w ∈ Πm and μ ∈ Πm was
given by the following lemma [12].Lemma 1: For any policies w ∈ Πm and μ ∈ Πm,
ηw − ημ = pw[Γwrw − Γμrμ + (Aw −Aμ)gμ
], (7)
where pw is the steady-state distribution of Aw, i.e., pwAw =0, pwe = 1, and gμ is the performance potential satisfying
the Poisson equation
Aμgμ = −Γμrμ + ημe. (8)
On the basis of performance difference formula (7), policyiteration algorithm was derived by noting pw > 01 in [12].
Moreover, the following optimality-condition lemma can beeasily obtained similarly to the MDP version in [11, p.187].
Lemma 2: A policy v∗ is the average-reward optimal policy
if and only if for any policy v ∈ Πs,
Γv∗
rv∗
+Av∗
gv∗
≥ Γvrv +Avgv∗
. (9)
1Vector p > 0 denotes each component of p is positive.
158
From Poisson equation (8), we have Γv∗
rv∗
+ Av∗
gv∗
=ηv∗
e. Thus, optimality inequality condition (9) can be rewrit-
ten as the following Bellman optimality equation.
Theorem 1: A policy v∗ is the average-reward optimal
policy if and only if
0 = maxv∈Πs
{Γvrv +Avgv∗
− ηv∗
e}
= maxv∈Πs
{Γvrv + Γv(P v − I)gv∗
− ηv∗
e}
. (10)
From the above discussion, we find that the equivalent in-finitesimal generator Av plays an important role in optimality
equation. This analysis is based on the equivalent continuoustime Markov chain. Next, we introduce another sensitivity
analysis results from the embedded Markov chain [13], which
provides an intuitive explanation for the policy iteration andoptimality equation in [4] and [5].
Theorem 2: For any policies w ∈ Πm and μ ∈ Πm,
ηw − ημ =πw
πwτw
{(rw − rμ)− (τw − τμ)ημ
+(Pw − Pμ)gμ}, (11)
On the basis of the performance difference formula (11),the following optimality condition can be obtained similarly
to Lemma 2.
Lemma 3: A policy v∗ is the average-reward optimal policy
if and only if for any policy v ∈ Πs,
rv∗
− ηv∗
τv∗
+ P v∗
gv∗
≥ rv − ηv∗
τv + P vgv∗
. (12)
By using the variation of Poisson equation (8) with optimal
policy v∗, i.e., rv∗
−ηv∗
τv∗
+P v∗
gv∗
= gv∗
, we have anotherform of Bellman optimality equation.
Theorem 3: A policy v∗ is the average-reward optimalpolicy if and only if
gv∗
= maxv∈Πs
{rv − ηv∗
τv + P vgv∗}
. (13)
Equation (13) had been widely applied to value function
based RL for SMDPs with average reward in [8], [9], [10].
The application of Bellman optimality equation (10) in RL isnot considered. In next section, we will propose a new RL
algorithm based on equation (10).
IV. RVI REINFORCEMENT LEARNING
An important difference between (10) and (13) is that the
optimal average reward ηv∗
is coupled with the expected
sojourn time τv in (13), while in (10), the optimal averagereward ηv∗
is separate. This character of (10) will facilitate
us to obtain a RVI RL algorithm. Next theorem provides us a
variation of Bellman equation (10).
Theorem 4: A policy v∗ is the average-reward optimalpolicy if and only if
g̃v∗
= maxv∈Πs
{Γvrv + λΓvP v g̃v∗
+ (I − λΓv)g̃v∗
− ηv∗
e}
, (14)
where λ is a constant and g̃v∗
= λ−1gv∗
.
Proof: From Bellman equation (10), we have
ηv∗
e = maxv∈Πs
{Γvrv + λΓv [P v − I] g̃v∗
}, (15)
Adding g̃v∗
on both sides of (15), we can obtain (14) by noting
that g̃v∗
and ηv∗
do not depend on v.
Equation (14) suggests a RVI RL algorithm. We define anoptimal Q-factor Q̃v∗
(i, a) for each state-action pair (i, a), i ∈S, a ∈ A(i), as follows.
Q̃v∗
(i, a) =r(i, a)
τ(i, a)+
λ
τ(i, a)
∑j∈S
p(j|i, a)g̃v∗
(j)
+(1−
λ
τ(i, a)
)g̃v∗
(i)− ηv∗
, (16)
which is different from the classical Q-factor for SMDPsdefined as follows[8], [9], [10]:
Qv∗
(i, a) = r(i, a)− ηv∗
τ(i, a) +∑j∈S
p(j|i, a)gv∗
(j). (17)
Taking the maximum on both sides of (16) over the action
space A(i) and from (14), we have
g̃v∗
(i) = maxa∈A(i)
Q̃v∗
(i, a), for any i ∈ S. (18)
Substituting (18) into (16), we can obtain the optimality
equation for Q-factors:
Q̃v∗
(i, a) =r(i, a)
τ(i, a)+
λ
τ(i, a)
∑j∈S
p(j|i, a) maxa∈A(j)
Q̃v∗
(j, a)
+(1−
λ
τ(i, a)
)max
a∈A(i)Q̃v∗
(i, a)− ηv∗
. (19)
If we can estimate the optimal Q-factors Q̃v∗
(i, a), the optimalpolicy can be obtained by
v∗(i) ∈ arg maxa∈A(i)
Q̃v∗
(i, a).
Define
m(n, i, a) =n∑
l=0
I(i,a)(Xl, Al),
where I(·)(·) is an indicator function. Thus m(n, i, a) denotesthe number of times that state-action pair (i, a) occurs up
to time n. Let R(Xn, An) denote the observed total rewardbetween decision epochs τn and τn+1, i.e., R(Xn, An) =f(Xn, An) +
∫ τn+1
τnc(Ws, Xn, An)ds. Then, we can design
the following Q-learning algorithm:
Q̃(Xn, An) := Q̃(Xn, An) + γ(m(n, Xn, An))δn,
δn =R(Xn, An)
τ(Xn, An)+
λ
τ(Xn, An)max
a∈A(Xn+1)Q̃(Xn+1, a)
+(1−
λ
τ(Xn, An)
)max
a∈A(Xn)Q̃(Xn, a)
− η̄ − Q̃(Xn, An). (20)
159
where η̄ is an estimate of the optimal average reward and γ(n)are the learning rates, which satisfy
∞∑n=0
γ(n) =∞,
∞∑n=0
γ2(n) < ∞.
The estimate of the optimal average reward η̄ is not expectedbecause its convergence to the optimal average reward might
be slow. From the definition of Q-factors (16), the Q-factors
is up to a constant. Thus, only the relative values of Q-factorsplay role in the choice of policy. In particular, the optimal
policy does not change if all the Q-factors are added by a
constant. Thus, we can directly remove the item η̄ in (20). Thischaracter shows the advantage of Q-factor (16). In the classical
Q-factor (17), the item ηv∗
τ(i, a) cannot be omitted since theoptimal average reward ηv∗
is coupled with the expected time
length τ(i, a).Removing η̄ in (20) will result in a numerical instability.
We may choose a reference state-action pair (i∗, a∗), applythe relative value iteration and obtain
δn =R(Xn, An)
τ(Xn, An)+
λ
τ(Xn, An)max
aQ̃n(Xn+1, a)
+(1−
λ
τ(Xn, An)
)max
aQ̃n(Xn, a)
−Q̃n(Xn, An)− Q̃n(i∗, a∗).
where reference state i∗ and reference action a∗ can be chosen
arbitrarily. Another way is to add a G-adjustment step [11] asfollows to avoid Q-factors going to infinity.
if maxi∈S,a∈A(i)
|Q̃n(i, a)| > G,
set Q̃n(i, a) = Q̃n(i, a)− Q̃n(i∗, a∗),
for all i ∈ S, a ∈ A(i), where G is a large constant.
It is noted that since the τ(Xn, An) is at the denominator,we cannot directly replace the τ(Xn, An) by the stochastic
sojourn time τn+1 − τn in the process of learning; otherwise,this will lead to a large variance. Thus, we need to estimate
the τ(i, a) for each i ∈ S and a ∈ A. It is easy to implement
these estimates by 1m(n,i,a)
∑nl=0(τl+1 − τl)I(i,a)(Xl, Al). If
each state-action pair occurs infinite often, the estimates are
consistent.
V. POLICY GRADIENT ESTIMATION
We assume that the Markov policy μ is parameterized anddenoted as μ(a|i, θ), i ∈ S, a ∈ A(i), where θ is the param-
eter. All quantities associated with the policy are obviously
functions of parameter θ. Therefore, the transition matrixand steady-state probability of embedded Markov chain, the
average reward, the performance potential, the expected to-
tal reward and expected sojourn time between two decisionepochs are denoted as P (θ), π(θ), η(θ), g(θ), r(θ) and τ(θ),respectively. We assume that μ(a|i, θ), i ∈ S, a ∈ A(i) aredifferentiable with respect to θ. If parameter θ takes different
values θ0 and θ1, there are two different policies μ(a|i, θ0)
and μ(a|i, θ1). Letting θ1 tend to θ0, from difference formula(11), we may obtain the performance gradient at θ = θ0.
dη(θ0)
dθ
=π(θ0)
π(θ0)τ(θ0)
{dr(θ0)
dθ−
dτ(θ0)
dθη +
dP (θ0)
dθg(θ0)
}.
(21)
According to (4) and (5), the performance gradient (21) canbe rewritten as follows:
dη(θ0)
dθ=
∑i∈S
π(i, θ0)
π(θ0)τ(θ0)
∑a∈A(i)
dμ(a|i, θ0)
dθQμ(i, a), (22)
where
Qμ(i, a) = r(i, a)− τ(i, a)η(θ0) +∑j∈S
p(j|i, a)g(j, θ0),
which is the classical Q-factor under policy μ of SMDPs. Wecan easily prove
Qμ(i, a) = Eμi
{∞∑
n=0
[r(Xn, An)− τ(Xn, An)η(θ0)]∣∣∣A0 = a
}.
Define ∇μ(a|i, θ) = dμ(a|i,θ)dθ
. From (22) and the basic
formula in [22], we have
dη(θ0)
dθ
=1
π(θ0)τ(θ0)lim
N→∞
1
N
N−1∑n=0
∇μ(An|Xn, θ0)
μ(An|Xn, θ0)Qμ(Xn, An),
(23)
To ensure the existence of∇μ(a|i,θ0)μ(a|i,θ0)
, here we need to make a
standard assumption: for any i ∈ S, a ∈ A(i), ifdμ(a|i,θ)
dθ= 0,
then μ(a|i, θ) = 0.
Formula (23) is similar to the basic formula for MDPs in[22] except the constant term 1
π(θ0)τ(θ0)and the Q-factors
Qμ(i, a). Using the same idea as that in [22], we may useany sample-path-based estimate Q̂(Xn, An, Xn+1, An+1, . . .),with E[Q̂(Xn, AnXn+1, An+1, . . .)|Xn, An] ≈ Qμ(Xn, An),to replace Qμ(Xn, An) in equation (23). In this way, using
different sample-path-based estimates of Qμ(Xn, An), includ-
ing discounted approximation, perturbation realization factorsand approximation by truncation [22], we may obtain different
gradient estimation algorithms. Here, we only consider the
discounted approximation. Define the following discounted Q-factor,
Qμβ(i, a)
= Eμi
{∞∑
n=0
βn[r(Xn, An)− τ(Xn, An)η(θ0)]∣∣∣A0 = a
},
where β is a discounted factor with 0 < β < 1. We can easily
prove that Qμβ(i, a) converges to Qμ(i, a) when β approaches
160
to 1. Thus, we may let
Q̂(Xn, An, Xn+1, An+1, . . .)
=
∞∑l=n
βl−n[r(Xl, Al)− τ(Xl, Al)η(θ0)]. (24)
When β is close to 1, the above approximation is a goodestimation of Qμ(Xn, An). Discounted factor β needs to be
carefully chosen to balance the bias and variance of the
estimate[21]. Putting (24) into (23), we have
dη(θ0)
dθ≈ C(θ0) lim
N→∞
1
N
N−1∑l=0
[r(Xl, Al)− τ(Xl, Al)η(θ0)]
l∑n=0
βl−n∇μ(An|Xn, θ0)
μ(An|Xn, θ0), (25)
where C(θ0) =1
π(θ0)τ(θ0). Note that η(θ0) in (25) is generally
unknown. Let
ηn =1
τn
{∫ τn
0
c(Ws, Xσs, Aσs
)ds+
n−1∑l=0
f(Xn, An)
}.
From the ergodicity, we have
limn→∞
ηn = η(θ0), w.p.1.
Thus, when n is large enough, we may use ηn as an estimateof η(θ0). For ηn, we have the following iteration
ηn+1 = ηn +1
τn+1
[R(Xn, An)− (τn+1 − τn)ηn
]. (26)
With (25) and (26), we may develop the following gradientestimation algorithm of SMDP (GSMDP), which yields an
estimate Δn for the performance gradient direction based on
a sample path of semi-Markov decision process. (Since C(θ0)is a constant, omitting it does not affect the gradient direction)
GSMDP Algorithm:1) Given θ0 and a sample path {X0, A0, τ0, R(X0, A0),
X1, A1, τ1, R(X1, A1), . . .} under policy μ(·|·, θ0),choose a discount factor 0 < β < 1.
2) Set Z0 = 0, η0 = 0, Δ0 = 0 and n = 0.3) for state Xn and action An at decision time τn and
total reward R(Xn, An) between decision epochs τn and
τn+1, do
Zn+1 = βZn +∇μ(An|Xn)
μ(An|Xn),
ηn+1 = ηn +1
τn+1
[R(Xn, An)− (τn+1 − τn)ηn
],
Δn+1 = Δn + γn+1
{[R(Xn, An)
− ηn+1(τn+1 − τn)]Zn+1 −Δn
},
where Zn+1 evaluates∑n
l=0 βn−l∇μ(Al|Xl,θ0)μ(Al|Xl,θ0)
. From the
above algorithm, we can find GSMDP algorithm is equivalent
to the policy gradient for discrete time MDPs in [21], [22]with a new reward function r̄(i, a) = r(i, a) − τ(i, a)η.
The new reward function cannot be obtained directly but we
can use the stochastic observation between decision epochτn and τn+1, R(Xn, An) − ηn+1(τn+1 − τn), to replace
it. From (23) and (25), the Δn in the GSMDP algorithm
converges to∑
i∈S,a∈A(i) π(i, θ0)∇μ(a|i, θ0)Qμβ(i, a), which
is an approximation of gradient direction, under the condition
that the Markov chain {(Xn, An, τn+1− τn), n = 0, 1, . . .} ispositive Harris[27].
The GSMDP algorithm extends the policy gradient algo-
rithm in [21], [22] to SMDPs and can be easily appliedto continuous time Markov processes. Compared with the
algorithm in [27], the new algorithm only requires half mem-
ory requirement. In [27], four quantities (zk,Δck,Δτ
k andηk
τΔck − ηk
cΔτk) need to be estimated (computed) and stored
for each parameter (c.f. (9)-(13) in [27]), plus two average
rewards (ηkc and ηk
τ ). Thus, the algorithm in [27] needs 4K+2memory units when there are K parameters. In the GSMDP
algorithm, there are only two quantities (Zk and Δk) needto be estimated for each parameter, plus one average reward
(ηn). The GSMDP algorithm only needs 2K+1 memory units.
When there are many parameters in the policy, the saving ofstorage is very preferable.
VI. EXPERIMENTAL RESULTS
Consider a three-state SMDP in [13]. For this example, we
can obtain the optimal policy is v∗(1) = 3, v∗(2) = 1, v∗(3) =2 and the optimal average reward ηv∗
= 2.0189. Then, we
apply the RVI RL algorithm to learn the optimal Q-factors. In
the RVI algorithm, we set λ = 0.9 and obtain the simulationresult as Fig. 1(a). From the simulation, the optimal policy
is v∗. To compare the performance between our algorithmwith the RL algorithm based on the classical Q-factor (17),
we simulate the RL algorithm with the estimation of optimal
average reward. The simulation results is described in Fig 1(b).Comparing simulation results in Fig. 1(a) and Fig. 1(b), the
RVI RL algorithm has a good convergence rate.
Moreover, we parameterize and randomize the policy byusing the softmax functions Consider the gradient estimation
at θ = 0 and set β = 0.9 in the GSMDP algorithm. Fig.
1(c) describes the simulation results. Applying the gradient-based optimization algorithm with step size equal 1 to the
three-state SMDP, the optimal average reward of the systemcan be obtained as described in Fig.1(d). In each gradient-
based iteration, we use GSMDP algorithm to estimate the
performance gradient by simulating a sample path with 10000state-transitions.
VII. CONCLUSION
In this paper, by analyzing two performance difference
formulas, we presented two equivalent forms of Bellmanoptimality equations. The Bellman optimality equation based
on equivalent infinitesimal generators provided us a RVI RL
algorithm. Experiment results showed that the RVI algorithmhad a good convergence property. As a natural consequence
of the performance difference formula based on embeddedMarkov chain, a new performance derivative formula was
established. With the performance derivative formula, we
161
0 2 4 6 8 10
x 104
−4
−2
0
2
4
6
8
10
Step Number of Learning
Q−V
alue
Q(1,1)Q(1,2)Q(1,3)Q(2,1)Q(2,2)Q(3,1)Q(3,2)
(a) RVI ReinforcementLearning
0 2 4 6 8 10
x 104
−8
−6
−4
−2
0
2
4
6
8
Step Number of Learning
Q−V
alue
Q(1,1)Q(1,2)Q(1,3)Q(2,1)Q(2,2)Q(3,1)Q(3,2)
(b) RL with estimation ofoptimal average reward
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Gradient Estimation at θ=(0 0 0 0 0 0 0) with GSMDP Algorithm
Per
form
ance
Der
ivat
ive
dη/θ11
dη/θ12
dη/θ13
dη/θ21
dη/θ22
dη/θ31
dη/θ32
(c) GSMDP algorithm
0 20 40 60 80 1000.8
1
1.2
1.4
1.6
1.8
2
2.2
Step Number of Gradient−Based Iterations
Ave
rage
Rew
ard
(d) Optimization result
Fig. 1. Simulation Results
proposed an on-line policy gradient estimation algorithm. The
sensitivity-based analysis approach in our paper fits well therecently developed sensitivity-based learning and optimization
framework [11] and provides some new insight for SMDPs.
Further research includes the convergence proof of RVI RLalgorithm, the applications in practical systems and the hier-
archical learning theory for Markov systems, where SMDP isthe main mathematical frameworks for the hierarchy.
ACKNOWLEDGMENT
The authors would like to thank the support from the
NSFC(No.61004036), Doctoral Fund of Ministry of Education(No. 20102302120071) and Shenzhen Basic Research Project
(No. JC201005260179A).
REFERENCES
[1] J. Janssen, Semi-Markov Models: Theory and Applications, New York:Springer, 1999.
[2] J. Janssen and R. Manca, Applied Semi-Markov Processes, New York:Springer, 2005.
[3] R. Howard, Dynamic Probabilistic Systems Volume II: Semi-Markov andDecision Processes, New York: Wiley, 1971.
[4] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dy-namic Programming, New York: Wiley, 1994.
[5] H. C. Tijms, Stochastic Models: An Algorithmic Approach, New York:John Wiley & Sons, 1994.
[6] D. P. Bertsekas, Dynamic Programming and Optimal Control, BelmontMassachusetts: Athena Scientific, 1995.
[7] S. J. Bradtke and M. O. Duff, “Reinforcement learning methods forcontinuous-time Markov decision problems,” Proceedings of the Con-ference on Advances in Neural Information Processing Systems, vol. 7,pp. 393–400, 1995.
[8] S. Mahadevan, N. Marchalleck, T. Das, and A. Gosavi, “Self-improvingfactory simulation using continuous-time average-reward reinforcementlearning,” Proceedings of the International Conference on Machine Lean-rning, San Francisco, CA: Morgan Kaufmann, pp. 202–210, 1997.
[9] T. K. Das, A. Gosavi, S. Mahadevan and N. Marchalleck, “Solving semi-Markov decision problems using average reward reinforcement learning,”Management Science, vol. 45, no. 4, pp. 560–574, 1999.
[10] A. Gosavi, “Reinforcement learning for long-run average cost,” Euro-pean Journal of Operational Research, vol. 155, pp. 654–674, 2004.
[11] X. R. Cao, Stochastic Learning and Optimization: A Sensitivity-BasedApproach, New York: Springer, 2007.
[12] X. R. Cao, “Semi-Markov decision problems and performance sensitivityanalysis,” IEEE Transactions on Automatic Control, vol. 48, no. 5, pp.758–769, 2003.
[13] Y. J. Li and F. Cao“Infinite-horizon gradient estimation for semi-Markovdecision processes”, 8th Asian Control Conference, Kaohsiung, 2011.
[14] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,Cambridge, MA: MIT Press, 1998.
[15] D. P. Bertsekas and T. N. Tsitsiklis, Neuro-Dynamic Programming,Belmont, Massachusetts: Athena Scientific, 1996.
[16] R. Sutton, “Learning to predict by the methods of temporal difference,”Machine Learning, vol. 3, no. 1, pp. 9–44, 1988.
[17] J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8,pp. 279–292, 1992.
[18] S. Mahadevan, “Average reward reinforcement learning: foundations,algorithms, and empirical results,” Machine Learning, vol. 22, pp. 159–196, 1996.
[19] R. J. Williams, “Simple statistical gradient-following algorithms forconnectionist reinforcement learning,” Machine Learning, vol. 8, pp. 229–256, 1992.
[20] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradi-ent methods for reinforcement learning with function approximation,”Proceedings of the Conference on Advances in Neural InformationProcessing Systems, vol. 12, pp. 1057–1063, 2000.
[21] J. Baxter and P. L. Bartlett, “Infinite-horizon policy-gradient estimation,”Journal of Artificial Intelligence Research, vol. 15, pp. 319–350, 2001.
[22] X. R. Cao, “A basic formula for online policy gradient algorithms,”IEEE Transactions on Automatic Control, vol. 50, no. 5, pp. 696–699,May 2005.
[23] J. Abounadi D. Bertsekas and V. S. Borkar, “Learning algorithms forMarkov decision processes with average cost,” SIAM Journal on Controland Optimization, vol. 40, no. 3, 681–698, 2001.
[24] D. Y. Dong, C. L. Chen, H. X. Li and T. J. Tarn, Quantum reinforcementlearning, IEEE Transactions on Systems, Man and Cybernetics, Part B,vol. 38. vol. 5, 1207-1219, 2008.
[25] A. Gosavi, “A reinforcement learning algorithm based on policy iter-ation for average reward: empirical results with yield management andconvergence analysis”, Machine Learning, vol. 55, no. 1, pp. 5-29, 2004.
[26] P. Marbach and J. N. Tsitsiklis, “Simulation-based optimization ofMarkov reward processes,” IEEE Transactions on Automatic Control, vol.46, no. 2. pp. 191–209, 2001.
[27] S. S. Singh, V. B. Tadic, and A. Doucet, “A policy gradient methodfor semi-Markov decision processes with application to call admissioncontrol,” European Journal of Operational Research, vol. 178, pp. 808–818, 2007.
[28] H. Kimura, K. Miyazaki, and S. Kobayashi, “Reinforcement learning inPOMDPs with function approximation,” Proceedings of the InternationalConference on Machine Learning, San Francisco, CA: Morgan Kauf-mann, pp. 152–160, 1997.
[29] L. C. Baird and A. W. Moore, “Gradient descent for general reinforce-ment learning,” Proceedings of the Conference on Advances in NeuralInformation Processing Systems, pp. 968–974, 1998.
[30] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policygradient methods for reinforcement learning with function approxima-tion,” Proceedings of the Conference on Advances in Neural InformationProcessing Systems, pp. 1057–1063, 1999.
[31] V. R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,” SIAMJournal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166,2003.
162