State-Continuity Approximation of Markov Decision ... · State-Continuity Approximation of Markov Decision Processes via Finite Element Analysis for Autonomous System Planning Junhong

State-Continuity Approximation of Markov Decision Processes via FiniteElement Analysis for Autonomous System Planning

Junhong Xu, Kai Yin, Lantao Liu

Abstract— Motion planning under uncertainty for an au-tonomous system can be formulated as a Markov DecisionProcess. In this paper, we propose a solution to this decision-theoretic planning problem using a continuous approximationof the underlying discrete value function and leveraging finiteelement methods. This approach allows us to obtain an accurateand continuous form of value function even with a small numberof states from a very low resolution of state space. We achievethis by taking advantage of the second order Taylor expansionto approximate the value function, where the value function ismodeled as a boundary-conditioned partial differential equationwhich can be naturally solved using a finite element method.We have validated our approach via extensive simulations, andthe evaluations reveal that our solution provides continuousvalue functions, leading to better path results in terms of pathsmoothness, travel distance and time costs, even with a smallerstate space.

I. INTRODUCTION

Many autonomous vehicles that operate in flow fields,e.g. aerial and marine vehicles, can be easily disturbedby environmental disturbances. For example, autonomousmarine vehicles might experience ocean currents as in Fig.1. The uncertain motion behavior in an externally disturbedenvironment can be modeled using the decision-theoreticplanning framework where the substrate is the MarkovDecision Process (MDP) [1]. However, the uncertainties inmany scenarios are mixed due to environmental and systemdynamics. To obtain a highly accurate solution, it is desirableto increase the number of states and to impose complexprobability structures for modeling the MDP. Despite manysuccessful achievements [2], [3], [4], existing work typicallydo not synthetically integrate environmental uncertainty intoa computationally efficient planning model that possessesstructural insights.

In this work, we treat the state space of MDP as a continu-ous form, but tackle it from a very different perspective fromthe majority of continuous MDP frameworks. The essenceof this work lies in that, the state space is mapped from adiscrete version to a continuous format, where the “mapping"mechanism is achieved via modeling the discrete-state valuefunction through a partial differential equation (PDE) whichis then solved by a finite element method.

In greater detail, we model the MDP as an infinite horizonproblem and approximate the value function of the MDPby its Taylor expansion [5], [6] and show that it satisfies adiffusion-type PDE. We then apply a Finite Element Method

J. Xu and L. Liu are with the School of Informatics, Computing, andEngineering at Indiana University, Bloomington, IN 47408, USA. E-mail:xu14, [email protected]. K. Yin is with HomeAway, Inc. E-mail:[email protected]. J.X. and K.Y. equally contributed.

Fig. 1: Oceans currents can cause significant disturbances forautonomous marine vehicles. The swirl patterned currents nearSouth Africa are Agulhas Rings which are usually strong. (Source:NASA.)

(FEM) based on (a subset of) states for MDP to constructa solution naturally extending on the entire spatial space.Combining these procedures, we propose an approximatepolicy iteration algorithm to obtain the final policy anda continuous value function. Our framework in principleallows us to compute the policy on any subset of the entireplanning domain (region). Because our proposed methodshave deep mathematical roots in the diffusion approximationand the theory of FEM, the final results have theoreticalguarantees. Finally, we validated our method in the scenarioof navigating a marine vehicle in the ocean. Our simulationresults show that the proposed approach produces resultssuperior to the MDP solution on the discrete states butrequires smaller computational efforts.

II. RELATED WORK

Path planning under the presence of the state uncertaintycan be viewed as a decision-theoretic planning problemwhich is usually modeled as MDPs [7]. Most existing meth-ods that solve MDPs in the literature use tabular methods,i.e., converting the continuous state space into a set ofdiscrete indices [1]. To model the wind field disturbance, theMDP is used to model the task of minimum energy basedplanning [8]. Similar work [9], [10], [11] addresses tasks ofminimum time and energy planning under uncertain oceancurrent estimations as an MDP problem.

In addition to formulating motion uncertainty via MDPs,the environment uncertainty should also be integrated inplanning. Environmental model uncertainty has been com-bined in stochastic planning algorithms such as Rapidly Ex-ploring Random Trees (RRT) [12]. A prior knowledge of themotion model can be leveraged in linear quadratic Gaussiancontroller to account for other sources of uncertainty [13].

arX

iv:1

903.

0094

8v1

[cs

.RO

] 3

Mar

201

9

To extend the discrete-time actions to be continuous, theSemi-Markov Decision Processes (SMDP) [14], [15] havebeen designed based on the temporal abstraction concept sothat actions can be performed in a more flexible manner.However, without relying on function approximation, thebasic form of state space is still assumed to be discreteand tabular. Though the framework of continuous stochasticcontrol can provide continuous value function and continuousactions, it requires Brownian motion driven dynamics [16].Nevertheless, the Hamilton-Jacobi-Bellman equation for infi-nite horizon stochastic control shares a similar diffusion-typeof equations with our approaches.

Research on the continuous representation of value func-tions has been conducted in literature. Fitted value iteration(FVI) is one of popular sampling methods that approximatecontinuous value functions [17], [18]. It requires batches ofsamples to approximate the true value function via regressionmethods. However, the behaviour of the algorithm is notwell understood, and without carefully choosing a functionapproximator, the algorithm may diverge or become veryslow on convergence [19], [20]. In contrast to the samplingmethods, we attempt to approximate the value function usingthe second-order Taylor expansion and calculate the resultingpartial differential equation via a finite element method [21].

III. PROBLEM DESCRIPTION AND FORMULATION

This section describes the ocean environment modelingwith disturbance and vehicle motion under uncertainty, fol-lowed by the decision-theoretic planning via Markov Deci-sion Processes.

A. Ocean Environment Model

We model the ocean environment as a 2-D surface (xy-plane). A vector field is used to represent the environ-mental disturbance caused by ocean currents at each loca-tion of the 2-D ocean surface. It is given by vdpx, tq “rvdxpx, tq, v

dypx, tqs

T , where x “ rx, ysT is the location, t isthe time, vdx and vdy are estimations of ocean current speedsalong x and y axes given a location vector x and a time t.Note that vdx and vdy can take various forms. For example,regression models can be fitted if historical data of oceancurrents is available; the ocean flow forecasts can be used[22]; or we can leverage a ocean circulation model studiedby oceanographers [23].

The resulting vdx and vdy are not accurate enough toreflect the true dynamics of the ocean disturbance, sincethey are associated with estimation uncertainties [24]. Foreffective and accurate planning, these uncertainties need tobe considered by taking account of some environmentalnoises. For example, we may assume that the noise followsa Gaussian distribution:

vdpx, tq “ vd `wpx, tq, (1)

where vdpx, tq denotes the velocity vector after introduc-ing uncertainty and wpx, tq “ rwxpx, tq, wypx, tqs is theuncertainty vector at location x and time t. To reduce thecomplexities on modeling and computing, we adopt existing

approximation methods [9], [10], [11] and assume the un-certainty along two dimensions vdx and vdy are independent:

wxpx, tq „ N p0, σ2xpx, tqq, wypx, tq „ N p0, σ2

ypx, tqq, (2)

where N p¨, ¨q denotes Gaussian distribution, and σ2x and

σ2y are the noise variance for each component, respectively.

These two components are independent with respect to bothspace and time.

B. Vehicle Motion Model

We formulate the motion model of the autonomous un-derwater vehicle (AUV) that operates in the 2-D planeenvironment. The net velocity of the vehicle with respect tothe world frame (the actual motion of the vehicle) is given byvnetpx, tq “ rvdxpx, tq ` v cosψ, vdypx, tq ` v sinψsT , wherev is the speed of the vehicle relative to ocean currents andψ is the heading angle of the vehicle relative to the worldframe. We assume that AUV has a maximum speed vmaxconstrained by its capacity. For example, a marine glidercan move at speeds comparable to those of typical oceancurrents. The AUV motion is described by

9xpx, tq “ vnetx px, tq, 9v “ a,

9ypx, tq “ vnety px, tq, 9ψ “ β,(3)

where vnetx and vnety are the x and y components of thenet velocity vector, respectively; a and β are the controlinputs for acceleration (fuel) and steering angular speed,respectively,

C. Infinite Horizon Markov Decision Process

We consider planning of minimum time cost under theinfluence of uncertain ocean disturbances. This problemis usually formulated as an infinite time-horizon MarkovDecision Process (MDP).

We represent the infinite discrete time horizon MDP as a 4-tuple pS,A, T , Rq. The discrete spatial state space S and theaction space A are finite. A state s P S denotes a spatial pointpx, yq P R2 on the plane in our problem setting. We denotethe entire autonomous vehicle planning region (or domain)by Ω. Thus, we have S Ă Ω. The action depends on thestate; that is, for each state s P S, we have a feasible actionset Apsq Ă A. The entire set of feasible state-action tupleis F :“ tps, aq P S ˆ Au. There is a probability transitionlaw T ps, a; ¨q on S for all ps, aq P F. Thus, T ps, a; s1qspecifies the probability of transitioning to the state s1 giventhe current state s with the taken action a constrained bysystem dynamics. The final element R : F Ñ R1 is a real-valued reward function that depends on state and action.

We consider the class of deterministic Markov poli-cies [15], denoted by Π, i.e., the mapping π : SÑ A dependson the current state and the current time, and πpsq P Apsqfor a given state. For a given initial state s0 and starting timet0, the expected discounted total reward is represented as:

vπps0q “ Eπs0

«

8ÿ

k“0

γkRpsk, akq

ff

, (4)

where γ P r0, 1q is the discount factor that discounts thereward at a geometric decaying rate. The aim is to find apolicy π˚ to maximize the total reward from the startingtime t0 at the initial state s0, i.e.,

π˚ps0q “ arg maxπPΠ

vπps0q. (5)

Accordingly, the optimal value is denoted by v˚ps0q.In practice the continuous 2-D ocean surface is dis-

cretized into a number of grids, where the center of agrid represents a state. The actions are defined by thevehicle moving at the maximum speed vmax towards adesired heading direction in the fixed world frame A “

tN,NE,E, SE, S, SW,W,NW u, where each element is apair of pψ, vmaxq. Because these actions only describe thedesired heading angle and moving speed, they can not beused directly to control the vehicle. Low-level controllersthat has integrated vehicle dynamics are used. The controlinputs are a and β in the preceding section.

The transition probability is modeled by a bivariate Gaus-sian distribution

T ps, a; s1q “ N pµpsq, diagrσ2xdt, σ

2ydtsq, (6)

where T represents the unnormalized transition probability,µpsq “ rx ` vnetx psqdt, y ` vnety psqdtsT , and dt denotesthe time of executing the action a. We have assumed thatwithin a short time window the ocean disturbances areconstant and only depend on the state s. In addition, thenoise scaling parameters, σx and σy , are the same at alllocations. Transition probabilities are only assigned to thestates s1 that are "connected" with the current state s (Twostates are connected only if a robot from a state is able totransit to the other state directly without passing any otherstates.). The transition probability can then be normalized byT ps, a; ¨q “ T ps,a;¨q

ř

s1 T ps,a;s1q.

IV. METHODOLOGY

The solution to the infinite horizon MDP has values on thediscrete spatial states (i.e. points) only. Although discreteMDP reduces computational efforts of evaluating infinitenumber of states on the entire space, it is still a large-scaleinstance for certain real applications. Moreover, it is desirableto have a solution on the entire continuous spatial space. Inwhat follows, we employ methods in literature (e.g., [5], [6])to approximate the value function of the discrete MDP by itsTaylor expansion, and show that it satisfies a diffusion-typePDE. Given the situation that the parameters correspondingto this diffusion-type PDE have values on discrete statesonly, we then apply a Finite Element Method (FEM) toconstruct a solution naturally extending on the entire spatialspace. We also show that under certain situations FEM allowsto obtain solution from fewer discrete states. Finally wepropose a policy iteration algorithm based on the diffusion-type PDE and FEM to obtain the final policy and continuousvalue function. Because our proposed methods have deepmathematical roots in the diffusion approximation and thetheory of FEM, the final results have theoretical guarantees.

A. Diffusion-Type Approximate Value Function

Under certain conditions [15], [25], it is well-known thatthe optimal solution that satisfies the following recursiverelationship

vpsq “ maxaPApsq

Rps, aq ` γ ¨ Earvps1q | ss(

, (7)

where Earvps1q|ss “ř

s1 T ps, a; s1q vps1q. This is the Bell-man equation, which serves as a basis to solve the problemwith Eq. (5) and Eq. (4). The popular algorithms to obtainthe solutions include policy and value iteration algorithms aswell as linear program based methods [15].

Following the procedures in [5], we subtract both hand-sides by V psq and then take Taylor expansions of value func-tion around s up to second order (assuming such expansionsexist):

0 “ maxaPApsq

!

Rps, aq ` γ`

Earvps1q | ss ´ vpsq˘

´ p1´ γq vpsq)

“ maxaPApsq

!

Rps, aq ` γ´

µs∇vpsq `1

2∇ ¨ σs∇vpsq

¯

´ p1´ γq vpsq)

, (8)

where

pµsqi “ÿ

s1

T ps, a; s1qpsi ´ s1iq,

pσsqi,j “ÿ

s1

T ps, a; s1qpsi ´ s1iqpsj ´ s

1jq, i, j P tx, yu,

s “ px, yq∆“ psx, syq; the operator ∇ ∆

“ pBBx, BByq andthe notation ¨ in the last equation indicate an inner product.We note that similar approximations can also be found inliterature of stochastic control [16]. We omit the rigorousderivations here, and refer readers to the aforementionedliterature.

Under the optimal policy, the Eq. (8) is actually a diffusiontype partial differential equation. The above procedures indi-cate that the solution to control problem Eq. (8) approximatesthe value function of the original problem at discrete states inS. In order to obtain the solution to this diffusion type PDE,we need to impose reasonable boundary conditions. Sincethe path planning should be done within a region, the valuefunction should not have values outside this region. Thismeans that the directional derivative of the value functionwith respect to the unit vector normal at the boundary of theplanning region must be zero. In addition, we constrain thevalue function at the goal state to be zero to ensure that thereis a unique solution.

The above analysis inspires us to perform a policy iterationalgorithm to obtain the policy [5]:‚ In the policy evaluation stage, we solve for a diffusion

type PDE with imposed boundary conditions to obtainvpsq;

‚ In the policy improvement stage, we search for a policythat maximizes the values of the right hand side of

Eq. (8) with vpsq obtained from the previous policyevaluation stage.

B. Partial Differential Equation Representation

Now let us look at the PDE involved in the policyevaluation stage. Let us further denote the entire continuousspatial planning region (called domain) by Ω, its boundaryby BΩ, and the goal state by sg . We assume that the discretespatial space S Ă Ω. We also use n to denote the unit vectornormal to BΩ pointing outward. Under the policy π, we aimto solve for the following diffusion type PDE:

´Rps, πpsqq “ γ´

µπs ∇vpsq`1

2∇¨σπs∇vpsq

¯

´p1´γq vpsq,

(9)with boundary conditions

Bvpsq

Bn∆“ ∇vpsq ¨ n “ 0, on BΩ (10a)

vpsgq “ 0, (10b)

where µπs and σπs indicate that µs and σs are obtained underthe policy π. The condition (10a) is a type of homogeneousNeumann condition, and (10b) can be thought of as aDirichlet condition in literature [26]. In practice, we assumethat the solutions to the above diffusion type PDE (9) withboundary conditions exist. The relevant conditions for a well-posed PDE can be found in [26].

We emphasize that Eq. (9) is obtained on discrete spatialstate in S Ă Ω, i.e., the parameters µs and σs and Rps, ¨q takevalues in S only. This is because the Taylor expansions (8)are only performed on the discrete values s P S. We needto construct a continuous solution with desired theoreticalguarantees on the entire domain Ω. In the next section, wewill introduce a finite element method to perfectly achieveour objectives in our setting.

We make a final comment in this section. The problemhere appears relevant to the numerical methods for differen-tial equation in an opposite sense: in numerical methods, wehave a well defined differential equation from the beginning,then discretize the domain to obtain the parameter values oneach discrete point, and finally compute the solution. In ourproblem, we obtain parameters on each discrete point first,assuming a solution exists, and finally construct a continuoussolution on the entire space.

C. A Finite Element Method

The finite element method is a name for a general tech-nique for constructing numerical solutions to differentialequations with boundary value problems. The method con-sists of dividing the domain Ω into a finite number ofsimple subdomains, the finite element meshes, and then usingvariational form (also called weak form) of the originaldifferential equation to construct the solution through "finiteelement functions" over meshes [27]. There is an extremelyrich collection of approaches and rigorous theories about thefinite element methods. We cannot elaborate details in thissection. However, our purpose here is merely to leveragea standard method to obtain the numerical solution of the

(a) Grids of States (b) Meshes (c) Meshes on a subsetof states

Fig. 2: Examples of triangle meshes on S and on a subset ofS

preceding PDE and to demonstrate later that solution caneven be obtained based on a subset of S with high precision.

We will sketch the main ideas involved in finite elementmethods (Galerkin methods in particular) in a heuristic way,and refer the readers to literature [28], [27] for a fullaccount of the theory. The variational or weak form ofthe problem amounts to multiplying the both sides of Eq.(9) by a function wpsq (such function called test function)with suitable properties, and to using integration-by-parts andboundary condition Eq. (10b) we have

´

ż

Rw ds “ γ

ż

µπs ∇v w ds´γ

2

ż

σπs∇v ¨∇w ds

´ p1´ γq

ż

v w ds. (11)

Here the test function wpsq “ 0 on BΩ as well as on sg . Itcan be shown the solution to this variational form is also thesolution to the original form.

In a standard finite element approach, the next procedurepartitions the domain Ω into suitable elements. Typicallytriangle element meshes will be applied because they aregeneral enough to handle all kinds of domain geometry. Inour problem setting, there is an additional requirement: suchpartition must be based on the discrete state points S. Forexample, Fig. 2a shows an 8 ˆ 8 grids of states, where adark dot corresponds to a state point, i.e., the set of thesedots being S for the infinite horizon MDP. Fig. 2b shows thetriangle meshes used for finite element methods, where eachvertex of triangle meshes corresponds to a state point.

Finally, we aim to represent the test function and thesolution by a linear finite combination of so-called basisfunctions, i.e., w “

ř

i ciφi, v “ř

i aiφi. This studyuses the Lagrange interpolation polynomials, each of whichis constructed based on vertices of the triangle element.Because coefficients ci should be arbitrary, this along withcondition (10a) leads to a coupled system of linear algebraicequation of type Ka “ F . Each entry of matrix K corre-sponds to basis function involved integrals on the right-handside of Eq. (11). When we choose N number of one degreeof Lagrange interpolation polynomials, the K is of N ˆNsize. F corresponds to the basis function involved integralson the right-hand side of Eq. (11). Solving this linear systemgives us the estimates of ai, i.e., the approximate solution v.

We make two notes here. First, we may use a subset of S to

Algorithm 1 Policy Iteration with State-Continuity Approx-imation and Finite Element MethodsInput: Transition function T , reward function R, discount

γ, spatial states S, and goal state sg .Output: Policy π and continuous value function vpsq.

1: Initialize π,@s P S, and set i “ 0.2: repeat3: Step 1. Finite Element Based Policy evaluation.4: Calculate µπs and σπs on a subset S1 Ď S;5: Obtain continuous value function vi by solving

for the variational form (11) by Finite ElementMethods described in Section IV-C;

6: Step 2. Policy improvement.7: Update the policy π on S (or any subset of domain

Ω including S) using the value function vi accord-ing to the following equation:πpsq “ arg maxaPApsq

!

Rps, aq ` γ´

µs∇vipsq `12∇ ¨ σs∇v

ipsq¯

´ p1´ γq vipsq

8: Set i :“ i` 1.9: until πpsq does not change for all s P S

perform the finite element methods. This is because the finiteelement methods approximate the solution with high preci-sion. Fig. 2c provides an example of using subset of states.The red dots are selected states and red triangle elementsare used in finite element methods. We will demonstrate innumerical examples in the later section. Second, differenttypes of applications and equations may require differentdesigns of meshes. Due to the limited space, we will discusshow to choose meshes in the future work.

D. Approximate Policy Iteration Algorithm

We summarize our approximate policy iterative algorithmin Algorithm 1.

In the policy evaluation step, we may perform finiteelement methods on S or a subset S1 Ă S if the computationalcost is a concern, as discussed in the preceding section,It is worthwhile noting that the obtained value function iscontinuous on the entire planning region, i.e., the domain Ω.Therefore, it is possible to obtain policy π on any subset Ω1

of Ω where S Ď Ω1. It also implies that we can obtain acontinuous policy on the entire domain Ω in principle.

V. EXPERIMENTS

In this section, we present experimental results for plan-ning with an objective of minimizing time cost to a desig-nated goal position under uncertain ocean disturbances.

We compare the proposed approach with two other base-line methods: the classic MDP policy iteration, and a goal-oriented planner which maintains the maximum vehiclespeed vmax and always keeps the vehicle heading angle ψtowards the goal direction. The latter method has been widelyused in navigating underwater vehicles due to its “effectivebut lightweight" properties [29]. The evaluation is based on

(a) Classic Policy Itera-tion

(b) Approximate policyiteration

(c) Goal-oriented planer

(d) Classic Policy Itera-tion

(e) Approximate policyiteration

(f) Goal-oriented planer

(g) Classic Policy Itera-tion

(h) Approximate policyiteration

(i) Goal-orientedplaner

Fig. 4: Trajectory comparisons of 10 trials under gyredisturbances. The top three figures demonstrate the pathsunder weak gyre disturbances, where A “ 0.5. Standarddeviations of the disturbance velocity vector are set toσ2x “ σ2

y “ 1kmh. The middle row shows the paths understronger and more uncertain disturbances, where A “ 1.5and σ2

x “ σ2y “ 3kmh. The last row demonstrates the paths

with random obstacles (e.g., oil platforms or islets) in theenvironment.

two criteria: (i) the time cost of the trajectory, and (ii) theapproximation error of the value function.

We first consider a wind-driven gyre model which iscommonly used in analyzing flow patterns in large scalerecirculation regions in the ocean [30]. Velocity compo-nents are given by vdxpxq “ ´πA sin pπ xs q cosπ ys andvdypxq “ πA cos pπ xs q sinπ ys respectively, where A denotesthe strength of the current and s determines the size ofthe gyres. The dimension of the ocean surface is set to40kmˆ 40km. The surface is discretized into 2kmˆ 2kmcells containing 20ˆ 20 “ 400 states. We set the maximumof vehicle speed vmax “ 3kmh as a fixed value in all theexperiments. To match the large spatial dimension, the actionexecution time (resolution) dt is set to 1h for MDP policies.

Because we are interested in minimizing the travel time,we impose a cost for each action execution. The reward

(a)

(b)

Fig. 5: The time costs and trajectory lengths of the threemethods under different disturbance strengths. These resultsare averaged over 10 trials

function is designed as follows

Rps, a, s1q “

$

’

&

’

%

0 if s’ is goal or s is goal or obstacle´1 if s’ is obstacle´0.1 otherwise

,

(12)

A. Trajectories and Time Costs

We present the performance evaluations in terms of tra-jectories and time costs. The entire state space is used forevaluating all methods. A detailed comparison of using asubset of the state space with approximate policy iteration isdemonstrated in Sect. V-B. We set s “ 20 which generatesfour gyre areas. To simulate the uncertainty of estimationsduring each experimental trial, we sample Gaussian noisesfrom Eq. (2) and add them to the calculated velocity com-ponents of the gyre model.

We looked into the trajectories of different methods underweak and strong disturbances in Fig. 4. We run each methodfor 10 times. The colored curves represent accumulatedtrajectories of multiple trials. For the weak disturbance,parameter A is set to 0.5 and the uncertainty parameter is setto 1kmh . The resulting maximum ocean current velocityis pπ2 ˘1qkmh, which is smaller than the vehicle maximumvelocity 3kmh. In this case, the vehicle has the capability ofgoing forward against ocean currents. In contrast, the strongocean current gives the maximum velocity of p1.5π˘3qkmhwhich could be larger than the vehicle maximum velocity. Bycomparing the first row and the bottom two rows of Fig. 4,we can easily observe that under weak ocean disturbances,the three methods produce trajectories that well “converge"and look similar. This is because the weak disturbance

A “ 0 PI (400 states) k “ 1 (400 states) k “ 2 (200 states)Time cost 16.65˘ 0.135 16.51˘ 0.12 16.49˘ 0.14Traj. len. 50.06˘ 0.135 49.65˘ 0.48 49.19˘ 0.169

A “ 0.5Time cost 15.15˘ 0.114 16.98˘ 0.0.716 14.87˘ 0.17Traj. len. 54.23˘ 0.55 58.69˘ 1.43 52.71˘ 1.02

A “ 1.0Time costs 12.4˘ 0.14 12.29˘ 0.055 12.07˘ 0.232Traj. len. 60.57˘ 1.03 61.38˘ 0.29 60.83˘ 1.02



TABLE I: Time and trajectory costs averaged over 10 trialswith different ocean current strengths A, where PI standsfor classic policy iteration. The best performing statistics arehighlighted with a bold font.

results in smaller vehicle motion uncertainty. The goal-oriented planer has a better path (Fig. 4c) in terms of thetravel distance (path length) and time costs under the weakdisturbance. On the other hand, MDP policies are able toleverage the fact that moving in the same direction with theocean current allows the vehicle to obtain a faster net speedand higher transition probability in the current direction.The phenomena can be observed From Fig. 4(d) to (j).Specifically, with strong disturbances, instead of taking amore directed path towards the goal position, MDP policiestake a longer path but with a fast speed by followingthe direction of the ocean currents. In contrast, the goal-oriented planer is not able to reach the goal under the strongdisturbance. Planning under the appearance of obstacles isshown in the last row of Fig. 4. It is worth noting that MDPpolicies have three circular trajectories around the top-leftgyre. We interpret the occasional circular behaviors that thenoisy motion drives the vehicle to the north direction towardsthe obstacle. The MDP policies take the path that followthe direction of the circular ocean current to avoid possiblecollisions.

We further quantitatively evaluated the averaged time costsand trajectory lengths of the three methods under differentdisturbance strengths with fixed σ2

x “ σ2y “ 3kmh, as il-

lustrated in Fig. 5. The overall performance for the proposedapproximate policy iteration has a comparable performanceto the classic policy iteration. However, our method has theadvantage of outputting a continuous value function. Thegoal-oriented planer yields a slightly better result when nodisturbance is presented (A “ 0). However, as long asnoticeable disturbance is presented, its performance degradesdramatically. In addition, it halted before arriving at the goalbecause it reached the given time budget (30h with A ě 1.0).

B. Performance with Different Resolution Parameters

We also investigated the performance of the approximatepolicy iteration where only a subset of the state space is used

(a) (b)

(c) (d)

(e) (f)Fig. 6: Value function calculated from approximate policyiteration on the right and exact solution on the left. Theenvironment parameters and settings are the same as the onesin Fig. 4.

to evaluate the value function. The environment parametersare the same with the ones used in the previous section.

Table I provides the statistics of comparing the policyiteration and two different resolution parameters (k), wherek “ 2 means the approximate policy evaluation is conductedwith only 1

2 size of the MDP state space. The results revealthat, in most cases, approximating value function using halfof the state space excels.

The continuous value function computed by the finiteelement method has the advantage of computational timesaving. With an MDP of N states, each policy iterationrequires to solve a system of N linear equations where thetime complexity is roughly OpN3q. In contrast, the finiteelement method with resolution parameter k requires a timecomplexity of only OppNk q

3q.

C. Value Function Approximation Error

We also present analysis on the approximation error ofthe value function. From Fig. 6, it can be seen that theapproximate value function is almost identical to the valuefunction computed by the classic policy iteration. In contrastto the value function computed by classic policy iterationwhich is discrete in nature, the value function computedby the approximation method is continuous and can be

Fig. 7: MSE between the approximated value function andthe value function computed by the classic policy iteration.Even with increasing state space, the gap between approxi-mation and the exact solutions is still small.

evaluated at every designated location. We can see that thelast row of Fig. 6, which is the environment with randomobstacles, has the largest discrepancy between the exactand approximate value functions. The reason is becauseadding multiple obstacles causes the true value function tobe non-smooth as shown in Fig. 6e. The non-smoothness iscaused by the definition of the reward function in Eq. (12).Thus, the resulting PDE solution results in a less accurateapproximation. To overcome this problem, the boundaryvalue for obstacle areas should be carefully considered. Weleave this task for our future work.

Fig. 7 shows the mean squared error (MSE) between theapproximate value function and the value function computedfrom the exact policy iteration. The value of the exactsolution is around two to three orders of magnitude of theerror. As illustrated by the blue line, when using a coarserresolution, i.e. k “ 2, we still achieve a relative small error.

D. Evaluation with Ocean Data

In addition to the gyre model, Regional Ocean ModelSystem (ROMS) [31] data is also used to evaluate ourapproach. The dataset provides ocean currents data in theSouthern California Bight region. This allows us to useGaussian Process Regression (GPR) to interpolate oceancurrents at every location on the 2-D ocean surface. We useone time snapshot to predict the speed of ocean currents. Theenvironment is discretized into 28ˆ 28 and each grid has adimension of 2kmˆ2km. We use k “ 2 when approximatingthe value function. Fig. 8 shows the trajectories of the threemethods. Note that the maximum speed of the ocean currentsfrom this dataset is around 3kmh, which is similar to thevehicle maximum linear speed. We can see that trajectories ofapproximate policy iteration are smoother (shorter distanceand time costs) compared with those from the classic policyiteration approach.

VI. CONCLUSIONS

In this paper, we propose a solution to solving autonomousvehicle planning problem using a continuous approximatevalue function for the infinite horizon MDP and using finiteelement methods as key tools. Our method leads to an

(a) (b) (c)Fig. 8: Trajectories from the three methods using ocean data. (a) Classic policy iteration; (b) Approximate policy iteration;(c) Goal-oriented planner.

accurate and continuous form of value function even witha subset from a very low resolution of state space. Weachieve this by leveraging the diffusion-type approximationfor the value function, where the value function satisfies apartial differential equation which can be naturally solvedusing finite element methods. Extensive simulations andevaluations have demonstrated advantages of our methodsfor providing continuous value functions, and for better pathresults in terms of path smoothness, travel distance and timecosts, even with a smaller state space.

REFERENCES

[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.MIT press, 2018.

[2] J. Van Den Berg, S. Patil, and R. Alterovitz, “Motion planning underuncertainty using iterative local optimization in belief space,” TheInternational Journal of Robotics Research, vol. 31, no. 11, pp. 1263–1278, 2012.

[3] Y. Luo, H. Bai, D. Hsu, and W. S. Lee, “Importance samplingfor online planning under uncertainty,” The International Journal ofRobotics Research, p. 0278364918780322, 2016.

[4] R. He, E. Brunskill, and N. Roy, “Puma: Planning under uncertaintywith macro-actions.” in AAAI, 2010.

[5] A. Braverman, I. Gurvich, and J. Huang, “On the Taylor expansion ofvalue functions,” arXiv preprint arXiv:1804.05011, 2018.

[6] J. Buchli, F. Farshidian, A. Winkler, T. Sandy, and M. Giftthaler,“Optimal and learning control for autonomous robots,” arXiv preprintarXiv:1708.09342, 2017.

[7] C. Boutilier, T. Dean, and S. Hanks, “Decision-theoretic planning:Structural assumptions and computational leverage,” Journal of Arti-ficial Intelligence Research, vol. 11, pp. 1–94, 1999.

[8] W. H. Al-Sabban, L. F. Gonzalez, and R. N. Smith, “Wind-energybased path planning for unmanned aerial vehicles using markov deci-sion processes,” in 2013 IEEE International Conference on Roboticsand Automation. IEEE, 2013, pp. 784–789.

[9] D. Kularatne, H. Hajieghrary, and M. A. Hsieh, “Optimal path plan-ning in time-varying flows with forecasting uncertainties,” in 2018IEEE International Conference on Robotics and Automation (ICRA).IEEE, 2018, pp. 1–8.

[10] V. T. Huynh, M. Dunbabin, and R. N. Smith, “Predictive motion plan-ning for auvs subject to strong time-varying currents and forecastinguncertainties,” in 2015 IEEE international conference on robotics andautomation (ICRA). IEEE, 2015, pp. 1144–1151.

[11] L. Liu and G. S. Sukhatme, “A solution to time-varying markovdecision processes,” IEEE Robotics and Automation Letters, vol. 3,no. 3, pp. 1631–1638, 2018.

[12] G. Kewlani, G. Ishigami, and K. Iagnemma, “Stochastic mobility-based path planning in uncertain environments,” in 2009 IEEE/RSJInternational Conference on Intelligent Robots and Systems. IEEE,2009, pp. 1183–1189.

[13] J. Van Den Berg, P. Abbeel, and K. Goldberg, “Lqg-mp: Optimizedpath planning for robots with motion uncertainty and imperfect stateinformation,” The International Journal of Robotics Research, vol. 30,no. 7, pp. 895–913, 2011.

[14] R. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps:A framework for temporal abstraction in reinforcement learning,”Artificial Intelligence, vol. 112, pp. 181–211, 1999.

[15] M. L. Puterman, Markov decision processes: discrete stochastic dy-namic programming. John Wiley & Sons, 2014.

[16] R. F. Stengel, Optimal control and estimation. Courier Corporation,1994.

[17] C. Szepesvári and R. Munos, “Finite time bounds for samplingbased fitted value iteration,” in Proceedings of the 22nd internationalconference on Machine learning. ACM, 2005, pp. 880–887.

[18] A. Antos, C. Szepesvári, and R. Munos, “Fitted q-iteration in continu-ous action-space mdps,” in Advances in neural information processingsystems, 2008, pp. 9–16.

[19] D. J. Lizotte, “Convergent fitted value iteration with linear function ap-proximation,” in Advances in Neural Information Processing Systems,2011, pp. 2537–2545.

[20] L. Baird, “Residual algorithms: Reinforcement learning with functionapproximation,” in Machine Learning Proceedings 1995. Elsevier,1995, pp. 30–37.

[21] I. Babuška, “Error-bounds for finite element method,” NumerischeMathematik, vol. 16, no. 4, pp. 322–333, 1971.

[22] A. F. Shchepetkin and J. C. McWilliams, “The regional oceanicmodeling system (ROMS): a split-explicit, free-surface, topography-following-coordinate oceanic model,” Ocean Modelling, vol. 9, no. 4,pp. 347–404, 2005.

[23] A. F. Blumberg and G. L. Mellor, “A description of a three-dimensional coastal ocean circulation model,” Three-dimensionalcoastal ocean models, vol. 4, pp. 1–16, 1987.

[24] R. N. Smith, J. Kelly, K. Nazarzadeh, and G. S. Sukhatme, “Aninvestigation on the accuracy of regional ocean models through fieldtrials,” in 2013 IEEE International Conference on Robotics andAutomation. IEEE, 2013, pp. 3436–3442.

[25] D. P. Bertsekas, Dynamic programming and optimal control, vol. 1,no. 2.

[26] L. C. Evans, Partial Differential Equations: Second Edition (GraduateSeries in Mathematics). American Mathematical Society, 2010.

[27] J. T. Oden and J. N. Reddy, An introduction to the mathematical theoryof finite elements. Courier Corporation, 2012.

[28] T. J. Hughes, The finite element method: linear static and dynamicfinite element analysis. Courier Corporation, 2012.

[29] K.-C. Ma, L. Liu, H. K. Heidarsson, and G. S. Sukhatme, “Data-drivenlearning and planning for environmental sampling,” Journal of FieldRobotics, vol. 35, no. 5, pp. 643–661, 2018.

[30] D. Kularatne, S. Bhattacharya, and M. A. Hsieh, “Time and energyoptimal path planning in general flows.” in Robotics: Science andSystems, 2016.

[31] A. F. Shchepetkin and J. C. McWilliams, “The regional oceanicmodeling system (roms): a split-explicit, free-surface, topography-following-coordinate oceanic model,” Ocean modelling, vol. 9, no. 4,

pp. 347–404, 2005.

Documents

State-Continuity Approximation of Markov Decision ... · State-Continuity Approximation of Markov Decision Processes via Finite Element Analysis for Autonomous System Planning Junhong