Optimal Control of Non-deterministic Systems for a

Optimal Control of Non-deterministic Systems for a ComputationallyEfficient Fragment of Temporal Logic

Eric M. Wolff, Ufuk Topcu, and Richard M. Murray

Abstract— We develop a framework for optimal controlpolicy synthesis for non-deterministic transition systems subjectto temporal logic specifications. We use a fragment of temporallogic to specify tasks such as safe navigation, response tothe environment, persistence, and surveillance. By restrictingspecifications to this fragment, we avoid a potentially doubly-exponential automaton construction. We compute feasible con-trol policies for non-deterministic transition systems in timepolynomial in the size of the system and specification. Wealso compute optimal control policies for average, minimax(bottleneck), and average cost-per-task-cycle cost functions. Wehighlight several interesting cases when these can be computedin time polynomial in the size of the system and specification.Additionally, we make connections between computing optimalcontrol policies for an average cost-per-task-cycle cost functionand the generalized traveling salesman problem. We givesimulation results for motion planning problems.

I. INTRODUCTION

The responsibilities given to robots, autonomous vehicles,and other cyberphysical systems continue to increase fasterthan our ability to reason about the correctness of their be-havior. In safety-critical applications like autonomous drivingand air traffic management, it is desirable to unambiguouslyspecify the desired system behavior and automatically syn-thesize a controller that provably implements this behavior.Additionally, many applications involve non-determinism (inthe system and environment) and demand efficient (not justfeasible) controllers.

Linear temporal logic (LTL) is an expressive task-specification language for specifying a variety of tasks,such as safety (always avoid B), response (if A, then B),persistence (eventually always stay in A), and recurrence(infinitely often visit A). Given a finite abstraction of adynamical system (see [1]–[7]) and an LTL specification,feasible control policies can be automatically created fornon-deterministic systems [8]. Unfortunately, the complexityof synthesizing a control policy that ensures that a non-deterministic system satisfies an LTL formula is doubly-exponential in the formula length [9].

The poor worst-case complexity of controller synthesisfor LTL has led to the development of fragments of LTLthat are both useful and computationally efficient to reasonabout [10]–[15]. Notably, generalized reactivity(1) (GR(1))[11] is an efficient fragment for synthesizing feasible controlpolicies for non-deterministic systems. A GR(1) formula is of

Eric M. Wolff and Richard M. Murray are with the Department of Controland Dynamical Systems, California Institute of Technology, Pasadena, CA.Ufuk Topcu is with the Department of Electrical and Systems Engineering,University of Pennsylvania, Philadelphia, PA. The corresponding author [email protected]

the form: environment assumptions imply system guarantees.This framework has been used to create reactive controlprotocols for robots and autonomous vehicles [7], [16].The generalized Rabin(1) fragment subsumes GR(1) andis the largest fragment of LTL that can be solved in timepolynomial in the size of the system and specification [13].

We use the fragment of temporal logic introduced by theauthors in [15]. This fragment includes the same systemguarantees in generalized Rabin(1), i.e., all system guaran-tees in GR(1) and persistence. However, we limit the envi-ronmental liveness assumptions. In this paper, we show thatfor this temporal logic fragment, one can compute correctand optimal control policies for non-deterministic transitionssystems subject to a variety of relevant cost functions.

Prior work has focused on finding feasible control policiesfor systems subject to LTL specifications. For deterministicsystems, control policies can be computed that optimizeminimax (or bottleneck) [17], weighted average [18], andcost-per-task-cycle [12] cost functions. Chatterjee et al. [19]compute optimal policies for non-deterministic systems sub-ject to parity constraints and an average cost function. Jingand Kress-Gazit [20] give an incomplete method for non-deterministic systems in the GR(1) framework.

Our main contributions are methods for optimal controlpolicy synthesis for non-deterministic transition systemssubject to constraints from an expressive temporal logicfragment. We compute control policies that are optimal foraverage, minimax (bottleneck), and average cost-per-task-cycle cost functions. Due to the non-determinism (in thesystem and environment), we minimize the worst-case foreach cost function. As our techniques do not require theso-called property automaton construction, we avoid thepotentially doubly-exponential blow-up in the length of thespecification. Additionally, we make close connections withcontrol policy synthesis and dynamic programming.

We extend a special case of the minimax (bottleneck) costfunction considered in [17] to non-deterministic systems.We get an exponential time improvement for deterministicsystems, as we do not require a Buchi automaton construc-tion for the LTL formula. For an average cost-per-task-cyclecost function (see [12]), we show that finding an optimalcontrol policy is equivalent to solving a generalized travelingsalesman problem, which is NP-hard, but for which manyefficient heuristic algorithms exist. Finally, we improve onour previous results [15] for computing feasible controlpolicies.

Submitted, 2013 Conference on Decison and Control (CDC)http://www.cds.caltech.edu/~murray/papers/wtm13-cdc.html

II. PRELIMINARIES

In this section we give background on the system modeland specification language that we consider. The notationfollows that used in the authors’ previous work [15].

An atomic proposition is a statement that is either True orFalse . A propositional formula is composed of only atomicpropositions and propositional connectives, i.e., ∧ (and) and¬ (not). The cardinality of a set X is denoted by �X �.A. System model

We use non-deterministic finite transition systems tomodel the system behavior.

Definition 1. A non-deterministic (finite) transition system(NTS) is a tuple T = (S,A,R, s0,AP,L, c) consisting of afinite set of states S, a finite set of actions A, a transitionfunction R ∶ S × A → 2

S , an initial state s0 ∈ S, a set ofatomic propositions AP , a labeling function L ∶ S → 2

AP ,and a non-negative cost function c ∶ S ×A × S → R.

Let A(s) denote the set of available actions at state s.Denote the parents of the states in the set S′ ⊆ S byParents(S′) ∶= {s ∈ S � ∃a ∈ A(s) and R(s, a)∩S′ ≠ �}. Theset Parents(S′) includes all states in S that can (possibly)reach S′ in a single transition. We assume that the transitionsystem is non-blocking, i.e., �R(s, a)� ≥ 1 for each state s ∈ Sand action a ∈ A(s). A deterministic transition system (DTS)is a non-deterministic transition system where �R(s, a)� = 1for each state s ∈ S and action a ∈ A(s).

A memoryless control policy for a non-deterministic tran-sition system T is a map µ ∶ S → A, where µ(s) ∈ A(s)for state s ∈ S. A finite-memory control policy is a mapµ ∶ S × M → A × M where the finite set M is calledthe memory and µ(s,m) ∈ A(s) ×M for state s ∈ S andmode m ∈ M . An infinite-memory control policy is a mapµ ∶ S+ → A, where S+ is a finite sequence of states endingin state s and µ(s) ∈ A(s).

Given a state s ∈ S and action a ∈ A(s), there maybe multiple possible successor states in the set R(s, a),i.e., �R(s, a)� > 1. A single successor state t ∈ R(s, a) isnon-deterministically selected. We interpret this selection (oraction) as an uncontrolled, adversarial environment resolvingthe non-determinism.

A run � = s0s1s2 . . . of T is an infinite sequence of itsstates, where si ∈ S is the state of the system at index i (alsodenoted �i) and for each i = 0,1, . . ., there exists a ∈ A(si)such that si+1 ∈ R(si, a). A word is an infinite sequence oflabels L(�) = L(s0)L(s1)L(s2) . . . where � = s0s1s2 . . . isa run. The set of runs of T with initial state s ∈ S inducedby a control policy µ is denoted by T µ(s).

Connections to graph theory: We will often considera non-deterministic transition system as a graph with thenatural bijection between the states and transitions of thetransition system and the vertices and edges of the graph.Let G = (S,R) be a directed graph (digraph) with verticesS and edges R. There is an edge e from vertex s to vertex tif and only if t ∈ R(s, a) for some a ∈ A(s). A digraphG = (S,R) is strongly connected if there exists a path

between each pair of vertices s, t ∈ S no matter how thenon-determinism is resolved. A digraph G′ = (S′,R′) is asubgraph of G = (S,R) if S′ ⊆ S and R′ ⊆ R. The subgraphof G restricted to states S′ ⊆ S is denoted by G�S′ . A digraphG′ ⊆ G is a strongly connected component if it is a maximalstrongly connected subgraph of G.

B. A fragment of temporal logic

We use the fragment of temporal logic introduced in [15]to specify tasks such as safe navigation, immediate responseto the environment, persistent coverage, and surveillance. Fora propositional formula ', the notation �' means that ' isalways true, �' means that ' is eventually true, � � 'means that ' is true infinitely often, and �� ' means that' is eventually always true [21].

Syntax: We consider formulas of the form

' = 'safe ∧ 'resp ∧ 'per ∧ 'task ∧ 'ssresp, (1)

where

'safe ∶= � 1,

'resp ∶= �j∈I2�( 2,j �⇒ #�2,j),

'per ∶=�� 3,

'task ∶= �j∈I4�� 4,j ,

'ssresp ∶= �

j∈I5�� ( 5,j �⇒ #�5,j).

Note that � 1 = ��j∈I1 1,j = �j∈I1 � 1,j and � � 3 =� � �j∈I3 3,j = �j∈I3� � 3,j . In the above definitions,I1, . . . , I5 are finite index sets and i,j and �i,j are propo-sitional formulas for any i and j.

We refer to each 4,j in 'task as a recurrent task.

Remark 1. Guarantee and obligation, i.e., � and�( �⇒ ��) respectively, are not included in (1). Neitherare disjunctions of formulas of the form (1). This fragmentof LTL is incomparable to other commonly used temporallogics, such as computational tree logic and GR(1). SeeWolff et al. [15] for details.

Remark 2. Our results easily extend to include a fixed orderfor some or all of the tasks in 'task, as well as ordered taskswith different state constraints between the tasks.

Semantics: We use set operations between a run � ofT = (S,A,R, s0,AP,L, c) and subsets of S where particularpropositional formulas hold to define satisfaction of a tem-poral logic formula [22]. We denote the set of states wherepropositional formula holds by [[ ]]. A run � satisfies thetemporal logic formula ', denoted by � � ', if and only ifcertain set operations hold.

Let � be a run of the system T , Inf(�) denote the setof states visited infinitely often in �, and Vis(�) denote theset of states visited at least once in �. Given propositionalformulas and �, we relate satisfaction of a temporal logicformula of the form (1) with set operations as follows:● � � � iff Vis(�) ⊆ [[ ]],

Fig. 1. Example of a non-deterministic transition system

● � �� iff Inf(�) ⊆ [[ ]],● � � �� iff Inf(�) ∩ [[ ]] ≠ �,● � � �( �⇒ #�) iff �i ∉ [[ ]] or �i+1 ∈[[�]] for all i,● � � �� ( �⇒ #�) iff there exists an index j such

that �i ∉ [[ ]] or �i+1 ∈ [[�]] for all i ≥ j.A run � satisfies a conjunction of temporal logic formulas

' = �mi=1 'i, denoted by � � ', if and only if the set

operations for each temporal logic formula 'i holds. Asystem T under control policy µ satisfies the formula ' atstate s ∈ S, denoted T µ(s) � ' if and only if � � ' forall � ∈ T µ(s). Given a system T , state s ∈ S is winning(with respect to the non-determinism) for ' if there exists acontrol policy µ such that T µ(s) � '. Let W ⊆ S denotethe set of winning states.

An example is given in Figure 1. The non-deterministictransition system T has states S = {1,2,3,4}; labels L(1) ={A}, L(2) = {C}, L(3) = {B}, L(4) = {B,C}; a singleaction called 0; and transitions R(1,0) = {2,3}, R(2,0) ={2}, R(3,0) = {4}, R(4,0) = {4}. From the acceptanceconditions, it follows that W = {2,4} for formula �(A ∨ C),W = {2,3,4} for formula �(A �⇒ #B), W = {1,2,3,4}for formula ��(A �⇒ #B), W = {1,2,3,4} for formula��C, and W = {3,4} for formula ��B. State 4 is winningfor all of the above formulas.

III. PROBLEM STATEMENT

We now formally state the two main problems of the paperand give an overview of our solution approach.

Problem 1. Given a non-deterministic transition system Tand a temporal logic formula ' of the form (1), determinewhether there exists a control policy µ such that T µ(s0) � '.Return the control policy µ if it exists.

We introduce a general cost function J to distinguishamong solutions to Problem 1. Let J map a set of runsT µ(s0) and the corresponding control policy µ to R ∪∞.

Problem 2. Given a non-deterministic transition system Tand a temporal logic formula ' of the form (1), determinewhether there exists an optimal control policy µ∗ such thatT µ∗(s0) � ' and J(T µ∗(s0)) ≤ J(T µ(s0)) for all feasibleµ. Return the control policy µ∗ if it exists.

We begin by defining the value function in Section IV,which is a key component of all later algorithms. Then,we create feasible control policies (i.e., solve Problem 1) inSection V. Although feasible control policies were createdin [15], we present a significantly improved algorithm.Then, we introduce average cost-per-task-cycle, minimax(bottleneck), and average cost functions in Sections VI-A,

VI-B, and VI-C, respectively. We discuss procedures forcomputing optimal control policies for these cost functionsin Section VI.

Remark 3. The restriction to the fragment in (1) is critical.For the full LTL, even Problem 1 is intractable in general.Determining if there exists such a control policy takes timedoubly-exponential in the length of ' [9].

IV. THE VALUE FUNCTION AND REACHABILITY

We introduce standard dynamic programming notions[23], as applied to non-deterministic systems.

We define controlled reachability in a non-deterministictransition system T with a value function. Let B ⊆ S be aset of states that the controller wants the system to reach. Letthe controlled value function for system T and target set Bbe a map V c

B,T ∶ S → R ∪∞, whose value V cB,T (s) at state

s ∈ S is the minimum (over all possible control policies) costneeded to reach the set B, under the worst-case resolutionof the non-determinism. If the value V c

B,T (s) = ∞, thenthe non-determinism can prevent the system from reachingset B from state s ∈ S. For example, consider the systemin Figure 1 with unit cost on edges and B = {4}. Then,V cB(1) = ∞, V c

B(2) = ∞, V cB(3) = 1, and V c

B(4) = 0. Thesystem cannot guarantee reaching set B from states 1 or 2.

The value function satisfies the optimality condition

V cB,T (s) = min

a∈A(s) max

t∈R(s,a)VcB,T (t) + c(s, a, t),

for all s ∈ S.An optimal control policy µB for reaching the set B is

memoryless [23] and can be computed at each state s ∈ S as

µB(s) = argmin

a∈A(s) max

t∈R(s,a)VcB,T (t) + c(s, a, t).

If multiple actions achieve the minimum, select an action inthis set with a minimal number of transitions to reach B.

We use the value function to define the controllablepredecessor set, CPre∞T (B), for a given system T with targetset B ⊆ S. Let CPre∞T (B) ∶= {s ∈ S � V c

B,T (s) <∞} be theset of all states that can reach a state in B for any resolutionof the non-determinism.

We define forced reachability similarly. Let the forcedvalue function for system T and target set B be a mapV fB,T ∶ S → R ∪ ∞, whose value V f

B,T (s) at state s ∈ Sis the maximum (over all possible control policies) cost ofreaching the set B. The forced value function satisfies theoptimality condition

V fB,T (s) = max

a∈A(s) max

t∈R(s,a)VfB,T (t) + c(s, a, t).

For a given system T with target set B ⊆ S, the forcedpredecessor set FPre∞T (B) ∶= {s ∈ S � V f

B,T (s) < ∞}, isthe set of all states from which no control policy can avoidreaching a state in B.

Remark 4. We consider the case where the controller selectsan action, and then the environment selects the next state.Our results easily extend to the case, used in GR(1) [11],

where the environment first resolves the non-determinism(selects an action) and then the controller selects its action.

Remark 5. The value function can be computed for all statesin O(�S�log�S�+�R�) time (see Algorithm 4 in the Appendix).The predecessor sets can be computed in O(�S� + �R�) timesince unit costs can be assumed [15].

V. FEASIBLE CONTROL POLICY

We now solve Problem 1 by creating feasible controlpolicies for non-deterministic transition systems that mustsatisfy a temporal logic formula of the form (1). Algorithm 2summarizes the main results from [15]. New results here arethe addition of the steady-state, next-step response formula'ss

resp and an improved algorithm for computing the winningset for the recurrent task formula 'task. The 'ss

resp formula ishandled in a similar manner to 'resp and is not detailed here.

Effectively, the formulas besides 'task restrict states thatcan be visited or transitions that can be taken. Safety, 'safe,limits certain states from ever being visited, next-step re-sponse, 'resp, limits certain transitions from ever being taken,persistence, 'per, limits certain states from being visitedinfinitely often, and steady-state, next-step response, 'ss

resp,limits certain transitions from being taken infinitely often.The remaining formula is recurrence, 'task, which constrainscertain states to be visited infinitely often. One creates thesubgraph that enforces all constraints except for 'task andthen computes a finite-memory control policy that repeatedlyvisits all 'task constraints. Then, one relaxes the constraintsthat only have to be satisfied over infinite time, and computesall states that can reach the states that are part of this policy.A feasible control policy exists for all and only the states inthis set. Details are in Algorithm 2.

We give an improved algorithm (compared to that in [15])for the formula 'task in Algorithm 1. The key insight is thatthe ordering of tasks does not affect feasibility, which is notthe case for optimality (see Section VI).

Proposition 1. Algorithm 1 computes exactly the winningset for 'task.

Proof: To satisfy the acceptance condition Inf(�) ∩[[ 4,j]] ≠ � for all j ∈ I4, there must exist non-emptysets Fj ⊆ [[ 4,j]] such that Fi ⊆ CPre∞T (Fj) holds forall i, j ∈ I4, i.e., all tasks are completed infinitely often.Algorithm 1 selects an arbitrary order on the tasks, e.g.,Fi+1 ⊆ CPre∞T (Fi) for all i = 1,2, . . . , �I4� − 1 and F1 ⊆CPre∞T (F�I4�), without loss of generality since tasks arecompleted infinitely often and the graph does not change.Starting with sets Fj ∶= [[ 4,j]] for all j ∈ I4, all and onlythe states in each Fj that do not satisfy the constraints areiteratively removed. At each iteration, at least one state inF1 is removed or the algorithm terminates. At termination,each Fj is the largest subset of [[ 4,j]] from which 'task canbe satisfied. �

The outer while loop runs at most �F1� iterations. Duringeach iteration, CPre∞T is computed �I4� times, which domi-nates the time required to compute the set intersections (when

using a hash table). Thus, the total complexity of Algorithm 1is O(�I4�Fmin(�S� + �R�), where Fmin =minj∈I4 �Fj �.Algorithm 1 BUCHI (T , {[[p4,j]] for j ∈ I4})Input: NTS T , Fj ∶= [[p4,j]] ⊆ S for j ∈ I4Output: Winning set W ⊆ S

1: while True do2: for i = 1,2,3, . . . , �I4� − 1 do3: Fi+1 ← Fi+1 ∩CPre∞T (Fi)4: if Fi+1 = � then5: return W ← �, Fj ← � for all j ∈ I46: end if7: end for8: if F1 ⊆ CPre∞T (F�I4�) then9: return W ← CPre∞T (F�I4�), Fj for all j ∈ I4

10: end if11: F1 ← F1 ∩CPre∞T (F�I4�)12: end while

Remark 6. Formula 'task is can be treated for deterministictransition systems more efficiently by computing stronglyconnected components [24] or performing nested depth-first search [25]. These O(�S� + �R�) procedures replaceAlgorithm 1.

We now detail feasible control policy synthesis in Algo-rithm 2. Compute the set W of states that are winning for' (lines 1-7). If the initial state s0 ∉ W , then no controlpolicy exists. If s0 ∈ W , compute the memoryless controlpolicy µSA which reaches the set SA. Use µSA until a statein SA is reached. Then, compute the memoryless controlpolicies µj induced from V c

Fj ,Tsafefor all j ∈ I4 (line 11,

also see Algorithm 1). The finite-memory control policyµ is defined as follows by switching between memorylesspolicies depending on the current task. Let j ∈ I4 denote thecurrent task set Fj . The system uses µj until a state in Fj

is visited. Then, the system updates its task to k = (j + 1,mod �I4�) + 1 and uses control policy µk until a state in Fk

is visited, and so on. The total complexity of the algorithmis O((�I2� + �I5� + �I4��Fmin�)(�S� + �R�)), which is polynomialin the size of the system and specification.

VI. OPTIMAL CONTROL POLICY

We now solve Problem 2 by computing control policies fornon-deterministic systems that satisfy a temporal logic for-mula of the form (1) and also minimize a cost function. Weconsider average cost-per-task-cycle, minimax, and averagecost functions. The last two admit polynomial time solutionsfor deterministic systems and restricted non-deterministicsystems. However, we begin with the average cost-per-task-cycle cost function as it is quite natural.

The values these cost functions take are independent ofany finite sequence of states. Thus, we optimize the infinitebehavior of the system, which corresponds to completingtasks specified by 'task on a subgraph of T as constructedin Section V. We assume that we are given T ss

resp (denotedhereafter by Tinf) and the task sets Fj ⊆ S returned by

Algorithm 2 Overview: Feasible synthesis for NTSInput: Non-deterministic system T and formula 'Output: Control policy µ

1: Compute Tresp on T2: Tsafe ← Tresp�S−FPre∞(S−[[ 1]])3: Tper ← Tsafe�S−FPre∞(S−[[ 3]])4: Compute T ss

resp on Tper5: ∶= {[[ 4,j]] for all j ∈ I4}6: SA, F ∶= {F1, . . . , F�I4�}← BUCHI(T ss

resp, )7: W ∶= CPre∞Tsafe

(SA)8: if s0 ∉W then9: return “no control policy exists”

10: end if11: µSA ← control policy induced by V c

SA,Tsafe

12: µj ← control policy induced by V cFj ,Tsafe

for all j ∈ I413: return Control policies µSA and µj for all j ∈ I4

Fig. 2. A non-deterministic transition system and its task graph.

Algorithm 1 (see Algorithm 2). Note that Tinf is the largestsubgraph of T where all constraints from 'safe, 'resp, 'per,and 'ss

resp hold. Each Fj is the largest set of states for the jthtask that are part of a feasible control policy. The problemis now to compute a feasible (winning) control policy thatalso minimizes the relevant cost function.

Since only the recurrent tasks in 'task on Tinf will matterfor optimization, we construct a new graph that encodes thecost of moving between all tasks. We construct the task graphG′ = (V ′,E′) which encodes the cost of optimal controlpolicies between all tasks in 'task (see Figure 2). Let V ′be partitioned as V ′ = �j∈I4 V ′j , where V ′i ∩ V ′j = � for alli ≠ j. Let Fj ⊆ S denote the set of states that correspond tothe jth task in 'task, as returned from Algorithm 1. Create astate v ∈ V ′j for each of the 2

�Fj � − 1 non-empty subsets ofFj that are reachable from the initial state. Define the map⌧ ∶ V ′ → 2

S from each state in V ′ to subsets of states in S.For each state v ∈ V ′, compute the controlled value functionV c⌧(v),Tinf

on Tinf. For all states u ∈ V ′i and v ∈ V ′j wherei ≠ j, define an edge euv ∈ E′. Assign a cost to edge euv ascuv ∶=maxs∈⌧(u) V c

⌧(v),Tinf(s). The cost cuv is the maximum

worst-case cost of reaching a state t ∈ ⌧(v) from a states ∈ ⌧(u), when using an optimal control policy.

It is necessary to consider all subsets of states, as thecost of reaching each subset may differ due to the non-determinism. For deterministic systems, one can simplycreate a state in V ′j for each state in Fj . This is becausethe cost of all subsets of Fj can be determined by the coststo reach the individual states in Fj .

Remark 7. It may not be necessary to create the entire

task graph at once. For example, one can create a taskgraph with �I4� states where each state corresponds to theset Fj . This gives a control policy that leads to an upperbound on the cost of an optimal policy. Additionally, bydefining edges in the task graph as the minimum worst-casecost mins∈⌧(u) V c

⌧(v),Tinf(s) between tasks, one can compute

a lower bound on the cost of an optimal policy. One canuse the current control policy and improve performance inan anytime manner by adding more states to the subgraphcorresponding to subsets of each Fj .

Algorithm 3 Overview: Optimal synthesis for NTSInput: NTS T , formula ', cost function JOutput: Optimal control policy µ∗

1: Compute T ssresp, SA, and Fj for all j ∈ I4 (see Alg. 2)

2: Compute F ∗j ⊆ Fj for all j ∈ I4 and optimal task order3: µ∗F ∗ ← control policy from V c

F ∗,Tsafewhere F ∗ = ∪j∈I4F ∗j

4: µ∗j ← control policy from V cF ∗j ,Tsafe

for all j ∈ I45: return µ∗F ∗ , µ∗j for all j ∈ I4 and optimal task order

A. Average cost-per-task-cycle

Recall that for 'task = �j∈I4 � � 4,j , the propositionalformula 4,j is the jth task. A run � of system T completesthe jth task at time t if and only if �t ∈ [[ 4,j]]. A taskcycle is a sequence of states that completes each task at leastonce, i.e., it intersects [[ j]] for each j = 1, . . . ,m at leastonce. Similarly to [12], we minimize the average cost-per-task-cycle, or equivalently the maximum cost of a task cyclein the limit. For a deterministic system, this corresponds tofinding a cycle of minimal cost that completes every task.

We define the cost function over a run �. Let � be arun of T under control policy µ, µ(�) be the correspondingcontrol input sequence, and ITC(t) = 1 indicate that thesystem completes a task cycle at time t and ITC(t) = 0

otherwise. The average cost per task cycle of run � is

J ′TC(�, µ(�)) ∶= lim sup

n→∞∑n

t=0 c(�t, µ(�t),�t+1)∑nt=0 ITC(t) ,

which maps runs and control inputs of T to R ∪ ∞. Thismap is well-defined when (i) c(�t, µ(�t),�t+1) is boundedfor all t ≥ 0, and (ii) there exists a t′ ∈ N such that ITC(t) = 1for infinitely many t ≥ t′. We assume that (i) is true in thesequel and note that (ii) holds for every run that satisfies aformula ' with at least one task. If there are no tasks in ',one can add the task ��True so that ITC(t) = 1 at everytime instance (see Section VI-C).

We define the average per-task-cycle cost function

JTC(T µ(s)) ∶= max

�∈T µ(s)J′TC(�, µ(�)) (2)

over the set of runs of system T starting from initial state sunder control policy µ. The cost function (2) does not dependon any finite behavior of the system, intuitively because anyshort-term costs are averaged out in the limit.

We next show that Problem 2 with cost function JTC is atleast as hard as the NP-hard generalized traveling salesmanproblem [26].

Generalized traveling salesman problem [26]: Let G =(V,E) be a digraph with vertices V , edges E, and a non-negative cost cij on each edge (i, j) ∈ E. Set V is the disjointunion of p vertex sets, i.e., V = V1∪ . . .∪Vp, where Vi∩Vj =� for all i ≠ j. There are no edges between states in thesame vertex set. The generalized traveling salesman problem,GTSP = �(V,E), c�, is to find a minimum cost cycle thatincludes a single state from each Vi for all i = 1, . . . , p.

Theorem 1. Any instance of the generalized traveling sales-man problem can be formulated as an equivalent instance ofProblem 2 with the cost function JTC .

Proof: The proof is by construction. Given an instanceof the GTSP �(V,E), c�, we solve Problem 2 on a deter-ministic transition system T = (S,A,R, s0,AP,L, c) andformula '. Let S = V ∪ {s0}. Define the transitions asR(u, av) = v, with action av ∈ A(u), and costs c(u, av, v) =cuv for each edge euv ∈ E. Label all states in vertex setVi with atomic proposition Li and let ' = �i∈p � � Li.Finally, add transitions from s0 to every other state s ∈ S.Although Problem 2 does not require that each task is onlycompleted once per cycle, an optimal solution always existswhich completes each task once per cycle. �

Recall that I4 is the index set of all recurrent tasks. Wecan fix an arbitrary task ordering, denoted I4 = 1, . . . , �I4�(with some abuse of notation), without loss of generality forfeasible control policies. However, the order that we visittasks matters for optimal control policies. Additionally, wecan select this task order ahead of time or update it duringexecution. We now (potentially conservatively) assume thatwe will select this task order ahead of time. This assumptionis not necessary in Sections VI-B or VI-C.

Assumption 1. A control policy attempts to complete tasksin a fixed order.

We will optimize the task order over all permutationsof fixed task orders. This optimization is a generalizedtraveling salesman problem on the task graph. While this isan NP-hard problem, practical methods exist for computingexact and approximate solutions [26]. Once the optimalordering of tasks is computed, the finite-memory controlpolicy switches between these tasks in a similar mannerdescribed in Section V.

B. Minimax (bottleneck) costs

We now consider a minimax (bottleneck) cost function,minimizes the maximum accumulated cost between com-pletion of tasks. The notation loosely follows [17] whichconsiders a generalization of this cost function for deter-ministic transition systems with LTL. Let Ttask(�, i) be theaccumulated cost at the ith completion of a task in 'taskalong a run �. The minimax cost of run � is

J ′bot(�, µ(�)) ∶= lim sup

i→∞ (Ttask(i + 1) −Ttask(i)), (3)

which maps runs and control inputs of T to R ∪∞.

Define the worst-case minimax cost function as

Jbot(T µ(s)) ∶= max

�∈T µ(s)J′bot(�, µ(�)) (4)

over the set of runs of system T starting from initial state sunder control policy µ.

We now solve Problem 2 for the cost function Jbot. First,compute the task graph as in Section VI. The edges in thetask graph correspond to the maximum cost accumulatedbetween completion of tasks, assuming that the system usesan optimal strategy. Thus, a minimal value of Jbot canbe found by minimizing the maximum edge in the taskgraph, subject to the constraint that a vertex correspondingto each task can be reached. Select an estimate of Jbot andremove all edges in the task graph that are greater thanthis value. If there exists a strongly connected componentof the task graph that contains a state corresponding to eachtask and is reachable from the initial state, then we have anupper bound on Jbot. If not, we have a lower bound. Thisobservation leads to a simple procedure where one selectsan estimate of Jbot as the median of edge costs that satisfythe previously computed bounds, removes all edges withcosts greater than this estimate, determines if the subgraphis strongly connected (with respect to the tasks), and thenupdates the bounds. Each iteration requires the computationof strongly connected components and the median of edgecosts, which can be done in linear time [24]. It is easy tosee that this procedure terminates with the correct value ofJbot in O(log�E′�) iterations. Thus, the total complexity isO(log�E′�(�V ′� + �E′�)), giving a polynomial time algorithmfor deterministic transition systems and non-deterministictransition systems with a single state per task, i.e., �Fj � = 1for all j ∈ I4.

C. Average costs

The average cost of run � is

J ′avg(�, µ(�)) ∶= lim sup

n→∞∑n

t=0 c(�t, µ(�t),�t+1)n

,

which maps runs and control inputs of T to R ∪∞.We now define the worst-case average cost function,

Javg(T µ(s)) ∶= sup

�∈T µ(s0)J ′avg(�, µ(�)) (5)

over the set of runs of system T starting from initial state sunder control policy µ.

We now solve Problem 2 for the cost function Javg. Notethat this cost function corresponds to JTC when 'task = ��True , but without additional tasks.

For non-deterministic transition systems, Problem 2 re-duces to solving a mean-payoff parity game on Tinf [19].An optimal control policy will typically require infinitememory, as opposed to the finite-memory control policiesthat we have considered. Such a policy alternates betweena feasible control policy and an unconstrained minimumcost control policy, spending an increasing amount of timeusing the unconstrained minimum cost control policy. Giventhe subgraph Tinf and the feasible task sets Fj ⊆ S (see

TABLE ICOMPLEXITY OF FEASIBLE POLICY SYNTHESIS

Language DTS NTSFrag. in (1) O(�'�(�S� + �R�)) O(�'�Fmin(�S� + �R�))GR(1) O(�'��S��R�) O(�'��S��R�)LTL O(2(�'�)(�S� + �R�)) O(22(�'�)(�S� + �R�))

Algorithm 2), one can compute an optimal control policyusing the results of Chatterjee et al. [19]. For deterministicsystems, extensions to a more general weighted average costfunction can be found in [18].

VII. COMPLEXITY

We summarize our complexity results for feasible controlpolicy synthesis and compare with LTL and GR(1) [11].We assume that set membership is determined in constanttime with a hash function [24]. We denote the length of atemporal logic formula by �'�. Let �'� = �I2� + �I4� + �I5� forthe fragment in (1), �'� = mn for a GR(1) formula with massumptions and n guarantees, and �'� be the formula lengthfor LTL [21]. Recall that Fmin =minj∈I4 �[[p4,j]]�. For typicalmotion planning specifications, Fmin � �S� and �'� is small.We use the non-symbolic complexity results for GR(1) in[11]. Results are summarized in Table I.

We now summarize the complexity of optimal con-trol policy synthesis. The task graph G′ = (V ′,E′)has O(∑i∈I4 2�Fi� − 1) states and can be computed inO((∑i∈I4 2�Fi� − 1)(�S�log�S� + �R�)) time. Computing anoptimal control policy for JTC requires solving an NP-hardgeneralized traveling salesman problem on G′. Computing anoptimal control policy for Jbot requires O(log�E′�(�V ′�+�E′�)time. An optimal control policy for Javg can be computed inpseudo-polynomial time [19]. For deterministic systems, thetask graph has O(∑i∈I4 �Fi�) states and can be computedin O((∑i∈I4 �Fi�)(�S�log�S� + �R�)) time. An optimal controlpolicy for Javg can be computed in O(�S��R�) time. Thus,we can compute optimal control policies for deterministictransition systems with cost functions Jbot and Javg in timepolynomial in the size of the system and specification.Additionally, for non-deterministic transition systems where�Fj � = 1 for all j ∈ I4, we can compute optimal controlpolicies for Jbot in time polynomial in the size of the systemand specification.

Remark 8. The fragment in (1) is not handled well bystandard approaches. Using ltl2ba [27], we created Buchiautomaton for formulas of the form 'resp. The automaton sizeand time to compute it both increased exponentially with thenumber of conjunctions in 'resp.

VIII. EXAMPLES

The following examples (based on those in [15]) demon-strate the techniques developed in Sections V and VI fortasks motivated by robot motion planning in a planar en-vironment (see Figure 3). All computations were done inPython on a dual-core Linux desktop with 2 GB of memory.All computation times were averaged over five arbitrarilygenerated problem instances and include construction of the

Fig. 3. Left: Diagram of deterministic setup (n = 10). Only white cells arelabeled ’stockroom.’ Right: Diagram of non-deterministic setup (n = 10).A dynamic obstacle (obs) moves within the shaded region.

Fig. 4. Control policy synthesis times for deterministic (left) and non-deterministic (right) grids.

transition system. Due to lack of space, we only consider theaverage cost-per-task-cycle cost function.

A. Deterministic transition system

Consider a gridworld where a robot occupies a single cellat a time and can choose to either remain in its current cell ormove to one of four adjacent cells at each step. We considersquare grids with static obstacle densities of 20 percent. Therobot’s task is to eventually remain in the stockroom whilerepeatedly visiting a pickup location P and multiple dropofflocations D0,D1,D2,D3. The robot must never collidewith a static obstacle. The set of atomic propositions is{P,D0,D1,D2,D3, storeroom,obs}. This task is formalizedby ' =��stockroom ∧ ��P ∧ �j∈I4 ��Dj ∧ �¬obs. Inall following results, Dj holds at a single state in the transi-tion system. Results for optimal control policy synthesis areshown in Figure 4 for n×n grids where n ∈ {200,300,400}.B. Non-deterministic transition system

We now consider a similar setup with a dynamicallymoving obstacle. The state of the system is the product ofthe robot’s location and the obstacle’s location, both of whichcan move as previously described for the robot. The robotselects an action and then the obstacle non-deterministicallymoves. The robot’s task is similar to before and is formalizedas ' = ��P ∧ �j∈I4 ��Dj ∧ �¬obs. Results for optimalcontrol policy synthesis are shown in Figure 4 for n×n gridswhere n ∈ {10,14,18}.

IX. CONCLUSIONS

We have presented a framework for optimal control policysynthesis for non-deterministic transition systems with spec-ifications from a fragment of temporal logic. Our approachis simple and makes explicit connections with dynamicprogramming through our extensive use of value functions.

Additionally, optimal policies can be computed in polyno-mial time for certain combinations of cost functions andsystem restrictions. Future work will investigate incrementalcomputation of the value function, extensions to Markovdecision processes, and exploring the underlying automatastructure of this fragment of temporal logic.

APPENDIX

Algorithm 4 computes the controlled value function asdefined in Section IV. This algorithm should be viewedas a minor extension of Dijkstra’s algorithm [24] to non-deterministic transition systems. Similar reasoning applies tothe forced value function. Set Q below is a priority queuewith standard operations.

Algorithm 4 Value function (controlled)Input: NTS T , set B ⊆ SOutput: The (controlled) value function V c

B,TV cB,T (s)←∞ for all s ∈ S −B

V cB,T (s)← 0 for all s ∈ B

Q← Swhile Q ≠ � dou← EXTRACT-MIN(Q)if V c

B,T (u) =∞ thenreturn V c

B,Tfor s ∈ Parents(u) ∩Q do

tmp←mina∈A(s)maxt∈R(s,a) V cB,T (t) + c(s, a, t)

if tmp < V cB,T (s) then

V cB,T (s)← tmp

DECREASE-KEY(Q,s, V cB,T (s))

return V cB,T

Theorem 2. Algorithm 4 computes the controlled valuefunction for all states in S in O(�S�log�S� + �R�) time.

Proof: (sketch) The proof follows that of Dijkstra’salgorithm (see Ch. 24.3 in [24]), with a slight modification toaccount for non-determinism. Let V ∗(s) denote the optimalcost of reaching set B from state s ∈ S and V (s) denote thecurrent upper bound. We show that V (u) = V ∗(u) whenevera state u is added to S −Q, i.e., u ← EXTRACT-MIN(Q).This is trivially true initially. By contradiction, assume stateu is the first state added to S − Q with V (u) > V ∗(u).Thus, V ∗(u) < ∞ and there exists a policy to reach B.Consider a policy from u that reaches a state y ∈ Q thatreaches subset X ⊂ S − Q in a single action, i.e., thereexists a ∈ A(y),R(y, a) ⊆ X . Such a state exists since Bis reachable from u. Then, V (y) = V ∗(y) since all statesin X have optimal costs. This with the non-negativity ofedge weights implies that V (y) = V ∗(y) ≤ V ∗(u) ≤ V (u).However, V (u) ≤ V (y), which is the contradiction. Thealgorithm uses O(�S�log�S�+�R�) time since EXTRACT-MINis called at most �S� times and DECREASE-KEY is calledat most �R� times. �

ACKNOWLEDGEMENTS

This work was supported by a NDSEG fellowship, theBoeing Corporation, and AFOSR award FA9550-12-1-0302.

REFERENCES

[1] R. Alur, T. A. Henzinger, G. Lafferriere, and G. J. Pappas, “Discreteabstractions of hybrid systems,” Proc. IEEE, vol. 88, no. 7, pp. 971–984, 2000.

[2] C. Belta and L. C. G. J. M. Habets, “Controlling of a class of nonlinearsystems on rectangles,” IEEE Trans. on Automatic Control, vol. 51,pp. 1749–1759, 2006.

[3] A. Bhatia, M. R. Maly, L. E. Kavraki, and M. Y. Vardi, “Motion plan-ning with complex goals,” IEEE Robotics and Automation Magazine,vol. 18, pp. 55–64, 2011.

[4] L. Habets, P. J. Collins, and J. H. van Schuppen, “Reachability andcontrol synthesis for piecewise-affine hybrid systems on simplices,”IEEE Trans. on Automatic Control, vol. 51, pp. 938–948, 2006.

[5] S. Karaman and E. Frazzoli, “Sampling-based motion planning withdeterministic µ-calculus specifications,” in Proc. of IEEE Conf. onDecision and Control, 2009.

[6] M. Kloetzer and C. Belta, “A fully automated framework for controlof linear systems from temporal logic specifications,” IEEE Trans. onAutomatic Control, vol. 53, no. 1, pp. 287–297, 2008.

[7] T. Wongpiromsarn, U. Topcu, and R. M. Murray, “Receding horizontemporal logic planning,” IEEE Trans. on Automatic Control, 2012.

[8] B. Yordanov, J. Tumova, I. Cerna, J. Barnat, and C. Belta, “Temporallogic control of discrete-time piecewise affine systems,” IEEE Trans.on Automatic Control, vol. 57, pp. 1491–1504, 2012.

[9] A. Pnueli and R. Rosner, “On the synthesis of a reactive module,” inProc. Symp. on Princp. of Prog. Lang., 1989, pp. 179–190.

[10] R. Alur and S. La Torre, “Deterministic generators and games for LTLfragments,” ACM Trans. Comput. Logic, vol. 5, no. 1, pp. 1–25, 2004.

[11] R. Bloem, B. Jobstmann, N. Piterman, A. Pnueli, and Y. Sa’ar,“Synthesis of Reactive(1) designs,” Journal of Computer and SystemSciences, vol. 78, pp. 911–938, 2012.

[12] Y. Chen, J. Tumova, and C. Belta, “LTL robot motion control basedon automata learning of environmental dynamics,” in IEEE Int. Conf.on Robotics and Automation, 2012.

[13] R. Ehlers, “Generalized Rabin(1) synthesis with applications to robustsystem synthesis,” in NASA Formal Methods. Springer, 2011.

[14] O. Maler, A. Pnueli, and J. Sifakis, “On the synthesis of discretecontrollers for timed systems,” in STACS 95. Springer, 1995, vol.900, pp. 229–242.

[15] E. M. Wolff, U. Topcu, and R. M. Murray, “Efficient reactive controllersynthesis for a fragment of linear temporal logic,” in In Proc. Int. Conf.on Robotics and Automation, 2013.

[16] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas, “Temporal logic-based reactive mission and motion planning,” IEEE Trans. on Robotics,vol. 25, pp. 1370–1381, 2009.

[17] S. L. Smith, J. Tumova, C. Belta, and D. Rus, “Optimal path planningfor surveillance with temporal-logic constraints,” The InternationalJournal of Robotics Research, vol. 30, pp. 1695–1708, 2011.

[18] E. M. Wolff, U. Topcu, and R. M. Murray, “Optimal control withweighted average costs and temporal logic specifications,” in Proc. ofRobotics: Science and Systems, 2012.

[19] K. Chatterjee, T. A. Henzinger, and M. Jurdzinski, “Mean-payoffparity games,” in Annual Symposium on Logic in Computer Science(LICS), 2005.

[20] G. Jing and H. Kress-Gazit, “Improving he continuous execution ofreactive ltl-based controllers,” in Proc. of Int. Conf. on Robotics andAutomation, 2013.

[21] C. Baier and J.-P. Katoen, Principles of Model Checking. MIT Press,2008.

[22] E. Gradel, W. Thomas, and T. Wilke, Eds., Automata, Logics, andInfinite Games: A Guide to Current Research. Springer-Verlag NewYork, Inc., 2002.

[23] D. P. Bertsekas, Dynamic Programming and Optimal Control (Vol. Iand II). Athena Scientific, 2001.

[24] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introductionto Algorithms: 2nd ed. MIT Press, 2001.

[25] G. Holzmann, Spin Model Checker, The Primer and Reference Manual.Addison-Wesley Professional, 2003.

[26] C. E. Noon and J. C. Bean, “An efficient transformation of thegeneralized traveling salesman problem,” INFOR, vol. 31, pp. 39–44,1993.

[27] P. Gastin and D. Oddoux, “Fast LTL to Buchi automata translation,”in Proc. of the 13th Int. Conf. on Computer Aided Verification, 2001.

Documents

Optimal Control of Non-deterministic Systems for a