A Residual Gradient Fuzzy Reinforcement Learning Algorithm ... · a genetic algorithm approach, a gradient-descent training approach, a recursive least squares approach, and clustering

International Journal of Fuzzy Systems manuscript No.(will be inserted by the editor)

A Residual Gradient Fuzzy Reinforcement Learning Algorithm forDifferential Games

Mostafa D. Awheda · Howard M. Schwartz

Received: date / Accepted: date

Abstract In this work, we propose a new fuzzy reinforcement learning algorithm for differential gamesthat have continuous state and action spaces. The proposed algorithm uses function approximation sys-tems whose parameters are updated differently from the updating mechanisms used in the algorithmsproposed in the literature. Unlike the algorithms presented in the literature which use the direct algo-rithms to update the parameters of their function approximation systems, the proposed algorithm usesthe residual gradient value iteration algorithm to tune the input and output parameters of its functionapproximation systems. It has been shown in the literature that the direct algorithms may not convergeto an answer in some cases, while the residual gradient algorithms are always guaranteed to convergeto a local minimum. The proposed algorithm is called the residual gradient fuzzy actor critic learning(RGFACL) algorithm. The proposed algorithm is used to learn three different pursuit-evasion differentialgames. Simulation results show that the performance of the proposed RGFACL algorithm outperformsthe performance of the fuzzy actor critic learning (FACL) and the Q-learning fuzzy inference system(QLFIS) algorithms in terms of convergence and speed of learning.

Keywords Fuzzy control · Reinforcement learning · Pursuit-evasion differential games · Residualgradient algorithms

1 Introduction

Fuzzy logic controllers (FLCs) have recently attracted considerable attention as intelligent controllers,and are used in engineering applications [1,2]. FLCs have been widely used to deal with plants that arenonlinear and ill-defined [3–5]. They can also deal with plants with high uncertainty in the knowledgeabout their environments [6,7]. However, one of the problems in adaptive fuzzy control is which mech-anism should be used to tune the fuzzy controller. Several learning approaches have been developedto tune the FLCs so that the desired performance is achieved. Some of these approaches design thefuzzy systems from input-output data by using different mechanisms such as a table look-up approach,a genetic algorithm approach, a gradient-descent training approach, a recursive least squares approach,and clustering [8,33]. This type of learning is called supervised learning where a training data set isused to learn from. However, in this type of learning, the performance of the learned FLC will dependon the performance of the expert. In addition, the training data set used in supervised learning maybe hard or expensive to obtain. In such cases, we think of alternative techniques where neither a pri-ori knowledge nor a training data set is required. In this case, reward-based learning techniques, suchas reinforcement learning algorithms, can be used. The main advantage of such reinforcement learning

Mostafa D. AwhedaDepartment of Systems and Computer EngineeringCarleton University1125 Colonel By Drive, Ottawa, ON, K1S 5B6, CanadaE-mail: [email protected]

Howard M. SchwartzDepartment of Systems and Computer EngineeringCarleton University1125 Colonel By Drive, Ottawa, ON, K1S 5B6, CanadaE-mail: [email protected]

2 Mostafa D. Awheda, Howard M. Schwartz

algorithms is that they do not need either a known model for the process nor an expert to learn from[44].

Reinforcement learning (RL) is a learning technique that maps situations to actions so that an agentlearns from the experience of interacting with its environment [22,24]. Reinforcement learning hasattracted attention and has been used in intelligent robot control systems [11–13,23,25,27–29,39]. Re-inforcement learning has also been used for solving nonlinear optimal control problems [14–20]. Withoutknowing which actions to take, the reinforcement learning agent exploits and explores actions to discoverwhich action gives the maximum reward in the long run. Different from supervised learning, which islearning from input-output data provided by an expert, reinforcement learning is adequate for learningfrom interaction by using very simple evaluative or critic information instead of instructive information[22]. Most of the traditional reinforcement learning algorithms represent the state/state-action valuefunction as a look-up table for each state/state-action-pair [22]. Despite the theoretical foundations ofthese algorithms and their effectiveness in many applications, these reinforcement learning approachescannot be applied to real applications with large state and action spaces [22,23,30,36–38]. This is becauseof the phenomenon known as the curse of dimensionality caused by the exponentially grown number ofstates when the number of state variables increases [22]. Moreover, traditional reinforcement learningapproaches cannot be applied to differential games, where the states and actions are continuous. One ofthe possible solutions to the problem of continuous domains is to discretize the state and action spaces.However, this discretization may also lead to the problem of the curse of dimensionality that appearswhen discretizing large continuous states and/or actions [22,23,39]. To overcome these issues that leadto the problem of the curse of dimensionality, one may use a function approximation system to repre-sent the large discrete and/or continuous spaces [8,9,22,35]. Different types of function approximationsystems are used in the literature, and the gradient-descent-based function approximation systems areamong the most widely used ones [22]. In addition, the gradient-descent-based function approximationsystems are well suited to online reinforcement learning [22].

1.1 Related work

Several fuzzy reinforcement learning approaches have been proposed in the literature to deal with dif-ferential games (that have continuous state and action spaces) by using gradient-descent-based functionapproximation systems [9,10,12,32,39,40]. Some of these approaches only tune the output parametersof their function approximation systems [9,12,32], where the input parameters of their function ap-proximations are kept fixed. On the other hand, the other approaches tune both the input and outputparameters of their function approximation systems [10,39,40].

In [9], the author proposed a fuzzy actor critic learning (FACL) algorithm that uses gradient-descent-based fuzzy inference systems (FISs) as function approximation systems. The FACL algorithm onlytunes the output parameters (the consequent parameters) of its FISs (the actor and the critic) by usingthe temporal difference error (TD) calculated based on the state value function. However, the inputparameters (the premise parameters) of its FISs are kept fixed during the learning process. In [32],the authors proposed a fuzzy reinforcement learning approach that uses gradient-descent-based FISsas function approximation systems. Their approach only tunes the output parameters of the functionapproximation systems based on the TD error calculated based on the state value functions of the twosuccessive states in the state transition. In [12], the authors proposed the generalized probabilistic fuzzyreinforcement learning (GPFRL) algorithm, which is a modified version of the actor-critic learning ar-chitecture. The GPFRL algorithm uses gradient-descent-based FISs as function approximation systems.The GPFRL only tunes the output parameters of its function approximation systems based on the TDerror of the critic and the performance function of the actor. In [39], the authors proposed a fuzzylearning approach that uses a time delay neural network (TDNN) and a FIS as gradient-descent-basedfunction approximation systems. Their approach tunes the input and output parameters of the functionapproximation systems based on the TD error calculated based on the state-action value function. In[40], the authors proposed a fuzzy actor-critic reinforcement learning network (FACRLN) algorithmbased on a fuzzy radial basis function (FRBF) neural network. The FACRLN uses the FRBF neu-ral networks as function approximation systems. The FACRLN algorithm tunes the input and outputparameters of the function approximation systems based on the TD error calculated by the temporaldifference of the value function between the two successive states in the state transition. In [10], theauthors proposed the QLFIS algorithm that uses gradient-descent-based FISs as function approximationsystems. The QLFIS algorithm tunes the input and output parameters of the function approximation

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games 3

systems based on the TD error of the state-action value functions of the two successive states in thestate transition. However, all these fuzzy reinforcement learning algorithms [9,10,12,32,39,40] use whatso called ”direct algorithms described in [21]” to tune the parameters of their function approximationsystems. Although direct algorithms have been widely used in the tunning mechanism of the parame-ters for the function approximation systems, the direct algorithms may lead the function approximationsystems to unpredictable results and, in some cases, to divergence [21,45–48].

1.2 Main contribution

In this work, we propose a new fuzzy reinforcement learning algorithm for differential games, where thestates and actions are continuous. The proposed algorithm uses function approximation systems whoseparameters are updated differently from the updating mechanisms used in the algorithms proposed in [9,10,12,32,39,40]. Unlike the algorithms presented in the literature which use the direct algorithms to tunetheir function approximation systems, the proposed algorithm uses the residual gradient value iterationalgorithm to tune the input and output parameters of its function approximation systems. The directalgorithms tune the parameters of the function approximation system based on the partial derivatives ofthe value function at the current state st; whereas the residual gradient algorithms tune the parametersof the function approximation system based on the partial derivatives of the value function at both thecurrent state st and the next state st+1. The direct and residual gradient algorithms are presented inSection (2.1). The direct algorithms may not converge to an answer in some cases, while the residualgradient algorithms are always guaranteed to converge to a local minimum [21,45–48]. Furthermore, wetake this opportunity to also present the complete derivation of the partial derivatives that are needed tocompute both the direct and residual gradient algorithms. To the best of our knowledge, the derivationof the partial derivatives have never been explicitly shown. Table (1) shows a brief comparison amongsome fuzzy reinforcement learning algorithms and the proposed algorithm.

Table 1 Comparison among some fuzzy reinforcement learning algorithms and the proposed algorithm

Algorithm Type of gradient-descentmethod

Input parameters offunction approximationsystems

Output parameters offunction approximationsystems

FACL [9] Direct Fixed TunedAlgorithm proposed in [32] Direct Fixed TunedGPFRL [12] Direct Tuned TunedAlgorithm proposed in [39] Direct Tuned TunedFACRLN [40] Direct Tuned TunedQLFIS [10] Direct Tuned TunedThe proposed algorithm Residual Tuned Tuned

We investigate the proposed algorithm on different pursuit-evasion differential games because this kindof games is considered as a general problem for several other problems such as the problems of wallfollowing, obstacle avoidance, and path planning. Moreover, pursuit-evasion games are useful for manyreal-world applications such as search and rescue, locating and capturing hostile intruders, localizing andneutralizing environmental threads, and surveillance and tracking [10,23]. The proposed algorithm isused to learn three different pursuit-evasion differential games. In the first game, the evader is followinga simple control strategy, whereas the pursuer is learning its control strategy to capture the evader inminimum time. In the second game, it is also only the pursuer that is learning. However, the evader isfollowing an intelligent control strategy that exploits the advantage of the maneuverability of the evader.In the third game, we make both the pursuer and the evader learn their control strategies. Therefore,the complexity of the system will increase as the learning in a multi-robot system is considered as aproblem of a ”moving target” [26]. In the multi-robot system learning, each robot will try to learn itscontrol strategy by interacting with the other robot which is also learning its control strategy at thesame time. Thus, the best-response policy of each learning robot may keep changing during learning inmulti-robot system. The proposed algorithm outperforms the FACL and QLFIS algorithms proposed in[9] and [10] in terms of convergence and speed of learning when they all are used to learn the pursuit-evasion differential games considered in this work.


This paper is organized as follows: Preliminary concepts and notations are reviewed in Section 2. TheQLFIS algorithm proposed in [10] is presented in Section 3. Section 4 presents the pursuit-evasion game.The proposed RGFACL algorithm is introduced in Section 5. The simulation and results are presentedin Section 6.

2 Preliminary Concepts and Notations

The direct and the residual gradient algorithms and the fuzzy inference systems are presented in thissection.

2.1 Direct and Residual Gradient Algorithms

Traditional reinforcement learning algorithms such as the Q-learning algorithm and the value iterationalgorithm represent the value functions as lookup tables and are guaranteed to converge to optimalvalues[22]. These algorithms have to be combined with function-approximation systems when they areapplied to real applications with large state and action spaces or with continuous state and actionspaces. The direct and the residual gradient algorithms described in [21] can be used when the tradi-tional Q-learning and the value iteration algorithms are combined with function-approximation systems.In [21], the author illustrated with some examples that the direct algorithms may converge fast but maybecome unstable in some cases. In addition, the author showed that the residual gradient algorithmsconverge in those examples and are always guaranteed to converge to a local minimum. Another studypresented in [47] shows that the direct algorithms are faster than the residual gradient algorithms onlywhen the value function is represented in a tabular form. However, when function approximation sys-tems are involved, the direct algorithms are not always faster than the residual gradient algorithms. Inother words, when function approximation systems are involved, the residual gradient algorithms areconsidered as the superior algorithms as they are always guaranteed to converge, whereas the directalgorithms may not converge to an answer in some cases. Other different studies [45,46,48] confirm theresults presented in [21] in terms of the superiority of the residual gradient algorithms as they are alwaysshown to converge. To illustrate the difference between the direct algorithms and the residual gradientalgorithms, we will give two different examples of these algorithms; the direct value iteration algorithm(an example of the direct algorithms) and the residual gradient value iteration algorithm (an exampleof the residual gradient algorithms).

2.1.1 The direct value iteration algorithm

For Markov decision processes (MDPs), the value function Vt(st) at the state st for approximation ofthe reinforcement rewards can be defined as follows [32,34],

Vt(st) = E∞∑i=t

γi−tri (1)

where γ ∈ [0, 1) is a discount factor and ri is the immediate external reward that the learning agentgets from the learning environment.

The recursive form of Eq. (1) can be defined as follows [32,34],

Vt(st) = rt + γVt(st+1) (2)

The temporal difference residual error, ∆t, and the mean square error, Et, between the two sides of Eq.(2) are given as follows,

∆t =[rt + γVt(st+1)

]− Vt(st)

](3)

Et =1

2∆2t (4)

For a deterministic MDP, after transition from a state to another, the direct value iteration algorithmupdates the weights of the function-approximation system as follows [21],


ψt+1 = ψt − α∂Et∂ψt

(5)

where ψt represents the input and output parameters of the function-approximation system that needsto be tuned, α is a learning rate, and the term ∂Et

∂ψtis defined as follows,

∂Et∂ψt

= ∆t∂∆t∂ψt

= −[rt + γVt(st+1)− Vt(st)

].∂

∂ψtVt(st) (6)

Thus,

ψt+1 = ψt + α[rt + γVt(st+1)− Vt(st)

].∂

∂ψtVt(st) (7)

The direct value iteration algorithm updates the derivative ∂Et∂ψt

as in Eq. (6). This equation shows that

the direct value iteration algorithm treats the value function Vt(st+1) in the temporal error ∆t as a

constant. Therefore, the derivative∂Vt(st+1)∂ψt

will be zero. However, the value function Vt(st+1) is not aconstant, and it is a function of the input and output parameters of the function-approximation system,

ψt. Therefore, the derivative∂Vt(st+1)∂ψt

should not be assigned to zero during the tuning of the input andoutput parameters of the function-approximation system.

2.1.2 The residual gradient value iteration algorithm

The residual gradient value iteration algorithm updates the weights of the function-approximation sys-tem as follows [21],

ψt+1 = ψt − α∂Et∂ψt

(8)

where,

∂Et∂ψt

= ∆t∂∆t∂ψt

=[rt + γVt(st+1)− Vt(st)

].[ ∂

∂ψtγVt(st+1)− ∂

∂ψtVt(st)

](9)

Thus,

ψt+1 = ψt − α[rt + γVt(st+1)− Vt(st)

].[ ∂

∂ψtγVt(st+1)− ∂

∂ψtVt(st)

](10)

The residual gradient value iteration algorithm updates the derivative ∂Et∂ψt

as in Eq. (9). In this equation,

the residual gradient value iteration algorithm treats the value function Vt(st+1) in the temporal error ∆tas a function of the input and output parameters of the function approximation system, ψt. Therefore,

the derivative∂Vt(st+1)∂ψt

will not be assigned to zero during the tuning of the input and output parametersof the function approximation system. This will make the residual gradient value iteration algorithmperforms better than the direct value iteration algorithm in terms of convergence [21].

2.2 Fuzzy Inference Systems

Fuzzy inference systems may be used as function approximation systems so that reinforcement learningapproaches can be applied to real systems with continuous domains [8,9,22]. Among the most widelyused FISs are the Mamdani FIS proposed in [41] and the Takagi-Sugeno-Kang (TSK) FIS proposedin [42,43]. The FISs used in this work are zero-order TSK FISs with constant consequents. Each FISconsists of L rules. The inputs of each rule are n fuzzy variables; whereas the consequent of each rule isa constant number. Each rule l (l = 1, ..., L) has the following form,

Rl : IF s1 is F l1, . . ., and sn is F ln THEN zl = kl (11)

where si, (i = 1, ..., n), is the ith input state variable of the fuzzy system, n is the number of input statevariables, and F li is the linguistic value of the input si at the rule l. Each input si has h membershipfunctions. The variable zl represents the output variable of the rule l, and kl is a constant that describesthe consequent parameter of the rule l. In this work, Gaussian membership functions are used for theinputs and each membership function (MF) is defined as follows,


µFli (si) = exp

(−(si −m

σ

)2)(12)

where σ and m are the standard deviation and the mean, respectively.

In each FIS used in this work, the total number of the standard deviations of the membership functionsof its inputs is defined as H, where H = n × h. In addition, the total number of the means of themembership functions of its inputs is H. Thus, for each FIS used in this work, the standard deviationsand the means of the membership functions of the inputs are defined, respectively, as σj and mj , wherej = 1, ..., H. We define the set of the parameters of the membership functions of each input, Ω(si), asfollows,

Ω(s1) = (σ1,m1), (σ2,m2), ..., (σh,mh)Ω(s2) = (σh+1,mh+1), (σh+2,mh+2), ..., (σ2h,m2h)

.

.

.Ω(sn) = (σ(n−1)h+1,m(n−1)h+1), (σ(n−1)h+2,m(n−1)h+2), ..., (σH ,mH)

(13)

The output of the fuzzy system is given by the following equation when we use the product inferenceengine with singleton fuzzifier and center-average defuzzifier [8].

Z(st) =

∑Ll=1

[(∏ni=1 µ

F li (si))kl]

∑Ll=1

(∏ni=1 µ

F li (si)) =

L∑l=1

Φl(st)kl (14)

where st = (s1, ..., sn) is the state vector, µFli describes the membership value of the input state variable

si in the rule l, and Φl(st) is the normalized activation degree (normalized firing strength) of the rule lat the state st and is defined as follows:

Φl(st) =

∏ni=1 µ

F li (si)∑Ll=1

(∏ni=1 µ

F li (si)) =

ωl(st)∑Ll=1 ωl(st)

(15)

where ωl(st) is the firing strength of the rule l at the state st, and it is defined as follows,

ωl(st) =n∏i=1

µFli (si) (16)

We define the set of the parameters of each firing strength of each rule in each FIS, Ω(ωl), as follows,

Ω(ω1) = (σ1,m1), (σh+1,mh+1), ..., (σ(n−1)h+1,m(n−1)h+1)Ω(ω2) = (σ1,m1), (σh+1,mh+1), ..., (σ(n−1)h+2,m(n−1)h+2)

.

.

.Ω(ωh) = (σ1,m1), (σh+1,mh+1), ..., (σH ,mH)Ω(ωh+1) = (σ1,m1), (σh+2,mh+2), ..., (σ(n−1)h+1,m(n−1)h+1)

.

.

.Ω(ωL) = (σh,mh), (σ2h,m2h), ..., (σH ,mH)

(17)


Fig. 1 The QLFIS technique [10]

3 The Q-learning Fuzzy Inference System Algorithm

The QLFIS algorithm is a modified version of the algorithms presented in [32] and [39]. The QLFISalgorithm combines the Q-learning algorithm with a function approximation system, and is applieddirectly to games with continuous state and action spaces. The structure of the QLFIS algorithm isshown in Fig. 1. The QLFIS algorithm uses a function approximation system (FIS) to estimate thestate-action value functions Q(st, u). The QLFIS algorithm also uses a function approximation system(FLC) to generate the continuous action. The QLFIS algorithm tunes the input and output parametersof its function approximation systems [10]. The QLFIS algorithm uses the so-called direct algorithmsdescribed in [21] as a mechanism to tune the input and output parameters of its function approximationsystems.

3.1 Update rules for the function approximation system (FIS)

The input parameters of the FIS are the parameters of the Gaussian membership functions of its input;the standard deviations σj and the means mj . On the other hand, the output parameters of the FIS arethe consequent (or conclusion) parts of the fuzzy rules, kl. To simplify notations, we refer to the inputand output parameters of the FIS as ψQ. Thus, the update rules of the input and output parametersfor the FIS of the QLFIS algorithm are given as follows [10],

ψQt+1 = ψQt + ρ∆t∂Qt(st, uc)

∂ψQt(18)

where ρ is a learning rate for the FIS parameters, and ∆t is the temporal difference error that is definedas follows,

∆t = rt + γQt(st+1, uc)−Qt(st, uc) (19)

where rt is the reward received at time t, γ is a discount factor, and Qt(st, uc) is the estimated state-action value function at the state st.

The term ∂Qt(st,uc)

∂ψQtin Eq. (18) is computed as in [10] as follows,

∂Qt(st, uc)

∂kl= Φl(st) (20)

∂Qt(st, uc)

∂σj=kl −Qt(st, uc)∑

l ωl(st)ωl(st)

2(si −mj)2

(σj)3(21)

∂Qt(st, uc)

∂mj=kl −Qt(st, uc)∑

l ωl(st)ωl(st)

2(si −mj)

(σj)2(22)


Fig. 2 Pursuit-evasion model

3.2 Update rules for the function approximation system (FLC)

Likewise the FIS, the input parameters of the FLC are the parameters of the Gaussian membershipfunctions of its input; the standard deviations σj and the means mj . On the other hand, the outputparameters of the FLC are the consequent (or conclusion) parts of the fuzzy rules, kl. We refer tothe input and output parameters of the FLC as ψu. Thus, the update rules of the input and outputparameters for the FLC of the QLFIS algorithm are given as follows [10],

ψut+1 = ψut + τ∆t∂ut∂ψut

[uc − utσn

](23)

where τ is a learning rate for the FLC parameters, uc is the output of the FLC with a random Gaussiannoise. The term ∂ut

∂ψutin Eq. (23) can be calculated by replacing Qt(st, uc) with ut in Eq. (20), Eq. (21)

and Eq. (22) as follows,

∂ut∂kl

= Φl(st) (24)

∂ut∂σj

=kl − ut∑l ωl(st)

ωl(st)2(si −mj)

2

(σj)3(25)

∂ut∂mj

=kl − ut∑l ωl(st)

ωl(st)2(si −mj)

(σj)2(26)

4 The Pursuit-Evasion Game

The pursuit-evasion game is defined as a differential game [49]. In this game, the pursuer’s objective isto capture the evader, whereas the evader’s objective is to escape from the pursuer or at least prolongthe capture time. Fig. (2) shows the model of the pursuit-evasion differential game. The equations ofmotion of the pursuer and the evader robots can be described by the following equations [50,51],

xκ = Vκ cos(θκ)yκ = Vκ sin(θκ)

θκ = VκLκ

tan(uκ)(27)

where κ represents both the pursuer ”p” and the evader ”e”, Vκ represents robot κ’s speed, θκ is theorientation of robot κ, (xκ, yκ) is the position of robot κ, Lκ represents the wheelbase of robot κ, anduκ represents robot κ’s steering angle, where uκ ∈ [−uκmax , uκmax ].

In this work, we assume that the pursuer is faster than the evader by making Vp > Ve. It is also assumedthat the evader is more maneuverable than the pursuer by making uemax > upmax . A simple classical


control strategy that can be used to define the control strategies of the pursuer and the evader, in apursuit-evasion game, can be given as follows,

uκ =

−uκmax : δκ < −uκmax

δκ : −uκmax ≤ δκ ≤ uκmaxuκmax : δκ > uκmax

(28)

and,

δκ = tan−1( ye − ypxe − xp

)− θκ (29)

where δκ represents the angle difference between the direction of robot κ and the line-of-sight (LoS) tothe other robot.

To capture the evader in a pursuit-evasion game when the pursuer uses the simple control strategydescribed by Eq (28) and Eq (29), the angle difference δp has to be driven to zero by the pursuer. Thus,the control strategy of the pursuer in this case is to drive this angle difference to zero. On the otherhand, the control strategy of the evader is to escape from the pursuer and keep the distance betweenthe evader and the pursuer as large as possible. The evader can do so by following the intelligent controlstrategy described by the following two rules [10,33,51,52]:

1. If the distance between the evader and the pursuer is greater than a specific distance d, the controlstrategy of the evader is defined as follows,

ue = tan−1( ye − ypxe − xp

)− θe (30)

2. If the distance between the evader and the pursuer is less than or equal to the distance d, the evaderexploits its higher maneuverability, and the control strategy for the evader in this case is given as follows,

ue = (θp + π)− θe (31)

The distance d is defined as follows,

d =Lp

tan(upmax)(32)

The pursuer succeeds to capture the evader if the distance between them is less than the capture radiusdc. The distance between the pursuer and the evader is defined by d and is given as follows,

d =√

(xe − xp)2 + (ye − yp)2 (33)

5 The Proposed Algorithm

In this work, we propose a new fuzzy reinforcement learning algorithm for differential games that havecontinuous state and action spaces. The proposed algorithm FISs as function approximation systems;an actor (fuzzy logic controller, FLC) and a critic. The critic is used to estimate the value functionsVt(st) and Vt(st+1) of the learning agent at two different states st and st+1, respectively. The values ofVt(st) and Vt(st+1) will depend on the input and output parameters of the critic. Unlike the algorithmsproposed in [9,10,12,32,39,40] which use the direct algorithms to tune the parameters of their functionapproximation systems, the proposed algorithm uses the residual gradient value iteration algorithmdescribed in [21] to tune the input and output parameters of its function approximation systems. It hasbeen shown in [21,45–48] that the direct algorithms may not converge to an answer in some cases, whilethe residual gradient algorithms are always guaranteed to converge. The proposed algorithm is called theRGFACL algorithm. The structure of the proposed RGFACL algorithm is shown in Fig. (3). The inputparameters of the critic are the parameters of the MFs of its inputs, σj and mj (where j = 1, ..., H).The output parameters of the critic are the consequent parameters of its rules, kl (where l = 1, ..., L).To simplify notations, we refer to the input and output parameters of the critic as ψC . Similarly, theinput parameters of the actor are the parameters of the MFs of its input σj and mj , and the outputparameters of the actor are the consequent parameters of its rules kl. To simplify notations, we referto the input and output parameters of the actor as ψA. The temporal difference residual error, ∆t, is


defined as follows,

∆t = rt + γVt(st+1)− Vt(st) (34)

5.1 Adaptation rules for the critic

In this subsection, we derive the adaptation rules that the proposed algorithm uses to tune the inputand output parameters of the critic. For the sake of completeness, we are going to present the completederivation of the partial derivatives that are needed by the proposed algorithm.

The mean squared error, E, of the temporal difference residual error, ∆t, is defined as follows,

E =1

2∆2t (35)

The input and output parameters of the critic are updated based on the residual gradient methoddescribed in [21] as follows,

ψCt+1 = ψCt − α∂E

∂ψCt(36)

where ψCt represents the input and output parameters of the critic at time t, and α is a learning ratefor the parameters of the critic.

The QLFIS algorithm proposed in [10] uses the so-called direct algorithms described in [21] to definethe term ∂E

∂ψCt. On the other hand, the proposed algorithm defines the term ∂E

∂ψCtbased on the residual

gradient value iteration algorithm which is also described in [21] as follows,

∂E

∂ψCt= ∆t

[γ∂Vt(st+1)

∂ψCt− ∂Vt(st)

∂ψCt

](37)

The proposed algorithm treats the value function Vt(st+1) in the temporal difference residual error ∆tas a function of the input and output parameters of its function approximation system (the critic), ψCt .Unlike the QLFIS algorithm, the proposed algorithm will not assign a value of zero to the derivative∂Vt(st+1)

∂ψCtduring the tuning of the input and output parameters of the critic. This is because the value

function Vt(st+1) is a function of the input and output parameters of the critic ψCt , and its derivative∂Vt(st+1)

∂ψCtshould not be assigned to zero all the time.

From Eq. (37), Eq. (36) can be rewritten as follows,

ψCt+1 = ψCt − α[rt + γVt(st+1)− Vt(st)

].[γ

∂

∂ψCtVt(st+1)− ∂

∂ψCtVt(st)

](38)

The derivatives of the state value functions, Vt(st) and Vt(st+1), with respect to the output parametersof the critic, kl, is calculated from Eq. (14) as follows,

∂Vt(st)

∂kl= Φl(st) (39)

where l = 1, ..., L.

Similarly,

∂Vt(st+1)

∂kl= Φl(st+1) (40)

where Φl(st) and Φl(st+1) are calculated by using Eq. (15) at the states st and st+1, respectively.

We use the chain rule to calculate the derivatives ∂Vt(st)∂σj

and∂Vt(st+1)∂σj

. We start with the derivative∂Vt(st)∂σj

which is calculated as follows,


∂Vt(st)

∂σj=∂Vt(st)

∂ωl(st).∂ωl(st)

∂σj(41)

where j = 1, ..., H.

The term ∂Vt(st)∂ωl(st)

is calculated as follows,

∂Vt(st)

∂ωl(st)=[ ∂Vt(st)∂ω1(st)

∂Vt(st)

∂ω2(st). . .

∂Vt(st)

∂ωL(st)

](42)

We first calculate the term ∂Vt(st)∂ω1(st)

as follows,

∂Vt(st)

∂ω1(st)=

∂

∂ω1(st)

[∑l ωl(st)kl∑l ωl(st)

]=k1 − Vt(st)∑

l ωl(st)(43)

Similarly, we can calculate the terms ∂Vt(st)∂ω2(st)

, ..., and ∂Vt(st)∂ωL(st)

. Thus, Eq. (42) can be rewritten as follows,

∂Vt(st)

∂ωl(st)=[k1 − Vt(st)∑

l ωl(st)

k2 − Vt(st)∑l ωl(st)

. . .kL − Vt(st)∑

l ωl(st)

](44)

On the other hand, the term ∂ωl(st)∂σj

in Eq. (41) is calculated as follows,

∂ωl(st)

∂σj=[∂ω1(st)

∂σj

∂ω2(st)

∂σj. . .

∂ωL(st)

∂σj

]T(45)

The derivative ∂ω1(st)∂σj

can be calculated based on the definition of ω1(st) given in Eq. (16) as follows,

∂ω1(st)

∂σj=

∂

∂σj

[ n∏i=1

µF1i (si)

]=

∂

∂σj

[µF

11 (s1)× µF

12 (s2)× . . . × µF

1n(sn)

](46)

We then substitute Eq. (12) into Eq. (46) as follows,

∂ω1(st)

∂σj=

∂

∂σj

[exp

(−(s1 −m1

σ1

)2)×exp(−(s2 −mh+1

σh+1

)2)× . . . ×exp(−(sn −m(n−1)h+1

σ(n−1)h+1

)2)](47)

Thus, the derivatives ∂ω1(st)∂σ1

, ∂ω1(st)∂σ2

, ..., and ∂ω1(st)∂σH

are calculated as follows,

∂ω1(st)

∂σ1=

2(s1 −m1)2

σ31

ω1(st)

∂ω1(st)

∂σ2= 0

.

.

.∂ω1(st)

∂σh+1=

2(s2 −mh+1)2

σ3h+1

ω1(st)

∂ω1(st)

∂σh+2= 0

.

.

.

∂ω1(st)

∂σ(n−1)h+1=

2(sn −m(n−1)h+1)2

σ3(n−1)h+1

ω1(st)

∂ω1(st)

∂σ(n−1)h+2= 0

.

.

.∂ω1(st)

∂σH= 0

(48)

Thus, from Eq. (48), the derivative ∂ω1(st)∂σj

, (j = 1, ..., H), can be rewritten as follows,


∂ω1(st)

∂σj=

2(si−mj)

2

σ3j

ω1(st) if (σj ,mj) ∈ Ω(ω1)

0 if (σj ,mj) /∈ Ω(ω1)(49)

where the term si is the ith input state variable of the state vector st and is defined as follows,

si =

s1 if (σj ,mj) ∈ Ω(s1)s2 if (σj ,mj) ∈ Ω(s2)...sn if (σj ,mj) ∈ Ω(sn)

(50)

We can rewrite Eq. (49) as follows,

∂ω1(st)

∂σj= ξj,1

2(si −mj)2

σ3j

ω1(st) (51)

where,

ξj,1 =

1 if (σj ,mj) ∈ Ω(ω1)0 if (σj ,mj) /∈ Ω(ω1)

(52)

Similarly, we can calculate the derivatives ∂ω2(st)∂σj

as follows,

∂ω2(st)

∂σj=

∂

∂σj

[ n∏i=1

µF2i (si)

]=

∂

∂σj

[µF

21 (s1)× µF

22 (s2)× . . . × µF

2n(sn)

](53)

∂ω2(st)

∂σj=

∂

∂σj

[exp

(−(s1 −m1

σ1

)2)×exp(−(s2 −mh+1

σh+1

)2)× . . . ×exp(−(sn −m(n−1)h+2

σ(n−1)h+2

)2)](54)

Thus, the derivatives ∂ω2(st)∂σ1

, ..., and ∂ω2(st)∂σH


∂ω2(st)

∂σ1=

2(s1 −m1)2

σ31

ω2(st)

∂ω2(st)

∂σ2= 0

.

.

.∂ω2(st)

∂σh+1=

2(s2 −mh+1)2

σ3h+1

ω2(st)

∂ω2(st)

∂σh+2= 0

.

.

.∂ω2(st)

∂σ(n−1)h+1= 0

∂ω2(st)

∂σ(n−1)h+2=

2(sn −m(n−1)h+2)2

σ3(n−1)h+2

ω2(st)

∂ω2(st)

∂σ(n−1)h+3= 0

.

.

.∂ω2(st)

∂σH= 0

(55)

Thus, from Eq. (55), the derivatives ∂ω2(st)∂σj



∂ω2(st)

∂σj=

2(si−mj)

2

σ3j


0 if (σj ,mj) /∈ Ω(ω2)(56)


∂ω2(st)

∂σj= ξj,2

2(si −mj)2

σ3j

ω2(st) (57)

where,

ξj,2 =

1 if (σj ,mj) ∈ Ω(ω2)0 if (σj ,mj) /∈ Ω(ω2)

(58)

Similarly, we can calculate the derivatives ∂ω3(st)∂σj

, ∂ω4(st)∂σj

, ..., and ∂ωL(st)∂σj

. Thus, Eq. (45) can be

rewritten as follows,

∂ωl(st)

∂σj=[ξj,1

2(si −mj)2

σ3j

ω1(st) ξj,22(si −mj)

2

σ3j

ω2(st) . . . ξj,L2(si −mj)

2

σ3j

ωL(st)]T

(59)

where,

ξj,l =

1 if (σj ,mj) ∈ Ω(ωl)0 if (σj ,mj) /∈ Ω(ωl)

(60)

Hence, from Eq. (44) and Eq. (59), the derivative ∂Vt(st)∂σj

, (j = 1, ..., H), in Eq. (41) is then calculated

as follows,

∂Vt(st)

∂σj=

2(si −mj)2

σ3j

×L∑l=1

ξj,lkl − Vt(st)∑

l ωl(st)ωl(st) (61)

Similarly, we calculate the derivative∂Vt(st+1)∂σj

as follows,

∂Vt(st+1)

∂σj=

2(s′i −mj)2

σ3j

×L∑l=1

ξj,lkl − Vt(st+1)∑

l ωl(st+1)ωl(st+1) (62)

where,

s′i =

s′1 if (σj ,mj) ∈ Ω(s1)s′2 if (σj ,mj) ∈ Ω(s2)...s′n if (σj ,mj) ∈ Ω(sn)

(63)

where s′i is the ith input state variable of the state vector st+1.

We also use the chain rule to calculate the derivatives ∂Vt(st)∂mj

and∂Vt(st+1)∂mj

. We start with the derivative∂Vt(st)∂mj

which is calculated as follows,

∂Vt(st)

∂mj=∂Vt(st)

∂ωl(st).∂ωl(st)

∂mj(64)

The term ∂Vt(st)∂ωl(st)

is calculated as in Eq. (44) and the term ∂ωl(st)∂mj

is calculated as follows,

∂ωl(st)

∂mj=[∂ω1(st)

∂mj

∂ω2(st)

∂mj. . .

∂ωL(st)

∂mj

]T(65)

We first calculate the term ∂ω1(st)∂mj

by using Eq. (16) as follows,

∂ω1(st)

∂mj=

∂

∂mj

[ n∏i=1

µF1i (si)

]=

∂

∂mj

[µF

11 (s1)× µF

12 (s2)× . . . × µF

1n(sn)

](66)


We then substitute Eq. (12) into Eq. (66) as follows,

∂ω1(st)

∂mj=

∂

∂mj

[exp

(−(s1 −m1

σ1

)2)×exp(−(s2 −mh+1

σh+1

)2)× . . . ×exp(−(sn −m(n−1)h+1

σ(n−1)h+1

)2)](67)

Thus, the derivatives ∂ω1(st)∂m1

, ∂ω1(st)∂m2

, ..., and ∂ω1(st)∂mH


∂ω1(st)

∂m1=

2(s1 −m1)

σ21

ω1(st)

∂ω1(st)

∂m2= 0

.

.

.∂ω1(st)

∂mh+1=

2(s2 −mh+1)

σ2h+1

ω1(st)

∂ω1(st)

∂mh+2= 0

.

.

.∂ω1(st)

∂m(n−1)h+1=

2(sn −m(n−1)h+1)

σ2(n−1)h+1

ω1(st)

∂ω1(st)

∂m(n−1)h+2= 0

.

.

.∂ω1(st)

∂mH= 0

(68)

Thus, from Eq. (68), the derivative ∂ω1(st)∂mj


∂ω1(st)

∂mj=

2(si−mj)

σ2j


0 if (σj ,mj) /∈ Ω(ω1)(69)


∂ω1(st)

∂mj= ξj,1

2(si −mj)

σ2j

ω1(st) (70)

where si is defined as in Eq. (50), and ξj,1 is defined as in Eq. (52).

Similarly, we can calculate the term ∂ω2(st)∂mj

as follows,

∂ω2(st)

∂mj=

∂

∂mj

[ n∏i=1

µF2i (si)

]=

∂

∂mj

[µF

21 (s1)× µF

22 (s2)× . . . × µF

2n(sn)

](71)

∂ω2(st)

∂mj=

∂

∂mj

[exp

(−(s1 −m1

σ1

)2)×exp(−(s2 −mh+1

σh+1

)2)× . . . ×exp(−(sn −m(n−1)h+2

σ(n−1)h+2

)2)](72)

Thus, the derivatives ∂ω2(st)∂m1

, ∂ω2(st)∂m2

, ..., and ∂ω2(st)∂mH



∂ω2(st)

∂m1=

2(s1 −m1)

σ21

ω2(st)

∂ω2(st)

∂m2= 0

.

.

.∂ω2(st)

∂mh+1=

2(s2 −mh+1)

σ2h+1

ω2(st)

∂ω2(st)

∂mh+2= 0

.

.

.∂ω2(st)

∂m(n−1)h+1= 0

∂ω2(st)

∂m(n−1)h+2=

2(sn −m(n−1)h+2)

σ2(n−1)h+2

ω2(st)

∂ω2(st)

∂m(n−1)h+3= 0

.

.

.∂ω2(st)

∂mH= 0

(73)

Thus, from Eq. (73), the derivative ∂ω2(st)∂mj


∂ω2(st)

∂mj=

2(si−mj)

σ2j


0 if (σj ,mj) /∈ Ω(ω2)(74)


∂ω2(st)

∂mj= ξj,2

2(si −mj)

σ2j

ω2(st) (75)

where si is defined as in Eq. (50), and ξj,2 is defined as in Eq. (58).

Similarly, we can calculate the terms ∂ω3(st)∂mj

, ∂ω4(st)∂mj

, ..., and ∂ωL(st)∂mj

. Thus, Eq. (65) can be rewritten

as follows,

∂ωl(st)

∂mj=[ξj,1

2(si −mj)

σ2j

ω1(st) ξj,22(si −mj)

σ2j

ω2(st) . . . ξj,L2(si −mj)

σ2j

ωL(st)]T

(76)

where ξj,l is defined as in Eq. (60).

From Eq. (44) and Eq. (76), the derivative ∂Vt(st)∂mj

in Eq. (64) is calculated as follows,

∂Vt(st)

∂mj=

2(si −mj)

σ2j

×L∑l=1

ξj,lkl − Vt(st)∑

l ωl(st)ωl(st) (77)

Similarly, we can calculate the derivative∂Vt(st+1)∂mj

as follows,

∂Vt(st+1)

∂mj=

2(s′i −mj)

σ2j

×L∑l=1

ξj,lkl − Vt(st+1)∑

l ωl(st+1)ωl(st+1) (78)

Hence, From Eq. (38), Eq. (61), Eq. (62), Eq. (77) and Eq. (78), the input parameters mj and σj of thecritic are updated at each time step as follows,

σj,t+1 = σj,t − α[rt + γVt(st+1)− Vt(st)

].[γ∂Vt(st+1)

∂σj,t− ∂Vt(st)

∂σj,t

](79)


mj,t+1 = mj,t − α[rt + γVt(st+1)− Vt(st)

].[γ∂Vt(st+1)

∂mj,t− ∂Vt(st)

∂mj,t

](80)

Similarly, From Eq. (38), Eq. (39), Eq. (40), the output parameters kl of the critic are updated at eachtime step as follows,

kl,t+1 = kl,t − α[rt + γVt(st+1)− Vt(st)

].[γ∂Vt(st+1)

∂kl,t− ∂Vt(st)

∂kl,t

](81)

5.2 Adaptation rules for the actor

The input and output parameters of the actor, ψA, are updated as follows, [10],

ψAt+1 = ψAt + β∆t∂ut∂ψAt

[uc − utσn

](82)

Where β is a learning rate for the actor parameters, uc is the output of the actor with a random Gaussiannoise. The derivatives of the output of the FLC (the actor), ut, with respect to the input and outputparameters of the FLC can be calculated by replacing Vt(st) with ut in Eq. (39), Eq. (61) and Eq. (77)as follows,

∂ut∂kl

= Φl(st) (83)

∂ut∂σj

=2(si −mj)

2

σ3j

×L∑l=1

ξj,lkl − ut∑l ωl(st)

ωl(st) (84)

∂ut∂mj

=2(si −mj)

σ2j

×L∑l=1

ξj,lkl − ut∑l ωl(st)

ωl(st) (85)

Hence, From Eq. (82), Eq. (84) and Eq. (85), the input parameters σj and mj of the actor (FLC) areupdated at each time step as follows,

σj,t+1 = σj,t + β∆t∂ut∂σj,t

[uc − utσn

](86)

mj,t+1 = mj,t + β∆t∂ut∂mj,t

[uc − utσn

](87)

Similarly, From Eq. (82) and Eq. (83), the output parameters kl of the actor (FLC) are updated at eachtime step as follows,

kl,t+1 = kl,t + β∆t∂ut∂kl,t

[uc − utσn

](88)

The proposed RGFACL algorithm is given in Algorithm 1.

Example:This example illustrates how to use the equations associated with the proposed algorithm to tune theinput and output parameters of the FISs (actor and critic) of a learning agent. We assume that the actorof the learning agent has two inputs, st = (s1, s2). That is, n = 2. We assume that the output of theactor is a crisp output. The critic of the learning agent has the same inputs as the actor, st = (s1, s2),and with a crisp output. Thus, the fuzzy rules of the actor and the critic can be described as in Eq.(11), as follows,

Rl(actor) : IF s1 is Al1 and s2 is Al2 THEN zl = al (89)

Rl(critic) : IF s1 is Cl1 and s2 is Cl2 THEN zl = cl (90)


Fig. 3 The proposed RGFACL algorithm

Algorithm 1 The Proposed Residual Gradient Fuzzy Actor Critic Learning Algorithm:

(1) Initialize:(a) the input and output parameters of the critic, ψC .(b) the input and output parameters of the actor, ψA.

(2) For each EPISODE do:(3) Update the learning rates α and β of the critic and actor, respectively.(4) Initialize the position of the pursuer at (xp, yp) = 0 and the position of the evader randomly at (xe, ye), and thencalculate the initial state st.(5) For each ITERATION do:(6) Calculate the output of the actor, ut, at the state st by using Eq. (14) and then calculate the output uc =ut + n(0, σn).(7) Calculate the output of the critic, Vt(st), at the state st by using Eq. (14).(8) Perform the action uc and observe the next state st+1 and the reward rt.(9) Calculate the output of the critic, Vt(st+1), at the next state st+1 by using Eq. (14).(10) Calculate the temporal difference error, ∆t, by using Eq. (34).(11) Update the input and output parameters of the critic, ψC , by using Eq. (79), Eq. (80) and Eq. (81).(12) Update the input and output parameters of the actor, ψA, based on Eq. (86), Eq. (87) and Eq. (88).(13) Set st ← st+1.(14) Check Termination Condition.(15) end for loop (ITERATION).(16) end for loop (EPISODE).

The linguistic values of the actor’s inputs, Al1 and Al2, are functions of the means and the standarddeviations of the MFs of the actor’s inputs, and al is a constant that represents the consequent part ofthe actor’s fuzzy rule Rl. On the other hand, the linguistic values of the critic’s inputs, Cl1 and Cl2, arefunctions of the means and the standard deviations of the MFs of the critic’s inputs, and cl representsthe consequent part of the critic’s fuzzy rule Rl.

We assume that each input of the two inputs to the FISs (actor and critic) has three Gaussian MFs,h = 3. That is, the input parameters of each FIS are σj and mj , where j = 1, ..., H and H = n× h = 6.In addition, each FIS has nine rules, L = hn = 9. That is, the output parameters of each FIS are nineparameters (al for the actor and cl for the critic), where l = 1, ..., 9. The parameters of the MFs of eachinput Ω(si), (i = 1, 2), defined by Eq. (13) can be given as follows,

Ω(s1) = (σ1,m1), (σ2,m2), (σ3,m3)Ω(s2) = (σ4,m4), (σ5,m5), (σ6,m6) (91)

On the other hand, the set of the parameters of each firing strength of each rule in each FIS, Ω(ωl),defined by Eq. (17) is given as follows,


Ω(ω1) = (σ1,m1), (σ4,m4)Ω(ω2) = (σ1,m1), (σ5,m5)Ω(ω3) = (σ1,m1), (σ6,m6)Ω(ω4) = (σ2,m2), (σ4,m4)Ω(ω5) = (σ2,m2), (σ5,m5)Ω(ω6) = (σ2,m2), (σ6,m6)Ω(ω7) = (σ3,m3), (σ4,m4)Ω(ω8) = (σ3,m3), (σ5,m5)Ω(ω9) = (σ3,m3), (σ6,m6)

(92)

The term ξj,l defined by Eq. (60) can be calculated based on the following matrix,

ξ =

1 1 1 0 0 0 0 0 00 0 0 1 1 1 0 0 00 0 0 0 0 0 1 1 11 0 0 1 0 0 1 0 00 1 0 0 1 0 0 1 00 0 1 0 0 1 0 0 1

6×9

(93)

To tune the input and output parameters of the actor and critic, we follow the procedure described inAlgorithm (1). After initializing the values of the input and output parameters of the actor and critic,learning rates, and the inputs, the output ut of the actor at the current state st is calculated by usingEq. (14) as follows,

ut =

∑9l=1

[(∏2i=1 µ

F li (si))al]

∑9l=1

(∏2i=1 µ

F li (si)) (94)

To solve the exploration/exploitation dilemma, a random noise n(0, σn) with a zero mean and a stan-dard deviation σn should be added to the actor’s output. Thus, the new output (action) uc will bedefined as uc = ut + n(0, σn).

The output of the critic at the current state st is calculated by using Eq. (14) as follows,

Vt(st) =

∑9l=1

[(∏2i=1 µ

F li (si))cl]

∑9l=1

(∏2i=1 µ

F li (si)) (95)

The learning agent performs the action uc and observes the next state st+1 and the immediate rewardrt. The output of the critic at the next state st+1 is then calculated by using Eq. (14), which is in turnused to calculate the temporal error ∆t by using Eq. (34). Then, the input and output parameters ofthe actor can be updated by using Eq. (86), Eq. (87) and Eq. (88). On the other hand, the input andoutput parameters of the critic can be updated by using Eq. (79), Eq. (80) and Eq. (81).

6 Simulation and Results

We evaluate the proposed RGFACL algorithm, the FACL algorithm and the QLFIS algorithm on threedifferent pursuit-evasion games. In the first game, the evader is following a simple control strategy,whereas the pursuer is learning its control strategy to capture the evader in minimum time. In thesecond game, it is also only the pursuer that is learning. However, the evader in this game is followingan intelligent control strategy that exploits the advantage of the maneuverability of the evader. In thethird game, we make both the pursuer and the evader learn their control strategies. In multi-robotlearning systems, each robot will try to learn its control strategy by interacting with the other robotwhich is also learning at the same time. Therefore, the complexity of the system will increase as thelearning in a multi-robot system is considered as a problem of a ”moving target” [26]. In the problemof a moving target, the best-response policy of each learning robot may keep changing during learninguntil each learning robot adopts an equilibrium policy. It is important to mention here that the pursuer,in all games, is assumed to not know the dynamics of the evader nor its control strategy.


Table 2 The time that the pursuer trained by each algorithm takes to capture an evader that follows a simple controlstrategy. The number of episodes here is 200.

AlgorithmEvader

(-5,9) (-10,-6) (7,4) (5,-10)

Classical strategy 10.70 12.40 8.05 11.20The proposed RGFACL 10.80 12.45 8.05 11.25QLFIS 10.95 12.50 8.05 11.35FACL 11.50 13.50 8.65 12.25

We use the same learning and exploration rates for all algorithms when they are applied to the samegame. Those rates are chosen to be similar to those used in [10]. We define the angle difference betweenthe direction of the pursuer and the line-of-sight (LoS) vector of the pursuer to the evader by δp. In allgames, we define the state st for the pursuer by the two input variables which are the pursuer angledifference δp and its derivative δp. In the third game, we define the state st for the evader by the twoinput variables which are the evader angle difference δe and its derivative δe. Three Gaussian member-ship functions (MFs) are used to define the fuzzy sets of each input.

In all games, we assume that the pursuer is faster than the evader, and the evader is more maneuverablethan the pursuer. In addition, the pursuer is assumed to not know the dynamics of the evader nor its con-trol strategy. The only information the pursuer knows about the evader is the position (location) of theevader. The parameters of the pursuer are set as follows, Vp = 2.0m/s, Lp = 0.3m and up ∈ [−0.5, 0.5].The pursuer starts its motion from the position (xp, yp) = (0, 0) with an initial orientation θp = 0.On the other hand, the parameters of the evader are set up as follows, Ve = 1m/s, Le = 0.3m andue ∈ [−1.0, 1.0]. The evader starts its motion from a random position at each episode with an initialorientation θe = 0. The sampling time is defined as T = 0.05s, whereas the capture radius is defined asdc = 0.1m.

6.1 Pursuit-Evasion Game 1

In this game, the evader is following a simple control strategy defined by Eq. (30). On the other hand,the pursuer is learning its control strategy with the proposed RGFACL algorithm. We compare ourresults with the results obtained when the pursuer is following the classical control strategy definedby Eq. (28) and Eq. (29). We also compare our results with the results obtained when the pursuer islearning its control strategy by the FACL and the QLFIS algorithms. We define the number of episodesin this game as 200 and the number of steps (in each episode) as 600. For each algorithm (the FACL,the QLFIS and the proposed RGFACL algorithms), we ran this game 20 times and we averaged thecapture time of the evader over this number of trials.

Table 2 shows the time that the pursuer takes to capture the evader when the evader is following asimple control strategy and starts its motion from different initial positions. The table shows the capturetime of the evader when the pursuer is following the classical control strategy and when the pursuer islearning its control strategy by the FACL algorithm, the QLFIS algorithm and the proposed RGFACLalgorithm. From Table 2, we can see that the capture time of the evader when the pursuer learns itscontrol strategy by the proposed RGFACL algorithm is very close to the capture time of the evaderwhen the pursuer follows the classical control strategy. This shows that the proposed RGFACL algo-rithm achieves the performance of the classical control.


In this game, the evader is following the control strategy defined by Eq. (30) and Eq. (31) with theadvantage of using its higher maneuverability. On the other hand, the pursuer in this game is learningits control strategy with the proposed RGFACL algorithm. Similar to game 1, we compare our resultsobtained when the pursuer is learning by the proposed RGFACL algorithm with the results obtainedwhen the pursuer is following the classical control strategy defined by Eq. (28) and Eq. (29). We alsocompare our results with the results obtained when the pursuer is learning its control strategy by the


Table 3 The time that the pursuer trained by each algorithm takes to capture an evader that follows an intelligentcontrol strategy. The number of episodes here is 200.

AlgorithmEvader

(-9,7) (-7,-10) (6,9) (3,-9)

Classical strategy NoCapture

NoCapture

NoCapture

NoCapture

The proposed RGFACL 13.05100%

14.30100%

11.65100%

11.15100%

QLFIS 23.2020%

23.5520%

25.4520%

21.3520%

FACL NoCapture

NoCapture

NoCapture

NoCapture

Table 4 The time that the pursuer trained by each algorithm takes to capture an evader that follows an intelligentcontrol strategy. The number of episodes here is 1000.

AlgorithmEvader

(-9,7) (-7,-10) (6,9) (3,-9)

Classical strategy NoCapture

NoCapture

NoCapture

NoCapture

The proposed RGFACL 12.70100%

13.15100%

11.30100%

10.90100%

QLFIS 20.2550%

21.6050%

19.6050%

19.2050%

FACL NoCapture

NoCapture

NoCapture

NoCapture

FACL and the QLFIS algorithms. In [10], it is assumed that the velocity of the pursuer and evader aregoverned by their steering angles so that the pursuer and evader can avoid slips during turning. Thisconstraint will make the evader slow down its speed whenever the evader makes a turn. This will makeit easy for the pursuer to capture the evader. Our objective is to see how the proposed algorithm and theother studied algorithms will behave when the evader makes use of the advantage of the maneuverabilitywithout any velocity constraints. Thus, in this work, we take this velocity constraints out so that boththe pursuer and the evader can make fast turns without any velocity constraints. In this game, we usetwo different numbers for the episodes (200 and 1000), whereas the number of steps (in each episode)is set as 3000. For each algorithm (the FACL, the QLFIS and the proposed RGFACL algorithms), weran this game 20 times and, then, averaged the capture time of the evader over this number of trials.

Table 3 and Table 4 show the time that the pursuer takes to capture the evader when the evader isfollowing the control strategy defined by Eq. (30) and Eq. (31) with the advantage of using its highermaneuverability. The number of episodes used here is 200 for Table 3 and 1000 for Table 4. The tablesshow that the pursuer fails to capture the evader when the pursuer is following the classical controlstrategy and when learning by the FACL algorithm. Table 3 shows that the pursuer succeeds to capturethe evader in all 20 trials only when the pursuer is learning by the proposed RGFACL algorithm. Whenlearning by the QLFIS algorithm, the pursuer succeeds to capture the evader only in 20% of the 20trials. On the other hand, Table 4 shows that the pursuer always succeeds to capture the evader onlywhen the pursuer is learning with the proposed RGFACL algorithm. However, when learning with theQLFIS algorithm, the pursuer succeeds to capture the evader only in 50% of the 20 trials. Table 3 andTable 4 show that the proposed RGFACL algorithm outperforms the FACL and the QLFIS algorithms.This is because the pursuer using the proposed RGFACL algorithm to learn its control strategy alwayssucceeds to capture the evader in less time as well as in a less number of episodes.


Unlike game 1 and game 2, both the evader and the pursuer are learning their control strategies in thisgame. In multi-robot learning systems, each robot will try to learn its control strategy by interactingwith the other robot which is also learning its control strategy at the same time. Thus, the complexity ofthe system will increase in this game as the learning in a multi-robot system is considered as a problemof a ”moving target” [26]. We compare the results obtained by the proposed algorithm with the resultsobtained by the FACL and QLFIS algorithms. Unlike the first two pursuit-evasion games, we do not usethe capture time of the evader as a criterion in our comparison in this game. This is because both the


Fig. 4 The paths of the pursuer and the evader when learning by the FACL algorithm proposed in [9] (starred lines)against the paths of the pursuer and the evader when following the classical strategy defined in Eq. (28) and Eq. (29)(dotted lines). The number of episodes used here is 200.

pursuer and the evader are learning. That is, a small capture time by the pursuer learning its controlstrategy by one of the learning algorithms can have two different indications; the first one is that thelearning algorithm is working well as the pursuer succeeds to capture the evader quickly. The secondindication, on the other hand, is that the learning algorithm is not working properly as the evader doesnot learn how to escape from the pursuer. Therefore, we compare the paths of the pursuer and theevader learning their control strategies by the learning algorithms with the paths of the pursuer andthe evader following the classical control strategy defined by Eq. (28) and Eq. (29).

In game 3, the pursuer starts its motion from the position (xp, yp) = (0, 0) with an initial orientationθp = 0. On the other hand, the evader starts its motion from a random position at each episode with aninitial orientation θe = 0. We run game 3 twice. In the first run, we set the number of the episodes in thisgame to 200, whereas the number of steps in each episode is set to 3000. The results of the first run areshown in Fig. (4) to Fig. (6) when the evader starts its motion from the position (xe, ye) = (−10,−10).These figures show the paths of the pursuer and the evader (starred lines) when learning by the FACL,the QLFIS, and the proposed RGFACL algorithms, respectively. The paths of the pursuer and theevader following the classical control strategy are also shown in the figures (dotted lines). The resultsof the first run show that the proposed RGFACL algorithm outperforms the FACL and the QLFISalgorithms as the performance of the proposed algorithm is close to the performance of the classicalcontrol strategy. In the second run of game 3, we set the number of the episodes in this game to 500,whereas the number of steps in each episode is set to 3000. The results of the second run are shownin Fig. (7) to Fig. (9) when the evader starts its motion from the position (xe, ye) = (−10,−10). Thefigures show that the performance of the proposed RGFACL algorithm and the performance of theQLFIS algorithm are close to the performance of the classical control strategy and both algorithmsoutperform the FACL algorithm.

7 Conclusion

In this work, we propose a new fuzzy reinforcement learning algorithm for differential games thathave continuous state and action spaces. The proposed algorithm uses FISs as function approximationsystems; an actor (fuzzy logic controller, FLC) and a critic. The proposed algorithm tunes the input andoutput parameters of its function approximation systems (the actor and the critic) differently from thetuning mechanisms used in the algorithms proposed in in the literature. The proposed algorithm uses


Fig. 5 The paths of the pursuer and the evader when learning by the QLFIS algorithm proposed in [10] (starred lines)against the paths of the pursuer and the evader when following the classical strategy defined in Eq. (28) and Eq. (29)(dotted lines). The number of episodes used here is 200.

Fig. 6 The paths of the pursuer and the evader when learning by the proposed RGFACL algorithm (starred lines)against the paths of the pursuer and the evader when following the classical strategy defined in Eq. (28) and Eq. (29)(dotted lines). The number of episodes used here is 200.


Fig. 7 The paths of the pursuer and the evader when learning by the FACL algorithm proposed in [9] (starred lines)against the paths of the pursuer and the evader when following the classical strategy defined in Eq. (28) and Eq. (29)(dotted lines). The number of episodes used here is 500.

Fig. 8 The paths of the pursuer and the evader when learning by the QLFIS algorithm proposed in [10] (starred lines)against the paths of the pursuer and the evader when following the classical strategy defined in Eq. (28) and Eq. (29)(dotted lines). The number of episodes used here is 500.


Fig. 9 The paths of the pursuer and the evader when learning by the proposed RGFACL algorithm (starred lines)against the paths of the pursuer and the evader when following the classical strategy defined in Eq. (28) and Eq. (29)(dotted lines). The number of episodes used here is 500.

the residual gradient value iteration algorithm as a mechanism in tuning the parameters of its functionapproximation systems; whereas the algorithms proposed in the literature use the direct algorithmsin their mechanisms to tune the parameters of their function approximation systems. The proposedalgorithm is called the RGFACL algorithm. It has been shown in the literature that the residual gradientalgorithms are superior to the direct algorithms as the residual gradient algorithms are always guaranteedto converge, whereas the direct algorithms may not converge to an answer in some cases. For ease ofimplementation, the complete derivation of the partial derivatives that are needed by the proposedalgorithm is presented in this work. The proposed algorithm is used to learn three different pursuit-evasion games. We start with the game where the pursuer learns its control strategy and the evaderfollows a simple control strategy. In the second game, the pursuer learns its control strategy and theevader follows an intelligent control strategy that exploits the advantage of higher maneuverability. Inthe third game, we increase the complexity of the system by making both the pursuer and the evaderlearn their control strategies. Simulation results show that the proposed RGFACL algorithm outperformsthe FACL and the QLFIS algorithms in terms of performance and the learning time when they all areused to learn the pursuit-evasion games considered in this work.

References

1. S. Micera, A. M. Sabatini, and P. Dario. Adaptive fuzzy control of electrically stimulated muscles for arm movements,Medical & biological engineering & computing, 37.6, 680-685, 1999.

2. F. Daldaban, N. Ustkoyuncu, and K. Guney, Phase inductance estimation for switched reluctance motor usingadaptive neuro- fuzzy inference system, Energy Conversion and Management, 47.5, 485-493, 2005.

3. J. S. R. Jang, C. T. Sun, and E. Mizutani, Neuro-fuzzy and soft computing, PTR Prentice Hall, 1997.4. S. Labiod and T. M. Guerra, Adaptive fuzzy control of a class of SISO nonaffine nonlinear systems, Fuzzy Sets and

Systems, 158.10, 1126-1137, 2007.5. H. K. Lam and FH F. Leung, Fuzzy controller with stability and performance rules for nonlinear systems, Fuzzy

Sets and Systems, 158.2, 147-163, 2007.6. H. Hagras, V. Callaghan, and M. Colley, Learning and adaptation of an intelligent mobile robot navigator operating

in unstructured environment based on a novel online Fuzzy-Genetic system, Fuzzy Sets and Systems, 141.1, 107-160,2004.

7. M. Mucientes, D. L. Moreno, A. Bugarn S. and Barro,Design of a fuzzy controller in mobile robotics using geneticalgorithms, Applied Soft Computing, 7.2, 540-546, 2007.

8. L. X. Wang. A Course in Fuzzy Systems and Control, Upper Saddle River, NJ: Prentice Hall, 1997.9. L. Jouffe, Fuzzy inference system learning by reinforcement methods, IEEE Trans. Syst., Man, Cybern. C, 28.3,

338-355, 1998.


10. S. F. Desouky and H. M. Schwartz, Q (λ)-learning adaptive fuzzy logic controllers for pursuit-evasion differentialgames, International Journal of Adaptive Control and Signal Processing, 25.10, 910-927, 2011.

11. M. D. Awheda and H. M. Schwartz, The residual gradient FACL algorithm for differential games, Electrical andComputer Engineering (CCECE), IEEE 28th Canadian Conference on, 1006-1011, 2015.

12. W. Hinojosa, S. Nefti and U. Kaymak, Systems control with generalized probabilistic fuzzy-reinforcement learning,Fuzzy Systems, IEEE Transactions on, 19.1, 51-64, 2011.

13. M. Rodrıguez, R. Iglesias, C. V. Regueiro and J. Correa and S. Barro, Autonomous and fast robot learning throughmotivation, Robotics and Autonomous Systems, 55.9, 735-740, Elsevier, 2007.

14. B. Luo, H.N. Wu and T. Huang, Off-policy reinforcement learning for H∞ control design, Cybernetics, IEEETransactions on, 45.1, 65-76, IEEE, 2015.

15. B. Luo, H.N. Wu and H.X. Li, Adaptive optimal control of highly dissipative nonlinear spatially distributed processeswith neuro-dynamic programming, Neural Networks and Learning Systems, IEEE Transactions on, 26.4, 684-696,IEEE, 2015.

16. B. Luo, H.N. Wu, T. Huang and D. Liu, Reinforcement learning solution for HJB equation arising in constrainedoptimal control problem, Neural Networks, 71, 150-158, Elsevier, 2015.

17. H. Modares, F. L. Lewis and M.B. Naghibi-Sistani, Integral reinforcement learning and experience replay foradaptive optimal control of partially-unknown constrained-input continuous-time systems, Automatica, 50.1, 193-202, Elsevier, 2014.

18. W. Dixon, Optimal adaptive control and differential games by reinforcement learning principles, Journal of Guid-ance, Control, and Dynamics, 37.3, 1048-1049, American Institute of Aeronautics and Astronautics, 2014.

19. B. Luo, H.N. Wu and H.X. Li, Data-based suboptimal neuro-control design with reinforcement learning for dissi-pative spatially distributed processes, Industrial & Engineering Chemistry Research, 53.19, 8106-8119, ACS Publi-cations, 2014.

20. H.N. Wu and B. Luo, Neural network based online simultaneous policy update algorithm for solving the HJIequation in nonlinear control, Neural Networks and Learning Systems, IEEE Transactions on, 23.12, 1884-1895,IEEE, 2012.

21. L. Baird, Residual algorithms: Reinforcement learning with function approximation, In ICML, 30-37, 1995.22. R. S. Sutton, and A. G. Barto, Reinforcement learning: An introduction, 1.1, Cambridge: MIT press, 1998.23. H. M. Schwartz, Multi-agent Machine Learning: A reinforcement approach, John Wiley & Sons, 2014.24. L. P. Kaelbling, M. L. Littman and A. W. Moore, Reinforcement learning: A survey, Journal of artificial intelligence

research, 4, 237-285, 1996.25. W. D. Smart and L. P. Kaelbling, Effective reinforcement learning for mobile robots, IEEE International Conference

on Robotics and Automation, Proceedings ICRA’02, 4, 2002.26. M. Bowling and M. Veloso, Multiagent learning using a variable learning rate, Artificial Intelligence, 136.2, 215-250,

2002.27. C. Ye, N. HC Yung, and D. Wang, A fuzzy controller with supervised learning assisted reinforcement learning

algorithm for obstacle avoidance, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 33.1,17-27, 2003.

28. T. Kondo and K. Ito, A reinforcement learning with revolutionary state recruitment strategy for autonomousmobile robots control, Robotics and Autonomous Systems, 46, 111-124, 2004.

29. D. A. Gutnisky and B. S. Zanutto, Learning obstacle avoidance with an operant behavior model, Artificial Life,10.1, 65-81, 2004.

30. R. S. Sutton, Learning to predict by the methods of temporal differences, Machine learning, 3.1, 9-44, 1988.31. J. W. Sheppard, Colearning in differential games, Machine Learning, 33, 201-233, 1998.32. S. N. Givigi Jr, H. M. Schwartz, and X. Lu, A reinforcement learning adaptive fuzzy controller for differential

games, Journal of Intelligent and Robotic Systems, 59, 3-30, 2010.33. S. F. Desouky and H. M. Schwartz, Self-learning fuzzy logic controllers for pursuit-evasion differential games,

Robotics and Autonomous Systems, 59, 22-33, 2011.34. W. M. Van Buijtenen, G. Schram, R. Babuska, and H. B. Verbruggen, Adaptive fuzzy control of satellite attitude

by reinforcement learning, Fuzzy Systems, IEEE Transactions on, 6.2, 185-194, 1998.35. A. Bonarini, A. Lazaric, F. Montrone, and M. Restelli, Reinforcement distribution in fuzzy Q-learning, Fuzzy sets

and systems, 160.10, 1420-1443, 2009.36. P. Dayan and T. J. Sejnowski, TD(λ) converges with probability 1, Machine Learning, 14, 295-301, 1994.37. P. Dayan, The convergence of TD(λ) for general λ, Machine Learning, 8.3-4, 341-362, 1992.38. T. Jakkola, M. Jordan and S. Singh, On the convergence of stochastic iterative dynamic programming, Neural

Computation, 6, 1185-1201, 1993.39. X. Dai, C. Li, and A. B. Rad, An approach to tune fuzzy controllers based on reinforcement learning for autonomous

vehicle control, IEEE Transactions on Intelligent Transportation Systems, 6.3, 285-293, 2005.40. X.S. Wang, Y.H. Cheng, and J.Q. Yi, A fuzzy Actor-Critic reinforcement learning network, Information Sciences,

177.18, 3764-3781, 2007.41. E. H. Mamdani and S. Assilian, An experiment in linguistic synthesis with a fuzzy logic controller, International

Journal of Man-Machine Studies, 7.1, 113, 1975.42. T. Takagi and M. Sugeno, Fuzzy identification of systems and its applications to modelling and control, IEEE

Transactions on Systems, Man and Cybernetics, SMC, 15.1, 116-132, 1985.43. M. Sugeno and G. Kang, Structure identification of fuzzy model, Fuzzy Sets Syst., 28, 15-33, 1988.44. A. G. Barto, R. S. Sutton and C. W. Anderson, Neuronlike adaptive elements that can solve difficult learning

control problems, Systems, Man and Cybernetics, IEEE Transactions on, 5, 834-846, 1983.45. J. Boyan, A. W. Moore, Generalization in reinforcement learning: Safely approximating the value function, Ad-

vances in neural information processing systems, 369-376, 1995.46. G. J. Gordon, Reinforcement learning with function approximation converges to a region, Advances in neural

information processing systems, 2001.47. R. Schoknecht and A. Merke, TD(0) converges provably faster than the residual gradient algorithm, ICML, 2003.48. J. N. Tsitsiklis and B. V. Roy, An analysis of temporal-difference learning with function approximation, Automatic

Control, IEEE Transactions on, 42.5, 674-690, 1997.


49. R. Isaacs, Differential Game, John Wiley and Sons, 1965.50. S. M. LaValle, Planning Algorithms, Cambridge University Press, 2006.51. S. H. Lim, T. Furukawa, G. Dissanayake and H.F.D. Whyte, A time-optimal control strategy for pursuit-evasion

games problems, In International Conference on Robotics and Automation, New Orleans, LA, 2004.52. S. F. Desouky and H. M. Schwartz, Different hybrid intelligent systems applied for the pursuit-evasion game, In

2009 IEEE International Conference on Systems, Man, and Cybernetics, 2677-2682, 2009.

Documents

A Residual Gradient Fuzzy Reinforcement Learning Algorithm ... · a genetic algorithm approach, a gradient-descent training approach, a recursive least squares approach, and clustering