Learning in the Multi-Robot Pursuit Evasion Game · Imam Ali ( A.S) v. Acknowledgments First and foremost, I am extremely grateful to Almighty Allah, and to Ahlul-Bayt, peace be upon

Learning in the Multi-Robot Pursuit Evasion Game

by

Ahmad Al-Talabi, M.Sc.

A dissertation submitted to the

Faculty of Graduate Studies and Research

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Electrical and Computer Engineering

Ottawa-Carleton Institute for Electrical and Computer Engineering (OCIECE)

Department of Systems and Computer Engineering

Carleton University

Ottawa, Ontario, Canada

January, 2019

c© Copyright 2019, Ahmad Al-Talabi

The undersigned hereby recommends to the

Faculty of Graduate Studies and Research

acceptance of the dissertation

Learning in the Multi-Robot Pursuit Evasion Game

submitted by

Ahmad Al-Talabi, M.Sc.

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Electrical and Computer Engineering

Professor Cecilia Zanni-Merk, External Examiner,

Mathematical Engineering department,

INSA Rouen Normandie

Professor Gabriel Wainer, Thesis Supervisor,


Professor Yvan Labiche, Chair,


Ottawa-Carleton Institute for Electrical and Computer Engineering (OCIECE)


Carleton University

Ottawa, Ontario, Canada

January, 2019

ii

Abstract

This thesis proposes different learning algorithms to investigate the learning issues

of mobile robots playing differential forms of the Pursuit-Evasion (PE) game. The

algorithms are used to reduce: (1) the computational requirements without affect-

ing the overall performance of the algorithm, (2) the learning time, (3) the capture

time and the possibility of collision among pursuers and (4) to deal multi-robot PE

games with a single-superior evader.

The computational complexity is reduced by examining four methods of pa-

rameter tuning for the Q-Learning Fuzzy Inference System (QLFIS) algorithm, to

determine both the best parameters to tune and those that have minimal impact

on performance. Two learning algorithms are then proposed to reduce the learn-

ing time. The first uses a two-stage technique that combines the PSO-based Fuzzy

Logic Control (FLC) algorithm with the QLFIS algorithm, with the PSO algorithm

used as a global optimizer and the QLFIS as a local optimizer. The second algorithm

is a modified version of the Fuzzy-Actor Critic Learning (FACL) algorithm, known

as Fuzzy Actor-Critic Learning Automaton (FACLA). It uses the Continuous Actor-

Critic Learning Automaton (CACLA) algorithm to tune the parameters of the Fuzzy

Inference System (FIS).

Next, a decentralized learning technique is proposed that enables a group of

iii

two or more pursuers to capture a single inferior evader. It uses the FACLA algo-

rithm together with the Kalman filter technique to reduce both the capture time,

and collision potential among the pursuers. It is assumed that there is no commu-

nication among the pursuers. Finally, a proposed decentralized learning algorithm

is applied successfully to a multi-robot PE game with a single-superior evader, in

which all players have identical speeds. A new reward function is suggested and

used to guide the pursuer to either move to the intercepted point with the evader

or move in parallel with the evader, depending on whether the pursuer can capture

the evader or not. Simulation results have shown the feasibility of the proposed

learning algorithms.

iv

“ People are of two kinds. They are either your brothers in Faith or your Equal

in Humanity.”

Imam Ali ( A.S)

v

Acknowledgments

First and foremost, I am extremely grateful to Almighty Allah, and to Ahlul-Bayt,

peace be upon them, for their countless blessings and for the great care they have

provided at every moment of my life. Certainly, without their support and bounty, I

would never successfully finish this chapter of my life by getting the PhD degree.

I would like to express my deepest gratitude to my supervisor Prof. Gabriel

Wainer for his guidance and support. He has always been available to meet with

me even with his tight time schedule. Being under his supervision is one of the best

things I have ever made. Thanks Prof. Wainer!

Also, I would like to express my sincere gratitude and appreciation to my friend

and mentor, Prof. Ramy Gohary for his encouragement, guidance, advice and sup-

port. Frankly, I couldn’t find the words that he deserves. He has always been

available to discuss how to overcome difficulties and to explore the ways of suc-

cess. Prof. Gohary is a treasure and only a lucky person can work with such a

knowledgeable professor. Thanks Prof. Gohary!

I would like to thank my thesis committee members Prof. Paul Keen, Prof. Cecilia

Zanni-Merk, Prof. James Green, Prof. Sidney Givigi and Prof. Emil Petriu for their

comprehensive evaluation and insightful comments on my thesis.

My sincere gratitude and appreciation go to the chair of the Systems and Com-

puter Engineering Department Prof. Yvan Labiche and to Prof. Ioannis Lambadaris

vi

for their support and invaluable assistance.

Also, I would like to thank my close friends Prof. Ahmed Qadoury Abed and Dr.

Nasir Kamat for being available to relieve my sufferings and for their support and

encouragement.

I cannot ever express my feelings towards my parents, brothers and sisters. Their

love, support, encouragement and prayers gave me more confidence to fulfill my

goals successfully. Special thanks to my brother Mohammed for his support and

patience.

Finally, I would like to dedicate this work to my beloved wife, Zainab, and my

children, Qamer, Mohammedbakir, Aya and Zahraa for their unconditioned love,

care, patience and understanding. You were the candles that lighted my way

through this long journey.

vii

Table of Contents

Abstract iii

Acknowledgments vi

List of Tables xii

List of Figures xv

Nomenclature xxiii

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Goal and Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 10

1.6 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Background and Literature Review 14

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 The Pursuit-Evasion Game . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 The Game of Two Cars . . . . . . . . . . . . . . . . . . . . . . 16

viii

2.3 Fuzzy Logic Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 Fuzzy Sets and Membership Functions . . . . . . . . . . . . . 18

2.3.2 Fuzzification . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.3 Fuzzy-Rule Base . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.4 Fuzzy-Inference Engine . . . . . . . . . . . . . . . . . . . . . 22

2.3.5 Defuzzification . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.1 Future-Discounting Reward . . . . . . . . . . . . . . . . . . . 28

2.4.2 Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.3 Temporal Difference Learning . . . . . . . . . . . . . . . . . . 32

2.4.4 Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . . . . 35

2.5 Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . 36

2.5.1 Particle Swarm Optimization with Inertia Weight . . . . . . . 39

2.5.2 Particle Swarm Optimization with Constriction Factor . . . . . 41

2.6 The Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.6.1 The System Dynamic Model . . . . . . . . . . . . . . . . . . . 43

2.6.2 The Kalman Filtering Process . . . . . . . . . . . . . . . . . . 45

2.6.3 Fading Memory Filter (FMF) . . . . . . . . . . . . . . . . . . . 47

2.7 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3 An Investigation of Methods of Parameter Tuning for the QLFIS 56

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3 Pursuit-Evasion Game . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.4 Fuzzy Logic Controller Structure . . . . . . . . . . . . . . . . . . . . 60

ix

3.5 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.6 Q-Learning Fuzzy Inference System (QLFIS) . . . . . . . . . . . . . . 64

3.6.1 The Learning Rule of the QLFIS and its Algorithm . . . . . . . 64

3.7 Computer Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.7.1 Evader Follows a Default Control Strategy . . . . . . . . . . . 69

3.7.2 Evader Using its Higher Maneuverability Advantageously . . . 73

3.7.3 Multi-Robot Learning . . . . . . . . . . . . . . . . . . . . . . 79

3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4 Learning Technique Using PSO-Based FLC and QLFIS lgorithms 90

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.2 The PSO-based FLC algorithm . . . . . . . . . . . . . . . . . . . . . . 92

4.3 Q-learning Fuzzy Inference System (QLFIS) . . . . . . . . . . . . . . 95

4.4 The proposed Two-Stage Learning Technique . . . . . . . . . . . . . 96




4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5 Kalman Fuzzy Actor-Critic Learning Automaton Algorithm for the Pursuit-

Evasion Differential Game 110

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.2 Fuzzy Actor-Critic Learning Automaton (FACLA) . . . . . . . . . . . . 112



5.3 Learning in n-Pursuer One-Evader PE Differential Game . . . . . . . 121

x

5.3.1 Predicting the Interception Point and its Effects . . . . . . . . 124

5.4 State Estimation Based on a Kalman Filter . . . . . . . . . . . . . . . 126

5.4.1 The Design of Filter Parameters . . . . . . . . . . . . . . . . . 129

5.4.2 Kalman Filter Initialization . . . . . . . . . . . . . . . . . . . 131

5.4.3 Fuzzy Fading Memory Filter . . . . . . . . . . . . . . . . . . . 132

5.4.4 Kalman Filter Model Selection . . . . . . . . . . . . . . . . . . 133


5.5.1 Case 1: Two-Pursuer One-Evader Game . . . . . . . . . . . . 147

5.5.2 Case 2: Three-Pursuer One-Evader Game . . . . . . . . . . . 151

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

6 Multi-Player Pursuit-Evasion Differential Game with Equal Speed 156

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.2 The Dynamic Equations of the Players . . . . . . . . . . . . . . . . . 158

6.3 Fuzzy Logic Controller Structure . . . . . . . . . . . . . . . . . . . . 160

6.4 Reward Function Formulation . . . . . . . . . . . . . . . . . . . . . . 160

6.5 Fuzzy Actor-Critic Learning Automaton (FACLA) . . . . . . . . . . . . 164


6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

7 Conclusions and Future Work 170

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

Bibliography 181

xi

List of Tables

2.1 Fuzzy Decision Table (FDT) . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Methods of parameter tuning. . . . . . . . . . . . . . . . . . . . . . . 58

3.2 Fuzzy logic parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3 FDTs of the pursuer and the evader before learning. . . . . . . . . . . 63

3.4 Mean and standard deviation of the capture time (s) for different

evader initial positions for the first version of the PE game. . . . . . . 70

3.5 Mean and standard deviation of the computation time (s) for the

four methods of parameter tuning for the first version of the PE game. 74


evader initial positions for the second version of the PE game. . . . . 76


four methods of parameter tuning for the second version of the PE

game. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79


evader initial positions for the third version of the PE game (Case 1). 81


four methods of parameter tuning for the third version of the PE

game (Case 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

xii


pursuers’ initial positions for the third version of the PE game (Case 2). 85


four methods of parameter tuning for the third version of the PE

game (Case 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.1 The percentage decrease in the mean value of the capture time (s)

for 1000 episode as the number of particles increases. . . . . . . . . 100

4.2 The percentage decrease in the mean value of the capture time (s)

for 10 particles as the number of episodes increases. . . . . . . . . . 100


evader initial positions for the case of only the pursuer learning. . . . 102

4.4 Total number of episodes for the different learning algorithms . . . . 102

4.5 Mean and standard deviation of the computation time (s) for differ-

ent learning algorithms for the case of only the pursuer learning. . . 104


evader initial positions for the case of multi-robot learning. . . . . . . 106


ent learning algorithms for the case of multi-robot learning. . . . . . 108

5.1 Total number of episodes for the different learning algorithms. . . . . 119


evader initial positions for the case of only the pursuer learning. . . . 119


ent learning algorithms for the case of only the pursuer learning. . . 119

xiii

5.4 Mean and standard deviation of the capture times (s) for different

evader initial positions for the case of multi-robot learning. . . . . . . 120


ent learning algorithms for the case of multi-robot learning. . . . . . 121

5.6 FDT of the fuzzy FMF. . . . . . . . . . . . . . . . . . . . . . . . . . . 134

5.7 Mean and standard deviation of the RMSE (cm) for the evader’s po-

sition estimate of Example 5.1. . . . . . . . . . . . . . . . . . . . . . 136





5.10 Mean and standard deviation of the capture time (s) for a two-

pursuer one-evader game for different pursuers’ initial positions. . . 149

5.11 Mean and standard deviation of the capture time (s) for a three-

pursuer one-evader game for different pursuers’ initial positions. . . 151

6.1 Fuzzy decision table . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

xiv

List of Figures

2.1 The PE model for the game of two cars. . . . . . . . . . . . . . . . . 17

2.2 Fuzzy logic controller structure . . . . . . . . . . . . . . . . . . . . . 19

2.3 Membership function of input and output. . . . . . . . . . . . . . . . 21

2.4 Graphical representation of different defuzzification methods. . . . . 25

2.5 Agent-environment interaction in RL [16]. . . . . . . . . . . . . . . . 27

2.6 Actor-critic structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.7 Kalman filtering process. . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.1 Initial MFs of pursuer and evader before learning. . . . . . . . . . . . 62

3.2 The structure of the QLFIS technique [20]. . . . . . . . . . . . . . . . 65

3.3 The PE paths on the xy-plane for the first version of the PE game,

before the pursuer starts to learn. . . . . . . . . . . . . . . . . . . . . 71

3.4 The PE paths on the xy-plane for the first version of the PE game

when the first method of parameter tuning is used versus the PE

paths when each player followed its DCS. . . . . . . . . . . . . . . . 71


when the second method of parameter tuning is used versus the PE


xv

3.6 The PE paths on the xy-plane plane for the first version of the PE

game when the third method of parameter tuning is used versus the

PE paths when each player followed its DCS. . . . . . . . . . . . . . . 72


when the fourth method of parameter tuning is used versus the PE


3.8 The PE paths on the xy-plane for the second version of the PE game,

before the pursuer starts to learn. . . . . . . . . . . . . . . . . . . . . 75

3.9 The PE paths on the xy-plane for the second version of the PE game

when the first method of parameter tuning is used versus the PE



when the second method of parameter tuning is used versus the PE



when the third method of parameter tuning is used versus the PE



when the fourth method of parameter tuning is used versus the PE


3.13 The PE paths on the xy-plane for the third version of the PE game

(Case 1), before the players start to learn. . . . . . . . . . . . . . . . 80


(Case 1) when the first method of parameter tuning is used versus

the PE paths when each player followed its DCS. . . . . . . . . . . . 81

xvi


(Case 1) when the second method of parameter tuning is used versus



(Case 1) when the third method of parameter tuning is used versus



(Case 1) when the fourth method of parameter tuning is used versus



(Case 2) when the first method of parameter tuning is used versus



(Case 2) when the second method of parameter tuning is used versus



(Case 2) when the third method of parameter tuning is used versus



(Case 2) when the fourth method of parameter tuning is used versus


4.1 The mean values of the capture time for the PSO-based FLC algo-

rithm for different population sizes. The range bars indicate the

standard deviations over the 500 simulation runs. . . . . . . . . . . 99

xvii

4.2 The mean values of the capture time for the PSO-based FLC algo-

rithm for different episode numbers. The range bars indicate the

standard deviations over the 500 simulation runs. . . . . . . . . . . . 99

4.3 The PE paths on the xy-plane using the PSO-based FLC algorithm for

the case of only the pursuer learning versus the PE paths when each

player followed its DCS. . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.4 The PE paths on the xy-plane using the QLFIS algorithm for the case

of only the pursuer learning versus the PE paths when each player

followed its DCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.5 The PE paths on the xy-plane using the proposed learning algorithm

for the case of only the pursuer learning versus the PE paths when

each player followed its DCS. . . . . . . . . . . . . . . . . . . . . . . 104

4.6 The PE paths on the xy-plane using the PSO-based FLC algorithm

for the case of multi-robot learning versus the PE paths when each


4.7 The PE paths on the xy-plane using the QLFIS algorithm for the case

of multi-robot learning versus the PE paths when each player fol-

lowed its DCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.8 The PE paths on the xy-plane using the proposed learning algorithm

for the case of multi-robot learning versus the PE paths when each


5.1 Structure of the FACL system [17]. . . . . . . . . . . . . . . . . . . . 113

xviii

5.2 The mean values of the capture time for the FACL, RGFACL and FA-

CLA algorithms for different episode numbers. The range bars indi-

cate the standard deviations over the 500 simulation runs. . . . . . . 118

5.3 The PE differential game model with two-pursuer and one-evader. . . 122

5.4 Geometric illustration for capturing situation. . . . . . . . . . . . . . 124

5.5 Geometric illustration for capturing situation using the estimated po-

sition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.6 MFs of the inputs µ and ξ. . . . . . . . . . . . . . . . . . . . . . . . . 134

5.7 The evader’s position estimate for Example 5.1 by using (a) CVM

Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter.

(d) Fuzzy CAM Kalman Filter. . . . . . . . . . . . . . . . . . . . . . . 137

5.8 The evader’s x-position estimate for Example 5.1 by using (a) CVM



5.9 The evader’s y-position estimate for Example 5.1 by using (a) CVM









xix













5.16 The PE paths using FACLA algorithm for a two-pursuer one-evader

game for different pursuers’ initial positions. . . . . . . . . . . . . . . 149

5.17 The PE paths using Kalman-FACLA algorithm for a two-pursuer one-

evader game for different pursuers’ initial positions. . . . . . . . . . . 150

5.18 The PE paths using FACLA algorithm for a three-pursuer one-evader

game for different pursuers’ initial positions. . . . . . . . . . . . . . . 152

5.19 The PE paths using Kalman-FACLA algorithm for a three-pursuer one-

evader game for different pursuers’ initial positions. . . . . . . . . . . 153

6.1 Geometric illustration for capturing situation. . . . . . . . . . . . . . 159

6.2 Geometric illustration for the pursuer moving in parallel with the

evader using PGL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

xx

6.3 The PE paths for three-pursuer one-evader after the first learning

episode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

6.4 The PE paths for three-pursuer one-evader after the final episode. . . 166

6.5 The average payoff of each pursuer at the end of each learning episode.167

6.6 The difference between the initial and the final LoSs between each

pursuer and the evader at each learning episode. . . . . . . . . . . . 168

6.7 The difference between the initial and the final Euclidean distances

between each pursuer and the evader at each learning episode. . . . 168

xxi

xxii

Nomenclature

FIS Fuzzy Inference System

QLFIS Q-Learning Fuzzy Inference System

PSO Particle Swarm Optimization

FLC Fuzzy Logic Control

FACL Fuzzy-Actor Critic Learning

CACLA Continuous Actor-Critic Learning Automaton

FACLA Fuzzy Actor-Critic Learning Automaton

PE Pursuit-Evasion

RL Reinforcement Learning

LoS Line-of-Sight

HJI Hamilton-Jacobi-Isaacs

GA Genetic Algorithm

MF Membership Function

TS Takagi-Sugeno

FDT Fuzzy Decision Table

TD Temporal-Difference

DP Dynamic Programming

xxiii

MC Monte Carlo

FMF Fading Memory Filter

CVM Constant Velocity Model

CAM Constant Acceleration Model

xxiv

Chapter 1

Introduction

1.1 Overview

Pursuit-Evasion (PE) games are multi-player games in which one or more ‘pursuers’

attempt to capture one or more ‘evaders’ in the least time, while the evaders try

to escape or maximize the capture time [1]. Hence, this can be formulated as a

zero-sum game in which a benefit for the pursuers is a loss for the evaders, and vice

versa. And since the goal of the pursuers is opposite to that of the evaders, PE can

be considered an optimization problem with conflicting objectives [2].

PE games have been used for decades, mainly in three scientific fields: neu-

roethology [3], behavioural biology [4, 5], and game theory [1, 6]. They are a key

research tool in the field of game theory, and several classes of the games have been

studied extensively due to their potential for military applications, including mis-

sile avoidance and interception, surveillance, reconnaissance and rescue operations

[1]. The concept can also be generalized to solve other applications, such as path

planning, collision avoidance, criminal pursuit and other related fields [7, 8].

1

1.1. OVERVIEW 2

PE games are differential games, which means that the system dynamics are

described by systems of differential equations. In other words, differential games

are games that have continuous state and action spaces. The solution complexity of

the differential game will increase as the number of players increases. This is due

to the difficulty to model the interaction between players, and players’ abilities to

interact with unknown and uncertain environments.

Fuzzy logic control (FLC) is a method for dealing with processes that are ill-

defined and/or involve uncertainty or continuous change, as well as a technique

for intelligent control [9]. Fuzzy logic concepts have been widely used in the field

of autonomous mobile robots [10–14]. Designing an FLC system requires an ap-

propriate knowledge base, and this can be constructed from experts’ knowledge.

However, building a knowledge base is not a simple task, so using an optimization

method such as Particle Swarm Optimization (PSO) or a learning method such as

Reinforcement Learning (RL) to autonomously tune the FLC parameters, can be

useful.

PSO is one of the population stochastic optimization methods that have been

proposed for different types of applications. In this thesis, the PSO algorithm will

act as a global optimizer to tune the parameters of the FLC. It was chosen because of

its simplicity and efficiency, and the fact that it only has a few parameters to adjust.

The PSO algorithm tunes the FLC parameters in accordance with the problem fitness

function.

RL represents a method to learn to achieve a particular goal [15] through inter-

action with the environment [16]. In RL, the player selects actions and the environ-

ment responds by producing new situations for the player. The environment also

generates numerical values (rewards) that should be maximized by a player over

1.2. MOTIVATION 3

time. RL has proven to be a good choice for intelligent robot control design, and it

has been successfully applied to tune FLC parameters.

1.2 Motivation

The study of PE games has received wide attention from researchers in various

fields due to the game’s extensive applicability, particularly in military applications

such as air combat, torpedo and ship and tank and jeep scenarios [1]. There are

also many related scientific implications, and understanding the PE game concept

has fostered applications in communications, video games and robotics. For ex-

ample, several important applications of mobile robots can be formulated as PE

games, including path planning, tracking, leader-follower, collision avoidance and

search-and-rescue [7]. Thus, due to the importance of the PE game, and recent

technological developments, researchers are increasingly interested in autonomous

agents (e.g. autonomous robots) that can learn and achieve a specific goal through

experience acquired from their environment. The learning algorithm for an au-

tonomous agent is implemented in a microcontroller; thus, it is better to use a

learning algorithm that can be easily and efficiently applied.

As discussed in Section 1.1 PE games are differential, and this can make achiev-

ing solutions complicated. Issues typically increase in proportion to the number

of game participants, which is partly due to modeling participants’ interactions,

as well as how they interact with unknown and uncertain environments. Thus, this

thesis focuses on the learning approach in PE differential games; that is, each player

learns the best action to take at each instant of time, and adapts to uncertain and

unknown environments.

1.2. MOTIVATION 4

Recently, fuzzy-reinforcement learning methods in [17–22] have been proposed

to address learning problems in differential games. With these methods, the learn-

ing process is achieved by tuning the parameters of two main components, both

of which are Fuzzy Inference Systems (FISs) with two sets of parameters (i.e. a

premise and a consequent) that can be tuned by the learning algorithm. The first

component is used to approximate a state value function V (·) or an action value

function Q(·, ·), and the second is used as an FLC. However, the work in [17–21]

did not investigate which set of parameters had the most significant impact on the

performance of the learning algorithm. In [18–20], the learning process is achieved

by tuning all sets of the parameters, while in [17, 21] only one set of parameters

is used; namely, the consequent set. Therefore, this issue requires investigation to

determine possible reductions in the computational requirements of the learning

algorithm.

Each player in a PE game tries to learn its control strategy, and this normally

takes a specific number of episodes to achieve acceptable performance. Sometimes

the number selected is greater than necessary, which means the learning time is

longer. The length of the learning process is very important, particularly if the

learning algorithm is being implementing with a real-world game. This makes it

necessary to propose other learning algorithms that could speed up the learning

process, as demonstrated later in the thesis.

If one of the learning algorithms proposed in [17–20, 22] is applied to a multi-

pursuer single-inferior evader PE game there is potential for collisions among the

pursuers, particularly if they are near one another or approaching the evader. This

motivates the development of a new learning algorithm with the ability to avoid

collisions. And since there are multiple pursuers in the game, it is more effective to

1.3. GOAL AND OBJECTIVE 5

propose a learning algorithm that can be implemented in a decentralized manner.

Furthermore, this thesis discusses the problem of multi-pursuer single-evader PE

games in which all players have equal capabilities. Though this game format was

previously investigated in [23] it was assumed that the speed of the evader was

known, which makes the algorithm inappropriate for practical use. Therefore, this

assumption not considered here, which provides motivation for proposing another

learning algorithm that could teach a group of pursuers how to capture a single

evader in a decentralized manner.

1.3 Goal and Objective

The overall goal of this thesis can be formulated as how to develop efficient learning

algorithms that enable one or several pursuers to capture one or several evaders in

minimal time. Efficiency of the learning algorithm can be achieved by meeting the

following objectives: reducing the computational requirements as much possible

without affecting the overall performance of the learning algorithm, reducing the

learning time, reducing the capture time, reducing the possibility of collision among

the pursuers and dealing with multi-robot PE games with a single-superior evader.

This research is carried out in four stages:

• Stage 1: In order to reduce computational complexity, the Q-Learning Fuzzy

Inference System (QLFIS) algorithm previously proposed in [20] is consid-

ered. Four methods for tuning the parameters of the algorithm are investi-

gated to determine the ones with maximal and minimal impacts on perfor-

mance. The four methods depend on whether all the parameters of the FIS

and the FLC are tuned, or only a subset; it is more computationally efficient

1.3. GOAL AND OBJECTIVE 6

to tune a subset rather than all parameters.

• Stage 2: For real-world applications the learning time is highly important,

thus two specific learning algorithms are proposed for PE differential games.

The first uses a two-stage learning technique that combines the PSO-based

FLC algorithm with the QLFIS algorithm. The resulting algorithm is called the

PSO-based FLC+QLFIS algorithm. The PSO aspect of the algorithm is used

as a global optimizer to autonomously tune the parameters of the FLC, while

the QLFIS aspect is used as a local optimizer. The second proposed learning

algorithm is a modified version of the Fuzzy-Actor Critic Learning (FACL) al-

gorithm, in which both the critic and the actor are FIS. This algorithm uses

the Continuous Actor-Critic Learning Automaton (CACLA) algorithm to tune

the parameters of the FIS, and is called the Fuzzy Actor-Critic Learning Au-

tomaton (FACLA) algorithm.

• Stage 3: A decentralized learning technique that enables a group of two or

more pursuers to capture a single inferior evader is proposed for PE differen-

tial games. Both the pursuers and the evader must learn their control strate-

gies simultaneously by interacting with one another. This learning technique

uses the FACLA algorithm and the Kalman filter technique to reduce the cap-

ture time and collision potential for the pursuers. It is assumed that there is

no communication among the pursuers, and each pursuer considers the other

pursuers as part of its environment.

• Stage 4: In this stage, a decentralized learning algorithm is proposed for

multi-robot PE games with a single-superior evader, which enables a group of

pursuers in PE differential games to learn how to capture such an evader. In

1.4. THESIS ORGANIZATION 7

this thesis, the superiority of the evader is defined in terms of its maximum

speed. Therefore, the superior evader can be defined as an evader whose

maximum speed is equal to or exceeds the maximum speed of the fastest pur-

suer in the game [8, 24, 25]. In this work, it is assumed that the pursuers

and the evader have identical speeds. A novel idea is used to formulate a re-

ward function for the proposed learning algorithm based on a combination of

two factors: the difference in the Line-of-Sight (LoS) between each pursuer in

the game and the evader at two consecutive time instants, and the difference

between two successive Euclidean distances between individual pursuers and

the evader.

1.4 Thesis Organization

This thesis is organized in a paper-based format and presented in seven chapters,

summarized as follows:

• Chapter 1 presents the general introduction, the motivation, the research goal

and objectives and the thesis organization. In addition, it provides a summary

of contributions and a list of publications based on this study.

• Chapter 2 defines some basic theoretical concepts applied in the thesis and

provides a detailed literature review. It begins with a brief introduction of the

problem of the PE differential game, and introduces a model of the ‘game of

two cars’, since this is the model that will be mainly used to present the PE

differential game in this thesis. Then the FIS, one of the most popular function

approximation methods, is described. Here, the FIS works as either an FLC,

or as a function approximator to manage the problem of continuous state


and action spaces such as the PE differential game. A detailed background

of RL is provided, as RL is used by learning agents to find an appropriate

learning strategy in an unknown environment. PSO, one of the population

stochastic optimization methods, is then applied as a global optimizer for the

FLC parameters. Furthermore, there is a detailed discussion about the Kalman

filter, one of the most powerful estimation techniques. The chapter concludes

with a detailed literature review on previous studies related to this research.

• Chapter 3 investigates how to reduce the computational requirement by ap-

plying four methods to implement the QLFIS algorithm. The methods are

based on the sets (i.e., premise and consequent) of FIS and FLC parameters

that can be tuned, and the four methods are applied to three versions of a PE

differential game. In the first version the evader plays a well-defined strategy,

which is to run away along the LoS while the pursuer tries to learn how to

capture it using the QLFIS algorithm. The second game is similar to the first

one, except that the evader plays an intelligent control strategy and makes a

sharp turn when the distance between the evader and pursuer is less than a

specific threshold value. In the third game, the QLFIS algorithm is used by

both players, and they attempt to learn their control strategies simultaneously

by interacting with one another. Simulation results are provided to evaluate

which parameters are most suitable to tune, and which have little impact on

performance.

• Chapter 4 introduces a new two-stage technique to reduce player learning

time, which is an important factor for any learning algorithm. The first stage

is represented by the PSO-based FLC algorithm, and the second by the QLFIS


algorithm. The basic reason for using the PSO algorithm is its ability to work

as a global optimizer and to find a global solution within a few iterations.

Thus, it is first used to find effective initial parameter values for the FLC of the

learning player. Then, in the second stage, the learning agent uses the QLFIS

algorithm with the resulting initial parameter setting to quickly find its control

strategy. The two-stage learning technique is applied to different versions of

PE differential games, and simulation results are provided and discussed.

• Chapter 5 develops a new fuzzy-reinforcement learning algorithm for PE dif-

ferential games. The proposed algorithm reduces the learning time that the

players need to find their control strategies. It uses the CACLA algorithm to

tune the FIS parameters, and is called the FACLA algorithm. The proposed

algorithm is applied to two versions of the PE games, and compared by sim-

ulation with state-of-the-art fuzzy-reinforcement learning algorithms, and the

PSO-based FLC+QLFIS algorithm. This chapter also presents a decentralized

learning technique that enables a group of two or more pursuers to capture a

single evader in PE differential games. The proposed technique uses the FA-

CLA algorithm and the Kalman filter technique, which is an estimation method

to predict an evader’s next position. It can be used by all pursuers to avoid col-

lisions among them and reduce capture time. The Kalman learning technique

is applied for each player to autonomously tune the parameters of its FLC and

self-learn its control strategy. Simulation results are provided to show that the

proposed learning algorithm works as required.

• Chapter 6 introduces a type of reward function for the FACLA algorithm to

teach a group of pursuers how to capture a single evader in a decentralized

1.5. SUMMARY OF CONTRIBUTIONS 10

manner. It is assumed that all players have identical speed, and each pur-

suer learns to take the right actions by tuning its FLC parameters, using the

FACLA algorithm, with the suggested reward function. The reward function

depends on two factors: the difference in the LoS between each pursuer in the

game and the evader at two consecutive time instances, and the difference be-

tween two consecutive Euclidean distances between individual pursuers and

the evader. The suggested reward function is used to guide each pursuer to

move either to the interception point with the evader or in parallel with it.

The pursuer movement direction depends on whether the pursuer can cap-

ture the evader or not. Simulation results are shown to validate the FACLA

algorithm with the suggested reward function.

• Chapter 7 highlights the conclusions and presents the main contributions of

the thesis. It also discusses ideas and directions for future work.

1.5 Summary of Contributions

The main contributions of this thesis are:

1-Reducing the Computational Time by:

Proposing and investigating four methods of parameter tuning for the QLFIS

algorithm. The investigation will determine whether it is necessary to tune both

the premise and consequent parameters of the FIS and FLC, or only the consequent

parameters.

1.5. SUMMARY OF CONTRIBUTIONS 11

2-Reducing the Learning Time by:

Proposing two learning algorithms for PE differential games to reduce the time

a player needs to find its control strategy. The first algorithm uses a two-stage

learning technique that combines the PSO-based FLC and QLFIS algorithms and

employs the PSO algorithm as a global optimizer and the QLFIS as a local optimizer.

The second algorithm is a modified version of the FACL algorithm called the FACLA

algorithm, and it uses a CACLA algorithm to tune the parameters of the actor and

critic. Simulations and comparisons of these algorithms and the state-of-the-art

fuzzy-reinforcement learning algorithms will be determined.

3-Reducing the Capture Time and Reducing the Possibility of Collision Among

Pursuers by:

Developing a decentralized learning technique for PE differential games that en-

ables a group of two or more pursuers to capture a single evader. The algorithm is

known as the Kalman-FACLA algorithm and it uses the Kalman filter to estimate the

evader’s next position, allowing pursuers to determine the evader’s direction. To

implement the algorithm, the only information each pursuer needs is the instanta-

neous position of the evader. This learning algorithm will be used to avoid collisions

among the pursuers and reduce capture time.

4-Dealing with multi-player PE games with single-superior evader by:

Defining a type of reward function for the FACLA algorithm that can teach a

group of pursuers how to capture a single-superior evader in a decentralized man-

ner, when the speed of all the players is identical. The proposed reward function

directs each pursuer to either move to intercept the evader, or to move in parallel

with it. There is no need to calculate the capture angle for each pursuer in order to

1.6. PUBLICATIONS 12

determine its control signal.

1.6 Publications

The publications that resulted from this research are:

1. A. A. Al-Talabi and H. M. Schwartz, “An investigation of methods of parameter

tuning for Q-learning fuzzy inference system,” in Proc. of the 2014 IEEE In-

ternational Conference on Fuzzy Systems (FUZZ-IEEE 2014), (Beijing, China),

pp. 2594-2601, July 2014.

2. A. A. Al-Talabi and H. M. Schwartz, “A two stage learning technique using

PSO-based FLC and QFIS for the pursuit evasion differential game,” in Proc. of

2014 IEEE International Conference on Mechatronics and Automation (ICMA

2014), (Tianjin, China), pp. 762-769, August 2014.

3. A. A. Al-Talabi and H. M. Schwartz, “A two stage learning technique for dual

learning in the pursuit-evasion differential game,” in Proc. of the IEEE Sympo-

sium Series on Computational Intelligence (SSCI) 2014, (Orlando, Florida),

pp. 1-8, December 2014.

4. A. A. Al-Talabi and H. M. Schwartz, “Kalman fuzzy actor-critic learning au-

tomaton algorithm for the pursuit-evasion differential game,” in Proc. of the

2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2016),

(Vancouver, Canada), pp. 1015-1022, July 2016.

5. A. A. Al-Talabi, “Fuzzy actor-critic learning automaton algorithm for the pursuit-

evasion differential game,” in Proc. of the 2017 IEEE International Automatic

Control Conference (CACS), (Pingtung, Taiwan), November 2017.

1.6. PUBLICATIONS 13

6. A. A. Al-Talabi, “Multi-player pursuit-evasion differential game with equal

speed,” in Proc. of the 2017 IEEE International Automatic Control Confer-

ence (CACS), (Pingtung, Taiwan), November 2017.

Note: Based on the advice of the Faculty of Graduate and Postdoctoral Affairs,

this thesis was written in paper-based format. Therefore, some content might be

presented more than once because each paper was written independently. Also, the

contents of the original papers are modified in a way that preserves the information

presented in each paper, and eliminates repetition. Moreover, additional work has

been added to some of the original papers.

Chapter 2

Background and Literature Review

2.1 Introduction

In this thesis, the Pursuit-Evasion (PE) differential game is investigated from the

learning point of view, and thus this chapter will give basic concepts and theoretical

background that are related to this issue. The chapter begins by giving a brief in-

troduction to what is called a differential game, where the game’s state and action

spaces are given in continuous domain. Then, the PE differential game is defined,

and a mathematical model for a two-player PE differential game is given. In this

model, each player is defined as a car-like mobile robot. Since the game has con-

tinuous state and action spaces, there is a necessity to use some forms of function

approximation, and for this reason the Fuzzy Inference System (FIS) is described.

In this thesis, the FIS is used as either a function approximator or a Fuzzy Logic

Controller (FLC). Following that, the concept and detailed theoretical background

of the Reinforcement Learning (RL) is provided. Here, the RL is used to address

the learning issue for the problem of the PE differential game. It guides each player

14

2.2. THE PURSUIT-EVASION GAME 15

to learn its control strategy in an unknown environment. Next, one of the most

popular stochastic optimization methods is discussed, which is the Particle Swarm

Optimization (PSO) algorithm. In this research, the PSO algorithm is used to speed

up the learning process by finding a good initial setting for the FLC parameters.

Finally, the Kalman filter, as one of the most powerful estimation techniques, is in-

vestigated in detail. Here, the Kalman filter is used to estimate the evader’s next

position. Also, one of its generalizations called a Fading Memory Filter (FMF) is

introduced.

This chapter is organized as follows. The PE game is described in Section 2.2,

and the fuzzy logic inference system is explained in Section 2.3. The RL and PSO

algorithms are described in Section 2.4 and Section 2.5, respectively. The Kalman

filter is discussed in Section 2.6. Finally, a detailed literature review of previous

studies related to this research is provided in Section 2.7.

2.2 The Pursuit-Evasion Game

Game problems involving conflict and/or cooperation are encountered on daily

bases in areas such as athletics, stock market dealing, political bargaining and war

games. Game theory is usually connected with these situations, as there are typ-

ically a number of rational participants, each with its own goal. Participants are

known as players, agents or decision makers, and if the dynamics of a game are

defined by differential equations, the game is referred to as ‘differential’. Differ-

ential games were initiated by Isaacs [1] in the 1950s, when he studied the opti-

mal behaviour of the PE game in the Rand Corporation. Isaacs [1] proposed the

2.2. THE PURSUIT-EVASION GAME 16

homicidal-chauffeur problem as an example of the PE differential game. This prob-

lem is a two-player zero-sum game, where the objective of the first player is opposite

to the objective of the second player. In the game, a slow pedestrian (the evader)

who can change direction instantaneously, tries to maximize the capture time or

avoid being captured by a fast homicidal chauffeur (the pursuer). Isaacs then ex-

tended his work to more general cases of the PE game, such as the game of two cars.

Typically, a PE game has one or several pursuers attempting to capture one or more

evaders in minimal time, while the evaders try to escape or maximize the capture

time [1]. Thus, this problem can be considered as an optimization one with con-

flicting objectives [2]. In the PE game, each player attempts to learn the best action

to take at every moment, and to adapt to uncertain or changing environments.

2.2.1 The Game of Two Cars

The game of the two cars represents one of the PE differential games, though it

is different from the homicidal-chauffeur problem. In this game, each player moves

with limited speed and turning ability, like a car. Figure 2.1 shows the PE model

for this game and its parameters, and the two players modeled as car-like mobile

robots.

The dynamic equations that describe the motion of the pursuer and the evader

robots in this game are [26]

xi = Vi cos(θi),

yi = Vi sin(θi), and

θi =ViLi

tan(ui),

(2.1)

where i is e for the evader and p for the pursuer. Also, (xi, yi), Vi, θi, Li and ui refer

2.3. FUZZY LOGIC CONTROL 17

( )

x

y

xe xp

ye

yp

Vp

Ve

The pursuer

The evader

Figure 2.1: The PE model for the game of two cars.

to the position, velocity, orientation, wheelbase and steering angle, respectively.

The steering angle is bounded by −uimax ≤ ui ≤ uimax , where uimax is the maximum

steering angle. When the steering angle is fixed at ui, the car moves in a circular

path with a radius Rdi, and when it is fixed at uimax , the car moves with a minimum

turning radius. The turning radius can be defined by

Rdi =Li

tan(ui). (2.2)

2.3 Fuzzy Logic Control

In 1965, Zadeh [27] established the basis of FLC by introducing the concepts of

fuzzy sets and fuzzy logic. However, researchers did not understand how to use

these concepts in an application until Mamdani [28] applied them to control an

automatic steam engine in 1974. Since then, fuzzy logic has been considered one


of the most powerful methods of describing and designing control systems to deal

with complex processes in an intuitive and simple manner. Thus, FLC has trig-

gered numerous studies [29], and become an active field of research in different

application areas, including statistics [30], industrial automation [31], image and

signal processing [32], biomedicine [33], data mining [34], pattern recognition

[35], data analysis [36], power plants [37], expert systems [38] and control en-

gineering problems [39–42]. Fuzzy-logic concepts are widely used in daily life,

and companies around the world have benefitted by designing various types of

FLCs for different applications and electronic devices. For example, some of the

washing machines manufactured by Matsushita Electric Industrial are fuzzy-based

controlled [43]. FLC is considered a form of soft computing or intelligent control

that can mimic human decision making to deal with partial truth situations. As in-

dicated in Figure 2.2, the FLC structure is composed of four principal components:

fuzzifier, fuzzy rule base, fuzzy inference engine and defuzzifier. The inputs and

outputs of the FLC are real-value crisp data sets. The implementation of FLC in real

applications involves the following three steps:

1. Fuzzification : to map a real-value crisp data set into a fuzzy set.

2. Fuzzy Inference Process : to combine the Membership Functions (MF) with

the fuzzy rules to obtain the fuzzy output.

3. Defuzzification: to convert the fuzzy output into real-value crisp data.

2.3.1 Fuzzy Sets and Membership Functions

The idea of fuzzy set theory is an extension of the concept of classical set theory

in which data elements are either in a set or not. For example, a classical set of all


Fuzzy outputs Fuzzy inputs

Crisp inputsFuzzifier Defuzzifier

Fuzzy Rule Base

Fuzzy Inference Engine

Crisp outputs

Figure 2.2: Fuzzy logic controller structure

positive numbers can be defined by

A = {z | z > 0}.

It is clear that if z > 0 then z is a member of set A; otherwise, z is not a positive

number and not a member of set A. There is a mapping function with two values,

0 or 1, which is called the MF, µ(z), and can be defined by

µ(z) =

1 : z ∈ A,

0 : z /∈ A.

Fuzzy set theory allows elements to have partial membership; whereby an element

is both a member and not a member at the same time, and has a specific degree

of membership. This partial membership can be expressed in terms of MF values

between 0 and 1, and the MF is used to map each value in the input space to

a membership value within the interval [0, 1]. For example, if Z is a collection of

elements z which is the universe of discourse, then a fuzzy set A in Z can be defined


as a set of ordered pairs as follows:

A ={

(

z, µA(z))

| z ∈ Z}

, µA(z) ∈ [0, 1],

where µA(z) denotes the MF of fuzzy set A. The domain for each input variable is

usually divided into several membership functions. Thus, for an input value with

several MFs, the input must be processed through each MF. There are many types

of MFs, including trapezoidal, triangular, Gaussian, bell-shaped and sigmoidal, and

the appropriate type to use depends on the problem under consideration. Control

applications commonly use trapezoidal, triangular and Gaussian MFs, and Gaus-

sian MFs are also widely used for the problems of function approximation. In this

work, Gaussian MFs are used unless otherwise stated. The Gaussian MF takes the

following form

µ(z) = exp

(

−

(

z −m

σ

)2)

, (2.3)

where m and σ are the mean and the standard deviation, respectively.

2.3.2 Fuzzification

The process of converting a classical set or crisp data to a fuzzy set is known

as fuzzification, and it involves two steps. The first step is to specify the type and

number of MFs for each input and output variable, and the second is to identify the

MFs with linguistic labels. For example, consider a simple air conditioner control

process that is controlled by a heater only [32]. The control process has one input

variable (temperature) and one output variable (motor speed). Suppose that each

variable has five triangular MFs, with the input variable having linguistic labels of

CD (cold), CL (cool), GD (good), WM (warm) and HT (hot), and the output variable


having linguistic labels of VS (very slow), SL (slow), NO (normal), FT (fast), and

VF (very fast), as shown in Figure 2.3.

0

1

40 45 50 55 60 65 70 75 80 85 90

µ

ºF

CD CL GD WM HT

Deg

ree

of

Mem

ber

ship

Temperature

(a) Membership function of input

0

1

0 10 20 30 40 50 60 70 80 90 100

µ

R/M

VS SL NO FT VF

Deg

ree

of

Mem

ber

ship

Speed

(b) Membership function of output

Figure 2.3: Membership function of input and output.

2.3.3 Fuzzy-Rule Base

The fuzzy-rule base represents an essential part of the FLC, and can be con-

structed from expert knowledge. It consists of a collection of linguistic control rules

in the form of fuzzy IF-THEN rules. The general form of the fuzzy IF-THEN rules is

Rl : IF z1 is Al1 and z2 is A

l2 and ... and zN is Al

N THEN yl is Cl, (2.4)

where z1 is Al1 and z2 is A


N represents the premise or an-

tecedent part and yl is Cl the consequent or the conclusion part. Also, zi is the ith

input variable, N is the number of input variables, yl is the output of rule l, Ali is

a fuzzy set of the input zi in rule l, and L is the number of the fuzzy rules. More-

over, in rule l, Cl can be either a fuzzy set of the output yl or a linear function of

the input variables, depending on the fuzzy inference method. Clearly, the inputs


are associated with the premise and the outputs are associated with the conclusion

[44].

Now, the Gaussian membership value of the input zi in rule l can be written as

follows:

µAli(zi) = exp

(

−

(

zi −mli

σli

)2)

, (2.5)

where mli and σl

i are the mean and the standard deviation, respectively for the MF

of the input zi in rule l.

2.3.4 Fuzzy-Inference Engine

The fuzzy-inference engine is used to produce the output fuzzy sets from the

input fuzzy ones, based on the information available on the fuzzy-rule-base [39],

cf. Figure 2.2. The fuzzy inference engine generates an output for each activated

rule, then it produces the final output by combining the outputs of all activated

rules.

The two most important types of fuzzy-inference engines or systems in the liter-

ature are the Mamdani and Assilian FIS [45] and the Takagi-Sugeno (TS) FIS [46].

The main difference between them is the definition of the consequent part of Equa-

tion (2.4). In rule l, Mamdani’s FIS is defined Cl in Equation (2.4) as the fuzzy set

of output yl, while TS FIS is defined Cl as a linear function of the input variables.

Hence, the linguistic rules of Mamdani’s FIS can be defined by Equation (2.4). The

TS FIS consists of linguistic rules in the form

Rl : IF z1 is Al1 and ... and zN is Al

N THEN yl = K0l +K1

l z1 ... +KNl zN , (2.6)

where Kil is a real constant. In the present work, zero-order TS rules with constant


consequents are used. It consists of linguistic rules in the form

Rl : IF z1 is Al1 and z2 is A


N THEN yl = Kl. (2.7)

The number of rules depends on the number of inputs and their corresponding MFs.

These rules increase exponentially with the number of inputs and their correspond-

ing MFs [44]. For example, with a two input zero-order TS fuzzy model, each input

has three MFs with the following linguistic labels: P (positive), Z (zero) and N (neg-

ative), that is, we need to build a fuzzy rule base with nine rules. The fuzzy rules

can also be constructed using a Fuzzy Decision Table (FDT), as shown in Table 2.1.

Table 2.1: Fuzzy Decision Table (FDT)

z2

N Z P

z1

N K1 K2 K3

Z K4 K5 K6

P K7 K8 K9

2.3.5 Defuzzification

The process of reconverting the fuzzy output to classical or crisp data is known as

defuzzification. There are several popular defuzzification methods in the literature;

including the Center of Gravity, Center-average, Mean of Maximum and Height.

Center of Gravity (COG) Defuzzification

The Center of Gravity (COG) method is one of the most widely used defuzzi-

fication methods. In the literature, it is also called the centroid or centre of area


method. Its basic principle is to find the point y∗ in the output universes of dis-

course Y where a vertical line would give two equal masses, as shown in Figure

2.4(a). Mathematically, the COG technique provides a crispy output value y∗ based

on calculation of the center of gravity of the fuzzy set, which is defined by the

expression [47]

y∗ =

∫

µC(y)ydy∫

µC(y)dy, (2.8)

where∫

µCdy refers to the area bounded by the curve µC .

Center-Average Defuzzification

This method can only be used for fuzzy sets with symmetrical output MFs. De-

spite this restriction, it is still one of the most frequently used methods due to its

computational efficiency. It is also effective when used with the TS FIS. In order to

find the defuzzified value, each membership function is weighted by height. The

defuzzified value can be expressed as [47]

y∗ =

∑

µC(y)y∑

µC(y), (2.9)

where y represents the centroid of each symmetric MF. Figure 2.4(b) illustrates this

operation.

Mean of Maximum (MOM) Defuzzification

This is also called the middle-of-maxima (MOM) method. The defuzzified out-

put is generated by calculating the mean or average of the fuzzy conclusions with

maximum membership values. This method is defined by the expression [48]

y∗ =

∑N

i=1(yi)

N, (2.10)


where yi is the fuzzy conclusion that has the maximum membership value, and N is

the number of qualified fuzzy conclusions. For example, the crisp value of the fuzzy

set using MOM for the case specified by Figure 2.4(c) is given by

y∗ =y1 + y2

2.

1= 𝜋𝑟2𝑦= 𝜋𝑟2𝑦∗= 𝜋𝑟2

𝜇= 𝜋𝑟2

(a) Center of Gravity (COG)

𝑦2= 𝜋𝑟2𝑦1= 𝜋𝑟20.25

0.50

0.75

1.00

𝑦= 𝜋𝑟2𝑦∗= 𝜋𝑟2

𝜇= 𝜋𝑟2

(b) Center-Average

𝑦2= 𝜋𝑟2𝑦1= 𝜋𝑟2

1

𝑦= 𝜋𝑟2𝑦∗= 𝜋𝑟2

𝜇= 𝜋𝑟2

(c) Mean of Maximum (MOM)

1= 𝜋𝑟2𝑦= 𝜋𝑟2𝑦∗= 𝜋𝑟2

𝜇= 𝜋𝑟2

(d) Max Membership Principle

Figure 2.4: Graphical representation of different defuzzification methods.

Max Membership Principle

This is also called the height method, and it is a defuzzification technique that

gives the output y∗ as the point in the output universes of discourse, such that µC(y)

2.4. REINFORCEMENT LEARNING 26

achieves its maximum value. This method can be defined as follows [47]

µC(y∗) ≥ µC(y) ∀y ∈ Y , (2.11)

where y∗ represents the height of the output fuzzy set C. This method is only

applicable when the height is unique, as shown in Figure 2.4(d).

2.4 Reinforcement Learning

RL is considered a subfield of machine learning [49], and has attracted great in-

terest from researchers due to its ability to manage a wide range of practical appli-

cations, including artificial intelligence and control engineering problems. RL con-

trols an agent in an unknown environment, so a specific objective is accomplished

by mapping situations to actions. The agent is the learner and decision-maker,

and everything outside the agent that interacts with it is called the environment

[16]. In RL, the learner is not informed which action to take or does not know

its desired output, as required by most machine learning techniques. Instead, it

explores various actions and, by trying them, selects those that offer the most re-

wards [50], which are numerical values that the learning agent seeks to maximize

over the long run. RL differs from supervised learning methods. In particular, in

supervised learning methods the agent learns how to achieve a goal by depending

on a set of input/output data provided by a teacher or an expert. However, with

RL the learning process is achieved through interaction between the learning agent

and the environment, without needing a training data set. The learning agent inter-

acts with the environment by taking an action that depends on its control strategy,

then the environment transitions to a new state and the action is evaluated. The


evaluation of an action is based on the received reward. The agent-environment

interaction continues to teach the agent how to achieve the intended goal. To be

more specific, assume that an agent-environment interaction is achieved at discrete

time steps, i.e., t = 0, 1, 2, · · · , such that at time step t the agent observes the state

of the environment, st ∈ S, where S refers to the set of possible states, and takes an

action at ∈ A, where A points to the set of available actions for the learning agent

in state st. As a result of its action, the agent transitions to a new state, st+1 ∈ S, at

the next time step, and receives a numerical reward, rt+1 ∈ R, from the interactive

environment. The agent-environment interaction is summarized by Figure 2.5.

reward

tr

state

ts

1tr

1ts

action

ta

Agent

Environment

Figure 2.5: Agent-environment interaction in RL [16].

In RL, the agent creates a mapping, πt, from states to the probabilities of choos-

ing each admissible action. The mapping πt is called the agent’s policy, and πt(s, a)

refers to the probability that the learning agent will select the action at = a when

the environment’s state is st = s . RL methods explain how an agent can use expe-

rience to change its policy in order to achieve the goal of getting as much reward

as it can over time.


2.4.1 Future-Discounting Reward

In RL, the agent’s goal is to maximize the long-term rewards, rather than the im-

mediate reward at each time step. If the sequence of rewards that an agent receives

after time step t is denoted by r = [rt+1, rt+2, rt+3, · · · ], then the agent should max-

imize the expected return reward Rt, which is a function of the reward sequence

r. In RL, the tasks can be grouped into two categories, depending on whether or

not the agent-environment interaction can be divided into episodes [16]. The first

category is known as episodic tasks and each episode has a terminal state, while

the second category is called continuing tasks that cannot be readily divided into

episodes. For episodic tasks, the return reward Rt can be defined as the sum of

received rewards, as follows

Rt = rt+1 + rt+2 + rt+3 + · · ·+ rτ , (2.12)

where τ is the terminal time. However, for continuing tasks the terminal time would

be∞ and the sum in Equation (2.12) could be infinite. According to the concept of

future-discounting [16], the agent attempts to select actions that maximize the sum

of the received discounted-future rewards. Hence, the return reward, Rt is given

by:

Rt = rt+1 + γrt+2 + γ2rt+3... =∞∑

k=0

γkrt+k+1, (2.13)

where γ (0 < γ ≤ 1) is the discount factor. As seen in Equation (2.13), discounting-

future rewards with (0 < γ ≤ 1) ensures that the sum remains finite as long as

each element in r is bounded. Also, Equation (2.13) shows how the value of the


discount factor affects the contribution of each future reward on the return. De-

pending on the value of the discount factor, an agent can have either a nearsighted

or farsighted perspective on maximizing its rewards. If γ approaches 0, the agent

will be nearsighted; that is, only concerned about maximizing the immediate re-

ward rt+1. However, as γ approaches 1 the agent becomes farsighted and will take

the future rewards into account more strongly.

2.4.2 Value Function

In RL, the value function is important, and most RL algorithms are based on

estimating either a state-value function V (s) or an action-value function Q(s, a)

[16]. There are three widely used value-function algorithms in RL [51]: the actor-

critic algorithm [52] to estimate V (s), Q-learning [53] and Sarsa [54] algorithms

to estimate Q(s, a). The state-value function provides an indication of how effective

it is for the learning agent to be in state s, and Q(s, a) is used to indicate how good

it is for the learning agent to take an action a in state s [16]. The value functions

are normally defined in relation to certain policies [16]. Suppose there is a policy

π that maps a state s ∈ S, and an action a ∈ A(s) to a probability value π(s, a),

then the value of state s under policy π can be expressed by V π(s), which can be

represented by the expected total reward when the learning agent starts in state s

and follows policy π thereafter. V π(s) can be given in terms of the expected sum of

discounted rewards as follows [16]:

V π(s) = Eπ {Rt | st = s}

= Eπ

{

∞∑

k=0

γkrk+t+1|st = s

}

,(2.14)


where Eπ{} refers to the expected value if the learning agent follows policy π. Simi-

larly, the action-value function under policy π can be denoted byQπ(s, a), which can

be represented by the expected total reward when the learning agent takes action

a in state s, and then follows policy π. Qπ(s, a) is given by

Qπ(s, a) = Eπ {Rt|st = s, at = a}

= Eπ

{

∞∑

k=0

γkrk+t+1|st = s, at = a

}

.(2.15)

For continuous state and action spaces, the value functions V π(s) and Qπ(s, a) can

be approximated by applying one of the function approximation methods, and the

learning agent tunes the parameters of the functions to match the estimates with

the observations [16].

The main role of the value function is to convert the return reward Rt into a

recursive formula. Therefore, the problem of maximizing the long-term reward in

the form of Equation (2.14) is converted to a problem of maximizing the long-term

reward in terms of the value function at the next time step. In particular, using

Equation (2.14), we have

V π(s) = Eπ

{

rt+1 + γ∞∑

k=0

γkrt+k+2 | st = s

}

= Eπ {rt+1 + γV π(st+1) | st = s} , (2.16)

where s ∈ S for the continuing task or s ∈ S+ for an episodic task, where S+ refers

to the set of all states, including terminal ones. As shown by Equation (2.16), the

value of the current state depends on the value of the next state. Equation (2.16) is

known as the Bellman equation for V π(s).


The main task of RL is to find the appropriate policy that maximizes the agent

reward over the long run, Rt . If the expected return of a policy π is greater than

or equal to the expected return of policy π′, then policy π is better than or equal

to policy π′. In RL, at least one policy is always better than the others; it is known

as the optimal policy and is denoted by π∗. Therefore, the optimal value of the

state-value function is given by

V ∗(s) = maxπ

V π(s), (2.17)

for all s ∈ S, and the optimal value of the action-value function is

Q∗(s, a) = maxπ

Qπ(s, a), (2.18)

for all s ∈ S and a ∈ A(s). It is possible to rewrite Q∗(s, a) in terms of V ∗(s) as

follows [16]:

Q∗(s, a) = E{rt+1 + γV ∗(st+1) | st = s, at = a}. (2.19)

It is also possible to write V ∗(s) in a form similar to the Bellman equation, as fol-

lows:

V ∗(s) = maxa∈A(s)

Qπ∗

(s, a)

= maxaE {rt+1 + γV ∗(st+1) | st = s, at = a} . (2.20)

Equation (2.20) is known as the Bellman optimality equation for V ∗(s) [16]. On


the other hand, the Bellman optimality equation for Q∗(s, a) is given by

Q∗(s, a) = E{

rt+1 + γmaxa′

Q∗(st+1, a′) | st = s, at = a

}

, (2.21)

where a′ refers to the action selected by the learning agent at the next state st+1,

which gives the maximum reward.

2.4.3 Temporal Difference Learning

Temporal-Difference (TD) learning represents a novel idea for RL algorithm that

combines the notions of Dynamic Programming (DP) and Monte Carlo (MC) meth-

ods. TD methods are similar to MC methods in that they learn from experience

without needing an environment dynamics model. Also, similar to DP methods,

TD learning methods update estimates by using other learned estimates and do not

need to wait for a final result [16]. TD learning methods are used to estimate the

value function when it is initially unknown and needs to be learned through agent-

environment interaction. The estimate occurs as follows: at each time step the

so-called TD-error, 4t, is calculated, then the value function is updated to reduce

the TD-error. Therefore, TD learning can be defined by

Vt+1(st) = Vt(st) + α4t, (2.22)

where (0 < α ≤ 1) is the learning rate parameter. The TD-error is given by

4t = rt+1 + γVt(st+1)− Vt(st). (2.23)

The TD learning method is considered as a bootstrapping technique, because the


Algorithm 2.1 TD Learning algorithm.

repeat (for each episode)

Initialize srepeat (for each time step t)

For state s, choose an action a based on the policy π.

Take action a; observe the next state s′, and reward r.Calculate V (s) from Equation (2.22).

Set s← s′.until (s is terminal)

until (finish all episodes)

updating process is based on the estimate at the next time-step. The TD learning

algorithm is given in Algorithm 2.1 [16].

It is possible to estimate the action-value function Q(s, a) for all states and ac-

tions based on the TD-error. The estimation approach is as before, but rather than

finding V (s), Q(s, a) is calculated instead. In this respect, there are two different

algorithms for updating Q(s, a). With the first algorithm, known as Sarsa [54], the

update is based on the action at+1. The update rule for Sarsa is defined by

Q(st, at) = Q(st, at) + α4t, (2.24)

where α is as defined earlier, and 4t is given by

∆ = [rt+1 + γQ(st+1, at+1)−Q(st, at)]. (2.25)

This updating rule is followed after every non-terminal state transition st. Oth-

erwise Q(st+1, at+1) is set to zero. The Sarsa algorithm is given in Algorithm 2.2

[16].

The second algorithm uses the greedy action at the next time step, instead of


Algorithm 2.2 Sarsa Learning algorithm.

Q(s, a) is initialized arbitrarily ∀ s ∈ S, a ∈ Arepeat (for each episode)

Initialize s.For state s, choose an action a based on a certain policy (e.g., ε-greedy).

repeat (for each time step t)Take action a; observe the next state st+1, and reward r.For state st+1, choose an action at+1 based on a certain policy (e.g., ε-

greedy).

Calculate Q(s, a) from Equation (2.24).

Set s← st+1.

Set a← at+1.

until (s is terminal)


action at+1. This is called Q-learning and was first defined by Watkins [53], and

it is one of the most popular types of RL learning algorithms. It starts with an

arbitrary initial action-value function, then updates Q(s, a) using a set of data tuples

generated by agent-environment interaction. The data tuples include ( st, at, , st+1,

rt+1), and the update rule for the Q-learning method is given by Equation (2.24),

though the calculation of4t is different. For theQ-learning algorithm,4t is defined

as follows:

4t = [rt+1 + γmaxa′

Q(st+1, a′)−Q(st, at)]. (2.26)

The Q-learning algorithm is given in Algorithm 2.3 [16].

For discrete state and action space cases, satisfying the following conditions

will ensure convergence of the Q-learning algorithm to the optimal value Q∗. The

conditions are [55]:

1. Visit all the state-action pairs an infinite number of times.

2.∞∑

t=0

αt =∞, and∞∑

t=0

α2t <∞.


Algorithm 2.3 Q-learning algorithm.

Q(s, a) is initialized arbitrarily ∀ s ∈ S, a ∈ A.

repeat (for each episode)

Initialize s.repeat (for each time step t)

For state s, choose an action a based on a certain policy (e.g., ε-greedy).

Take action a; observe the next state st+1, and reward r.Calculate Q(s, a) from Equation (2.24).

Set s← st+1.

until (s is terminal)


2.4.4 Actor-Critic Methods

The actor-critic based methods are TD techniques in which the actor refers to

the policy structure used for action selection, and the critic refers to the estimated

value function used for criticizing the policy. The critic is represented as a TD-error,

to derive the learning processes for the actor and the critic, as shown in Figure 2.6

action TD-error

reward

Environment

state

Actor

Critic

Figure 2.6: Actor-critic structure.

2.5. PARTICLE SWARM OPTIMIZATION 36

The critic is usually a state-value function, V (s), and each time the learning

agent selects an action at while in current state st, the critic evaluates the resulting

new state st+1 to decide whether the performance of the learning agent has im-

proved or deteriorated. The evaluation is based on the TD-error, which is defined

in Equation (2.23). By applying the value of the TD-error it is possible to evaluate

the action chosen by the learning agent in its current state. If the error is positive,

the current action should be strengthened, and if negative, the action should be

weakened [16]. Actor-critic methods are widely used due to their applicability in

RL problems that have continuous state and action spaces, by using some form of

function approximation, such as FLC [56], neural networks [16] or linear approxi-

mation [57].

2.5 Particle Swarm Optimization

PSO is one of the most active research areas in the field of computational swarm

intelligence [58]. Developed by Eberhart and Kennedy [59–61] in 1995, PSO is

a population stochastic optimization method inspired by the behaviour of social

interaction in bird flocking and fish schooling, and was originally designed for con-

tinuous optimization problems [62]. PSO has become an attractive optimization

method for solving a wide spectrum of optimization problems and other problems

that can be easily converted to optimization ones [63]. Its popularity is largely due

to its simplicity, efficiency and the fact that it has only a few parameters to adjust.

It has been successfully applied in various application areas, including pattern clas-

sification, robotic application, system design, multi-objective optimization, image

segmentation, power systems, games, system identification and electric circuitry


design [63, 64]. PSO shares many characteristics with evolutionary optimization

methods such as a Genetic Algorithm (GA), but they are different in how they ex-

plore multidimensional search spaces. It was determined that the PSO algorithm

has similar or better performance than GA [65–70]. However, both methods are

initialized with a population of random solutions in the search space, and the pop-

ulation then moves stochastically in the space to reach an optimum solution by

updating generations [71]. In PSO, the population is called a swarm and represents

candidate solutions in the search space. Each candidate solution is called a particle.

The swarm can be defined as a set of Np particles and as follows:

X = {X1, X2, ..., XNp},

where Xi refers to the position of the ith particle in the search space. Typically,

the PSO algorithm starts with a swarm X, where all its particles are randomly

positioned in D-dimensional search space A ⊂ RD. Then, Xi can be represented by

Xi = (xi1, xi2, ..., xiD) ∈ A, i = 1, 2, ..., Np.

Each particle has a random velocity that directs the particle to fly across the search

space, and this is represented as

Vi = (vi1, vi2, ..., viD), i = 1, 2, ..., Np.

Furthermore, each particle has a fitness value that is calculated according to the

problem fitness function, and is used to measure how close the particle is to the

optimum solution. The PSO algorithm uses two primary updating formulas: one


to update the velocity of each particle, and the other to update their positions. To

update a particle’s velocity, each particle moving through the search space keeps

track of the two best positions in the space with their fitness values [72]; this is

also called the particle experience. The first position is the personal best Pbest that

has been found so far by the particle. The personal best position of the ith particle

is represented as Pbesti = (pbesti1, pbesti2, ..., pbestiD). The second position is the

best global position, Gbestg, that has been found so far among all particles in the

swarm. The symbol g refers to the index of the best particle such that Gbestg =

(gbestg1, gbestg2, ..., gbestgD). Therefore, each particle will depend on its experience

to calculate the new values of the particle’s velocity and position in the search space.

This type of experience mechanism does not exist in GA, or in any evolutionary

optimization methods in general. To update Pbest and Gbestg, it is necessary to

calculate the fitness value of each particle at time step t . Let f denotes the problem

fitness function that should be maximized, then Pbesti(t) is calculated as follows:

Pbesti(t) =

Xi(t) : if f(Xi(t)) ≥ f(Xi(t− 1)),

P besti(t− 1) : otherwise.

(2.27)

Therefore, for the swarm with Np particle, Gbestg(t) is calculated by selecting the

particle with the best Pbest. After calculating Pbest(t) and Gbestg(t), the ith particle

can update its velocity and position in dimension d using the following formulas:

vid(t+ 1) = vid(t) + c1 · r1 · (pbestid(t)− xid(t)) + c2 · r2 · (gbestgd(t)− xid(t)), (2.28)

and

xid(t+ 1) = xid(t) + vid(t+ 1), (2.29)


for i = 1, · · · , Np, and d = 1, · · · , D, where vid is the velocity at dimension d of

particle i, pbestid is the best position found so far by particle i at dimension d,

and gbestgd is the global best position found so far at dimension d. The scalars c1

and c2 are learning factors [73] or weights of the attraction to pbestid and gbestgd,

respectively, and are also called the cognitive and social learning parameters. The

second term of Equation (2.28) represents what is known as the cognition term,

and indicates the thinking of the ith particle on how to learn based on its own,

while the third term represents what is called the social term, and indicates the

collaboration among the particles in the swarm [74, 75]. The term xid is the current

position of particle i at dimension d, and r1 and r2 are uniform random numbers

in the range [0,1] that are generated at each iteration for every particle and every

dimension to explore the search space. This process is repeated until a specific

stopping criterion is achieved, such as when the number of iterations specified in

the algorithm are completed, or when the velocity update is below a specific value.

The PSO algorithm is given in Algorithm 2.4.

There are several modifications of the original PSO algorithm, and the most

widely used are the PSO with inertia weight [74] and the PSO with constriction

factor [76].

2.5.1 Particle Swarm Optimization with Inertia Weight

In 1998, Shi et al. [68, 74] modified the original PSO algorithm by introducing

a new factor called the inertia weight, which is responsible for keeping the particle

moving in the same direction as in the previous iteration. A large inertia weight is

applied to facilitate a global search capability (i.e., to encourage exploration), and a

small weight is used to assist local search capabilities [77]. Thus, the inertia weight


Algorithm 2.4 PSO Algorithm

1. Set t← 0.

2. Set the number of particles, Np, in the swarm X.

3. For each particle i ∈ 1, · · · , Np, Do

(a) Randomly initialize the position and the velocity of the ith particle, Xi

and Vi, respectively.

(b) Set the initial personal best position of the ith particle the same as its

initial position, Pbesti(t) = Xi.

4. end for

5. Repeat

(a) For each particle i ∈ 1, · · · , Np, Do

i. Find the personal best position, Pbesti, and the fitness value,

f(Xi(t)).

(b) end for

(c) Sort the entire particles according to their fitness values.

(d) Find the global best position and its fitness value.

(e) Update the velocity of each particle from Equation (2.28).

(f) Update the position of each particle from Equation (2.29).

(g) Set t← t+ 1.

6. Until (some condition is satisfied)

is responsible for maintaining a balance between global and local search abilities,

so an optimum solution can be found with a fewer iterations [78]. Particles can

update their velocities according to the inertia weight method as follows [74]:

vid(t+1) = w ·vid(t)+c1 ·r1 · (pbestid(t)−xid(t))+c2 ·r2 · (gbestgd(t)−xid(t)), (2.30)


where w is the inertia weight. With this modification it is better to decrease the

value of w over time, which gives the PSO algorithm higher performance compared

to constant inertia settings. The most common way is to start with a large value

of w to encourage exploration in the early stages of the optimization process, and

then decrease it linearly towards zero to remove the oscillatory behaviour in later

stages, or to do more local search activities near the end of the optimization process

[77]. The chosen starting value of the inertia weight is usually slightly larger than

1, and the final value is close to 0 (e.g. 0.1) to prevent the first term of Equation

(2.30) from disappearing. The linearly decreasing strategy of inertial weight can be

described mathematically as follows [58, 79]:

w(t) = wmax − (wmax − wmin) ·t

tmax

, (2.31)

where t is the current iteration counter and tmax is the maximum allowed iteration

counter. Also, wmax and wmin refer to the maximum inertia weight at iteration t = 0

and the minimum inertia weight at iteration t = tmax, respectively.

There are other approaches proposed in the literature to dynamically adapt the

value of w in each iteration, such as random, chaotic random, chaotic linear de-

creasing and fuzzy adaptive [80, 81].

2.5.2 Particle Swarm Optimization with Constriction Factor

Similar to the PSO with inertia weight, another variant of the original PSO al-

gorithm was developed by Clerc et al. [76] in 2002. They made the two learning

factors c1 and c2 dependent, and the relation between them can be defined by using

2.6. THE KALMAN FILTER 42

a new coefficient called the constriction factor χ, which is given by

χ =2

∣

∣

∣2− ψ −

√

ψ2 − 4ψ∣

∣

∣

, (2.32)

where ψ = c1 + c2 such that ψ > 4. In [82, 83], it was shown that using the

constriction factor method is necessary to ensure of the PSO algorithm convergence.

Most implementations of this method use equal values for learning factors c1 and c2,

and c1 = c2 = 2.05 such that χ = 0.729 is considered the default parameter setting

for this variant. Particles can update their velocities according to the constriction

factor method as follows [84]:

vid = χ · [vid + c1 · r1 · (pbestid − xid) + c2 · r2 · (gbestgd − xid)]. (2.33)

2.6 The Kalman Filter

The Kalman filter represents one of the most useful and widely used estimation

techniques for estimating the true state of interest in the presence of noise. It

provides an optimal estimate by minimizing the estimated error between the actual

and estimated state, if certain conditions are satisfied [85]. More specifically, for a

dynamical system having a state vector x, the Kalman filtering provides an iterative

method to find an estimate for x; the estimation state vector can be denoted as x,

by invoking consecutive data inputs or measurements [85]. In fact, the output of

the Kalman filter can be considered as a Gaussian probability density function with

mean x and an estimation error covariance matrix P . Thus, this probabilistic nature


allows the Kalman filter to consider uncertainties that may arise in the system, such

as motion model uncertainty and noisy measurements. The Kalman filtering process

starts with an initial estimate of both x and P , which can be denoted by x(0) and

P (0), respectively. Then, through the iterative process, the Kalman filter reduces the

variance between x and x, and as more and more measurements become available,

x approaches x very quickly. The Kalman filtering process keeps track of both x and

P .

2.6.1 The System Dynamic Model

In Kalman filtering, the system dynamic is assumed to be linear and Gaussian.

In other words, the state evolution is a linear function of the state variables and

the control inputs, whereas the measurements are a linear function of the state

variables. Also, it is assumed that there are zero-mean white Gaussian noises in both

the process and the measurement model. Consider a dynamic system described by

a discrete-time state space model in the form [85]:

x(k + 1) = F (k)x(k) +G(k)u(k) + v(k), (2.34a)

y(k) = H(k)x(k) + w(k), (2.34b)

where the vectors x(k) ∈ Rn, u(k) ∈ R

p and y(k) ∈ Rm refer to the system’s full

state, input, and output, respectively. The matrix F (k) ∈ Rn×n is used to describe

the system’s dynamics and is called the state transition matrix or the fundamental

matrix, the matrix G(k) ∈ Rn×p is used to describe how the optional control input,

u(k), affects the state’s evolution and is named the control matrix, and H(k) ∈

Rm×n refers to the measurement or output matrix and is used to map the state


vector into the measurement. The vectors v(k) ∈ Rn and w(k) ∈ R

m are called the

process and the measurement noises, respectively. They are defined as zero-mean

white Gaussian noises with covariance matrix Qf (k) and Rf (k), respectively, and

are assumed to be independent of each other [86]:

E[

v(i)v(k)T]

=

Qf (k) : if i = k,

0 : if i 6= k,

(2.35a)

E[

w(i)w(k)T]

=

Rf (k) : if i = k,

0 : if i 6= k,

(2.35b)

E[

v(i)w(k)T]

= 0, for all i and k. (2.35c)

As an example, Newton’s equations of motion gives a simple dynamic system model

to describe the motion of an object moving in a plane with a constant velocity. In

this thesis, the moving object represents the evader, whose dynamics is unknown

by the pursuer. To estimate the position (i.e., the xy-coordinate) of this object using

the Kalman filter, the following model can be used

x(k + 1) =

xo(k + 1)

yo(k + 1)

vxo(k + 1)

vyo(k + 1)

=

1 0 T 0

0 1 0 T

0 0 1 0

0 0 0 1

xo(k)

yo(k)

vxo(k)

vyo(k)

+ v(k), (2.36a)


y(k) =

xo(k)

yo(k)

=

1 0 0 0

0 1 0 0

xo(k)

yo(k)

vxo(k)

vyo(k)

+ w(k), (2.36b)

where T represents the sampling time, (xo(k), yo(k)) refers to the object’s position

at time step k and (vxo(k), vyo(k)) refers to its velocity component.

2.6.2 The Kalman Filtering Process

Like any probabilistic estimation method, the Kalman filtering process is a two-

step iterative algorithm of prediction and update. In the prediction step, the Kalman

filter uses the process model with the current state estimate of x and P to predict

the new value of the system state and its covariance. The resulting predicted values

are usually imperfect due to the error in the process model. In the update step, the

Kalman filter corrects the prediction by taking into consideration the actual reading

coming from the measurement sensors by calculating what is called the Kalman

gain, K. Thus, the results of the update step represent the new estimate of the

true state of the system, together with its covariance. After that, the outputs of

the update step become the inputs to the prediction step and the iterative process

continues. Therefore, the Kalman filter estimation process can be considered as a

form of feedback control [86]. The Kalman filter equations can be divided into two

groups, and can be summarized as follows [85]:

1. Prediction step equations:

x(k + 1|k) = F (k)x(k|k) +G(k)u(k), (2.37a)


P (k + 1|k) = F (k)P (k|k)F T (k) +Qf (k), (2.37b)

where the notation x(k1|k2) with k1 ≥ k2 is used to denote the estimated value

of the state at time step k1 given values of the measurement at all times up to

time step k2.

2. Update step equations:

x(k + 1|k + 1) = x(k + 1|k) +K(k + 1)ν(k + 1), (2.38a)

P (k + 1|k + 1) = P (k + 1|k)−K(k + 1)H(k + 1)P (k + 1|k), (2.38b)

where

ν(k + 1) = y(k + 1)−H(k + 1)x(k + 1|k), (2.39a)

S(k + 1) = H(k + 1)P (k + 1|k)HT (k + 1) +Rf (k + 1), (2.39b)

K(k + 1) = P (k + 1|k)HT (k + 1)S−1(k + 1), (2.39c)

where ν(k + 1) is called the residual or the innovation error vector, which re-

flects the difference between the actual and predicted measurements, y(k + 1) and

H(k + 1)x(k + 1|k), respectively, and S(k + 1) is the residual covariance. Given

these equations, the two-step iterative process can be described as shown in Figure

2.7. From this figure, one can view that K plays an important role for finding the

estimation of x and P , and it represents a weighting factor to decide which to trust

more, the sensor reading or the predicted estimate. If K is a scalar, then its value

will be between 0 and 1. If K → 1, this gives an indication that the sensor readings

are more trusted than the predicted values. So, the Kalman filter gives more weight


Prediction

1- Predict the system state 𝑥ොሺ𝑘 + 1ȁ𝑘ሻ = 𝐹ሺ𝑘ሻ𝑥ොሺ𝑘ȁ𝑘ሻ + 𝐺ሺ𝑘ሻ𝑢ሺ𝑘ሻ

2- Predict the error covariance 𝑃ሺ𝑘 + 1ȁ𝑘ሻ = 𝐹ሺ𝑘ሻ𝑃ሺ𝑘ȁ𝑘ሻ𝐹𝑇ሺ𝑘ሻ + 𝑄𝑓ሺ𝑘ሻ

Update

1- Calculate the Kalman gain 𝐾ሺ𝑘 + 1ሻ = 𝑃ሺ𝑘 + 1ȁ𝑘ሻ𝐻𝑇ሺ𝑘 + 1ሻ𝑆−1ሺ𝑘 + 1ሻ

where 𝑆ሺ𝑘 + 1ሻ = 𝐻ሺ𝑘 + 1ሻ𝑃ሺ𝑘 + 1ȁ𝑘ሻ𝐻𝑇ሺ𝑘 + 1ሻ + 𝑅𝑓ሺ𝑘 + 1ሻ

2- Update the state estimate 𝑥ොሺ𝑘 + 1ȁ𝑘 + 1ሻ = 𝑥ොሺ𝑘 + 1ȁ𝑘ሻ + 𝐾ሺ𝑘 + 1ሻ𝜈ሺ𝑘 + 1ሻ

where 𝜈ሺ𝑘 + 1ሻ = 𝑦ሺ𝑘 + 1ሻ − 𝐻ሺ𝑘 + 1ሻ𝑥ොሺ𝑘 + 1ȁ𝑘ሻ

3- Update the error covariance 𝑃ሺ𝑘 + 1ȁ𝑘 + 1ሻ = 𝑃ሺ𝑘 + 1ȁ𝑘ሻ − 𝐾ሺ𝑘 + 1ሻ𝐻ሺ𝑘 + 1ሻ𝑃ሺ𝑘 + 1ȁ𝑘ሻ

Initial values for 𝑥ොሺ0ሻ and 𝑃ሺ0ሻ

Figure 2.7: Kalman filtering process.

to the measurement. On the other hand, If K → 0, this means that the predicted

value is more trustworthy than the sensor readings. Therefore, the measurements

have less weight compared to the predicted value. Accordingly, this will make the

measurements have less impact on updating the estimates.

2.6.3 Fading Memory Filter (FMF)

For the implementation of the Kalman filter, the system model is presumed to be

precisely known, otherwise the filter may not give an appropriate state estimate and

may diverge [87]. So, to handle the modeling error that can be presented in the

system model, a FMF was proposed. FMF is considered as one of the generalizations

of the Kalman filter. The FMF’s principle idea is to give more weight to the recent

measurements and less weight to the past measurements. This will make the filter


be more reacting to the recent measurements, and will accordingly make it more

robust to the modeling uncertainty [87].

For the FMF, Qf (k) and Rf (k) are replaced by α−2kf Qf (k) and α−2k+2

f Rf (k), re-

spectively, where αf is a parameter (αf ≥ 1) and is also called the weight factor,

and it is usually selected to be close to 1. Therefore, as k increases, both Qf (k)

and Rf (k) decrease, and this will give more weight to the recent measurement. If

αf = 1, the original Kalman filter is obtained. After substituting the new values

of Qf (k) and Rf (k) into the corresponding equations, Equation (2.37b), Equation

(2.38b), and Equation (2.39c) can be written as follows:

P (k + 1|k) = α2fF (k)P (k|k)F

T (k) +Qf (k), (2.40)

P (k + 1|k + 1) = P (k + 1|k)−K(k + 1)H(k + 1)P (k + 1|k), (2.41)

K(k + 1) = P (k + 1|k)HT (k + 1)S−1(k + 1), (2.42)

where P (k + 1|k) = α2fP (k + 1|k) and P (k + 1|k + 1) = α2

fP (k + 1|k + 1).

From Equations (2.40) – (2.42), it is obvious that the FMF can be implemented

easily by using the same Kalman filter equations, except that there is a simple mod-

ification that must be done on Equation (2.37b), and it is as follows:

P (k + 1|k) = α2fF (k)P (k|k)F

T (k) +Qf (k). (2.43)

2.7. LITERATURE REVIEW 49

2.7 Literature Review

Over the last few decades PE games have received great interest due to their practi-

cal importance, and several types of PE have been developed, including the homicidal-

chauffeur game and the game of two cars [1]. The optimal control strategies for

the homicidal-chauffeur game and the game of two cars are given in [1, 88, 89].

Isaacs also solved two-player PE games with a slower evader analytically [1], us-

ing the tenet of transition method. This method uses a partial differential equation

known as the Hamilton-Jacobi-Isaacs (HJI) to solve two-player, zero-sum games. It

is based on backward analysis, in that it starts from the terminal state of the game

and traces the optimal trajectory of the states backward. However, current theories

for differential games are unable to deal with the problem of the multi-player PE

differential games, as it is difficult to determine the terminal states of a game. This

difficulty arises because some evaders are not captured, or, if all evaders are cap-

tured, it might have occurred at different times. It could also be due to the increase

in possible engagements between pursuers and evaders as the number of players

increases. For example, suppose that, in a three player PE game with two pursuers

and one evader, one of the pursuers catches the evader, thereby leaving the sec-

ond pursuer with an unknown terminal state and rendering the backward analysis

used by Isaacs [1] unable to solve the HJI equation. Furthermore, the problem will

become more complex when superior evaders are involved. There are four main

techniques in the literature that are commonly used to solve the problems of PE

games: optimal control, DP, RL and game theory [6].

The PE game is a differential game with continuous state and action spaces,

and its solution complexity increases as the number of players increases. Thus, it is


important to use learning techniques that help each player find its optimal strategy

through interaction with its environment, and effectively manage the continuous

state and action spaces. The RL learning method technique has been successfully

applied to differential games [17–19, 21, 90–92], and a function approximation

method such as FIS can manage continuous state and action spaces.

FLC represents a good choice to deal with processes that are ill-defined and/or

involve uncertainty or continuous change. Moreover, it is well known that FISs

can approximate any continuous function, and they are widely used as function

approximators [17, 47, 93–97]. FIS has a knowledge base consisting of a collection

of rules in linguistic form. The rules are used to specify variations in the system

parameters as inputs change [98]. Building a knowledge base is very complex,

particularly when there are many input and output parameters. To overcome this,

it is possible to use an optimization method, such as a genetic algorithm (GA) or a

PSO algorithm, to autonomously tune the parameters of the FLC [98–106]. In [98–

106], supervised learning techniques have been used to tune the FLC parameters.

However, these require a teacher or a training input/output data set which are

sometimes impossible or expensive to obtain [107]. Thus, it is preferable to use

the GA and the PSO algorithms in an unsupervised learning manner. This thesis

proposes using the PSO algorithm to tune the FLC parameters in this way. RL

methods have also been proposed to address the problem of tuning FLC parameters

in an unsupervised manner [17–21, 56, 108–110]. The use of RL with the FLC

is called fuzzy-reinforcement learning, and the technique has been proposed and

successfully applied to different types of problems [56, 108, 111–121], including

the differential game problems, such as the PE game [17–20, 122] and a guarding

territory game [21, 123, 124].


RL is one of the most widely used approaches for learning through interaction

with the environment, and it has attracted researchers in both the artificial intel-

ligence and machine learning fields [50]. Most of the RL algorithms are tabular

methods that use discrete state and action spaces, so they cannot be directly applied

to problems that have continuous state and action spaces, such as the PE differential

game problem. To solve this problem, the state and action spaces can be discretized

such that the resulting table is not too large. Some games are solved by converting

a differential game into a grid world game [125, 126], but for real-world problems

the resulting table or grid will be larges. Furthermore, the discretization of state

and action spaces is not a simple task [56]. However, an appropriate function ap-

proximation method can be applied to avoid discretization and deal directly with

continuous state and action spaces [127], which means FIS can be used to gener-

alize state and action spaces. In [17], a fuzzy actor-critic RL algorithm is applied

to the PE differential game and validated experimentally, and the learning algo-

rithm is used to tune the consequent parameters of the FLC and the FIS. The FIS

is used as an approximation of the value function V (s), and it supposes that only

the pursuer learns its behaviour strategy, while the evader plays an optimal control

strategy [17]. In [21], FACL is applied to the guarding territory differential game,

and with this learning technique the consequent parameters are tuned to allow the

defender to learn its Nash equilibrium strategy. In [18–20], the QLFIS is applied

to the PE differential game, and all the premise and consequent parameters of the

FIS and the FLC are tuned. In addition, the FIS is used as an approximation of the

action-value function Q(s, a). The simulation results in [18, 20] have shown that

the QLFIS algorithm enabled the pursuer to catch the evader in minimum time. In

[19], the QLFIS was modified to teach the evader how to escape and avoid capture.


The modification added the distance between the pursuer and the evader as an in-

put to the evader’s FLC, and when the distance approached an assigned value the

evader makes sharp turns to escape if possible, or to, at least, maximize the capture

time. However, the work in [17–21] did not determine which set of parameters had

a significant impact on the performance of the learning algorithm; thus, this should

be investigated. In addition, the learning process normally requires a specific num-

ber of episodes before each learning algorithm achieves acceptable performance.

Sometimes the number selected is higher than needed, which makes the learning

time longer. This issue, as well as proposals for learning algorithms that can speed

up the learning process, are also considered in this thesis. Moreover, if the learn-

ing algorithms proposed in [17–20] are applied to a multi-pursuer a single-inferior

evader PE game, the potential for collisions between pursuers increases, particu-

larly if they are near one another or approaching the evader; it will be necessary to

avoid these situations.

In multi-robot PE differential games, the environmental complexity and uncer-

tainty increases as the number of players increases. Eventually, it encounters the

so-called ‘curse of dimensionality’, where the state and action spaces grow expo-

nentially, rendering the problem intractable. In multi-robot PE differential games,

actions taken by each player depend not only on the current state of the game, but

also on the actions taken by the other players; this is called joint action. There are

several methods in the literature that have been used to solve the multi-player PE

differential game with slower evaders. For example, in [128] a hierarchical decom-

position method was used to solve a deterministic multi-player PE differential game

case. The game was decomposed into several two-player PE games, and the focus

was on minimizing the capture time (Tc) of all the evaders, where Tc = maxj{Tcj}


and Tcj denotes the required time to capture evader j. Backward analysis was used

in [128] to find the optimal strategy for each player in individual two-player PE

games. The main drawback of the hierarchical decomposition method is that the

engagement possibility between the pursuers and evaders increases exponentially

with the increase of players. Decentralized learning methods [129, 130] were re-

cently applied to the problem of the multi-player PE game, and in [129] Givigi et

al. modeled the multi-player PE game as a Markov game. Also, each player was

modeled as a learning automata to learn how to cope with the challenging situa-

tion, and their learning algorithms converged to an equilibrium point. Simulation

results for the case of a three-player PE game in a grid-world in which two pur-

suers attempt to capture a single evader are also given to show the feasibility of

their learning algorithm. In the simulation, only the pursuers learn their behaviour

strategy while the evader follows a fixed strategy. A drawback of this learning al-

gorithm is that the computational requirements increase exponentially when the

number of players or the grid-world size increases. In [130], Desouky et al. ex-

tended their previous learning algorithm [131] from a single PE differential game

to an n-pursuer n-evader differential game. Their learning algorithm consists of

two phases: The decomposition phase decomposes the n-pursuer n-evader game

into n two-player PE games, with the Q-learning algorithm applied to learn the best

coupling among the players so each pursuer is coupled with only one evader. In the

second phase, the learning algorithm previously proposed in [131] is used to teach

each couple how to play the game and self-learn their control strategy. Simulation

results for n = 2 and n = 3 indicate that the pursuers find the best coupling, and

each player is able to learn its Default Control Strategy (DCS). The drawback to this

algorithm is that it is only applicable to continuous domain problems that can be


easily discretized, and discretized domain sizes that are not too large. To avoid this,

a fuzzy-reinforcement actor-critic structure can be used to deal with continuous

state and action spaces.

As mentioned earlier, although there are numerous papers discussing different

methods, including learning methods to solve the problem of the PE game with slow

evaders, there are only a few that address multi-player PE differential games with

superior evaders [8, 23–25, 132–136]. The authors in [8] proposed a sufficiency

condition to capture superior evaders. In [24, 132], decentralized approaches were

used to capture a superior evader in a noisy environment and with a jamming con-

frontation, respectively. In [133] formation control was applied to enable a group

of pursuers to cooperate to capture a superior evader. The focus is on the pur-

suer strategy that ensures invariant angle distribution around the evader, while it

is assumed that the superior evader follows a simple fixed strategy. In [134], a

hierarchical decomposition method was proposed to solve PE games with superior

evaders, and in [25, 135] an Apollonius circle method was used to solve the prob-

lem of a multi-player PE differential game with a superior evader. It was assumed

that each player has a simple motion, and that the game is played in an environ-

ment with perfect information; that is, the evader knows the state change of all the

pursuers, and vice versa. The problem is examined from the point of view of all

players in the game, and possible evasion and pursuit strategies are also discussed.

Most recently, Awheda et. al [23, 136] proposed two decentralized learning algo-

rithms for the PE game problem with a single-superior evader. The first algorithm

[23] enables a team of pursuers with equal speed to capture a single evader with a

similar speed. This learning algorithm is based on the condition that was proposed


in [8], as well as a specific formation control strategy. The second learning algo-

rithm [136] is based on fuzzy-reinforcement learning with Apollonius circles and a

modified formation control strategy. The goal of a superior evader is to learn how

to reach a specific target, and the goal of the pursuers is to learn how to cooperate

to capture the superior evader. However, when the distance between the evader

and the nearest pursuer is less than a specific tolerance distance, the strategy of

the evader is to run away from that pursuer. In [23, 136], there is a necessity to

calculate the capture angle to obtain the required control signal for each pursuer

to capture the superior evader. This thesis proposes a new reward function for-

mulation for the FACLA algorithm, that enables a group of pursuers to capture a

single evader in a decentralized manner and without finding the capture angle. It

is assumed that all the players have similar speeds.

Chapter 3

An Investigation of Methods of

Parameter Tuning for the QLFIS

3.1 Introduction

Fuzzy-reinforcement learning methods have recently been proposed to address the

problem of learning in differential games [17–21]. Reinforcement Learning (RL)

has been used successfully to tune the Fuzzy Logic Control (FLC) parameters for the

problems of the Pursuit-Evasion (PE) and for guarding territory differential game.

In [17], only the consequent parameters of the FLC and the Fuzzy Inference System

(FIS) are tuned in the PE game, using a Fuzzy Actor-Critic Learning (FACL) algo-

rithm. In [18–20], the Q-Learning Fuzzy Inference System (QLFIS) algorithm is

applied to the PE differential game, and all the premise and consequent parameters

of the FLC and the FIS are tuned. In [21], FACL is applied to the guarding territory

differential game, and only the consequent parameters are tuned.

In [17–21], the best parameters to tune were not investigated. Hence, this

56

3.1. INTRODUCTION 57

chapter discusses whether it is necessary to tune all the premise and consequent

parameters of the FLC and the FIS, or to just tune their consequent parameters. As

is known, it would be computationally more efficient to tune only the consequent

parameters. However, the question is: would it cause a significant loss in perfor-

mance measures if only a subset of the available parameters were tuned? To address

this question, four methods of implementing the QLFIS algorithm are investigated.

The QLFIS consists of two FISs, one used as an FLC, and the other as a function

approximator to estimate the action-value function. The four methods are applied

to three versions of the PE games. In the first version only the pursuer is learning,

and in the second the evader uses its higher maneuverability and plays intelligently

against a self-learning pursuer. In the final version, all the players are learning. An

evaluation is given to determine which parameters are the best to tune and which

parameters have little impact on performance. Also, it is used to recommend the

method that is most effective in reducing the computational time, which represents

an important factor for the real-time application, and has acceptable performance.

The results were published in [137]1.

This chapter is organized as follows: In Section 3.2, the problem statement is

discussed, and Section 3.3 describes the PE game and model. The structure of

the FLC is presented in Section 3.4. A brief introduction of RL is given in Section

3.5. The QLFIS is described in Section 3.6, and Section 3.7 presents the simulation

results. Finally, conclusions and guidelines are discussed in Section 3.8.

1A. A. Al-Talabi and H. M. Schwartz, “An investigation of methods of parameter tuning for Q-learning fuzzy inference system,” in Proc. of the 2014 IEEE International Conference on FuzzySystems (FUZZ-IEEE 2014), (Beijing, China), pp. 2594-2601, IEEE, July 2014.

3.2. PROBLEM STATEMENT 58

3.2 Problem Statement

Methods of parameter tuning for the QLFIS algorithm are investigated. Four pos-

sible methods are considered, based on whether it is necessary to tune all the pa-

rameters (i.e., premise and consequent) of the FIS and the FLC, or to tune only

consequent parameters, as explained in Table 3.1. Three PE games are considered

[20]: in the first game, the pursuer self learns its control strategy while the evader

uses a deterministic strategy, which is to escape along the Line-of-Sight (LoS); this is

defined as the Default Control Strategy (DCS). In the second game, the evader uses

its higher maneuverability and plays intelligently against a self-learning pursuer. In

the final game, two cases are taken into consideration. The first case is a single-

pursuer single-evader game, whereas the second is a multi-pursuer single-evader

game. In both cases, each pursuer interacts with the evader in order to self-learn

their control strategies simultaneously.

Table 3.1: Methods of parameter tuning.

The tuned parameters

For FLC For FIS

Method 1 Consequent parameters Consequent parameters

Method 2 Consequent parameters All parameters

Method 3 All parameters Consequent parameters

Method 4 All parameters All parameters

3.3 Pursuit-Evasion Game

The application used for this study is the PE differential game [1]. There are two

teams in this game, and each team has one or more participant (i.e., player) and

3.3. PURSUIT-EVASION GAME 59

each team has its own goal. The first team is called the pursuer team, and the

second is the evader team. The goal of the first team is to capture the second

team’s participants as quickly as possible, and the second team’s goal is to run away

or prolong the capture time. Thus, this game can be considered an optimization

problem with conflicting objectives [2]. In the PE game, each player needs to learn

how to determine its control strategy at every time step, and to adapt to and interact

with uncertain or changing environments. Each player is modeled as a car-like

mobile robot, and the dynamic equations that describe its motion are given by [26]

xi = Vi cos θi,

yi = Vi sin θi,

θi =ViLi

tan ui,

(3.1)

where i is e for the evader and p for the pursuer, and (xi, yi), Vi, θi, Li and ui refer

to position, velocity, orientation, wheelbase and steering angle, respectively. The

steering angle is bounded and given by −uimax ≤ ui ≤ uimax , where uimax is the

maximum steering angle. The minimum turning radius of each robot is calculated

from

Rdimin=

Li

tan(uimax). (3.2)

It is assumed that the pursuer is faster than the evader Vp > Ve, and at the same

time the evader is more maneuverable than the pursuer (upmax < uemax). As in [20],

the DCS for this game is defined by

3.4. FUZZY LOGIC CONTROLLER STRUCTURE 60

ui =

−uimax : δi < −uimax ,

δi : −uimax ≤ δi ≤ uimax ,

uimax : δi > uimax ,

(3.3)

where δi represents the angle difference, and is given by

δi = tan−1

(

ye − ypxe − xp

)

− θi. (3.4)

This control strategy allows the player to escape along the LoS. The distance be-

tween the pursuer and the evader at time t is given by

D(t) =√

(xe(t)− xp(t))2 + (ye(t)− yp(t))2, (3.5)

and capture occurs when this distance is less than a certain value, `, which is called

the capture radius.

3.4 Fuzzy Logic Controller Structure

For the problem under consideration, the fuzzy system is implemented using zero-

order Takagi-Sugeno (TS) model [46], in which each rule’s consequent is repre-

sented by a fuzzy singleton, which is a type of fuzzy set that contains only one

member whose degree of membership is unity. The fuzzy system has two inputs

and one output. For each learning agent, the two inputs are z1, and z2, which rep-

resent the angle difference δi and its derivative δi, respectively, and the output is

the steering angle ui. Each input has three Gaussian Membership Functions (MFs)


that have the linguistic values of P (positive), Z (zero) and N (negative). For the

Gaussian MF, the mean and standard deviation represent the possible tunable pa-

rameters. Thus, there are 2 × 6 = 12 parameters in all the MFs, and these are

called the premise parameters. The mean and standard deviation for the jth MF

of the input zi can be denoted by mij and σij, respectively. The number of rules

depends on the number of inputs and their corresponding MFs; in this case there

are two inputs, δi and δi, and each input has three MFs. Thus, nine rules must be

built, each with one consequent parameter Kl. As a result, there are 21 parameters

(12 + 9) that can be tuned during the learning phase, as specified by Table 3.2. The

Fuzzy output ui is defuzzified into a crisp output using the following center-average

defuzzication method:

ui =

L∑

l=1

Kl

(

N∏

i=1

µAli(zi)

)

L∑

l=1

(

N∏

i=1

µAli(zi)

)

. (3.6)

This formula is also called weighted average defuzzification method [138].

Table 3.2: Fuzzy logic parameters.

σ11 m11

Premise parameters σ12 m12

σ13 m13

σ21 m21

σ22 m22

σ23 m23

K1

Consequent parameters K2

...

K9


For the problem under consideration, each learning agent starts with initial MFs

and Fuzzy Decision Table (FDT). Figure 3.1 and Table 3.3 show the MFs of the pur-

suer and the evader before learning and their FDTs. The reason for choosing these

initial settings is to prevent the pursuer from catching the evader at the beginning

of the PE game. It is possible to use different initial settings, though this will affect

the period of the learning process.

-1 -0.5 0 0.5 1

0

0.2

0.4

0.6

0.8

1

Deg

ree o

f m

em

bersh

ip

N Z P

-1 -0.5 0 0.5 1

0

0.2

0.4

0.6

0.8

1

Deg

ree o

f m

em

bersh

ip

N Z P

-1 -0.5 0 0.5 1

0

0.2

0.4

0.6

0.8

1

Deg

ree o

f m

em

bersh

ip

N Z P

-1 -0.5 0 0.5 1

0

0.2

0.4

0.6

0.8

1

Deg

ree o

f m

em

bersh

ip

N Z P

Figure 3.1: Initial MFs of pursuer and evader before learning.

3.5 Reinforcement Learning

The main concept of RL is for the agent to learn how to achieve a specific goal by

interacting with the environment. The interaction between the learning agent and

the environment occurs in a simplified manner. At each time step (t = 0, 1, 2...), the

agent selects an action and the environment responds by presenting a new situation


Table 3.3: FDTs of the pursuer and the evader before learning.

(a) FDT of the pursuer.

δp

N Z P

δp

N -0.50 -0.25 0.00

Z -0.25 0.00 0.25

P 0.00 0.25 0.50

(b) FDT of the evader.

δe

N Z P

δe

N -1.00 -0.50 0.00

Z -0.50 0.00 0.50

P 0.00 0.50 1.00

to the agent, which receives a scalar value rt+1 ∈ R, known as a reward. The goal of

the learning agent is to maximize the accumulated discounted-future reward over

the long run, which is defined by

Rt = rt+1 + γrt+2 + γ2rt+3 + · · · =τ∑

k=0

γkrt+k+1, (3.7)

where γ ∈ (0, 1] is the discount factor and τ is the terminal time.

In RL, the reward function selection process is a task-dependent problem, and

choosing this function correctly enables the agent to update its value function ac-

curately. The main task of the PE game is to enable the pursuer to catch the evader

in the minimum time, and the right choice for this function is defined by Desouky

et al. [20], and is given by:

rt+1 = ∆D(t)/∆Dmax, (3.8)

where

∆D(t) = D(t)−D(t+ 1), (3.9)

3.6. Q-LEARNING FUZZY INFERENCE SYSTEM (QLFIS) 64

and

∆Dmax = (Vp + Ve)T, (3.10)

where T represents the sampling time.

3.6 Q-Learning Fuzzy Inference System (QLFIS)

Desouky et al. [20] proposed a learning technique called the QLFIS and applied it

to the problem of the PE differential game1. The structure of the learning technique

is shown in Figure 3.2. The QLFIS tunes all the parameters of the FIS and the FLC,

and the FIS is used to approximate the action-value function Q(s, a), whereas the

FLC is used to determine the control signal ut. For exploration, a white Gaussian

noise with zero mean and standard deviation σn is added to the signal ut to generate

the control signal uc.

3.6.1 The Learning Rule of the QLFIS and its Algorithm

The QLFIS technique is used to tune all the parameters of the FIS and the FLC.

Let φQ(t) and φu(t) refer to the parameter vector of the FIS and the FLC, respectively,

where φ is defined as follows:

φ =

σ

m

K

. (3.11)

1Desouky et al. [20] used eligibility traces defined by λ, as their RL algorithm. In [139], it hasbeen discovered that the eligibility trace had little advantage in this application and as such it is notused in the subsequent work.


),(max 11 usQr tu

t

ts

u cu

),( usQ tt

t

+ - FIS

FLC Environment

Ɲ(0,σn) +

+

ts

Figure 3.2: The structure of the QLFIS technique [20].

Thus, φQ(t) and φu(t) can be updated according to the following gradient-based

formulas [20]

φQ(t+ 1) = φQ(t) + η∆t

∂Qt(st, ut)

∂φQ, (3.12)

φu(t+ 1) = φu(t) + ξ∆t

(

uc − u

σn

)

∂u

∂φu, (3.13)

where η and ξ are the learning rate of the FIS and the FLC, respectively, and are

defined by [20]

η = 0.1− 0.09

(

iepMax. Episodes

)

, (3.14)

ξ = 0.1η, (3.15)

where iep is the current episode number. The terms ∂Qt(st,ut)∂φQ and ∂u

∂φu are calculated

as follows [20]:

∂Qt(st, ut)

∂σij=

(

2(zi −mij)2

σ3ij

)

(

K −Qt(st, ut))

∑9l=1 ωl

ωT , (3.16a)


∂Qt(st, ut)

∂mij

=

(

2(zi −mij)

σ2ij

)

(

K −Qt(st, ut))

∑9l=1 ωl

ωT , (3.16b)

∂Qt(st, ut)

∂Kl

= ωl, (3.16c)

and

∂u

∂σij=

(

2(zi −mij)2

σ3ij

)

(K − u)∑9

l=1 ωl

ωT , (3.17a)

∂u

∂mij

=

(

2(zi −mij)

σ2ij

)

(K − u)∑9

l=1 ωl

ωT , (3.17b)

∂u

∂Kl

= ωl, (3.17c)

where ωl represents the normalized firing strength of rule l and it is calculated as

follows:

ωl =ωl

9∑

l=1

ωl

, (3.18)

and ωl represents the firing strength of rule l, and it is defined as follows:

ωl =2∏

i=1

µAli(zi), (3.19)

where µAli(zi) refers to the Gaussian membership value of the input zi in rule l. The

terms K and ω are two vectors containing the consequent and strength of certain

rules, respectively. For example, the parameter σ23 represents the standard devia-

tion for the third MF of the second input z2, and it appears in rules R3, R6, and R9.

Thus, ∂Qt(st,ut)∂σ23

can be calculated from Equation (3.16a) with K =

[

K3 K6 K9

]

and ω =

[

ω3 ω6 ω9

]

. The QLFIS learning algorithm is given in Algorithm 3.1.


Algorithm 3.1 Learning in the QLFIS.

1. Initialize the premise and the consequent parameters of the FLC as shown in

Figure 3.1 and Table 3.3, respectively.

2. Initialize the premise parameters of the FIS with the same values as those

of the FLC and initialize the consequent parameters to zeros. Note that the

output of the FIS is a Q-value.

3. Set γ ← 0.95, and σn ← 0.08.

4. For each episode (game)

(a) Calculate η from Equation (3.14) and calculate ξ from Equation (3.15).

(b) Initialize the position of the pursuer, (xp, yp) to (0, 0).

(c) Initialize the position of the evader, (xe, ye), randomly.

(d) Calculate the initial state, s = (δi, δi), from Equation (3.4).

(e) For each step (play) Do

i. Calculate the output of the FLC, u, using Equation (3.6).

ii. Calculate the output uc = u+N (0, σn).

iii. Calculate the output of the FIS, Q(s, u), using Equation (3.6).

iv. Run the game for the current step and observe the next state st+1.

v. Get the reward, rt+1.

vi. From Equation (3.6), calculate the Q-value of the next state,

Q(st+1, u′), which is the output of the FIS.

vii. Calculate the TD-error, ∆t.

viii. Calculate the gradient for the premise and the consequent param-

eters of the FIS and the FLC from Equation (3.16) and Equation

(3.17), respectively.

ix. Update the parameters of the FIS from Equation (3.12).

x. Update the parameters of the FLC from Equation (3.13).

xi. Set st ← st+1.

(f) end for

5. end for

3.7. COMPUTER SIMULATION 68

3.7 Computer Simulation

In order to evaluate the effect of each method of parameter tuning on the perfor-

mance of the QLFIS algorithm, this algorithm is applied to the three versions of PE

differential game. For the purpose of simulation, all the codes were written using

Matlab software, and were implemented using an Intel i7-3612 Quad core proces-

sor with a 2.1 GHz clock frequency and an 8 GB of RAM. So, all the comparisons of

the computation time were based on these specifications. For fair comparison, the

same values used by Desouky et al. [20] are applied, unless stated otherwise. It is

assumed that the pursuer is faster than the evader with Vp = 2m/s and Ve = 1m/s,

and the evader is more maneuverable than the pursuer with −1 ≤ ue ≤ 1 and

−0.5 ≤ up ≤ 0.5. The wheelbases of the pursuer and the evader are the same

and they are equal to 0.3 m. In each episode, the pursuer’s motion starts from the

origin with an initial orientation of θp = 0, while the evader’s motion is chosen

randomly from a set of 64 different positions with θe = 0, unless stated other-

wise. The selected capture radius is ` = 0.1 m, except for the second game, where

` = 0.05 m. The sample time is T = 0.1 s. For statistical analysis purposes, Monte

Carlo simulation is run 500 times to get sufficient information about the capture

and computation times, and each simulation run performs 1000 episodes/games.

The number of plays in each game is 600, so each game terminates when the time

exceeds 60 s, or the pursuer captures the evader. Also, a 2-tailed 2-sample t-test

is performed to show if there is any significant difference among the means of the

computation time at the 0.05 level.


3.7.1 Evader Follows a Default Control Strategy

In this game, it is assumed that the evader plays its DCS, as defined by Equation

(3.3) and Equation (3.4). It is also assumed that the pursuer does not have any

information about the evader’s strategy. The goal is to make the pursuer self-learn

its control strategy by interacting with the evader. Furthermore, to find the best

methods of parameter tuning for this game, the four methods discussed in Section

3.2 are implemented, and the results are compared with the DCS.

As mentioned in Section 3.4, the initial MFs of the pursuer and its FDT are

selected as shown in Figure 3.1 and Table 3.3, respectively, such that the pursuer

cannot capture the evader at the beginning of the game. Figure 3.3 shows the

PE paths before the pursuer starts to learn (i.e., for the evader’s starting position

(−6, 7)), which demonstrates that the pursuer was unsuccessful in capturing the

evader.

Following the initialization step, the performance of each method of parame-

ter tuning needs to be evaluated. Since the capture time and the PE paths of the

DCS are known, the performance can be evaluated by comparing the capture time

and the PE paths that result from each method, with those resulting from the DCS.

Therefore, for each method of parameter tuning, the Monte Carlo simulation for

Algorithm 3.1 is run 500 times, and after each simulation run the capture times

are calculated for 5 different evader initial positions. Then, the mean and standard

deviation of the capture time over the 500 simulation runs are calculated, and the

results are given in Table 3.4, which shows the mean values of the capture time for

the different initial evader positions using the DCS and the four tuning methods.

The PE paths using these methods compared to the DCS are shown in Figures 3.4 –

3.7. From Table 3.4 and Figures 3.4 – 3.5 it is clear that the performance of the first


two methods is similar, and their mean values of the capture time are only slightly

different than the DCS. This observation reveals that the tuning method of the FIS

does not have that much affect on the performance of the learning algorithm. Ac-

cording to Table 3.4 and Figures 3.6 – 3.7, the performance of the last two methods

is similar, and both approach the performance of the DCS with respect to the mean

values of the capture time and on the PE paths, which confirms that tuning all the

parameters of the FLC has a significant effect on the performance of the learning

algorithm. Moreover, Figure 3.7 shows that the PE paths using the fourth method

and the DCS are indistinguishable. Thus, for this version of the PE game, it can

be concluded that the performance of the learning algorithm is slightly affected by

changing the method of tuning for the FIS, and is significantly affected by changing

the method of tuning for the FLC. Furthermore, the results show that the pursuer

can learn its control strategy in all four methods, and that the last two methods

outperform the first two.

Table 3.4: Mean and standard deviation of the capture time (s) for different evader

initial positions for the first version of the PE game.

Evader initial position

(−6, 7) (−7,−7) (2, 4) (3,−8) (−4, 5)

Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard

deviation deviation deviation deviation deviation

DCS 9.6 - 10.4 - 4.5 - 8.5 - 6.8 -

Method 1 10.1304 0.1700 10.8554 0.1830 4.7946 0.0837 9.0078 0.1579 7.1934 0.1326

Method 2 10.0144 0.1513 10.7732 0.1729 4.6938 0.0914 8.8976 0.1536 7.0804 0.1117

Method 3 9.6196 0.0417 10.4006 0.0077 4.5004 0.0063 8.6004 0.0089 6.8030 0.0193

Method 4 9.6214 0.0434 10.4012 0.0167 4.5004 0.0063 8.5996 0.0110 6.8020 0.0166

After comparing the performances of the four methods of parameter tuning, the

computational complexity should be measured, as it represents an important factor

for any real-time applications. In this thesis, it is measured by finding the mean and

the standard deviation of the time (i.e., computation time) that the computer will


Figure 3.3: The PE paths on the xy-plane for the first version of the PE game, before

the pursuer starts to learn.

-14 -12 -10 -8 -6 -4 -2 0 2

x-position (m)

0

5

10

15

y-po

siti

on (m

)

Pursuer's path using DCS.

Evader's path using DCS.

Pursuer's path using Method 1.

Evader's path using Method 1.

Figure 3.4: The PE paths on the xy-plane for the first version of the PE game when

the first method of parameter tuning is used versus the PE paths when each

player followed its DCS.


-14 -12 -10 -8 -6 -4 -2 0 2

x-position (m)

0

5

10

15

y-po

siti

on (m

)Pursuer's path using DCS.





the second method of parameter tuning is used versus the PE paths when each


-14 -12 -10 -8 -6 -4 -2 0 2

x-position (m)

0

5

10

15

y-po

siti

on (m

)





Figure 3.6: The PE paths on the xy-plane plane for the first version of the PE game

when the third method of parameter tuning is used versus the PE paths when

each player followed its DCS.


-14 -12 -10 -8 -6 -4 -2 0 2

x-position (m)

0

5

10

15

y-po

siti

on (m

)






the fourth method of parameter tuning is used versus the PE paths when each


take when simulating the learning algorithms only. As mentioned in Section 3.7,

the simulation of the learning algorithm is run 500 times, and is implemented using

an Intel i7-3612 Quad core processor with a 2.1 GHz clock frequency and an 8 GB

of RAM. Table 3.5 shows a comparison of the computation time for each parameter

tuning method after 500 simulation run. From this table, it can be seen that the first,

second, and third methods are, respectively, about 2.78, 1.45, and 1.52 times faster

than the fourth method. Also, when the 2-tailed 2-sample t-test is applied to the

computation time, it is found that the difference in means is statistically significant

at the 0.05 level (i.e., all differences in means resulted in a p-value below 0.0001).

3.7.2 Evader Using its Higher Maneuverability Advantageously

The second version of the PE game allows the evader to make use of its advan-

tage of higher maneuverability. The dynamic equations that describe the motions


Table 3.5: Mean and standard deviation of the computation time (s) for the four

methods of parameter tuning for the first version of the PE game.

Mean Standard Deviation

Method 1 7.5943 0.0706

Method 2 14.5941 0.2539

Method 3 13.9162 0.1373

Method 4 21.1100 0.3423

of the pursuer and evader robots are given by:

xi = vi cos(θi),

yi = vi sin(θi), and

θi =viLi

tan(ui).

(3.20)

In this game, the robot velocity, vi, is governed by its steering angle, so it slows

down in turns and avoids slippage. This velocity is defined by:

vi = Vi cos(ui), (3.21)

where Vi represents the robot’s maximum velocity. Desouky et al.[20] modified

the evader’s default strategy to allow the evader to take advantage of its higher

maneuverability, as follows:

1. If the evader is far enough from the pursuer (i.e., D(t) is greater than a certain

value dc), then the evader will attempt to escape along the LoS. So, the evader

control strategy is

ue = tan−1

(

ye − ypxe − xp

)

− θe. (3.22)


2. If D(t) < dc, the evader uses its advantage of higher maneuverability to move

in the opposite direction than the pursuer. So, the evader control strategy

here is

ue = (θp + π)− θe, (3.23)

where dc is the minimum turning radius of the pursuer, Rdpmin.

Figure 3.8 shows the PE paths before the pursuer learns (i.e., for the evader

position (−4, 5)).

Figure 3.8: The PE paths on the xy-plane for the second version of the PE game,

before the pursuer starts to learn.

To determine the best methods of parameter tuning for this game, the QLFIS learn-

ing process is implemented for each method, and the results are compared with

the modified DCS of this game. The mean values of the capture time for different


initial evader positions using the modified DCS and the four tuning methods are

given in Table 3.6, and the PE paths for these methods compared with the modified

DCS are shown in Figures 3.9 – 3.12. Table 3.6 and Figures 3.9 – 3.10 show that

the performance of the first two methods is similar. They also show that the first

two methods of parameter tuning cannot achieve acceptable performance, as there

are significant disparities in the mean values of the capture time compared with the

DCS, and the PE paths differ from those of the DCS. Thus, it is clear that the pursuer

does not learn well, which gives an indication that changing the method of tuning

for the FIS has less effect on the resulting performance of the QLFIS algorithm. On

the other hand, Table 3.6 and Figures 3.11 – 3.12 show that the performance of the

third method resembles the performance of the fourth, and that it outperforms the

first two methods with respect to both capture time and PE paths when compared

with those resulting from the DCS. This confirms that the FLC parameter tuning

method has significant impact on the learning algorithm performance. As a result,

for the second version of the PE game, it can be concluded that the performance of

the QLFIS algorithm is also slightly affected by changing the method of tuning of

the FIS, and the performance is also significantly better by changing the method of

tuning for the FLC.


initial positions for the second version of the PE game.


(−6, 7) (−7,−7) (2, 4) (3,−8) (−4, 5)



DCS 18.6 - 11.9 - 4.3 - 10.2 - 10.2 -

Method 1 15.4137 1.4949 16.7493 1.1453 9.7761 1.9525 14.0899 1.6505 8.7226 1.1525

Method 2 15.3214 1.0682 16.5767 1.2139 9.8778 1.6904 13.7428 1.5064 8.9742 1.0761

Method 3 18.9392 0.4601 12.0921 0.4216 4.4259 0.2496 9.5625 0.4812 10.4883 0.1790

Method 4 18.8886 0.3218 12.1428 0.4129 4.3516 0.2453 9.6222 0.4634 10.4087 0.1828


-10 -8 -6 -4 -2 0 2

x-position (m)

0

1

2

3

4

5

6

7

8

9

10

11

y-po

siti

on (m





Figure 3.9: The PE paths on the xy-plane for the second version of the PE game

when the first method of parameter tuning is used versus the PE paths when


-10 -8 -6 -4 -2 0 2

x-position (m)

0

1

2

3

4

5

6

7

8

9

10

11

y-po

siti

on (m

)






when the second method of parameter tuning is used versus the PE paths

when each player followed its DCS.


-10 -8 -6 -4 -2 0 2

x-position (m)

0

1

2

3

4

5

6

7

8

9

10

11

y-po

siti

on (m






when the third method of parameter tuning is used versus the PE paths when


-10 -8 -6 -4 -2 0 2

x-position (m)

0

1

2

3

4

5

6

7

8

9

10

11

y-po

siti

on (m

)






when the fourth method of parameter tuning is used versus the PE paths when



Table 3.7 shows a comparison of the computation time for each method of pa-

rameter tuning, which shows that the first, second, and third methods are, respec-

tively, about 2.56, 1.18, and 1.61 times faster than the fourth method. Since the

first two methods did not give an acceptable performance when the evader uses the

advantage of its higher maneuverability, their reduction in the computation time is

not of significant importance for this game. Also, t-testing demonstrated a signifi-

cant difference among the means of the computation time at the 0.05 level (i.e., all

differences in means resulted in p-value less than 0.0001).


methods of parameter tuning for the second version of the PE game.


Method 1 12.6270 0.4631

Method 2 28.9608 1.0336

Method 3 20.1649 0.7549

Method 4 32.3712 0.6077

3.7.3 Multi-Robot Learning

The game presented in this section is called multi-robot learning because each

player in this game is trying to learn its control strategy. Two cases are presented,

the first case is a two-player PE differential game (i.e. single-pursuer single-evader

game), while the second one is a multi-pursuer single-evader differential game.

Case 1: Single-Pursuer Single-Evader

In this case, it is assumed that each player does not have any information about

the other player’s strategy, and the goal is to make both players interact with each

other, and thereby self-learn their control strategies simultaneously.


The initial MFs of the pursuer and the evader and their FDTs before learning

are selected as shown in Figure 3.1 and Table 3.3. Figure 3.13 shows the PE paths

before learning (i.e., for the evader position (−6, 7)).

After learning, the mean values of the capture time for different initial evader

positions using the DCS and the four methods of parameter tuning are given in

Table 3.8. The PE paths using these methods compared to the DCS are shown in

Figures 3.14 – 3.17. Table 3.8 and Figure 3.14 show that the first method of pa-

rameter tuning for multi-robot learning is not effective enough to reach the desired

performance level, compared to the DCS. It is clear that the evader does not learn

well, and is captured too soon. As indicated, there are differences in the capture

times, and the PE paths have deviated from those of the DCS. Also, Table 3.8 shows

that the mean values of the capture time for the last three methods of parameter

tuning are slightly different than those of the DCS. From Figure 3.15, it is evident

Figure 3.13: The PE paths on the xy-plane for the third version of the PE game

(Case 1), before the players start to learn.


that the PE paths are different than the paths of the DCS. Furthermore, Figures

3.16 – 3.17 show that the PE paths of the last two methods differ slightly from that

of the DCS. Thus, tuning all the parameters of the FLC gives the highest perfor-

mance for capture time and the PE paths.


initial positions for the third version of the PE game (Case 1).


(−6, 7) (−7,−7) (2, 4) (3,−8) (−4, 5)



DCS 9.6 - 10.4 - 4.5 - 8.5 - 6.8 -

Method 1 8.3946 0.5580 9.0094 0.6411 4.3218 0.4160 8.3382 0.5957 5.7092 0.5179

Method 2 9.4096 0.2956 10.1420 0.3581 4.4298 0.1256 8.5698 0.2748 6.5362 0.2150

Method 3 9.6424 0.1356 10.4448 0.0663 4.4434 0.0553 8.5850 0.1280 6.7448 0.1017

Method 4 9.7034 0.2288 10.4346 0.3692 4.5382 0.1839 8.6496 0.2069 6.8628 0.1756

-14 -12 -10 -8 -6 -4 -2 0 2

x-position (m)

0

2

4

6

8

10

12

14

16

18

y-po

siti

on (m

)






(Case 1) when the first method of parameter tuning is used versus the PE

paths when each player followed its DCS.

Table 3.9 shows the computation time when the four methods of parameter


-14 -12 -10 -8 -6 -4 -2 0 2

x-position (m)

0

5

10

15

y-po

siti

on (m






(Case 1) when the second method of parameter tuning is used versus the PE


-14 -12 -10 -8 -6 -4 -2 0 2

x-position (m)

0

5

10

15

y-po

siti

on (m

)






(Case 1) when the third method of parameter tuning is used versus the PE



-14 -12 -10 -8 -6 -4 -2 0 2

x-position (m)

0

5

10

15

y-po

siti

on (m

)






(Case 1) when the fourth method of parameter tuning is used versus the PE


tuning were implemented. It shows that the first, second, and third methods are,

respectively, about 3.6, 1.62, and 1.65 times faster than the fourth method. As the

performance of the first method is not convincing, its computation speed is not that

important. But, when comparing the other methods, it seems that the third method

is the most recommended one. Also, based on t-testing, it is found that there are

significant differences in means at the 0.05 level (i.e., the resulted p-value is less

than 0.0001).

Case 2: Multi-Pursuer Single-Evader

For this case, it is assumed that there are multi-pursuer and single-evader, in

which all pursuers have the same capabilities. Also, it is assumed that each player

does not have any information about the other players’ strategies, and the goal is to

enable each player learns its control strategy. For the pursuers, it is proposed that



methods of parameter tuning for the third version of the PE game (Case 1).


Method 1 10.4277 0.2808

Method 2 23.1968 0.5221

Method 3 22.7298 0.4542

Method 4 37.5307 1.0639

the learning process is achieved in a decentralized manner, wherein each pursuer in-

teracts with the evader to find its control strategy by considering the other pursuers

as part of its environment. On the other hand, it is assumed that the control strategy

of the evader is to learn how to escape from the nearest pursuer. Thus, the evader’s

learning process can be achieved through the interaction between the evader and

all the pursuers in order to find its control strategy. Therefore, at each time step the

evader calculates its distances from all the pursuers to determine which pursuer is

the closest one to it. Thus, δe can be defined as the angle difference between the

LoS from the nearest pursuer to the evader and the evader’s direction, and is given

by

δe = tan−1

(

ye − ypcxe − xpc

)

− θe, (3.24)

where pc denotes the nearest pursuer to the evader. Also, to implement the QLFIS

algorithm for this case, the reward functions of each player should be defined.

For each pursuer the reward function is as defined by Equation (3.8), whereas the

evader’s reward function ret+1 is defined as follows:

ret+1 = −rpct+1. (3.25)


For simulation purposes, it is assumed that there are three pursuers and a single

evader. The evader starts its motion from the origin with an initial orientation

θe = 0, while the pursuers’ starting positions are selected randomly from a set of 64

different positions with θp1 = θp2 = θp3 = 0.

After running the Monte Carlo simulation 500 times for each method of parame-

ter tuning, the mean value and standard deviation of the capture time for 5 different

initial pursuers’ positions are calculated, and the results are as shown in Table 3.10.

The PE paths using the four tuning methods compared to those resulting from the

DCS are shown in Figures 3.18 – 3.21 (i.e., for the initial pursuers’ positions (−4, 4),

(−4,−5), (8, 4)). The results show that the performance of the first tuning method

is unacceptable, as there are large discrepancies in the mean values of the capture

time compared with those of the DCS. Also, it shows that Method 2 has a better

performance compared with Method 1, which means that tuning all the parameters

of the function approximation gives the learning player more ability to learn. On

the other hand, the results show that the performance of the last two methods are

slightly different from that of the DCS, which means that the learning processes are

completed successfully and each player is able to learn its control strategy. From this

simulation, it is evident that tuning all the parameters of the FLC has a significant

impact on the learning process.

Table 3.10: Mean and standard deviation of the capture time (s) for different

pursuers’ initial positions for the third version of the PE game (Case 2).

Pursuers’ initial position.

(−2,−10), (−3, 12), (4, 16) (−6, 12), (3, 10), (6,−12) (20,−6), (−20,−5), (4, 12) (15, 5), (−15, 6), (7,−7) (−4, 4), (−4,−5), (8, 4)



DCS 7.6 - 8.7 - 11.9 - 8.3 - 4.4 -

Method 1 7.3710 1.2687 6.9116 1.1570 9.6606 1.6043 7.2210 1.4531 4.2896 0.5471

Method 2 7.9668 0.5270 8.9962 0.7027 12.0242 0.8159 8.8184 0.6144 4.5988 0.3217

Method 3 7.8546 0.0861 8.6812 0.1035 11.9670 0.1413 8.5204 0.0910 4.4630 0.0560

Method 4 7.8334 0.1792 8.8168 0.1993 11.9152 0.2600 8.4978 0.2130 4.4386 0.1210

3.8. CONCLUSION 86

The computation time for each method of parameter tuning is given in Table

3.11, which shows that the first, second, and third methods are, respectively, about

3.18, 1.61, and 1.65 times faster than the fourth method. Since the first method’s

performance is unacceptable, its computation speed is insignificant. But, among the

other methods, it seems that the third method is the most recommended one. Also,

t-testing results showed a significant difference among the means of the computa-

tion time at the 0.05 level (i.e., the resulted p-value is below 0.0001).


methods of parameter tuning for the third version of the PE game (Case 2).


Method 1 17.1903 0.3888

Method 2 34.0040 0.6114

Method 3 33.1321 0.4174

Method 4 54.6030 0.9466

3.8 Conclusion

Four methods of parameter tuning for QLFIS algorithm were applied to three

versions of the PE games. In the first method only the consequent parameters of

the FLC and FIS were tuned, while in the second method only the consequent pa-

rameters of the FLC and all the parameters (i.e., the premise and the consequent pa-

rameters) of the FIS were tuned. In the third method, all the parameters of the FLC

and only the consequent parameters of the FIS were tuned. Finally, all the param-

eters of the FLC and FIS in the last method were tuned. The results show that the

performance of the QLFIS in each game depends on the parameter tuning method.

3.8. CONCLUSION 87

-4 -2 0 2 4 6 8 10

x-position (m)

-5

-4

-3

-2

-1

0

1

2

3

4

y-po

siti

on (m

)

Pursuer1's path using DCS.




Pursuer1's path using Method 1.





(Case 2) when the first method of parameter tuning is used versus the PE


-4 -2 0 2 4 6 8 10

x-position (m)

-5

-4

-3

-2

-1

0

1

2

3

4

y-po

siti

on (m

)










(Case 2) when the second method of parameter tuning is used versus the PE


3.8. CONCLUSION 88

-4 -2 0 2 4 6 8 10

x-position (m)

-5

-4

-3

-2

-1

0

1

2

3

4

y-po

siti

on (m

)










(Case 2) when the third method of parameter tuning is used versus the PE


-4 -2 0 2 4 6 8 10

x-position (m)

-5

-4

-3

-2

-1

0

1

2

3

4

y-po

siti

on (m

)










(Case 2) when the fourth method of parameter tuning is used versus the PE


3.8. CONCLUSION 89

In the first and second games, the results demonstrate that the performance of the

learning algorithm is slightly affected by changing the method of tuning for the

FIS, while it is significantly affected by changing the method of tuning for the FLC.

Furthermore, in the first game it was shown that the pursuer could learn its control

strategy in all methods, but the last two methods outperformed the first two. This

is because the first game is so simple. In the second game, it was found that the

first two methods of parameter tuning do not yield good performance because the

pursuer does not learn well. It was shown that tuning all the parameters of the

FLC, as in the last two methods, achieves the best performance regarding capture

time and the PE paths. In the third game, the results show that the performance

of the learning algorithm is affected by changing the method of tuning for both

the FIS and FLC, and that changing the method of tuning for the FLC as in the

third and fourth methods has a significant impact on performance. In addition, the

first method of parameter tuning is not effective enough for multi-robot learning to

reach the desired performance; the evader does not learn well and gets captured

too soon. Furthermore, although changing the tuning method of the FIS, as in the

second method, can significantly enhance the capture time performance, PE paths

are different than the DCS paths. For all versions of the PE game discussed in this

chapter, the results indicate that the first method has the lowest computation time,

and the fourth has the longest. They also show that the computation times for

Methods 2 and 3 are between those of the first and fourth methods, and they are

almost similar except for the second version of the game. Thus, to reduce the com-

putational complexity and maintain the performance of the QLFIS algorithm, it is

best to use Method 1 for the first version of the game and Method 3 for the second

and third versions.

Chapter 4

Learning Technique Using PSO-Based

FLC and QLFIS lgorithms

4.1 Introduction

Despite the popularity of Reinforcement Learning (RL), its algorithms cannot typ-

ically be used directly for problems with continuous state and action spaces, such

as the Pursuit-Evasion (PE) game in its differential form. Thus, it is possible to use

one of the function approximation methods, such as the Fuzzy Inference System

(FIS), to generalize the state and action spaces [47]. The FIS has a knowledge

base consisting of a collection of rules in linguistic form, and building the knowl-

edge base can be complicated, particularly for systems with many input and output

parameters. In [98–101], supervised learning techniques are used to tune the FIS

parameters, and these require a teacher or training input/output dataset which is

sometimes unavailable or too expensive to obtain. The training period also depends

on the usefulness of the initial FIS parameter setting, and starting with a random

90


initial setting will affect the starting functionality of the FIS and the speed of con-

vergence to the final setting. For this reason, an unsupervised two-stage learning

technique that combines a Particle Swarm Optimization (PSO)-based Fuzzy Logic

Control (FLC) algorithm with the Q-Learning Fuzzy Inference System (QLFIS) al-

gorithm is proposed for the problem of the PE differential game. In the first stage,

the game runs for a few episodes and the PSO algorithm is used to tune the FLC

parameters. From an optimization viewpoint, in this stage the PSO algorithm works

as a global optimizer to find appropriate values for the FLC parameters, which rep-

resents the initial setting of the FLC parameters for the next stage. The game then

proceeds to the second stage, in which the QLFIS algorithm works as a local op-

timizer to accelerate convergence to the final setting of the FLC parameters, since

it uses the gradient descent as an updating approach. The proposed technique is

applied to two versions of the PE differential games, and the findings are presented

and discussed in this chapter. In the first game, the pursuer attempts to learn its

Default Control Strategy (DCS) from the rewards received from its environment,

while the evader plays a well-defined strategy of trying to escape along the Line-of-

Sight (LoS); the results of this game were published in [140]1. The second game

addresses dual learning in PE differential game, with both players interacting to

self-learn their control strategies simultaneously; the corresponding results were

presented in [141]2.

The organization of this chapter is as follows: In Section 4.2, the PSO-based FLC

1A. A. Al-Talabi and H. M. Schwartz, “A two stage learning technique using PSO-based FLC andQLFIS for the pursuit evasion differential game,” in Proc. of the 2014 IEEE International Conferenceon Mechatronics and Automation (ICMA 2014), (Tianjin, China), pp. 762-769, IEEE, August 2014.

2A. A. Al-Talabi and H. M. Schwartz, “A two stage learning technique for dual learning in thepursuit-evasion differential game,” in Proc. of the IEEE Symposium Series on Computational Intelli-gence (SSCI) 2014, (Orlando, USA), pp. 1-8, IEEE, December 2014.

4.2. THE PSO-BASED FLC ALGORITHM 92

algorithm is described. The QLFIS is briefly discussed in Section 4.3, and Section

4.4 presents the proposed learning technique. In Section 4.5, the simulation results

are presented, and finally, conclusions are presented in Section 4.6.

4.2 The PSO-based FLC algorithm

The learning algorithm that will be presented in this section is called the PSO-

based FLC algorithm because the parameters of the FLC are tuned based on the

PSO algorithm. In this algorithm, each learning agent has an FLC. The FLC is used

to determine the control signal, ui, of the learning agent, where i refers to the

learning agent. The FLC is implemented using the zero-order Takagi-Sugeno (TS)

rules with constant consequents [46]. It consists of two inputs and one output; the

two inputs are the angle difference δi and its derivative δi, and the output is the

steering angle ui. Each input has three Gaussian membership functions (MFs), with

linguistic values of P (positive), Z (zero) and N (negative). Thus, depending on the

number of inputs and their corresponding MFs, the FLC has 21 parameters that can

be tuned during the learning phase; the tunable parameters are explained in the

previous work [137]. The fuzzy output ui is defuzzified into crisp output using the

weighted average defuzzification method [138].

In this chapter, the PSO algorithm with a constriction factor [76] is used to tune

the FLC parameters of the learning agent. Using the constriction factor may become

necessary to ensure that the PSO algorithm convergences [82]. The particles can

update their velocities and positions according to the constriction factor as follows

[84]:

vid = χ · [vid + c1 · r1 · (pbestid − xid) + c2 · r2 · (gbestgd − xid)], (4.1)


and

xid = xid + vid, (4.2)

where vid is the velocity at dimension d of particle i, pbestid is the best position found

so far by particle i at dimension d, and gbestgd is the global best position found so

far at dimension d. The values c1 and c2 are learning factors [73] or weights of the

attraction to the pbestid and gbestgd, respectively, and the variable xid is the current

position of particle i at dimension d. The two values r1 and r2 are uniform random

numbers in the range [0,1] generated at each iteration to explore the search space,

and χ is the constriction factor [84] given by

χ =2

∣

∣

∣2− ψ −√

ψ2 − 4ψ∣

∣

∣

, (4.3)

where ψ = c1 + c2 such that ψ > 4.

The FLC consists of 21 tunable parameters that can be coded to a particle flying

in a 21-dimensional search space. Moreover, the PSO algorithm will be initialized

with a population of Np particles, with random values positioned on the search

space. The PSO algorithm tunes the FLC parameters, depending on the problem

fitness function. For the problem of the PE game, the pursuer tries to maximize its

reward rt+1 at every time step to catch the evader in minimum time. The fitness

function that can be maximized by the pursuer over time is the average receiving

reward, which is defined by

AR =1

τ

τ∑

k=1

rt+k, (4.4)

where τ is the final time step of the game. The learning algorithm is summarized

in Algorithm 4.1.


Algorithm 4.1 The PSO-based FLC

1. Initialize with a population of Np particles with random position and velocityx and v.

2. For each particle Do

(a) Set the personal best position as the starting position.

(b) Set the personal best fitness values to a small value.

3. end for

4. Set the global best position as the starting position of the first particle.

5. Set the global fitness value to a small value.

6. Set the algorithm parameters c1 = c2 = 2.05.

7. Calculate the constriction factor χ from Equation (4.3).


(a) Initialize the position of the pursuer, (xp, yp) to (0, 0).

(b) Initialize the position of the evader, (xe, ye), randomly.

(c) Calculate the initial state, s = (δi, δi).

(d) For each particle Do

i. For each step (play) Do

A. Calculate the output of the FLC, ui, using the weighted averagedefuzzification method.

B. Run the game for the current time step and observe the next statest+1.

C. Get the reward, rt+1.

ii. end foriii. Calculate the fitness value from Equation (4.4).

(e) end for

(f) Sort the particles according to their fitness values.

(g) Find the personal best position for each particle and its fitness value.

(h) Find the global best position and its fitness value.

(i) Update the velocity of each particle from Equation (4.1).

(j) Update the position of each particle from Equation (4.2).

9. end for


4.3 Q-learning Fuzzy Inference System (QLFIS)

For the probelm of the PE differential game, Desouky et al. [20] proposed a learning

technique called the Q-Learning Fuzzy Inference System (QLFIS) that tunes all the

parameters of the FIS and the FLC. For each learning player, the FIS is used to

approximate the action-value function, whereas the FLC is used to find its control

signal. In [137] four methods of QLFIS parameter tuning were investigated in order

to reduce the computational time without effecting the overall performance of the

learning algorithm. In the first method, only the consequent parameters of the FLC

and the FIS were tuned, while in the second method the consequent parameters of

the FLC and all the parameters (i.e., the premise and the consequent parameters)

of the FIS were tuned. In the third method, all the parameters of the FLC and only

the consequent parameters of the FIS were tuned, and in the fourth method all the

parameters of the FLC and FIS were tuned. It was found that the third method of

parameter tuning is adequate for learning in the PE game to achieve the desired

performance, and it is used in subsequent work.

Let KQ(t) represent the consequent parameter vector of the FIS, and φu(t) rep-

resent the parameter vector of the FLC. These vectors are updated according to the

following gradient based formulas [17, 142]

KQ(t+ 1) = KQ(t) + η∆t

∂Qt(st, ut)

∂KQ, (4.5)

and

φu(t+ 1) = φu(t) + ξ∆t

(

uc − u

σn

)

∂u

∂φu, (4.6)

where η and ξ are the learning rate of the FIS and FLC, respectively, and are defined

4.4. THE PROPOSED TWO-STAGE LEARNING TECHNIQUE 96

as in [137].

4.4 The proposed Two-Stage Learning Technique

The proposed learning technique is a combination of two stages: the PSO-based

FLC algorithm proposed in Section 4.2 and the QLFIS technique proposed in [20],

and is called PSO-based FLC+QLFIS algorithm. In the first stage, the PE game is run

for a few episodes using the PSO-based FLC algorithm to tune the FLC parameters.

Here, from an optimization viewpoint, the PSO algorithm in this stage works as

a global optimizer [143] to determine appropriate values for the FLC parameters,

which represent the initial setting of the FLC parameters for the next stage. In

the second stage, the game continues with the QLFIS algorithm working as a local

optimizer to speed up the convergence to the final setting of the FLC parameters,

since it uses the gradient descent as an updating approach. The two-stage learning

technique is given in Algorithm 4.2.

Algorithm 4.2 The two-stage learning technique

• Stage 1

(a) Set the number of episodes and particles for the PSO-based FLC algorithm.

(b) Run the game using the PSO-based FLC algorithm to tune the FLC parame-ters.

• Stage 2

(a) Set the number of episodes for the QLFIS algorithm.

(b) Initialize the FLC parameters with the same values obtained from stage 1.

(c) Continue the game using the QLFIS algorithm to continue tuning the FLCparameters.



For the purpose of simulation, unless stated otherwise the same values used by

Desouky et al. [20] are selected. It is assumed that the pursuer is faster than the

evader with Vp = 2m/s and Ve = 1m/s, and the evader is more maneuverable than

the pursuer with −1 ≤ −uemax ≤ 1 and −0.5 ≤ −upmax ≤ 0.5. The wheelbases of the

pursuer and the evader are the same and they are equal to 0.3 m. In each episode,

the pursuer’s motion starts from the origin with an initial orientation θp = 0, while

the evader’s motion is chosen randomly from a set of 64 different positions with

θe = 0. The selected capture radius is ` = 0.1 m and the sample time is T = 0.1 s.

For comparison, the results of the proposed learning technique are compared with

the results of the DCS, the QLFIS algorithm and the PSO-based FLC algorithm.


In this game, it is assumed that the evader plays its DCS as defined in [137]. It

is also assumed that the pursuer does not have any information about the evader’s

control strategy. The goal is to make the pursuer self-learn its control strategy by

interacting with the evader.

Each learning algorithm mentioned above has several parameter values that

should be set a priori. For the QLFIS algorithm, the same parameter values used

by Desouky et al. [20] are selected. The number of episodes/games is 1000, the

number of plays in each game is 600, and the game terminates when the time

exceeds 60 s or the pursuer captures the evader. In the QLFIS algorithm the learning

process is completed after 1000 episodes, and the other QLFIS parameters are γ =

0.95 and σn = 0.08. For the PSO-based FLC algorithm, it is assumed that c1 =


c2 = 2.05 which are chosen such that ψ > 4 in Equation (4.3), and the population

size selected is ten particles. This choice is based on computer simulation for the

PE game and applying the same values used by Desouky et al. [20]. Monte Carlo

simulations are run 500 times for population sizes of 1, 5, 10, 20, 30, 40, and

50, after which the mean and standard deviation of the capture time over the 500

simulation runs are calculated for each population size. Figure 4.1 shows that there

are very small differences in the mean values of the capture time for more than

10 particles. Also, it shows the range bars, which indicate the standard deviations

over the 500 simulation runs. By taking the mean value of the capture time for 50

particles as a reference and finding the percentage decrease in the mean value of

the capture time at each selected population size as given in Table 4.1, it can be

found that the mean values of the capture time decrease by less than 5% when the

population size is greater than or equal to 10; thus, it is better to use the smallest

population size (i.e. Np = 10 particles) that satisfies the desired performance with

the shortest learning time. To determine the appropriate number of episodes for

the PSO-based FLC algorithm another Monte Carlo simulation is run 500 times

for different numbers of episodes, while the population size remained constant at

ten particles. Figure 4.2 shows the mean values of the capture time for different

numbers of episodes. It is clear that the mean value of the capture time decreases

as the number of episodes increases. Also, by taking the mean value of the capture

time for 1000 episodes as a reference, it is found that the mean values of the capture

time decrease by less than 5% after 500 episodes, as given in Table 4.2. Therefore,

the number of episodes selected is 500, and the PSO-based FLC algorithm learning

process is completed after 500 episodes and 10 particles; thus, the PSO-based FLC

learning process is completed after 5000 episodes.


0 5 10 15 20 25 30 35 40 45 50

Population Size

0

10

20

30

40

50

60

Mea

n C

ap

ture

Tim

e (s

)

Figure 4.1: The mean values of the capture time for the PSO-based FLC algorithm

for different population sizes. The range bars indicate the standard deviations

over the 500 simulation runs.

0 100 200 300 400 500 600 700 800 900 1000

Episodes

5

10

15

20

25

30

35

40

45

Mea

n C

ap

ture

Tim

e (s

)

Figure 4.2: The mean values of the capture time for the PSO-based FLC algorithm

for different episode numbers. The range bars indicate the standard devia-

tions over the 500 simulation runs.


Table 4.1: The percentage decrease in the mean value of the capture time (s) for

1000 episode as the number of particles increases.

Mean % Decrease

1 Particles 48.0710 460.0266

5 Particles 17.1092 99.3220

10 Particles 8.9645 4.4363

20 Particles 8.6282 0.5184

30 Particles 8.6128 0.3390

40 Particles 8.6093 0.2982

50 Particles 8.5837 -

Table 4.2: The percentage decrease in the mean value of the capture time (s) for

10 particles as the number of episodes increases.

Mean % Decrease

10 Episodes 35.1744 292.3744

25 Episodes 22.6430 152.5852

50 Episodes 16.5029 84.0917

100 Episodes 12.8111 42.9093

150 Episodes 11.4658 27.9023

200 Episodes 10.7588 20.0156

250 Episodes 10.3246 15.1721

300 Episodes 10.0300 11.8858

400 Episodes 9.6541 7.6926

500 Episodes 9.4067 4.9328

600 Episodes 9.2724 3.4347

700 Episodes 9.1634 2.2188

800 Episodes 9.0796 1.2840

900 Episodes 9.0155 0.5689

1000 Episodes 8.9645 -


As indicated in Section 4.4, the proposed learning technique goes through two

stages. In the first stage, the game is run for 40 episodes and 10 particles using

the PSO-based FLC algorithm given in Algorithm 4.1, and from an optimization

perspective the PSO algorithm works as a global optimizer in this stage. The PSO

algorithm is used to autonomously tune the FLC parameters, a process that typically

takes a few episodes. These parameters constitute the initial settings of the FLC for

the next stage, which also uses the QLFIS algorithm. In this stage, the QLFIS algo-

rithm works as a local optimizer to reach the final setting of the FLC parameters,

since it uses gradient descent as an updating technique. The learning process in this

stage takes 100 episodes to achieve the desired performance. After finding the ap-

propriate parameter settings, the performance of each learning algorithm must be

evaluated. Knowing the capture time and the PE paths of the DCS, this will provide

measures for evaluating the performance of each learning algorithm. The perfor-

mance can be evaluated by comparing the capture time and the PE paths resulting

from the learning algorithm, with those from the DCS. Therefore, for each learning

algorithm, a Monte Carlo simulation is run 500 times, and after each simulation run

the capture times are calculated for six different initial positions of the evader. Table

4.3 shows the mean values of the capture times for different initial evader positions,

using the PSO-based FLC algorithm, the QLFIS algorithm and the proposed learning

algorithm compared with those of the DCS, and comparisons of the PE paths using

these algorithms to the DCS are shown in Figures 4.3 – 4.5 (i.e., for the evader po-

sition (−6, 7)). From Table 4.3 and Figure 4.3, it can be seen that the mean values

of the capture time and the PE paths of the PSO-based FLC algorithm are slightly

different from the DCS, which means that the performance of the PSO-based FLC

algorithm is convincing. Also, it shows that the pursuer was succeeded in finding


its control strategy. But, the main problem here is that the PSO-based FLC learning

process is very slow in the final optimization stages [143]. The learning process

takes a total of 5000 episodes to achieve acceptable performance. Also, from Table

4.3 and Figures 4.4 – 4.5, the performance of the QLFIS algorithm and the proposed

learning algorithm are similar, and they approach the performance of the DCS with

respect to both the capture time and the PE paths. Moreover, the PE paths, when

using these algorithms and the DCS, are almost identical. In the proposed learning

technique, the learning process is completed after 40 × 10 + 100 = 500 episodes.

Table 4.4 shows the number of episodes required to complete the PSO-based FLC

algorithm, the QLFIS algorithm and the proposed learning algorithm. It is clear

that the latter needs fewer episodes for learning (i.e., it has a small learning time);

only 10% and 50% of the episodes required for the PSO-based FLC algorithm and

the QLFIS algorithm, respectively.


initial positions for the case of only the pursuer learning.


(−6, 7) (−7,−7) (2, 4) (3,−8) (−4, 5) (−10,−12)

Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard

deviation deviation deviation deviation deviation deviation

DCS 9.6 - 10.4 - 4.5 - 8.5 - 6.8 - 16.0 -

PSO-based FLC algorithm 9.8692 0.1899 10.6162 0.1676 4.5882 0.0798 8.7774 0.1390 6.9824 0.1106 16.3760 0.2582

QLFIS algorithm 9.6196 0.0417 10.4006 0.0077 4.5004 0.0063 8.6004 0.0089 6.8030 0.0193 16.0046 0.0219

PSO-based FLC+QLFIS 9.6676 0.0910 10.4302 0.0775 4.5072 0.0259 8.6180 0.0617 6.8284 0.0610 16.0840 0.1321

Table 4.4: Total number of episodes for the different learning algorithms

Total umber of Episodes

PSO-based FLC algorithm 5000

QLFIS algorithm 1000

PSO-based FLC+QLFIS algorithm 500


-14 -12 -10 -8 -6 -4 -2 0 2

x-position (m)

0

5

10

15

y-po

siti

on (m



Pursuer's path using PSO-based FLC.

Evader's path using PSO-based FLC.

Figure 4.3: The PE paths on the xy-plane using the PSO-based FLC algorithm for

the case of only the pursuer learning versus the PE paths when each player

followed its DCS.

-14 -12 -10 -8 -6 -4 -2 0 2

x-position (m)

0

5

10

15

y-po

siti

on (m

)





Figure 4.4: The PE paths on the xy-plane using the QLFIS algorithm for the case of

only the pursuer learning versus the PE paths when each player followed its

DCS.


-14 -12 -10 -8 -6 -4 -2 0 2

x-position (m)

0

5

10

15

y-po

siti

on (m

)



Pursuer's path using PSO-based FLC+QFIS.

Evader's path using PSO-based FLC+QFIS.

Figure 4.5: The PE paths on the xy-plane using the proposed learning algorithm

for the case of only the pursuer learning versus the PE paths when each player

followed its DCS.

A comparison of the computation time for the different learning algorithms

is given in Table 4.5, which shows that the computation time of the PSO-based

FLC+QLFIS algorithm is 3.53 and 4.67 times faster than the QLFIS and PSO-based

FLC algorithms, respectively.

Table 4.5: Mean and standard deviation of the computation time (s) for different

learning algorithms for the case of only the pursuer learning.


PSO-based FLC algorithm 18.4283 0.2659

QLFIS algorithm 13.9162 0.1373

PSO-based FLC+QLFIS 3.9472 0.9378



In this game, it is assumed that each robot has no information about the other

robot’s strategy. The goal is to make both robots (players) interact with one another

and self-learn their control strategies simultaneously. The results of the proposed

learning technique are compared with the results of the DCS and with those of the

PSO-based FLC algorithm and the QLFIS algorithm. The parameters of the PSO-

based FLC algorithm, the QLFIS algorithm and the proposed learning algorithm

have the same parameter values as those used in Section 4.5.1.

The mean values of the capture time for different initial evader positions using

the DCS, the PSO-based FLC algorithm, the QLFIS algorithm and the proposed two-

stage learning algorithm are given in Table 4.6. The PE paths using these algorithms

compared with the DCS are shown in Figures 4.6 – 4.8 (i.e., for the evader position

(−6, 7)). From Table 4.6 and Figure 4.6, it is clear that the mean values of the

capture time and the PE paths of the PSO-based FLC algorithm are slightly different

from the DCS. But, the main problem here is that the PSO-based FLC learning

process is very slow in the final optimization stages [143]. The learning process

takes a total of 5000 episodes to obtain acceptable performance. From Table 4.6

and Figures 4.7 – 4.8, it is evident that there are very small differences between the

performance of the QLFIS algorithm and the two-stage learning algorithm. Also,

the performance of the proposed learning algorithm is very close to that of the DCS

with respect to both the capture time and the PE paths. Moreover, in the two-stage

learning technique, the learning process is completed after 40 × 10 + 100 = 500

episodes. It is clear that the proposed learning algorithm needs fewer episodes for

learning (i.e., it has a small learning time); that is, 10% and 50% of the number

of episodes required for the PSO-based FLC algorithm and the QLFIS algorithm,


respectively.


initial positions for the case of multi-robot learning.


(−6, 7) (−7,−7) (2, 4) (3,−8) (−4, 5) (−10,−12)

Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard

deviation deviation deviation deviation deviation deviation

DCS 9.6 - 10.4 - 4.5 - 8.5 - 6.8 - 16.0 -

PSO-based FLC algorithm 9.7784 0.0412 10.5002 0.0045 4.5990 0.0100 8.7002 0.0045 6.9014 0.0118 16.1670 0.0475

QLFIS algorithm 9.6424 0.1356 10.4448 0.0663 4.4434 0.0553 8.5850 0.1280 6.7448 0.1017 16.0320 0.2560

PSO-based FLC+QLFIS 9.6880 0.1094 10.4220 0.1213 4.5394 0.0525 8.6496 0.0951 6.8618 0.0821 16.0646 0.1948

-14 -12 -10 -8 -6 -4 -2 0 2

x-position (m)

0

5

10

15

y-po

siti

on (m

)



Pursuer's path using PSO-based FLC.

Evader's path using PSO-based FLC.

Figure 4.6: The PE paths on the xy-plane using the PSO-based FLC algorithm

for the case of multi-robot learning versus the PE paths when each player

followed its DCS.

Table 4.7 shows a comparison of the computation time, which demonstrates that

the computation time of the proposed learning algorithm is 3.33 and 5.24 times

faster than the QLFIS and PSO-based FLC algorithms, respectively.


-14 -12 -10 -8 -6 -4 -2 0 2

x-position (m)

0

5

10

15

y-po

siti

on (m





Figure 4.7: The PE paths on the xy-plane using the QLFIS algorithm for the case of

multi-robot learning versus the PE paths when each player followed its DCS.

-14 -12 -10 -8 -6 -4 -2 0 2

x-position (m)

0

5

10

15

y-po

siti

on (m

)



Pursuer's path using PSO-based FLC+QFIS.

Evader's path using PSO-based FLC+QFIS.

Figure 4.8: The PE paths on the xy-plane using the proposed learning algorithm

for the case of multi-robot learning versus the PE paths when each player

followed its DCS.

4.6. CONCLUSION 108


learning algorithms for the case of multi-robot learning.


PSO-based FLC algorithm 35.7969 0.6138

QLFIS algorithm 22.7298 0.4542


4.6 Conclusion

In this chapter, a two-stage learning technique that combines the PSO-based FLC

algorithm with the QLFIS algorithm is used to autonomously tune the parameters

of a FLC. The proposed technique has two key benefits. First, the PSO algorithm

is used as a global optimizer to quickly determine good initial parameter settings

of the FLC, and second, the gradient descent approach in the QLFIS algorithm is

used to accelerate convergence to the final FLC parameter settings. The proposed

technique is applied to mobile robots playing a differential form of a PE game, and

two versions of the game are considered. In the first game, the pursuer learns

its DCS by using rewards received from its environment, while the evader plays a

well-defined strategy of escaping along the LoS. In the second game, both players

interact in order to self-learn their control strategies simultaneously (dual learning).

Simulation results show that both the pursuer and the evader can learn their de-

fault control strategies based on rewards received from their environments. In both

games, the results indicate that the performance of the proposed learning technique

and the QLFIS algorithm are very close, and they approach the performance of the

DCS with respect to capture times and PE paths. Moreover, the proposed learn-

ing technique performance and the QLFIS algorithm are slightly different than the

4.6. CONCLUSION 109

PSO-based FLC algorithm. Finally, the proposed learning technique outperforms

the QLFIS algorithm and the PSO-based FLC algorithm with respect to both learn-

ing time and computation time, both of which are highly important for any learning

algorithm.

Chapter 5

Kalman Fuzzy Actor-Critic Learning

Automaton Algorithm for the

Pursuit-Evasion Differential Game

5.1 Introduction

In this chapter, an efficient learning algorithm that can autonomously tune the pa-

rameters of the Fuzzy Logic Control (FLC) of a mobile robot playing a Pursuit-

Evasion (PE) differential game is proposed. The efficiency is measured by how

quickly the learning agent can determine its control strategy; that is, how to reduce

the learning time. The proposed algorithm is a modified version of the Fuzzy-Actor

Critic Learning (FACL) algorithm that was proposed in [17], in which both the

critic and the actor are Fuzzy Inference System (FIS). It uses the Continuous Actor-

Critic Learning Automaton (CACLA) algorithm to tune the parameters of the FIS,

and is known as the Fuzzy Actor-Critic Learning Automaton (FACLA) algorithm.

110


FACLA is applied to two versions of PE games, and compared through simulation

with the FACL [17], a Residual Gradient Fuzzy Actor-Critic Learning (RGFACL) that

was proposed in [22] and the PSO-based FLC+QLFIS [140] algorithms; the sim-

ulation results were published in [144]1. Following that, a decentralized learning

technique that enables two or more pursuers to capture a single evader in PE dif-

ferential games is proposed. The pursuers and the evader interact with each other

to self-learn their control strategies simultaneously by tuning their FLC parameters,

and the tuning process is based on Reinforcement Learning (RL). The proposed

learning algorithm uses the FACLA algorithm with the Kalman filter technique, and

is known as the Kalman-FACLA algorithm. The Kalman filter is used to estimate the

evader’s next position, allowing the pursuers to determine the evader’s direction to

avoid collisions among them and reduce the capture time. Awheda et. al [145]

also used the Kalman filter to estimate the evader’s position at the next time step

to allow the pursuer to find its Line-of-Sight (LoS) to the evader at the estimated

position. Such prediction was used for the single-pursuer single-evader differential

game. In this chapter, it is assumed that each pursuer knows only the instanta-

neous position of the evader and vice versa. Also, it is assumed that there is no

communication among the pursuers and each pursuer considers other pursuers as

part of its environment. This allows cooperation among the pursuers to be done in

a decentralized manner. The simulation results were published in [146]2.

This chpater is organized as follows: Section 5.2 explains the FACLA algorithm

1A. A. Al-Talabi, “Fuzzy actor-critic learning automaton algorithm for the pursuit-evasion differ-ential game,” in Proc. of the 2017 CACS International Automatic Control Conference, (Pingtung,Taiwan), November 2017.

2A. A. Al-Talabi and H. M. Schwartz, “Kalman fuzzy actor-critic learning automaton algorithm forthe pursuit-evasion differential game,” in Proc. of the 2016 IEEE International Conference on FuzzySystems (FUZZ-IEEE 2016), (Vancouver, Canada), pp. 1015-1022, July 2016.

5.2. FUZZY ACTOR-CRITIC LEARNING AUTOMATON (FACLA) 112

and its implementation, and Section 5.3 describes the n-pursers single evader PE

game. The state estimation based on the Kalman filter is presented in Section 5.4,

and the simulation results in Section 5.5. Finally, conclusions are provided in Sec-

tion 5.6.

5.2 Fuzzy Actor-Critic Learning Automaton (FACLA)

In RL, there are three well-known and widely used value function algorithms: actor-

critic, Q-learning and Sarsa [51, 147]. The first is employed to estimate a state-

value function V (s), and the other two are used to estimate an action-value function

Q(s, a).

The structure of the actor-critic learning system consists of two parts, an actor

and a critic. The actor is used to select an action for each state, and the critic is

applied to estimate V (s). The estimated V (s) helps critique the actions taken by

the actor to evaluate whether the system performances are better or worse than

expected [16]. The critic estimates V (s) using Temporal-Difference (TD) learning

[16].

In [17], Givigi et al. proposed an actor-critic learning technique called FACL, and

applied it to PE differential games. The critic and actor are FISs for each learning

agent. The learning technique works on problems that have continuous state and

action spaces [142]. Figure 5.1 shows the structure of the FACL system.

In this figure, the actor works as an FLC to determine the control signal ut. For

exploration, a white-Gaussian noise with zero mean and standard deviation σn is

added to the signal ut to generate the control signal uc. The other two blocks are

critics, and are used to estimate the state-value function, V (s).


The FIS of both the actor and critic is implemented using zero-order Takagi-

Sugeno (TS) rules with constant consequents [46]. For the learning agent i, the FIS

consists of two inputs and one output. The two inputs are z1 and z2 that correspond

to the angle difference δi and its derivative δi, respectively [137]. The output for

the actor is the steering angle ui, while the output of the critic is the estimate of

Vi(s), where δi is given by

δi = tan−1

(

ye − ypxe − xp

)

− θi, (5.1)

where (xe, ye) and (xp, yp) are the positions of the evader and pursuer, and θi is the

orientation of learning agent i.

Each input has three Gaussian Membership Functions (MFs) with the following

linguistic values: P (positive), Z (zero) and N (negative). Thus, depending on the

number of inputs and their corresponding MFs, the FIS has 21 parameters that can

ts tu cu

)( tt sV

t

+ - Critic

Actor Environment

Ɲ(0,σn) +

+

ts

Critic 1tr

1ts

ᵧ )( 1tt sV

+

+

Figure 5.1: Structure of the FACL system [17].


be tuned during the learning phase. The tuned parameters are the means and the

standard deviations of all the input MFs and the consequent parameters of the fuzzy

rule base, as given in [137]. The fuzzy output is defuzzified into a crisp output using

the weighted average defuzzification method [138].

In [137], four methods of parameter tuning were investigated, and it was de-

termined that tuning all the parameters of the actor and only the consequent pa-

rameters of the critic is adequate to learn in the PE game and get the desired per-

formance. Hence, in this work the learning algorithm will tune all the parameters

of the actor and only the consequent parameters of the critic. Let KC refers to the

consequent parameter vector of the critic and φA denotes the parameter vector of

the actor, and these are updated according to the following gradient based formulas

[17, 142]:

KC(t+ 1) = KC(t) + η∆t

∂Vt(st)

∂KC, (5.2)

and

φA(t+ 1) = φA(t) + ξ∆t

(

uc − utσn

)

∂ut∂φA

, (5.3)

where η and ξ are the learning rate of the critic and actor respectively. They are

defined as in [137], and ∆t is the TD-error. The terms ∂Vt(st)

∂KCl

and ∂ut

∂φA are given by

∂Vt(st)

∂KCl

= ωl, (5.4)

and

∂ut∂σij

=

(

2(zi −mij)2

σ3ij

)

(K − ut)∑L

l=1 ωl

ωT , (5.5a)


∂ut∂mij

=

(

2(zi −mij)

σ2ij

)

(K − ut)∑L

l=1 ωl

ωT , (5.5b)

∂ut∂Kl

= ωl, (5.5c)

where ωl and ωl represent the firing strength and the normalized firing strength of

rule l respectively, and K and ω are the vectors of the consequent and the strength

of certain rules respectively [137].

The authors in [148] proposed an algorithm called CACLA for the RL field to

manage problems with continuous state and action spaces. With CACLA, the pa-

rameter vector of the critic is updated as before, while the parameter vector of the

actor is updated as follows:

IF ∆t > 0 : φA(t+ 1) = φA(t) + ξ

(

uc − utσn

)

∂ut∂φA

. (5.6)

The main differences between the updated rule Equation (5.6) and Equation (5.3)

are:

• In Equation (5.6), φA is updated only when ∆t is positive.

• In Equation (5.6), the value of the TD-error is not used.

In Equation (5.6), a positive ∆t means the current action taken is better than ex-

pected and should be enforced. If a negative update occurs, as in Equation (5.3),

the update will guide the algorithm to select an action that does not necessarily

have a better value function (i.e., leads to positive ∆t). For this reason, the FACL

algorithm is modified to update the actor parameters according to the CACLA algo-

rithm. The modified algorithm is called the FACLA. The FACLA algorithm is given

in Algorithm 5.1.


Algorithm 5.1 Learning in the FACLA.

1. The premise and consequent parameters of the actor are initialized such thatthe evader can escape at the beginning of the game.

2. Set all the consequent parameters of the critic to zero. The premise parame-ters of the critic are initialized with the same values of the actor.

3. Set σn ← 0.08 and γ ← 0.95.


(a) Calculate the values of η and ξ as in [137].

(b) Initialize the position of the pursuer, (xp, yp) to (0, 0).

(c) Initialize the position of the evader, (xe, ye), randomly.

(d) Calculate the initial state , st = (δi, δi).

(e) For each step (play) Do

i. Calculate the output of the Actor, ut using the weighted average de-fuzzification.

ii. Calculate the output uc = ut +N (0, σn).

iii. Calculate the output of the Critic, Vt(st), using the weighted averagedefuzzification.

iv. For the current time step, run the game to observe the next state st+1.

v. Get the reward, rt+1.

vi. Calculate the output of the Critic at the next state st+1 , Vt(st+1) usingthe weighted average defuzzification.

vii. Calculate the TD-error, ∆t.

viii. Calculate the gradients ∂Vt(st)

∂KCl

and ∂ut

∂φA from Equation (5.4) and Equa-

tion (5.5), respectively.

ix. Update the parameters of the Critic from Equation (5.2).

x. Update the parameters of the Actor from Equation (5.6).

xi. Set st ← st+1.

(f) end for

5. end for


In this section, two versions of the two-player PE differential game are presented

to demonstrate the learning performance of the FACLA algorithm. In both versions,

it is assumed that the pursuer is faster than the evader, and the evader is more

maneuverable. Let Vp = 2 m/s, Ve = 1 m/s, −1 ≤ ue ≤ 1 and −0.5 ≤ up ≤ 0.5. The

wheelbases for the pursuer and evader are the same and they are equal to 0.3 m.

In each iteration, the motion for the pursuer starts from the origin with θp = 0, and

the evader is randomly chosen from a set of 64 different positions with θe = 0. The

capture radius is ` = 0.1 m and the sample time is T = 0.1 s. The game consists

of 600 plays, and will be terminated once the time exceeds 60 s or the pursuer

captures the evader.


Assume that the evader plays a Default Control Strategy (DCS) and as defined

in [137]. Also assume that the pursuer does not have any information about the

evader’s strategy. The appropriate number of episodes for the FACL, RGFACL and

FACLA algorithms are obtained by running Monte Carlo simulation for each algo-

rithm 500 times for a various number of episodes, which are 10, 25, 50, 100, 150,

200, 300, 400, 500, 600, 700, 800, 900 and 1000. Figure 5.2 shows the mean

values of the capture time for the different number of episodes. It shows that these

values typically decrease as the number of episodes increases. Interestingly, the

mean value of the capture time for the FACLA decreases more quickly than those for

the FACL and RGFACL algorithms. Also, a 2-tailed 2-sample t-test showed that the

difference in means is found to be statistically significant at the 0.05 level (i.e., all

differences in means resulted in a p-value below 0.0001). By taking the mean values

of the capture time for 1000 episodes for the FACLA, FACL and RGFACL algorithms


as references and finding the percentage decrease in the mean value of the capture

time at each selected number of episodes, it can be shown that the mean values

of the capture time decrease by less than 5% only when the numbers of episodes

are greater than or equal to 150, 300 and 500, respectively. Thus, the number of

episodes for FACLA, FACL and RGFACL algorithms are set to these values, respec-

tively. In [140], the PSO-based FLC+QLFIS algorithm takes about 500 episodes to

achieve acceptable performance. Table 5.1 tabulates the number of episodes re-

quired to complete PSO-based FLC+QLFIS, RGFACL, FACL and FACLA algorithms,

and it is clear that the FACLA algorithm requires less episodes for learning (i.e., it

has a small learning time). After learning, the mean values of the capture time for

different initial evader positions using the DCS, PSO-based FLC+QLFIS, RGFACL,

FACL and FACLA are given in Table 5.2.

0 100 200 300 400 500 600 700 800 900 1000

Episodes

8

10

12

14

16

18

20

22

24

26

28

Mea

n A

vera

ge C

ap

ture T

ime (

s)

FACL Algorithm

RGFACL Algorithm

FACLA Algorithm

Figure 5.2: The mean values of the capture time for the FACL, RGFACL and FA-

CLA algorithms for different episode numbers. The range bars indicate the

standard deviations over the 500 simulation runs.


Table 5.1: Total number of episodes for the different learning algorithms.

Total umber of Episodes

PSO-based FLC+QLFIS algorithm 500

RGFACL algorithm 500

FACL algorithm 300

FACLA algorithm 150


initial positions for the case of only the pursuer learning.


(−6, 7) (−7,−7) (2, 4) (3,−8) (−4, 5)



DCS 9.6 - 10.4 - 4.5 - 8.5 - 6.8 -

PSO-based FLC+QLFIS 9.6676 0.0910 10.4302 0.0775 4.5072 0.0259 8.6180 0.0617 6.8284 0.0610

RGFACL algorithm 9.8334 0.0698 10.5730 0.0806 4.5994 0.0427 8.7476 0.0692 6.9898 0.0580

FACL algorithm 9.6840 0.0625 10.4078 0.0310 4.5042 0.0201 8.6028 0.0188 6.8276 0.0600

FACLA algorithm 9.7206 0.0405 10.4640 0.0509 4.5074 0.0262 8.6332 0.0480 6.8990 0.0241

The computation time for the different learning algorithms is given in Table 5.3,

which shows that the FACLA algorithm needs less time compared with the other

algorithms. The FACLA algorithm is 2.11, 2.40 and 7.62 times faster than the PSO-

based FLC+QLFIS, FACL and RGFACL algorithms, respectively.


learning algorithms for the case of only the pursuer learning.



RGFACL algorithm 14.2861 0.4720

FACL algorithm 4.5069 0.1020

FACLA algorithm 1.8750 0.0447



Here, it is assumed that each robot has no information about its opponent’s

strategy, and both need to learn their control strategies at the same time. The

learning process for each of the considered algorithms is implemented using the

same values as those given in Section 5.2.1, and, after learning, the results of the

FACLA algorithm are compared with the results of the DCS and with those of the

PSO-based FLC+QLFIS, RGFACL and FACL algorithms. Table 5.4 summarizes the

mean values of the capture time for different initial evader positions using the DCS,

PSO-based FLC+QLFIS, RGFACL, FACL and FACLA algorithms.

Table 5.4: Mean and standard deviation of the capture times (s) for different

evader initial positions for the case of multi-robot learning.


(−6, 7) (−7,−7) (2, 4) (3,−8) (−4, 5)



DCS 9.6 - 10.4 - 4.5 - 8.5 - 6.8 -

PSO-based FLC+QLFIS 9.6880 0.1094 10.4220 0.1213 4.5394 0.0525 8.6496 0.0951 6.8618 0.0821

RGFACL algorithm 9.5532 0.4594 10.2860 0.4010 4.4738 0.1731 8.5396 0.2858 6.7562 0.3551

FACL algorithm 9.7082 0.1234 10.4562 0.0919 4.4744 0.0476 8.6464 0.0973 6.8284 0.0895

FACLA algorithm 9.6938 0.1330 10.4028 0.1401 4.5298 0.0598 8.6236 0.1131 6.8608 0.1059

Table 5.5 shows the computation time for the different learning algorithms. It

demonstrates that the FACLA algorithm is 2.32, 2.44 and 8.87 times faster than the

PSO-based FLC+QLFIS, FACL and RGFACL algorithms, respectively.

It can be concluded from Table 5.2 and Table 5.4 that the capture times of the

different learning algorithms approach those of the DCS, which means that the play-

ers are able to learn their DCSs. The advantage of FACLA over the other learning

algorithms considered is that it has the lowest learning time, as demonstrated in Ta-

ble 5.5. Also, the 2-tailed 2-sample t-test is performed comparing the computation

time for the FACLA algorithm with those of the PSO-based FLC+QLFIS, RGFACL

5.3. LEARNING IN N -PURSUER ONE-EVADER PE DIFFERENTIAL GAME 121


learning algorithms for the case of multi-robot learning.



RGFACL algorithm 26.1200 1.6964

FACL algorithm 7.1903 0.2032

FACLA algorithm 2.9448 0.1223

and FACL algorithms that are given by Table 5.3 and Table 5.5, and it presented a

significant difference among the means with p-value less than 0.0001.

5.3 Learning in n-Pursuer One-Evader PE Differential

Game

In this section, the complexity of the two-player PE differential game is increased

by adding more pursuers. The new pursuers are also car-like mobile robots with

dynamic equations as defined in [26]. The FACLA algorithm is used as a learning

algorithm for this game, since it takes fewer episodes to learn each player how to

find its control strategy, as explained in Section 5.2. As a special case, the problem

of a two-pursuer one-evader PE differential game will be addressed, and a gen-

eralization of the case of n-pursuer one-evader game will also be given. The PE

differential game model with two-pursuer and one-evader is shown in Figure 5.3.

The FLC of each player is as explained in Section 5.2. It has two inputs and one

output. In this application, it was found that three Gaussian MFs for each input is

enough to attain the desired performance. For each pursuer pi, where i = 1, 2 the

two inputs are the pursuer’s angle difference δpi and its derivative δpi , where δpi is


𝐭𝐚𝐧−𝟏 (𝒚𝒆 − 𝒚𝒑𝟏𝒙𝒆 − 𝒙𝒑𝟏)

x

y

xe xp1

ye

yp1

Vp1

Ve

Pursuer1

Evader

𝜽𝒆

𝜽𝒑𝟏𝜹𝒆

𝜹𝒑𝟏

𝜽𝒑𝟐𝜹𝒑𝟐 𝐭𝐚𝐧−𝟏 (𝒚𝒆 − 𝒚𝒑𝟐𝒙𝒆 − 𝒙𝒑𝟐)

yp2

xp2

Vp2

P2E

P1E

EP

Pursuer2

Figure 5.3: The PE differential game model with two-pursuer and one-evader.

the angle difference between the pursuer’s velocity vector−→Vpi , and the LoS vector

−−→PiE from pi to the evader; the output is the pursuer’s steering angle upi . For the

evader, the two inputs are the angle difference δe and its derivative δe, where δe is

the angle difference between its velocity vector and its intended escape direction;

the output is the evader’s steering angle ue. The intended escape direction of the

evader should consider the presence of the two pursuers, and how far they are from

the evader. The escape direction can be defined by

−→EP dir = (xdir, ydir)

=w−−→P1E +

1

w

−−→P2E

‖w−−→P1E +

1

w

−−→P2E‖

, (5.7)

where w is a weighting factor that depends on the distances between the evader


and each pursuer, and it is given by

w =‖−−→P2E‖

‖−−→P1E‖

. (5.8)

So, δe can be defined by

δe = arctan

(

ydirxdir

)

− θe. (5.9)

To implement the FACLA algorithm for the case of two-pursuer one-evader game,

the immediate reward rt+1 for all the players in the game must be calculated. For

each pursuer pi where i = 1, 2, let rpit+1 represents the reward for the pursuer pi,

which is calculated as in [137]. The evader’s immediate reward ret+1 is calculated

as follows:

ret+1 = −2∑

i=1

rpit+1. (5.10)

Generally, for the n-pursuer one-evader PE differential game, it is assumed that

the control strategy of each pursuer remains the same, and the FLC structure of all

players is as mentioned previously. Also, it is assumed that the evader’s goal is to

learn how to escape from the nearest pursuer. Therefore, the evader would have to

take into consideration the presence of these pursuers and their relative distances.

Hence, at each time step the evader needs to calculate its distance from all pursuers

to determine which one is the closest. Thus, δe can be defined as the angle difference

between the evader’s direction and the LoS from the nearest pursuer to the evader,

and is calculated from Equation (5.1). Also, the evader’s reward function ret+1 can

be defined as follows:

ret+1 = −rpct+1. (5.11)


where pc denotes the nearest pursuer to the evader.

5.3.1 Predicting the Interception Point and its Effects

According to the pursuers’ DCSs as defined in [137], at each instant of time

the pursuers attempt to capture the evader by following their LoS to the evader.

However, if the pursuers follow this strategy the possibility of collision among them

is high, and the capture time might not be the minimum one. If each pursuer can

predict its interception point with the evader E and moves directly to this point

as shown in Figure 5.4 (i.e., the pursuer modifies its LoS toward the predicted

interception location−−→PiE, instead of the instant position of the evader), the capture

time and the potential for collision among the pursuers can be reduced. To do

that, it was assumed in [149] that the pursuers know the evader’s velocity vector in

order to find the modified LoS,−−→PiE. To find

−−→PiE, it is necessary to first determine

the values of the angles βi and αi. The value of βi can be calculated as follows:

αi βi

Pi E

Ѐ

PiE

EP

Pi Ѐ

Vpi

Ve

Figure 5.4: Geometric illustration for capturing situation.

βi = arccos

(

−

−−→PiE ·

−→EP

‖−−→PiE‖‖

−→EP‖

)

, (5.12)


and the value of αi can be calculated according to the law of sines, as follows [150]:

Vesin(αi)

=Vpi

sin(βi)⇒ αi = arcsin

(

(VeVpi

) sin(βi)

)

. (5.13)

Once the values of βi and αi are known, the magnitude of−→EP can be calculated

from

‖−→EP‖ = ‖

−−→PiE‖

sinαi

sin(αi + βi), (5.14)

and the evader’s velocity vector−→EP can be calculated as follows:

−→EP = ‖

−→EP‖(

−→EP )dir. (5.15)

Finally,−−→PiE will be determined by:

−−→PiE =

−−→PiE +

−→EP (5.16)

= (x′, y′).

After finding−−→PiE for each pursuer, the predicted angle difference δpi , i.e., the angle

difference between−−→PiE and the pursuer’s velocity vector

−→Vpi , can be expressed as:

δpi = arctan

(

y′

x′

)

− θpi . (5.17)

5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 126

5.4 State Estimation Based on a Kalman Filter

As explained in Section 5.3.1, to find the modified LoS,−−→PiE, it was assumed in

[149] that the pursuers would know the evader’s velocity vector−→EP . In this study

it will be unnecessary, because the Kalman filter will predict the unknown states or

variables of interest. The Kalman filter estimates the states of a dynamical system

based on a linear model, which can be written in a discrete state space form as

follows [85]:

x(k + 1) = F (k)x(k) +G(k)u(k) + v(k), (5.18)

and the measurement model can be described by

y(k) = H(k)x(k) + w(k). (5.19)

Since it is assumed that the evader moves with constant velocity, Newton’s equa-

tions of motion can give a simple dynamic system model to describe the evader’s

motion. Thus, to estimate the evader’s position (i.e., (xe(k+1), ye(k+1))) using the

Kalman filter, the following Constant Velocity Model (CVM) can be used

x(k + 1) =

xe(k + 1)

ye(k + 1)

vxe(k + 1)

vye(k + 1)

=

1 0 T 0

0 1 0 T

0 0 1 0

0 0 0 1

xe(k)

ye(k)

vxe(k)

vye(k)

+ v(k), (5.20a)


y(k) =

xe(k)

ye(k)

=

1 0 0 0

0 1 0 0

xe(k)

ye(k)

vxe(k)

vye(k)

+ w(k), (5.20b)

where T represents the sampling time, (xe(k), ye(k)) refers to the evader’s position

at time step k and (vxe(k), vye(k)) refers to its velocity component.

Generally, the design of an appropriate system model represents an important

issue for the Kalman filter to work properly. Therefore, if the evader moves in a

straight line or if there is a small change in its path or velocity, the CVM, described

by Equation (5.20), will enable the Kalman filter to estimate the evader’s position

accurately. But, if the evader has an acceleration or a manoeuvrability like a moving

car that can accelerate/decelerate, or make a turn, the CVM will fail. Therefore,

there is a necessity to change the model of Equation (5.20) to cope with such issues.

So, the new model should take the acceleration into consideration. Thus, the new

model is called Constant Acceleration Model (CAM) and can be defined by

x(k + 1) =

1 0 T 0 T 2/2 0

0 1 0 T 0 T 2/2

0 0 1 0 T 0

0 0 0 1 0 T

0 0 0 0 1 0

0 0 0 0 0 1

x(k) + v(k), (5.21a)


y(k) =

1 0 0 0 0 0

0 1 0 0 0 0

x(k) + w(k), (5.21b)

where x(k) = [xe(k) ye(k) vxe(k) vye(k) axe

(k) aye(k)]T. (axe

(k), aye(k)) is the evader’s

acceleration components at time step k.

After selecting the appropriate Kalman filter model to accurately estimate the

evader’s position at the next time step E = (xe(k + 1), ye(k + 1)), each pursuer can

predict the position where capture can take place by invoking the instantaneous

position of the evader E = (xe(k), ye(k)) and the estimated next position of the

evader E = (xe(k + 1), ye(k + 1)) as shown in Figure 5.5. Thus, each pursuer

can move in the direction of the expected capture point of the evader, rather than

following its LoS to the estimated position in the next time step−−→PiE [145]. The

evader’s velocity vector−→EP can be calculated as follows:

αi βi

PiE

Ѐ

PiE

EPPi Ѐ

Vpi

^

eV

Ê

Figure 5.5: Geometric illustration for capturing situation using the estimated posi-

tion.

−→EP = Ve(

−−→EE)dir, (5.22)


where

Ve =

√

(xe(k + 1)− xe(k))2 + (ye(k + 1)− ye(k))2

T, (5.23)

and

(−−→EE)dir = arctan

(

ye(k + 1)− ye(k)

xe(k + 1)− xe(k)

)

. (5.24)

5.4.1 The Design of Filter Parameters

After determining the appropriate system model, the measurement and process

noise covariance matrices, Rf (k) and Qf (k), respectively, should be constituted or

selected carefully. In general, Rf (k) and Qf (k) can be considered as tuning factors.

The design of Rf (k) is quite easily compared to the design of Qf (k). Using the

fact that every measurement sensor has manufacturer’s specifications that give an

idea about the value of Rf (k), and most trusted sensors give readings close to the

true values. Also, it is possible to find the variance of the measurement noise by

taking some off-line measurements [86]. For example, if the Kalman filter is used

to estimate the xy-coordinate of the evader (i.e., for the work presented in this

chapter, it is assumed that the xy-coordinate could be measured directly from a

position sensor, such as a camera), and if it is assumed that there is no correlation

between the noises of the two measurement sensors, then Rf (k) can be defined as

follows:

Rf (k) =

σ2xo

0

0 σ2yo

, (5.25)

where σ2xo

and σ2xo

represent the variances for the x and y coordinate sensors, re-

spectively. Hence, finding Rf (k) is generally straightforward. However, the ques-

tion that someone may ask is: what will happen if Rf (k) is larger or smaller than


its actual value. The answer to this question is: if Rf (k) is large this will tell the

filter that the measurement is so noisy, so the filter will trust the prediction more

than the sensor reading, and this may cause the filter to exhibit slow convergence

or even divergence. On the other hand, if Rf (k) is small the filter will favor the

measurement over the prediction and this will cause the filter to follow the noisy

measurement [151].

In contrast with Rf (k), the design of Qf (k) represents a difficult task, because

the estimated states might not be readily observable. Qf (k) is used to take into

consideration any unmodeled disturbances that can influence the system dynam-

ics. In other words, it accounts for the uncertainty in the model itself. It can be

represented by the following equation [152]:

Qf (k) =

∫ T

0

F (t)QcFT (t)dt, (5.26)

where Qc represents the covariance of the continuous noise. As an example, for the

dynamic system described by Equation (5.21), Qf (k) can be calculated to be:

Qf (k) = σ2w

T 5/20 0 T 4/8 0 T 3/6 0

0 T 5/20 0 T 4/8 0 T 3/6

T 4/8 0 T 3/3 0 T 2/2 0

0 T 4/8 0 T 3/3 0 T 2/2

T 6/6 0 T 2/2 0 T 0

0 T 6/6 0 T 2/2 0 T

, (5.27)

where σ2w is the variance of the white noise. The question here is how to determine


an appropriate value for σ2w. The simple answer to this question is to select this

value based on trial and error. Moreover, if this value is too large, this will tell the

filter that the magnitude of disturbances that can affect the state evolution is large.

Therefore, the filter has no choice but to trust the measurements, and this will

cause the Kalman filter to follow the measurements, even if they have significant

noises. On the other hand, if σ2w is too small, this will make the Kalman filter trust

the prediction, because small variance in the process noise means that the process

model is accurate, and this might lead to filter divergence, if the process model does

not reflect the actual reality.

5.4.2 Kalman Filter Initialization

For Kalman Filter Implementation, the initial estimate of the state, x(0), and the

initial state covariance, P (0), should be specified. Practically, if the designer has

a good prior knowledge about these values for the problem that he tried to solve,

then this can give a good starting point for the Kalman filter to work properly. The

choice of x(0) is based on the designer’s knowledge, and some designers consider

the first measurement as a part of the initial state vector. However, without such

knowledge, x(0) can be selected to be zero or any reasonable number.

On the other hand, the P (0) matrix is usually defined as a diagonal matrix, with

the variances of the estimated error in the state vector along the corresponding

diagonal elements. The selection of P (0) depends on the designer’s knowledge

about x(0). So, if there is a sufficient information about x(0), then it is better to

set small values along the diagonal of P (0). This setting gives an indication to the

Kalman filter that x(0) represents a good initialization. Otherwise, each diagonal

element of P (0) is set to a reasonable large number.


5.4.3 Fuzzy Fading Memory Filter

For the implementation of the Kalman filter, the system model is presumed to

be precisely known, otherwise the filter may not give an appropriate state estimate

and may diverge [87]. So, to handle the modeling error that can arise in the system

model, the Fading Memory Filter (FMF) was proposed. FMF is considered as one of

the generalizations of the Kalman filter. It can be implemented easily by using the

same Kalman filter equations, except with a modification of the error covariance

prediction equation, which is given by

P (k + 1|k) = α2fF (k)P (k|k)F

T (k) +Qf (k), (5.28)

where αf ≥ 1, and is usually selected to be close to 1.

For the FMF, the value of αf is chosen to be a constant, and is selected based

on trial and error or the designer’s knowledge. A constant αf may not make the

filter respond precisely to the dynamic change of the estimated system, and may

not give the best performance. Therefore, a zero-order TS fuzzy system model is

proposed and used to find an appropriate value for αf at every time step to improve

the filter behaviour. Thus, the resulting filter can be considered as an adaptive one.

The adaptation process is based on the mean and covariance of the residuals, as the

residuals give an indication to the filter how much the estimated measurements fits

the actual ones. In other words, the residuals provide a degree of fitting between

the estimated and actual sensor readings such that if it is not white noise, this

gives a sign that the filter does not perform as required [153]. Hence, the residual

information provide a measure or a Degree of Divergence (DOD) [153].

The TS fuzzy system has two inputs and one outputs. The two inputs are the


DOD parameters, µ and ξ, which represent the average magnitude of the residual

and the trace of the residual covariance matrix at the current time step divided

by the number of measurements [154], respectively, and the output is the weight

factor αf . The DOD parameters are calculated as follows:

µ =1

m

m∑

i=1

| νi(k) |, (5.29a)

ξ =νT (k)ν(k)

m, (5.29b)

where ν(k) denotes the residual vector.

5.4.4 Kalman Filter Model Selection

To select an appropriate Kalman filter model that can be used to address the

problem mentioned in Section 5.3, four models are taken into consideration, which

are CVM, CAM, fuzzy FMF based on CVM, and fuzzy FMF based on CAM. The last

two, can be named in short fuzzy CVM and fuzzy CAM. Also, three examples are

given. In the first example, it is assumed that the evader’s movement is based on

its DCS, and in the second one, it is assumed that the evader’s movement is based

on the modified DCS [20]. The final example is similar to the second one, except

that the evader can turn with maximum velocity [22]. In all of these example, it is

assumed that the Kalman filter is used to estimate the evader’s position. For each

example, the CVM, CAM, and their fuzzy FMF counterparts, are implemented and

their performances are compared. The comparison is based on finding the value of

the average Root Mean Square Error (RMSE) (i.e., the average root mean square


of the differences between the true and estimated position) by running the Monte

Carlo simulation 500 times for each example and filter model. The MFs and the

FDT of the fuzzy FMF are taken as shown in Figure 5.6 and Table 5.6.

0 0.5 1 1.5 2

0

0.2

0.4

0.6

0.8

1

Deg

ree o

f m

em

bersh

ip

Z M L

0 0.5 1 1.5 2

0

0.2

0.4

0.6

0.8

1

Deg

ree o

f m

em

bersh

ip

Z M L

Figure 5.6: MFs of the inputs µ and ξ.

Table 5.6: FDT of the fuzzy FMF.

ξ

Z S L

µ

Z 1.06 1.03 1.04

S 1.03 1.05 1.02

L 1.01 1.03 1.02

Example 5.1 In this example, the Kalman filter is used to track the evader’s

movement by estimating its position at each time step. The evader’s motion is based

on its DCS. Also, the evader’s motion is assumed to start from the position (−6, 7).


So, the initial state vector x(0) is given by

x(0) =

[−6, 7, 0, 0]T , for the CVM and its fuzzy FMF counterpart,

[−6, 7, 0, 0, 0, 0]T , for the CAM and its fuzzy FMF counterpart.

The initial estimation error covariance matrix, P (0), for the CVM and its fuzzy

FMF counterpart is given by

P (0) = 3×

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

, (5.30)

and for the CAM and its fuzzy FMF counterpart it is given by

P (0) = 3×

1 0 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

0 0 0 0 0 1

, (5.31)

For the process and measurement noise covariance matrices, Qf (k) and Rf (k),

the values of σw, σxo, and σyo are set to 0.01, 0.05, and 0.05, respectively. The

simulation results are as shown in Table 5.7 and Figures 5.7 – 5.9. Table 5.7 gives


the position average RMSE and its standard deviation, which shows that the perfor-

mance of the CAM Kalman filter is slightly better than that of the CVM Kalman filter,

as the CAM Kalman filter can track the start turning path of the evader quicker than

the CVM Kalman filter. Also, it shows that the performances of both the fuzzy CVM

and fuzzy CAM Kalman filters are better than those of the non-fuzzy ones, and the

fuzzy CVM Kalman filter is the preeminent one. Figure 5.7 demonstrates the ability

of each filter to estimate the evader’s position in the xy-coordinate, while Figures

5.8 – 5.9 show the same ability to estimate the x and y positions, separately. It can

be seen that all the filters provide an acceptable state estimation, though the Fuzzy

CVM Kalman filter is the most accurate one.

Table 5.7: Mean and standard deviation of the RMSE (cm) for the evader’s position

estimate of Example 5.1.

The position average RMSE Standard Deviation

CVM Kalman Filter 5.0549 0.6728

CAM Kalman Filter 5.0280 0.5482

Fuzzy CVM Kalman Filter 3.8403 0.4093

Fuzzy CAM Kalman Filter 4.2909 0.3442

Example 5.2 In this example, it is assumed that the evader’s motion is based

on its modified DCS, which allows the evader to use the advantage of its higher

maneuverability. Also, it is assumed that all the filters are simulated based on the

information given in Example 5.1, except σw is set equal to 0.03.

Table 5.8 and Figures 5.10 – 5.12 demonstrate the position estimation accuracy

for each filter. Table 5.8 shows that the CVM and CAM Kalman filters have large

average RMSE compared with their fuzzy counterparts. Also, Figure 5.10 gives an

indication that there are modeling uncertainties in both the CVM and CAM Kalman


-13 -12 -11 -10 -9 -8 -7 -6 -5

x-position (m)

6

7

8

9

10

11

12

13

14

15

y-p

osi

tio

n (

m)

True

Measurements

CVM Kalman Filter

(a)

-13 -12 -11 -10 -9 -8 -7 -6 -5

x-position (m)

6

7

8

9

10

11

12

13

14

15

y-p

osi

tio

n (

m)

True

Measurements

CAM Kalman Filter

(b)

-13 -12 -11 -10 -9 -8 -7 -6 -5

x-position (m)

6

7

8

9

10

11

12

13

14

15

y-p

osi

tio

n (

m)

True

Measurements

Fuzzy CVM Kalman Filter

(c)

-13 -12 -11 -10 -9 -8 -7 -6 -5

x-position (m)

6

7

8

9

10

11

12

13

14

15

y-p

osi

tio

n (

m)

True

Measurements

Fuzzy CAM Kalman Filter

(d)

Figure 5.7: The evader’s position estimate for Example 5.1 by using (a) CVM

Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter. (d)

Fuzzy CAM Kalman Filter.


0 2 4 6 8 10

Time (s)

-13

-12

-11

-10

-9

-8

-7

-6

-5

x-p

osi

tio

n (

m)

True

CVM Kalman Filter

(a)

0 2 4 6 8 10

Time (s)

-13

-12

-11

-10

-9

-8

-7

-6

-5

x-p

osi

tio

n (

m)

True

CAM Kalman Filter

(b)

0 2 4 6 8 10

Time (s)

-13

-12

-11

-10

-9

-8

-7

-6

-5

x-p

osi

tion

(m

)

True


(c)

0 2 4 6 8 10

Time (s)

-13

-12

-11

-10

-9

-8

-7

-6

-5

x-p

osi

tio

n (

m)

True


(d)

Figure 5.8: The evader’s x-position estimate for Example 5.1 by using (a) CVM




0 2 4 6 8 10

Time (s)

6

8

10

12

14

16

y-p

osi

tion

(m

)

True

CVM Kalman Filter

(a)

0 2 4 6 8 10

Time (s)

6

8

10

12

14

16

y-p

osi

tion

(m

)

True

CAM Kalman Filter

(b)

0 2 4 6 8 10

Time (s)

6

8

10

12

14

16

y-p

osi

tion

(m

)

True


(c)

0 2 4 6 8 10

Time (s)

6

8

10

12

14

16

y-p

osi

tion

(m

)

True

Fuzzy CAM Filter

(d)

Figure 5.9: The evader’s y-position estimate for Example 5.1 by using (a) CVM




filters. Thus, when the evader starts to take a sharp turn at time 9.0 s, as seen in

Figures 5.11 – 5.12, both filters might diverge. On the other hand, the results show

that both the fuzzy CVM and fuzzy CAM Kalman filters provide better performance

compared to the CVM and CAM Kalman filter, as they have the ability to handle the

modeling uncertainties through the use of the fuzzy FMM. Also, it shows that the

fuzzy CAM Kalman filter is the best.








Example 5.3 This example is similar to the second one, except that the evader

can turn with maximum velocity [22]. It is assumed that all filters are simulated

based on the information given in Example 5.2.

Simulation results are as shown in Table 5.9 and Figures 5.13 – 5.15. Table 5.9

shows that the position average RMSEs are large compared to the result of Example

5.2, because the filters have less abilities to continuously track the sharp and fast

evader’s turn. Also, at time 9.2 s, the results show that all filters give inaccurate

state estimation, and the non-fuzzy filters are the worst.


-13 -12 -11 -10 -9 -8 -7 -6 -5

x-position (m)

6

7

8

9

10

11

12

13

14

y-p

osi

tion

(m

)

True

Measurements

CVM Kalman Filter

(a)

-12 -11 -10 -9 -8 -7 -6 -5

x-position (m)

6

7

8

9

10

11

12

13

14

y-p

osi

tio

n (

m)

True

Measurements

CAM Kalman Filter

(b)

-12 -11 -10 -9 -8 -7 -6 -5

x-position (m)

6

7

8

9

10

11

12

13

14

y-p

osi

tio

n (

m)

True

Measurements


(c)

-12 -11 -10 -9 -8 -7 -6 -5

x-position (m)

6

7

8

9

10

11

12

13

14

y-p

osi

tion

(m

)True

Measurements


(d)












0 5 10 15 20

Time (s)

-13

-12

-11

-10

-9

-8

-7

-6

-5

x-p

osi

tion

(m

)

True

CVM Kalman Filter

(a)

0 5 10 15 20

Time (s)

-12

-11

-10

-9

-8

-7

-6

-5

x-p

osi

tio

n (

m)

True

CAM Kalman Filter

(b)

0 5 10 15 20

Time (s)

-12

-11

-10

-9

-8

-7

-6

-5

x-p

osi

tion

(m

)

True


(c)

0 5 10 15 20

Time (s)

-12

-11

-10

-9

-8

-7

-6

-5

x-p

osi

tion

(m

)

True


(d)





0 5 10 15 20

Time (s)

6

7

8

9

10

11

12

13

14

15

y-p

osi

tio

n (

m)

True

CVM Kalman Filter

(a)

0 5 10 15 20

Time (s)

6

7

8

9

10

11

12

13

14

15

y-p

osi

tio

n (

m)

True

CAM Kalman Filter

(b)

0 5 10 15 20

Time (s)

6

7

8

9

10

11

12

13

14

15

y-p

osi

tion

(m

)

True


(c)

0 5 10 15 20

Time (s)

6

7

8

9

10

11

12

13

14

15

y-p

osi

tio

n (

m)

True

Fuzzy CAM Filter

(d)





-13 -12 -11 -10 -9 -8 -7 -6 -5

x-position (m)

7

8

9

10

11

12

13

14

15

y-p

osi

tion

(m

)

True

CVM Kalman Filter

(a)

-13 -12 -11 -10 -9 -8 -7 -6 -5

x-position (m)

7

8

9

10

11

12

13

14

15

y-p

osi

tio

n (

m)

True

CAM Kalman Filter

(b)

-13 -12 -11 -10 -9 -8 -7 -6 -5

x-position (m)

7

8

9

10

11

12

13

14

15

y-p

osi

tio

n (

m)

True


(c)

-13 -12 -11 -10 -9 -8 -7 -6 -5

x-position (m)

7

8

9

10

11

12

13

14

15

y-p

osi

tio

n (

m)

True


(d)





0 5 10 15 20 25 30

Time (s)

-13

-12

-11

-10

-9

-8

-7

-6

-5

x-p

osi

tion

(m

)

True

CVM Kalman Filter

(a)

0 5 10 15 20 25 30

Time (s)

-13

-12

-11

-10

-9

-8

-7

-6

-5

x-p

osi

tio

n (

m)

True

CAM Kalman Filter

(b)

0 5 10 15 20 25 30

Time (s)

-13

-12

-11

-10

-9

-8

-7

-6

-5

x-p

osi

tio

n (

m)

True


(c)

0 5 10 15 20 25 30

Time (s)

-13

-12

-11

-10

-9

-8

-7

-6

-5

x-p

osi

tion

(m

)

True


(d)





0 5 10 15 20 25 30

Time (s)

7

8

9

10

11

12

13

14

15

y-p

osi

tio

n (

m)

True

CVM Kalman Filter

(a)

0 5 10 15 20 25 30

Time (s)

7

8

9

10

11

12

13

14

15

y-p

osi

tion

(m

)

True

CAM Kalman Filter

(b)

0 5 10 15 20 25 30

Time (s)

7

8

9

10

11

12

13

14

15

y-p

osi

tion

(m

)

True


(c)

0 5 10 15 20 25 30

Time (s)

7

8

9

10

11

12

13

14

15

y-p

osi

tio

n (

m)

True

Fuzzy CAM Filter

(d)






To compare the performance of the Kalman-FACLA algorithm with the FACLA algo-

rithm, a Monte Carlo simulation is run 500 times for each learning algorithm, and

two cases are considered. The first case is a two-pursuer one-evader game, while

the second one is a three-pursuer one-evader game. It is assumed that the pursuers

are faster than the evader (i.e. Vp1 = Vp2 = Vp3 = 0.5 m/s and Ve = 0.3 m/s),

and that −0.8 ≤ ue ≤ 0.8 and −0.5 ≤ up1 , up2 , up3 ≤ 0.5. The wheelbases of the

pursuers and the evader are the same and they are equal to 0.2 m. In each episode,

the evader’s motion starts from the origin with an initial orientation of θe = 0, while

the pursuers’ motions are chosen randomly from a set of 64 different positions with

θp1 = θp2 = θp3 = 0. The selected capture radius is ` = 0.1 m, and the sample time

is T = 0.1 s. The number of episodes/games is 200, and the number of plays in

each game is 600. The game terminates when the time exceeds 60 s, or one of the

pursuers captures the evader. From Example 5.1, it is found that the fuzzy CVM

Kalman filter gives the best performance compared with the other models, there-

fore the Kalman-FACLA algorithm is based on this model. All the filter parameters

are selected as in Example 5.1.

5.5.1 Case 1: Two-Pursuer One-Evader Game

For this case, it is assumed that the game is played with two pursuers attempting

to learn how to capture a single evader, and the evader is learning how to escape

or extend the capture time. It is also assumed that each player has no information

about the other players’ strategies. In addition, it is assumed that each pursuer

only knows the instantaneous position of the evader, and vice versa. The goal is


to make all players interact with each other to self-learn their control strategies

simultaneously, using either the FACLA algorithm or the Kalman-FACLA algorithm.

After the learning processes are complete, the performance of each learning

technique is tested by running the game with different sets of pursuers’ initial po-

sitions. The performance is assessed based on the capture time and the possibility

of collision between pursuers. The mean value and standard deviation of the cap-

ture time for different initial pursuer positions, using the FACLA algorithm and the

Kalman-FACLA algorithm, are given in Table 5.10. From the table it is clear that the

mean values of the capture time using the Kalman-FACLA algorithm are less than

those of the FACLA algorithm. For example, in the first set of the test the pursuers

took approximately 21.1 s to capture the evader using the FACLA algorithm, and ap-

proximately 19.2 s to capture the evader using the Kalman-FACLA algorithm. Also,

the mean and standard deviation of the capture time that are resulted from the

Kalman-FACLA algorithm are compared with those resulted from the FACLA algo-

rithm using the 2-tailed 2-sample t-test. In all tested pursuers’ initial positions, the

difference in means was found to be statistically significant at the 0.05 level (i.e.,

the resulted p-value is below 0.0001). The corresponding PE paths are shown in

Figure 5.16 and Figure 5.17, and it is evident that all players are capable of learning

their control strategies by interacting with each other using either one of the two

algorithms. However, the main difference between these algorithms is the potential

for collision between pursuers. Figure 5.16 shows that this is likely using the FACLA

algorithm, but reduced or diminished using the Kalman-FACLA algorithm, as shown

in Figure 5.17.


Table 5.10: Mean and standard deviation of the capture time (s) for a two-pursuer

one-evader game for different pursuers’ initial positions.


(−2,−5), (−3, 6) (−5, 3), (−2, 6) (−6,−2), (5, 3) (−4, 3), (5,−2)

Mean Standard Mean Standard Mean Standard Mean Standard

deviation deviation deviation deviation

FACLA algorithm 21.0776 0.6689 28.6800 0.2516 19.5232 0.4689 18.7682 0.4228

Kalman-FACLA algorithm 19.1984 0.6094 27.2504 1.2528 16.7122 0.2557 15.5420 0.3337

-3 -2 -1 0 1 2 3 4 5 6 7

x-position (m)

-6

-4

-2

0

2

4

6

y-p

osi

tio

n (

m)

Pursuer 1

Pursuer 2

Evader

(a) Pursuer 1 (−2,−5), Pursuer 2 (−3, 6)

-6 -4 -2 0 2 4 6 8

x-position (m)

-6

-4

-2

0

2

4

6

y-p

osi

tio

n (

m)

Pursuer 1

Pursuer 2

Evader

(b) Pursuer 1 (−5, 3), Pursuer 2 (−2, 6)

-6 -4 -2 0 2 4 6

x-position (m)

-6

-5

-4

-3

-2

-1

0

1

2

3

y-p

osi

tio

n (

m)

Pursuer 1

Pursuer 2

Evader

(c) Pursuer 1 (−6,−2), Pursuer 2 (5, 3)

-4 -3 -2 -1 0 1 2 3 4 5 6

x-position (m)

-6

-5

-4

-3

-2

-1

0

1

2

3

y-p

osi

tio

n (

m)

Pursuer 1

Pursuer 2

Evader

(d) Pursuer 1 (−4, 3), Pursuer 2 (5,−2)

Figure 5.16: The PE paths using FACLA algorithm for a two-pursuer one-evader

game for different pursuers’ initial positions.


-3 -2 -1 0 1 2 3 4 5 6

x-position (m)

-6

-4

-2

0

2

4

6

y-p

osi

tio

n (

m)

Pursuer 1

Pursuer 2

Evader

(a) Pursuer 1 (−2,−5), Pursuer 2 (−3, 6)

-6 -4 -2 0 2 4 6

x-position (m)

-6

-4

-2

0

2

4

6

y-p

osi

tio

n (

m)

Pursuer 1

Pursuer 2

Evader

(b) Pursuer 1 (−5, 3), Pursuer 2 (−2, 6)

-6 -4 -2 0 2 4 6

x-position (m)

-5

-4

-3

-2

-1

0

1

2

3

y-p

osi

tio

n (

m)

Pursuer 1

Pursuer 2

Evader

(c) Pursuer 1 (−6,−2), Pursuer 2 (5, 3)

-4 -3 -2 -1 0 1 2 3 4 5 6

x-position (m)

-5

-4

-3

-2

-1

0

1

2

3

y-p

osi

tio

n (

m)

Pursuer 1

Pursuer 2

Evader

(d) Pursuer 1 (−4, 3), Pursuer 2 (5,−2)

Figure 5.17: The PE paths using Kalman-FACLA algorithm for a two-pursuer one-

evader game for different pursuers’ initial positions.


5.5.2 Case 2: Three-Pursuer One-Evader Game

For this case, the assumptions of the game are similar to those of the previous

case, but with three pursuers instead of two. It is also assumed that the control

strategy of the evader is to learn how to escape from the nearest pursuer. After

learning, the performance of the Kalman-FACLA algorithm is compared with the

performance of the FACLA algorithm. The comparison is based on testing each

learning algorithm by running the game with four different sets of pursuers’ initial

positions, and the results are as shown in Table 5.11 and Figures 5.18 – 5.19. Table

5.11 and Figures 5.18 – 5.19 show that the pursuers succeeded in finding their con-

trol strategies and capturing the evader, using either of the two learning algorithms.

Also, from Table 5.11, it can be seen that using the Kalman-FACLA algorithm signif-

icantly reduces the mean values of the capture time. For example, for the second

testing set, the usage of the FACLA algorithm reduces the mean value of the capture

time from 54.2762 s to 43.0026 s. Also, the 2-tailed 2-sample t-test demonstrated

a significant difference among the means at the 0.05 level (i.e., the resulted p-value

is less than 0.0001), when applied to the results presented in Table 5.11. Figure

5.18 shows that there is a possibility of collision among the pursuers when using

the FACLA algorithm, but this possibility is reduced or diminished when using the

modified algorithm, as shown in Figure 5.19.

Table 5.11: Mean and standard deviation of the capture time (s) for a three-pursuer

one-evader game for different pursuers’ initial positions.


(−6, 12), (3, 10), (6,−12) (20,−6), (−20,−5), (4, 12) (15, 5), (−15, 6), (7,−7) (−4, 4), (−4,−5), (8, 4)

Mean Standard Mean Standard Mean Standard Mean Standard

deviation deviation deviation deviation

FACLA algorithm 39.6628 0.6605 54.2762 1.3017 39.0016 0.8454 19.6752 0.2940

Kalman-FACLA algorithm 32.8064 0.2098 43.0026 0.4314 31.8094 0.3121 16.1140 0.2063


-10 -5 0 5 10

x-position (m)

-15

-10

-5

0

5

10

15

y-p

osi

tion

(m

)

Pursuer 1

Pursuer 2

Pursuer 3

Evader

(a) Pursuer 1 (−6, 12), Pursuer 2 (3, 10), Pur-suer 3 (6,−12)

-20 -10 0 10 20

x-position (m)

-20

-15

-10

-5

0

5

10

15

y-p

osi

tion

(m

)

Pursuer 1

Pursuer 2

Pursuer 3

Evader

(b) Pursuer 1 (20,−6), Pursuer 2 (−20,−5), Pur-suer 3 (4, 12)

-20 -10 0 10 20

x-position (m)

-10

-5

0

5

10

y-p

osi

tion

(m

)

Pursuer 1

Pursuer 2

Pursuer 3

Evader

(c) Pursuer 1 (15, 5), Pursuer 2 (−15, 6), Pur-suer 3 (7,−7)

-5 0 5 10

x-position (m)

-6

-4

-2

0

2

4

6

8

y-p

osi

tion

(m

)

Pursuer 1

Pursuer 2

Pursuer 3

Evader

(d) Pursuer 1 (−4, 4), Pursuer 2 (−4,−5), Pur-suer 3 (8, 4)

Figure 5.18: The PE paths using FACLA algorithm for a three-pursuer one-evader

game for different pursuers’ initial positions.


-10 -5 0 5 10

x-position (m)

-15

-10

-5

0

5

10

15

y-p

osi

tion

(m

)

Pursuer 1

Pursuer 2

Pursuer 3

Evader

(a) Pursuer 1 (−6, 12), Pursuer 2 (3, 10), Pur-suer 3 (6,−12)

-20 -10 0 10 20

x-position (m)

-20

-15

-10

-5

0

5

10

15

y-p

osi

tion

(m

)

Pursuer 1

Pursuer 2

Pursuer 3

Evader

(b) Pursuer 1 (20,−6), Pursuer 2 (−20,−5), Pur-suer 3 (4, 12)

-20 -10 0 10 20

x-position (m)

-10

-5

0

5

10

y-p

osi

tion

(m

)

Pursuer 1

Pursuer 2

Pursuer 3

Evader

(c) Pursuer 1 (15, 5), Pursuer 2 (−15, 6), Pur-suer 3 (7,−7)

-5 0 5 10

x-position (m)

-6

-4

-2

0

2

4

6

8

y-p

osi

tion

(m

)

Pursuer 1

Pursuer 2

Pursuer 3

Evader

(d) Pursuer 1 (−4, 4), Pursuer 2 (−4,−5), Pur-suer 3 (8, 4)

Figure 5.19: The PE paths using Kalman-FACLA algorithm for a three-pursuer one-

evader game for different pursuers’ initial positions.

5.6. CONCLUSION 154

5.6 Conclusion

In this chapter, a new fuzzy-reinforcement learning algorithm called FACLA is

proposed for the PE differential games to address the issue of reducing the learning

period the players need to determine their control strategies. It is a modified ver-

sion of the FACL algorithm, and uses the CACLA algorithm to tune the parameters

of the FIS. The proposed algorithm was applied to two versions of the two-player

PE differential games and compared by computer simulation with the FACL [17],

the RGFACL [22] and the PSO-based FLC+QLFIS algorithms [140]. Simulation re-

sults show that the FACLA algorithm allows each learning agent to reach its DCS

in a learning time less than the other algorithms. Then the FACLA algorithm is

modified and applied to the problem of multi-pursuer single-evader PE differen-

tial games. The modification is accomplished by using the Kalman filter to enable

each pursuer to estimate the evader’s movement direction. By using the modifi-

cation, each pursuer can move to the expected interception point directly, rather

than following its LoS to the evader in an attempt to reduce the capture time and

collision potential among the pursuers. The modified FACLA algorithm is called the

Kalman-FACLA, and it works with PE differential games that have continuous state

and action spaces. It also works in a decentralized manner, because each pursuer

considers the other pursuers as part of the environment, and there is no communi-

cation or direct cooperation among them. The Kalman-FACLA algorithm is applied

to the problem of a multi-pursuer single-evader game with multiple pursuers at-

tempting to capture a single evader, and all players learning simultaneously. This

occurs in two cases of the game in particular: a two-pursuer one-evader game and a

three-pursuer one-evader game. In both cases, the simulation results show that the

5.6. CONCLUSION 155

Kalman-FACLA algorithm outperforms the FACLA algorithm by reducing the capture

time and the collision potential among pursuers.

Chapter 6

Multi-Player Pursuit-Evasion

Differential Game with Equal Speed

6.1 Introduction

This chapter focuses on the problem of multi-player Pursuit-Evasion (PE) differen-

tial games with a single-superior evader, in which all the players have equal speed.

In the literature, there are just few articles that address the multi-player PE dif-

ferential games with superior evaders [8, 25, 133–135] and without any type of

learning. However, Awheda et. al [23, 136] recently proposed two decentralized

learning algorithms for the PE game issue with a single superior-evader. The first

learning algorithm [23] was used to enable a group of pursuers with equal speed

to capture a single evader that has a speed identical to the speed of the pursuers.

This algorithm was based on the condition proposed in [8] and a specific forma-

tion control strategy. The second proposed learning algorithm [136] was used to

enable a group of equal speed pursuers to capture a single evader when its speed

156


is greater than or equal to the speed of the pursuers. It was based on Apollonius

circles and a modified formation control strategy. The drawback of both these al-

gorithms [23, 136] is that they must calculate the capture angle of each pursuer in

order to determine its control signal. Thus, in this chapter a special type of reward

function that will enable a group of pursuers to capture a single evader in a de-

centralized manner without knowing the capture angle is suggested for the Fuzzy

Actor-Critic Learning Automaton (FACLA) algorithm. It is assumed that all players

in the game have identical speed. The game is played so each pursuer can learn

how to participate in capturing the evader by tuning its Fuzzy Logic Control (FLC)

parameters. The tuning process depends on the reward values that each pursuer

received after each action taken. The suggested reward function depends on two

factors: the first is the difference in the Line-of-Sight (LoS) between each pursuer

in the game and the evader at two consecutive time instants, and the second is the

difference between two successive Euclidean distances between each pursuer and

the evader. The simulation results were published in [155]1.

This chapter is organized as follows: Section 6.2 describes the dynamic equa-

tions of the players, and the FLC structure is discussed briefly in Section 6.3. The

formulation of the reward function is provided in Section 6.4, and the FACLA algo-

rithm is briefly addressed in Section 6.5. In Section 6.6, the simulation results are

discussed, and conclusions are provided in Section 6.7.

1A. A. Al-Talabi, “Multi-Player Pursuit-Evasion Differential Game with Equal Speed,” in Proc. ofthe 2017 IEEE International Automatic Control Conference (CACS), (Pingtung, Taiwan), November2017.

6.2. THE DYNAMIC EQUATIONS OF THE PLAYERS 158

6.2 The Dynamic Equations of the Players

For the multi-player PE game presented in this chapter, it is assumed that there are

n-pursuer (p1, p2, ..., pn) trying to capture a single evader e, and that all players have

identical capabilities. Letting Vp and Ve denote, respectively, the maximum velocity

of each purser and the evader, such that Vp = Ve. The dynamic equations of the

players are

xpi = Vp cos(θpi), ypi = Vp sin(θpi), (6.1)

xe = Ve cos(θe), ye = Ve sin(θe), (6.2)

−π ≤ θpi ≤ π, − π ≤ θe ≤ π,

where (xpi , ypi) and (xe, ye) refer to the positions of the pursuer pi and evader e,

respectively. Also, θpi and θe refer to the pursuer, pi, and evader strategies, respec-

tively. At time t, the Euclidean distance between each pursuer pi and the evader is

defined by

Dpi(t) =√

(xe(t)− xpi(t))2 + (ye(t)− ypi(t))

2, i = 1, 2, ..., n. (6.3)

The evader is captured, if at any time t, 0 ≤ t ≤ Tf , where Tf refers to the final

time, there is at least one pursuer pi, such that Dpi(t) is less than a certain threshold

value `, which is called the radius of capture. To ensure this, it is necessary to

satisfy the capture condition, which can be explained geometrically by Figure 6.1.

In Figure 6.1, Pi and E denote the initial positions of the pursuer pi and the evader

e, respectively. From this figure and according to the law of sines, the capture

condition can be defined by

6.2. THE DYNAMIC EQUATIONS OF THE PLAYERS 159

Vesin(αi)

=Vp

sin(βi)⇒ αi = βi and βi < βimax , (6.4)

where

βimax =(π

2

)

. (6.5)

As shown in Figure 6.1, when the angle βi is less than βimax , it is obvious that the

αi

βi Ve

Pi

E

Vp 𝐷𝑝𝑖

αi

βi

��

𝛽𝑖max

Figure 6.1: Geometric illustration for capturing situation.

pursuer pi can always find an angle αi that ensures the capture of the evader. Also,

it is clear that the pursuer pi can take care of the evader’s movement within an angle

of 2βi. So, the minimum number of pursuers required to cover 2π angle around the

evader is three. To summarize, the necessary conditions for capturing the evader

are

1. There exists enough pursuers in the game; and,

2. At each instant of time there exists at least one pursuer satisfying Equation

(6.4).


6.3 Fuzzy Logic Controller Structure

An FLC with two inputs and one output is used for each learning agent pi. The

inputs are the x and y components of the Manhattan distance between the pursuer

pi and the evader, and the output is θpi . For each pursuer pi, the Manhattan distance

components are defined by:

Dxpi= xe(t)− xpi(t), (6.6)

Dypi= ye(t)− ypi(t). (6.7)

A two input and one output zero-order Takagi-Sugeno (TS) fuzzy model is used

[46]. The two inputs are z1, and z2, which represent Dxpi

and Dypi

, respectively,

and it is assumed that each input has five triangular Membership Functions (MFs).

Therefore, it is necessary to build 25 rules, each with one consequent parameter Kl.

The fuzzy rules can be constructed using the Fuzzy Decision Table (FDT), and as

shown in Table 6.1. Where A1, A2, · · ·A5 and B1, B2, · · ·B5 are the linguistic labels

for the MFs of Dxpi

and Dypi

, respectively. The fuzzy output θpi is defuzzified into a

crisp output using the weighted average defuzzification method [138].

6.4 Reward Function Formulation

In this chapter, an actor-critic structure consisting of two main components (actor

and critic) is applied. Actor refers to the policy structure used to select an action

for the current system state, and critic refers to the estimated value function V (s),

that is used to criticize the action. After calculating V (s), the critic evaluates the

6.4. REWARD FUNCTION FORMULATION 161

Table 6.1: Fuzzy decision table

Dypi

B1 B2 B3 B4 B5

Dxpi

A1 K1 K2 K3 K4 K5

A2 K6 K7 K8 K9 K10

A3 K11 K12 K13 K14 K15

A4 K16 K17 K18 K19 K20

A5 K21 K22 K23 K24 K25

resulting state to determine whether the performance has improved or deteriorated

[16]. This evaluation is based on the TD-error 4t, which is defined by

4t = rt+1 + γVt(st+1)− Vt(st). (6.8)

Based on 4t, the critic can estimate V (s) as follows [16]

Vt+1(st) = Vt(st) + α4t. (6.9)

From Equations (6.8) – (6.9), it is clear that the reward function plays an important

role for enabling the learning agent to accurately update its value function. The

choice of reward function depends on the problem to be addressed, and for the

problem under consideration, the main objective is to enable a team of n-pursuers

to learn how to capture a single evader by interacting with it. Thus, a special form

of reward function is suggested. The suggested reward function uses two factors to

help pursuers learn how to participate in capturing the evader. The first factor is the

difference in the LoS between each pursuer and the evader at two consecutive time


instants (∆LoS(t) ), and the second factor is the difference between two successive

Euclidean distances between each pursuer and the evader ∆D(t). For the pursuer

pi, ∆LoSpi(t) is defined as follows:

∆LoSpi(t) = LoSpi(t)− LoSpi(t+ 1), (6.10)

where the LoS(t) between pi and the evader is defined by:

LoSpi(t) = tan−1

(

ye(t)− ypi(t)

xe(t)− xpi(t)

)

. (6.11)

Also, ∆Dpi(t) is given by:

∆Dpi(t) = Dpi(t)−Dpi(t+ 1). (6.12)

The first factor ensures that the pursuers move according to the Parallel Guidance

Law (PGL), which means each pursuer that could capture the evader will move to

the capture point E, as shown in Figure 6.1. The pursuers that cannot capture

the evader will move in parallel with it as in Figure 6.2, to ensure the invariant

angle distribution around the evader. Figure 6.2 clearly identifies two paths for the

pursuer to make ∆LoS approaches to zero:−−→PiA and

−−→PiB. If the pursuer follows

the−−→PiB path the distance between it and the evader remains unchanged, but if the

pursuer follows the path−−→PiA the distance between them increases. Thus, the second

factor is used to give a positive reward to the pursuer that reduces this distance over

time, or at least makes it equal to the initial distance.

According to previous analyses, the reward function of the pursuer pi can be


Ve

E

Pi

Vp ��𝑖

𝛽𝑖

��𝑖

A

B

C

𝐷𝑝𝑖

Figure 6.2: Geometric illustration for the pursuer moving in parallel with the

evader using PGL.

defined as follows:

rpi(t+ 1) = 1.5r1pi(t+ 1) + r2pi(t+ 1), (6.13)

where r1pi(t+ 1) and r2pi(t+ 1) are defined as follows:

r1pi(t+ 1) = 2e(−∆LoSpi

(t)2

0.005) − 1, (6.14)

and

r2pi(t+ 1) =∆Dpi(t)

∆Dpimax

, (6.15)

where ∆Dpimax refers to the maximum value of ∆Dpi and is calculated from

∆Dpimax = (Vpi + Ve)T, (6.16)

where T represents the sampling time.


6.5 Fuzzy Actor-Critic Learning Automaton (FACLA)

In [144], the FACLA algorithm was proposed to address the problem of the PE dif-

ferential game in which both the actor and critic are Fuzzy Inference System (FIS),

and the results showed that the FACLA algorithm reduces the time players need to

learn their control strategies. In this chapter, only the consequent parameters of the

FIS are tuned. Let KCl and KA

l represent the consequent parameters of the critic

and actor in rule l, respectively. KCl and KA

l are then updated according to the

following gradient based formulas [144]:

KCl (t+ 1) = KC

l (t) + η∆t

∂Vt(st)

∂KCl

, (6.17)

IF ∆t > 0 : KAl (t+ 1) = KA

l (t) + ξ∆t(uc − ut)∂ut∂KA

l

, (6.18)

where η and ξ are the learning parameters for the critic and actor, respectively. They

can be defined as follows:

η = 0.3− 0.09

(

iepMax. Episodes

)

, (6.19)

ξ = 0.1η, (6.20)

where iep is the current episode. The terms ∂Vt(st)

∂KCl

and ∂ut

∂KAl

are given by

∂Vt(st)

∂KCl

=∂ut∂KA

l

= ωl, (6.21)

where ωl and ωl represent the firing strength and normalized firing strength of rule

l. In Equation (6.18), a positive ∆t indicates that the current action should be


enforced. However, if a negative update is also allowed the learning algorithm does

not necessarily lead to selecting an action that has a better value function (i.e., leads

to positive ∆t), and thus it is not considered.


For simulation purpose, it is assumed that there are three pursuers and one evader,

and their velocities are the same (Vp = Ve = 1 m/s). The selected sample time

is T = 0.1 s, and the capture radius is ` = 1 m. The game is played for 1000

episodes, with 200 as the maximum number of steps/plays in each game, so the

game terminates when the time exceeds 20 s, or when the capturing situation is

satisfied. The goal of the simulation is to enable the pursuers to self-learn their

control strategies by interacting with the evader and continuously tuning their FLCs.

The evader’s motion starts from the origin, and the pursuers’ motions start from

xp1 , yp1 = (−5, 5), xp2 , yp2 = (5, 5), and xp3 , yp3 = (0,−5). The PE paths at the first

and final learning episodes are shown in Figure 6.3 and Figure 6.4, respectively.

Figure 6.3 indicates that each pursuer tries to explore the best actions to learn its

control strategy, while Figure 6.4, shows that all the pursuers can learn their control

strategies to capture the evader. As expected, it is clear that pursuer p3 moved to

the capture point, while pursuers p1 and p2 moved in parallel with the evader.

The average payoff for each pursuer in the game is shown in Figure 6.5. For

p1 and p2 it converges to 1.5, whereas it converges to 1.8 for pursuer p3. This is

because both p1 and p2 move using the PGL, but they cannot reduce their distance

to the evader over time, thus they receive less reward. However, p3 moves using the

PGL, and thereby reduces its distance to the evader over time and receives more

6.7. CONCLUSION 169

combination of two factors. The first depends on the difference in the LoS between

each pursuer and the evader at two consecutive times instants, which allows the

pursuers to move according to the parallel guidance law. The other factor depends

on the difference between two successive Euclidean distances between each pur-

suer and the evader, to ensure the distance between them remains unchanged or

reduced over time. From the computer simulations and the results, it is clear that

the FACLA algorithm with the suggested reward function enables each pursuer to

learn its control policy and participate in capturing the evader. The pursuers search

for their control strategies by interacting with the evader. The FACLA algorithm

based on the suggested reward function operates in a decentralized manner, since

each pursuer in the PE game regards the other players as part of its environment,

and does not communicate or have direct collaboration with them.

Chapter 7

Conclusions and Future Work

7.1 Conclusions

Pursuit-Evasion (PE) type games have been used for decades, and several of them

have been studied extensively due to their potential for military application. The

concept can also be generalized to solve real-world applications. Four main tech-

niques are commonly used to solve the PE game problem: optimal control, dy-

namic programming, game theory and Reinforcement Learning (RL). The solution

complexity of the game increases proportionately to the number of players, thus, it

is better to use learning techniques that help each player find its control strategy.

Several learning algorithms to solve the problem of the PE game were proposed

previously, each with its own advantages and disadvantages. The disadvantages

are:

1- Computational requirements need to be investigated: The researchers did

not consider which set of parameters had a significant impact on the perfor-

mance of the learning algorithm. Some assumed that the learning process is

170

7.1. CONCLUSIONS 171

achieved by tuning all the sets of parameters, while others assumed that tun-

ing can be done using only one set of parameters. This choice will certainly

impact the computation time or computational requirements.

2- Long learning time: For each learning algorithm, the learning process re-

quires a specific number of episodes to achieve acceptable performance. Some

researchers set this number to be higher than necessary, which increases the

learning time.

3- High possibility of collision among pursuers: The implementation of these

algorithms for the problem of a PE game in which multiple pursuers try to cap-

ture a single-evader, leads to high probability of collision among the pursuers.

In addition, the capture time might not be the minimum one.

4- Knowing the speed of a superior evader: The learning algorithms previously

proposed for the case of single-superior evader assumed that the speed of the

evader is known by each pursuer.

In order to resolve these disadvantages, this thesis addresses the problem of PE

games from the learning perspective. In particular, it proposes several learning

algorithms that can be easily and efficiently applied to PE differential games. The

objectives of the proposed algorithms were:

• to investigate the possible reduction of computation time;

• to determine the possible reduction of time that the player needs to find its

control strategy (i.e., reducing the learning time);

• to determine how to avoid or reduce the possibility of collisions among the

pursuers, and how to reduce the capture time;

7.1. CONCLUSIONS 172

• to deal with the problem of the PE differential game with a superior evader, in

which the evader’s speed is equal to the maximum speed of the fastest pursuer

in the game; and,

• to determine how the learning algorithm can be implemented in a decen-

tralized manner in which each player considers the other players part of its

environment, and thus does not need to share its states or actions with other

players.

These objectives were met by investigating the previously proposed learning algo-

rithm, and implementing the new proposed learning algorithms. The new proposed

algorithms are based on fuzzy-reinforcement learning, the Particle Swarm Opti-

mization (PSO) algorithm, the Kalman filter and the concept of Parallel Guidance

Law (PGL). Fuzzy-reinforcement learning combines the RL with the Fuzzy Infer-

ence System (FIS), to deal with the PE game in its differential form. Here, the FIS

works as either a Fuzzy Logic Control (FLC), or a function approximator to manage

the problem of continuous state and action spaces. In this thesis, the PSO algorithm

works as a global optimizer for the FLC parameters to determine appropriate values

for the FLC parameter setting. Starting with these settings affects the starting func-

tionality of the FLC, and will speed up the convergence to its final setting; thus the

learning process will be rapid. The Kalman filter was used to enable each pursuer

to estimate the evader’s velocity vector, which helps minimize both the capture time

and the potential for collision among pursuers. Finally, the concept of PGL was used

by each pursuer such that each pursuer that could capture the evader will move to

the expected capture point, while the pursuers that cannot capture the evader will

move in parallel with it, to ensure invariant angle distribution around the evader.

7.2. CONTRIBUTIONS 173

7.2 Contributions

The main contributions of this thesis are:

1-Reduced the Computational Time

Four methods of implementing the Q-Learning Fuzzy Inference System (QLFIS)

algorithm were proposed in Chapter 3 to analyze the possibility of reducing the

computational requirements of this algorithm without affecting its overall perfor-

mance. The analysis was based on whether it is necessary to tune all the parameters

of the FIS and the FLC, or just their consequential parameters. The four methods

were applied to three versions of PE games, and an evaluation of each game was

made to decide which parameters are the best to tune, and which have minimal

impact on performance. Simulation results in Section 3.7 showed that the perfor-

mance of the learning algorithm for each game depends on its parameter tuning

method. Also, it showed the possible reduction of computational time for each

game.

2-Reduced Learning Time

For this purpose two algorithms were proposed, as follows:

1. In Chapter 4, an unsupervised two-stage learning technique, known as the

PSO-based FLC+QLFIS, was proposed to reduce the learning time. The learn-

ing algorithm combines the PSO-based FLC algorithm with the QLFIS algo-

rithm. In the first stage, the PSO algorithm works as a global optimizer to au-

tonomously tune the parameters of the FLC, and in the second stage the QLFIS

algorithm acts as a local optimizer. For the PSO-based FLC+QLFIS learning al-

gorithm, the first stage is critical, since it provides the next stage with the best


initial parameter settings for the FLC. The PSO algorithm uses a simple for-

mula to update each particle, and it has low computational requirements. The

proposed technique was applied to the problem of the PE differential game,

and the simulation results in Section 4.5 show that the PSO-based FLC+QLFIS

learning algorithm requires less episodes to learn (i.e., the shortest learning

time) than the PSO-based FLC algorithm or the QLFIS algorithm. The find-

ings of this chapter can be used as a base for providing a similar learning

algorithm with a useful initial parameter setting. For example, using artificial

neural networks in an application requires an initialization step for its weights

which are typically initialized randomly, and using random weights can cause

inefficient start-up of the neural network. Therefore, it is possible to use the

PSO algorithm, as in the first stage of the proposed learning algorithm, to find

an acceptable setting for the weights in a few iterations.

2. In Chapter 5, a new fuzzy-reinforcement learning algorithm was proposed to

address the problems of PE differential games, and to reduce players’ learn-

ing time. The new algorithm uses the Continuous Actor-Critic Learning Au-

tomaton (CACLA) algorithm to tune the parameters of the FIS, and is known

as Fuzzy Actor-Critic Learning Automaton (FACLA) algorithm. The algorithm

was applied to different versions of the PE games, and it was compared by sim-

ulations to the Fuzzy-Actor Critic Learning (FACL), Residual Gradient Fuzzy

Actor-Critic Learning (RGFACL) and PSO-based FLC+QLFIS algorithms. The

simulation results demonstrated that the advantages of the FACLA algorithm

over other algorithms are due to having the shortest learning time (i.e., it

takes only 150 episodes for learning) and the lowest computation time, both

of which are important factors for any learning technique. These advantages,


as well as the fact that the FACLA algorithm can deal with the states and ac-

tions represented in continuous domains, indicate that the FACLA algorithm

can be used as a learning technique for an application with a continuous state

and action spaces.

3-Reduced Capture Time and Collision Potential Among Pursuers

In Chapter 5, a modified version of the FACLA algorithm was proposed for the

problem of multi-pursuer PE differential games with an inferior evader. The mod-

ification used the Kalman filter technique to enable each pursuer to estimate the

evader’s velocity vector. With this modification, each pursuer can move to the ex-

pected interception point directly, instead of following its line-of-sight to the evader

in an attempt to reduce the capture time and reduce the collision potential among

pursuers. In Section 5.5, the simulation results for both a two-pursuer one-evader

game and a three-pursuer one-evader game showed that the modified learning al-

gorithm outperforms the original learning algorithm by reducing both the capture

time and potential pursuer collisions.

4-Dealing with Multi-player PE Games with a Single-Superior Evader

In Chapter 6, a new reward function formulation that enables a group of pur-

suers to capture a single-superior evader in a decentralized manner, when all play-

ers have identical speeds, was proposed for the FACLA algorithm. It was used to

direct each pursuer to move to either the interception point with the evader, or in

parallel with it to ensure invariant angle distribution around the evader. Maintain-

ing an invariant angle of distribution around the evader reduces its maneuverability

and forces it toward another pursuer; thus, if the pursuers follow this strategy they

7.3. FUTURE WORK 176

will capture the evader. In addition, there is no need to calculate the capture an-

gle of each pursuer in order to determine its control signal. The simulation results

showed that pursuers could learn their control strategies without knowing the cap-

ture angle. The results also showed how the pursuers learn to cooperate indirectly

and finally capture the evader.

7.3 Future Work

The problem of the multi-robot PE differential game is still an open research field,

particularly from a learning perspective, and there are several ideas to explore in the

near future. These ideas will focus on developing new learning algorithms to solve

the general case of the multi-player PE differential game when there are n pursuers

and m evaders, and some of the evaders are superior. Therefore, the direction of

future work can be organized as follows:

1. Exploring the benefits of learning algorithms and using the ideas in this thesis

will promote the development of learning algorithms for the general case of

PE differential games with more than one superior evader and players with

different capabilities (i.e., different speeds and maneuverabilities). The learn-

ing algorithm will find the best control strategy for all players in the game.

As mentioned in Section 2.7, the problem of multi-player PE games is diffi-

cult to solve due to the curse of dimensionality, and the problem becomes

more complex when there are multiple superior evaders in the game. A new

learning algorithm could decompose the problem into several PE games, such

as one-pursuer one-evader and multi-pursuer single evader; the latter is very

useful in the case of a superior evader. Solving the problem of PE games with


multiple superior evaders is difficult, particularly when there are two or more

superior evaders. For example, if two evaders are initially surrounded by a

number of pursuers, the evaders could collaborate to allow one of them to es-

cape. Future work could potentially address this problem and find a solution.

2. Most of the research on PE games assumes that all the players have a con-

stant velocity. This is usually not the case in real situations, where each player

can increase its maneuverability by accelerating or decelerating. In addition,

if the player is a car-like mobile robot, rollover is a problem closely related

to its motion. Thus, to avoid rollover when the player uses its maneuver-

ability to change the direction of movement, the velocity should also change

according to the rate of change of the motion direction. When a car turns,

centripetal force is generated at its center of mass, and the resulting torque

could cause the car to roll over. Depending on its design, a car rolls over when

the centripetal force exceeds the value Froll = MV 2/Rturn, where M , V and

Rturn are the mass, velocity and turning radius of the moving car, respectively.

Thus, a car will not roll over if V 2/Rturn is less than certain Kroll factor, where

Kroll = Froll/M . There is a close relationship between velocity and the turning

radius, which can be written as follows:

V (t) =√

RturnKroll. (7.1)

From Equation (7.1), it is clear that it is impossible for a car to turn at a spe-

cific velocity without taking the curvature radius into consideration. There-

fore, if we assume that the minimum turning radius of the moving car is Rm


and it moves with velocity Vs(t), the car can turn safely with any turning ra-

dius greater than or equal to Rs(t), where Rs(t) is given by

Rs(t) = max

(

Rm,V 2s (t)

Kroll

)

. (7.2)

It is obvious that a moving car can rely on either its superior velocity or supe-

rior maneuverability. In PE games, the pursuer typically relies on its superior

velocity, and the evader on its superior maneuverability. Thus, if there are two

cars and one of them can turn with a velocity higher than the other, or it has

a smaller turning radius than the other assuming both have the same velocity,

this car has an advantage.

Therefore, regarding a PE game with such specifications, it would be inter-

esting to investigate how to find a learning algorithm with acceptable perfor-

mance.

3. All the work presented in this thesis are based on computer simulation, and it

would be more interesting if the learning algorithms could also be validated

experimentally. Thus, future work could involve running the proposed learn-

ing algorithms in an experiment with real robots. To do this, an open source

hardware platform called TurtleBot could be used, or any other platform that

can act as a mobile robot. TurtleBot can be programmed and configured using

the Robot Operating System (ROS), with either C++ or python programming

languages. The flexible framework package in ROS is used for writing robot

software, and if a TurtleBot is powered by this software it can manage several

activities, including vision, localization, communication and mobility. For the

proposed learning algorithms, it is assumed that the pursuer instantaneously


knows the position of the evader, and vice versa. Therefore, it would be pos-

sible to use either an OptiTrack system or a stereo camera to determine robot

positions. OptiTrack is comprised of several synchronized infrared cameras,

like the one at the Royal Military College of Canada.

4. For each pursuer a two-actor structure can be proposed for the problem of

the PE games with a superior evader. The first actor is used to determine a

pursuer’s action when the evader moves within the pursuer’s capture area,

while the second actor is used to determine the pursuer’s action when the

evader moves outside the pursuer’s capture area.

5. It is possible to benefit from the concepts of intrinsic motivation [156] and the

concept of object-oriented representation [157] for designing intrinsically mo-

tivated object-oriented PE differential games that make the learning process

manageable in large state and action spaces. Each player in this representa-

tion is considered to be an object.

6. Most of the RL algorithms, including the learning algorithms that are pre-

sented in this work, manually adjust the parameters of the learning algorithm;

that is, learning parameters, discount-factor and exploration rate. Hence,

in the future one will be able to adjust these parameters using an adaptive

method, and thereby make the learning agent fully autonomous.

7. For the work presented in this thesis, it is assumed that the PE games are

played in an obstacle-free environment. Thus, it will be more interesting if

the complexity of the game increases by adding static or dynamic obstacles.

Therefore, in the future, the learning algorithm should be used to teach each


player how to find its control strategy and at the same time how to avoid

obstacles.

Bibliography

[1] R. Isaacs, Differential games: A Mathematical Theory with Applications to War-

fare and Pursuit, Control and Optimization. Dover books on mathematics,

New York, NY: Dover Publ., 1999.

[2] V. D. Gesu, B. Lenzitti, G. L. Bosco, and D. Tegolo, “Comparison of differ-

ent cooperation strategies in the prey-predator problem,” in the Interna-

tional Workshop on Computer Architecture for Machine Perception and Sensing,

(Montreal, Canada), pp. 108–112, August 2006.

[3] G. Miller and D. Cliff, “Co-evolution of pursuit and evasion I: Biological and

game-theoretic foundations (tech. rep. csrp311),” tech. rep., 1994.

[4] C. Boesch, “Cooperative hunting roles among Tai chimpanzees,” Human Na-

ture, vol. 13, pp. 27–46, March 2002.

[5] D. Araiza-Illan and T. Dodd, “Biologically inspired controller for the au-

tonomous navigation of a mobile robot in an evasion task,” World Academy

of Science, Engineering and Technology, vol. 68, pp. 780–785, August 2010.

[6] S. A. Shedied, Optimal Control for a Two Player Dynamic Pursuit Evasion

Game; the Herding Problem. PhD thesis, Virginia Polytechnique Institute and

State University, January 2002.

181

BIBLIOGRAPHY 182

[7] H. M. Schwartz, Multi-Agent Machine Learning: A Reinforcement Approach.

John Wiley & Sons, August 2014.

[8] M. Wei, G. Chen, J. B. J. Cruz, L. Hayes, and M. Chang, “A decentral-

ized approach to pursuer-evader games with multiple superior evaders,” in

2006 IEEE Intelligent Transportation Systems Conference, (Toronto, Canada),

pp. 1586–1591, September 2006.

[9] K. M. Passino, “Intelligent control: An overview of techniques,” Perspectives in

Control Engineering: Technologies, Applications, and New Directions, pp. 104–

133, 2001.

[10] H. R. Beom and H. S. Cho, “A sensor-based navigation for a mobile robot

using fuzzy logic and reinforcement learning,” IEEE Trans. on Systems, Man,

and Cybernetics, vol. 25, pp. 464–477, March 1995.

[11] A. Saffiotti, “The uses of fuzzy logic in autonomous robot navigation,” Soft

Computing-A Fusion of Foundations, Methodologies and Applications, vol. 1,

pp. 180–197, December 1997.

[12] A. Saffiotti, E. H. Ruspini, and K. Konolige, “Using fuzzy logic for mobile

robot control,” Practical Applications of Fuzzy Technologies, vol. 6, pp. 185–

205, 1999.

[13] E. Aguirre and A. Gonzalez, “Fuzzy behaviors for mobile robot navigation:

design, coordination and fusion,” International Journal of Approximate Rea-

soning, vol. 25, pp. 255–289, November 2000.

[14] A. Ollero, J. Ferruz, O. Sanchez, and G. Heredia, “Mobile robot path tracking

BIBLIOGRAPHY 183

and visual target tracking using fuzzy logic,” in Fuzzy Logic Techniques for

Autonomous Vehicle Navigation, pp. 51–72, Springer, 2001.

[15] M. Sugiyama, Statistical Reinforcement Learning: Modern Machine Learning

Approaches. Chapman and Hall/CRC, March 2015.

[16] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cam-

bridge, MA: MIT Press, 1998.

[17] S. N. Givigi, H. M. Schwartz, and X. Lu, “A reinforcement learning adaptive

fuzzy controller for differential games,” Journal of Intelligent and Robotic

Systems, vol. 59, pp. 3–30, July 2010.

[18] S. F. Desouky and H. M. Schwartz, “Self-learning fuzzy logic controllers

for pursuit-evasion differential games,” Robotics and Autonomous Systems,

vol. 59, pp. 22–33, January 2011.

[19] B. M. Al Faiya and H. M. Schwartz, “Q(λ)-learning fuzzy controller for the

homicidal chauffeur differential game,” in Proc. of the 20th IEEE Mediter-

ranean Conference on Control and Automation (MED), (Barcelona, Spain),

pp. 247–252, July 2012.

[20] S. F. Desouky and H. M. Schwartz, “Q(λ)-learning adaptive fuzzy logic con-

trollers for pursuit-evasion differential games,” International Journal of Adap-

tive Control and Signal Processing, vol. 25, pp. 910–927, October 2011.

[21] X. Lu, Multi-Agent Reinforcement Learning in Games. PhD thesis, Carleton

University, March 2012.

BIBLIOGRAPHY 184

[22] M. D. Awheda and H. M. Schwartz, “A residual gradient fuzzy reinforcement

learning algorithm for differential games,” International Journal of Fuzzy Sys-

tems, vol. 19, pp. 1058–1076, August 2017.

[23] M. D. Awheda and H. M. Schwartz, “Decentralized learning in pursuit-

evasion differential games with multi-pursuer and single-superior evader,”

in Proc. of the Annual IEEE Systems Conference (SysCon), (Orlando, USA),

pp. 1–8, April 2016.

[24] M. Wei, G. Chen, J. B. J. Cruz, L. S. Hayes, M. Chang, and E. Blasch, “A decen-

tralized approach to pursuer-evader games with multiple superior evaders

in noisy environments,” in 2007 IEEE Aerospace Conference, (Big Sky, USA),

pp. 1–10, March 2007.

[25] S. Jin and Z. Qu, “Pursuit-evasion games with multi-pursuer vs. one fast

evader,” in Proc. of the 8th World Congress on Intelligent Control and Automa-

tion (WCICA) 2010, (Jinan, China), pp. 3184–3189, July 2010.

[26] S. M. LaValle, Planning Algorithms. Cambridge University Press, May 2006.

[27] L. A. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, pp. 338–353, 1965.

[28] Y. Bai, H. Zhuang, and D. Wang, Advanced Fuzzy Logic Technologiesin Indus-

trial Applications. Springer Science & Business Media, January 2007.

[29] H. Han, C. Y. Su, and Y. Stepanenko, “Adaptive control of a class of nonlinear

systems with nonlinearly parameterized fuzzy approximators,” IEEE Trans.

on Fuzzy Systems, vol. 9, pp. 315–323, April 2001.

BIBLIOGRAPHY 185

[30] R. Coppi, M. A. Gil, and H. A. Kiers, “The fuzzy approach to statistical anal-

ysis,” Computational statistics & data analysis, vol. 51, pp. 1–14, November

2006.

[31] C. Von Altrock and J. Gebhardt, “Recent successful fuzzy logic applications

in industrial automation,” in Proc. of the Fifth IEEE International Conference

on Fuzzy Systems, vol. 3, pp. 1845–1851, September 1996.

[32] M. I. Chacon M, “Fuzzy logic for image processing: definition and applica-

tions of a fuzzy image processing scheme,” Advanced Fuzzy Logic Technologies

in Industrial Applications, pp. 101–113, 2006.

[33] A. Patel, S. K. Gupta, Q. Rehman, and M. Verma, “Application of fuzzy logic

in biomedical informatics,” Journal of Emerging Trends in Computing and

Information Sciences, vol. 4, pp. 57–62, January 2013.

[34] B. Bouchon-Meunier, “Some applications of fuzzy logic in data mining and

information retrieval,” in EUSFLAT Conference, pp. 21–21, September 2005.

[35] S. Mitra and S. K. Pal, “Fuzzy sets in pattern recognition and machine intel-

ligence,” Fuzzy sets and systems, vol. 156, pp. 381–386, December 2005.

[36] A. Salski, “Fuzzy logic approach to data analysis and ecological modelling,”

in Proc. of the European symposium on intelligent techniques (ESIT99), 1999.

[37] Z. Huang, K. Y. Lee, and R. M. Edwards, “Fuzzy logic control application in

a nuclear power plant,” IFAC Proc. Volumes, vol. 35, pp. 239–244, January

2002.

BIBLIOGRAPHY 186

[38] C. T. Leondes, Fuzzy logic and expert systems applications, vol. 6. Elsevier,

1998.

[39] S. N. Sivanandam, S. Sumathi, and S. N. Deepa, Introduction to Fuzzy Logic

using MATLAB, vol. 1. Springer, January 2007.

[40] G. Chen and T. T. Pham, Introduction to Fuzzy Sets, Fuzzy Logic, and Fuzzy

Control Systems. CRC press, November 2000.

[41] G. Feng, Analysis and Synthesis of Fuzzy Control Systems: A Model-Based Ap-

proach. CRC press, March 2010.

[42] A. Gad and M. Farooq, “Application of fuzzy logic in engineering problems,”

in IECON’01. The 27th Annual Conference of the IEEE Industrial Electronics

Society, vol. 3, pp. 2044–2049, November.

[43] K. Self, “Designing with fuzzy logic,” IEEE Spectrum, vol. 27, pp. 42–44,

November 1990.

[44] K. M. Passino and S. Yurkovich, Fuzzy Control. Addison Wesley Longman,

Inc., 1998.

[45] E. H. Mamdani and S. Assilian, “An experiment in linguistic synthesis with a

fuzzy logic controller,” International Journal of Man-Machine Studies, vol. 7,

pp. 1–13, January 1975.

[46] T. Takagi and M. Sugeno, “Fuzzy identification of systems and its applica-

tions to modelling and control,” IEEE Trans. on Systems, Man, and Cybernet-

ics, vol. SMC-15, pp. 116–132, January 1985.

BIBLIOGRAPHY 187

[47] T. J. Ross, Fuzzy Logic with Engineering Applications. John Wiley & Sons,

2004.

[48] Y. Shi and P. C. Sen, “A new defuzzification method for fuzzy control of power

converters,” in Conference Record of the 2000 IEEE Industry Applications Con-

ference. Thirty-Fifth IAS Annual Meeting and World Conference on Industrial

Applications of Electrical Energy (Cat. No.00CH37129), vol. 2, (Rome, Italy),

pp. 1202–1209, October 2000.

[49] C. Szepesvari, “Algorithms for reinforcement learning,” Synthesis Lectures on

Artificial Intelligence and Machine Learning, vol. 4, no. 1, pp. 1–103, 2010.

[50] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A

survey,” Journal of Artificial Intelligence Research, vol. 4, pp. 237–285, May

1996.

[51] M. A. Wiering, “QV (λ)-learning: A new on-policy reinforcement learning

algorithm,” in Proc. of the 7th European Workshop on Reinforcement Learning,

vol. 7, (Napoli, Italy), pp. 17–18, October 2005.

[52] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements

that can solve difficult learning control problems,” IEEE Trans. on Systems,

Man, and Cybernetics, pp. 834–846, September 1983.

[53] C. J. C. H. Watkins, Learning from delayed rewards. PhD thesis, King’s College,

Cambridge University, 1989.

[54] G. A. Rummery and M. Niranjan, “On-line Q-learning using connectionist

BIBLIOGRAPHY 188

systems,” Tech. Rep. CUED/F-INFENF/TR 166, University of Cambridge, De-

partment of Engineering, September 1994.

[55] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8,

pp. 279–292, May 1992.

[56] X. Dai, C. K. Li, and A. B. Rad, “An approach to tune fuzzy controllers based

on reinforcement learning for autonomous vehicle control,” IEEE Trans. on

Intelligent Transportation Systems, vol. 6, pp. 285–293, September 2005.

[57] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in Neu-

ral Information Processing Systems, pp. 1008–1014, 2000.

[58] K. E. Parsopoulos, Particle swarm optimization and intelligence: advances and

applications. IGI global, January 2010.

[59] J. Kennedy and R. C. Eberhart, “Particle swarm optimization,” in Proc. of

the IEEE International Conference on Neural Networks, (Perth, Australia),

pp. 1942–1948, November 1995.

[60] R. C. Eberhart and J. Kennedy, “A new optimizer using particle swarm the-

ory,” in Proc. of the Sixth International Symposium on Micro Machine and

Human Science, (Nagoya, Japan), pp. 39–43, October 1995.

[61] J. Kennedy, “The particle swarm: social adaptation of knowledge,” in Proc. of

1997 IEEE International Conference on Evolutionary Computation (ICEC ’97),

(Indianapolis, USA), pp. 303–308, April 1997.

[62] W.-N. Chen, J. Zhang, H. S. Chung, W.-L. Zhong, W.-G. Wu, and Y.-H. Shi, “A

BIBLIOGRAPHY 189

novel set-based particle swarm optimization method for discrete optimiza-

tion problems,” IEEE Trans. on Evolutionary Computation, vol. 14, pp. 278–

300, April 2010.

[63] R. C. Eberhart and Y. Shi, “Particle swarm optimization: developments, ap-

plications and resources,” in Proc. of the 2001 IEEE Congress on Evolutionary

Computation, vol. 1, (Seoul, South Korea), pp. 81–86, 2001.

[64] R. A. Vural, O. Der, and T. Yildirim, “Investigation of particle swarm opti-

mization for switching characterization of inverter design,” Expert Systems

with Applications, vol. 38, pp. 5696–5703, May 2011.

[65] J. Pugh, Y. Zhang, and A. Martinoli, “Particle swarm optimization for un-

supervised robotic learning,” in Proc. of the Swarm Intelligence Symposium,

(Pasadena, USA), pp. 92–99, June 2005.

[66] K. Veeramachaneni, T. Peram, C. Mohan, and L. A. Osadciw, “Optimization

using particle swarms with near neighbor interactions,” in Genetic and evolu-

tionary computation conference, pp. 110–121, Springer, July 2003.

[67] J. Robinson, S. Sinton, and Y. Rahmat-Samii, “Particle swarm, genetic al-

gorithm, and their hybrids: optimization of a profiled corrugated horn an-

tenna,” in Proc. of the IEEE Antennas and Propagation Society International

Symposium, vol. 1, (San Antonio, USA), pp. 314–317, June 2002.

[68] R. C. Eberhart and Y. Shi, “Comparison between genetic algorithms and par-

ticle swarm optimization,” in International conference on evolutionary pro-

gramming, pp. 611–616, Springer, March 1998.

BIBLIOGRAPHY 190

[69] J. Kennedy and W. M. Spears, “Matching algorithms to problems: an experi-

mental test of the particle swarm and some genetic algorithms on the multi-

modal problem generator,” in Proc. of the 1998 IEEE International Conference

on Evolutionary Computation, (Anchorage, USA), pp. 78–83, May 1998.

[70] R. Martinez-Soto, O. Castillo, L. T. Aguilar, and P. Melin, “Fuzzy logic con-

trollers optimization using genetic algorithms and particle swarm optimiza-

tion,” in Mexican International Conference on Artificial Intelligence, pp. 475–

486, Springer, November 2010.

[71] A. Lazinica, Particle swarm optimization. InTech Kirchengasse, January 2009.

[72] D. Bratton and J. Kennedy, “Defining a standard for particle swarm optimiza-

tion,” in Proc. of the IEEE Swarm Intelligence Symposium, pp. 120 – 127, April

2007.

[73] Y. Dai, L. Liu, and Y. Li, “An intelligent parameter selection method for

particle swarm optimization algorithm,” in Proc. of the Fourth IEEE Interna-

tional Joint Conference on Computational Sciences and Optimization, (Yunnan,

China), pp. 960–964, April 2011.

[74] Y. Shi and R. C. Eberhart, “A modified particle swarm optimizer,” in Proc.

of the 1998 IEEE International Conference on Evolutionary Computation, (An-

chorage, USA), pp. 69–73, May 1998.

[75] W.-h. Zha, Y. Yuan, and T. Zhang, “Excitation parameter identification based

on the adaptive inertia weight particle swarm optimization,” Advanced Elec-

trical and Electronics Engineering, pp. 369–374, 2011.

BIBLIOGRAPHY 191

[76] M. Clerc and J. Kennedy, “The particle swarm: Explosion, stability, and con-

vergence in a multi-dimensional complex space,” IEEE Trans. on Evolutionary

Computation, vol. 6, pp. 58–73, February 2002.

[77] Y. Shi and R. C. Eberhart, “Empirical study of particle swarm optimiza-

tion,” in Proc. of the 1999 IEEE Congress on Evolutionary Computation-CEC99,

vol. 3, (Washington, USA), pp. 1945–1950, July 1999.

[78] Y. Shi and R. C. Eberhart, “Parameter selection in particle swarm optimiza-

tion,” in Proc. of the 7th International Conference on Evolutionary Program-

ming VII, pp. 591–600, Springer-Verlag, March 1998.

[79] C.-H. Yang, C.-J. Hsiao, and L.-Y. Chuang, “Linearly decreasing weight parti-

cle swarm optimization with accelerated strategy for data clustering,” IAENG

International Journal of Computer Science, vol. 37, no. 3, p. 1, 2010.

[80] Y. Shi and R. C. Eberhart, “Fuzzy adaptive particle swarm optimization,” in

Proc. of the 2001 IEEE Congress on Evolutionary Computation, vol. 1, (Seoul,

South Korea), pp. 101–106, May 2001.

[81] M. A. Arasomwan and A. O. Adewumi, “On adaptive chaotic inertia weights

in particle swarm optimization,” in Proc. of the 2013 IEEE Symposium on

Swarm Intelligence (SIS), (Singapore, Singapore), pp. 72–79, April 2013.

[82] R. C. Eberhart and Y. Shi, “Comparing inertia weights and constriction fac-

tors in particle swarm optimization,” in Proc. of the 2000 IEEE Congress on

Evolutionary Computation, vol. 1, (La Jolla, USA), pp. 84–88, July 2000.

BIBLIOGRAPHY 192

[83] R. C. Eberhart and Y. Shi, Computational Intelligence: Concepts to Implemen-

tations. Morgan Kaufmann Publishers Inc., 2007.

[84] J. Kennedy, “Some issues and practices for particle swarms,” in Proc. of the

IEEE Swarm Intelligence Symposium, (Honolulu, USA), pp. 162–169, April

2007.

[85] H. M. Choset, S. Hutchinson, K. M. Lynch, G. Kantor, W. Burgard, L. E.

Kavraki, and S. Thrun, Principles of robot motion: theory, algorithms, and

implementation. MIT press, 2005.

[86] G. Bishop and G. Welch, “An introduction to the kalman filter,” SIGGRAPH

Course Notes, pp. 1–81, 2001.

[87] D. Simon, Optimal state estimation: Kalman, H∞, and nonlinear approaches.

John Wiley & Sons, 2006.

[88] A. W. Merz, “The homicidal chauffeur,” AIAA Journal, vol. 12, pp. 259–260,

March 1974.

[89] S. H. Lim, T. Furukawa, G. Dissanayake, and H. F. Durrant-Whyte, “A time-

optimal control strategy for pursuit-evasion games problems,” in Proc. of the

IEEE International Conference on Robotics and Automation, vol. 4, (New Or-

leans, , USA), pp. 3962–3967, April 2004.

[90] M. E. Harmon, L. C. B. III, and A. H. Klopf, “Reinforcement learning applied

to a differential game,” Adaptive Behavior, vol. 4, pp. 3–28, September 1995.

[91] J. W. Sheppard, “Colearning in differential games,” Machine Learning,

vol. 33, pp. 201–233, November 1998.

BIBLIOGRAPHY 193

[92] Y. Ishiwaka, T. Satob, and Y. Kakazu, “An approach to the pursuit problem on

a heterogeneous multiagent system using reinforcement learning,” Robotics

and Autonomous Systems, vol. 43, pp. 245–256, June 2003.

[93] L.-X. Wang and J. M. Mendel, “Fuzzy basis functions, universal approxima-

tion, and orthogonal least-squares learning,” IEEE Trans. on Neural Networks,

vol. 3, pp. 807–814, September 1992.

[94] L.-X. Wang, “Fuzzy systems are universal approximators,” in Proc. of the

IEEE International Conference on Fuzzy Systems, (San Diego, USA), pp. 1163–

1170, March 1992.

[95] H. Ying, “Sufficient conditions on general fuzzy systems as function approx-

imators,” Automatica, vol. 30, pp. 521–525, March 1994.

[96] J. L. Castro, “Fuzzy logic controllers are universal approximators,” IEEE

Trans. on Systems, Man, and Cybernetics, vol. 25, pp. 629–635, April 1995.

[97] J. L. Castro and M. Delgado, “Fuzzy systems with defuzzification are uni-

versal approximators,” IEEE Trans. on Systems, Man, and Cybernetics, Part B

(Cybernetics), vol. 26, pp. 149–152, February 1996.

[98] D. Wang, G. Wang, and R. Hu, “Parameters optimization of fuzzy controller

based on PSO,” in Proc. of the 3rd IEEE International Conference on Intelligent

System and Knowledge Engineering, vol. 1, (Xiamen, China), pp. 599–603,

November 2008.

[99] S. F. Desouky and H. M. Schwartz, “Genetic based fuzzy logic controller for

BIBLIOGRAPHY 194

a wall-following mobile robot,” in Proc. of the 2009 IEEE American Control

Conference (ACC-09), (St. Louis, USA), pp. 3555–3560, June 2009.

[100] S. F. Desouky and H. M. Schwartz, “Hybrid intelligent systems applied to

the pursuit-evasion game,” in Proc. of the IEEE International Conference on

Systems, Man, and Cybernetics, (San Antonio, USA), pp. 2603–2608, October.

[101] E. Omizegba, G. Adebayo, and A. Balewa, “Optimizing fuzzy membership

functions using particle swarm algorithm,” in Proc. of the 2009 IEEE Inter-

national Conference on Systems, Man and Cybernetics, (San Antonio, USA),

pp. 3866–3870, October 2009.

[102] A. Esmin, A. Aoki, and G. Lambert-Torres, “Particle swarm optimization for

fuzzy membership functions optimization,” in Proc. of the 2002 IEEE Inter-

national Conference on Systems, Man and Cybernetics, vol. 3, (Yasmine Ham-

mamet, Tunisia), pp. 6–pp, October 2002.

[103] G. Fang, N. M. Kwok, and Q. Ha, “Automatic fuzzy membership function

tuning using the particle swarm optimization,” in Proc. of the 2008 IEEE

Pacific-Asia Workshop on Computational Intelligence and Industrial Applica-

tion, vol. 2, (Wuhan, China), pp. 324–328, December 2008.

[104] N. Khaehintung, A. Kunakorn, and P. Sirisuk, “A novel fuzzy logic control

technique tuned by particle swarm optimization for maximum power point

tracking for a photovoltaic system using a current-mode boost converter with

bifurcation control,” International Journal of Control, Automation and Sys-

tems, vol. 8, pp. 289–300, April 2010.

BIBLIOGRAPHY 195

[105] Z. Bingul and O. Karahan, “A fuzzy logic controller tuned with PSO for 2 DOF

robot trajectory control,” Expert Systems with Applications, vol. 38, pp. 1017–

1031, January 2011.

[106] R. Rahmani, M. Mahmodian, S. Mekhilef, and A. Shojaei, “Fuzzy logic con-

troller optimized by particle swarm optimization for DC motor speed con-

trol,” in Proc. of the 2012 IEEE Student Conference on Research and Develop-

ment (SCOReD), (Pulau Pinang, Malaysia), pp. 109–113, December 2012.

[107] S. F. Desouky and H. M. Schwartz, “Q(λ)-learning fuzzy logic controller for

a multi-robot system,” in Proc. of the 2010 IEEE International Conference on

Systems, Man, and Cybernetics, (Istanbul, Turkey), pp. 4075–4080, October

2010.

[108] L. Jouffe, “Fuzzy inference system learning by reinforcement methods,” IEEE

Trans. on Systems, Man, and Cybernetics, Part C (Applications and Reviews),

vol. 28, pp. 338–355, August 1998.

[109] W. M. Buijtenen, G. Schram, R. Babuska, and H. B. Verbruggen, “Adaptive

fuzzy control of satellite attitude by reinforcement learning,” IEEE Trans. on

Fuzzy Systems, vol. 6, pp. 185–194, May 1998.

[110] M. J. Er and C. Deng, “Online tuning of fuzzy inference systems using dy-

namic fuzzy Q-learning,” IEEE Trans. on Systems, Man, and Cybernetics, Part

B (Cybernetics), vol. 34, pp. 1478–1489, June 2004.

[111] H. R. Berenji and P. Khedkar, “Learning and tuning fuzzy logic controllers

through reinforcements,” IEEE Trans. on Neural Networks, vol. 3, pp. 724–

740, September 1992.

BIBLIOGRAPHY 196

[112] H. R. Berenji, “Fuzzy Q-learning: a new approach for fuzzy dynamic pro-

gramming,” in Proc. of the 1994 IEEE 3rd International Fuzzy Systems Confer-

ence, (Orlando, USA), pp. 486–491, June 1994.

[113] P. Y. Glorennec, “Fuzzy Q-learning and dynamical fuzzy Q-learning,” in Proc.

of 1994 IEEE 3rd International Fuzzy Systems Conference, (Orlando, USA),

pp. 474–479, June 1994.

[114] N. H. Yung and C. Ye, “An intelligent mobile vehicle navigator based on fuzzy

logic and reinforcement learning,” IEEE Trans. on Systems, Man, and Cyber-

netics, Part B (Cybernetics), vol. 29, no. 2, pp. 314–321, 1999.

[115] C. Zhou and Q. Meng, “Dynamic balance of a biped robot using fuzzy rein-

forcement learning agents,” Fuzzy sets and Systems, vol. 134, pp. 169–187,

February 2003.

[116] C.-K. Lin, “A reinforcement learning adaptive fuzzy controller for robots,”

Fuzzy Sets and Systems, vol. 137, pp. 339–352, August 2003.

[117] C. Ye, N. H. Yung, and D. Wang, “A fuzzy controller with supervised learn-

ing assisted reinforcement learning algorithm for obstacle avoidance,” IEEE

Trans. on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 33, pp. 17–

27, February 2003.

[118] Y. Duan and X. Hexu, “Fuzzy reinforcement learning and its application in

robot navigation,” in Proc. of 2005 IEEE International Conference on Machine

Learning and Cybernetics, vol. 2, (Guangzhou, China), pp. 899–904, August

2005.

BIBLIOGRAPHY 197

[119] X.-S. Wang, Y.-H. Cheng, and J.-Q. Yi, “A fuzzy actor–critic reinforcement

learning network,” Information Sciences, vol. 177, pp. 3764–3781, Septem-

ber 2007.

[120] V. Derhami, V. J. Majd, and M. N. Ahmadabadi, “Exploration and exploita-

tion balance management in fuzzy reinforcement learning,” Fuzzy Sets and

Systems, vol. 161, pp. 578–595, February 2010.

[121] H. Van Hasselt, Reinforcement learning in continuous state and action spaces,

pp. 207–251. Springer, 2012.

[122] M. D. Awheda and H. M. Schwartz, “The residual gradient FACL algorithm

for differential games,” in Proc. of the 28th IEEE Canadian Conference on Elec-

trical and Computer Engineering (CCECE2015), (Halifax, Canada), pp. 1006–

1011, May 2015.

[123] H. Raslan, H. M. Schwartz, and S. N. Givigi, “A learning invader for the

guarding a territory game,” Journal of Intelligent and Robotic Systems, vol. 83,

pp. 55–70, July 2016.

[124] C. V. Analikwu and H. M. Schwartz, “Reinforcement learning in the guarding

a territory game,” in Proc. of the 2016 IEEE International Conference on Fuzzy

Systems (FUZZ-IEEE 2016), (Vancouver, Canada), pp. 1007–1014, July 2016.

[125] X. Lu and H. M. Schwartz, “An investigation of gaurding a territory problem

in a grid world,” in Proc. of the American Control Conference(ACC) 2010,

(Baltimore, USA), pp. 3204–3210, June 2010.

[126] M. L. Littman, “Markov games as a framework for multi-agent reinforcement

BIBLIOGRAPHY 198

learning,” in Proc. of the 11th International Conference on Machine Learning,

pp. 157–163, 1994.

[127] K. Doya, “Reinforcement learning in continuous time and space,” Neural

Computation, vol. 12, pp. 219–245, January 2000.

[128] D. Li, J. B. J. Cruz, G. Chen, C. Kwan, and M.-H. Chang, “A hierarchical

approach to multi-player pursuit-evasion differential games,” in Proc. of the

44th IEEE Conference on Decision and Control, (Seville, Spain), pp. 5674–

5679, December 2005.

[129] S. N. Givigi and H. M. Schwartz, “Decentralized strategy selection with learn-

ing automata for multiple pursuer-evader games,” Adaptive Behavior, vol. 22,

pp. 221–234, August 2014.

[130] S. F. Desouky and H. M. Schwartz, “Learning in n-pursuer n-evader differ-

ential games,” in Proc. of the 2010 IEEE International Conference on Systems,

Man, and Cybernetics, (Istanbul, Turkey), pp. 4069–4074, October 2010.

[131] S. F. Desouky and H. M. Schwartz, “A novel hybrid learning technique ap-

plied to a self-learning multi-robot system,” in Proc. of the IEEE International

Conference on Systems, Man, and Cybernetics, (San Antonio, USA), pp. 2690–

2697, October.

[132] M. Wei, G. Chen, J. B. J. Cruz, L. Haynes, K. Pham, and E. Blasch, “Multi-

pursuer multi-evader pursuit-evasion games with jamming confrontation,”

Journal of Aerospace Computing, Information, and Communication, vol. 4,

pp. 693–706, March 2007.

BIBLIOGRAPHY 199

[133] X. Wang, J. B. J. Cruz, G. Chen, K. Phamc, and E. Blasch, “Formation control

in multi-player pursuit evasion game with superior evaders,” in Proc. of the

Defense Transformation and Net-Centric Systems, vol. 6578, (Orlando, USA),

p. 657811, International Society for Optics and Photonics, May 2007.

[134] Z. s. Cai, L. n. Sun, , and H. b. Gao, “A novel hierarchical decomposition

for multi-player pursuit evasion differential game with superior evaders,” in

Proc. of the first ACM/SIGEVO Summit on Genetic and Evolutionary Computa-

tion, (Shanghai, China), pp. 795–798, June 2009.

[135] F. B. Fu, P. Q. Shu, H. B. Rong, D. Lei, Z. Q. Bo, and Z. Zhaosheng, “Research

on high speed evader vs. multi lower speed pursuers in multi pursuit-evasion

games,” Information Technology Journal, vol. 11, pp. 989–997, August 2012.

[136] M. D. Awheda and H. M. Schwartz, “A decentralized fuzzy learning algo-

rithm for pursuit-evasion differential games with superior evaders.,” Journal

of Intelligent and Robotic Systems, vol. 83, pp. 35–53, July 2016.

[137] A. A. Al-Talabi and H. M. Schwartz, “An investigation of methods of pa-

rameter tuning for Q-learning fuzzy inference system,” in Proc. of the 2014

IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2014), (Beijing,

China), pp. 2594–2601, July 2014.

[138] L.-X. Wang, A Course In Fuzzy Systems And Control. Prentice-Hall press, USA,

1997.

[139] B. M. Al Faiya, “Learning in pursuit-evasion differential games using re-

inforcement fuzzy learning,” Master’s thesis, Carleton University, February

2012.

BIBLIOGRAPHY 200

[140] A. A. Al-Talabi and H. M. Schwartz, “A two stage learning technique using

PSO-based FLC and QFIS for the pursuit evasion differential game,” in Proc.

of the 2014 IEEE International Conference on Mechatronics and Automation

(ICMA 2014), (Tianjin, China), pp. 762–769, August 2014.

[141] A. A. Al-Talabi and H. M. Schwartz, “A two stage learning technique for dual

learning in the pursuit-evasion differential game,” in Proc. of the IEEE Sym-

posium Series on Computational Intelligence (SSCI) 2014, (Orlando, USA),

pp. 1–8, December 2014.

[142] L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement learning

and dynamic programming using function approximators, vol. 39. CRC press,

April 2010.

[143] J. F. Schutte, B. I. Koh, J. A. Reinbolt, R. T. Haftka, A. D. George, and B. J.

Fregly, “Evaluation of a particle swarm algorithm for biomechanical opti-

mization,” Journal of Biomechanical Engineering, vol. 127, pp. 465–474, June

2005.

[144] A. A. Al-Talabi, “Fuzzy actor-critic learning automaton algorithm for the

pursuit-evasion differential game,” in Proc. of the 2017 IEEE International

Automatic Control Conference (CACS), (Pingtung, Taiwan), November 2017.

[145] M. D. Awheda, , and H. M. Schwartz, “A fuzzy reinforcement learning algo-

rithm with a prediction mechanism,” in Proc. of The 22nd IEEE Mediterranean

Conference on Control and Automation (MED), (Palermo, Italy), pp. 593–598,

June 2014.

BIBLIOGRAPHY 201

[146] A. A. Al-Talabi and H. M. Schwartz, “Kalman fuzzy actor-critic learning au-

tomaton algorithm for the pursuit-evasion differential game,” in Proc. of

the 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2016),

(Vancouver, Canada), pp. 1015–1022, July 2016.

[147] M. A. Wiering and H. Van Hasselt, “Two novel on-policy reinforcement learn-

ing algorithms based on TD (λ)-methods,” in Proc. of the 2007 IEEE Interna-

tional Symposium on Approximate Dynamic Programming and Reinforcement

Learning (ADPRL 2007), (Honolulu, USA), pp. 280–287, April 2007.

[148] H. Van Hasselt and M. A. Wiering, “Reinforcement learning in continuous

action spaces,” in Proc. of the 2007 IEEE International Symposium on Ap-

proximate Dynamic Programming and Reinforcement Learning (ADPRL 2007),

(Honolulu, USA), pp. 272–279, April 2007.

[149] S. F. Desouky, Learning and Design of Fuzzy Logic Controllers For Pursuit-

Evasion Differential Games. PhD thesis, Carleton University, July 2010.

[150] D. Li and J. B. J. Cruz, “Better cooperative control with limited look-

ahead,” in Proc. of the 2006 American Control Conference, (Minneapolis,

USA), pp. 4914–4919, June 2006.

[151] Z. Xue, “A comparison of nonlinear filters on mobile robot pose estimation,”

Master’s thesis, Carleton University, February 2013.

[152] J. Z. Sasiadek and P. Hartana, “GPS/INS sensor fusion for accurate posi-

tioning and navigation based on kalman filtering,” IFAC Proceedings Volumes,

vol. 37, pp. 115–120, June 2004.

BIBLIOGRAPHY 202

[153] J. Z. Sasiadek and Q. Wang, “Low cost automation using INS/GPS data fusion

for accurate positioning,” Robotica, vol. 21, pp. 255–260, June 2003.

[154] D.-J. Jwo and F.-C. Chung, “Fuzzy adaptive unscented kalman filter for ultra-

tight GPS/INS integration,” in Proc. of the 2010 IEEE International Sympo-

sium on Computational Intelligence and Design (ISCID), vol. 2, pp. 229–235,

October 2010.

[155] A. A. Al-Talabi, “Multi-player pursuit-evasion differential game with equal

speed,” in Proc. of the 2017 IEEE International Automatic Control Conference

(CACS), (Pingtung, Taiwan), November 2017.

[156] S. Singh, R. L. Lewis, A. G. Barto, and J. Sorg, “Intrinsically motivated

reinforcement learning: An evolutionary perspective,” IEEE Trans. on Au-

tonomous Mental Development, vol. 2, pp. 70–82, June 2010.

[157] C. Diuk, A. Cohen, and M. L. Littman, “An object-oriented representation for

efficient reinforcement learning,” in Proc. of the 25th international conference

on Machine learning, pp. 240–247, ACM, July 2008.

Documents

Learning in the Multi-Robot Pursuit Evasion Game · Imam Ali ( A.S) v. Acknowledgments First and foremost, I am extremely grateful to Almighty Allah, and to Ahlul-Bayt, peace be upon