Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Learning in the Multi-Robot Pursuit Evasion Game
by
Ahmad Al-Talabi, M.Sc.
A dissertation submitted to the
Faculty of Graduate Studies and Research
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Electrical and Computer Engineering
Ottawa-Carleton Institute for Electrical and Computer Engineering (OCIECE)
Department of Systems and Computer Engineering
Carleton University
Ottawa, Ontario, Canada
January, 2019
c© Copyright 2019, Ahmad Al-Talabi
The undersigned hereby recommends to the
Faculty of Graduate Studies and Research
acceptance of the dissertation
Learning in the Multi-Robot Pursuit Evasion Game
submitted by
Ahmad Al-Talabi, M.Sc.
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Electrical and Computer Engineering
Professor Cecilia Zanni-Merk, External Examiner,
Mathematical Engineering department,
INSA Rouen Normandie
Professor Gabriel Wainer, Thesis Supervisor,
Department of Systems and Computer Engineering
Professor Yvan Labiche, Chair,
Department of Systems and Computer Engineering
Ottawa-Carleton Institute for Electrical and Computer Engineering (OCIECE)
Department of Systems and Computer Engineering
Carleton University
Ottawa, Ontario, Canada
January, 2019
ii
Abstract
This thesis proposes different learning algorithms to investigate the learning issues
of mobile robots playing differential forms of the Pursuit-Evasion (PE) game. The
algorithms are used to reduce: (1) the computational requirements without affect-
ing the overall performance of the algorithm, (2) the learning time, (3) the capture
time and the possibility of collision among pursuers and (4) to deal multi-robot PE
games with a single-superior evader.
The computational complexity is reduced by examining four methods of pa-
rameter tuning for the Q-Learning Fuzzy Inference System (QLFIS) algorithm, to
determine both the best parameters to tune and those that have minimal impact
on performance. Two learning algorithms are then proposed to reduce the learn-
ing time. The first uses a two-stage technique that combines the PSO-based Fuzzy
Logic Control (FLC) algorithm with the QLFIS algorithm, with the PSO algorithm
used as a global optimizer and the QLFIS as a local optimizer. The second algorithm
is a modified version of the Fuzzy-Actor Critic Learning (FACL) algorithm, known
as Fuzzy Actor-Critic Learning Automaton (FACLA). It uses the Continuous Actor-
Critic Learning Automaton (CACLA) algorithm to tune the parameters of the Fuzzy
Inference System (FIS).
Next, a decentralized learning technique is proposed that enables a group of
iii
two or more pursuers to capture a single inferior evader. It uses the FACLA algo-
rithm together with the Kalman filter technique to reduce both the capture time,
and collision potential among the pursuers. It is assumed that there is no commu-
nication among the pursuers. Finally, a proposed decentralized learning algorithm
is applied successfully to a multi-robot PE game with a single-superior evader, in
which all players have identical speeds. A new reward function is suggested and
used to guide the pursuer to either move to the intercepted point with the evader
or move in parallel with the evader, depending on whether the pursuer can capture
the evader or not. Simulation results have shown the feasibility of the proposed
learning algorithms.
iv
“ People are of two kinds. They are either your brothers in Faith or your Equal
in Humanity.”
Imam Ali ( A.S)
v
Acknowledgments
First and foremost, I am extremely grateful to Almighty Allah, and to Ahlul-Bayt,
peace be upon them, for their countless blessings and for the great care they have
provided at every moment of my life. Certainly, without their support and bounty, I
would never successfully finish this chapter of my life by getting the PhD degree.
I would like to express my deepest gratitude to my supervisor Prof. Gabriel
Wainer for his guidance and support. He has always been available to meet with
me even with his tight time schedule. Being under his supervision is one of the best
things I have ever made. Thanks Prof. Wainer!
Also, I would like to express my sincere gratitude and appreciation to my friend
and mentor, Prof. Ramy Gohary for his encouragement, guidance, advice and sup-
port. Frankly, I couldn’t find the words that he deserves. He has always been
available to discuss how to overcome difficulties and to explore the ways of suc-
cess. Prof. Gohary is a treasure and only a lucky person can work with such a
knowledgeable professor. Thanks Prof. Gohary!
I would like to thank my thesis committee members Prof. Paul Keen, Prof. Cecilia
Zanni-Merk, Prof. James Green, Prof. Sidney Givigi and Prof. Emil Petriu for their
comprehensive evaluation and insightful comments on my thesis.
My sincere gratitude and appreciation go to the chair of the Systems and Com-
puter Engineering Department Prof. Yvan Labiche and to Prof. Ioannis Lambadaris
vi
for their support and invaluable assistance.
Also, I would like to thank my close friends Prof. Ahmed Qadoury Abed and Dr.
Nasir Kamat for being available to relieve my sufferings and for their support and
encouragement.
I cannot ever express my feelings towards my parents, brothers and sisters. Their
love, support, encouragement and prayers gave me more confidence to fulfill my
goals successfully. Special thanks to my brother Mohammed for his support and
patience.
Finally, I would like to dedicate this work to my beloved wife, Zainab, and my
children, Qamer, Mohammedbakir, Aya and Zahraa for their unconditioned love,
care, patience and understanding. You were the candles that lighted my way
through this long journey.
vii
Table of Contents
Abstract iii
Acknowledgments vi
List of Tables xii
List of Figures xv
Nomenclature xxiii
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Goal and Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Background and Literature Review 14
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 The Pursuit-Evasion Game . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 The Game of Two Cars . . . . . . . . . . . . . . . . . . . . . . 16
viii
2.3 Fuzzy Logic Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Fuzzy Sets and Membership Functions . . . . . . . . . . . . . 18
2.3.2 Fuzzification . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Fuzzy-Rule Base . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.4 Fuzzy-Inference Engine . . . . . . . . . . . . . . . . . . . . . 22
2.3.5 Defuzzification . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Future-Discounting Reward . . . . . . . . . . . . . . . . . . . 28
2.4.2 Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.3 Temporal Difference Learning . . . . . . . . . . . . . . . . . . 32
2.4.4 Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . 36
2.5.1 Particle Swarm Optimization with Inertia Weight . . . . . . . 39
2.5.2 Particle Swarm Optimization with Constriction Factor . . . . . 41
2.6 The Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6.1 The System Dynamic Model . . . . . . . . . . . . . . . . . . . 43
2.6.2 The Kalman Filtering Process . . . . . . . . . . . . . . . . . . 45
2.6.3 Fading Memory Filter (FMF) . . . . . . . . . . . . . . . . . . . 47
2.7 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3 An Investigation of Methods of Parameter Tuning for the QLFIS 56
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Pursuit-Evasion Game . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Fuzzy Logic Controller Structure . . . . . . . . . . . . . . . . . . . . 60
ix
3.5 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.6 Q-Learning Fuzzy Inference System (QLFIS) . . . . . . . . . . . . . . 64
3.6.1 The Learning Rule of the QLFIS and its Algorithm . . . . . . . 64
3.7 Computer Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.7.1 Evader Follows a Default Control Strategy . . . . . . . . . . . 69
3.7.2 Evader Using its Higher Maneuverability Advantageously . . . 73
3.7.3 Multi-Robot Learning . . . . . . . . . . . . . . . . . . . . . . 79
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4 Learning Technique Using PSO-Based FLC and QLFIS lgorithms 90
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2 The PSO-based FLC algorithm . . . . . . . . . . . . . . . . . . . . . . 92
4.3 Q-learning Fuzzy Inference System (QLFIS) . . . . . . . . . . . . . . 95
4.4 The proposed Two-Stage Learning Technique . . . . . . . . . . . . . 96
4.5 Computer Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5.1 Evader Follows a Default Control Strategy . . . . . . . . . . . 97
4.5.2 Multi-Robot Learning . . . . . . . . . . . . . . . . . . . . . . 105
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5 Kalman Fuzzy Actor-Critic Learning Automaton Algorithm for the Pursuit-
Evasion Differential Game 110
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2 Fuzzy Actor-Critic Learning Automaton (FACLA) . . . . . . . . . . . . 112
5.2.1 Evader Follows a Default Control Strategy . . . . . . . . . . . 117
5.2.2 Multi-Robot Learning . . . . . . . . . . . . . . . . . . . . . . 120
5.3 Learning in n-Pursuer One-Evader PE Differential Game . . . . . . . 121
x
5.3.1 Predicting the Interception Point and its Effects . . . . . . . . 124
5.4 State Estimation Based on a Kalman Filter . . . . . . . . . . . . . . . 126
5.4.1 The Design of Filter Parameters . . . . . . . . . . . . . . . . . 129
5.4.2 Kalman Filter Initialization . . . . . . . . . . . . . . . . . . . 131
5.4.3 Fuzzy Fading Memory Filter . . . . . . . . . . . . . . . . . . . 132
5.4.4 Kalman Filter Model Selection . . . . . . . . . . . . . . . . . . 133
5.5 Computer Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.5.1 Case 1: Two-Pursuer One-Evader Game . . . . . . . . . . . . 147
5.5.2 Case 2: Three-Pursuer One-Evader Game . . . . . . . . . . . 151
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6 Multi-Player Pursuit-Evasion Differential Game with Equal Speed 156
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.2 The Dynamic Equations of the Players . . . . . . . . . . . . . . . . . 158
6.3 Fuzzy Logic Controller Structure . . . . . . . . . . . . . . . . . . . . 160
6.4 Reward Function Formulation . . . . . . . . . . . . . . . . . . . . . . 160
6.5 Fuzzy Actor-Critic Learning Automaton (FACLA) . . . . . . . . . . . . 164
6.6 Computer Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7 Conclusions and Future Work 170
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Bibliography 181
xi
List of Tables
2.1 Fuzzy Decision Table (FDT) . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Methods of parameter tuning. . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Fuzzy logic parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 FDTs of the pursuer and the evader before learning. . . . . . . . . . . 63
3.4 Mean and standard deviation of the capture time (s) for different
evader initial positions for the first version of the PE game. . . . . . . 70
3.5 Mean and standard deviation of the computation time (s) for the
four methods of parameter tuning for the first version of the PE game. 74
3.6 Mean and standard deviation of the capture time (s) for different
evader initial positions for the second version of the PE game. . . . . 76
3.7 Mean and standard deviation of the computation time (s) for the
four methods of parameter tuning for the second version of the PE
game. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.8 Mean and standard deviation of the capture time (s) for different
evader initial positions for the third version of the PE game (Case 1). 81
3.9 Mean and standard deviation of the computation time (s) for the
four methods of parameter tuning for the third version of the PE
game (Case 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
xii
3.10 Mean and standard deviation of the capture time (s) for different
pursuers’ initial positions for the third version of the PE game (Case 2). 85
3.11 Mean and standard deviation of the computation time (s) for the
four methods of parameter tuning for the third version of the PE
game (Case 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.1 The percentage decrease in the mean value of the capture time (s)
for 1000 episode as the number of particles increases. . . . . . . . . 100
4.2 The percentage decrease in the mean value of the capture time (s)
for 10 particles as the number of episodes increases. . . . . . . . . . 100
4.3 Mean and standard deviation of the capture time (s) for different
evader initial positions for the case of only the pursuer learning. . . . 102
4.4 Total number of episodes for the different learning algorithms . . . . 102
4.5 Mean and standard deviation of the computation time (s) for differ-
ent learning algorithms for the case of only the pursuer learning. . . 104
4.6 Mean and standard deviation of the capture time (s) for different
evader initial positions for the case of multi-robot learning. . . . . . . 106
4.7 Mean and standard deviation of the computation time (s) for differ-
ent learning algorithms for the case of multi-robot learning. . . . . . 108
5.1 Total number of episodes for the different learning algorithms. . . . . 119
5.2 Mean and standard deviation of the capture time (s) for different
evader initial positions for the case of only the pursuer learning. . . . 119
5.3 Mean and standard deviation of the computation time (s) for differ-
ent learning algorithms for the case of only the pursuer learning. . . 119
xiii
5.4 Mean and standard deviation of the capture times (s) for different
evader initial positions for the case of multi-robot learning. . . . . . . 120
5.5 Mean and standard deviation of the computation time (s) for differ-
ent learning algorithms for the case of multi-robot learning. . . . . . 121
5.6 FDT of the fuzzy FMF. . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.7 Mean and standard deviation of the RMSE (cm) for the evader’s po-
sition estimate of Example 5.1. . . . . . . . . . . . . . . . . . . . . . 136
5.8 Mean and standard deviation of the RMSE (cm) for the evader’s po-
sition estimate of Example 5.2. . . . . . . . . . . . . . . . . . . . . . 140
5.9 Mean and standard deviation of the RMSE (cm) for the evader’s po-
sition estimate of Example 5.3. . . . . . . . . . . . . . . . . . . . . . 141
5.10 Mean and standard deviation of the capture time (s) for a two-
pursuer one-evader game for different pursuers’ initial positions. . . 149
5.11 Mean and standard deviation of the capture time (s) for a three-
pursuer one-evader game for different pursuers’ initial positions. . . 151
6.1 Fuzzy decision table . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
xiv
List of Figures
2.1 The PE model for the game of two cars. . . . . . . . . . . . . . . . . 17
2.2 Fuzzy logic controller structure . . . . . . . . . . . . . . . . . . . . . 19
2.3 Membership function of input and output. . . . . . . . . . . . . . . . 21
2.4 Graphical representation of different defuzzification methods. . . . . 25
2.5 Agent-environment interaction in RL [16]. . . . . . . . . . . . . . . . 27
2.6 Actor-critic structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7 Kalman filtering process. . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1 Initial MFs of pursuer and evader before learning. . . . . . . . . . . . 62
3.2 The structure of the QLFIS technique [20]. . . . . . . . . . . . . . . . 65
3.3 The PE paths on the xy-plane for the first version of the PE game,
before the pursuer starts to learn. . . . . . . . . . . . . . . . . . . . . 71
3.4 The PE paths on the xy-plane for the first version of the PE game
when the first method of parameter tuning is used versus the PE
paths when each player followed its DCS. . . . . . . . . . . . . . . . 71
3.5 The PE paths on the xy-plane for the first version of the PE game
when the second method of parameter tuning is used versus the PE
paths when each player followed its DCS. . . . . . . . . . . . . . . . 72
xv
3.6 The PE paths on the xy-plane plane for the first version of the PE
game when the third method of parameter tuning is used versus the
PE paths when each player followed its DCS. . . . . . . . . . . . . . . 72
3.7 The PE paths on the xy-plane for the first version of the PE game
when the fourth method of parameter tuning is used versus the PE
paths when each player followed its DCS. . . . . . . . . . . . . . . . 73
3.8 The PE paths on the xy-plane for the second version of the PE game,
before the pursuer starts to learn. . . . . . . . . . . . . . . . . . . . . 75
3.9 The PE paths on the xy-plane for the second version of the PE game
when the first method of parameter tuning is used versus the PE
paths when each player followed its DCS. . . . . . . . . . . . . . . . 77
3.10 The PE paths on the xy-plane for the second version of the PE game
when the second method of parameter tuning is used versus the PE
paths when each player followed its DCS. . . . . . . . . . . . . . . . 77
3.11 The PE paths on the xy-plane for the second version of the PE game
when the third method of parameter tuning is used versus the PE
paths when each player followed its DCS. . . . . . . . . . . . . . . . 78
3.12 The PE paths on the xy-plane for the second version of the PE game
when the fourth method of parameter tuning is used versus the PE
paths when each player followed its DCS. . . . . . . . . . . . . . . . 78
3.13 The PE paths on the xy-plane for the third version of the PE game
(Case 1), before the players start to learn. . . . . . . . . . . . . . . . 80
3.14 The PE paths on the xy-plane for the third version of the PE game
(Case 1) when the first method of parameter tuning is used versus
the PE paths when each player followed its DCS. . . . . . . . . . . . 81
xvi
3.15 The PE paths on the xy-plane for the third version of the PE game
(Case 1) when the second method of parameter tuning is used versus
the PE paths when each player followed its DCS. . . . . . . . . . . . 82
3.16 The PE paths on the xy-plane for the third version of the PE game
(Case 1) when the third method of parameter tuning is used versus
the PE paths when each player followed its DCS. . . . . . . . . . . . 82
3.17 The PE paths on the xy-plane for the third version of the PE game
(Case 1) when the fourth method of parameter tuning is used versus
the PE paths when each player followed its DCS. . . . . . . . . . . . 83
3.18 The PE paths on the xy-plane for the third version of the PE game
(Case 2) when the first method of parameter tuning is used versus
the PE paths when each player followed its DCS. . . . . . . . . . . . 87
3.19 The PE paths on the xy-plane for the third version of the PE game
(Case 2) when the second method of parameter tuning is used versus
the PE paths when each player followed its DCS. . . . . . . . . . . . 87
3.20 The PE paths on the xy-plane for the third version of the PE game
(Case 2) when the third method of parameter tuning is used versus
the PE paths when each player followed its DCS. . . . . . . . . . . . 88
3.21 The PE paths on the xy-plane for the third version of the PE game
(Case 2) when the fourth method of parameter tuning is used versus
the PE paths when each player followed its DCS. . . . . . . . . . . . 88
4.1 The mean values of the capture time for the PSO-based FLC algo-
rithm for different population sizes. The range bars indicate the
standard deviations over the 500 simulation runs. . . . . . . . . . . 99
xvii
4.2 The mean values of the capture time for the PSO-based FLC algo-
rithm for different episode numbers. The range bars indicate the
standard deviations over the 500 simulation runs. . . . . . . . . . . . 99
4.3 The PE paths on the xy-plane using the PSO-based FLC algorithm for
the case of only the pursuer learning versus the PE paths when each
player followed its DCS. . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4 The PE paths on the xy-plane using the QLFIS algorithm for the case
of only the pursuer learning versus the PE paths when each player
followed its DCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.5 The PE paths on the xy-plane using the proposed learning algorithm
for the case of only the pursuer learning versus the PE paths when
each player followed its DCS. . . . . . . . . . . . . . . . . . . . . . . 104
4.6 The PE paths on the xy-plane using the PSO-based FLC algorithm
for the case of multi-robot learning versus the PE paths when each
player followed its DCS. . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.7 The PE paths on the xy-plane using the QLFIS algorithm for the case
of multi-robot learning versus the PE paths when each player fol-
lowed its DCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.8 The PE paths on the xy-plane using the proposed learning algorithm
for the case of multi-robot learning versus the PE paths when each
player followed its DCS. . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.1 Structure of the FACL system [17]. . . . . . . . . . . . . . . . . . . . 113
xviii
5.2 The mean values of the capture time for the FACL, RGFACL and FA-
CLA algorithms for different episode numbers. The range bars indi-
cate the standard deviations over the 500 simulation runs. . . . . . . 118
5.3 The PE differential game model with two-pursuer and one-evader. . . 122
5.4 Geometric illustration for capturing situation. . . . . . . . . . . . . . 124
5.5 Geometric illustration for capturing situation using the estimated po-
sition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.6 MFs of the inputs µ and ξ. . . . . . . . . . . . . . . . . . . . . . . . . 134
5.7 The evader’s position estimate for Example 5.1 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter.
(d) Fuzzy CAM Kalman Filter. . . . . . . . . . . . . . . . . . . . . . . 137
5.8 The evader’s x-position estimate for Example 5.1 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter.
(d) Fuzzy CAM Kalman Filter. . . . . . . . . . . . . . . . . . . . . . . 138
5.9 The evader’s y-position estimate for Example 5.1 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter.
(d) Fuzzy CAM Kalman Filter. . . . . . . . . . . . . . . . . . . . . . . 139
5.10 The evader’s position estimate for Example 5.2 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter.
(d) Fuzzy CAM Kalman Filter. . . . . . . . . . . . . . . . . . . . . . . 141
5.11 The evader’s x-position estimate for Example 5.2 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter.
(d) Fuzzy CAM Kalman Filter. . . . . . . . . . . . . . . . . . . . . . . 142
xix
5.12 The evader’s y-position estimate for Example 5.2 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter.
(d) Fuzzy CAM Kalman Filter. . . . . . . . . . . . . . . . . . . . . . . 143
5.13 The evader’s position estimate for Example 5.3 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter.
(d) Fuzzy CAM Kalman Filter. . . . . . . . . . . . . . . . . . . . . . . 144
5.14 The evader’s x-position estimate for Example 5.3 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter.
(d) Fuzzy CAM Kalman Filter. . . . . . . . . . . . . . . . . . . . . . . 145
5.15 The evader’s y-position estimate for Example 5.3 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter.
(d) Fuzzy CAM Kalman Filter. . . . . . . . . . . . . . . . . . . . . . . 146
5.16 The PE paths using FACLA algorithm for a two-pursuer one-evader
game for different pursuers’ initial positions. . . . . . . . . . . . . . . 149
5.17 The PE paths using Kalman-FACLA algorithm for a two-pursuer one-
evader game for different pursuers’ initial positions. . . . . . . . . . . 150
5.18 The PE paths using FACLA algorithm for a three-pursuer one-evader
game for different pursuers’ initial positions. . . . . . . . . . . . . . . 152
5.19 The PE paths using Kalman-FACLA algorithm for a three-pursuer one-
evader game for different pursuers’ initial positions. . . . . . . . . . . 153
6.1 Geometric illustration for capturing situation. . . . . . . . . . . . . . 159
6.2 Geometric illustration for the pursuer moving in parallel with the
evader using PGL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
xx
6.3 The PE paths for three-pursuer one-evader after the first learning
episode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.4 The PE paths for three-pursuer one-evader after the final episode. . . 166
6.5 The average payoff of each pursuer at the end of each learning episode.167
6.6 The difference between the initial and the final LoSs between each
pursuer and the evader at each learning episode. . . . . . . . . . . . 168
6.7 The difference between the initial and the final Euclidean distances
between each pursuer and the evader at each learning episode. . . . 168
xxi
xxii
Nomenclature
FIS Fuzzy Inference System
QLFIS Q-Learning Fuzzy Inference System
PSO Particle Swarm Optimization
FLC Fuzzy Logic Control
FACL Fuzzy-Actor Critic Learning
CACLA Continuous Actor-Critic Learning Automaton
FACLA Fuzzy Actor-Critic Learning Automaton
PE Pursuit-Evasion
RL Reinforcement Learning
LoS Line-of-Sight
HJI Hamilton-Jacobi-Isaacs
GA Genetic Algorithm
MF Membership Function
TS Takagi-Sugeno
FDT Fuzzy Decision Table
TD Temporal-Difference
DP Dynamic Programming
xxiii
MC Monte Carlo
FMF Fading Memory Filter
CVM Constant Velocity Model
CAM Constant Acceleration Model
xxiv
Chapter 1
Introduction
1.1 Overview
Pursuit-Evasion (PE) games are multi-player games in which one or more ‘pursuers’
attempt to capture one or more ‘evaders’ in the least time, while the evaders try
to escape or maximize the capture time [1]. Hence, this can be formulated as a
zero-sum game in which a benefit for the pursuers is a loss for the evaders, and vice
versa. And since the goal of the pursuers is opposite to that of the evaders, PE can
be considered an optimization problem with conflicting objectives [2].
PE games have been used for decades, mainly in three scientific fields: neu-
roethology [3], behavioural biology [4, 5], and game theory [1, 6]. They are a key
research tool in the field of game theory, and several classes of the games have been
studied extensively due to their potential for military applications, including mis-
sile avoidance and interception, surveillance, reconnaissance and rescue operations
[1]. The concept can also be generalized to solve other applications, such as path
planning, collision avoidance, criminal pursuit and other related fields [7, 8].
1
1.1. OVERVIEW 2
PE games are differential games, which means that the system dynamics are
described by systems of differential equations. In other words, differential games
are games that have continuous state and action spaces. The solution complexity of
the differential game will increase as the number of players increases. This is due
to the difficulty to model the interaction between players, and players’ abilities to
interact with unknown and uncertain environments.
Fuzzy logic control (FLC) is a method for dealing with processes that are ill-
defined and/or involve uncertainty or continuous change, as well as a technique
for intelligent control [9]. Fuzzy logic concepts have been widely used in the field
of autonomous mobile robots [10–14]. Designing an FLC system requires an ap-
propriate knowledge base, and this can be constructed from experts’ knowledge.
However, building a knowledge base is not a simple task, so using an optimization
method such as Particle Swarm Optimization (PSO) or a learning method such as
Reinforcement Learning (RL) to autonomously tune the FLC parameters, can be
useful.
PSO is one of the population stochastic optimization methods that have been
proposed for different types of applications. In this thesis, the PSO algorithm will
act as a global optimizer to tune the parameters of the FLC. It was chosen because of
its simplicity and efficiency, and the fact that it only has a few parameters to adjust.
The PSO algorithm tunes the FLC parameters in accordance with the problem fitness
function.
RL represents a method to learn to achieve a particular goal [15] through inter-
action with the environment [16]. In RL, the player selects actions and the environ-
ment responds by producing new situations for the player. The environment also
generates numerical values (rewards) that should be maximized by a player over
1.2. MOTIVATION 3
time. RL has proven to be a good choice for intelligent robot control design, and it
has been successfully applied to tune FLC parameters.
1.2 Motivation
The study of PE games has received wide attention from researchers in various
fields due to the game’s extensive applicability, particularly in military applications
such as air combat, torpedo and ship and tank and jeep scenarios [1]. There are
also many related scientific implications, and understanding the PE game concept
has fostered applications in communications, video games and robotics. For ex-
ample, several important applications of mobile robots can be formulated as PE
games, including path planning, tracking, leader-follower, collision avoidance and
search-and-rescue [7]. Thus, due to the importance of the PE game, and recent
technological developments, researchers are increasingly interested in autonomous
agents (e.g. autonomous robots) that can learn and achieve a specific goal through
experience acquired from their environment. The learning algorithm for an au-
tonomous agent is implemented in a microcontroller; thus, it is better to use a
learning algorithm that can be easily and efficiently applied.
As discussed in Section 1.1 PE games are differential, and this can make achiev-
ing solutions complicated. Issues typically increase in proportion to the number
of game participants, which is partly due to modeling participants’ interactions,
as well as how they interact with unknown and uncertain environments. Thus, this
thesis focuses on the learning approach in PE differential games; that is, each player
learns the best action to take at each instant of time, and adapts to uncertain and
unknown environments.
1.2. MOTIVATION 4
Recently, fuzzy-reinforcement learning methods in [17–22] have been proposed
to address learning problems in differential games. With these methods, the learn-
ing process is achieved by tuning the parameters of two main components, both
of which are Fuzzy Inference Systems (FISs) with two sets of parameters (i.e. a
premise and a consequent) that can be tuned by the learning algorithm. The first
component is used to approximate a state value function V (·) or an action value
function Q(·, ·), and the second is used as an FLC. However, the work in [17–21]
did not investigate which set of parameters had the most significant impact on the
performance of the learning algorithm. In [18–20], the learning process is achieved
by tuning all sets of the parameters, while in [17, 21] only one set of parameters
is used; namely, the consequent set. Therefore, this issue requires investigation to
determine possible reductions in the computational requirements of the learning
algorithm.
Each player in a PE game tries to learn its control strategy, and this normally
takes a specific number of episodes to achieve acceptable performance. Sometimes
the number selected is greater than necessary, which means the learning time is
longer. The length of the learning process is very important, particularly if the
learning algorithm is being implementing with a real-world game. This makes it
necessary to propose other learning algorithms that could speed up the learning
process, as demonstrated later in the thesis.
If one of the learning algorithms proposed in [17–20, 22] is applied to a multi-
pursuer single-inferior evader PE game there is potential for collisions among the
pursuers, particularly if they are near one another or approaching the evader. This
motivates the development of a new learning algorithm with the ability to avoid
collisions. And since there are multiple pursuers in the game, it is more effective to
1.3. GOAL AND OBJECTIVE 5
propose a learning algorithm that can be implemented in a decentralized manner.
Furthermore, this thesis discusses the problem of multi-pursuer single-evader PE
games in which all players have equal capabilities. Though this game format was
previously investigated in [23] it was assumed that the speed of the evader was
known, which makes the algorithm inappropriate for practical use. Therefore, this
assumption not considered here, which provides motivation for proposing another
learning algorithm that could teach a group of pursuers how to capture a single
evader in a decentralized manner.
1.3 Goal and Objective
The overall goal of this thesis can be formulated as how to develop efficient learning
algorithms that enable one or several pursuers to capture one or several evaders in
minimal time. Efficiency of the learning algorithm can be achieved by meeting the
following objectives: reducing the computational requirements as much possible
without affecting the overall performance of the learning algorithm, reducing the
learning time, reducing the capture time, reducing the possibility of collision among
the pursuers and dealing with multi-robot PE games with a single-superior evader.
This research is carried out in four stages:
• Stage 1: In order to reduce computational complexity, the Q-Learning Fuzzy
Inference System (QLFIS) algorithm previously proposed in [20] is consid-
ered. Four methods for tuning the parameters of the algorithm are investi-
gated to determine the ones with maximal and minimal impacts on perfor-
mance. The four methods depend on whether all the parameters of the FIS
and the FLC are tuned, or only a subset; it is more computationally efficient
1.3. GOAL AND OBJECTIVE 6
to tune a subset rather than all parameters.
• Stage 2: For real-world applications the learning time is highly important,
thus two specific learning algorithms are proposed for PE differential games.
The first uses a two-stage learning technique that combines the PSO-based
FLC algorithm with the QLFIS algorithm. The resulting algorithm is called the
PSO-based FLC+QLFIS algorithm. The PSO aspect of the algorithm is used
as a global optimizer to autonomously tune the parameters of the FLC, while
the QLFIS aspect is used as a local optimizer. The second proposed learning
algorithm is a modified version of the Fuzzy-Actor Critic Learning (FACL) al-
gorithm, in which both the critic and the actor are FIS. This algorithm uses
the Continuous Actor-Critic Learning Automaton (CACLA) algorithm to tune
the parameters of the FIS, and is called the Fuzzy Actor-Critic Learning Au-
tomaton (FACLA) algorithm.
• Stage 3: A decentralized learning technique that enables a group of two or
more pursuers to capture a single inferior evader is proposed for PE differen-
tial games. Both the pursuers and the evader must learn their control strate-
gies simultaneously by interacting with one another. This learning technique
uses the FACLA algorithm and the Kalman filter technique to reduce the cap-
ture time and collision potential for the pursuers. It is assumed that there is
no communication among the pursuers, and each pursuer considers the other
pursuers as part of its environment.
• Stage 4: In this stage, a decentralized learning algorithm is proposed for
multi-robot PE games with a single-superior evader, which enables a group of
pursuers in PE differential games to learn how to capture such an evader. In
1.4. THESIS ORGANIZATION 7
this thesis, the superiority of the evader is defined in terms of its maximum
speed. Therefore, the superior evader can be defined as an evader whose
maximum speed is equal to or exceeds the maximum speed of the fastest pur-
suer in the game [8, 24, 25]. In this work, it is assumed that the pursuers
and the evader have identical speeds. A novel idea is used to formulate a re-
ward function for the proposed learning algorithm based on a combination of
two factors: the difference in the Line-of-Sight (LoS) between each pursuer in
the game and the evader at two consecutive time instants, and the difference
between two successive Euclidean distances between individual pursuers and
the evader.
1.4 Thesis Organization
This thesis is organized in a paper-based format and presented in seven chapters,
summarized as follows:
• Chapter 1 presents the general introduction, the motivation, the research goal
and objectives and the thesis organization. In addition, it provides a summary
of contributions and a list of publications based on this study.
• Chapter 2 defines some basic theoretical concepts applied in the thesis and
provides a detailed literature review. It begins with a brief introduction of the
problem of the PE differential game, and introduces a model of the ‘game of
two cars’, since this is the model that will be mainly used to present the PE
differential game in this thesis. Then the FIS, one of the most popular function
approximation methods, is described. Here, the FIS works as either an FLC,
or as a function approximator to manage the problem of continuous state
1.4. THESIS ORGANIZATION 8
and action spaces such as the PE differential game. A detailed background
of RL is provided, as RL is used by learning agents to find an appropriate
learning strategy in an unknown environment. PSO, one of the population
stochastic optimization methods, is then applied as a global optimizer for the
FLC parameters. Furthermore, there is a detailed discussion about the Kalman
filter, one of the most powerful estimation techniques. The chapter concludes
with a detailed literature review on previous studies related to this research.
• Chapter 3 investigates how to reduce the computational requirement by ap-
plying four methods to implement the QLFIS algorithm. The methods are
based on the sets (i.e., premise and consequent) of FIS and FLC parameters
that can be tuned, and the four methods are applied to three versions of a PE
differential game. In the first version the evader plays a well-defined strategy,
which is to run away along the LoS while the pursuer tries to learn how to
capture it using the QLFIS algorithm. The second game is similar to the first
one, except that the evader plays an intelligent control strategy and makes a
sharp turn when the distance between the evader and pursuer is less than a
specific threshold value. In the third game, the QLFIS algorithm is used by
both players, and they attempt to learn their control strategies simultaneously
by interacting with one another. Simulation results are provided to evaluate
which parameters are most suitable to tune, and which have little impact on
performance.
• Chapter 4 introduces a new two-stage technique to reduce player learning
time, which is an important factor for any learning algorithm. The first stage
is represented by the PSO-based FLC algorithm, and the second by the QLFIS
1.4. THESIS ORGANIZATION 9
algorithm. The basic reason for using the PSO algorithm is its ability to work
as a global optimizer and to find a global solution within a few iterations.
Thus, it is first used to find effective initial parameter values for the FLC of the
learning player. Then, in the second stage, the learning agent uses the QLFIS
algorithm with the resulting initial parameter setting to quickly find its control
strategy. The two-stage learning technique is applied to different versions of
PE differential games, and simulation results are provided and discussed.
• Chapter 5 develops a new fuzzy-reinforcement learning algorithm for PE dif-
ferential games. The proposed algorithm reduces the learning time that the
players need to find their control strategies. It uses the CACLA algorithm to
tune the FIS parameters, and is called the FACLA algorithm. The proposed
algorithm is applied to two versions of the PE games, and compared by sim-
ulation with state-of-the-art fuzzy-reinforcement learning algorithms, and the
PSO-based FLC+QLFIS algorithm. This chapter also presents a decentralized
learning technique that enables a group of two or more pursuers to capture a
single evader in PE differential games. The proposed technique uses the FA-
CLA algorithm and the Kalman filter technique, which is an estimation method
to predict an evader’s next position. It can be used by all pursuers to avoid col-
lisions among them and reduce capture time. The Kalman learning technique
is applied for each player to autonomously tune the parameters of its FLC and
self-learn its control strategy. Simulation results are provided to show that the
proposed learning algorithm works as required.
• Chapter 6 introduces a type of reward function for the FACLA algorithm to
teach a group of pursuers how to capture a single evader in a decentralized
1.5. SUMMARY OF CONTRIBUTIONS 10
manner. It is assumed that all players have identical speed, and each pur-
suer learns to take the right actions by tuning its FLC parameters, using the
FACLA algorithm, with the suggested reward function. The reward function
depends on two factors: the difference in the LoS between each pursuer in the
game and the evader at two consecutive time instances, and the difference be-
tween two consecutive Euclidean distances between individual pursuers and
the evader. The suggested reward function is used to guide each pursuer to
move either to the interception point with the evader or in parallel with it.
The pursuer movement direction depends on whether the pursuer can cap-
ture the evader or not. Simulation results are shown to validate the FACLA
algorithm with the suggested reward function.
• Chapter 7 highlights the conclusions and presents the main contributions of
the thesis. It also discusses ideas and directions for future work.
1.5 Summary of Contributions
The main contributions of this thesis are:
1-Reducing the Computational Time by:
Proposing and investigating four methods of parameter tuning for the QLFIS
algorithm. The investigation will determine whether it is necessary to tune both
the premise and consequent parameters of the FIS and FLC, or only the consequent
parameters.
1.5. SUMMARY OF CONTRIBUTIONS 11
2-Reducing the Learning Time by:
Proposing two learning algorithms for PE differential games to reduce the time
a player needs to find its control strategy. The first algorithm uses a two-stage
learning technique that combines the PSO-based FLC and QLFIS algorithms and
employs the PSO algorithm as a global optimizer and the QLFIS as a local optimizer.
The second algorithm is a modified version of the FACL algorithm called the FACLA
algorithm, and it uses a CACLA algorithm to tune the parameters of the actor and
critic. Simulations and comparisons of these algorithms and the state-of-the-art
fuzzy-reinforcement learning algorithms will be determined.
3-Reducing the Capture Time and Reducing the Possibility of Collision Among
Pursuers by:
Developing a decentralized learning technique for PE differential games that en-
ables a group of two or more pursuers to capture a single evader. The algorithm is
known as the Kalman-FACLA algorithm and it uses the Kalman filter to estimate the
evader’s next position, allowing pursuers to determine the evader’s direction. To
implement the algorithm, the only information each pursuer needs is the instanta-
neous position of the evader. This learning algorithm will be used to avoid collisions
among the pursuers and reduce capture time.
4-Dealing with multi-player PE games with single-superior evader by:
Defining a type of reward function for the FACLA algorithm that can teach a
group of pursuers how to capture a single-superior evader in a decentralized man-
ner, when the speed of all the players is identical. The proposed reward function
directs each pursuer to either move to intercept the evader, or to move in parallel
with it. There is no need to calculate the capture angle for each pursuer in order to
1.6. PUBLICATIONS 12
determine its control signal.
1.6 Publications
The publications that resulted from this research are:
1. A. A. Al-Talabi and H. M. Schwartz, “An investigation of methods of parameter
tuning for Q-learning fuzzy inference system,” in Proc. of the 2014 IEEE In-
ternational Conference on Fuzzy Systems (FUZZ-IEEE 2014), (Beijing, China),
pp. 2594-2601, July 2014.
2. A. A. Al-Talabi and H. M. Schwartz, “A two stage learning technique using
PSO-based FLC and QFIS for the pursuit evasion differential game,” in Proc. of
2014 IEEE International Conference on Mechatronics and Automation (ICMA
2014), (Tianjin, China), pp. 762-769, August 2014.
3. A. A. Al-Talabi and H. M. Schwartz, “A two stage learning technique for dual
learning in the pursuit-evasion differential game,” in Proc. of the IEEE Sympo-
sium Series on Computational Intelligence (SSCI) 2014, (Orlando, Florida),
pp. 1-8, December 2014.
4. A. A. Al-Talabi and H. M. Schwartz, “Kalman fuzzy actor-critic learning au-
tomaton algorithm for the pursuit-evasion differential game,” in Proc. of the
2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2016),
(Vancouver, Canada), pp. 1015-1022, July 2016.
5. A. A. Al-Talabi, “Fuzzy actor-critic learning automaton algorithm for the pursuit-
evasion differential game,” in Proc. of the 2017 IEEE International Automatic
Control Conference (CACS), (Pingtung, Taiwan), November 2017.
1.6. PUBLICATIONS 13
6. A. A. Al-Talabi, “Multi-player pursuit-evasion differential game with equal
speed,” in Proc. of the 2017 IEEE International Automatic Control Confer-
ence (CACS), (Pingtung, Taiwan), November 2017.
Note: Based on the advice of the Faculty of Graduate and Postdoctoral Affairs,
this thesis was written in paper-based format. Therefore, some content might be
presented more than once because each paper was written independently. Also, the
contents of the original papers are modified in a way that preserves the information
presented in each paper, and eliminates repetition. Moreover, additional work has
been added to some of the original papers.
Chapter 2
Background and Literature Review
2.1 Introduction
In this thesis, the Pursuit-Evasion (PE) differential game is investigated from the
learning point of view, and thus this chapter will give basic concepts and theoretical
background that are related to this issue. The chapter begins by giving a brief in-
troduction to what is called a differential game, where the game’s state and action
spaces are given in continuous domain. Then, the PE differential game is defined,
and a mathematical model for a two-player PE differential game is given. In this
model, each player is defined as a car-like mobile robot. Since the game has con-
tinuous state and action spaces, there is a necessity to use some forms of function
approximation, and for this reason the Fuzzy Inference System (FIS) is described.
In this thesis, the FIS is used as either a function approximator or a Fuzzy Logic
Controller (FLC). Following that, the concept and detailed theoretical background
of the Reinforcement Learning (RL) is provided. Here, the RL is used to address
the learning issue for the problem of the PE differential game. It guides each player
14
2.2. THE PURSUIT-EVASION GAME 15
to learn its control strategy in an unknown environment. Next, one of the most
popular stochastic optimization methods is discussed, which is the Particle Swarm
Optimization (PSO) algorithm. In this research, the PSO algorithm is used to speed
up the learning process by finding a good initial setting for the FLC parameters.
Finally, the Kalman filter, as one of the most powerful estimation techniques, is in-
vestigated in detail. Here, the Kalman filter is used to estimate the evader’s next
position. Also, one of its generalizations called a Fading Memory Filter (FMF) is
introduced.
This chapter is organized as follows. The PE game is described in Section 2.2,
and the fuzzy logic inference system is explained in Section 2.3. The RL and PSO
algorithms are described in Section 2.4 and Section 2.5, respectively. The Kalman
filter is discussed in Section 2.6. Finally, a detailed literature review of previous
studies related to this research is provided in Section 2.7.
2.2 The Pursuit-Evasion Game
Game problems involving conflict and/or cooperation are encountered on daily
bases in areas such as athletics, stock market dealing, political bargaining and war
games. Game theory is usually connected with these situations, as there are typ-
ically a number of rational participants, each with its own goal. Participants are
known as players, agents or decision makers, and if the dynamics of a game are
defined by differential equations, the game is referred to as ‘differential’. Differ-
ential games were initiated by Isaacs [1] in the 1950s, when he studied the opti-
mal behaviour of the PE game in the Rand Corporation. Isaacs [1] proposed the
2.2. THE PURSUIT-EVASION GAME 16
homicidal-chauffeur problem as an example of the PE differential game. This prob-
lem is a two-player zero-sum game, where the objective of the first player is opposite
to the objective of the second player. In the game, a slow pedestrian (the evader)
who can change direction instantaneously, tries to maximize the capture time or
avoid being captured by a fast homicidal chauffeur (the pursuer). Isaacs then ex-
tended his work to more general cases of the PE game, such as the game of two cars.
Typically, a PE game has one or several pursuers attempting to capture one or more
evaders in minimal time, while the evaders try to escape or maximize the capture
time [1]. Thus, this problem can be considered as an optimization one with con-
flicting objectives [2]. In the PE game, each player attempts to learn the best action
to take at every moment, and to adapt to uncertain or changing environments.
2.2.1 The Game of Two Cars
The game of the two cars represents one of the PE differential games, though it
is different from the homicidal-chauffeur problem. In this game, each player moves
with limited speed and turning ability, like a car. Figure 2.1 shows the PE model
for this game and its parameters, and the two players modeled as car-like mobile
robots.
The dynamic equations that describe the motion of the pursuer and the evader
robots in this game are [26]
xi = Vi cos(θi),
yi = Vi sin(θi), and
θi =ViLi
tan(ui),
(2.1)
where i is e for the evader and p for the pursuer. Also, (xi, yi), Vi, θi, Li and ui refer
2.3. FUZZY LOGIC CONTROL 17
( )
x
y
xe xp
ye
yp
Vp
Ve
The pursuer
The evader
Figure 2.1: The PE model for the game of two cars.
to the position, velocity, orientation, wheelbase and steering angle, respectively.
The steering angle is bounded by −uimax ≤ ui ≤ uimax , where uimax is the maximum
steering angle. When the steering angle is fixed at ui, the car moves in a circular
path with a radius Rdi, and when it is fixed at uimax , the car moves with a minimum
turning radius. The turning radius can be defined by
Rdi =Li
tan(ui). (2.2)
2.3 Fuzzy Logic Control
In 1965, Zadeh [27] established the basis of FLC by introducing the concepts of
fuzzy sets and fuzzy logic. However, researchers did not understand how to use
these concepts in an application until Mamdani [28] applied them to control an
automatic steam engine in 1974. Since then, fuzzy logic has been considered one
2.3. FUZZY LOGIC CONTROL 18
of the most powerful methods of describing and designing control systems to deal
with complex processes in an intuitive and simple manner. Thus, FLC has trig-
gered numerous studies [29], and become an active field of research in different
application areas, including statistics [30], industrial automation [31], image and
signal processing [32], biomedicine [33], data mining [34], pattern recognition
[35], data analysis [36], power plants [37], expert systems [38] and control en-
gineering problems [39–42]. Fuzzy-logic concepts are widely used in daily life,
and companies around the world have benefitted by designing various types of
FLCs for different applications and electronic devices. For example, some of the
washing machines manufactured by Matsushita Electric Industrial are fuzzy-based
controlled [43]. FLC is considered a form of soft computing or intelligent control
that can mimic human decision making to deal with partial truth situations. As in-
dicated in Figure 2.2, the FLC structure is composed of four principal components:
fuzzifier, fuzzy rule base, fuzzy inference engine and defuzzifier. The inputs and
outputs of the FLC are real-value crisp data sets. The implementation of FLC in real
applications involves the following three steps:
1. Fuzzification : to map a real-value crisp data set into a fuzzy set.
2. Fuzzy Inference Process : to combine the Membership Functions (MF) with
the fuzzy rules to obtain the fuzzy output.
3. Defuzzification: to convert the fuzzy output into real-value crisp data.
2.3.1 Fuzzy Sets and Membership Functions
The idea of fuzzy set theory is an extension of the concept of classical set theory
in which data elements are either in a set or not. For example, a classical set of all
2.3. FUZZY LOGIC CONTROL 19
Fuzzy outputs Fuzzy inputs
Crisp inputsFuzzifier Defuzzifier
Fuzzy Rule Base
Fuzzy Inference Engine
Crisp outputs
Figure 2.2: Fuzzy logic controller structure
positive numbers can be defined by
A = {z | z > 0}.
It is clear that if z > 0 then z is a member of set A; otherwise, z is not a positive
number and not a member of set A. There is a mapping function with two values,
0 or 1, which is called the MF, µ(z), and can be defined by
µ(z) =
1 : z ∈ A,
0 : z /∈ A.
Fuzzy set theory allows elements to have partial membership; whereby an element
is both a member and not a member at the same time, and has a specific degree
of membership. This partial membership can be expressed in terms of MF values
between 0 and 1, and the MF is used to map each value in the input space to
a membership value within the interval [0, 1]. For example, if Z is a collection of
elements z which is the universe of discourse, then a fuzzy set A in Z can be defined
2.3. FUZZY LOGIC CONTROL 20
as a set of ordered pairs as follows:
A ={
(
z, µA(z))
| z ∈ Z}
, µA(z) ∈ [0, 1],
where µA(z) denotes the MF of fuzzy set A. The domain for each input variable is
usually divided into several membership functions. Thus, for an input value with
several MFs, the input must be processed through each MF. There are many types
of MFs, including trapezoidal, triangular, Gaussian, bell-shaped and sigmoidal, and
the appropriate type to use depends on the problem under consideration. Control
applications commonly use trapezoidal, triangular and Gaussian MFs, and Gaus-
sian MFs are also widely used for the problems of function approximation. In this
work, Gaussian MFs are used unless otherwise stated. The Gaussian MF takes the
following form
µ(z) = exp
(
−
(
z −m
σ
)2)
, (2.3)
where m and σ are the mean and the standard deviation, respectively.
2.3.2 Fuzzification
The process of converting a classical set or crisp data to a fuzzy set is known
as fuzzification, and it involves two steps. The first step is to specify the type and
number of MFs for each input and output variable, and the second is to identify the
MFs with linguistic labels. For example, consider a simple air conditioner control
process that is controlled by a heater only [32]. The control process has one input
variable (temperature) and one output variable (motor speed). Suppose that each
variable has five triangular MFs, with the input variable having linguistic labels of
CD (cold), CL (cool), GD (good), WM (warm) and HT (hot), and the output variable
2.3. FUZZY LOGIC CONTROL 21
having linguistic labels of VS (very slow), SL (slow), NO (normal), FT (fast), and
VF (very fast), as shown in Figure 2.3.
0
1
40 45 50 55 60 65 70 75 80 85 90
µ
ºF
CD CL GD WM HT
Deg
ree
of
Mem
ber
ship
Temperature
(a) Membership function of input
0
1
0 10 20 30 40 50 60 70 80 90 100
µ
R/M
VS SL NO FT VF
Deg
ree
of
Mem
ber
ship
Speed
(b) Membership function of output
Figure 2.3: Membership function of input and output.
2.3.3 Fuzzy-Rule Base
The fuzzy-rule base represents an essential part of the FLC, and can be con-
structed from expert knowledge. It consists of a collection of linguistic control rules
in the form of fuzzy IF-THEN rules. The general form of the fuzzy IF-THEN rules is
Rl : IF z1 is Al1 and z2 is A
l2 and ... and zN is Al
N THEN yl is Cl, (2.4)
where z1 is Al1 and z2 is A
l2 and ... and zN is Al
N represents the premise or an-
tecedent part and yl is Cl the consequent or the conclusion part. Also, zi is the ith
input variable, N is the number of input variables, yl is the output of rule l, Ali is
a fuzzy set of the input zi in rule l, and L is the number of the fuzzy rules. More-
over, in rule l, Cl can be either a fuzzy set of the output yl or a linear function of
the input variables, depending on the fuzzy inference method. Clearly, the inputs
2.3. FUZZY LOGIC CONTROL 22
are associated with the premise and the outputs are associated with the conclusion
[44].
Now, the Gaussian membership value of the input zi in rule l can be written as
follows:
µAli(zi) = exp
(
−
(
zi −mli
σli
)2)
, (2.5)
where mli and σl
i are the mean and the standard deviation, respectively for the MF
of the input zi in rule l.
2.3.4 Fuzzy-Inference Engine
The fuzzy-inference engine is used to produce the output fuzzy sets from the
input fuzzy ones, based on the information available on the fuzzy-rule-base [39],
cf. Figure 2.2. The fuzzy inference engine generates an output for each activated
rule, then it produces the final output by combining the outputs of all activated
rules.
The two most important types of fuzzy-inference engines or systems in the liter-
ature are the Mamdani and Assilian FIS [45] and the Takagi-Sugeno (TS) FIS [46].
The main difference between them is the definition of the consequent part of Equa-
tion (2.4). In rule l, Mamdani’s FIS is defined Cl in Equation (2.4) as the fuzzy set
of output yl, while TS FIS is defined Cl as a linear function of the input variables.
Hence, the linguistic rules of Mamdani’s FIS can be defined by Equation (2.4). The
TS FIS consists of linguistic rules in the form
Rl : IF z1 is Al1 and ... and zN is Al
N THEN yl = K0l +K1
l z1 ... +KNl zN , (2.6)
where Kil is a real constant. In the present work, zero-order TS rules with constant
2.3. FUZZY LOGIC CONTROL 23
consequents are used. It consists of linguistic rules in the form
Rl : IF z1 is Al1 and z2 is A
l2 and ... and zN is Al
N THEN yl = Kl. (2.7)
The number of rules depends on the number of inputs and their corresponding MFs.
These rules increase exponentially with the number of inputs and their correspond-
ing MFs [44]. For example, with a two input zero-order TS fuzzy model, each input
has three MFs with the following linguistic labels: P (positive), Z (zero) and N (neg-
ative), that is, we need to build a fuzzy rule base with nine rules. The fuzzy rules
can also be constructed using a Fuzzy Decision Table (FDT), as shown in Table 2.1.
Table 2.1: Fuzzy Decision Table (FDT)
z2
N Z P
z1
N K1 K2 K3
Z K4 K5 K6
P K7 K8 K9
2.3.5 Defuzzification
The process of reconverting the fuzzy output to classical or crisp data is known as
defuzzification. There are several popular defuzzification methods in the literature;
including the Center of Gravity, Center-average, Mean of Maximum and Height.
Center of Gravity (COG) Defuzzification
The Center of Gravity (COG) method is one of the most widely used defuzzi-
fication methods. In the literature, it is also called the centroid or centre of area
2.3. FUZZY LOGIC CONTROL 24
method. Its basic principle is to find the point y∗ in the output universes of dis-
course Y where a vertical line would give two equal masses, as shown in Figure
2.4(a). Mathematically, the COG technique provides a crispy output value y∗ based
on calculation of the center of gravity of the fuzzy set, which is defined by the
expression [47]
y∗ =
∫
µC(y)ydy∫
µC(y)dy, (2.8)
where∫
µCdy refers to the area bounded by the curve µC .
Center-Average Defuzzification
This method can only be used for fuzzy sets with symmetrical output MFs. De-
spite this restriction, it is still one of the most frequently used methods due to its
computational efficiency. It is also effective when used with the TS FIS. In order to
find the defuzzified value, each membership function is weighted by height. The
defuzzified value can be expressed as [47]
y∗ =
∑
µC(y)y∑
µC(y), (2.9)
where y represents the centroid of each symmetric MF. Figure 2.4(b) illustrates this
operation.
Mean of Maximum (MOM) Defuzzification
This is also called the middle-of-maxima (MOM) method. The defuzzified out-
put is generated by calculating the mean or average of the fuzzy conclusions with
maximum membership values. This method is defined by the expression [48]
y∗ =
∑N
i=1(yi)
N, (2.10)
2.3. FUZZY LOGIC CONTROL 25
where yi is the fuzzy conclusion that has the maximum membership value, and N is
the number of qualified fuzzy conclusions. For example, the crisp value of the fuzzy
set using MOM for the case specified by Figure 2.4(c) is given by
y∗ =y1 + y2
2.
1= 𝜋𝑟2𝑦= 𝜋𝑟2𝑦∗= 𝜋𝑟2
𝜇= 𝜋𝑟2
(a) Center of Gravity (COG)
𝑦2= 𝜋𝑟2𝑦1= 𝜋𝑟20.25
0.50
0.75
1.00
𝑦= 𝜋𝑟2𝑦∗= 𝜋𝑟2
𝜇= 𝜋𝑟2
(b) Center-Average
𝑦2= 𝜋𝑟2𝑦1= 𝜋𝑟2
1
𝑦= 𝜋𝑟2𝑦∗= 𝜋𝑟2
𝜇= 𝜋𝑟2
(c) Mean of Maximum (MOM)
1= 𝜋𝑟2𝑦= 𝜋𝑟2𝑦∗= 𝜋𝑟2
𝜇= 𝜋𝑟2
(d) Max Membership Principle
Figure 2.4: Graphical representation of different defuzzification methods.
Max Membership Principle
This is also called the height method, and it is a defuzzification technique that
gives the output y∗ as the point in the output universes of discourse, such that µC(y)
2.4. REINFORCEMENT LEARNING 26
achieves its maximum value. This method can be defined as follows [47]
µC(y∗) ≥ µC(y) ∀y ∈ Y , (2.11)
where y∗ represents the height of the output fuzzy set C. This method is only
applicable when the height is unique, as shown in Figure 2.4(d).
2.4 Reinforcement Learning
RL is considered a subfield of machine learning [49], and has attracted great in-
terest from researchers due to its ability to manage a wide range of practical appli-
cations, including artificial intelligence and control engineering problems. RL con-
trols an agent in an unknown environment, so a specific objective is accomplished
by mapping situations to actions. The agent is the learner and decision-maker,
and everything outside the agent that interacts with it is called the environment
[16]. In RL, the learner is not informed which action to take or does not know
its desired output, as required by most machine learning techniques. Instead, it
explores various actions and, by trying them, selects those that offer the most re-
wards [50], which are numerical values that the learning agent seeks to maximize
over the long run. RL differs from supervised learning methods. In particular, in
supervised learning methods the agent learns how to achieve a goal by depending
on a set of input/output data provided by a teacher or an expert. However, with
RL the learning process is achieved through interaction between the learning agent
and the environment, without needing a training data set. The learning agent inter-
acts with the environment by taking an action that depends on its control strategy,
then the environment transitions to a new state and the action is evaluated. The
2.4. REINFORCEMENT LEARNING 27
evaluation of an action is based on the received reward. The agent-environment
interaction continues to teach the agent how to achieve the intended goal. To be
more specific, assume that an agent-environment interaction is achieved at discrete
time steps, i.e., t = 0, 1, 2, · · · , such that at time step t the agent observes the state
of the environment, st ∈ S, where S refers to the set of possible states, and takes an
action at ∈ A, where A points to the set of available actions for the learning agent
in state st. As a result of its action, the agent transitions to a new state, st+1 ∈ S, at
the next time step, and receives a numerical reward, rt+1 ∈ R, from the interactive
environment. The agent-environment interaction is summarized by Figure 2.5.
reward
tr
state
ts
1tr
1ts
action
ta
Agent
Environment
Figure 2.5: Agent-environment interaction in RL [16].
In RL, the agent creates a mapping, πt, from states to the probabilities of choos-
ing each admissible action. The mapping πt is called the agent’s policy, and πt(s, a)
refers to the probability that the learning agent will select the action at = a when
the environment’s state is st = s . RL methods explain how an agent can use expe-
rience to change its policy in order to achieve the goal of getting as much reward
as it can over time.
2.4. REINFORCEMENT LEARNING 28
2.4.1 Future-Discounting Reward
In RL, the agent’s goal is to maximize the long-term rewards, rather than the im-
mediate reward at each time step. If the sequence of rewards that an agent receives
after time step t is denoted by r = [rt+1, rt+2, rt+3, · · · ], then the agent should max-
imize the expected return reward Rt, which is a function of the reward sequence
r. In RL, the tasks can be grouped into two categories, depending on whether or
not the agent-environment interaction can be divided into episodes [16]. The first
category is known as episodic tasks and each episode has a terminal state, while
the second category is called continuing tasks that cannot be readily divided into
episodes. For episodic tasks, the return reward Rt can be defined as the sum of
received rewards, as follows
Rt = rt+1 + rt+2 + rt+3 + · · ·+ rτ , (2.12)
where τ is the terminal time. However, for continuing tasks the terminal time would
be∞ and the sum in Equation (2.12) could be infinite. According to the concept of
future-discounting [16], the agent attempts to select actions that maximize the sum
of the received discounted-future rewards. Hence, the return reward, Rt is given
by:
Rt = rt+1 + γrt+2 + γ2rt+3... =∞∑
k=0
γkrt+k+1, (2.13)
where γ (0 < γ ≤ 1) is the discount factor. As seen in Equation (2.13), discounting-
future rewards with (0 < γ ≤ 1) ensures that the sum remains finite as long as
each element in r is bounded. Also, Equation (2.13) shows how the value of the
2.4. REINFORCEMENT LEARNING 29
discount factor affects the contribution of each future reward on the return. De-
pending on the value of the discount factor, an agent can have either a nearsighted
or farsighted perspective on maximizing its rewards. If γ approaches 0, the agent
will be nearsighted; that is, only concerned about maximizing the immediate re-
ward rt+1. However, as γ approaches 1 the agent becomes farsighted and will take
the future rewards into account more strongly.
2.4.2 Value Function
In RL, the value function is important, and most RL algorithms are based on
estimating either a state-value function V (s) or an action-value function Q(s, a)
[16]. There are three widely used value-function algorithms in RL [51]: the actor-
critic algorithm [52] to estimate V (s), Q-learning [53] and Sarsa [54] algorithms
to estimate Q(s, a). The state-value function provides an indication of how effective
it is for the learning agent to be in state s, and Q(s, a) is used to indicate how good
it is for the learning agent to take an action a in state s [16]. The value functions
are normally defined in relation to certain policies [16]. Suppose there is a policy
π that maps a state s ∈ S, and an action a ∈ A(s) to a probability value π(s, a),
then the value of state s under policy π can be expressed by V π(s), which can be
represented by the expected total reward when the learning agent starts in state s
and follows policy π thereafter. V π(s) can be given in terms of the expected sum of
discounted rewards as follows [16]:
V π(s) = Eπ {Rt | st = s}
= Eπ
{
∞∑
k=0
γkrk+t+1|st = s
}
,(2.14)
2.4. REINFORCEMENT LEARNING 30
where Eπ{} refers to the expected value if the learning agent follows policy π. Simi-
larly, the action-value function under policy π can be denoted byQπ(s, a), which can
be represented by the expected total reward when the learning agent takes action
a in state s, and then follows policy π. Qπ(s, a) is given by
Qπ(s, a) = Eπ {Rt|st = s, at = a}
= Eπ
{
∞∑
k=0
γkrk+t+1|st = s, at = a
}
.(2.15)
For continuous state and action spaces, the value functions V π(s) and Qπ(s, a) can
be approximated by applying one of the function approximation methods, and the
learning agent tunes the parameters of the functions to match the estimates with
the observations [16].
The main role of the value function is to convert the return reward Rt into a
recursive formula. Therefore, the problem of maximizing the long-term reward in
the form of Equation (2.14) is converted to a problem of maximizing the long-term
reward in terms of the value function at the next time step. In particular, using
Equation (2.14), we have
V π(s) = Eπ
{
rt+1 + γ∞∑
k=0
γkrt+k+2 | st = s
}
= Eπ {rt+1 + γV π(st+1) | st = s} , (2.16)
where s ∈ S for the continuing task or s ∈ S+ for an episodic task, where S+ refers
to the set of all states, including terminal ones. As shown by Equation (2.16), the
value of the current state depends on the value of the next state. Equation (2.16) is
known as the Bellman equation for V π(s).
2.4. REINFORCEMENT LEARNING 31
The main task of RL is to find the appropriate policy that maximizes the agent
reward over the long run, Rt . If the expected return of a policy π is greater than
or equal to the expected return of policy π′, then policy π is better than or equal
to policy π′. In RL, at least one policy is always better than the others; it is known
as the optimal policy and is denoted by π∗. Therefore, the optimal value of the
state-value function is given by
V ∗(s) = maxπ
V π(s), (2.17)
for all s ∈ S, and the optimal value of the action-value function is
Q∗(s, a) = maxπ
Qπ(s, a), (2.18)
for all s ∈ S and a ∈ A(s). It is possible to rewrite Q∗(s, a) in terms of V ∗(s) as
follows [16]:
Q∗(s, a) = E{rt+1 + γV ∗(st+1) | st = s, at = a}. (2.19)
It is also possible to write V ∗(s) in a form similar to the Bellman equation, as fol-
lows:
V ∗(s) = maxa∈A(s)
Qπ∗
(s, a)
= maxaE {rt+1 + γV ∗(st+1) | st = s, at = a} . (2.20)
Equation (2.20) is known as the Bellman optimality equation for V ∗(s) [16]. On
2.4. REINFORCEMENT LEARNING 32
the other hand, the Bellman optimality equation for Q∗(s, a) is given by
Q∗(s, a) = E{
rt+1 + γmaxa′
Q∗(st+1, a′) | st = s, at = a
}
, (2.21)
where a′ refers to the action selected by the learning agent at the next state st+1,
which gives the maximum reward.
2.4.3 Temporal Difference Learning
Temporal-Difference (TD) learning represents a novel idea for RL algorithm that
combines the notions of Dynamic Programming (DP) and Monte Carlo (MC) meth-
ods. TD methods are similar to MC methods in that they learn from experience
without needing an environment dynamics model. Also, similar to DP methods,
TD learning methods update estimates by using other learned estimates and do not
need to wait for a final result [16]. TD learning methods are used to estimate the
value function when it is initially unknown and needs to be learned through agent-
environment interaction. The estimate occurs as follows: at each time step the
so-called TD-error, 4t, is calculated, then the value function is updated to reduce
the TD-error. Therefore, TD learning can be defined by
Vt+1(st) = Vt(st) + α4t, (2.22)
where (0 < α ≤ 1) is the learning rate parameter. The TD-error is given by
4t = rt+1 + γVt(st+1)− Vt(st). (2.23)
The TD learning method is considered as a bootstrapping technique, because the
2.4. REINFORCEMENT LEARNING 33
Algorithm 2.1 TD Learning algorithm.
repeat (for each episode)
Initialize srepeat (for each time step t)
For state s, choose an action a based on the policy π.
Take action a; observe the next state s′, and reward r.Calculate V (s) from Equation (2.22).
Set s← s′.until (s is terminal)
until (finish all episodes)
updating process is based on the estimate at the next time-step. The TD learning
algorithm is given in Algorithm 2.1 [16].
It is possible to estimate the action-value function Q(s, a) for all states and ac-
tions based on the TD-error. The estimation approach is as before, but rather than
finding V (s), Q(s, a) is calculated instead. In this respect, there are two different
algorithms for updating Q(s, a). With the first algorithm, known as Sarsa [54], the
update is based on the action at+1. The update rule for Sarsa is defined by
Q(st, at) = Q(st, at) + α4t, (2.24)
where α is as defined earlier, and 4t is given by
∆ = [rt+1 + γQ(st+1, at+1)−Q(st, at)]. (2.25)
This updating rule is followed after every non-terminal state transition st. Oth-
erwise Q(st+1, at+1) is set to zero. The Sarsa algorithm is given in Algorithm 2.2
[16].
The second algorithm uses the greedy action at the next time step, instead of
2.4. REINFORCEMENT LEARNING 34
Algorithm 2.2 Sarsa Learning algorithm.
Q(s, a) is initialized arbitrarily ∀ s ∈ S, a ∈ Arepeat (for each episode)
Initialize s.For state s, choose an action a based on a certain policy (e.g., ε-greedy).
repeat (for each time step t)Take action a; observe the next state st+1, and reward r.For state st+1, choose an action at+1 based on a certain policy (e.g., ε-
greedy).
Calculate Q(s, a) from Equation (2.24).
Set s← st+1.
Set a← at+1.
until (s is terminal)
until (finish all episodes)
action at+1. This is called Q-learning and was first defined by Watkins [53], and
it is one of the most popular types of RL learning algorithms. It starts with an
arbitrary initial action-value function, then updates Q(s, a) using a set of data tuples
generated by agent-environment interaction. The data tuples include ( st, at, , st+1,
rt+1), and the update rule for the Q-learning method is given by Equation (2.24),
though the calculation of4t is different. For theQ-learning algorithm,4t is defined
as follows:
4t = [rt+1 + γmaxa′
Q(st+1, a′)−Q(st, at)]. (2.26)
The Q-learning algorithm is given in Algorithm 2.3 [16].
For discrete state and action space cases, satisfying the following conditions
will ensure convergence of the Q-learning algorithm to the optimal value Q∗. The
conditions are [55]:
1. Visit all the state-action pairs an infinite number of times.
2.∞∑
t=0
αt =∞, and∞∑
t=0
α2t <∞.
2.4. REINFORCEMENT LEARNING 35
Algorithm 2.3 Q-learning algorithm.
Q(s, a) is initialized arbitrarily ∀ s ∈ S, a ∈ A.
repeat (for each episode)
Initialize s.repeat (for each time step t)
For state s, choose an action a based on a certain policy (e.g., ε-greedy).
Take action a; observe the next state st+1, and reward r.Calculate Q(s, a) from Equation (2.24).
Set s← st+1.
until (s is terminal)
until (finish all episodes)
2.4.4 Actor-Critic Methods
The actor-critic based methods are TD techniques in which the actor refers to
the policy structure used for action selection, and the critic refers to the estimated
value function used for criticizing the policy. The critic is represented as a TD-error,
to derive the learning processes for the actor and the critic, as shown in Figure 2.6
action TD-error
reward
Environment
state
Actor
Critic
Figure 2.6: Actor-critic structure.
2.5. PARTICLE SWARM OPTIMIZATION 36
The critic is usually a state-value function, V (s), and each time the learning
agent selects an action at while in current state st, the critic evaluates the resulting
new state st+1 to decide whether the performance of the learning agent has im-
proved or deteriorated. The evaluation is based on the TD-error, which is defined
in Equation (2.23). By applying the value of the TD-error it is possible to evaluate
the action chosen by the learning agent in its current state. If the error is positive,
the current action should be strengthened, and if negative, the action should be
weakened [16]. Actor-critic methods are widely used due to their applicability in
RL problems that have continuous state and action spaces, by using some form of
function approximation, such as FLC [56], neural networks [16] or linear approxi-
mation [57].
2.5 Particle Swarm Optimization
PSO is one of the most active research areas in the field of computational swarm
intelligence [58]. Developed by Eberhart and Kennedy [59–61] in 1995, PSO is
a population stochastic optimization method inspired by the behaviour of social
interaction in bird flocking and fish schooling, and was originally designed for con-
tinuous optimization problems [62]. PSO has become an attractive optimization
method for solving a wide spectrum of optimization problems and other problems
that can be easily converted to optimization ones [63]. Its popularity is largely due
to its simplicity, efficiency and the fact that it has only a few parameters to adjust.
It has been successfully applied in various application areas, including pattern clas-
sification, robotic application, system design, multi-objective optimization, image
segmentation, power systems, games, system identification and electric circuitry
2.5. PARTICLE SWARM OPTIMIZATION 37
design [63, 64]. PSO shares many characteristics with evolutionary optimization
methods such as a Genetic Algorithm (GA), but they are different in how they ex-
plore multidimensional search spaces. It was determined that the PSO algorithm
has similar or better performance than GA [65–70]. However, both methods are
initialized with a population of random solutions in the search space, and the pop-
ulation then moves stochastically in the space to reach an optimum solution by
updating generations [71]. In PSO, the population is called a swarm and represents
candidate solutions in the search space. Each candidate solution is called a particle.
The swarm can be defined as a set of Np particles and as follows:
X = {X1, X2, ..., XNp},
where Xi refers to the position of the ith particle in the search space. Typically,
the PSO algorithm starts with a swarm X, where all its particles are randomly
positioned in D-dimensional search space A ⊂ RD. Then, Xi can be represented by
Xi = (xi1, xi2, ..., xiD) ∈ A, i = 1, 2, ..., Np.
Each particle has a random velocity that directs the particle to fly across the search
space, and this is represented as
Vi = (vi1, vi2, ..., viD), i = 1, 2, ..., Np.
Furthermore, each particle has a fitness value that is calculated according to the
problem fitness function, and is used to measure how close the particle is to the
optimum solution. The PSO algorithm uses two primary updating formulas: one
2.5. PARTICLE SWARM OPTIMIZATION 38
to update the velocity of each particle, and the other to update their positions. To
update a particle’s velocity, each particle moving through the search space keeps
track of the two best positions in the space with their fitness values [72]; this is
also called the particle experience. The first position is the personal best Pbest that
has been found so far by the particle. The personal best position of the ith particle
is represented as Pbesti = (pbesti1, pbesti2, ..., pbestiD). The second position is the
best global position, Gbestg, that has been found so far among all particles in the
swarm. The symbol g refers to the index of the best particle such that Gbestg =
(gbestg1, gbestg2, ..., gbestgD). Therefore, each particle will depend on its experience
to calculate the new values of the particle’s velocity and position in the search space.
This type of experience mechanism does not exist in GA, or in any evolutionary
optimization methods in general. To update Pbest and Gbestg, it is necessary to
calculate the fitness value of each particle at time step t . Let f denotes the problem
fitness function that should be maximized, then Pbesti(t) is calculated as follows:
Pbesti(t) =
Xi(t) : if f(Xi(t)) ≥ f(Xi(t− 1)),
P besti(t− 1) : otherwise.
(2.27)
Therefore, for the swarm with Np particle, Gbestg(t) is calculated by selecting the
particle with the best Pbest. After calculating Pbest(t) and Gbestg(t), the ith particle
can update its velocity and position in dimension d using the following formulas:
vid(t+ 1) = vid(t) + c1 · r1 · (pbestid(t)− xid(t)) + c2 · r2 · (gbestgd(t)− xid(t)), (2.28)
and
xid(t+ 1) = xid(t) + vid(t+ 1), (2.29)
2.5. PARTICLE SWARM OPTIMIZATION 39
for i = 1, · · · , Np, and d = 1, · · · , D, where vid is the velocity at dimension d of
particle i, pbestid is the best position found so far by particle i at dimension d,
and gbestgd is the global best position found so far at dimension d. The scalars c1
and c2 are learning factors [73] or weights of the attraction to pbestid and gbestgd,
respectively, and are also called the cognitive and social learning parameters. The
second term of Equation (2.28) represents what is known as the cognition term,
and indicates the thinking of the ith particle on how to learn based on its own,
while the third term represents what is called the social term, and indicates the
collaboration among the particles in the swarm [74, 75]. The term xid is the current
position of particle i at dimension d, and r1 and r2 are uniform random numbers
in the range [0,1] that are generated at each iteration for every particle and every
dimension to explore the search space. This process is repeated until a specific
stopping criterion is achieved, such as when the number of iterations specified in
the algorithm are completed, or when the velocity update is below a specific value.
The PSO algorithm is given in Algorithm 2.4.
There are several modifications of the original PSO algorithm, and the most
widely used are the PSO with inertia weight [74] and the PSO with constriction
factor [76].
2.5.1 Particle Swarm Optimization with Inertia Weight
In 1998, Shi et al. [68, 74] modified the original PSO algorithm by introducing
a new factor called the inertia weight, which is responsible for keeping the particle
moving in the same direction as in the previous iteration. A large inertia weight is
applied to facilitate a global search capability (i.e., to encourage exploration), and a
small weight is used to assist local search capabilities [77]. Thus, the inertia weight
2.5. PARTICLE SWARM OPTIMIZATION 40
Algorithm 2.4 PSO Algorithm
1. Set t← 0.
2. Set the number of particles, Np, in the swarm X.
3. For each particle i ∈ 1, · · · , Np, Do
(a) Randomly initialize the position and the velocity of the ith particle, Xi
and Vi, respectively.
(b) Set the initial personal best position of the ith particle the same as its
initial position, Pbesti(t) = Xi.
4. end for
5. Repeat
(a) For each particle i ∈ 1, · · · , Np, Do
i. Find the personal best position, Pbesti, and the fitness value,
f(Xi(t)).
(b) end for
(c) Sort the entire particles according to their fitness values.
(d) Find the global best position and its fitness value.
(e) Update the velocity of each particle from Equation (2.28).
(f) Update the position of each particle from Equation (2.29).
(g) Set t← t+ 1.
6. Until (some condition is satisfied)
is responsible for maintaining a balance between global and local search abilities,
so an optimum solution can be found with a fewer iterations [78]. Particles can
update their velocities according to the inertia weight method as follows [74]:
vid(t+1) = w ·vid(t)+c1 ·r1 · (pbestid(t)−xid(t))+c2 ·r2 · (gbestgd(t)−xid(t)), (2.30)
2.5. PARTICLE SWARM OPTIMIZATION 41
where w is the inertia weight. With this modification it is better to decrease the
value of w over time, which gives the PSO algorithm higher performance compared
to constant inertia settings. The most common way is to start with a large value
of w to encourage exploration in the early stages of the optimization process, and
then decrease it linearly towards zero to remove the oscillatory behaviour in later
stages, or to do more local search activities near the end of the optimization process
[77]. The chosen starting value of the inertia weight is usually slightly larger than
1, and the final value is close to 0 (e.g. 0.1) to prevent the first term of Equation
(2.30) from disappearing. The linearly decreasing strategy of inertial weight can be
described mathematically as follows [58, 79]:
w(t) = wmax − (wmax − wmin) ·t
tmax
, (2.31)
where t is the current iteration counter and tmax is the maximum allowed iteration
counter. Also, wmax and wmin refer to the maximum inertia weight at iteration t = 0
and the minimum inertia weight at iteration t = tmax, respectively.
There are other approaches proposed in the literature to dynamically adapt the
value of w in each iteration, such as random, chaotic random, chaotic linear de-
creasing and fuzzy adaptive [80, 81].
2.5.2 Particle Swarm Optimization with Constriction Factor
Similar to the PSO with inertia weight, another variant of the original PSO al-
gorithm was developed by Clerc et al. [76] in 2002. They made the two learning
factors c1 and c2 dependent, and the relation between them can be defined by using
2.6. THE KALMAN FILTER 42
a new coefficient called the constriction factor χ, which is given by
χ =2
∣
∣
∣2− ψ −
√
ψ2 − 4ψ∣
∣
∣
, (2.32)
where ψ = c1 + c2 such that ψ > 4. In [82, 83], it was shown that using the
constriction factor method is necessary to ensure of the PSO algorithm convergence.
Most implementations of this method use equal values for learning factors c1 and c2,
and c1 = c2 = 2.05 such that χ = 0.729 is considered the default parameter setting
for this variant. Particles can update their velocities according to the constriction
factor method as follows [84]:
vid = χ · [vid + c1 · r1 · (pbestid − xid) + c2 · r2 · (gbestgd − xid)]. (2.33)
2.6 The Kalman Filter
The Kalman filter represents one of the most useful and widely used estimation
techniques for estimating the true state of interest in the presence of noise. It
provides an optimal estimate by minimizing the estimated error between the actual
and estimated state, if certain conditions are satisfied [85]. More specifically, for a
dynamical system having a state vector x, the Kalman filtering provides an iterative
method to find an estimate for x; the estimation state vector can be denoted as x,
by invoking consecutive data inputs or measurements [85]. In fact, the output of
the Kalman filter can be considered as a Gaussian probability density function with
mean x and an estimation error covariance matrix P . Thus, this probabilistic nature
2.6. THE KALMAN FILTER 43
allows the Kalman filter to consider uncertainties that may arise in the system, such
as motion model uncertainty and noisy measurements. The Kalman filtering process
starts with an initial estimate of both x and P , which can be denoted by x(0) and
P (0), respectively. Then, through the iterative process, the Kalman filter reduces the
variance between x and x, and as more and more measurements become available,
x approaches x very quickly. The Kalman filtering process keeps track of both x and
P .
2.6.1 The System Dynamic Model
In Kalman filtering, the system dynamic is assumed to be linear and Gaussian.
In other words, the state evolution is a linear function of the state variables and
the control inputs, whereas the measurements are a linear function of the state
variables. Also, it is assumed that there are zero-mean white Gaussian noises in both
the process and the measurement model. Consider a dynamic system described by
a discrete-time state space model in the form [85]:
x(k + 1) = F (k)x(k) +G(k)u(k) + v(k), (2.34a)
y(k) = H(k)x(k) + w(k), (2.34b)
where the vectors x(k) ∈ Rn, u(k) ∈ R
p and y(k) ∈ Rm refer to the system’s full
state, input, and output, respectively. The matrix F (k) ∈ Rn×n is used to describe
the system’s dynamics and is called the state transition matrix or the fundamental
matrix, the matrix G(k) ∈ Rn×p is used to describe how the optional control input,
u(k), affects the state’s evolution and is named the control matrix, and H(k) ∈
Rm×n refers to the measurement or output matrix and is used to map the state
2.6. THE KALMAN FILTER 44
vector into the measurement. The vectors v(k) ∈ Rn and w(k) ∈ R
m are called the
process and the measurement noises, respectively. They are defined as zero-mean
white Gaussian noises with covariance matrix Qf (k) and Rf (k), respectively, and
are assumed to be independent of each other [86]:
E[
v(i)v(k)T]
=
Qf (k) : if i = k,
0 : if i 6= k,
(2.35a)
E[
w(i)w(k)T]
=
Rf (k) : if i = k,
0 : if i 6= k,
(2.35b)
E[
v(i)w(k)T]
= 0, for all i and k. (2.35c)
As an example, Newton’s equations of motion gives a simple dynamic system model
to describe the motion of an object moving in a plane with a constant velocity. In
this thesis, the moving object represents the evader, whose dynamics is unknown
by the pursuer. To estimate the position (i.e., the xy-coordinate) of this object using
the Kalman filter, the following model can be used
x(k + 1) =
xo(k + 1)
yo(k + 1)
vxo(k + 1)
vyo(k + 1)
=
1 0 T 0
0 1 0 T
0 0 1 0
0 0 0 1
xo(k)
yo(k)
vxo(k)
vyo(k)
+ v(k), (2.36a)
2.6. THE KALMAN FILTER 45
y(k) =
xo(k)
yo(k)
=
1 0 0 0
0 1 0 0
xo(k)
yo(k)
vxo(k)
vyo(k)
+ w(k), (2.36b)
where T represents the sampling time, (xo(k), yo(k)) refers to the object’s position
at time step k and (vxo(k), vyo(k)) refers to its velocity component.
2.6.2 The Kalman Filtering Process
Like any probabilistic estimation method, the Kalman filtering process is a two-
step iterative algorithm of prediction and update. In the prediction step, the Kalman
filter uses the process model with the current state estimate of x and P to predict
the new value of the system state and its covariance. The resulting predicted values
are usually imperfect due to the error in the process model. In the update step, the
Kalman filter corrects the prediction by taking into consideration the actual reading
coming from the measurement sensors by calculating what is called the Kalman
gain, K. Thus, the results of the update step represent the new estimate of the
true state of the system, together with its covariance. After that, the outputs of
the update step become the inputs to the prediction step and the iterative process
continues. Therefore, the Kalman filter estimation process can be considered as a
form of feedback control [86]. The Kalman filter equations can be divided into two
groups, and can be summarized as follows [85]:
1. Prediction step equations:
x(k + 1|k) = F (k)x(k|k) +G(k)u(k), (2.37a)
2.6. THE KALMAN FILTER 46
P (k + 1|k) = F (k)P (k|k)F T (k) +Qf (k), (2.37b)
where the notation x(k1|k2) with k1 ≥ k2 is used to denote the estimated value
of the state at time step k1 given values of the measurement at all times up to
time step k2.
2. Update step equations:
x(k + 1|k + 1) = x(k + 1|k) +K(k + 1)ν(k + 1), (2.38a)
P (k + 1|k + 1) = P (k + 1|k)−K(k + 1)H(k + 1)P (k + 1|k), (2.38b)
where
ν(k + 1) = y(k + 1)−H(k + 1)x(k + 1|k), (2.39a)
S(k + 1) = H(k + 1)P (k + 1|k)HT (k + 1) +Rf (k + 1), (2.39b)
K(k + 1) = P (k + 1|k)HT (k + 1)S−1(k + 1), (2.39c)
where ν(k + 1) is called the residual or the innovation error vector, which re-
flects the difference between the actual and predicted measurements, y(k + 1) and
H(k + 1)x(k + 1|k), respectively, and S(k + 1) is the residual covariance. Given
these equations, the two-step iterative process can be described as shown in Figure
2.7. From this figure, one can view that K plays an important role for finding the
estimation of x and P , and it represents a weighting factor to decide which to trust
more, the sensor reading or the predicted estimate. If K is a scalar, then its value
will be between 0 and 1. If K → 1, this gives an indication that the sensor readings
are more trusted than the predicted values. So, the Kalman filter gives more weight
2.6. THE KALMAN FILTER 47
Prediction
1- Predict the system state 𝑥ොሺ𝑘 + 1ȁ𝑘ሻ = 𝐹ሺ𝑘ሻ𝑥ොሺ𝑘ȁ𝑘ሻ + 𝐺ሺ𝑘ሻ𝑢ሺ𝑘ሻ
2- Predict the error covariance 𝑃ሺ𝑘 + 1ȁ𝑘ሻ = 𝐹ሺ𝑘ሻ𝑃ሺ𝑘ȁ𝑘ሻ𝐹𝑇ሺ𝑘ሻ + 𝑄𝑓ሺ𝑘ሻ
Update
1- Calculate the Kalman gain 𝐾ሺ𝑘 + 1ሻ = 𝑃ሺ𝑘 + 1ȁ𝑘ሻ𝐻𝑇ሺ𝑘 + 1ሻ𝑆−1ሺ𝑘 + 1ሻ
where 𝑆ሺ𝑘 + 1ሻ = 𝐻ሺ𝑘 + 1ሻ𝑃ሺ𝑘 + 1ȁ𝑘ሻ𝐻𝑇ሺ𝑘 + 1ሻ + 𝑅𝑓ሺ𝑘 + 1ሻ
2- Update the state estimate 𝑥ොሺ𝑘 + 1ȁ𝑘 + 1ሻ = 𝑥ොሺ𝑘 + 1ȁ𝑘ሻ + 𝐾ሺ𝑘 + 1ሻ𝜈ሺ𝑘 + 1ሻ
where 𝜈ሺ𝑘 + 1ሻ = 𝑦ሺ𝑘 + 1ሻ − 𝐻ሺ𝑘 + 1ሻ𝑥ොሺ𝑘 + 1ȁ𝑘ሻ
3- Update the error covariance 𝑃ሺ𝑘 + 1ȁ𝑘 + 1ሻ = 𝑃ሺ𝑘 + 1ȁ𝑘ሻ − 𝐾ሺ𝑘 + 1ሻ𝐻ሺ𝑘 + 1ሻ𝑃ሺ𝑘 + 1ȁ𝑘ሻ
Initial values for 𝑥ොሺ0ሻ and 𝑃ሺ0ሻ
Figure 2.7: Kalman filtering process.
to the measurement. On the other hand, If K → 0, this means that the predicted
value is more trustworthy than the sensor readings. Therefore, the measurements
have less weight compared to the predicted value. Accordingly, this will make the
measurements have less impact on updating the estimates.
2.6.3 Fading Memory Filter (FMF)
For the implementation of the Kalman filter, the system model is presumed to be
precisely known, otherwise the filter may not give an appropriate state estimate and
may diverge [87]. So, to handle the modeling error that can be presented in the
system model, a FMF was proposed. FMF is considered as one of the generalizations
of the Kalman filter. The FMF’s principle idea is to give more weight to the recent
measurements and less weight to the past measurements. This will make the filter
2.6. THE KALMAN FILTER 48
be more reacting to the recent measurements, and will accordingly make it more
robust to the modeling uncertainty [87].
For the FMF, Qf (k) and Rf (k) are replaced by α−2kf Qf (k) and α−2k+2
f Rf (k), re-
spectively, where αf is a parameter (αf ≥ 1) and is also called the weight factor,
and it is usually selected to be close to 1. Therefore, as k increases, both Qf (k)
and Rf (k) decrease, and this will give more weight to the recent measurement. If
αf = 1, the original Kalman filter is obtained. After substituting the new values
of Qf (k) and Rf (k) into the corresponding equations, Equation (2.37b), Equation
(2.38b), and Equation (2.39c) can be written as follows:
P (k + 1|k) = α2fF (k)P (k|k)F
T (k) +Qf (k), (2.40)
P (k + 1|k + 1) = P (k + 1|k)−K(k + 1)H(k + 1)P (k + 1|k), (2.41)
K(k + 1) = P (k + 1|k)HT (k + 1)S−1(k + 1), (2.42)
where P (k + 1|k) = α2fP (k + 1|k) and P (k + 1|k + 1) = α2
fP (k + 1|k + 1).
From Equations (2.40) – (2.42), it is obvious that the FMF can be implemented
easily by using the same Kalman filter equations, except that there is a simple mod-
ification that must be done on Equation (2.37b), and it is as follows:
P (k + 1|k) = α2fF (k)P (k|k)F
T (k) +Qf (k). (2.43)
2.7. LITERATURE REVIEW 49
2.7 Literature Review
Over the last few decades PE games have received great interest due to their practi-
cal importance, and several types of PE have been developed, including the homicidal-
chauffeur game and the game of two cars [1]. The optimal control strategies for
the homicidal-chauffeur game and the game of two cars are given in [1, 88, 89].
Isaacs also solved two-player PE games with a slower evader analytically [1], us-
ing the tenet of transition method. This method uses a partial differential equation
known as the Hamilton-Jacobi-Isaacs (HJI) to solve two-player, zero-sum games. It
is based on backward analysis, in that it starts from the terminal state of the game
and traces the optimal trajectory of the states backward. However, current theories
for differential games are unable to deal with the problem of the multi-player PE
differential games, as it is difficult to determine the terminal states of a game. This
difficulty arises because some evaders are not captured, or, if all evaders are cap-
tured, it might have occurred at different times. It could also be due to the increase
in possible engagements between pursuers and evaders as the number of players
increases. For example, suppose that, in a three player PE game with two pursuers
and one evader, one of the pursuers catches the evader, thereby leaving the sec-
ond pursuer with an unknown terminal state and rendering the backward analysis
used by Isaacs [1] unable to solve the HJI equation. Furthermore, the problem will
become more complex when superior evaders are involved. There are four main
techniques in the literature that are commonly used to solve the problems of PE
games: optimal control, DP, RL and game theory [6].
The PE game is a differential game with continuous state and action spaces,
and its solution complexity increases as the number of players increases. Thus, it is
2.7. LITERATURE REVIEW 50
important to use learning techniques that help each player find its optimal strategy
through interaction with its environment, and effectively manage the continuous
state and action spaces. The RL learning method technique has been successfully
applied to differential games [17–19, 21, 90–92], and a function approximation
method such as FIS can manage continuous state and action spaces.
FLC represents a good choice to deal with processes that are ill-defined and/or
involve uncertainty or continuous change. Moreover, it is well known that FISs
can approximate any continuous function, and they are widely used as function
approximators [17, 47, 93–97]. FIS has a knowledge base consisting of a collection
of rules in linguistic form. The rules are used to specify variations in the system
parameters as inputs change [98]. Building a knowledge base is very complex,
particularly when there are many input and output parameters. To overcome this,
it is possible to use an optimization method, such as a genetic algorithm (GA) or a
PSO algorithm, to autonomously tune the parameters of the FLC [98–106]. In [98–
106], supervised learning techniques have been used to tune the FLC parameters.
However, these require a teacher or a training input/output data set which are
sometimes impossible or expensive to obtain [107]. Thus, it is preferable to use
the GA and the PSO algorithms in an unsupervised learning manner. This thesis
proposes using the PSO algorithm to tune the FLC parameters in this way. RL
methods have also been proposed to address the problem of tuning FLC parameters
in an unsupervised manner [17–21, 56, 108–110]. The use of RL with the FLC
is called fuzzy-reinforcement learning, and the technique has been proposed and
successfully applied to different types of problems [56, 108, 111–121], including
the differential game problems, such as the PE game [17–20, 122] and a guarding
territory game [21, 123, 124].
2.7. LITERATURE REVIEW 51
RL is one of the most widely used approaches for learning through interaction
with the environment, and it has attracted researchers in both the artificial intel-
ligence and machine learning fields [50]. Most of the RL algorithms are tabular
methods that use discrete state and action spaces, so they cannot be directly applied
to problems that have continuous state and action spaces, such as the PE differential
game problem. To solve this problem, the state and action spaces can be discretized
such that the resulting table is not too large. Some games are solved by converting
a differential game into a grid world game [125, 126], but for real-world problems
the resulting table or grid will be larges. Furthermore, the discretization of state
and action spaces is not a simple task [56]. However, an appropriate function ap-
proximation method can be applied to avoid discretization and deal directly with
continuous state and action spaces [127], which means FIS can be used to gener-
alize state and action spaces. In [17], a fuzzy actor-critic RL algorithm is applied
to the PE differential game and validated experimentally, and the learning algo-
rithm is used to tune the consequent parameters of the FLC and the FIS. The FIS
is used as an approximation of the value function V (s), and it supposes that only
the pursuer learns its behaviour strategy, while the evader plays an optimal control
strategy [17]. In [21], FACL is applied to the guarding territory differential game,
and with this learning technique the consequent parameters are tuned to allow the
defender to learn its Nash equilibrium strategy. In [18–20], the QLFIS is applied
to the PE differential game, and all the premise and consequent parameters of the
FIS and the FLC are tuned. In addition, the FIS is used as an approximation of the
action-value function Q(s, a). The simulation results in [18, 20] have shown that
the QLFIS algorithm enabled the pursuer to catch the evader in minimum time. In
[19], the QLFIS was modified to teach the evader how to escape and avoid capture.
2.7. LITERATURE REVIEW 52
The modification added the distance between the pursuer and the evader as an in-
put to the evader’s FLC, and when the distance approached an assigned value the
evader makes sharp turns to escape if possible, or to, at least, maximize the capture
time. However, the work in [17–21] did not determine which set of parameters had
a significant impact on the performance of the learning algorithm; thus, this should
be investigated. In addition, the learning process normally requires a specific num-
ber of episodes before each learning algorithm achieves acceptable performance.
Sometimes the number selected is higher than needed, which makes the learning
time longer. This issue, as well as proposals for learning algorithms that can speed
up the learning process, are also considered in this thesis. Moreover, if the learn-
ing algorithms proposed in [17–20] are applied to a multi-pursuer a single-inferior
evader PE game, the potential for collisions between pursuers increases, particu-
larly if they are near one another or approaching the evader; it will be necessary to
avoid these situations.
In multi-robot PE differential games, the environmental complexity and uncer-
tainty increases as the number of players increases. Eventually, it encounters the
so-called ‘curse of dimensionality’, where the state and action spaces grow expo-
nentially, rendering the problem intractable. In multi-robot PE differential games,
actions taken by each player depend not only on the current state of the game, but
also on the actions taken by the other players; this is called joint action. There are
several methods in the literature that have been used to solve the multi-player PE
differential game with slower evaders. For example, in [128] a hierarchical decom-
position method was used to solve a deterministic multi-player PE differential game
case. The game was decomposed into several two-player PE games, and the focus
was on minimizing the capture time (Tc) of all the evaders, where Tc = maxj{Tcj}
2.7. LITERATURE REVIEW 53
and Tcj denotes the required time to capture evader j. Backward analysis was used
in [128] to find the optimal strategy for each player in individual two-player PE
games. The main drawback of the hierarchical decomposition method is that the
engagement possibility between the pursuers and evaders increases exponentially
with the increase of players. Decentralized learning methods [129, 130] were re-
cently applied to the problem of the multi-player PE game, and in [129] Givigi et
al. modeled the multi-player PE game as a Markov game. Also, each player was
modeled as a learning automata to learn how to cope with the challenging situa-
tion, and their learning algorithms converged to an equilibrium point. Simulation
results for the case of a three-player PE game in a grid-world in which two pur-
suers attempt to capture a single evader are also given to show the feasibility of
their learning algorithm. In the simulation, only the pursuers learn their behaviour
strategy while the evader follows a fixed strategy. A drawback of this learning al-
gorithm is that the computational requirements increase exponentially when the
number of players or the grid-world size increases. In [130], Desouky et al. ex-
tended their previous learning algorithm [131] from a single PE differential game
to an n-pursuer n-evader differential game. Their learning algorithm consists of
two phases: The decomposition phase decomposes the n-pursuer n-evader game
into n two-player PE games, with the Q-learning algorithm applied to learn the best
coupling among the players so each pursuer is coupled with only one evader. In the
second phase, the learning algorithm previously proposed in [131] is used to teach
each couple how to play the game and self-learn their control strategy. Simulation
results for n = 2 and n = 3 indicate that the pursuers find the best coupling, and
each player is able to learn its Default Control Strategy (DCS). The drawback to this
algorithm is that it is only applicable to continuous domain problems that can be
2.7. LITERATURE REVIEW 54
easily discretized, and discretized domain sizes that are not too large. To avoid this,
a fuzzy-reinforcement actor-critic structure can be used to deal with continuous
state and action spaces.
As mentioned earlier, although there are numerous papers discussing different
methods, including learning methods to solve the problem of the PE game with slow
evaders, there are only a few that address multi-player PE differential games with
superior evaders [8, 23–25, 132–136]. The authors in [8] proposed a sufficiency
condition to capture superior evaders. In [24, 132], decentralized approaches were
used to capture a superior evader in a noisy environment and with a jamming con-
frontation, respectively. In [133] formation control was applied to enable a group
of pursuers to cooperate to capture a superior evader. The focus is on the pur-
suer strategy that ensures invariant angle distribution around the evader, while it
is assumed that the superior evader follows a simple fixed strategy. In [134], a
hierarchical decomposition method was proposed to solve PE games with superior
evaders, and in [25, 135] an Apollonius circle method was used to solve the prob-
lem of a multi-player PE differential game with a superior evader. It was assumed
that each player has a simple motion, and that the game is played in an environ-
ment with perfect information; that is, the evader knows the state change of all the
pursuers, and vice versa. The problem is examined from the point of view of all
players in the game, and possible evasion and pursuit strategies are also discussed.
Most recently, Awheda et. al [23, 136] proposed two decentralized learning algo-
rithms for the PE game problem with a single-superior evader. The first algorithm
[23] enables a team of pursuers with equal speed to capture a single evader with a
similar speed. This learning algorithm is based on the condition that was proposed
2.7. LITERATURE REVIEW 55
in [8], as well as a specific formation control strategy. The second learning algo-
rithm [136] is based on fuzzy-reinforcement learning with Apollonius circles and a
modified formation control strategy. The goal of a superior evader is to learn how
to reach a specific target, and the goal of the pursuers is to learn how to cooperate
to capture the superior evader. However, when the distance between the evader
and the nearest pursuer is less than a specific tolerance distance, the strategy of
the evader is to run away from that pursuer. In [23, 136], there is a necessity to
calculate the capture angle to obtain the required control signal for each pursuer
to capture the superior evader. This thesis proposes a new reward function for-
mulation for the FACLA algorithm, that enables a group of pursuers to capture a
single evader in a decentralized manner and without finding the capture angle. It
is assumed that all the players have similar speeds.
Chapter 3
An Investigation of Methods of
Parameter Tuning for the QLFIS
3.1 Introduction
Fuzzy-reinforcement learning methods have recently been proposed to address the
problem of learning in differential games [17–21]. Reinforcement Learning (RL)
has been used successfully to tune the Fuzzy Logic Control (FLC) parameters for the
problems of the Pursuit-Evasion (PE) and for guarding territory differential game.
In [17], only the consequent parameters of the FLC and the Fuzzy Inference System
(FIS) are tuned in the PE game, using a Fuzzy Actor-Critic Learning (FACL) algo-
rithm. In [18–20], the Q-Learning Fuzzy Inference System (QLFIS) algorithm is
applied to the PE differential game, and all the premise and consequent parameters
of the FLC and the FIS are tuned. In [21], FACL is applied to the guarding territory
differential game, and only the consequent parameters are tuned.
In [17–21], the best parameters to tune were not investigated. Hence, this
56
3.1. INTRODUCTION 57
chapter discusses whether it is necessary to tune all the premise and consequent
parameters of the FLC and the FIS, or to just tune their consequent parameters. As
is known, it would be computationally more efficient to tune only the consequent
parameters. However, the question is: would it cause a significant loss in perfor-
mance measures if only a subset of the available parameters were tuned? To address
this question, four methods of implementing the QLFIS algorithm are investigated.
The QLFIS consists of two FISs, one used as an FLC, and the other as a function
approximator to estimate the action-value function. The four methods are applied
to three versions of the PE games. In the first version only the pursuer is learning,
and in the second the evader uses its higher maneuverability and plays intelligently
against a self-learning pursuer. In the final version, all the players are learning. An
evaluation is given to determine which parameters are the best to tune and which
parameters have little impact on performance. Also, it is used to recommend the
method that is most effective in reducing the computational time, which represents
an important factor for the real-time application, and has acceptable performance.
The results were published in [137]1.
This chapter is organized as follows: In Section 3.2, the problem statement is
discussed, and Section 3.3 describes the PE game and model. The structure of
the FLC is presented in Section 3.4. A brief introduction of RL is given in Section
3.5. The QLFIS is described in Section 3.6, and Section 3.7 presents the simulation
results. Finally, conclusions and guidelines are discussed in Section 3.8.
1A. A. Al-Talabi and H. M. Schwartz, “An investigation of methods of parameter tuning for Q-learning fuzzy inference system,” in Proc. of the 2014 IEEE International Conference on FuzzySystems (FUZZ-IEEE 2014), (Beijing, China), pp. 2594-2601, IEEE, July 2014.
3.2. PROBLEM STATEMENT 58
3.2 Problem Statement
Methods of parameter tuning for the QLFIS algorithm are investigated. Four pos-
sible methods are considered, based on whether it is necessary to tune all the pa-
rameters (i.e., premise and consequent) of the FIS and the FLC, or to tune only
consequent parameters, as explained in Table 3.1. Three PE games are considered
[20]: in the first game, the pursuer self learns its control strategy while the evader
uses a deterministic strategy, which is to escape along the Line-of-Sight (LoS); this is
defined as the Default Control Strategy (DCS). In the second game, the evader uses
its higher maneuverability and plays intelligently against a self-learning pursuer. In
the final game, two cases are taken into consideration. The first case is a single-
pursuer single-evader game, whereas the second is a multi-pursuer single-evader
game. In both cases, each pursuer interacts with the evader in order to self-learn
their control strategies simultaneously.
Table 3.1: Methods of parameter tuning.
The tuned parameters
For FLC For FIS
Method 1 Consequent parameters Consequent parameters
Method 2 Consequent parameters All parameters
Method 3 All parameters Consequent parameters
Method 4 All parameters All parameters
3.3 Pursuit-Evasion Game
The application used for this study is the PE differential game [1]. There are two
teams in this game, and each team has one or more participant (i.e., player) and
3.3. PURSUIT-EVASION GAME 59
each team has its own goal. The first team is called the pursuer team, and the
second is the evader team. The goal of the first team is to capture the second
team’s participants as quickly as possible, and the second team’s goal is to run away
or prolong the capture time. Thus, this game can be considered an optimization
problem with conflicting objectives [2]. In the PE game, each player needs to learn
how to determine its control strategy at every time step, and to adapt to and interact
with uncertain or changing environments. Each player is modeled as a car-like
mobile robot, and the dynamic equations that describe its motion are given by [26]
xi = Vi cos θi,
yi = Vi sin θi,
θi =ViLi
tan ui,
(3.1)
where i is e for the evader and p for the pursuer, and (xi, yi), Vi, θi, Li and ui refer
to position, velocity, orientation, wheelbase and steering angle, respectively. The
steering angle is bounded and given by −uimax ≤ ui ≤ uimax , where uimax is the
maximum steering angle. The minimum turning radius of each robot is calculated
from
Rdimin=
Li
tan(uimax). (3.2)
It is assumed that the pursuer is faster than the evader Vp > Ve, and at the same
time the evader is more maneuverable than the pursuer (upmax < uemax). As in [20],
the DCS for this game is defined by
3.4. FUZZY LOGIC CONTROLLER STRUCTURE 60
ui =
−uimax : δi < −uimax ,
δi : −uimax ≤ δi ≤ uimax ,
uimax : δi > uimax ,
(3.3)
where δi represents the angle difference, and is given by
δi = tan−1
(
ye − ypxe − xp
)
− θi. (3.4)
This control strategy allows the player to escape along the LoS. The distance be-
tween the pursuer and the evader at time t is given by
D(t) =√
(xe(t)− xp(t))2 + (ye(t)− yp(t))2, (3.5)
and capture occurs when this distance is less than a certain value, `, which is called
the capture radius.
3.4 Fuzzy Logic Controller Structure
For the problem under consideration, the fuzzy system is implemented using zero-
order Takagi-Sugeno (TS) model [46], in which each rule’s consequent is repre-
sented by a fuzzy singleton, which is a type of fuzzy set that contains only one
member whose degree of membership is unity. The fuzzy system has two inputs
and one output. For each learning agent, the two inputs are z1, and z2, which rep-
resent the angle difference δi and its derivative δi, respectively, and the output is
the steering angle ui. Each input has three Gaussian Membership Functions (MFs)
3.4. FUZZY LOGIC CONTROLLER STRUCTURE 61
that have the linguistic values of P (positive), Z (zero) and N (negative). For the
Gaussian MF, the mean and standard deviation represent the possible tunable pa-
rameters. Thus, there are 2 × 6 = 12 parameters in all the MFs, and these are
called the premise parameters. The mean and standard deviation for the jth MF
of the input zi can be denoted by mij and σij, respectively. The number of rules
depends on the number of inputs and their corresponding MFs; in this case there
are two inputs, δi and δi, and each input has three MFs. Thus, nine rules must be
built, each with one consequent parameter Kl. As a result, there are 21 parameters
(12 + 9) that can be tuned during the learning phase, as specified by Table 3.2. The
Fuzzy output ui is defuzzified into a crisp output using the following center-average
defuzzication method:
ui =
L∑
l=1
Kl
(
N∏
i=1
µAli(zi)
)
L∑
l=1
(
N∏
i=1
µAli(zi)
)
. (3.6)
This formula is also called weighted average defuzzification method [138].
Table 3.2: Fuzzy logic parameters.
σ11 m11
Premise parameters σ12 m12
σ13 m13
σ21 m21
σ22 m22
σ23 m23
K1
Consequent parameters K2
...
K9
3.5. REINFORCEMENT LEARNING 62
For the problem under consideration, each learning agent starts with initial MFs
and Fuzzy Decision Table (FDT). Figure 3.1 and Table 3.3 show the MFs of the pur-
suer and the evader before learning and their FDTs. The reason for choosing these
initial settings is to prevent the pursuer from catching the evader at the beginning
of the PE game. It is possible to use different initial settings, though this will affect
the period of the learning process.
-1 -0.5 0 0.5 1
0
0.2
0.4
0.6
0.8
1
Deg
ree o
f m
em
bersh
ip
N Z P
-1 -0.5 0 0.5 1
0
0.2
0.4
0.6
0.8
1
Deg
ree o
f m
em
bersh
ip
N Z P
-1 -0.5 0 0.5 1
0
0.2
0.4
0.6
0.8
1
Deg
ree o
f m
em
bersh
ip
N Z P
-1 -0.5 0 0.5 1
0
0.2
0.4
0.6
0.8
1
Deg
ree o
f m
em
bersh
ip
N Z P
Figure 3.1: Initial MFs of pursuer and evader before learning.
3.5 Reinforcement Learning
The main concept of RL is for the agent to learn how to achieve a specific goal by
interacting with the environment. The interaction between the learning agent and
the environment occurs in a simplified manner. At each time step (t = 0, 1, 2...), the
agent selects an action and the environment responds by presenting a new situation
3.5. REINFORCEMENT LEARNING 63
Table 3.3: FDTs of the pursuer and the evader before learning.
(a) FDT of the pursuer.
δp
N Z P
δp
N -0.50 -0.25 0.00
Z -0.25 0.00 0.25
P 0.00 0.25 0.50
(b) FDT of the evader.
δe
N Z P
δe
N -1.00 -0.50 0.00
Z -0.50 0.00 0.50
P 0.00 0.50 1.00
to the agent, which receives a scalar value rt+1 ∈ R, known as a reward. The goal of
the learning agent is to maximize the accumulated discounted-future reward over
the long run, which is defined by
Rt = rt+1 + γrt+2 + γ2rt+3 + · · · =τ∑
k=0
γkrt+k+1, (3.7)
where γ ∈ (0, 1] is the discount factor and τ is the terminal time.
In RL, the reward function selection process is a task-dependent problem, and
choosing this function correctly enables the agent to update its value function ac-
curately. The main task of the PE game is to enable the pursuer to catch the evader
in the minimum time, and the right choice for this function is defined by Desouky
et al. [20], and is given by:
rt+1 = ∆D(t)/∆Dmax, (3.8)
where
∆D(t) = D(t)−D(t+ 1), (3.9)
3.6. Q-LEARNING FUZZY INFERENCE SYSTEM (QLFIS) 64
and
∆Dmax = (Vp + Ve)T, (3.10)
where T represents the sampling time.
3.6 Q-Learning Fuzzy Inference System (QLFIS)
Desouky et al. [20] proposed a learning technique called the QLFIS and applied it
to the problem of the PE differential game1. The structure of the learning technique
is shown in Figure 3.2. The QLFIS tunes all the parameters of the FIS and the FLC,
and the FIS is used to approximate the action-value function Q(s, a), whereas the
FLC is used to determine the control signal ut. For exploration, a white Gaussian
noise with zero mean and standard deviation σn is added to the signal ut to generate
the control signal uc.
3.6.1 The Learning Rule of the QLFIS and its Algorithm
The QLFIS technique is used to tune all the parameters of the FIS and the FLC.
Let φQ(t) and φu(t) refer to the parameter vector of the FIS and the FLC, respectively,
where φ is defined as follows:
φ =
σ
m
K
. (3.11)
1Desouky et al. [20] used eligibility traces defined by λ, as their RL algorithm. In [139], it hasbeen discovered that the eligibility trace had little advantage in this application and as such it is notused in the subsequent work.
3.6. Q-LEARNING FUZZY INFERENCE SYSTEM (QLFIS) 65
),(max 11 usQr tu
t
ts
u cu
),( usQ tt
t
+ - FIS
FLC Environment
Ɲ(0,σn) +
+
ts
Figure 3.2: The structure of the QLFIS technique [20].
Thus, φQ(t) and φu(t) can be updated according to the following gradient-based
formulas [20]
φQ(t+ 1) = φQ(t) + η∆t
∂Qt(st, ut)
∂φQ, (3.12)
φu(t+ 1) = φu(t) + ξ∆t
(
uc − u
σn
)
∂u
∂φu, (3.13)
where η and ξ are the learning rate of the FIS and the FLC, respectively, and are
defined by [20]
η = 0.1− 0.09
(
iepMax. Episodes
)
, (3.14)
ξ = 0.1η, (3.15)
where iep is the current episode number. The terms ∂Qt(st,ut)∂φQ and ∂u
∂φu are calculated
as follows [20]:
∂Qt(st, ut)
∂σij=
(
2(zi −mij)2
σ3ij
)
(
K −Qt(st, ut))
∑9l=1 ωl
ωT , (3.16a)
3.6. Q-LEARNING FUZZY INFERENCE SYSTEM (QLFIS) 66
∂Qt(st, ut)
∂mij
=
(
2(zi −mij)
σ2ij
)
(
K −Qt(st, ut))
∑9l=1 ωl
ωT , (3.16b)
∂Qt(st, ut)
∂Kl
= ωl, (3.16c)
and
∂u
∂σij=
(
2(zi −mij)2
σ3ij
)
(K − u)∑9
l=1 ωl
ωT , (3.17a)
∂u
∂mij
=
(
2(zi −mij)
σ2ij
)
(K − u)∑9
l=1 ωl
ωT , (3.17b)
∂u
∂Kl
= ωl, (3.17c)
where ωl represents the normalized firing strength of rule l and it is calculated as
follows:
ωl =ωl
9∑
l=1
ωl
, (3.18)
and ωl represents the firing strength of rule l, and it is defined as follows:
ωl =2∏
i=1
µAli(zi), (3.19)
where µAli(zi) refers to the Gaussian membership value of the input zi in rule l. The
terms K and ω are two vectors containing the consequent and strength of certain
rules, respectively. For example, the parameter σ23 represents the standard devia-
tion for the third MF of the second input z2, and it appears in rules R3, R6, and R9.
Thus, ∂Qt(st,ut)∂σ23
can be calculated from Equation (3.16a) with K =
[
K3 K6 K9
]
and ω =
[
ω3 ω6 ω9
]
. The QLFIS learning algorithm is given in Algorithm 3.1.
3.6. Q-LEARNING FUZZY INFERENCE SYSTEM (QLFIS) 67
Algorithm 3.1 Learning in the QLFIS.
1. Initialize the premise and the consequent parameters of the FLC as shown in
Figure 3.1 and Table 3.3, respectively.
2. Initialize the premise parameters of the FIS with the same values as those
of the FLC and initialize the consequent parameters to zeros. Note that the
output of the FIS is a Q-value.
3. Set γ ← 0.95, and σn ← 0.08.
4. For each episode (game)
(a) Calculate η from Equation (3.14) and calculate ξ from Equation (3.15).
(b) Initialize the position of the pursuer, (xp, yp) to (0, 0).
(c) Initialize the position of the evader, (xe, ye), randomly.
(d) Calculate the initial state, s = (δi, δi), from Equation (3.4).
(e) For each step (play) Do
i. Calculate the output of the FLC, u, using Equation (3.6).
ii. Calculate the output uc = u+N (0, σn).
iii. Calculate the output of the FIS, Q(s, u), using Equation (3.6).
iv. Run the game for the current step and observe the next state st+1.
v. Get the reward, rt+1.
vi. From Equation (3.6), calculate the Q-value of the next state,
Q(st+1, u′), which is the output of the FIS.
vii. Calculate the TD-error, ∆t.
viii. Calculate the gradient for the premise and the consequent param-
eters of the FIS and the FLC from Equation (3.16) and Equation
(3.17), respectively.
ix. Update the parameters of the FIS from Equation (3.12).
x. Update the parameters of the FLC from Equation (3.13).
xi. Set st ← st+1.
(f) end for
5. end for
3.7. COMPUTER SIMULATION 68
3.7 Computer Simulation
In order to evaluate the effect of each method of parameter tuning on the perfor-
mance of the QLFIS algorithm, this algorithm is applied to the three versions of PE
differential game. For the purpose of simulation, all the codes were written using
Matlab software, and were implemented using an Intel i7-3612 Quad core proces-
sor with a 2.1 GHz clock frequency and an 8 GB of RAM. So, all the comparisons of
the computation time were based on these specifications. For fair comparison, the
same values used by Desouky et al. [20] are applied, unless stated otherwise. It is
assumed that the pursuer is faster than the evader with Vp = 2m/s and Ve = 1m/s,
and the evader is more maneuverable than the pursuer with −1 ≤ ue ≤ 1 and
−0.5 ≤ up ≤ 0.5. The wheelbases of the pursuer and the evader are the same
and they are equal to 0.3 m. In each episode, the pursuer’s motion starts from the
origin with an initial orientation of θp = 0, while the evader’s motion is chosen
randomly from a set of 64 different positions with θe = 0, unless stated other-
wise. The selected capture radius is ` = 0.1 m, except for the second game, where
` = 0.05 m. The sample time is T = 0.1 s. For statistical analysis purposes, Monte
Carlo simulation is run 500 times to get sufficient information about the capture
and computation times, and each simulation run performs 1000 episodes/games.
The number of plays in each game is 600, so each game terminates when the time
exceeds 60 s, or the pursuer captures the evader. Also, a 2-tailed 2-sample t-test
is performed to show if there is any significant difference among the means of the
computation time at the 0.05 level.
3.7. COMPUTER SIMULATION 69
3.7.1 Evader Follows a Default Control Strategy
In this game, it is assumed that the evader plays its DCS, as defined by Equation
(3.3) and Equation (3.4). It is also assumed that the pursuer does not have any
information about the evader’s strategy. The goal is to make the pursuer self-learn
its control strategy by interacting with the evader. Furthermore, to find the best
methods of parameter tuning for this game, the four methods discussed in Section
3.2 are implemented, and the results are compared with the DCS.
As mentioned in Section 3.4, the initial MFs of the pursuer and its FDT are
selected as shown in Figure 3.1 and Table 3.3, respectively, such that the pursuer
cannot capture the evader at the beginning of the game. Figure 3.3 shows the
PE paths before the pursuer starts to learn (i.e., for the evader’s starting position
(−6, 7)), which demonstrates that the pursuer was unsuccessful in capturing the
evader.
Following the initialization step, the performance of each method of parame-
ter tuning needs to be evaluated. Since the capture time and the PE paths of the
DCS are known, the performance can be evaluated by comparing the capture time
and the PE paths that result from each method, with those resulting from the DCS.
Therefore, for each method of parameter tuning, the Monte Carlo simulation for
Algorithm 3.1 is run 500 times, and after each simulation run the capture times
are calculated for 5 different evader initial positions. Then, the mean and standard
deviation of the capture time over the 500 simulation runs are calculated, and the
results are given in Table 3.4, which shows the mean values of the capture time for
the different initial evader positions using the DCS and the four tuning methods.
The PE paths using these methods compared to the DCS are shown in Figures 3.4 –
3.7. From Table 3.4 and Figures 3.4 – 3.5 it is clear that the performance of the first
3.7. COMPUTER SIMULATION 70
two methods is similar, and their mean values of the capture time are only slightly
different than the DCS. This observation reveals that the tuning method of the FIS
does not have that much affect on the performance of the learning algorithm. Ac-
cording to Table 3.4 and Figures 3.6 – 3.7, the performance of the last two methods
is similar, and both approach the performance of the DCS with respect to the mean
values of the capture time and on the PE paths, which confirms that tuning all the
parameters of the FLC has a significant effect on the performance of the learning
algorithm. Moreover, Figure 3.7 shows that the PE paths using the fourth method
and the DCS are indistinguishable. Thus, for this version of the PE game, it can
be concluded that the performance of the learning algorithm is slightly affected by
changing the method of tuning for the FIS, and is significantly affected by changing
the method of tuning for the FLC. Furthermore, the results show that the pursuer
can learn its control strategy in all four methods, and that the last two methods
outperform the first two.
Table 3.4: Mean and standard deviation of the capture time (s) for different evader
initial positions for the first version of the PE game.
Evader initial position
(−6, 7) (−7,−7) (2, 4) (3,−8) (−4, 5)
Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard
deviation deviation deviation deviation deviation
DCS 9.6 - 10.4 - 4.5 - 8.5 - 6.8 -
Method 1 10.1304 0.1700 10.8554 0.1830 4.7946 0.0837 9.0078 0.1579 7.1934 0.1326
Method 2 10.0144 0.1513 10.7732 0.1729 4.6938 0.0914 8.8976 0.1536 7.0804 0.1117
Method 3 9.6196 0.0417 10.4006 0.0077 4.5004 0.0063 8.6004 0.0089 6.8030 0.0193
Method 4 9.6214 0.0434 10.4012 0.0167 4.5004 0.0063 8.5996 0.0110 6.8020 0.0166
After comparing the performances of the four methods of parameter tuning, the
computational complexity should be measured, as it represents an important factor
for any real-time applications. In this thesis, it is measured by finding the mean and
the standard deviation of the time (i.e., computation time) that the computer will
3.7. COMPUTER SIMULATION 71
Figure 3.3: The PE paths on the xy-plane for the first version of the PE game, before
the pursuer starts to learn.
-14 -12 -10 -8 -6 -4 -2 0 2
x-position (m)
0
5
10
15
y-po
siti
on (m
)
Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using Method 1.
Evader's path using Method 1.
Figure 3.4: The PE paths on the xy-plane for the first version of the PE game when
the first method of parameter tuning is used versus the PE paths when each
player followed its DCS.
3.7. COMPUTER SIMULATION 72
-14 -12 -10 -8 -6 -4 -2 0 2
x-position (m)
0
5
10
15
y-po
siti
on (m
)Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using Method 2.
Evader's path using Method 2.
Figure 3.5: The PE paths on the xy-plane for the first version of the PE game when
the second method of parameter tuning is used versus the PE paths when each
player followed its DCS.
-14 -12 -10 -8 -6 -4 -2 0 2
x-position (m)
0
5
10
15
y-po
siti
on (m
)
Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using Method 3.
Evader's path using Method 3.
Figure 3.6: The PE paths on the xy-plane plane for the first version of the PE game
when the third method of parameter tuning is used versus the PE paths when
each player followed its DCS.
3.7. COMPUTER SIMULATION 73
-14 -12 -10 -8 -6 -4 -2 0 2
x-position (m)
0
5
10
15
y-po
siti
on (m
)
Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using Method 4.
Evader's path using Method 4.
Figure 3.7: The PE paths on the xy-plane for the first version of the PE game when
the fourth method of parameter tuning is used versus the PE paths when each
player followed its DCS.
take when simulating the learning algorithms only. As mentioned in Section 3.7,
the simulation of the learning algorithm is run 500 times, and is implemented using
an Intel i7-3612 Quad core processor with a 2.1 GHz clock frequency and an 8 GB
of RAM. Table 3.5 shows a comparison of the computation time for each parameter
tuning method after 500 simulation run. From this table, it can be seen that the first,
second, and third methods are, respectively, about 2.78, 1.45, and 1.52 times faster
than the fourth method. Also, when the 2-tailed 2-sample t-test is applied to the
computation time, it is found that the difference in means is statistically significant
at the 0.05 level (i.e., all differences in means resulted in a p-value below 0.0001).
3.7.2 Evader Using its Higher Maneuverability Advantageously
The second version of the PE game allows the evader to make use of its advan-
tage of higher maneuverability. The dynamic equations that describe the motions
3.7. COMPUTER SIMULATION 74
Table 3.5: Mean and standard deviation of the computation time (s) for the four
methods of parameter tuning for the first version of the PE game.
Mean Standard Deviation
Method 1 7.5943 0.0706
Method 2 14.5941 0.2539
Method 3 13.9162 0.1373
Method 4 21.1100 0.3423
of the pursuer and evader robots are given by:
xi = vi cos(θi),
yi = vi sin(θi), and
θi =viLi
tan(ui).
(3.20)
In this game, the robot velocity, vi, is governed by its steering angle, so it slows
down in turns and avoids slippage. This velocity is defined by:
vi = Vi cos(ui), (3.21)
where Vi represents the robot’s maximum velocity. Desouky et al.[20] modified
the evader’s default strategy to allow the evader to take advantage of its higher
maneuverability, as follows:
1. If the evader is far enough from the pursuer (i.e., D(t) is greater than a certain
value dc), then the evader will attempt to escape along the LoS. So, the evader
control strategy is
ue = tan−1
(
ye − ypxe − xp
)
− θe. (3.22)
3.7. COMPUTER SIMULATION 75
2. If D(t) < dc, the evader uses its advantage of higher maneuverability to move
in the opposite direction than the pursuer. So, the evader control strategy
here is
ue = (θp + π)− θe, (3.23)
where dc is the minimum turning radius of the pursuer, Rdpmin.
Figure 3.8 shows the PE paths before the pursuer learns (i.e., for the evader
position (−4, 5)).
Figure 3.8: The PE paths on the xy-plane for the second version of the PE game,
before the pursuer starts to learn.
To determine the best methods of parameter tuning for this game, the QLFIS learn-
ing process is implemented for each method, and the results are compared with
the modified DCS of this game. The mean values of the capture time for different
3.7. COMPUTER SIMULATION 76
initial evader positions using the modified DCS and the four tuning methods are
given in Table 3.6, and the PE paths for these methods compared with the modified
DCS are shown in Figures 3.9 – 3.12. Table 3.6 and Figures 3.9 – 3.10 show that
the performance of the first two methods is similar. They also show that the first
two methods of parameter tuning cannot achieve acceptable performance, as there
are significant disparities in the mean values of the capture time compared with the
DCS, and the PE paths differ from those of the DCS. Thus, it is clear that the pursuer
does not learn well, which gives an indication that changing the method of tuning
for the FIS has less effect on the resulting performance of the QLFIS algorithm. On
the other hand, Table 3.6 and Figures 3.11 – 3.12 show that the performance of the
third method resembles the performance of the fourth, and that it outperforms the
first two methods with respect to both capture time and PE paths when compared
with those resulting from the DCS. This confirms that the FLC parameter tuning
method has significant impact on the learning algorithm performance. As a result,
for the second version of the PE game, it can be concluded that the performance of
the QLFIS algorithm is also slightly affected by changing the method of tuning of
the FIS, and the performance is also significantly better by changing the method of
tuning for the FLC.
Table 3.6: Mean and standard deviation of the capture time (s) for different evader
initial positions for the second version of the PE game.
Evader initial position
(−6, 7) (−7,−7) (2, 4) (3,−8) (−4, 5)
Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard
deviation deviation deviation deviation deviation
DCS 18.6 - 11.9 - 4.3 - 10.2 - 10.2 -
Method 1 15.4137 1.4949 16.7493 1.1453 9.7761 1.9525 14.0899 1.6505 8.7226 1.1525
Method 2 15.3214 1.0682 16.5767 1.2139 9.8778 1.6904 13.7428 1.5064 8.9742 1.0761
Method 3 18.9392 0.4601 12.0921 0.4216 4.4259 0.2496 9.5625 0.4812 10.4883 0.1790
Method 4 18.8886 0.3218 12.1428 0.4129 4.3516 0.2453 9.6222 0.4634 10.4087 0.1828
3.7. COMPUTER SIMULATION 77
-10 -8 -6 -4 -2 0 2
x-position (m)
0
1
2
3
4
5
6
7
8
9
10
11
y-po
siti
on (m
)Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using Method 1.
Evader's path using Method 1.
Figure 3.9: The PE paths on the xy-plane for the second version of the PE game
when the first method of parameter tuning is used versus the PE paths when
each player followed its DCS.
-10 -8 -6 -4 -2 0 2
x-position (m)
0
1
2
3
4
5
6
7
8
9
10
11
y-po
siti
on (m
)
Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using Method 2.
Evader's path using Method 2.
Figure 3.10: The PE paths on the xy-plane for the second version of the PE game
when the second method of parameter tuning is used versus the PE paths
when each player followed its DCS.
3.7. COMPUTER SIMULATION 78
-10 -8 -6 -4 -2 0 2
x-position (m)
0
1
2
3
4
5
6
7
8
9
10
11
y-po
siti
on (m
)Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using Method 3.
Evader's path using Method 3.
Figure 3.11: The PE paths on the xy-plane for the second version of the PE game
when the third method of parameter tuning is used versus the PE paths when
each player followed its DCS.
-10 -8 -6 -4 -2 0 2
x-position (m)
0
1
2
3
4
5
6
7
8
9
10
11
y-po
siti
on (m
)
Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using Method 4.
Evader's path using Method 4.
Figure 3.12: The PE paths on the xy-plane for the second version of the PE game
when the fourth method of parameter tuning is used versus the PE paths when
each player followed its DCS.
3.7. COMPUTER SIMULATION 79
Table 3.7 shows a comparison of the computation time for each method of pa-
rameter tuning, which shows that the first, second, and third methods are, respec-
tively, about 2.56, 1.18, and 1.61 times faster than the fourth method. Since the
first two methods did not give an acceptable performance when the evader uses the
advantage of its higher maneuverability, their reduction in the computation time is
not of significant importance for this game. Also, t-testing demonstrated a signifi-
cant difference among the means of the computation time at the 0.05 level (i.e., all
differences in means resulted in p-value less than 0.0001).
Table 3.7: Mean and standard deviation of the computation time (s) for the four
methods of parameter tuning for the second version of the PE game.
Mean Standard Deviation
Method 1 12.6270 0.4631
Method 2 28.9608 1.0336
Method 3 20.1649 0.7549
Method 4 32.3712 0.6077
3.7.3 Multi-Robot Learning
The game presented in this section is called multi-robot learning because each
player in this game is trying to learn its control strategy. Two cases are presented,
the first case is a two-player PE differential game (i.e. single-pursuer single-evader
game), while the second one is a multi-pursuer single-evader differential game.
Case 1: Single-Pursuer Single-Evader
In this case, it is assumed that each player does not have any information about
the other player’s strategy, and the goal is to make both players interact with each
other, and thereby self-learn their control strategies simultaneously.
3.7. COMPUTER SIMULATION 80
The initial MFs of the pursuer and the evader and their FDTs before learning
are selected as shown in Figure 3.1 and Table 3.3. Figure 3.13 shows the PE paths
before learning (i.e., for the evader position (−6, 7)).
After learning, the mean values of the capture time for different initial evader
positions using the DCS and the four methods of parameter tuning are given in
Table 3.8. The PE paths using these methods compared to the DCS are shown in
Figures 3.14 – 3.17. Table 3.8 and Figure 3.14 show that the first method of pa-
rameter tuning for multi-robot learning is not effective enough to reach the desired
performance level, compared to the DCS. It is clear that the evader does not learn
well, and is captured too soon. As indicated, there are differences in the capture
times, and the PE paths have deviated from those of the DCS. Also, Table 3.8 shows
that the mean values of the capture time for the last three methods of parameter
tuning are slightly different than those of the DCS. From Figure 3.15, it is evident
Figure 3.13: The PE paths on the xy-plane for the third version of the PE game
(Case 1), before the players start to learn.
3.7. COMPUTER SIMULATION 81
that the PE paths are different than the paths of the DCS. Furthermore, Figures
3.16 – 3.17 show that the PE paths of the last two methods differ slightly from that
of the DCS. Thus, tuning all the parameters of the FLC gives the highest perfor-
mance for capture time and the PE paths.
Table 3.8: Mean and standard deviation of the capture time (s) for different evader
initial positions for the third version of the PE game (Case 1).
Evader initial position
(−6, 7) (−7,−7) (2, 4) (3,−8) (−4, 5)
Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard
deviation deviation deviation deviation deviation
DCS 9.6 - 10.4 - 4.5 - 8.5 - 6.8 -
Method 1 8.3946 0.5580 9.0094 0.6411 4.3218 0.4160 8.3382 0.5957 5.7092 0.5179
Method 2 9.4096 0.2956 10.1420 0.3581 4.4298 0.1256 8.5698 0.2748 6.5362 0.2150
Method 3 9.6424 0.1356 10.4448 0.0663 4.4434 0.0553 8.5850 0.1280 6.7448 0.1017
Method 4 9.7034 0.2288 10.4346 0.3692 4.5382 0.1839 8.6496 0.2069 6.8628 0.1756
-14 -12 -10 -8 -6 -4 -2 0 2
x-position (m)
0
2
4
6
8
10
12
14
16
18
y-po
siti
on (m
)
Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using Method 1.
Evader's path using Method 1.
Figure 3.14: The PE paths on the xy-plane for the third version of the PE game
(Case 1) when the first method of parameter tuning is used versus the PE
paths when each player followed its DCS.
Table 3.9 shows the computation time when the four methods of parameter
3.7. COMPUTER SIMULATION 82
-14 -12 -10 -8 -6 -4 -2 0 2
x-position (m)
0
5
10
15
y-po
siti
on (m
)Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using Method 2.
Evader's path using Method 2.
Figure 3.15: The PE paths on the xy-plane for the third version of the PE game
(Case 1) when the second method of parameter tuning is used versus the PE
paths when each player followed its DCS.
-14 -12 -10 -8 -6 -4 -2 0 2
x-position (m)
0
5
10
15
y-po
siti
on (m
)
Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using Method 3.
Evader's path using Method 3.
Figure 3.16: The PE paths on the xy-plane for the third version of the PE game
(Case 1) when the third method of parameter tuning is used versus the PE
paths when each player followed its DCS.
3.7. COMPUTER SIMULATION 83
-14 -12 -10 -8 -6 -4 -2 0 2
x-position (m)
0
5
10
15
y-po
siti
on (m
)
Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using Method 4.
Evader's path using Method 4.
Figure 3.17: The PE paths on the xy-plane for the third version of the PE game
(Case 1) when the fourth method of parameter tuning is used versus the PE
paths when each player followed its DCS.
tuning were implemented. It shows that the first, second, and third methods are,
respectively, about 3.6, 1.62, and 1.65 times faster than the fourth method. As the
performance of the first method is not convincing, its computation speed is not that
important. But, when comparing the other methods, it seems that the third method
is the most recommended one. Also, based on t-testing, it is found that there are
significant differences in means at the 0.05 level (i.e., the resulted p-value is less
than 0.0001).
Case 2: Multi-Pursuer Single-Evader
For this case, it is assumed that there are multi-pursuer and single-evader, in
which all pursuers have the same capabilities. Also, it is assumed that each player
does not have any information about the other players’ strategies, and the goal is to
enable each player learns its control strategy. For the pursuers, it is proposed that
3.7. COMPUTER SIMULATION 84
Table 3.9: Mean and standard deviation of the computation time (s) for the four
methods of parameter tuning for the third version of the PE game (Case 1).
Mean Standard Deviation
Method 1 10.4277 0.2808
Method 2 23.1968 0.5221
Method 3 22.7298 0.4542
Method 4 37.5307 1.0639
the learning process is achieved in a decentralized manner, wherein each pursuer in-
teracts with the evader to find its control strategy by considering the other pursuers
as part of its environment. On the other hand, it is assumed that the control strategy
of the evader is to learn how to escape from the nearest pursuer. Thus, the evader’s
learning process can be achieved through the interaction between the evader and
all the pursuers in order to find its control strategy. Therefore, at each time step the
evader calculates its distances from all the pursuers to determine which pursuer is
the closest one to it. Thus, δe can be defined as the angle difference between the
LoS from the nearest pursuer to the evader and the evader’s direction, and is given
by
δe = tan−1
(
ye − ypcxe − xpc
)
− θe, (3.24)
where pc denotes the nearest pursuer to the evader. Also, to implement the QLFIS
algorithm for this case, the reward functions of each player should be defined.
For each pursuer the reward function is as defined by Equation (3.8), whereas the
evader’s reward function ret+1 is defined as follows:
ret+1 = −rpct+1. (3.25)
3.7. COMPUTER SIMULATION 85
For simulation purposes, it is assumed that there are three pursuers and a single
evader. The evader starts its motion from the origin with an initial orientation
θe = 0, while the pursuers’ starting positions are selected randomly from a set of 64
different positions with θp1 = θp2 = θp3 = 0.
After running the Monte Carlo simulation 500 times for each method of parame-
ter tuning, the mean value and standard deviation of the capture time for 5 different
initial pursuers’ positions are calculated, and the results are as shown in Table 3.10.
The PE paths using the four tuning methods compared to those resulting from the
DCS are shown in Figures 3.18 – 3.21 (i.e., for the initial pursuers’ positions (−4, 4),
(−4,−5), (8, 4)). The results show that the performance of the first tuning method
is unacceptable, as there are large discrepancies in the mean values of the capture
time compared with those of the DCS. Also, it shows that Method 2 has a better
performance compared with Method 1, which means that tuning all the parameters
of the function approximation gives the learning player more ability to learn. On
the other hand, the results show that the performance of the last two methods are
slightly different from that of the DCS, which means that the learning processes are
completed successfully and each player is able to learn its control strategy. From this
simulation, it is evident that tuning all the parameters of the FLC has a significant
impact on the learning process.
Table 3.10: Mean and standard deviation of the capture time (s) for different
pursuers’ initial positions for the third version of the PE game (Case 2).
Pursuers’ initial position.
(−2,−10), (−3, 12), (4, 16) (−6, 12), (3, 10), (6,−12) (20,−6), (−20,−5), (4, 12) (15, 5), (−15, 6), (7,−7) (−4, 4), (−4,−5), (8, 4)
Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard
deviation deviation deviation deviation deviation
DCS 7.6 - 8.7 - 11.9 - 8.3 - 4.4 -
Method 1 7.3710 1.2687 6.9116 1.1570 9.6606 1.6043 7.2210 1.4531 4.2896 0.5471
Method 2 7.9668 0.5270 8.9962 0.7027 12.0242 0.8159 8.8184 0.6144 4.5988 0.3217
Method 3 7.8546 0.0861 8.6812 0.1035 11.9670 0.1413 8.5204 0.0910 4.4630 0.0560
Method 4 7.8334 0.1792 8.8168 0.1993 11.9152 0.2600 8.4978 0.2130 4.4386 0.1210
3.8. CONCLUSION 86
The computation time for each method of parameter tuning is given in Table
3.11, which shows that the first, second, and third methods are, respectively, about
3.18, 1.61, and 1.65 times faster than the fourth method. Since the first method’s
performance is unacceptable, its computation speed is insignificant. But, among the
other methods, it seems that the third method is the most recommended one. Also,
t-testing results showed a significant difference among the means of the computa-
tion time at the 0.05 level (i.e., the resulted p-value is below 0.0001).
Table 3.11: Mean and standard deviation of the computation time (s) for the four
methods of parameter tuning for the third version of the PE game (Case 2).
Mean Standard Deviation
Method 1 17.1903 0.3888
Method 2 34.0040 0.6114
Method 3 33.1321 0.4174
Method 4 54.6030 0.9466
3.8 Conclusion
Four methods of parameter tuning for QLFIS algorithm were applied to three
versions of the PE games. In the first method only the consequent parameters of
the FLC and FIS were tuned, while in the second method only the consequent pa-
rameters of the FLC and all the parameters (i.e., the premise and the consequent pa-
rameters) of the FIS were tuned. In the third method, all the parameters of the FLC
and only the consequent parameters of the FIS were tuned. Finally, all the param-
eters of the FLC and FIS in the last method were tuned. The results show that the
performance of the QLFIS in each game depends on the parameter tuning method.
3.8. CONCLUSION 87
-4 -2 0 2 4 6 8 10
x-position (m)
-5
-4
-3
-2
-1
0
1
2
3
4
y-po
siti
on (m
)
Pursuer1's path using DCS.
Pursuer2's path using DCS.
Pursuer3's path using DCS.
Evader's path using DCS.
Pursuer1's path using Method 1.
Pursuer2's path using Method 1.
Pursuer3's path using Method 1.
Evader's path using Method 1.
Figure 3.18: The PE paths on the xy-plane for the third version of the PE game
(Case 2) when the first method of parameter tuning is used versus the PE
paths when each player followed its DCS.
-4 -2 0 2 4 6 8 10
x-position (m)
-5
-4
-3
-2
-1
0
1
2
3
4
y-po
siti
on (m
)
Pursuer1's path using DCS.
Pursuer2's path using DCS.
Pursuer3's path using DCS.
Evader's path using DCS.
Pursuer1's path using Method 2.
Pursuer2's path using Method 2.
Pursuer3's path using Method 2.
Evader's path using Method 2.
Figure 3.19: The PE paths on the xy-plane for the third version of the PE game
(Case 2) when the second method of parameter tuning is used versus the PE
paths when each player followed its DCS.
3.8. CONCLUSION 88
-4 -2 0 2 4 6 8 10
x-position (m)
-5
-4
-3
-2
-1
0
1
2
3
4
y-po
siti
on (m
)
Pursuer1's path using DCS.
Pursuer2's path using DCS.
Pursuer3's path using DCS.
Evader's path using DCS.
Pursuer1's path using Method 3.
Pursuer2's path using Method 3.
Pursuer3's path using Method 3.
Evader's path using Method 3.
Figure 3.20: The PE paths on the xy-plane for the third version of the PE game
(Case 2) when the third method of parameter tuning is used versus the PE
paths when each player followed its DCS.
-4 -2 0 2 4 6 8 10
x-position (m)
-5
-4
-3
-2
-1
0
1
2
3
4
y-po
siti
on (m
)
Pursuer1's path using DCS.
Pursuer2's path using DCS.
Pursuer3's path using DCS.
Evader's path using DCS.
Pursuer1's path using Method 4.
Pursuer2's path using Method 4.
Pursuer3's path using Method 4.
Evader's path using Method 4.
Figure 3.21: The PE paths on the xy-plane for the third version of the PE game
(Case 2) when the fourth method of parameter tuning is used versus the PE
paths when each player followed its DCS.
3.8. CONCLUSION 89
In the first and second games, the results demonstrate that the performance of the
learning algorithm is slightly affected by changing the method of tuning for the
FIS, while it is significantly affected by changing the method of tuning for the FLC.
Furthermore, in the first game it was shown that the pursuer could learn its control
strategy in all methods, but the last two methods outperformed the first two. This
is because the first game is so simple. In the second game, it was found that the
first two methods of parameter tuning do not yield good performance because the
pursuer does not learn well. It was shown that tuning all the parameters of the
FLC, as in the last two methods, achieves the best performance regarding capture
time and the PE paths. In the third game, the results show that the performance
of the learning algorithm is affected by changing the method of tuning for both
the FIS and FLC, and that changing the method of tuning for the FLC as in the
third and fourth methods has a significant impact on performance. In addition, the
first method of parameter tuning is not effective enough for multi-robot learning to
reach the desired performance; the evader does not learn well and gets captured
too soon. Furthermore, although changing the tuning method of the FIS, as in the
second method, can significantly enhance the capture time performance, PE paths
are different than the DCS paths. For all versions of the PE game discussed in this
chapter, the results indicate that the first method has the lowest computation time,
and the fourth has the longest. They also show that the computation times for
Methods 2 and 3 are between those of the first and fourth methods, and they are
almost similar except for the second version of the game. Thus, to reduce the com-
putational complexity and maintain the performance of the QLFIS algorithm, it is
best to use Method 1 for the first version of the game and Method 3 for the second
and third versions.
Chapter 4
Learning Technique Using PSO-Based
FLC and QLFIS lgorithms
4.1 Introduction
Despite the popularity of Reinforcement Learning (RL), its algorithms cannot typ-
ically be used directly for problems with continuous state and action spaces, such
as the Pursuit-Evasion (PE) game in its differential form. Thus, it is possible to use
one of the function approximation methods, such as the Fuzzy Inference System
(FIS), to generalize the state and action spaces [47]. The FIS has a knowledge
base consisting of a collection of rules in linguistic form, and building the knowl-
edge base can be complicated, particularly for systems with many input and output
parameters. In [98–101], supervised learning techniques are used to tune the FIS
parameters, and these require a teacher or training input/output dataset which is
sometimes unavailable or too expensive to obtain. The training period also depends
on the usefulness of the initial FIS parameter setting, and starting with a random
90
4.1. INTRODUCTION 91
initial setting will affect the starting functionality of the FIS and the speed of con-
vergence to the final setting. For this reason, an unsupervised two-stage learning
technique that combines a Particle Swarm Optimization (PSO)-based Fuzzy Logic
Control (FLC) algorithm with the Q-Learning Fuzzy Inference System (QLFIS) al-
gorithm is proposed for the problem of the PE differential game. In the first stage,
the game runs for a few episodes and the PSO algorithm is used to tune the FLC
parameters. From an optimization viewpoint, in this stage the PSO algorithm works
as a global optimizer to find appropriate values for the FLC parameters, which rep-
resents the initial setting of the FLC parameters for the next stage. The game then
proceeds to the second stage, in which the QLFIS algorithm works as a local op-
timizer to accelerate convergence to the final setting of the FLC parameters, since
it uses the gradient descent as an updating approach. The proposed technique is
applied to two versions of the PE differential games, and the findings are presented
and discussed in this chapter. In the first game, the pursuer attempts to learn its
Default Control Strategy (DCS) from the rewards received from its environment,
while the evader plays a well-defined strategy of trying to escape along the Line-of-
Sight (LoS); the results of this game were published in [140]1. The second game
addresses dual learning in PE differential game, with both players interacting to
self-learn their control strategies simultaneously; the corresponding results were
presented in [141]2.
The organization of this chapter is as follows: In Section 4.2, the PSO-based FLC
1A. A. Al-Talabi and H. M. Schwartz, “A two stage learning technique using PSO-based FLC andQLFIS for the pursuit evasion differential game,” in Proc. of the 2014 IEEE International Conferenceon Mechatronics and Automation (ICMA 2014), (Tianjin, China), pp. 762-769, IEEE, August 2014.
2A. A. Al-Talabi and H. M. Schwartz, “A two stage learning technique for dual learning in thepursuit-evasion differential game,” in Proc. of the IEEE Symposium Series on Computational Intelli-gence (SSCI) 2014, (Orlando, USA), pp. 1-8, IEEE, December 2014.
4.2. THE PSO-BASED FLC ALGORITHM 92
algorithm is described. The QLFIS is briefly discussed in Section 4.3, and Section
4.4 presents the proposed learning technique. In Section 4.5, the simulation results
are presented, and finally, conclusions are presented in Section 4.6.
4.2 The PSO-based FLC algorithm
The learning algorithm that will be presented in this section is called the PSO-
based FLC algorithm because the parameters of the FLC are tuned based on the
PSO algorithm. In this algorithm, each learning agent has an FLC. The FLC is used
to determine the control signal, ui, of the learning agent, where i refers to the
learning agent. The FLC is implemented using the zero-order Takagi-Sugeno (TS)
rules with constant consequents [46]. It consists of two inputs and one output; the
two inputs are the angle difference δi and its derivative δi, and the output is the
steering angle ui. Each input has three Gaussian membership functions (MFs), with
linguistic values of P (positive), Z (zero) and N (negative). Thus, depending on the
number of inputs and their corresponding MFs, the FLC has 21 parameters that can
be tuned during the learning phase; the tunable parameters are explained in the
previous work [137]. The fuzzy output ui is defuzzified into crisp output using the
weighted average defuzzification method [138].
In this chapter, the PSO algorithm with a constriction factor [76] is used to tune
the FLC parameters of the learning agent. Using the constriction factor may become
necessary to ensure that the PSO algorithm convergences [82]. The particles can
update their velocities and positions according to the constriction factor as follows
[84]:
vid = χ · [vid + c1 · r1 · (pbestid − xid) + c2 · r2 · (gbestgd − xid)], (4.1)
4.2. THE PSO-BASED FLC ALGORITHM 93
and
xid = xid + vid, (4.2)
where vid is the velocity at dimension d of particle i, pbestid is the best position found
so far by particle i at dimension d, and gbestgd is the global best position found so
far at dimension d. The values c1 and c2 are learning factors [73] or weights of the
attraction to the pbestid and gbestgd, respectively, and the variable xid is the current
position of particle i at dimension d. The two values r1 and r2 are uniform random
numbers in the range [0,1] generated at each iteration to explore the search space,
and χ is the constriction factor [84] given by
χ =2
∣
∣
∣2− ψ −√
ψ2 − 4ψ∣
∣
∣
, (4.3)
where ψ = c1 + c2 such that ψ > 4.
The FLC consists of 21 tunable parameters that can be coded to a particle flying
in a 21-dimensional search space. Moreover, the PSO algorithm will be initialized
with a population of Np particles, with random values positioned on the search
space. The PSO algorithm tunes the FLC parameters, depending on the problem
fitness function. For the problem of the PE game, the pursuer tries to maximize its
reward rt+1 at every time step to catch the evader in minimum time. The fitness
function that can be maximized by the pursuer over time is the average receiving
reward, which is defined by
AR =1
τ
τ∑
k=1
rt+k, (4.4)
where τ is the final time step of the game. The learning algorithm is summarized
in Algorithm 4.1.
4.2. THE PSO-BASED FLC ALGORITHM 94
Algorithm 4.1 The PSO-based FLC
1. Initialize with a population of Np particles with random position and velocityx and v.
2. For each particle Do
(a) Set the personal best position as the starting position.
(b) Set the personal best fitness values to a small value.
3. end for
4. Set the global best position as the starting position of the first particle.
5. Set the global fitness value to a small value.
6. Set the algorithm parameters c1 = c2 = 2.05.
7. Calculate the constriction factor χ from Equation (4.3).
8. For each episode (game)
(a) Initialize the position of the pursuer, (xp, yp) to (0, 0).
(b) Initialize the position of the evader, (xe, ye), randomly.
(c) Calculate the initial state, s = (δi, δi).
(d) For each particle Do
i. For each step (play) Do
A. Calculate the output of the FLC, ui, using the weighted averagedefuzzification method.
B. Run the game for the current time step and observe the next statest+1.
C. Get the reward, rt+1.
ii. end foriii. Calculate the fitness value from Equation (4.4).
(e) end for
(f) Sort the particles according to their fitness values.
(g) Find the personal best position for each particle and its fitness value.
(h) Find the global best position and its fitness value.
(i) Update the velocity of each particle from Equation (4.1).
(j) Update the position of each particle from Equation (4.2).
9. end for
4.3. Q-LEARNING FUZZY INFERENCE SYSTEM (QLFIS) 95
4.3 Q-learning Fuzzy Inference System (QLFIS)
For the probelm of the PE differential game, Desouky et al. [20] proposed a learning
technique called the Q-Learning Fuzzy Inference System (QLFIS) that tunes all the
parameters of the FIS and the FLC. For each learning player, the FIS is used to
approximate the action-value function, whereas the FLC is used to find its control
signal. In [137] four methods of QLFIS parameter tuning were investigated in order
to reduce the computational time without effecting the overall performance of the
learning algorithm. In the first method, only the consequent parameters of the FLC
and the FIS were tuned, while in the second method the consequent parameters of
the FLC and all the parameters (i.e., the premise and the consequent parameters)
of the FIS were tuned. In the third method, all the parameters of the FLC and only
the consequent parameters of the FIS were tuned, and in the fourth method all the
parameters of the FLC and FIS were tuned. It was found that the third method of
parameter tuning is adequate for learning in the PE game to achieve the desired
performance, and it is used in subsequent work.
Let KQ(t) represent the consequent parameter vector of the FIS, and φu(t) rep-
resent the parameter vector of the FLC. These vectors are updated according to the
following gradient based formulas [17, 142]
KQ(t+ 1) = KQ(t) + η∆t
∂Qt(st, ut)
∂KQ, (4.5)
and
φu(t+ 1) = φu(t) + ξ∆t
(
uc − u
σn
)
∂u
∂φu, (4.6)
where η and ξ are the learning rate of the FIS and FLC, respectively, and are defined
4.4. THE PROPOSED TWO-STAGE LEARNING TECHNIQUE 96
as in [137].
4.4 The proposed Two-Stage Learning Technique
The proposed learning technique is a combination of two stages: the PSO-based
FLC algorithm proposed in Section 4.2 and the QLFIS technique proposed in [20],
and is called PSO-based FLC+QLFIS algorithm. In the first stage, the PE game is run
for a few episodes using the PSO-based FLC algorithm to tune the FLC parameters.
Here, from an optimization viewpoint, the PSO algorithm in this stage works as
a global optimizer [143] to determine appropriate values for the FLC parameters,
which represent the initial setting of the FLC parameters for the next stage. In
the second stage, the game continues with the QLFIS algorithm working as a local
optimizer to speed up the convergence to the final setting of the FLC parameters,
since it uses the gradient descent as an updating approach. The two-stage learning
technique is given in Algorithm 4.2.
Algorithm 4.2 The two-stage learning technique
• Stage 1
(a) Set the number of episodes and particles for the PSO-based FLC algorithm.
(b) Run the game using the PSO-based FLC algorithm to tune the FLC parame-ters.
• Stage 2
(a) Set the number of episodes for the QLFIS algorithm.
(b) Initialize the FLC parameters with the same values obtained from stage 1.
(c) Continue the game using the QLFIS algorithm to continue tuning the FLCparameters.
4.5. COMPUTER SIMULATION 97
4.5 Computer Simulation
For the purpose of simulation, unless stated otherwise the same values used by
Desouky et al. [20] are selected. It is assumed that the pursuer is faster than the
evader with Vp = 2m/s and Ve = 1m/s, and the evader is more maneuverable than
the pursuer with −1 ≤ −uemax ≤ 1 and −0.5 ≤ −upmax ≤ 0.5. The wheelbases of the
pursuer and the evader are the same and they are equal to 0.3 m. In each episode,
the pursuer’s motion starts from the origin with an initial orientation θp = 0, while
the evader’s motion is chosen randomly from a set of 64 different positions with
θe = 0. The selected capture radius is ` = 0.1 m and the sample time is T = 0.1 s.
For comparison, the results of the proposed learning technique are compared with
the results of the DCS, the QLFIS algorithm and the PSO-based FLC algorithm.
4.5.1 Evader Follows a Default Control Strategy
In this game, it is assumed that the evader plays its DCS as defined in [137]. It
is also assumed that the pursuer does not have any information about the evader’s
control strategy. The goal is to make the pursuer self-learn its control strategy by
interacting with the evader.
Each learning algorithm mentioned above has several parameter values that
should be set a priori. For the QLFIS algorithm, the same parameter values used
by Desouky et al. [20] are selected. The number of episodes/games is 1000, the
number of plays in each game is 600, and the game terminates when the time
exceeds 60 s or the pursuer captures the evader. In the QLFIS algorithm the learning
process is completed after 1000 episodes, and the other QLFIS parameters are γ =
0.95 and σn = 0.08. For the PSO-based FLC algorithm, it is assumed that c1 =
4.5. COMPUTER SIMULATION 98
c2 = 2.05 which are chosen such that ψ > 4 in Equation (4.3), and the population
size selected is ten particles. This choice is based on computer simulation for the
PE game and applying the same values used by Desouky et al. [20]. Monte Carlo
simulations are run 500 times for population sizes of 1, 5, 10, 20, 30, 40, and
50, after which the mean and standard deviation of the capture time over the 500
simulation runs are calculated for each population size. Figure 4.1 shows that there
are very small differences in the mean values of the capture time for more than
10 particles. Also, it shows the range bars, which indicate the standard deviations
over the 500 simulation runs. By taking the mean value of the capture time for 50
particles as a reference and finding the percentage decrease in the mean value of
the capture time at each selected population size as given in Table 4.1, it can be
found that the mean values of the capture time decrease by less than 5% when the
population size is greater than or equal to 10; thus, it is better to use the smallest
population size (i.e. Np = 10 particles) that satisfies the desired performance with
the shortest learning time. To determine the appropriate number of episodes for
the PSO-based FLC algorithm another Monte Carlo simulation is run 500 times
for different numbers of episodes, while the population size remained constant at
ten particles. Figure 4.2 shows the mean values of the capture time for different
numbers of episodes. It is clear that the mean value of the capture time decreases
as the number of episodes increases. Also, by taking the mean value of the capture
time for 1000 episodes as a reference, it is found that the mean values of the capture
time decrease by less than 5% after 500 episodes, as given in Table 4.2. Therefore,
the number of episodes selected is 500, and the PSO-based FLC algorithm learning
process is completed after 500 episodes and 10 particles; thus, the PSO-based FLC
learning process is completed after 5000 episodes.
4.5. COMPUTER SIMULATION 99
0 5 10 15 20 25 30 35 40 45 50
Population Size
0
10
20
30
40
50
60
Mea
n C
ap
ture
Tim
e (s
)
Figure 4.1: The mean values of the capture time for the PSO-based FLC algorithm
for different population sizes. The range bars indicate the standard deviations
over the 500 simulation runs.
0 100 200 300 400 500 600 700 800 900 1000
Episodes
5
10
15
20
25
30
35
40
45
Mea
n C
ap
ture
Tim
e (s
)
Figure 4.2: The mean values of the capture time for the PSO-based FLC algorithm
for different episode numbers. The range bars indicate the standard devia-
tions over the 500 simulation runs.
4.5. COMPUTER SIMULATION 100
Table 4.1: The percentage decrease in the mean value of the capture time (s) for
1000 episode as the number of particles increases.
Mean % Decrease
1 Particles 48.0710 460.0266
5 Particles 17.1092 99.3220
10 Particles 8.9645 4.4363
20 Particles 8.6282 0.5184
30 Particles 8.6128 0.3390
40 Particles 8.6093 0.2982
50 Particles 8.5837 -
Table 4.2: The percentage decrease in the mean value of the capture time (s) for
10 particles as the number of episodes increases.
Mean % Decrease
10 Episodes 35.1744 292.3744
25 Episodes 22.6430 152.5852
50 Episodes 16.5029 84.0917
100 Episodes 12.8111 42.9093
150 Episodes 11.4658 27.9023
200 Episodes 10.7588 20.0156
250 Episodes 10.3246 15.1721
300 Episodes 10.0300 11.8858
400 Episodes 9.6541 7.6926
500 Episodes 9.4067 4.9328
600 Episodes 9.2724 3.4347
700 Episodes 9.1634 2.2188
800 Episodes 9.0796 1.2840
900 Episodes 9.0155 0.5689
1000 Episodes 8.9645 -
4.5. COMPUTER SIMULATION 101
As indicated in Section 4.4, the proposed learning technique goes through two
stages. In the first stage, the game is run for 40 episodes and 10 particles using
the PSO-based FLC algorithm given in Algorithm 4.1, and from an optimization
perspective the PSO algorithm works as a global optimizer in this stage. The PSO
algorithm is used to autonomously tune the FLC parameters, a process that typically
takes a few episodes. These parameters constitute the initial settings of the FLC for
the next stage, which also uses the QLFIS algorithm. In this stage, the QLFIS algo-
rithm works as a local optimizer to reach the final setting of the FLC parameters,
since it uses gradient descent as an updating technique. The learning process in this
stage takes 100 episodes to achieve the desired performance. After finding the ap-
propriate parameter settings, the performance of each learning algorithm must be
evaluated. Knowing the capture time and the PE paths of the DCS, this will provide
measures for evaluating the performance of each learning algorithm. The perfor-
mance can be evaluated by comparing the capture time and the PE paths resulting
from the learning algorithm, with those from the DCS. Therefore, for each learning
algorithm, a Monte Carlo simulation is run 500 times, and after each simulation run
the capture times are calculated for six different initial positions of the evader. Table
4.3 shows the mean values of the capture times for different initial evader positions,
using the PSO-based FLC algorithm, the QLFIS algorithm and the proposed learning
algorithm compared with those of the DCS, and comparisons of the PE paths using
these algorithms to the DCS are shown in Figures 4.3 – 4.5 (i.e., for the evader po-
sition (−6, 7)). From Table 4.3 and Figure 4.3, it can be seen that the mean values
of the capture time and the PE paths of the PSO-based FLC algorithm are slightly
different from the DCS, which means that the performance of the PSO-based FLC
algorithm is convincing. Also, it shows that the pursuer was succeeded in finding
4.5. COMPUTER SIMULATION 102
its control strategy. But, the main problem here is that the PSO-based FLC learning
process is very slow in the final optimization stages [143]. The learning process
takes a total of 5000 episodes to achieve acceptable performance. Also, from Table
4.3 and Figures 4.4 – 4.5, the performance of the QLFIS algorithm and the proposed
learning algorithm are similar, and they approach the performance of the DCS with
respect to both the capture time and the PE paths. Moreover, the PE paths, when
using these algorithms and the DCS, are almost identical. In the proposed learning
technique, the learning process is completed after 40 × 10 + 100 = 500 episodes.
Table 4.4 shows the number of episodes required to complete the PSO-based FLC
algorithm, the QLFIS algorithm and the proposed learning algorithm. It is clear
that the latter needs fewer episodes for learning (i.e., it has a small learning time);
only 10% and 50% of the episodes required for the PSO-based FLC algorithm and
the QLFIS algorithm, respectively.
Table 4.3: Mean and standard deviation of the capture time (s) for different evader
initial positions for the case of only the pursuer learning.
Evader initial position
(−6, 7) (−7,−7) (2, 4) (3,−8) (−4, 5) (−10,−12)
Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard
deviation deviation deviation deviation deviation deviation
DCS 9.6 - 10.4 - 4.5 - 8.5 - 6.8 - 16.0 -
PSO-based FLC algorithm 9.8692 0.1899 10.6162 0.1676 4.5882 0.0798 8.7774 0.1390 6.9824 0.1106 16.3760 0.2582
QLFIS algorithm 9.6196 0.0417 10.4006 0.0077 4.5004 0.0063 8.6004 0.0089 6.8030 0.0193 16.0046 0.0219
PSO-based FLC+QLFIS 9.6676 0.0910 10.4302 0.0775 4.5072 0.0259 8.6180 0.0617 6.8284 0.0610 16.0840 0.1321
Table 4.4: Total number of episodes for the different learning algorithms
Total umber of Episodes
PSO-based FLC algorithm 5000
QLFIS algorithm 1000
PSO-based FLC+QLFIS algorithm 500
4.5. COMPUTER SIMULATION 103
-14 -12 -10 -8 -6 -4 -2 0 2
x-position (m)
0
5
10
15
y-po
siti
on (m
)Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using PSO-based FLC.
Evader's path using PSO-based FLC.
Figure 4.3: The PE paths on the xy-plane using the PSO-based FLC algorithm for
the case of only the pursuer learning versus the PE paths when each player
followed its DCS.
-14 -12 -10 -8 -6 -4 -2 0 2
x-position (m)
0
5
10
15
y-po
siti
on (m
)
Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using Method 3.
Evader's path using Method 3.
Figure 4.4: The PE paths on the xy-plane using the QLFIS algorithm for the case of
only the pursuer learning versus the PE paths when each player followed its
DCS.
4.5. COMPUTER SIMULATION 104
-14 -12 -10 -8 -6 -4 -2 0 2
x-position (m)
0
5
10
15
y-po
siti
on (m
)
Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using PSO-based FLC+QFIS.
Evader's path using PSO-based FLC+QFIS.
Figure 4.5: The PE paths on the xy-plane using the proposed learning algorithm
for the case of only the pursuer learning versus the PE paths when each player
followed its DCS.
A comparison of the computation time for the different learning algorithms
is given in Table 4.5, which shows that the computation time of the PSO-based
FLC+QLFIS algorithm is 3.53 and 4.67 times faster than the QLFIS and PSO-based
FLC algorithms, respectively.
Table 4.5: Mean and standard deviation of the computation time (s) for different
learning algorithms for the case of only the pursuer learning.
Mean Standard Deviation
PSO-based FLC algorithm 18.4283 0.2659
QLFIS algorithm 13.9162 0.1373
PSO-based FLC+QLFIS 3.9472 0.9378
4.5. COMPUTER SIMULATION 105
4.5.2 Multi-Robot Learning
In this game, it is assumed that each robot has no information about the other
robot’s strategy. The goal is to make both robots (players) interact with one another
and self-learn their control strategies simultaneously. The results of the proposed
learning technique are compared with the results of the DCS and with those of the
PSO-based FLC algorithm and the QLFIS algorithm. The parameters of the PSO-
based FLC algorithm, the QLFIS algorithm and the proposed learning algorithm
have the same parameter values as those used in Section 4.5.1.
The mean values of the capture time for different initial evader positions using
the DCS, the PSO-based FLC algorithm, the QLFIS algorithm and the proposed two-
stage learning algorithm are given in Table 4.6. The PE paths using these algorithms
compared with the DCS are shown in Figures 4.6 – 4.8 (i.e., for the evader position
(−6, 7)). From Table 4.6 and Figure 4.6, it is clear that the mean values of the
capture time and the PE paths of the PSO-based FLC algorithm are slightly different
from the DCS. But, the main problem here is that the PSO-based FLC learning
process is very slow in the final optimization stages [143]. The learning process
takes a total of 5000 episodes to obtain acceptable performance. From Table 4.6
and Figures 4.7 – 4.8, it is evident that there are very small differences between the
performance of the QLFIS algorithm and the two-stage learning algorithm. Also,
the performance of the proposed learning algorithm is very close to that of the DCS
with respect to both the capture time and the PE paths. Moreover, in the two-stage
learning technique, the learning process is completed after 40 × 10 + 100 = 500
episodes. It is clear that the proposed learning algorithm needs fewer episodes for
learning (i.e., it has a small learning time); that is, 10% and 50% of the number
of episodes required for the PSO-based FLC algorithm and the QLFIS algorithm,
4.5. COMPUTER SIMULATION 106
respectively.
Table 4.6: Mean and standard deviation of the capture time (s) for different evader
initial positions for the case of multi-robot learning.
Evader initial position
(−6, 7) (−7,−7) (2, 4) (3,−8) (−4, 5) (−10,−12)
Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard
deviation deviation deviation deviation deviation deviation
DCS 9.6 - 10.4 - 4.5 - 8.5 - 6.8 - 16.0 -
PSO-based FLC algorithm 9.7784 0.0412 10.5002 0.0045 4.5990 0.0100 8.7002 0.0045 6.9014 0.0118 16.1670 0.0475
QLFIS algorithm 9.6424 0.1356 10.4448 0.0663 4.4434 0.0553 8.5850 0.1280 6.7448 0.1017 16.0320 0.2560
PSO-based FLC+QLFIS 9.6880 0.1094 10.4220 0.1213 4.5394 0.0525 8.6496 0.0951 6.8618 0.0821 16.0646 0.1948
-14 -12 -10 -8 -6 -4 -2 0 2
x-position (m)
0
5
10
15
y-po
siti
on (m
)
Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using PSO-based FLC.
Evader's path using PSO-based FLC.
Figure 4.6: The PE paths on the xy-plane using the PSO-based FLC algorithm
for the case of multi-robot learning versus the PE paths when each player
followed its DCS.
Table 4.7 shows a comparison of the computation time, which demonstrates that
the computation time of the proposed learning algorithm is 3.33 and 5.24 times
faster than the QLFIS and PSO-based FLC algorithms, respectively.
4.5. COMPUTER SIMULATION 107
-14 -12 -10 -8 -6 -4 -2 0 2
x-position (m)
0
5
10
15
y-po
siti
on (m
)Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using Method 3.
Evader's path using Method 3.
Figure 4.7: The PE paths on the xy-plane using the QLFIS algorithm for the case of
multi-robot learning versus the PE paths when each player followed its DCS.
-14 -12 -10 -8 -6 -4 -2 0 2
x-position (m)
0
5
10
15
y-po
siti
on (m
)
Pursuer's path using DCS.
Evader's path using DCS.
Pursuer's path using PSO-based FLC+QFIS.
Evader's path using PSO-based FLC+QFIS.
Figure 4.8: The PE paths on the xy-plane using the proposed learning algorithm
for the case of multi-robot learning versus the PE paths when each player
followed its DCS.
4.6. CONCLUSION 108
Table 4.7: Mean and standard deviation of the computation time (s) for different
learning algorithms for the case of multi-robot learning.
Mean Standard Deviation
PSO-based FLC algorithm 35.7969 0.6138
QLFIS algorithm 22.7298 0.4542
PSO-based FLC+QLFIS 6.8285 1.9889
4.6 Conclusion
In this chapter, a two-stage learning technique that combines the PSO-based FLC
algorithm with the QLFIS algorithm is used to autonomously tune the parameters
of a FLC. The proposed technique has two key benefits. First, the PSO algorithm
is used as a global optimizer to quickly determine good initial parameter settings
of the FLC, and second, the gradient descent approach in the QLFIS algorithm is
used to accelerate convergence to the final FLC parameter settings. The proposed
technique is applied to mobile robots playing a differential form of a PE game, and
two versions of the game are considered. In the first game, the pursuer learns
its DCS by using rewards received from its environment, while the evader plays a
well-defined strategy of escaping along the LoS. In the second game, both players
interact in order to self-learn their control strategies simultaneously (dual learning).
Simulation results show that both the pursuer and the evader can learn their de-
fault control strategies based on rewards received from their environments. In both
games, the results indicate that the performance of the proposed learning technique
and the QLFIS algorithm are very close, and they approach the performance of the
DCS with respect to capture times and PE paths. Moreover, the proposed learn-
ing technique performance and the QLFIS algorithm are slightly different than the
4.6. CONCLUSION 109
PSO-based FLC algorithm. Finally, the proposed learning technique outperforms
the QLFIS algorithm and the PSO-based FLC algorithm with respect to both learn-
ing time and computation time, both of which are highly important for any learning
algorithm.
Chapter 5
Kalman Fuzzy Actor-Critic Learning
Automaton Algorithm for the
Pursuit-Evasion Differential Game
5.1 Introduction
In this chapter, an efficient learning algorithm that can autonomously tune the pa-
rameters of the Fuzzy Logic Control (FLC) of a mobile robot playing a Pursuit-
Evasion (PE) differential game is proposed. The efficiency is measured by how
quickly the learning agent can determine its control strategy; that is, how to reduce
the learning time. The proposed algorithm is a modified version of the Fuzzy-Actor
Critic Learning (FACL) algorithm that was proposed in [17], in which both the
critic and the actor are Fuzzy Inference System (FIS). It uses the Continuous Actor-
Critic Learning Automaton (CACLA) algorithm to tune the parameters of the FIS,
and is known as the Fuzzy Actor-Critic Learning Automaton (FACLA) algorithm.
110
5.1. INTRODUCTION 111
FACLA is applied to two versions of PE games, and compared through simulation
with the FACL [17], a Residual Gradient Fuzzy Actor-Critic Learning (RGFACL) that
was proposed in [22] and the PSO-based FLC+QLFIS [140] algorithms; the sim-
ulation results were published in [144]1. Following that, a decentralized learning
technique that enables two or more pursuers to capture a single evader in PE dif-
ferential games is proposed. The pursuers and the evader interact with each other
to self-learn their control strategies simultaneously by tuning their FLC parameters,
and the tuning process is based on Reinforcement Learning (RL). The proposed
learning algorithm uses the FACLA algorithm with the Kalman filter technique, and
is known as the Kalman-FACLA algorithm. The Kalman filter is used to estimate the
evader’s next position, allowing the pursuers to determine the evader’s direction to
avoid collisions among them and reduce the capture time. Awheda et. al [145]
also used the Kalman filter to estimate the evader’s position at the next time step
to allow the pursuer to find its Line-of-Sight (LoS) to the evader at the estimated
position. Such prediction was used for the single-pursuer single-evader differential
game. In this chapter, it is assumed that each pursuer knows only the instanta-
neous position of the evader and vice versa. Also, it is assumed that there is no
communication among the pursuers and each pursuer considers other pursuers as
part of its environment. This allows cooperation among the pursuers to be done in
a decentralized manner. The simulation results were published in [146]2.
This chpater is organized as follows: Section 5.2 explains the FACLA algorithm
1A. A. Al-Talabi, “Fuzzy actor-critic learning automaton algorithm for the pursuit-evasion differ-ential game,” in Proc. of the 2017 CACS International Automatic Control Conference, (Pingtung,Taiwan), November 2017.
2A. A. Al-Talabi and H. M. Schwartz, “Kalman fuzzy actor-critic learning automaton algorithm forthe pursuit-evasion differential game,” in Proc. of the 2016 IEEE International Conference on FuzzySystems (FUZZ-IEEE 2016), (Vancouver, Canada), pp. 1015-1022, July 2016.
5.2. FUZZY ACTOR-CRITIC LEARNING AUTOMATON (FACLA) 112
and its implementation, and Section 5.3 describes the n-pursers single evader PE
game. The state estimation based on the Kalman filter is presented in Section 5.4,
and the simulation results in Section 5.5. Finally, conclusions are provided in Sec-
tion 5.6.
5.2 Fuzzy Actor-Critic Learning Automaton (FACLA)
In RL, there are three well-known and widely used value function algorithms: actor-
critic, Q-learning and Sarsa [51, 147]. The first is employed to estimate a state-
value function V (s), and the other two are used to estimate an action-value function
Q(s, a).
The structure of the actor-critic learning system consists of two parts, an actor
and a critic. The actor is used to select an action for each state, and the critic is
applied to estimate V (s). The estimated V (s) helps critique the actions taken by
the actor to evaluate whether the system performances are better or worse than
expected [16]. The critic estimates V (s) using Temporal-Difference (TD) learning
[16].
In [17], Givigi et al. proposed an actor-critic learning technique called FACL, and
applied it to PE differential games. The critic and actor are FISs for each learning
agent. The learning technique works on problems that have continuous state and
action spaces [142]. Figure 5.1 shows the structure of the FACL system.
In this figure, the actor works as an FLC to determine the control signal ut. For
exploration, a white-Gaussian noise with zero mean and standard deviation σn is
added to the signal ut to generate the control signal uc. The other two blocks are
critics, and are used to estimate the state-value function, V (s).
5.2. FUZZY ACTOR-CRITIC LEARNING AUTOMATON (FACLA) 113
The FIS of both the actor and critic is implemented using zero-order Takagi-
Sugeno (TS) rules with constant consequents [46]. For the learning agent i, the FIS
consists of two inputs and one output. The two inputs are z1 and z2 that correspond
to the angle difference δi and its derivative δi, respectively [137]. The output for
the actor is the steering angle ui, while the output of the critic is the estimate of
Vi(s), where δi is given by
δi = tan−1
(
ye − ypxe − xp
)
− θi, (5.1)
where (xe, ye) and (xp, yp) are the positions of the evader and pursuer, and θi is the
orientation of learning agent i.
Each input has three Gaussian Membership Functions (MFs) with the following
linguistic values: P (positive), Z (zero) and N (negative). Thus, depending on the
number of inputs and their corresponding MFs, the FIS has 21 parameters that can
ts tu cu
)( tt sV
t
+ - Critic
Actor Environment
Ɲ(0,σn) +
+
ts
Critic 1tr
1ts
ᵧ )( 1tt sV
+
+
Figure 5.1: Structure of the FACL system [17].
5.2. FUZZY ACTOR-CRITIC LEARNING AUTOMATON (FACLA) 114
be tuned during the learning phase. The tuned parameters are the means and the
standard deviations of all the input MFs and the consequent parameters of the fuzzy
rule base, as given in [137]. The fuzzy output is defuzzified into a crisp output using
the weighted average defuzzification method [138].
In [137], four methods of parameter tuning were investigated, and it was de-
termined that tuning all the parameters of the actor and only the consequent pa-
rameters of the critic is adequate to learn in the PE game and get the desired per-
formance. Hence, in this work the learning algorithm will tune all the parameters
of the actor and only the consequent parameters of the critic. Let KC refers to the
consequent parameter vector of the critic and φA denotes the parameter vector of
the actor, and these are updated according to the following gradient based formulas
[17, 142]:
KC(t+ 1) = KC(t) + η∆t
∂Vt(st)
∂KC, (5.2)
and
φA(t+ 1) = φA(t) + ξ∆t
(
uc − utσn
)
∂ut∂φA
, (5.3)
where η and ξ are the learning rate of the critic and actor respectively. They are
defined as in [137], and ∆t is the TD-error. The terms ∂Vt(st)
∂KCl
and ∂ut
∂φA are given by
∂Vt(st)
∂KCl
= ωl, (5.4)
and
∂ut∂σij
=
(
2(zi −mij)2
σ3ij
)
(K − ut)∑L
l=1 ωl
ωT , (5.5a)
5.2. FUZZY ACTOR-CRITIC LEARNING AUTOMATON (FACLA) 115
∂ut∂mij
=
(
2(zi −mij)
σ2ij
)
(K − ut)∑L
l=1 ωl
ωT , (5.5b)
∂ut∂Kl
= ωl, (5.5c)
where ωl and ωl represent the firing strength and the normalized firing strength of
rule l respectively, and K and ω are the vectors of the consequent and the strength
of certain rules respectively [137].
The authors in [148] proposed an algorithm called CACLA for the RL field to
manage problems with continuous state and action spaces. With CACLA, the pa-
rameter vector of the critic is updated as before, while the parameter vector of the
actor is updated as follows:
IF ∆t > 0 : φA(t+ 1) = φA(t) + ξ
(
uc − utσn
)
∂ut∂φA
. (5.6)
The main differences between the updated rule Equation (5.6) and Equation (5.3)
are:
• In Equation (5.6), φA is updated only when ∆t is positive.
• In Equation (5.6), the value of the TD-error is not used.
In Equation (5.6), a positive ∆t means the current action taken is better than ex-
pected and should be enforced. If a negative update occurs, as in Equation (5.3),
the update will guide the algorithm to select an action that does not necessarily
have a better value function (i.e., leads to positive ∆t). For this reason, the FACL
algorithm is modified to update the actor parameters according to the CACLA algo-
rithm. The modified algorithm is called the FACLA. The FACLA algorithm is given
in Algorithm 5.1.
5.2. FUZZY ACTOR-CRITIC LEARNING AUTOMATON (FACLA) 116
Algorithm 5.1 Learning in the FACLA.
1. The premise and consequent parameters of the actor are initialized such thatthe evader can escape at the beginning of the game.
2. Set all the consequent parameters of the critic to zero. The premise parame-ters of the critic are initialized with the same values of the actor.
3. Set σn ← 0.08 and γ ← 0.95.
4. For each episode (game)
(a) Calculate the values of η and ξ as in [137].
(b) Initialize the position of the pursuer, (xp, yp) to (0, 0).
(c) Initialize the position of the evader, (xe, ye), randomly.
(d) Calculate the initial state , st = (δi, δi).
(e) For each step (play) Do
i. Calculate the output of the Actor, ut using the weighted average de-fuzzification.
ii. Calculate the output uc = ut +N (0, σn).
iii. Calculate the output of the Critic, Vt(st), using the weighted averagedefuzzification.
iv. For the current time step, run the game to observe the next state st+1.
v. Get the reward, rt+1.
vi. Calculate the output of the Critic at the next state st+1 , Vt(st+1) usingthe weighted average defuzzification.
vii. Calculate the TD-error, ∆t.
viii. Calculate the gradients ∂Vt(st)
∂KCl
and ∂ut
∂φA from Equation (5.4) and Equa-
tion (5.5), respectively.
ix. Update the parameters of the Critic from Equation (5.2).
x. Update the parameters of the Actor from Equation (5.6).
xi. Set st ← st+1.
(f) end for
5. end for
5.2. FUZZY ACTOR-CRITIC LEARNING AUTOMATON (FACLA) 117
In this section, two versions of the two-player PE differential game are presented
to demonstrate the learning performance of the FACLA algorithm. In both versions,
it is assumed that the pursuer is faster than the evader, and the evader is more
maneuverable. Let Vp = 2 m/s, Ve = 1 m/s, −1 ≤ ue ≤ 1 and −0.5 ≤ up ≤ 0.5. The
wheelbases for the pursuer and evader are the same and they are equal to 0.3 m.
In each iteration, the motion for the pursuer starts from the origin with θp = 0, and
the evader is randomly chosen from a set of 64 different positions with θe = 0. The
capture radius is ` = 0.1 m and the sample time is T = 0.1 s. The game consists
of 600 plays, and will be terminated once the time exceeds 60 s or the pursuer
captures the evader.
5.2.1 Evader Follows a Default Control Strategy
Assume that the evader plays a Default Control Strategy (DCS) and as defined
in [137]. Also assume that the pursuer does not have any information about the
evader’s strategy. The appropriate number of episodes for the FACL, RGFACL and
FACLA algorithms are obtained by running Monte Carlo simulation for each algo-
rithm 500 times for a various number of episodes, which are 10, 25, 50, 100, 150,
200, 300, 400, 500, 600, 700, 800, 900 and 1000. Figure 5.2 shows the mean
values of the capture time for the different number of episodes. It shows that these
values typically decrease as the number of episodes increases. Interestingly, the
mean value of the capture time for the FACLA decreases more quickly than those for
the FACL and RGFACL algorithms. Also, a 2-tailed 2-sample t-test showed that the
difference in means is found to be statistically significant at the 0.05 level (i.e., all
differences in means resulted in a p-value below 0.0001). By taking the mean values
of the capture time for 1000 episodes for the FACLA, FACL and RGFACL algorithms
5.2. FUZZY ACTOR-CRITIC LEARNING AUTOMATON (FACLA) 118
as references and finding the percentage decrease in the mean value of the capture
time at each selected number of episodes, it can be shown that the mean values
of the capture time decrease by less than 5% only when the numbers of episodes
are greater than or equal to 150, 300 and 500, respectively. Thus, the number of
episodes for FACLA, FACL and RGFACL algorithms are set to these values, respec-
tively. In [140], the PSO-based FLC+QLFIS algorithm takes about 500 episodes to
achieve acceptable performance. Table 5.1 tabulates the number of episodes re-
quired to complete PSO-based FLC+QLFIS, RGFACL, FACL and FACLA algorithms,
and it is clear that the FACLA algorithm requires less episodes for learning (i.e., it
has a small learning time). After learning, the mean values of the capture time for
different initial evader positions using the DCS, PSO-based FLC+QLFIS, RGFACL,
FACL and FACLA are given in Table 5.2.
0 100 200 300 400 500 600 700 800 900 1000
Episodes
8
10
12
14
16
18
20
22
24
26
28
Mea
n A
vera
ge C
ap
ture T
ime (
s)
FACL Algorithm
RGFACL Algorithm
FACLA Algorithm
Figure 5.2: The mean values of the capture time for the FACL, RGFACL and FA-
CLA algorithms for different episode numbers. The range bars indicate the
standard deviations over the 500 simulation runs.
5.2. FUZZY ACTOR-CRITIC LEARNING AUTOMATON (FACLA) 119
Table 5.1: Total number of episodes for the different learning algorithms.
Total umber of Episodes
PSO-based FLC+QLFIS algorithm 500
RGFACL algorithm 500
FACL algorithm 300
FACLA algorithm 150
Table 5.2: Mean and standard deviation of the capture time (s) for different evader
initial positions for the case of only the pursuer learning.
Evader initial position
(−6, 7) (−7,−7) (2, 4) (3,−8) (−4, 5)
Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard
deviation deviation deviation deviation deviation
DCS 9.6 - 10.4 - 4.5 - 8.5 - 6.8 -
PSO-based FLC+QLFIS 9.6676 0.0910 10.4302 0.0775 4.5072 0.0259 8.6180 0.0617 6.8284 0.0610
RGFACL algorithm 9.8334 0.0698 10.5730 0.0806 4.5994 0.0427 8.7476 0.0692 6.9898 0.0580
FACL algorithm 9.6840 0.0625 10.4078 0.0310 4.5042 0.0201 8.6028 0.0188 6.8276 0.0600
FACLA algorithm 9.7206 0.0405 10.4640 0.0509 4.5074 0.0262 8.6332 0.0480 6.8990 0.0241
The computation time for the different learning algorithms is given in Table 5.3,
which shows that the FACLA algorithm needs less time compared with the other
algorithms. The FACLA algorithm is 2.11, 2.40 and 7.62 times faster than the PSO-
based FLC+QLFIS, FACL and RGFACL algorithms, respectively.
Table 5.3: Mean and standard deviation of the computation time (s) for different
learning algorithms for the case of only the pursuer learning.
Mean Standard Deviation
PSO-based FLC+QLFIS 3.9472 0.9378
RGFACL algorithm 14.2861 0.4720
FACL algorithm 4.5069 0.1020
FACLA algorithm 1.8750 0.0447
5.2. FUZZY ACTOR-CRITIC LEARNING AUTOMATON (FACLA) 120
5.2.2 Multi-Robot Learning
Here, it is assumed that each robot has no information about its opponent’s
strategy, and both need to learn their control strategies at the same time. The
learning process for each of the considered algorithms is implemented using the
same values as those given in Section 5.2.1, and, after learning, the results of the
FACLA algorithm are compared with the results of the DCS and with those of the
PSO-based FLC+QLFIS, RGFACL and FACL algorithms. Table 5.4 summarizes the
mean values of the capture time for different initial evader positions using the DCS,
PSO-based FLC+QLFIS, RGFACL, FACL and FACLA algorithms.
Table 5.4: Mean and standard deviation of the capture times (s) for different
evader initial positions for the case of multi-robot learning.
Evader initial position
(−6, 7) (−7,−7) (2, 4) (3,−8) (−4, 5)
Mean Standard Mean Standard Mean Standard Mean Standard Mean Standard
deviation deviation deviation deviation deviation
DCS 9.6 - 10.4 - 4.5 - 8.5 - 6.8 -
PSO-based FLC+QLFIS 9.6880 0.1094 10.4220 0.1213 4.5394 0.0525 8.6496 0.0951 6.8618 0.0821
RGFACL algorithm 9.5532 0.4594 10.2860 0.4010 4.4738 0.1731 8.5396 0.2858 6.7562 0.3551
FACL algorithm 9.7082 0.1234 10.4562 0.0919 4.4744 0.0476 8.6464 0.0973 6.8284 0.0895
FACLA algorithm 9.6938 0.1330 10.4028 0.1401 4.5298 0.0598 8.6236 0.1131 6.8608 0.1059
Table 5.5 shows the computation time for the different learning algorithms. It
demonstrates that the FACLA algorithm is 2.32, 2.44 and 8.87 times faster than the
PSO-based FLC+QLFIS, FACL and RGFACL algorithms, respectively.
It can be concluded from Table 5.2 and Table 5.4 that the capture times of the
different learning algorithms approach those of the DCS, which means that the play-
ers are able to learn their DCSs. The advantage of FACLA over the other learning
algorithms considered is that it has the lowest learning time, as demonstrated in Ta-
ble 5.5. Also, the 2-tailed 2-sample t-test is performed comparing the computation
time for the FACLA algorithm with those of the PSO-based FLC+QLFIS, RGFACL
5.3. LEARNING IN N -PURSUER ONE-EVADER PE DIFFERENTIAL GAME 121
Table 5.5: Mean and standard deviation of the computation time (s) for different
learning algorithms for the case of multi-robot learning.
Mean Standard Deviation
PSO-based FLC+QLFIS 6.8285 1.9889
RGFACL algorithm 26.1200 1.6964
FACL algorithm 7.1903 0.2032
FACLA algorithm 2.9448 0.1223
and FACL algorithms that are given by Table 5.3 and Table 5.5, and it presented a
significant difference among the means with p-value less than 0.0001.
5.3 Learning in n-Pursuer One-Evader PE Differential
Game
In this section, the complexity of the two-player PE differential game is increased
by adding more pursuers. The new pursuers are also car-like mobile robots with
dynamic equations as defined in [26]. The FACLA algorithm is used as a learning
algorithm for this game, since it takes fewer episodes to learn each player how to
find its control strategy, as explained in Section 5.2. As a special case, the problem
of a two-pursuer one-evader PE differential game will be addressed, and a gen-
eralization of the case of n-pursuer one-evader game will also be given. The PE
differential game model with two-pursuer and one-evader is shown in Figure 5.3.
The FLC of each player is as explained in Section 5.2. It has two inputs and one
output. In this application, it was found that three Gaussian MFs for each input is
enough to attain the desired performance. For each pursuer pi, where i = 1, 2 the
two inputs are the pursuer’s angle difference δpi and its derivative δpi , where δpi is
5.3. LEARNING IN N -PURSUER ONE-EVADER PE DIFFERENTIAL GAME 122
𝐭𝐚𝐧−𝟏 (𝒚𝒆 − 𝒚𝒑𝟏𝒙𝒆 − 𝒙𝒑𝟏)
x
y
xe xp1
ye
yp1
Vp1
Ve
Pursuer1
Evader
𝜽𝒆
𝜽𝒑𝟏𝜹𝒆
𝜹𝒑𝟏
𝜽𝒑𝟐𝜹𝒑𝟐 𝐭𝐚𝐧−𝟏 (𝒚𝒆 − 𝒚𝒑𝟐𝒙𝒆 − 𝒙𝒑𝟐)
yp2
xp2
Vp2
P2E
P1E
EP
Pursuer2
Figure 5.3: The PE differential game model with two-pursuer and one-evader.
the angle difference between the pursuer’s velocity vector−→Vpi , and the LoS vector
−−→PiE from pi to the evader; the output is the pursuer’s steering angle upi . For the
evader, the two inputs are the angle difference δe and its derivative δe, where δe is
the angle difference between its velocity vector and its intended escape direction;
the output is the evader’s steering angle ue. The intended escape direction of the
evader should consider the presence of the two pursuers, and how far they are from
the evader. The escape direction can be defined by
−→EP dir = (xdir, ydir)
=w−−→P1E +
1
w
−−→P2E
‖w−−→P1E +
1
w
−−→P2E‖
, (5.7)
where w is a weighting factor that depends on the distances between the evader
5.3. LEARNING IN N -PURSUER ONE-EVADER PE DIFFERENTIAL GAME 123
and each pursuer, and it is given by
w =‖−−→P2E‖
‖−−→P1E‖
. (5.8)
So, δe can be defined by
δe = arctan
(
ydirxdir
)
− θe. (5.9)
To implement the FACLA algorithm for the case of two-pursuer one-evader game,
the immediate reward rt+1 for all the players in the game must be calculated. For
each pursuer pi where i = 1, 2, let rpit+1 represents the reward for the pursuer pi,
which is calculated as in [137]. The evader’s immediate reward ret+1 is calculated
as follows:
ret+1 = −2∑
i=1
rpit+1. (5.10)
Generally, for the n-pursuer one-evader PE differential game, it is assumed that
the control strategy of each pursuer remains the same, and the FLC structure of all
players is as mentioned previously. Also, it is assumed that the evader’s goal is to
learn how to escape from the nearest pursuer. Therefore, the evader would have to
take into consideration the presence of these pursuers and their relative distances.
Hence, at each time step the evader needs to calculate its distance from all pursuers
to determine which one is the closest. Thus, δe can be defined as the angle difference
between the evader’s direction and the LoS from the nearest pursuer to the evader,
and is calculated from Equation (5.1). Also, the evader’s reward function ret+1 can
be defined as follows:
ret+1 = −rpct+1. (5.11)
5.3. LEARNING IN N -PURSUER ONE-EVADER PE DIFFERENTIAL GAME 124
where pc denotes the nearest pursuer to the evader.
5.3.1 Predicting the Interception Point and its Effects
According to the pursuers’ DCSs as defined in [137], at each instant of time
the pursuers attempt to capture the evader by following their LoS to the evader.
However, if the pursuers follow this strategy the possibility of collision among them
is high, and the capture time might not be the minimum one. If each pursuer can
predict its interception point with the evader E and moves directly to this point
as shown in Figure 5.4 (i.e., the pursuer modifies its LoS toward the predicted
interception location−−→PiE, instead of the instant position of the evader), the capture
time and the potential for collision among the pursuers can be reduced. To do
that, it was assumed in [149] that the pursuers know the evader’s velocity vector in
order to find the modified LoS,−−→PiE. To find
−−→PiE, it is necessary to first determine
the values of the angles βi and αi. The value of βi can be calculated as follows:
αi βi
Pi E
Ѐ
PiE
EP
Pi Ѐ
Vpi
Ve
Figure 5.4: Geometric illustration for capturing situation.
βi = arccos
(
−
−−→PiE ·
−→EP
‖−−→PiE‖‖
−→EP‖
)
, (5.12)
5.3. LEARNING IN N -PURSUER ONE-EVADER PE DIFFERENTIAL GAME 125
and the value of αi can be calculated according to the law of sines, as follows [150]:
Vesin(αi)
=Vpi
sin(βi)⇒ αi = arcsin
(
(VeVpi
) sin(βi)
)
. (5.13)
Once the values of βi and αi are known, the magnitude of−→EP can be calculated
from
‖−→EP‖ = ‖
−−→PiE‖
sinαi
sin(αi + βi), (5.14)
and the evader’s velocity vector−→EP can be calculated as follows:
−→EP = ‖
−→EP‖(
−→EP )dir. (5.15)
Finally,−−→PiE will be determined by:
−−→PiE =
−−→PiE +
−→EP (5.16)
= (x′, y′).
After finding−−→PiE for each pursuer, the predicted angle difference δpi , i.e., the angle
difference between−−→PiE and the pursuer’s velocity vector
−→Vpi , can be expressed as:
δpi = arctan
(
y′
x′
)
− θpi . (5.17)
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 126
5.4 State Estimation Based on a Kalman Filter
As explained in Section 5.3.1, to find the modified LoS,−−→PiE, it was assumed in
[149] that the pursuers would know the evader’s velocity vector−→EP . In this study
it will be unnecessary, because the Kalman filter will predict the unknown states or
variables of interest. The Kalman filter estimates the states of a dynamical system
based on a linear model, which can be written in a discrete state space form as
follows [85]:
x(k + 1) = F (k)x(k) +G(k)u(k) + v(k), (5.18)
and the measurement model can be described by
y(k) = H(k)x(k) + w(k). (5.19)
Since it is assumed that the evader moves with constant velocity, Newton’s equa-
tions of motion can give a simple dynamic system model to describe the evader’s
motion. Thus, to estimate the evader’s position (i.e., (xe(k+1), ye(k+1))) using the
Kalman filter, the following Constant Velocity Model (CVM) can be used
x(k + 1) =
xe(k + 1)
ye(k + 1)
vxe(k + 1)
vye(k + 1)
=
1 0 T 0
0 1 0 T
0 0 1 0
0 0 0 1
xe(k)
ye(k)
vxe(k)
vye(k)
+ v(k), (5.20a)
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 127
y(k) =
xe(k)
ye(k)
=
1 0 0 0
0 1 0 0
xe(k)
ye(k)
vxe(k)
vye(k)
+ w(k), (5.20b)
where T represents the sampling time, (xe(k), ye(k)) refers to the evader’s position
at time step k and (vxe(k), vye(k)) refers to its velocity component.
Generally, the design of an appropriate system model represents an important
issue for the Kalman filter to work properly. Therefore, if the evader moves in a
straight line or if there is a small change in its path or velocity, the CVM, described
by Equation (5.20), will enable the Kalman filter to estimate the evader’s position
accurately. But, if the evader has an acceleration or a manoeuvrability like a moving
car that can accelerate/decelerate, or make a turn, the CVM will fail. Therefore,
there is a necessity to change the model of Equation (5.20) to cope with such issues.
So, the new model should take the acceleration into consideration. Thus, the new
model is called Constant Acceleration Model (CAM) and can be defined by
x(k + 1) =
1 0 T 0 T 2/2 0
0 1 0 T 0 T 2/2
0 0 1 0 T 0
0 0 0 1 0 T
0 0 0 0 1 0
0 0 0 0 0 1
x(k) + v(k), (5.21a)
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 128
y(k) =
1 0 0 0 0 0
0 1 0 0 0 0
x(k) + w(k), (5.21b)
where x(k) = [xe(k) ye(k) vxe(k) vye(k) axe
(k) aye(k)]T. (axe
(k), aye(k)) is the evader’s
acceleration components at time step k.
After selecting the appropriate Kalman filter model to accurately estimate the
evader’s position at the next time step E = (xe(k + 1), ye(k + 1)), each pursuer can
predict the position where capture can take place by invoking the instantaneous
position of the evader E = (xe(k), ye(k)) and the estimated next position of the
evader E = (xe(k + 1), ye(k + 1)) as shown in Figure 5.5. Thus, each pursuer
can move in the direction of the expected capture point of the evader, rather than
following its LoS to the estimated position in the next time step−−→PiE [145]. The
evader’s velocity vector−→EP can be calculated as follows:
αi βi
PiE
Ѐ
PiE
EPPi Ѐ
Vpi
^
eV
Ê
Figure 5.5: Geometric illustration for capturing situation using the estimated posi-
tion.
−→EP = Ve(
−−→EE)dir, (5.22)
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 129
where
Ve =
√
(xe(k + 1)− xe(k))2 + (ye(k + 1)− ye(k))2
T, (5.23)
and
(−−→EE)dir = arctan
(
ye(k + 1)− ye(k)
xe(k + 1)− xe(k)
)
. (5.24)
5.4.1 The Design of Filter Parameters
After determining the appropriate system model, the measurement and process
noise covariance matrices, Rf (k) and Qf (k), respectively, should be constituted or
selected carefully. In general, Rf (k) and Qf (k) can be considered as tuning factors.
The design of Rf (k) is quite easily compared to the design of Qf (k). Using the
fact that every measurement sensor has manufacturer’s specifications that give an
idea about the value of Rf (k), and most trusted sensors give readings close to the
true values. Also, it is possible to find the variance of the measurement noise by
taking some off-line measurements [86]. For example, if the Kalman filter is used
to estimate the xy-coordinate of the evader (i.e., for the work presented in this
chapter, it is assumed that the xy-coordinate could be measured directly from a
position sensor, such as a camera), and if it is assumed that there is no correlation
between the noises of the two measurement sensors, then Rf (k) can be defined as
follows:
Rf (k) =
σ2xo
0
0 σ2yo
, (5.25)
where σ2xo
and σ2xo
represent the variances for the x and y coordinate sensors, re-
spectively. Hence, finding Rf (k) is generally straightforward. However, the ques-
tion that someone may ask is: what will happen if Rf (k) is larger or smaller than
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 130
its actual value. The answer to this question is: if Rf (k) is large this will tell the
filter that the measurement is so noisy, so the filter will trust the prediction more
than the sensor reading, and this may cause the filter to exhibit slow convergence
or even divergence. On the other hand, if Rf (k) is small the filter will favor the
measurement over the prediction and this will cause the filter to follow the noisy
measurement [151].
In contrast with Rf (k), the design of Qf (k) represents a difficult task, because
the estimated states might not be readily observable. Qf (k) is used to take into
consideration any unmodeled disturbances that can influence the system dynam-
ics. In other words, it accounts for the uncertainty in the model itself. It can be
represented by the following equation [152]:
Qf (k) =
∫ T
0
F (t)QcFT (t)dt, (5.26)
where Qc represents the covariance of the continuous noise. As an example, for the
dynamic system described by Equation (5.21), Qf (k) can be calculated to be:
Qf (k) = σ2w
T 5/20 0 T 4/8 0 T 3/6 0
0 T 5/20 0 T 4/8 0 T 3/6
T 4/8 0 T 3/3 0 T 2/2 0
0 T 4/8 0 T 3/3 0 T 2/2
T 6/6 0 T 2/2 0 T 0
0 T 6/6 0 T 2/2 0 T
, (5.27)
where σ2w is the variance of the white noise. The question here is how to determine
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 131
an appropriate value for σ2w. The simple answer to this question is to select this
value based on trial and error. Moreover, if this value is too large, this will tell the
filter that the magnitude of disturbances that can affect the state evolution is large.
Therefore, the filter has no choice but to trust the measurements, and this will
cause the Kalman filter to follow the measurements, even if they have significant
noises. On the other hand, if σ2w is too small, this will make the Kalman filter trust
the prediction, because small variance in the process noise means that the process
model is accurate, and this might lead to filter divergence, if the process model does
not reflect the actual reality.
5.4.2 Kalman Filter Initialization
For Kalman Filter Implementation, the initial estimate of the state, x(0), and the
initial state covariance, P (0), should be specified. Practically, if the designer has
a good prior knowledge about these values for the problem that he tried to solve,
then this can give a good starting point for the Kalman filter to work properly. The
choice of x(0) is based on the designer’s knowledge, and some designers consider
the first measurement as a part of the initial state vector. However, without such
knowledge, x(0) can be selected to be zero or any reasonable number.
On the other hand, the P (0) matrix is usually defined as a diagonal matrix, with
the variances of the estimated error in the state vector along the corresponding
diagonal elements. The selection of P (0) depends on the designer’s knowledge
about x(0). So, if there is a sufficient information about x(0), then it is better to
set small values along the diagonal of P (0). This setting gives an indication to the
Kalman filter that x(0) represents a good initialization. Otherwise, each diagonal
element of P (0) is set to a reasonable large number.
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 132
5.4.3 Fuzzy Fading Memory Filter
For the implementation of the Kalman filter, the system model is presumed to
be precisely known, otherwise the filter may not give an appropriate state estimate
and may diverge [87]. So, to handle the modeling error that can arise in the system
model, the Fading Memory Filter (FMF) was proposed. FMF is considered as one of
the generalizations of the Kalman filter. It can be implemented easily by using the
same Kalman filter equations, except with a modification of the error covariance
prediction equation, which is given by
P (k + 1|k) = α2fF (k)P (k|k)F
T (k) +Qf (k), (5.28)
where αf ≥ 1, and is usually selected to be close to 1.
For the FMF, the value of αf is chosen to be a constant, and is selected based
on trial and error or the designer’s knowledge. A constant αf may not make the
filter respond precisely to the dynamic change of the estimated system, and may
not give the best performance. Therefore, a zero-order TS fuzzy system model is
proposed and used to find an appropriate value for αf at every time step to improve
the filter behaviour. Thus, the resulting filter can be considered as an adaptive one.
The adaptation process is based on the mean and covariance of the residuals, as the
residuals give an indication to the filter how much the estimated measurements fits
the actual ones. In other words, the residuals provide a degree of fitting between
the estimated and actual sensor readings such that if it is not white noise, this
gives a sign that the filter does not perform as required [153]. Hence, the residual
information provide a measure or a Degree of Divergence (DOD) [153].
The TS fuzzy system has two inputs and one outputs. The two inputs are the
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 133
DOD parameters, µ and ξ, which represent the average magnitude of the residual
and the trace of the residual covariance matrix at the current time step divided
by the number of measurements [154], respectively, and the output is the weight
factor αf . The DOD parameters are calculated as follows:
µ =1
m
m∑
i=1
| νi(k) |, (5.29a)
ξ =νT (k)ν(k)
m, (5.29b)
where ν(k) denotes the residual vector.
5.4.4 Kalman Filter Model Selection
To select an appropriate Kalman filter model that can be used to address the
problem mentioned in Section 5.3, four models are taken into consideration, which
are CVM, CAM, fuzzy FMF based on CVM, and fuzzy FMF based on CAM. The last
two, can be named in short fuzzy CVM and fuzzy CAM. Also, three examples are
given. In the first example, it is assumed that the evader’s movement is based on
its DCS, and in the second one, it is assumed that the evader’s movement is based
on the modified DCS [20]. The final example is similar to the second one, except
that the evader can turn with maximum velocity [22]. In all of these example, it is
assumed that the Kalman filter is used to estimate the evader’s position. For each
example, the CVM, CAM, and their fuzzy FMF counterparts, are implemented and
their performances are compared. The comparison is based on finding the value of
the average Root Mean Square Error (RMSE) (i.e., the average root mean square
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 134
of the differences between the true and estimated position) by running the Monte
Carlo simulation 500 times for each example and filter model. The MFs and the
FDT of the fuzzy FMF are taken as shown in Figure 5.6 and Table 5.6.
0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
Deg
ree o
f m
em
bersh
ip
Z M L
0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
Deg
ree o
f m
em
bersh
ip
Z M L
Figure 5.6: MFs of the inputs µ and ξ.
Table 5.6: FDT of the fuzzy FMF.
ξ
Z S L
µ
Z 1.06 1.03 1.04
S 1.03 1.05 1.02
L 1.01 1.03 1.02
Example 5.1 In this example, the Kalman filter is used to track the evader’s
movement by estimating its position at each time step. The evader’s motion is based
on its DCS. Also, the evader’s motion is assumed to start from the position (−6, 7).
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 135
So, the initial state vector x(0) is given by
x(0) =
[−6, 7, 0, 0]T , for the CVM and its fuzzy FMF counterpart,
[−6, 7, 0, 0, 0, 0]T , for the CAM and its fuzzy FMF counterpart.
The initial estimation error covariance matrix, P (0), for the CVM and its fuzzy
FMF counterpart is given by
P (0) = 3×
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
, (5.30)
and for the CAM and its fuzzy FMF counterpart it is given by
P (0) = 3×
1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 0 1
, (5.31)
For the process and measurement noise covariance matrices, Qf (k) and Rf (k),
the values of σw, σxo, and σyo are set to 0.01, 0.05, and 0.05, respectively. The
simulation results are as shown in Table 5.7 and Figures 5.7 – 5.9. Table 5.7 gives
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 136
the position average RMSE and its standard deviation, which shows that the perfor-
mance of the CAM Kalman filter is slightly better than that of the CVM Kalman filter,
as the CAM Kalman filter can track the start turning path of the evader quicker than
the CVM Kalman filter. Also, it shows that the performances of both the fuzzy CVM
and fuzzy CAM Kalman filters are better than those of the non-fuzzy ones, and the
fuzzy CVM Kalman filter is the preeminent one. Figure 5.7 demonstrates the ability
of each filter to estimate the evader’s position in the xy-coordinate, while Figures
5.8 – 5.9 show the same ability to estimate the x and y positions, separately. It can
be seen that all the filters provide an acceptable state estimation, though the Fuzzy
CVM Kalman filter is the most accurate one.
Table 5.7: Mean and standard deviation of the RMSE (cm) for the evader’s position
estimate of Example 5.1.
The position average RMSE Standard Deviation
CVM Kalman Filter 5.0549 0.6728
CAM Kalman Filter 5.0280 0.5482
Fuzzy CVM Kalman Filter 3.8403 0.4093
Fuzzy CAM Kalman Filter 4.2909 0.3442
Example 5.2 In this example, it is assumed that the evader’s motion is based
on its modified DCS, which allows the evader to use the advantage of its higher
maneuverability. Also, it is assumed that all the filters are simulated based on the
information given in Example 5.1, except σw is set equal to 0.03.
Table 5.8 and Figures 5.10 – 5.12 demonstrate the position estimation accuracy
for each filter. Table 5.8 shows that the CVM and CAM Kalman filters have large
average RMSE compared with their fuzzy counterparts. Also, Figure 5.10 gives an
indication that there are modeling uncertainties in both the CVM and CAM Kalman
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 137
-13 -12 -11 -10 -9 -8 -7 -6 -5
x-position (m)
6
7
8
9
10
11
12
13
14
15
y-p
osi
tio
n (
m)
True
Measurements
CVM Kalman Filter
(a)
-13 -12 -11 -10 -9 -8 -7 -6 -5
x-position (m)
6
7
8
9
10
11
12
13
14
15
y-p
osi
tio
n (
m)
True
Measurements
CAM Kalman Filter
(b)
-13 -12 -11 -10 -9 -8 -7 -6 -5
x-position (m)
6
7
8
9
10
11
12
13
14
15
y-p
osi
tio
n (
m)
True
Measurements
Fuzzy CVM Kalman Filter
(c)
-13 -12 -11 -10 -9 -8 -7 -6 -5
x-position (m)
6
7
8
9
10
11
12
13
14
15
y-p
osi
tio
n (
m)
True
Measurements
Fuzzy CAM Kalman Filter
(d)
Figure 5.7: The evader’s position estimate for Example 5.1 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter. (d)
Fuzzy CAM Kalman Filter.
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 138
0 2 4 6 8 10
Time (s)
-13
-12
-11
-10
-9
-8
-7
-6
-5
x-p
osi
tio
n (
m)
True
CVM Kalman Filter
(a)
0 2 4 6 8 10
Time (s)
-13
-12
-11
-10
-9
-8
-7
-6
-5
x-p
osi
tio
n (
m)
True
CAM Kalman Filter
(b)
0 2 4 6 8 10
Time (s)
-13
-12
-11
-10
-9
-8
-7
-6
-5
x-p
osi
tion
(m
)
True
Fuzzy CVM Kalman Filter
(c)
0 2 4 6 8 10
Time (s)
-13
-12
-11
-10
-9
-8
-7
-6
-5
x-p
osi
tio
n (
m)
True
Fuzzy CAM Kalman Filter
(d)
Figure 5.8: The evader’s x-position estimate for Example 5.1 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter. (d)
Fuzzy CAM Kalman Filter.
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 139
0 2 4 6 8 10
Time (s)
6
8
10
12
14
16
y-p
osi
tion
(m
)
True
CVM Kalman Filter
(a)
0 2 4 6 8 10
Time (s)
6
8
10
12
14
16
y-p
osi
tion
(m
)
True
CAM Kalman Filter
(b)
0 2 4 6 8 10
Time (s)
6
8
10
12
14
16
y-p
osi
tion
(m
)
True
Fuzzy CVM Kalman Filter
(c)
0 2 4 6 8 10
Time (s)
6
8
10
12
14
16
y-p
osi
tion
(m
)
True
Fuzzy CAM Filter
(d)
Figure 5.9: The evader’s y-position estimate for Example 5.1 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter. (d)
Fuzzy CAM Kalman Filter.
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 140
filters. Thus, when the evader starts to take a sharp turn at time 9.0 s, as seen in
Figures 5.11 – 5.12, both filters might diverge. On the other hand, the results show
that both the fuzzy CVM and fuzzy CAM Kalman filters provide better performance
compared to the CVM and CAM Kalman filter, as they have the ability to handle the
modeling uncertainties through the use of the fuzzy FMM. Also, it shows that the
fuzzy CAM Kalman filter is the best.
Table 5.8: Mean and standard deviation of the RMSE (cm) for the evader’s position
estimate of Example 5.2.
The position average RMSE Standard Deviation
CVM Kalman Filter 18.4862 0.3513
CAM Kalman Filter 18.2201 0.4244
Fuzzy CVM Kalman Filter 10.7093 0.3505
Fuzzy CAM Kalman Filter 8.5407 0.3723
Example 5.3 This example is similar to the second one, except that the evader
can turn with maximum velocity [22]. It is assumed that all filters are simulated
based on the information given in Example 5.2.
Simulation results are as shown in Table 5.9 and Figures 5.13 – 5.15. Table 5.9
shows that the position average RMSEs are large compared to the result of Example
5.2, because the filters have less abilities to continuously track the sharp and fast
evader’s turn. Also, at time 9.2 s, the results show that all filters give inaccurate
state estimation, and the non-fuzzy filters are the worst.
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 141
-13 -12 -11 -10 -9 -8 -7 -6 -5
x-position (m)
6
7
8
9
10
11
12
13
14
y-p
osi
tion
(m
)
True
Measurements
CVM Kalman Filter
(a)
-12 -11 -10 -9 -8 -7 -6 -5
x-position (m)
6
7
8
9
10
11
12
13
14
y-p
osi
tio
n (
m)
True
Measurements
CAM Kalman Filter
(b)
-12 -11 -10 -9 -8 -7 -6 -5
x-position (m)
6
7
8
9
10
11
12
13
14
y-p
osi
tio
n (
m)
True
Measurements
Fuzzy CVM Kalman Filter
(c)
-12 -11 -10 -9 -8 -7 -6 -5
x-position (m)
6
7
8
9
10
11
12
13
14
y-p
osi
tion
(m
)True
Measurements
Fuzzy CAM Kalman Filter
(d)
Figure 5.10: The evader’s position estimate for Example 5.2 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter. (d)
Fuzzy CAM Kalman Filter.
Table 5.9: Mean and standard deviation of the RMSE (cm) for the evader’s position
estimate of Example 5.3.
The position average RMSE Standard Deviation
CVM Kalman Filter 34.4452 0.1957
CAM Kalman Filter 36.6180 0.2353
Fuzzy CVM Kalman Filter 20.1055 0.1993
Fuzzy CAM Kalman Filter 16.2257 0.2114
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 142
0 5 10 15 20
Time (s)
-13
-12
-11
-10
-9
-8
-7
-6
-5
x-p
osi
tion
(m
)
True
CVM Kalman Filter
(a)
0 5 10 15 20
Time (s)
-12
-11
-10
-9
-8
-7
-6
-5
x-p
osi
tio
n (
m)
True
CAM Kalman Filter
(b)
0 5 10 15 20
Time (s)
-12
-11
-10
-9
-8
-7
-6
-5
x-p
osi
tion
(m
)
True
Fuzzy CVM Kalman Filter
(c)
0 5 10 15 20
Time (s)
-12
-11
-10
-9
-8
-7
-6
-5
x-p
osi
tion
(m
)
True
Fuzzy CAM Kalman Filter
(d)
Figure 5.11: The evader’s x-position estimate for Example 5.2 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter. (d)
Fuzzy CAM Kalman Filter.
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 143
0 5 10 15 20
Time (s)
6
7
8
9
10
11
12
13
14
15
y-p
osi
tio
n (
m)
True
CVM Kalman Filter
(a)
0 5 10 15 20
Time (s)
6
7
8
9
10
11
12
13
14
15
y-p
osi
tio
n (
m)
True
CAM Kalman Filter
(b)
0 5 10 15 20
Time (s)
6
7
8
9
10
11
12
13
14
15
y-p
osi
tion
(m
)
True
Fuzzy CVM Kalman Filter
(c)
0 5 10 15 20
Time (s)
6
7
8
9
10
11
12
13
14
15
y-p
osi
tio
n (
m)
True
Fuzzy CAM Filter
(d)
Figure 5.12: The evader’s y-position estimate for Example 5.2 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter. (d)
Fuzzy CAM Kalman Filter.
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 144
-13 -12 -11 -10 -9 -8 -7 -6 -5
x-position (m)
7
8
9
10
11
12
13
14
15
y-p
osi
tion
(m
)
True
CVM Kalman Filter
(a)
-13 -12 -11 -10 -9 -8 -7 -6 -5
x-position (m)
7
8
9
10
11
12
13
14
15
y-p
osi
tio
n (
m)
True
CAM Kalman Filter
(b)
-13 -12 -11 -10 -9 -8 -7 -6 -5
x-position (m)
7
8
9
10
11
12
13
14
15
y-p
osi
tio
n (
m)
True
Fuzzy CVM Kalman Filter
(c)
-13 -12 -11 -10 -9 -8 -7 -6 -5
x-position (m)
7
8
9
10
11
12
13
14
15
y-p
osi
tio
n (
m)
True
Fuzzy CAM Kalman Filter
(d)
Figure 5.13: The evader’s position estimate for Example 5.3 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter. (d)
Fuzzy CAM Kalman Filter.
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 145
0 5 10 15 20 25 30
Time (s)
-13
-12
-11
-10
-9
-8
-7
-6
-5
x-p
osi
tion
(m
)
True
CVM Kalman Filter
(a)
0 5 10 15 20 25 30
Time (s)
-13
-12
-11
-10
-9
-8
-7
-6
-5
x-p
osi
tio
n (
m)
True
CAM Kalman Filter
(b)
0 5 10 15 20 25 30
Time (s)
-13
-12
-11
-10
-9
-8
-7
-6
-5
x-p
osi
tio
n (
m)
True
Fuzzy CVM Kalman Filter
(c)
0 5 10 15 20 25 30
Time (s)
-13
-12
-11
-10
-9
-8
-7
-6
-5
x-p
osi
tion
(m
)
True
Fuzzy CAM Kalman Filter
(d)
Figure 5.14: The evader’s x-position estimate for Example 5.3 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter. (d)
Fuzzy CAM Kalman Filter.
5.4. STATE ESTIMATION BASED ON A KALMAN FILTER 146
0 5 10 15 20 25 30
Time (s)
7
8
9
10
11
12
13
14
15
y-p
osi
tio
n (
m)
True
CVM Kalman Filter
(a)
0 5 10 15 20 25 30
Time (s)
7
8
9
10
11
12
13
14
15
y-p
osi
tion
(m
)
True
CAM Kalman Filter
(b)
0 5 10 15 20 25 30
Time (s)
7
8
9
10
11
12
13
14
15
y-p
osi
tion
(m
)
True
Fuzzy CVM Kalman Filter
(c)
0 5 10 15 20 25 30
Time (s)
7
8
9
10
11
12
13
14
15
y-p
osi
tio
n (
m)
True
Fuzzy CAM Filter
(d)
Figure 5.15: The evader’s y-position estimate for Example 5.3 by using (a) CVM
Kalman Filter. (b) CAM Kalman Filter. (c) Fuzzy CVM Kalman Filter. (d)
Fuzzy CAM Kalman Filter.
5.5. COMPUTER SIMULATION 147
5.5 Computer Simulation
To compare the performance of the Kalman-FACLA algorithm with the FACLA algo-
rithm, a Monte Carlo simulation is run 500 times for each learning algorithm, and
two cases are considered. The first case is a two-pursuer one-evader game, while
the second one is a three-pursuer one-evader game. It is assumed that the pursuers
are faster than the evader (i.e. Vp1 = Vp2 = Vp3 = 0.5 m/s and Ve = 0.3 m/s),
and that −0.8 ≤ ue ≤ 0.8 and −0.5 ≤ up1 , up2 , up3 ≤ 0.5. The wheelbases of the
pursuers and the evader are the same and they are equal to 0.2 m. In each episode,
the evader’s motion starts from the origin with an initial orientation of θe = 0, while
the pursuers’ motions are chosen randomly from a set of 64 different positions with
θp1 = θp2 = θp3 = 0. The selected capture radius is ` = 0.1 m, and the sample time
is T = 0.1 s. The number of episodes/games is 200, and the number of plays in
each game is 600. The game terminates when the time exceeds 60 s, or one of the
pursuers captures the evader. From Example 5.1, it is found that the fuzzy CVM
Kalman filter gives the best performance compared with the other models, there-
fore the Kalman-FACLA algorithm is based on this model. All the filter parameters
are selected as in Example 5.1.
5.5.1 Case 1: Two-Pursuer One-Evader Game
For this case, it is assumed that the game is played with two pursuers attempting
to learn how to capture a single evader, and the evader is learning how to escape
or extend the capture time. It is also assumed that each player has no information
about the other players’ strategies. In addition, it is assumed that each pursuer
only knows the instantaneous position of the evader, and vice versa. The goal is
5.5. COMPUTER SIMULATION 148
to make all players interact with each other to self-learn their control strategies
simultaneously, using either the FACLA algorithm or the Kalman-FACLA algorithm.
After the learning processes are complete, the performance of each learning
technique is tested by running the game with different sets of pursuers’ initial po-
sitions. The performance is assessed based on the capture time and the possibility
of collision between pursuers. The mean value and standard deviation of the cap-
ture time for different initial pursuer positions, using the FACLA algorithm and the
Kalman-FACLA algorithm, are given in Table 5.10. From the table it is clear that the
mean values of the capture time using the Kalman-FACLA algorithm are less than
those of the FACLA algorithm. For example, in the first set of the test the pursuers
took approximately 21.1 s to capture the evader using the FACLA algorithm, and ap-
proximately 19.2 s to capture the evader using the Kalman-FACLA algorithm. Also,
the mean and standard deviation of the capture time that are resulted from the
Kalman-FACLA algorithm are compared with those resulted from the FACLA algo-
rithm using the 2-tailed 2-sample t-test. In all tested pursuers’ initial positions, the
difference in means was found to be statistically significant at the 0.05 level (i.e.,
the resulted p-value is below 0.0001). The corresponding PE paths are shown in
Figure 5.16 and Figure 5.17, and it is evident that all players are capable of learning
their control strategies by interacting with each other using either one of the two
algorithms. However, the main difference between these algorithms is the potential
for collision between pursuers. Figure 5.16 shows that this is likely using the FACLA
algorithm, but reduced or diminished using the Kalman-FACLA algorithm, as shown
in Figure 5.17.
5.5. COMPUTER SIMULATION 149
Table 5.10: Mean and standard deviation of the capture time (s) for a two-pursuer
one-evader game for different pursuers’ initial positions.
Pursuers’ initial position.
(−2,−5), (−3, 6) (−5, 3), (−2, 6) (−6,−2), (5, 3) (−4, 3), (5,−2)
Mean Standard Mean Standard Mean Standard Mean Standard
deviation deviation deviation deviation
FACLA algorithm 21.0776 0.6689 28.6800 0.2516 19.5232 0.4689 18.7682 0.4228
Kalman-FACLA algorithm 19.1984 0.6094 27.2504 1.2528 16.7122 0.2557 15.5420 0.3337
-3 -2 -1 0 1 2 3 4 5 6 7
x-position (m)
-6
-4
-2
0
2
4
6
y-p
osi
tio
n (
m)
Pursuer 1
Pursuer 2
Evader
(a) Pursuer 1 (−2,−5), Pursuer 2 (−3, 6)
-6 -4 -2 0 2 4 6 8
x-position (m)
-6
-4
-2
0
2
4
6
y-p
osi
tio
n (
m)
Pursuer 1
Pursuer 2
Evader
(b) Pursuer 1 (−5, 3), Pursuer 2 (−2, 6)
-6 -4 -2 0 2 4 6
x-position (m)
-6
-5
-4
-3
-2
-1
0
1
2
3
y-p
osi
tio
n (
m)
Pursuer 1
Pursuer 2
Evader
(c) Pursuer 1 (−6,−2), Pursuer 2 (5, 3)
-4 -3 -2 -1 0 1 2 3 4 5 6
x-position (m)
-6
-5
-4
-3
-2
-1
0
1
2
3
y-p
osi
tio
n (
m)
Pursuer 1
Pursuer 2
Evader
(d) Pursuer 1 (−4, 3), Pursuer 2 (5,−2)
Figure 5.16: The PE paths using FACLA algorithm for a two-pursuer one-evader
game for different pursuers’ initial positions.
5.5. COMPUTER SIMULATION 150
-3 -2 -1 0 1 2 3 4 5 6
x-position (m)
-6
-4
-2
0
2
4
6
y-p
osi
tio
n (
m)
Pursuer 1
Pursuer 2
Evader
(a) Pursuer 1 (−2,−5), Pursuer 2 (−3, 6)
-6 -4 -2 0 2 4 6
x-position (m)
-6
-4
-2
0
2
4
6
y-p
osi
tio
n (
m)
Pursuer 1
Pursuer 2
Evader
(b) Pursuer 1 (−5, 3), Pursuer 2 (−2, 6)
-6 -4 -2 0 2 4 6
x-position (m)
-5
-4
-3
-2
-1
0
1
2
3
y-p
osi
tio
n (
m)
Pursuer 1
Pursuer 2
Evader
(c) Pursuer 1 (−6,−2), Pursuer 2 (5, 3)
-4 -3 -2 -1 0 1 2 3 4 5 6
x-position (m)
-5
-4
-3
-2
-1
0
1
2
3
y-p
osi
tio
n (
m)
Pursuer 1
Pursuer 2
Evader
(d) Pursuer 1 (−4, 3), Pursuer 2 (5,−2)
Figure 5.17: The PE paths using Kalman-FACLA algorithm for a two-pursuer one-
evader game for different pursuers’ initial positions.
5.5. COMPUTER SIMULATION 151
5.5.2 Case 2: Three-Pursuer One-Evader Game
For this case, the assumptions of the game are similar to those of the previous
case, but with three pursuers instead of two. It is also assumed that the control
strategy of the evader is to learn how to escape from the nearest pursuer. After
learning, the performance of the Kalman-FACLA algorithm is compared with the
performance of the FACLA algorithm. The comparison is based on testing each
learning algorithm by running the game with four different sets of pursuers’ initial
positions, and the results are as shown in Table 5.11 and Figures 5.18 – 5.19. Table
5.11 and Figures 5.18 – 5.19 show that the pursuers succeeded in finding their con-
trol strategies and capturing the evader, using either of the two learning algorithms.
Also, from Table 5.11, it can be seen that using the Kalman-FACLA algorithm signif-
icantly reduces the mean values of the capture time. For example, for the second
testing set, the usage of the FACLA algorithm reduces the mean value of the capture
time from 54.2762 s to 43.0026 s. Also, the 2-tailed 2-sample t-test demonstrated
a significant difference among the means at the 0.05 level (i.e., the resulted p-value
is less than 0.0001), when applied to the results presented in Table 5.11. Figure
5.18 shows that there is a possibility of collision among the pursuers when using
the FACLA algorithm, but this possibility is reduced or diminished when using the
modified algorithm, as shown in Figure 5.19.
Table 5.11: Mean and standard deviation of the capture time (s) for a three-pursuer
one-evader game for different pursuers’ initial positions.
Pursuers’ initial position.
(−6, 12), (3, 10), (6,−12) (20,−6), (−20,−5), (4, 12) (15, 5), (−15, 6), (7,−7) (−4, 4), (−4,−5), (8, 4)
Mean Standard Mean Standard Mean Standard Mean Standard
deviation deviation deviation deviation
FACLA algorithm 39.6628 0.6605 54.2762 1.3017 39.0016 0.8454 19.6752 0.2940
Kalman-FACLA algorithm 32.8064 0.2098 43.0026 0.4314 31.8094 0.3121 16.1140 0.2063
5.5. COMPUTER SIMULATION 152
-10 -5 0 5 10
x-position (m)
-15
-10
-5
0
5
10
15
y-p
osi
tion
(m
)
Pursuer 1
Pursuer 2
Pursuer 3
Evader
(a) Pursuer 1 (−6, 12), Pursuer 2 (3, 10), Pur-suer 3 (6,−12)
-20 -10 0 10 20
x-position (m)
-20
-15
-10
-5
0
5
10
15
y-p
osi
tion
(m
)
Pursuer 1
Pursuer 2
Pursuer 3
Evader
(b) Pursuer 1 (20,−6), Pursuer 2 (−20,−5), Pur-suer 3 (4, 12)
-20 -10 0 10 20
x-position (m)
-10
-5
0
5
10
y-p
osi
tion
(m
)
Pursuer 1
Pursuer 2
Pursuer 3
Evader
(c) Pursuer 1 (15, 5), Pursuer 2 (−15, 6), Pur-suer 3 (7,−7)
-5 0 5 10
x-position (m)
-6
-4
-2
0
2
4
6
8
y-p
osi
tion
(m
)
Pursuer 1
Pursuer 2
Pursuer 3
Evader
(d) Pursuer 1 (−4, 4), Pursuer 2 (−4,−5), Pur-suer 3 (8, 4)
Figure 5.18: The PE paths using FACLA algorithm for a three-pursuer one-evader
game for different pursuers’ initial positions.
5.5. COMPUTER SIMULATION 153
-10 -5 0 5 10
x-position (m)
-15
-10
-5
0
5
10
15
y-p
osi
tion
(m
)
Pursuer 1
Pursuer 2
Pursuer 3
Evader
(a) Pursuer 1 (−6, 12), Pursuer 2 (3, 10), Pur-suer 3 (6,−12)
-20 -10 0 10 20
x-position (m)
-20
-15
-10
-5
0
5
10
15
y-p
osi
tion
(m
)
Pursuer 1
Pursuer 2
Pursuer 3
Evader
(b) Pursuer 1 (20,−6), Pursuer 2 (−20,−5), Pur-suer 3 (4, 12)
-20 -10 0 10 20
x-position (m)
-10
-5
0
5
10
y-p
osi
tion
(m
)
Pursuer 1
Pursuer 2
Pursuer 3
Evader
(c) Pursuer 1 (15, 5), Pursuer 2 (−15, 6), Pur-suer 3 (7,−7)
-5 0 5 10
x-position (m)
-6
-4
-2
0
2
4
6
8
y-p
osi
tion
(m
)
Pursuer 1
Pursuer 2
Pursuer 3
Evader
(d) Pursuer 1 (−4, 4), Pursuer 2 (−4,−5), Pur-suer 3 (8, 4)
Figure 5.19: The PE paths using Kalman-FACLA algorithm for a three-pursuer one-
evader game for different pursuers’ initial positions.
5.6. CONCLUSION 154
5.6 Conclusion
In this chapter, a new fuzzy-reinforcement learning algorithm called FACLA is
proposed for the PE differential games to address the issue of reducing the learning
period the players need to determine their control strategies. It is a modified ver-
sion of the FACL algorithm, and uses the CACLA algorithm to tune the parameters
of the FIS. The proposed algorithm was applied to two versions of the two-player
PE differential games and compared by computer simulation with the FACL [17],
the RGFACL [22] and the PSO-based FLC+QLFIS algorithms [140]. Simulation re-
sults show that the FACLA algorithm allows each learning agent to reach its DCS
in a learning time less than the other algorithms. Then the FACLA algorithm is
modified and applied to the problem of multi-pursuer single-evader PE differen-
tial games. The modification is accomplished by using the Kalman filter to enable
each pursuer to estimate the evader’s movement direction. By using the modifi-
cation, each pursuer can move to the expected interception point directly, rather
than following its LoS to the evader in an attempt to reduce the capture time and
collision potential among the pursuers. The modified FACLA algorithm is called the
Kalman-FACLA, and it works with PE differential games that have continuous state
and action spaces. It also works in a decentralized manner, because each pursuer
considers the other pursuers as part of the environment, and there is no communi-
cation or direct cooperation among them. The Kalman-FACLA algorithm is applied
to the problem of a multi-pursuer single-evader game with multiple pursuers at-
tempting to capture a single evader, and all players learning simultaneously. This
occurs in two cases of the game in particular: a two-pursuer one-evader game and a
three-pursuer one-evader game. In both cases, the simulation results show that the
5.6. CONCLUSION 155
Kalman-FACLA algorithm outperforms the FACLA algorithm by reducing the capture
time and the collision potential among pursuers.
Chapter 6
Multi-Player Pursuit-Evasion
Differential Game with Equal Speed
6.1 Introduction
This chapter focuses on the problem of multi-player Pursuit-Evasion (PE) differen-
tial games with a single-superior evader, in which all the players have equal speed.
In the literature, there are just few articles that address the multi-player PE dif-
ferential games with superior evaders [8, 25, 133–135] and without any type of
learning. However, Awheda et. al [23, 136] recently proposed two decentralized
learning algorithms for the PE game issue with a single superior-evader. The first
learning algorithm [23] was used to enable a group of pursuers with equal speed
to capture a single evader that has a speed identical to the speed of the pursuers.
This algorithm was based on the condition proposed in [8] and a specific forma-
tion control strategy. The second proposed learning algorithm [136] was used to
enable a group of equal speed pursuers to capture a single evader when its speed
156
6.1. INTRODUCTION 157
is greater than or equal to the speed of the pursuers. It was based on Apollonius
circles and a modified formation control strategy. The drawback of both these al-
gorithms [23, 136] is that they must calculate the capture angle of each pursuer in
order to determine its control signal. Thus, in this chapter a special type of reward
function that will enable a group of pursuers to capture a single evader in a de-
centralized manner without knowing the capture angle is suggested for the Fuzzy
Actor-Critic Learning Automaton (FACLA) algorithm. It is assumed that all players
in the game have identical speed. The game is played so each pursuer can learn
how to participate in capturing the evader by tuning its Fuzzy Logic Control (FLC)
parameters. The tuning process depends on the reward values that each pursuer
received after each action taken. The suggested reward function depends on two
factors: the first is the difference in the Line-of-Sight (LoS) between each pursuer
in the game and the evader at two consecutive time instants, and the second is the
difference between two successive Euclidean distances between each pursuer and
the evader. The simulation results were published in [155]1.
This chapter is organized as follows: Section 6.2 describes the dynamic equa-
tions of the players, and the FLC structure is discussed briefly in Section 6.3. The
formulation of the reward function is provided in Section 6.4, and the FACLA algo-
rithm is briefly addressed in Section 6.5. In Section 6.6, the simulation results are
discussed, and conclusions are provided in Section 6.7.
1A. A. Al-Talabi, “Multi-Player Pursuit-Evasion Differential Game with Equal Speed,” in Proc. ofthe 2017 IEEE International Automatic Control Conference (CACS), (Pingtung, Taiwan), November2017.
6.2. THE DYNAMIC EQUATIONS OF THE PLAYERS 158
6.2 The Dynamic Equations of the Players
For the multi-player PE game presented in this chapter, it is assumed that there are
n-pursuer (p1, p2, ..., pn) trying to capture a single evader e, and that all players have
identical capabilities. Letting Vp and Ve denote, respectively, the maximum velocity
of each purser and the evader, such that Vp = Ve. The dynamic equations of the
players are
xpi = Vp cos(θpi), ypi = Vp sin(θpi), (6.1)
xe = Ve cos(θe), ye = Ve sin(θe), (6.2)
−π ≤ θpi ≤ π, − π ≤ θe ≤ π,
where (xpi , ypi) and (xe, ye) refer to the positions of the pursuer pi and evader e,
respectively. Also, θpi and θe refer to the pursuer, pi, and evader strategies, respec-
tively. At time t, the Euclidean distance between each pursuer pi and the evader is
defined by
Dpi(t) =√
(xe(t)− xpi(t))2 + (ye(t)− ypi(t))
2, i = 1, 2, ..., n. (6.3)
The evader is captured, if at any time t, 0 ≤ t ≤ Tf , where Tf refers to the final
time, there is at least one pursuer pi, such that Dpi(t) is less than a certain threshold
value `, which is called the radius of capture. To ensure this, it is necessary to
satisfy the capture condition, which can be explained geometrically by Figure 6.1.
In Figure 6.1, Pi and E denote the initial positions of the pursuer pi and the evader
e, respectively. From this figure and according to the law of sines, the capture
condition can be defined by
6.2. THE DYNAMIC EQUATIONS OF THE PLAYERS 159
Vesin(αi)
=Vp
sin(βi)⇒ αi = βi and βi < βimax , (6.4)
where
βimax =(π
2
)
. (6.5)
As shown in Figure 6.1, when the angle βi is less than βimax , it is obvious that the
αi
βi Ve
Pi
E
Vp 𝐷𝑝𝑖
αi
βi
��
𝛽𝑖max
Figure 6.1: Geometric illustration for capturing situation.
pursuer pi can always find an angle αi that ensures the capture of the evader. Also,
it is clear that the pursuer pi can take care of the evader’s movement within an angle
of 2βi. So, the minimum number of pursuers required to cover 2π angle around the
evader is three. To summarize, the necessary conditions for capturing the evader
are
1. There exists enough pursuers in the game; and,
2. At each instant of time there exists at least one pursuer satisfying Equation
(6.4).
6.3. FUZZY LOGIC CONTROLLER STRUCTURE 160
6.3 Fuzzy Logic Controller Structure
An FLC with two inputs and one output is used for each learning agent pi. The
inputs are the x and y components of the Manhattan distance between the pursuer
pi and the evader, and the output is θpi . For each pursuer pi, the Manhattan distance
components are defined by:
Dxpi= xe(t)− xpi(t), (6.6)
Dypi= ye(t)− ypi(t). (6.7)
A two input and one output zero-order Takagi-Sugeno (TS) fuzzy model is used
[46]. The two inputs are z1, and z2, which represent Dxpi
and Dypi
, respectively,
and it is assumed that each input has five triangular Membership Functions (MFs).
Therefore, it is necessary to build 25 rules, each with one consequent parameter Kl.
The fuzzy rules can be constructed using the Fuzzy Decision Table (FDT), and as
shown in Table 6.1. Where A1, A2, · · ·A5 and B1, B2, · · ·B5 are the linguistic labels
for the MFs of Dxpi
and Dypi
, respectively. The fuzzy output θpi is defuzzified into a
crisp output using the weighted average defuzzification method [138].
6.4 Reward Function Formulation
In this chapter, an actor-critic structure consisting of two main components (actor
and critic) is applied. Actor refers to the policy structure used to select an action
for the current system state, and critic refers to the estimated value function V (s),
that is used to criticize the action. After calculating V (s), the critic evaluates the
6.4. REWARD FUNCTION FORMULATION 161
Table 6.1: Fuzzy decision table
Dypi
B1 B2 B3 B4 B5
Dxpi
A1 K1 K2 K3 K4 K5
A2 K6 K7 K8 K9 K10
A3 K11 K12 K13 K14 K15
A4 K16 K17 K18 K19 K20
A5 K21 K22 K23 K24 K25
resulting state to determine whether the performance has improved or deteriorated
[16]. This evaluation is based on the TD-error 4t, which is defined by
4t = rt+1 + γVt(st+1)− Vt(st). (6.8)
Based on 4t, the critic can estimate V (s) as follows [16]
Vt+1(st) = Vt(st) + α4t. (6.9)
From Equations (6.8) – (6.9), it is clear that the reward function plays an important
role for enabling the learning agent to accurately update its value function. The
choice of reward function depends on the problem to be addressed, and for the
problem under consideration, the main objective is to enable a team of n-pursuers
to learn how to capture a single evader by interacting with it. Thus, a special form
of reward function is suggested. The suggested reward function uses two factors to
help pursuers learn how to participate in capturing the evader. The first factor is the
difference in the LoS between each pursuer and the evader at two consecutive time
6.4. REWARD FUNCTION FORMULATION 162
instants (∆LoS(t) ), and the second factor is the difference between two successive
Euclidean distances between each pursuer and the evader ∆D(t). For the pursuer
pi, ∆LoSpi(t) is defined as follows:
∆LoSpi(t) = LoSpi(t)− LoSpi(t+ 1), (6.10)
where the LoS(t) between pi and the evader is defined by:
LoSpi(t) = tan−1
(
ye(t)− ypi(t)
xe(t)− xpi(t)
)
. (6.11)
Also, ∆Dpi(t) is given by:
∆Dpi(t) = Dpi(t)−Dpi(t+ 1). (6.12)
The first factor ensures that the pursuers move according to the Parallel Guidance
Law (PGL), which means each pursuer that could capture the evader will move to
the capture point E, as shown in Figure 6.1. The pursuers that cannot capture
the evader will move in parallel with it as in Figure 6.2, to ensure the invariant
angle distribution around the evader. Figure 6.2 clearly identifies two paths for the
pursuer to make ∆LoS approaches to zero:−−→PiA and
−−→PiB. If the pursuer follows
the−−→PiB path the distance between it and the evader remains unchanged, but if the
pursuer follows the path−−→PiA the distance between them increases. Thus, the second
factor is used to give a positive reward to the pursuer that reduces this distance over
time, or at least makes it equal to the initial distance.
According to previous analyses, the reward function of the pursuer pi can be
6.4. REWARD FUNCTION FORMULATION 163
Ve
E
Pi
Vp ��𝑖
𝛽𝑖
��𝑖
A
B
C
𝐷𝑝𝑖
Figure 6.2: Geometric illustration for the pursuer moving in parallel with the
evader using PGL.
defined as follows:
rpi(t+ 1) = 1.5r1pi(t+ 1) + r2pi(t+ 1), (6.13)
where r1pi(t+ 1) and r2pi(t+ 1) are defined as follows:
r1pi(t+ 1) = 2e(−∆LoSpi
(t)2
0.005) − 1, (6.14)
and
r2pi(t+ 1) =∆Dpi(t)
∆Dpimax
, (6.15)
where ∆Dpimax refers to the maximum value of ∆Dpi and is calculated from
∆Dpimax = (Vpi + Ve)T, (6.16)
where T represents the sampling time.
6.5. FUZZY ACTOR-CRITIC LEARNING AUTOMATON (FACLA) 164
6.5 Fuzzy Actor-Critic Learning Automaton (FACLA)
In [144], the FACLA algorithm was proposed to address the problem of the PE dif-
ferential game in which both the actor and critic are Fuzzy Inference System (FIS),
and the results showed that the FACLA algorithm reduces the time players need to
learn their control strategies. In this chapter, only the consequent parameters of the
FIS are tuned. Let KCl and KA
l represent the consequent parameters of the critic
and actor in rule l, respectively. KCl and KA
l are then updated according to the
following gradient based formulas [144]:
KCl (t+ 1) = KC
l (t) + η∆t
∂Vt(st)
∂KCl
, (6.17)
IF ∆t > 0 : KAl (t+ 1) = KA
l (t) + ξ∆t(uc − ut)∂ut∂KA
l
, (6.18)
where η and ξ are the learning parameters for the critic and actor, respectively. They
can be defined as follows:
η = 0.3− 0.09
(
iepMax. Episodes
)
, (6.19)
ξ = 0.1η, (6.20)
where iep is the current episode. The terms ∂Vt(st)
∂KCl
and ∂ut
∂KAl
are given by
∂Vt(st)
∂KCl
=∂ut∂KA
l
= ωl, (6.21)
where ωl and ωl represent the firing strength and normalized firing strength of rule
l. In Equation (6.18), a positive ∆t indicates that the current action should be
6.6. COMPUTER SIMULATION 165
enforced. However, if a negative update is also allowed the learning algorithm does
not necessarily lead to selecting an action that has a better value function (i.e., leads
to positive ∆t), and thus it is not considered.
6.6 Computer Simulation
For simulation purpose, it is assumed that there are three pursuers and one evader,
and their velocities are the same (Vp = Ve = 1 m/s). The selected sample time
is T = 0.1 s, and the capture radius is ` = 1 m. The game is played for 1000
episodes, with 200 as the maximum number of steps/plays in each game, so the
game terminates when the time exceeds 20 s, or when the capturing situation is
satisfied. The goal of the simulation is to enable the pursuers to self-learn their
control strategies by interacting with the evader and continuously tuning their FLCs.
The evader’s motion starts from the origin, and the pursuers’ motions start from
xp1 , yp1 = (−5, 5), xp2 , yp2 = (5, 5), and xp3 , yp3 = (0,−5). The PE paths at the first
and final learning episodes are shown in Figure 6.3 and Figure 6.4, respectively.
Figure 6.3 indicates that each pursuer tries to explore the best actions to learn its
control strategy, while Figure 6.4, shows that all the pursuers can learn their control
strategies to capture the evader. As expected, it is clear that pursuer p3 moved to
the capture point, while pursuers p1 and p2 moved in parallel with the evader.
The average payoff for each pursuer in the game is shown in Figure 6.5. For
p1 and p2 it converges to 1.5, whereas it converges to 1.8 for pursuer p3. This is
because both p1 and p2 move using the PGL, but they cannot reduce their distance
to the evader over time, thus they receive less reward. However, p3 moves using the
PGL, and thereby reduces its distance to the evader over time and receives more
6.7. CONCLUSION 169
combination of two factors. The first depends on the difference in the LoS between
each pursuer and the evader at two consecutive times instants, which allows the
pursuers to move according to the parallel guidance law. The other factor depends
on the difference between two successive Euclidean distances between each pur-
suer and the evader, to ensure the distance between them remains unchanged or
reduced over time. From the computer simulations and the results, it is clear that
the FACLA algorithm with the suggested reward function enables each pursuer to
learn its control policy and participate in capturing the evader. The pursuers search
for their control strategies by interacting with the evader. The FACLA algorithm
based on the suggested reward function operates in a decentralized manner, since
each pursuer in the PE game regards the other players as part of its environment,
and does not communicate or have direct collaboration with them.
Chapter 7
Conclusions and Future Work
7.1 Conclusions
Pursuit-Evasion (PE) type games have been used for decades, and several of them
have been studied extensively due to their potential for military application. The
concept can also be generalized to solve real-world applications. Four main tech-
niques are commonly used to solve the PE game problem: optimal control, dy-
namic programming, game theory and Reinforcement Learning (RL). The solution
complexity of the game increases proportionately to the number of players, thus, it
is better to use learning techniques that help each player find its control strategy.
Several learning algorithms to solve the problem of the PE game were proposed
previously, each with its own advantages and disadvantages. The disadvantages
are:
1- Computational requirements need to be investigated: The researchers did
not consider which set of parameters had a significant impact on the perfor-
mance of the learning algorithm. Some assumed that the learning process is
170
7.1. CONCLUSIONS 171
achieved by tuning all the sets of parameters, while others assumed that tun-
ing can be done using only one set of parameters. This choice will certainly
impact the computation time or computational requirements.
2- Long learning time: For each learning algorithm, the learning process re-
quires a specific number of episodes to achieve acceptable performance. Some
researchers set this number to be higher than necessary, which increases the
learning time.
3- High possibility of collision among pursuers: The implementation of these
algorithms for the problem of a PE game in which multiple pursuers try to cap-
ture a single-evader, leads to high probability of collision among the pursuers.
In addition, the capture time might not be the minimum one.
4- Knowing the speed of a superior evader: The learning algorithms previously
proposed for the case of single-superior evader assumed that the speed of the
evader is known by each pursuer.
In order to resolve these disadvantages, this thesis addresses the problem of PE
games from the learning perspective. In particular, it proposes several learning
algorithms that can be easily and efficiently applied to PE differential games. The
objectives of the proposed algorithms were:
• to investigate the possible reduction of computation time;
• to determine the possible reduction of time that the player needs to find its
control strategy (i.e., reducing the learning time);
• to determine how to avoid or reduce the possibility of collisions among the
pursuers, and how to reduce the capture time;
7.1. CONCLUSIONS 172
• to deal with the problem of the PE differential game with a superior evader, in
which the evader’s speed is equal to the maximum speed of the fastest pursuer
in the game; and,
• to determine how the learning algorithm can be implemented in a decen-
tralized manner in which each player considers the other players part of its
environment, and thus does not need to share its states or actions with other
players.
These objectives were met by investigating the previously proposed learning algo-
rithm, and implementing the new proposed learning algorithms. The new proposed
algorithms are based on fuzzy-reinforcement learning, the Particle Swarm Opti-
mization (PSO) algorithm, the Kalman filter and the concept of Parallel Guidance
Law (PGL). Fuzzy-reinforcement learning combines the RL with the Fuzzy Infer-
ence System (FIS), to deal with the PE game in its differential form. Here, the FIS
works as either a Fuzzy Logic Control (FLC), or a function approximator to manage
the problem of continuous state and action spaces. In this thesis, the PSO algorithm
works as a global optimizer for the FLC parameters to determine appropriate values
for the FLC parameter setting. Starting with these settings affects the starting func-
tionality of the FLC, and will speed up the convergence to its final setting; thus the
learning process will be rapid. The Kalman filter was used to enable each pursuer
to estimate the evader’s velocity vector, which helps minimize both the capture time
and the potential for collision among pursuers. Finally, the concept of PGL was used
by each pursuer such that each pursuer that could capture the evader will move to
the expected capture point, while the pursuers that cannot capture the evader will
move in parallel with it, to ensure invariant angle distribution around the evader.
7.2. CONTRIBUTIONS 173
7.2 Contributions
The main contributions of this thesis are:
1-Reduced the Computational Time
Four methods of implementing the Q-Learning Fuzzy Inference System (QLFIS)
algorithm were proposed in Chapter 3 to analyze the possibility of reducing the
computational requirements of this algorithm without affecting its overall perfor-
mance. The analysis was based on whether it is necessary to tune all the parameters
of the FIS and the FLC, or just their consequential parameters. The four methods
were applied to three versions of PE games, and an evaluation of each game was
made to decide which parameters are the best to tune, and which have minimal
impact on performance. Simulation results in Section 3.7 showed that the perfor-
mance of the learning algorithm for each game depends on its parameter tuning
method. Also, it showed the possible reduction of computational time for each
game.
2-Reduced Learning Time
For this purpose two algorithms were proposed, as follows:
1. In Chapter 4, an unsupervised two-stage learning technique, known as the
PSO-based FLC+QLFIS, was proposed to reduce the learning time. The learn-
ing algorithm combines the PSO-based FLC algorithm with the QLFIS algo-
rithm. In the first stage, the PSO algorithm works as a global optimizer to au-
tonomously tune the parameters of the FLC, and in the second stage the QLFIS
algorithm acts as a local optimizer. For the PSO-based FLC+QLFIS learning al-
gorithm, the first stage is critical, since it provides the next stage with the best
7.2. CONTRIBUTIONS 174
initial parameter settings for the FLC. The PSO algorithm uses a simple for-
mula to update each particle, and it has low computational requirements. The
proposed technique was applied to the problem of the PE differential game,
and the simulation results in Section 4.5 show that the PSO-based FLC+QLFIS
learning algorithm requires less episodes to learn (i.e., the shortest learning
time) than the PSO-based FLC algorithm or the QLFIS algorithm. The find-
ings of this chapter can be used as a base for providing a similar learning
algorithm with a useful initial parameter setting. For example, using artificial
neural networks in an application requires an initialization step for its weights
which are typically initialized randomly, and using random weights can cause
inefficient start-up of the neural network. Therefore, it is possible to use the
PSO algorithm, as in the first stage of the proposed learning algorithm, to find
an acceptable setting for the weights in a few iterations.
2. In Chapter 5, a new fuzzy-reinforcement learning algorithm was proposed to
address the problems of PE differential games, and to reduce players’ learn-
ing time. The new algorithm uses the Continuous Actor-Critic Learning Au-
tomaton (CACLA) algorithm to tune the parameters of the FIS, and is known
as Fuzzy Actor-Critic Learning Automaton (FACLA) algorithm. The algorithm
was applied to different versions of the PE games, and it was compared by sim-
ulations to the Fuzzy-Actor Critic Learning (FACL), Residual Gradient Fuzzy
Actor-Critic Learning (RGFACL) and PSO-based FLC+QLFIS algorithms. The
simulation results demonstrated that the advantages of the FACLA algorithm
over other algorithms are due to having the shortest learning time (i.e., it
takes only 150 episodes for learning) and the lowest computation time, both
of which are important factors for any learning technique. These advantages,
7.2. CONTRIBUTIONS 175
as well as the fact that the FACLA algorithm can deal with the states and ac-
tions represented in continuous domains, indicate that the FACLA algorithm
can be used as a learning technique for an application with a continuous state
and action spaces.
3-Reduced Capture Time and Collision Potential Among Pursuers
In Chapter 5, a modified version of the FACLA algorithm was proposed for the
problem of multi-pursuer PE differential games with an inferior evader. The mod-
ification used the Kalman filter technique to enable each pursuer to estimate the
evader’s velocity vector. With this modification, each pursuer can move to the ex-
pected interception point directly, instead of following its line-of-sight to the evader
in an attempt to reduce the capture time and reduce the collision potential among
pursuers. In Section 5.5, the simulation results for both a two-pursuer one-evader
game and a three-pursuer one-evader game showed that the modified learning al-
gorithm outperforms the original learning algorithm by reducing both the capture
time and potential pursuer collisions.
4-Dealing with Multi-player PE Games with a Single-Superior Evader
In Chapter 6, a new reward function formulation that enables a group of pur-
suers to capture a single-superior evader in a decentralized manner, when all play-
ers have identical speeds, was proposed for the FACLA algorithm. It was used to
direct each pursuer to move to either the interception point with the evader, or in
parallel with it to ensure invariant angle distribution around the evader. Maintain-
ing an invariant angle of distribution around the evader reduces its maneuverability
and forces it toward another pursuer; thus, if the pursuers follow this strategy they
7.3. FUTURE WORK 176
will capture the evader. In addition, there is no need to calculate the capture an-
gle of each pursuer in order to determine its control signal. The simulation results
showed that pursuers could learn their control strategies without knowing the cap-
ture angle. The results also showed how the pursuers learn to cooperate indirectly
and finally capture the evader.
7.3 Future Work
The problem of the multi-robot PE differential game is still an open research field,
particularly from a learning perspective, and there are several ideas to explore in the
near future. These ideas will focus on developing new learning algorithms to solve
the general case of the multi-player PE differential game when there are n pursuers
and m evaders, and some of the evaders are superior. Therefore, the direction of
future work can be organized as follows:
1. Exploring the benefits of learning algorithms and using the ideas in this thesis
will promote the development of learning algorithms for the general case of
PE differential games with more than one superior evader and players with
different capabilities (i.e., different speeds and maneuverabilities). The learn-
ing algorithm will find the best control strategy for all players in the game.
As mentioned in Section 2.7, the problem of multi-player PE games is diffi-
cult to solve due to the curse of dimensionality, and the problem becomes
more complex when there are multiple superior evaders in the game. A new
learning algorithm could decompose the problem into several PE games, such
as one-pursuer one-evader and multi-pursuer single evader; the latter is very
useful in the case of a superior evader. Solving the problem of PE games with
7.3. FUTURE WORK 177
multiple superior evaders is difficult, particularly when there are two or more
superior evaders. For example, if two evaders are initially surrounded by a
number of pursuers, the evaders could collaborate to allow one of them to es-
cape. Future work could potentially address this problem and find a solution.
2. Most of the research on PE games assumes that all the players have a con-
stant velocity. This is usually not the case in real situations, where each player
can increase its maneuverability by accelerating or decelerating. In addition,
if the player is a car-like mobile robot, rollover is a problem closely related
to its motion. Thus, to avoid rollover when the player uses its maneuver-
ability to change the direction of movement, the velocity should also change
according to the rate of change of the motion direction. When a car turns,
centripetal force is generated at its center of mass, and the resulting torque
could cause the car to roll over. Depending on its design, a car rolls over when
the centripetal force exceeds the value Froll = MV 2/Rturn, where M , V and
Rturn are the mass, velocity and turning radius of the moving car, respectively.
Thus, a car will not roll over if V 2/Rturn is less than certain Kroll factor, where
Kroll = Froll/M . There is a close relationship between velocity and the turning
radius, which can be written as follows:
V (t) =√
RturnKroll. (7.1)
From Equation (7.1), it is clear that it is impossible for a car to turn at a spe-
cific velocity without taking the curvature radius into consideration. There-
fore, if we assume that the minimum turning radius of the moving car is Rm
7.3. FUTURE WORK 178
and it moves with velocity Vs(t), the car can turn safely with any turning ra-
dius greater than or equal to Rs(t), where Rs(t) is given by
Rs(t) = max
(
Rm,V 2s (t)
Kroll
)
. (7.2)
It is obvious that a moving car can rely on either its superior velocity or supe-
rior maneuverability. In PE games, the pursuer typically relies on its superior
velocity, and the evader on its superior maneuverability. Thus, if there are two
cars and one of them can turn with a velocity higher than the other, or it has
a smaller turning radius than the other assuming both have the same velocity,
this car has an advantage.
Therefore, regarding a PE game with such specifications, it would be inter-
esting to investigate how to find a learning algorithm with acceptable perfor-
mance.
3. All the work presented in this thesis are based on computer simulation, and it
would be more interesting if the learning algorithms could also be validated
experimentally. Thus, future work could involve running the proposed learn-
ing algorithms in an experiment with real robots. To do this, an open source
hardware platform called TurtleBot could be used, or any other platform that
can act as a mobile robot. TurtleBot can be programmed and configured using
the Robot Operating System (ROS), with either C++ or python programming
languages. The flexible framework package in ROS is used for writing robot
software, and if a TurtleBot is powered by this software it can manage several
activities, including vision, localization, communication and mobility. For the
proposed learning algorithms, it is assumed that the pursuer instantaneously
7.3. FUTURE WORK 179
knows the position of the evader, and vice versa. Therefore, it would be pos-
sible to use either an OptiTrack system or a stereo camera to determine robot
positions. OptiTrack is comprised of several synchronized infrared cameras,
like the one at the Royal Military College of Canada.
4. For each pursuer a two-actor structure can be proposed for the problem of
the PE games with a superior evader. The first actor is used to determine a
pursuer’s action when the evader moves within the pursuer’s capture area,
while the second actor is used to determine the pursuer’s action when the
evader moves outside the pursuer’s capture area.
5. It is possible to benefit from the concepts of intrinsic motivation [156] and the
concept of object-oriented representation [157] for designing intrinsically mo-
tivated object-oriented PE differential games that make the learning process
manageable in large state and action spaces. Each player in this representa-
tion is considered to be an object.
6. Most of the RL algorithms, including the learning algorithms that are pre-
sented in this work, manually adjust the parameters of the learning algorithm;
that is, learning parameters, discount-factor and exploration rate. Hence,
in the future one will be able to adjust these parameters using an adaptive
method, and thereby make the learning agent fully autonomous.
7. For the work presented in this thesis, it is assumed that the PE games are
played in an obstacle-free environment. Thus, it will be more interesting if
the complexity of the game increases by adding static or dynamic obstacles.
Therefore, in the future, the learning algorithm should be used to teach each
7.3. FUTURE WORK 180
player how to find its control strategy and at the same time how to avoid
obstacles.
Bibliography
[1] R. Isaacs, Differential games: A Mathematical Theory with Applications to War-
fare and Pursuit, Control and Optimization. Dover books on mathematics,
New York, NY: Dover Publ., 1999.
[2] V. D. Gesu, B. Lenzitti, G. L. Bosco, and D. Tegolo, “Comparison of differ-
ent cooperation strategies in the prey-predator problem,” in the Interna-
tional Workshop on Computer Architecture for Machine Perception and Sensing,
(Montreal, Canada), pp. 108–112, August 2006.
[3] G. Miller and D. Cliff, “Co-evolution of pursuit and evasion I: Biological and
game-theoretic foundations (tech. rep. csrp311),” tech. rep., 1994.
[4] C. Boesch, “Cooperative hunting roles among Tai chimpanzees,” Human Na-
ture, vol. 13, pp. 27–46, March 2002.
[5] D. Araiza-Illan and T. Dodd, “Biologically inspired controller for the au-
tonomous navigation of a mobile robot in an evasion task,” World Academy
of Science, Engineering and Technology, vol. 68, pp. 780–785, August 2010.
[6] S. A. Shedied, Optimal Control for a Two Player Dynamic Pursuit Evasion
Game; the Herding Problem. PhD thesis, Virginia Polytechnique Institute and
State University, January 2002.
181
BIBLIOGRAPHY 182
[7] H. M. Schwartz, Multi-Agent Machine Learning: A Reinforcement Approach.
John Wiley & Sons, August 2014.
[8] M. Wei, G. Chen, J. B. J. Cruz, L. Hayes, and M. Chang, “A decentral-
ized approach to pursuer-evader games with multiple superior evaders,” in
2006 IEEE Intelligent Transportation Systems Conference, (Toronto, Canada),
pp. 1586–1591, September 2006.
[9] K. M. Passino, “Intelligent control: An overview of techniques,” Perspectives in
Control Engineering: Technologies, Applications, and New Directions, pp. 104–
133, 2001.
[10] H. R. Beom and H. S. Cho, “A sensor-based navigation for a mobile robot
using fuzzy logic and reinforcement learning,” IEEE Trans. on Systems, Man,
and Cybernetics, vol. 25, pp. 464–477, March 1995.
[11] A. Saffiotti, “The uses of fuzzy logic in autonomous robot navigation,” Soft
Computing-A Fusion of Foundations, Methodologies and Applications, vol. 1,
pp. 180–197, December 1997.
[12] A. Saffiotti, E. H. Ruspini, and K. Konolige, “Using fuzzy logic for mobile
robot control,” Practical Applications of Fuzzy Technologies, vol. 6, pp. 185–
205, 1999.
[13] E. Aguirre and A. Gonzalez, “Fuzzy behaviors for mobile robot navigation:
design, coordination and fusion,” International Journal of Approximate Rea-
soning, vol. 25, pp. 255–289, November 2000.
[14] A. Ollero, J. Ferruz, O. Sanchez, and G. Heredia, “Mobile robot path tracking
BIBLIOGRAPHY 183
and visual target tracking using fuzzy logic,” in Fuzzy Logic Techniques for
Autonomous Vehicle Navigation, pp. 51–72, Springer, 2001.
[15] M. Sugiyama, Statistical Reinforcement Learning: Modern Machine Learning
Approaches. Chapman and Hall/CRC, March 2015.
[16] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cam-
bridge, MA: MIT Press, 1998.
[17] S. N. Givigi, H. M. Schwartz, and X. Lu, “A reinforcement learning adaptive
fuzzy controller for differential games,” Journal of Intelligent and Robotic
Systems, vol. 59, pp. 3–30, July 2010.
[18] S. F. Desouky and H. M. Schwartz, “Self-learning fuzzy logic controllers
for pursuit-evasion differential games,” Robotics and Autonomous Systems,
vol. 59, pp. 22–33, January 2011.
[19] B. M. Al Faiya and H. M. Schwartz, “Q(λ)-learning fuzzy controller for the
homicidal chauffeur differential game,” in Proc. of the 20th IEEE Mediter-
ranean Conference on Control and Automation (MED), (Barcelona, Spain),
pp. 247–252, July 2012.
[20] S. F. Desouky and H. M. Schwartz, “Q(λ)-learning adaptive fuzzy logic con-
trollers for pursuit-evasion differential games,” International Journal of Adap-
tive Control and Signal Processing, vol. 25, pp. 910–927, October 2011.
[21] X. Lu, Multi-Agent Reinforcement Learning in Games. PhD thesis, Carleton
University, March 2012.
BIBLIOGRAPHY 184
[22] M. D. Awheda and H. M. Schwartz, “A residual gradient fuzzy reinforcement
learning algorithm for differential games,” International Journal of Fuzzy Sys-
tems, vol. 19, pp. 1058–1076, August 2017.
[23] M. D. Awheda and H. M. Schwartz, “Decentralized learning in pursuit-
evasion differential games with multi-pursuer and single-superior evader,”
in Proc. of the Annual IEEE Systems Conference (SysCon), (Orlando, USA),
pp. 1–8, April 2016.
[24] M. Wei, G. Chen, J. B. J. Cruz, L. S. Hayes, M. Chang, and E. Blasch, “A decen-
tralized approach to pursuer-evader games with multiple superior evaders
in noisy environments,” in 2007 IEEE Aerospace Conference, (Big Sky, USA),
pp. 1–10, March 2007.
[25] S. Jin and Z. Qu, “Pursuit-evasion games with multi-pursuer vs. one fast
evader,” in Proc. of the 8th World Congress on Intelligent Control and Automa-
tion (WCICA) 2010, (Jinan, China), pp. 3184–3189, July 2010.
[26] S. M. LaValle, Planning Algorithms. Cambridge University Press, May 2006.
[27] L. A. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, pp. 338–353, 1965.
[28] Y. Bai, H. Zhuang, and D. Wang, Advanced Fuzzy Logic Technologiesin Indus-
trial Applications. Springer Science & Business Media, January 2007.
[29] H. Han, C. Y. Su, and Y. Stepanenko, “Adaptive control of a class of nonlinear
systems with nonlinearly parameterized fuzzy approximators,” IEEE Trans.
on Fuzzy Systems, vol. 9, pp. 315–323, April 2001.
BIBLIOGRAPHY 185
[30] R. Coppi, M. A. Gil, and H. A. Kiers, “The fuzzy approach to statistical anal-
ysis,” Computational statistics & data analysis, vol. 51, pp. 1–14, November
2006.
[31] C. Von Altrock and J. Gebhardt, “Recent successful fuzzy logic applications
in industrial automation,” in Proc. of the Fifth IEEE International Conference
on Fuzzy Systems, vol. 3, pp. 1845–1851, September 1996.
[32] M. I. Chacon M, “Fuzzy logic for image processing: definition and applica-
tions of a fuzzy image processing scheme,” Advanced Fuzzy Logic Technologies
in Industrial Applications, pp. 101–113, 2006.
[33] A. Patel, S. K. Gupta, Q. Rehman, and M. Verma, “Application of fuzzy logic
in biomedical informatics,” Journal of Emerging Trends in Computing and
Information Sciences, vol. 4, pp. 57–62, January 2013.
[34] B. Bouchon-Meunier, “Some applications of fuzzy logic in data mining and
information retrieval,” in EUSFLAT Conference, pp. 21–21, September 2005.
[35] S. Mitra and S. K. Pal, “Fuzzy sets in pattern recognition and machine intel-
ligence,” Fuzzy sets and systems, vol. 156, pp. 381–386, December 2005.
[36] A. Salski, “Fuzzy logic approach to data analysis and ecological modelling,”
in Proc. of the European symposium on intelligent techniques (ESIT99), 1999.
[37] Z. Huang, K. Y. Lee, and R. M. Edwards, “Fuzzy logic control application in
a nuclear power plant,” IFAC Proc. Volumes, vol. 35, pp. 239–244, January
2002.
BIBLIOGRAPHY 186
[38] C. T. Leondes, Fuzzy logic and expert systems applications, vol. 6. Elsevier,
1998.
[39] S. N. Sivanandam, S. Sumathi, and S. N. Deepa, Introduction to Fuzzy Logic
using MATLAB, vol. 1. Springer, January 2007.
[40] G. Chen and T. T. Pham, Introduction to Fuzzy Sets, Fuzzy Logic, and Fuzzy
Control Systems. CRC press, November 2000.
[41] G. Feng, Analysis and Synthesis of Fuzzy Control Systems: A Model-Based Ap-
proach. CRC press, March 2010.
[42] A. Gad and M. Farooq, “Application of fuzzy logic in engineering problems,”
in IECON’01. The 27th Annual Conference of the IEEE Industrial Electronics
Society, vol. 3, pp. 2044–2049, November.
[43] K. Self, “Designing with fuzzy logic,” IEEE Spectrum, vol. 27, pp. 42–44,
November 1990.
[44] K. M. Passino and S. Yurkovich, Fuzzy Control. Addison Wesley Longman,
Inc., 1998.
[45] E. H. Mamdani and S. Assilian, “An experiment in linguistic synthesis with a
fuzzy logic controller,” International Journal of Man-Machine Studies, vol. 7,
pp. 1–13, January 1975.
[46] T. Takagi and M. Sugeno, “Fuzzy identification of systems and its applica-
tions to modelling and control,” IEEE Trans. on Systems, Man, and Cybernet-
ics, vol. SMC-15, pp. 116–132, January 1985.
BIBLIOGRAPHY 187
[47] T. J. Ross, Fuzzy Logic with Engineering Applications. John Wiley & Sons,
2004.
[48] Y. Shi and P. C. Sen, “A new defuzzification method for fuzzy control of power
converters,” in Conference Record of the 2000 IEEE Industry Applications Con-
ference. Thirty-Fifth IAS Annual Meeting and World Conference on Industrial
Applications of Electrical Energy (Cat. No.00CH37129), vol. 2, (Rome, Italy),
pp. 1202–1209, October 2000.
[49] C. Szepesvari, “Algorithms for reinforcement learning,” Synthesis Lectures on
Artificial Intelligence and Machine Learning, vol. 4, no. 1, pp. 1–103, 2010.
[50] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A
survey,” Journal of Artificial Intelligence Research, vol. 4, pp. 237–285, May
1996.
[51] M. A. Wiering, “QV (λ)-learning: A new on-policy reinforcement learning
algorithm,” in Proc. of the 7th European Workshop on Reinforcement Learning,
vol. 7, (Napoli, Italy), pp. 17–18, October 2005.
[52] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements
that can solve difficult learning control problems,” IEEE Trans. on Systems,
Man, and Cybernetics, pp. 834–846, September 1983.
[53] C. J. C. H. Watkins, Learning from delayed rewards. PhD thesis, King’s College,
Cambridge University, 1989.
[54] G. A. Rummery and M. Niranjan, “On-line Q-learning using connectionist
BIBLIOGRAPHY 188
systems,” Tech. Rep. CUED/F-INFENF/TR 166, University of Cambridge, De-
partment of Engineering, September 1994.
[55] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8,
pp. 279–292, May 1992.
[56] X. Dai, C. K. Li, and A. B. Rad, “An approach to tune fuzzy controllers based
on reinforcement learning for autonomous vehicle control,” IEEE Trans. on
Intelligent Transportation Systems, vol. 6, pp. 285–293, September 2005.
[57] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in Neu-
ral Information Processing Systems, pp. 1008–1014, 2000.
[58] K. E. Parsopoulos, Particle swarm optimization and intelligence: advances and
applications. IGI global, January 2010.
[59] J. Kennedy and R. C. Eberhart, “Particle swarm optimization,” in Proc. of
the IEEE International Conference on Neural Networks, (Perth, Australia),
pp. 1942–1948, November 1995.
[60] R. C. Eberhart and J. Kennedy, “A new optimizer using particle swarm the-
ory,” in Proc. of the Sixth International Symposium on Micro Machine and
Human Science, (Nagoya, Japan), pp. 39–43, October 1995.
[61] J. Kennedy, “The particle swarm: social adaptation of knowledge,” in Proc. of
1997 IEEE International Conference on Evolutionary Computation (ICEC ’97),
(Indianapolis, USA), pp. 303–308, April 1997.
[62] W.-N. Chen, J. Zhang, H. S. Chung, W.-L. Zhong, W.-G. Wu, and Y.-H. Shi, “A
BIBLIOGRAPHY 189
novel set-based particle swarm optimization method for discrete optimiza-
tion problems,” IEEE Trans. on Evolutionary Computation, vol. 14, pp. 278–
300, April 2010.
[63] R. C. Eberhart and Y. Shi, “Particle swarm optimization: developments, ap-
plications and resources,” in Proc. of the 2001 IEEE Congress on Evolutionary
Computation, vol. 1, (Seoul, South Korea), pp. 81–86, 2001.
[64] R. A. Vural, O. Der, and T. Yildirim, “Investigation of particle swarm opti-
mization for switching characterization of inverter design,” Expert Systems
with Applications, vol. 38, pp. 5696–5703, May 2011.
[65] J. Pugh, Y. Zhang, and A. Martinoli, “Particle swarm optimization for un-
supervised robotic learning,” in Proc. of the Swarm Intelligence Symposium,
(Pasadena, USA), pp. 92–99, June 2005.
[66] K. Veeramachaneni, T. Peram, C. Mohan, and L. A. Osadciw, “Optimization
using particle swarms with near neighbor interactions,” in Genetic and evolu-
tionary computation conference, pp. 110–121, Springer, July 2003.
[67] J. Robinson, S. Sinton, and Y. Rahmat-Samii, “Particle swarm, genetic al-
gorithm, and their hybrids: optimization of a profiled corrugated horn an-
tenna,” in Proc. of the IEEE Antennas and Propagation Society International
Symposium, vol. 1, (San Antonio, USA), pp. 314–317, June 2002.
[68] R. C. Eberhart and Y. Shi, “Comparison between genetic algorithms and par-
ticle swarm optimization,” in International conference on evolutionary pro-
gramming, pp. 611–616, Springer, March 1998.
BIBLIOGRAPHY 190
[69] J. Kennedy and W. M. Spears, “Matching algorithms to problems: an experi-
mental test of the particle swarm and some genetic algorithms on the multi-
modal problem generator,” in Proc. of the 1998 IEEE International Conference
on Evolutionary Computation, (Anchorage, USA), pp. 78–83, May 1998.
[70] R. Martinez-Soto, O. Castillo, L. T. Aguilar, and P. Melin, “Fuzzy logic con-
trollers optimization using genetic algorithms and particle swarm optimiza-
tion,” in Mexican International Conference on Artificial Intelligence, pp. 475–
486, Springer, November 2010.
[71] A. Lazinica, Particle swarm optimization. InTech Kirchengasse, January 2009.
[72] D. Bratton and J. Kennedy, “Defining a standard for particle swarm optimiza-
tion,” in Proc. of the IEEE Swarm Intelligence Symposium, pp. 120 – 127, April
2007.
[73] Y. Dai, L. Liu, and Y. Li, “An intelligent parameter selection method for
particle swarm optimization algorithm,” in Proc. of the Fourth IEEE Interna-
tional Joint Conference on Computational Sciences and Optimization, (Yunnan,
China), pp. 960–964, April 2011.
[74] Y. Shi and R. C. Eberhart, “A modified particle swarm optimizer,” in Proc.
of the 1998 IEEE International Conference on Evolutionary Computation, (An-
chorage, USA), pp. 69–73, May 1998.
[75] W.-h. Zha, Y. Yuan, and T. Zhang, “Excitation parameter identification based
on the adaptive inertia weight particle swarm optimization,” Advanced Elec-
trical and Electronics Engineering, pp. 369–374, 2011.
BIBLIOGRAPHY 191
[76] M. Clerc and J. Kennedy, “The particle swarm: Explosion, stability, and con-
vergence in a multi-dimensional complex space,” IEEE Trans. on Evolutionary
Computation, vol. 6, pp. 58–73, February 2002.
[77] Y. Shi and R. C. Eberhart, “Empirical study of particle swarm optimiza-
tion,” in Proc. of the 1999 IEEE Congress on Evolutionary Computation-CEC99,
vol. 3, (Washington, USA), pp. 1945–1950, July 1999.
[78] Y. Shi and R. C. Eberhart, “Parameter selection in particle swarm optimiza-
tion,” in Proc. of the 7th International Conference on Evolutionary Program-
ming VII, pp. 591–600, Springer-Verlag, March 1998.
[79] C.-H. Yang, C.-J. Hsiao, and L.-Y. Chuang, “Linearly decreasing weight parti-
cle swarm optimization with accelerated strategy for data clustering,” IAENG
International Journal of Computer Science, vol. 37, no. 3, p. 1, 2010.
[80] Y. Shi and R. C. Eberhart, “Fuzzy adaptive particle swarm optimization,” in
Proc. of the 2001 IEEE Congress on Evolutionary Computation, vol. 1, (Seoul,
South Korea), pp. 101–106, May 2001.
[81] M. A. Arasomwan and A. O. Adewumi, “On adaptive chaotic inertia weights
in particle swarm optimization,” in Proc. of the 2013 IEEE Symposium on
Swarm Intelligence (SIS), (Singapore, Singapore), pp. 72–79, April 2013.
[82] R. C. Eberhart and Y. Shi, “Comparing inertia weights and constriction fac-
tors in particle swarm optimization,” in Proc. of the 2000 IEEE Congress on
Evolutionary Computation, vol. 1, (La Jolla, USA), pp. 84–88, July 2000.
BIBLIOGRAPHY 192
[83] R. C. Eberhart and Y. Shi, Computational Intelligence: Concepts to Implemen-
tations. Morgan Kaufmann Publishers Inc., 2007.
[84] J. Kennedy, “Some issues and practices for particle swarms,” in Proc. of the
IEEE Swarm Intelligence Symposium, (Honolulu, USA), pp. 162–169, April
2007.
[85] H. M. Choset, S. Hutchinson, K. M. Lynch, G. Kantor, W. Burgard, L. E.
Kavraki, and S. Thrun, Principles of robot motion: theory, algorithms, and
implementation. MIT press, 2005.
[86] G. Bishop and G. Welch, “An introduction to the kalman filter,” SIGGRAPH
Course Notes, pp. 1–81, 2001.
[87] D. Simon, Optimal state estimation: Kalman, H∞, and nonlinear approaches.
John Wiley & Sons, 2006.
[88] A. W. Merz, “The homicidal chauffeur,” AIAA Journal, vol. 12, pp. 259–260,
March 1974.
[89] S. H. Lim, T. Furukawa, G. Dissanayake, and H. F. Durrant-Whyte, “A time-
optimal control strategy for pursuit-evasion games problems,” in Proc. of the
IEEE International Conference on Robotics and Automation, vol. 4, (New Or-
leans, , USA), pp. 3962–3967, April 2004.
[90] M. E. Harmon, L. C. B. III, and A. H. Klopf, “Reinforcement learning applied
to a differential game,” Adaptive Behavior, vol. 4, pp. 3–28, September 1995.
[91] J. W. Sheppard, “Colearning in differential games,” Machine Learning,
vol. 33, pp. 201–233, November 1998.
BIBLIOGRAPHY 193
[92] Y. Ishiwaka, T. Satob, and Y. Kakazu, “An approach to the pursuit problem on
a heterogeneous multiagent system using reinforcement learning,” Robotics
and Autonomous Systems, vol. 43, pp. 245–256, June 2003.
[93] L.-X. Wang and J. M. Mendel, “Fuzzy basis functions, universal approxima-
tion, and orthogonal least-squares learning,” IEEE Trans. on Neural Networks,
vol. 3, pp. 807–814, September 1992.
[94] L.-X. Wang, “Fuzzy systems are universal approximators,” in Proc. of the
IEEE International Conference on Fuzzy Systems, (San Diego, USA), pp. 1163–
1170, March 1992.
[95] H. Ying, “Sufficient conditions on general fuzzy systems as function approx-
imators,” Automatica, vol. 30, pp. 521–525, March 1994.
[96] J. L. Castro, “Fuzzy logic controllers are universal approximators,” IEEE
Trans. on Systems, Man, and Cybernetics, vol. 25, pp. 629–635, April 1995.
[97] J. L. Castro and M. Delgado, “Fuzzy systems with defuzzification are uni-
versal approximators,” IEEE Trans. on Systems, Man, and Cybernetics, Part B
(Cybernetics), vol. 26, pp. 149–152, February 1996.
[98] D. Wang, G. Wang, and R. Hu, “Parameters optimization of fuzzy controller
based on PSO,” in Proc. of the 3rd IEEE International Conference on Intelligent
System and Knowledge Engineering, vol. 1, (Xiamen, China), pp. 599–603,
November 2008.
[99] S. F. Desouky and H. M. Schwartz, “Genetic based fuzzy logic controller for
BIBLIOGRAPHY 194
a wall-following mobile robot,” in Proc. of the 2009 IEEE American Control
Conference (ACC-09), (St. Louis, USA), pp. 3555–3560, June 2009.
[100] S. F. Desouky and H. M. Schwartz, “Hybrid intelligent systems applied to
the pursuit-evasion game,” in Proc. of the IEEE International Conference on
Systems, Man, and Cybernetics, (San Antonio, USA), pp. 2603–2608, October.
[101] E. Omizegba, G. Adebayo, and A. Balewa, “Optimizing fuzzy membership
functions using particle swarm algorithm,” in Proc. of the 2009 IEEE Inter-
national Conference on Systems, Man and Cybernetics, (San Antonio, USA),
pp. 3866–3870, October 2009.
[102] A. Esmin, A. Aoki, and G. Lambert-Torres, “Particle swarm optimization for
fuzzy membership functions optimization,” in Proc. of the 2002 IEEE Inter-
national Conference on Systems, Man and Cybernetics, vol. 3, (Yasmine Ham-
mamet, Tunisia), pp. 6–pp, October 2002.
[103] G. Fang, N. M. Kwok, and Q. Ha, “Automatic fuzzy membership function
tuning using the particle swarm optimization,” in Proc. of the 2008 IEEE
Pacific-Asia Workshop on Computational Intelligence and Industrial Applica-
tion, vol. 2, (Wuhan, China), pp. 324–328, December 2008.
[104] N. Khaehintung, A. Kunakorn, and P. Sirisuk, “A novel fuzzy logic control
technique tuned by particle swarm optimization for maximum power point
tracking for a photovoltaic system using a current-mode boost converter with
bifurcation control,” International Journal of Control, Automation and Sys-
tems, vol. 8, pp. 289–300, April 2010.
BIBLIOGRAPHY 195
[105] Z. Bingul and O. Karahan, “A fuzzy logic controller tuned with PSO for 2 DOF
robot trajectory control,” Expert Systems with Applications, vol. 38, pp. 1017–
1031, January 2011.
[106] R. Rahmani, M. Mahmodian, S. Mekhilef, and A. Shojaei, “Fuzzy logic con-
troller optimized by particle swarm optimization for DC motor speed con-
trol,” in Proc. of the 2012 IEEE Student Conference on Research and Develop-
ment (SCOReD), (Pulau Pinang, Malaysia), pp. 109–113, December 2012.
[107] S. F. Desouky and H. M. Schwartz, “Q(λ)-learning fuzzy logic controller for
a multi-robot system,” in Proc. of the 2010 IEEE International Conference on
Systems, Man, and Cybernetics, (Istanbul, Turkey), pp. 4075–4080, October
2010.
[108] L. Jouffe, “Fuzzy inference system learning by reinforcement methods,” IEEE
Trans. on Systems, Man, and Cybernetics, Part C (Applications and Reviews),
vol. 28, pp. 338–355, August 1998.
[109] W. M. Buijtenen, G. Schram, R. Babuska, and H. B. Verbruggen, “Adaptive
fuzzy control of satellite attitude by reinforcement learning,” IEEE Trans. on
Fuzzy Systems, vol. 6, pp. 185–194, May 1998.
[110] M. J. Er and C. Deng, “Online tuning of fuzzy inference systems using dy-
namic fuzzy Q-learning,” IEEE Trans. on Systems, Man, and Cybernetics, Part
B (Cybernetics), vol. 34, pp. 1478–1489, June 2004.
[111] H. R. Berenji and P. Khedkar, “Learning and tuning fuzzy logic controllers
through reinforcements,” IEEE Trans. on Neural Networks, vol. 3, pp. 724–
740, September 1992.
BIBLIOGRAPHY 196
[112] H. R. Berenji, “Fuzzy Q-learning: a new approach for fuzzy dynamic pro-
gramming,” in Proc. of the 1994 IEEE 3rd International Fuzzy Systems Confer-
ence, (Orlando, USA), pp. 486–491, June 1994.
[113] P. Y. Glorennec, “Fuzzy Q-learning and dynamical fuzzy Q-learning,” in Proc.
of 1994 IEEE 3rd International Fuzzy Systems Conference, (Orlando, USA),
pp. 474–479, June 1994.
[114] N. H. Yung and C. Ye, “An intelligent mobile vehicle navigator based on fuzzy
logic and reinforcement learning,” IEEE Trans. on Systems, Man, and Cyber-
netics, Part B (Cybernetics), vol. 29, no. 2, pp. 314–321, 1999.
[115] C. Zhou and Q. Meng, “Dynamic balance of a biped robot using fuzzy rein-
forcement learning agents,” Fuzzy sets and Systems, vol. 134, pp. 169–187,
February 2003.
[116] C.-K. Lin, “A reinforcement learning adaptive fuzzy controller for robots,”
Fuzzy Sets and Systems, vol. 137, pp. 339–352, August 2003.
[117] C. Ye, N. H. Yung, and D. Wang, “A fuzzy controller with supervised learn-
ing assisted reinforcement learning algorithm for obstacle avoidance,” IEEE
Trans. on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 33, pp. 17–
27, February 2003.
[118] Y. Duan and X. Hexu, “Fuzzy reinforcement learning and its application in
robot navigation,” in Proc. of 2005 IEEE International Conference on Machine
Learning and Cybernetics, vol. 2, (Guangzhou, China), pp. 899–904, August
2005.
BIBLIOGRAPHY 197
[119] X.-S. Wang, Y.-H. Cheng, and J.-Q. Yi, “A fuzzy actor–critic reinforcement
learning network,” Information Sciences, vol. 177, pp. 3764–3781, Septem-
ber 2007.
[120] V. Derhami, V. J. Majd, and M. N. Ahmadabadi, “Exploration and exploita-
tion balance management in fuzzy reinforcement learning,” Fuzzy Sets and
Systems, vol. 161, pp. 578–595, February 2010.
[121] H. Van Hasselt, Reinforcement learning in continuous state and action spaces,
pp. 207–251. Springer, 2012.
[122] M. D. Awheda and H. M. Schwartz, “The residual gradient FACL algorithm
for differential games,” in Proc. of the 28th IEEE Canadian Conference on Elec-
trical and Computer Engineering (CCECE2015), (Halifax, Canada), pp. 1006–
1011, May 2015.
[123] H. Raslan, H. M. Schwartz, and S. N. Givigi, “A learning invader for the
guarding a territory game,” Journal of Intelligent and Robotic Systems, vol. 83,
pp. 55–70, July 2016.
[124] C. V. Analikwu and H. M. Schwartz, “Reinforcement learning in the guarding
a territory game,” in Proc. of the 2016 IEEE International Conference on Fuzzy
Systems (FUZZ-IEEE 2016), (Vancouver, Canada), pp. 1007–1014, July 2016.
[125] X. Lu and H. M. Schwartz, “An investigation of gaurding a territory problem
in a grid world,” in Proc. of the American Control Conference(ACC) 2010,
(Baltimore, USA), pp. 3204–3210, June 2010.
[126] M. L. Littman, “Markov games as a framework for multi-agent reinforcement
BIBLIOGRAPHY 198
learning,” in Proc. of the 11th International Conference on Machine Learning,
pp. 157–163, 1994.
[127] K. Doya, “Reinforcement learning in continuous time and space,” Neural
Computation, vol. 12, pp. 219–245, January 2000.
[128] D. Li, J. B. J. Cruz, G. Chen, C. Kwan, and M.-H. Chang, “A hierarchical
approach to multi-player pursuit-evasion differential games,” in Proc. of the
44th IEEE Conference on Decision and Control, (Seville, Spain), pp. 5674–
5679, December 2005.
[129] S. N. Givigi and H. M. Schwartz, “Decentralized strategy selection with learn-
ing automata for multiple pursuer-evader games,” Adaptive Behavior, vol. 22,
pp. 221–234, August 2014.
[130] S. F. Desouky and H. M. Schwartz, “Learning in n-pursuer n-evader differ-
ential games,” in Proc. of the 2010 IEEE International Conference on Systems,
Man, and Cybernetics, (Istanbul, Turkey), pp. 4069–4074, October 2010.
[131] S. F. Desouky and H. M. Schwartz, “A novel hybrid learning technique ap-
plied to a self-learning multi-robot system,” in Proc. of the IEEE International
Conference on Systems, Man, and Cybernetics, (San Antonio, USA), pp. 2690–
2697, October.
[132] M. Wei, G. Chen, J. B. J. Cruz, L. Haynes, K. Pham, and E. Blasch, “Multi-
pursuer multi-evader pursuit-evasion games with jamming confrontation,”
Journal of Aerospace Computing, Information, and Communication, vol. 4,
pp. 693–706, March 2007.
BIBLIOGRAPHY 199
[133] X. Wang, J. B. J. Cruz, G. Chen, K. Phamc, and E. Blasch, “Formation control
in multi-player pursuit evasion game with superior evaders,” in Proc. of the
Defense Transformation and Net-Centric Systems, vol. 6578, (Orlando, USA),
p. 657811, International Society for Optics and Photonics, May 2007.
[134] Z. s. Cai, L. n. Sun, , and H. b. Gao, “A novel hierarchical decomposition
for multi-player pursuit evasion differential game with superior evaders,” in
Proc. of the first ACM/SIGEVO Summit on Genetic and Evolutionary Computa-
tion, (Shanghai, China), pp. 795–798, June 2009.
[135] F. B. Fu, P. Q. Shu, H. B. Rong, D. Lei, Z. Q. Bo, and Z. Zhaosheng, “Research
on high speed evader vs. multi lower speed pursuers in multi pursuit-evasion
games,” Information Technology Journal, vol. 11, pp. 989–997, August 2012.
[136] M. D. Awheda and H. M. Schwartz, “A decentralized fuzzy learning algo-
rithm for pursuit-evasion differential games with superior evaders.,” Journal
of Intelligent and Robotic Systems, vol. 83, pp. 35–53, July 2016.
[137] A. A. Al-Talabi and H. M. Schwartz, “An investigation of methods of pa-
rameter tuning for Q-learning fuzzy inference system,” in Proc. of the 2014
IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2014), (Beijing,
China), pp. 2594–2601, July 2014.
[138] L.-X. Wang, A Course In Fuzzy Systems And Control. Prentice-Hall press, USA,
1997.
[139] B. M. Al Faiya, “Learning in pursuit-evasion differential games using re-
inforcement fuzzy learning,” Master’s thesis, Carleton University, February
2012.
BIBLIOGRAPHY 200
[140] A. A. Al-Talabi and H. M. Schwartz, “A two stage learning technique using
PSO-based FLC and QFIS for the pursuit evasion differential game,” in Proc.
of the 2014 IEEE International Conference on Mechatronics and Automation
(ICMA 2014), (Tianjin, China), pp. 762–769, August 2014.
[141] A. A. Al-Talabi and H. M. Schwartz, “A two stage learning technique for dual
learning in the pursuit-evasion differential game,” in Proc. of the IEEE Sym-
posium Series on Computational Intelligence (SSCI) 2014, (Orlando, USA),
pp. 1–8, December 2014.
[142] L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement learning
and dynamic programming using function approximators, vol. 39. CRC press,
April 2010.
[143] J. F. Schutte, B. I. Koh, J. A. Reinbolt, R. T. Haftka, A. D. George, and B. J.
Fregly, “Evaluation of a particle swarm algorithm for biomechanical opti-
mization,” Journal of Biomechanical Engineering, vol. 127, pp. 465–474, June
2005.
[144] A. A. Al-Talabi, “Fuzzy actor-critic learning automaton algorithm for the
pursuit-evasion differential game,” in Proc. of the 2017 IEEE International
Automatic Control Conference (CACS), (Pingtung, Taiwan), November 2017.
[145] M. D. Awheda, , and H. M. Schwartz, “A fuzzy reinforcement learning algo-
rithm with a prediction mechanism,” in Proc. of The 22nd IEEE Mediterranean
Conference on Control and Automation (MED), (Palermo, Italy), pp. 593–598,
June 2014.
BIBLIOGRAPHY 201
[146] A. A. Al-Talabi and H. M. Schwartz, “Kalman fuzzy actor-critic learning au-
tomaton algorithm for the pursuit-evasion differential game,” in Proc. of
the 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2016),
(Vancouver, Canada), pp. 1015–1022, July 2016.
[147] M. A. Wiering and H. Van Hasselt, “Two novel on-policy reinforcement learn-
ing algorithms based on TD (λ)-methods,” in Proc. of the 2007 IEEE Interna-
tional Symposium on Approximate Dynamic Programming and Reinforcement
Learning (ADPRL 2007), (Honolulu, USA), pp. 280–287, April 2007.
[148] H. Van Hasselt and M. A. Wiering, “Reinforcement learning in continuous
action spaces,” in Proc. of the 2007 IEEE International Symposium on Ap-
proximate Dynamic Programming and Reinforcement Learning (ADPRL 2007),
(Honolulu, USA), pp. 272–279, April 2007.
[149] S. F. Desouky, Learning and Design of Fuzzy Logic Controllers For Pursuit-
Evasion Differential Games. PhD thesis, Carleton University, July 2010.
[150] D. Li and J. B. J. Cruz, “Better cooperative control with limited look-
ahead,” in Proc. of the 2006 American Control Conference, (Minneapolis,
USA), pp. 4914–4919, June 2006.
[151] Z. Xue, “A comparison of nonlinear filters on mobile robot pose estimation,”
Master’s thesis, Carleton University, February 2013.
[152] J. Z. Sasiadek and P. Hartana, “GPS/INS sensor fusion for accurate posi-
tioning and navigation based on kalman filtering,” IFAC Proceedings Volumes,
vol. 37, pp. 115–120, June 2004.
BIBLIOGRAPHY 202
[153] J. Z. Sasiadek and Q. Wang, “Low cost automation using INS/GPS data fusion
for accurate positioning,” Robotica, vol. 21, pp. 255–260, June 2003.
[154] D.-J. Jwo and F.-C. Chung, “Fuzzy adaptive unscented kalman filter for ultra-
tight GPS/INS integration,” in Proc. of the 2010 IEEE International Sympo-
sium on Computational Intelligence and Design (ISCID), vol. 2, pp. 229–235,
October 2010.
[155] A. A. Al-Talabi, “Multi-player pursuit-evasion differential game with equal
speed,” in Proc. of the 2017 IEEE International Automatic Control Conference
(CACS), (Pingtung, Taiwan), November 2017.
[156] S. Singh, R. L. Lewis, A. G. Barto, and J. Sorg, “Intrinsically motivated
reinforcement learning: An evolutionary perspective,” IEEE Trans. on Au-
tonomous Mental Development, vol. 2, pp. 70–82, June 2010.
[157] C. Diuk, A. Cohen, and M. L. Littman, “An object-oriented representation for
efficient reinforcement learning,” in Proc. of the 25th international conference
on Machine learning, pp. 240–247, ACM, July 2008.