Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
A Learning-Based Semi-Autonomous Control Architecture for Robotic Exploration of Search and Rescue
Environments
by
Barzin Doroodgar
A thesis submitted in conformity with the requirements for the degree of Masters of Applied Science
Mechanical and Industrial Engineering University of Toronto
© Copyright by Barzin Doroodgar 2011
ii
A Learning-Based Semi-Autonomous Control Architecture for
Robotic Exploration in Search and Rescue Environments
Barzin Doroodgar
Masters of Applied Science
Department of Mechanical and Industrial Engineering
2011
Abstract
Semi-autonomous control schemes can address the limitations of both teleoperation and fully
autonomous robotic control of rescue robots in disaster environments by allowing cooperation
and task sharing between a human operator and a robot with respect to tasks such as navigation,
exploration and victim identification. Herein, a unique hierarchical reinforcement learning
(HRL) -based semi-autonomous control architecture is presented for rescue robots operating in
unknown and cluttered urban search and rescue (USAR) environments. The aim of the controller
is to allow a rescue robot to continuously learn from its own experiences in an environment in
order to improve its overall performance in exploration of unknown disaster scenes. A new
direction-based exploration technique and a rubble pile categorization technique are integrated
into the control architecture for exploration of unknown rubble filled environments. Both
simulations and physical experiments in USAR-like environments verify the robustness of the
proposed control architecture.
iii
Acknowledgments
I would like to thank my supervisor, Prof. Goldie Nejat for her support and constructive input
and guidance throughout this project. I would also like to thank my M.A.Sc. thesis committee for
their time and valuable feedback. In addition, I would like to thank Babak Mobedi for his
development of the 3D sensory system, Onome Igharoro for his development of the HRI
interface, Geoffrey Louie for his development of the victim detection algorithm, and Mina
Salama, John Qi, and Edward Lin for their work on the low-level navigation control of the rescue
robot. Without the aid of these colleagues the implementation of my project would not be
possible. Last but not least, I would like to thank my parents and all my friends and lab mates for
their love and support throughout the two years of my graduate school.
iv
Table of Contents
Acknowledgments .......................................................................................................................... iii
Table of Contents ........................................................................................................................... iv
List of Tables ................................................................................................................................ vii
List of Figures .............................................................................................................................. viii
List of Appendices ......................................................................................................................... xi
Chapter 1 Introduction .................................................................................................................... 1
1.1 Motivation ........................................................................................................................... 1
1.2 Literature Review ................................................................................................................ 2
1.3 Problem Definition .............................................................................................................. 4
1.4 Proposed Methodology and Tasks ...................................................................................... 5
1.4.1 Literature Review .................................................................................................... 5
1.4.2 Control Architecture and MAXQ-Based Robot Decision Making ......................... 5
1.4.3 Simulations ............................................................................................................. 6
1.4.4 Implementation ....................................................................................................... 6
1.4.5 Conclusion .............................................................................................................. 7
Chapter 2 Literature Review ........................................................................................................... 8
2.1 Semi-Autonomous Control ................................................................................................. 8
2.2 Reinforcement Learning for Robot Control ...................................................................... 10
2.3 Chapter Summary ............................................................................................................. 13
Chapter 3 Proposed Learning-Based Semi-Autonomous Control Architecture ........................... 14
3.1 Semi-Autonomous Control Architecture .......................................................................... 14
3.2 MAXQ-Based Robot Decision Making ............................................................................ 15
3.2.1 MAXQ Learning Technique ................................................................................. 15
3.2.2 Proposed MAXQ Task Graph ............................................................................... 19
v
3.3 Chapter Summary ............................................................................................................. 42
Chapter 4 Simulations ................................................................................................................... 43
4.1 MAXQ Learning and Q-Value Convergence ................................................................... 43
4.1.1 Simulations Results ............................................................................................... 45
4.2 Performance of Overall Control Architecture ................................................................... 49
4.2.1 Simulations Results ............................................................................................... 52
4.3 Chapter Summary ............................................................................................................. 55
Chapter 5 Experiments .................................................................................................................. 56
5.1 Experimental Set-up .......................................................................................................... 56
5.1.1 HRI User Interface ................................................................................................ 58
5.1.2 Robot Control ........................................................................................................ 59
5.2 Experiment #1: Evaluation of Navigation and Exploration Modules ............................... 61
5.2.1 Experimental Results ............................................................................................ 61
5.3 Experiment #2: Teleoperated vs. Semi-Autonomous Control .......................................... 64
5.3.1 Experimental Results ............................................................................................ 64
5.4 Chapter Summary ............................................................................................................. 68
Chapter 6 Conclusion .................................................................................................................... 69
6.1 Summary of Contributions ................................................................................................ 69
6.1.1 Semi-Autonomous Control Architecture for Rescue Robots ................................ 69
6.1.2 MAXQ Learning Technique Applied to the Semi-Autonomous Control
Architecture ........................................................................................................... 69
6.1.3 Simulations ........................................................................................................... 70
6.1.4 Implementation ..................................................................................................... 71
6.2 Discussion of Future Work ............................................................................................... 71
6.3 Final Concluding Statement .............................................................................................. 72
References ..................................................................................................................................... 73
vi
Appendices .................................................................................................................................... 77
vii
List of Tables
Table 1: Examples of calibrated depth images ............................................................................. 40
Table 2: Exploration coefficients used in the simulations ............................................................ 51
Table 3: Exploration results with terrain information. .................................................................. 54
Table 4: Exploration results without terrain information. ............................................................ 54
Table 5: Experimental depth images and depth profile arrays ..................................................... 62
Table 6: Summary of trial times in teleoperated and semi-autonomous experiments .................. 67
viii
List of Figures
Figure 1: Semi-autonomous control architecture. ......................................................................... 15
Figure 2: MAXQ task graph for a semi-autonomous USAR robot. ............................................. 20
Figure 3: 2D grid map of the four regions surrounding the robot. ............................................... 23
Figure 4: Scenario illustrating the contribution of the exploration coefficient, λx. ...................... 26
Figure 5: Examples of concave obstacles. .................................................................................... 27
Figure 6: Scenarios illustrating the concave obstacle detection technique: (a) Not identified as a
concave obstacle, (b) identified as a concave obstacle. ................................................................ 29
Figure 7: Scenario illustrating the concave obstacle avoidance technique. .................................. 30
Figure 8: Comparison of the robot’s path to avoid concave obstacle. (a) path taken with a wall-
following technique, (b) path taken using the proposed concave obstacle avoidance technique. 31
Figure 9: Grid representation of the local environment surrounding a robot (the robot’s forward
direction is shown towards cell C2). ............................................................................................. 32
Figure 10: 2D Dxy array for rubble pile profile classification. The cell shading becomes darker
as the cell depth values increase. .................................................................................................. 34
Figure 11: Scenarios illustrating the rubble pile classifications: (a) open space, (b) non-climbable
obstacle, (c) uphill climbable obstacle, (d) downhill, and (e) drop. ............................................. 34
Figure 12: Contribution of the overall slope of the depth profile on rubble pile classification: (a)
open space, (b) non-climbable obstacle, (c) uphill climbable obstacle, (d) downhill. ................. 35
Figure 13: Contribution of the smoothness parameter, R, to the rubble pile classification: (a)
small R indicates a smooth traversable terrain, (b) large R indicates a rough non-traversable
terrain. ........................................................................................................................................... 36
ix
Figure 14: Comparison of traversable open space with a drop in the terrain: (a) open space, (b)
drop where surface is observable in 3D data, (c) drop where surface is not observale in 3D data.
....................................................................................................................................................... 37
Figure 15: Example of using the position of 3D data within the Dxy array to detect non-
traversable small voids in the rubble. ........................................................................................... 37
Figure 16: Calibration plot showing the relationship between the plane parameter B and angle of
an incline. ...................................................................................................................................... 39
Figure 17: Simulation screenshots: robot exploring in randomly generated scene layouts. ......... 44
Figure 18: Convergence of the Root Q-values when exploration is optimal. ............................... 46
Figure 19: Convergence of the Root Q-values when operator’s assistance is optimal. ................ 47
Figure 20: Convergence of the Root Q-values when victim detection is optimal. ....................... 48
Figure 21: The steps taken by the robot to fully explore the simulated environment in the first
200 trials of the offline training. ................................................................................................... 48
Figure 22: Simulated USAR-like scene: (a)-(c) actual scenes, (d)-(f) 2D grid map developed of
the scenes during simulations with robot starting location and heading. ..................................... 50
Figure 23: Number of exploration steps for the robot in Scenes A, B and C. .............................. 52
Figure 24: USAR-like scene: (a) one of three scenes for experiment #1, (b) scene for experiment
#2, (c) rescue robot stuck in rubble pile (d) victim with heat pad under rubble, in experiment #1,
(d) victim on the upper level of the scene, in experiment #2, and (f) red cushion detected as false
victim. ........................................................................................................................................... 57
Figure 25: Rugged rescue robot in USAR-like scene. .................................................................. 57
Figure 26: (a) 2D image of robot’s front view, (b) 2D image of robot’s rear view, (c) 3D map,
robot status (green indicates the wheels of the robot are moving), infrared visual feedback on the
robot’s perimeter (if objects are very close red rectangles appear), and control menu display, and
(d) real-time 3D information. ........................................................................................................ 58
x
Figure 27: Xbox 360 wireless controller used as the operator’s input device. ............................. 60
Figure 28: 2D representation of experimental scenes; (a)-(c) depict the original scene layouts 1,
2, and 3; (d)-(f) show the 2D representation of the scene as determined by the robot during the
experiments. .................................................................................................................................. 62
Figure 29: Number of exploration steps for the robot in scenes 1, 2 and 3. ................................. 63
Figure 30: Percentage of scene explored. ..................................................................................... 65
Figure 31: Number of victims identified. ..................................................................................... 65
Figure 32: Number of collisions. .................................................................................................. 66
xi
List of Appendices
Appendix A ................................................................................................................................... 77
Appendix B ................................................................................................................................... 80
1
Chapter 1 Introduction
1.1 Motivation
Search and rescue operations in urban disaster scenes are extremely challenging due to the highly
cluttered and unstructured nature of the environments. In addition, in some scenarios the task of
rescuing victims from collapsed structures can be extremely hazardous due to asbestos and dust,
general instability of damaged structures, and in some cases, presence of toxic chemicals or
radiation in the environment [1]-[3]. Moreover, rescuing victims from collapsed structures
sometimes requires entering through small voids which may not be possible or very time
consuming (e.g. requiring to create larger openings by removing the rubble first) for rescue
workers [1]-[3]. To overcome these challenges, mobile robotic systems are being developed to
aid rescue workers in urban search and rescue (USAR) operations. Current applications of rescue
robots require a team of rescue operators to remotely guide robots in the disaster scene, i.e.,
[2],[4]. However, teleoperation of rescue robots while also searching for victims in such complex
environments can be a very stressful task, leading to both cognitive and physical fatigue [5].
Consequently, operators can suffer low levels of alertness, and lack of memory and
concentration during these time critical situations [2].
Human operators can have perceptual difficulties in trying to understand an environment via
remote visual feedback. Two common examples of perceptual difficulties that are caused by
remote perception are the keyhole effect, which causes loss of situational awareness (SA), and
scale ambiguity [6]. The keyhole effect is a result of the limited angular view that is provided by
the visual platform on rescue robots which creates a sense of trying to understand the
surrounding environment through a small hole [6]. As a result, operators can have a hard time
trying to integrate what they see from the robot’s immersed view with their mental model of the
environment, becoming disoriented and losing situational awareness [2], [6], [7]. Studies have
shown that having good situational awareness is critical, in fact it has been noted that operators
will stop everything they are doing and spend an average of 30% of their time trying to
acquire/re-acquire SA, even when they are performing a time-sensitive search and rescue task
[8]. The need for high levels of situational awareness in USAR situations can make it difficult
for operators to safely navigate the robot and identify victims. On the other hand, scale
2
ambiguity deals with the difficulty of recognizing the scale of the environment through remote
visual feedback [6]. An example of this ambiguity was observed in [2], when mobile robots were
used at the World Trade Center (WTC) to search for victims in the rubble, robot operators could
not perceive from the remote visual information whether a rescue robot was small enough to pass
through an opening or climb over a rubble pile. Due to the mental and physical overload an
operator can experience when having to perform the entire search task using teleoperated robots,
a minimum of two rescue personnel are recommended to control a single robot [4].
An alternative to teleoperated control is to use autonomous controllers for rescue robots, which
eliminates the need of constant human supervision. However, deploying a fully autonomous
rescue robot in a USAR scene requires addressing a number of challenges. Firstly, rescue
personnel do not generally trust an autonomous robot to perform such critical tasks without any
human supervision [3]. Secondly, rescue robots have very demanding hardware and software
requirements unlike any other robotic applications. Namely, autonomous navigation can be very
difficult to achieve due to the rough terrain conditions of these environments. Moreover, victim
identification can be particularly challenging due to the presence of dust, debris, poor lighting
conditions and extreme heat sources in the disaster environment [1], [3].
To address the challenges and limitations of both teleoperated and fully autonomous control of
rescue robots in USAR environments, recent efforts have focused on developing semi-
autonomous controllers that allow task sharing between rescue robots and human operators [9]-
[12]. Not only does task sharing reduce the stress and mental workload of operators, it also
allows a rescue robot to benefit from a human operator’s experience and knowledge.
1.2 Literature Review
Teleoperated rescue robots have been utilized in disaster scenes such as the WTC site in 2001,
earthquake in Niigata, Japan in 2004, and Hurricanes Katrina, Rita, and Wilma in the United
States in 2005 [13]. Rescue robots were used again recently to inspect the collapse structures in
the aftermath of the 8.9 magnitude earthquake that hit Japan in March of 2011 [14]. At the WTC
disaster site, out of the ten rescue robots proposed only three were qualified to inspect the
disaster environment and look for victims: the MicroTracs, MicroVGTV, and Solem [2]. All
three robots were remotely controlled by rescue operators; MicroTracs and MicroVGTV were
tethered while Solem was a wireless robot. While in theory all three robots could be operated by
3
one person, in practice it took two rescue operators to operate each robot. One person controlled
the robot, while the other kept the tether or safety rope from tangling. Even though the
MicroTracs and MicroVGTV had two-way audio capabilities, the operators could only use the
visual feedback since the audio headsets and protective headgear could not be used at the same
time. Thus, the operators were forced to use the limited robot’s eye view to navigate, explore and
find victims. In addition, the robots lacked mapping and localization capabilities. The robots also
lacked sufficient sensory capabilities. On average, the robots got stuck 2.1 times per minute,
requiring the operators to assist them with limited information about the robot’s status and
environment [2]. Communication dropouts were also a problem with the Solem robot, which had
a total of 21 communication dropouts in its seventh deployment [2]. Furthermore, lack of sleep,
stress and cognitive fatigue also affected the operators’ abilities to perform at their best.
The issues observed in teleoperated rescue robots have encouraged the recent developments in
enhancing the autonomous capabilities of rescue robots. By providing the rescue robots with
more autonomous capabilities, the robot can take over some of the rescue tasks. Autonomous
rescue robots presented in [15]-[17], for example, have autonomous capabilities in navigation,
localization and mapping, and victim identification. For example, the IUB Rugbot presented in
[15] is a wireless tracked robot that has autonomous mapping capabilities based on a laser range
finder and Simultaneous Localization and Mapping (SLAM). The robot also utilizes an IR
camera and shape recognition algorithm based on thermal images. Gullfaxe [16] is also a wheel-
based wireless robot. The robot captures the distance and angle readings from the surrounding
obstacles, using an array of ultrasonic sensors to generate a map of the environment. The robot
also utilizes its ultrasonic sensors as well as infrared sensors for autonomous localization and
obstacle avoidance. Furthermore, two pyro-electric sensors are utilized to capture heat emissions
from victims and a microphone is used to localize the victim with respect to the robot. Although
victim identification using heat signatures in [15] and [16] may be possible in some
environments, such as controlled rescue competition arenas, external heat sources, e.g. fire,
makes it very difficult for IR cameras to locate victims in real disaster scenes. In addition, the
audio signal captured by microphones can be extremely noisy due to weak signals and external
noises in cluttered USAR environments [1], [3].
The robot CASTER [17] is also a rugged wireless tracked robot with autonomous capabilities.
The robot utilizes two laser scanners, a range camera, a heading/altitude sensor, and a 2D camera
4
for achieving autonomous navigation and localization. Autonomous victim identification is also
achieved via an IR camera, 2D cameras, a range camera and a microphone. CASTER was tested
in the 2005 RoboCup Rescue competitions, helping its team take 3rd
place in the competitions.
Although some of the autonomous capabilities, e.g. victim identification, performed well during
the competitions, the robot’s heavy platform caused it to unintentionally push obstacles and stairs
around. The robot was also reported to flip once and permanently get stuck in two rounds of the
competition [17].
Due to the harsh and unpredictable nature of USAR environments, fully autonomous navigation
and victim identification in these environments are extremely challenging to achieve. The
operator’s knowledge and experience is still needed to assist the robots in unpredictable
situations. Semi-autonomous control architectures create a cooperated control between the robot
and the human operator by sharing the tasks between the two entities. Only a handful of semi-
autonomous controllers have been developed to date specifically for search and rescue robots to
share the navigation and localization, exploration, and victim identification tasks [9]-[12].
However, none of these control schemes incorporate learning into the decision making module of
their control architecture.
1.3 Problem Definition
The highly cluttered nature of the USAR environments and the communication limitations of
urban disaster scenes make teleoperation of rescue robots very difficult. In addition, human
operators can become disoriented, stressed and fatigued very quickly in USAR environments,
causing crucial errors in control and victim identification. Furthermore, all robots that operate in
these environments do not have a priori information about landmarks in the scene, which makes
it extremely difficult for robots to autonomously navigate the scenes and identify victims.
Although current semi-autonomous control schemes simplify the tasks of the human operator
and the rescue robot, the level of autonomy of the robot is generally manually set by the
operator. In addition, none of these semi-autonomous control schemes incorporate learning into
their control architecture to deal with unknown and highly unpredictable USAR environments.
The focus of this thesis is to develop, for the first time, a learning-based semi-autonomous
control architecture for robotic exploration of unknown search and rescue environments. The
control architecture is to provide the robot with the decision making ability to autonomously
5
navigate an unknown cluttered disaster scene and identify victims, as well as to learn how to
share the tasks with the human operator for optimal results.
1.4 Proposed Methodology and Tasks
This thesis focuses on the development of a hierarchical reinforcement learning (HRL) -based
semi-autonomous control architecture for a rescue robot for the application of exploration and
victim identification in cluttered environments. In particular, the HRL control algorithm allows a
rescue robot to learn and make decisions regarding which tasks should be carried out at a given
time and whether the human or the robot should perform these tasks for optimum results. By
giving the robot the decision making ability to decide when human intervention is required, the
human operator can take advantage of the robot’s ability to continuously learn from its
surrounding environment.
The proposed methodology and tasks for the design of the HRL-based semi-autonomous control
architecture consists of the following sections:
1.4.1 Literature Review
Chapter 2 provides a detailed literature review on the two main features of the proposed control
architecture: semi-autonomous control and reinforcement learning for robotic control. The first
section of the literature review focuses on the implementation of semi-autonomous control for
mobile robots. An overview of the advantages of semi-autonomous control over fully
teleoperated control of robots is presented. Then a more detailed discussion about semi-
autonomous control schemes is provided which categorizes the controllers into fixed and
variable autonomy. Lastly semi-autonomous control schemes implemented for the specific
application of search and rescue are discussed. The second section of the literature review
provides an overview on reinforcement learning as applied to mobile robotic navigation and
exploration. This section discusses traditional reinforcement learning, hybrid reinforcement
learning, and hierarchical reinforcement learning as applied to robotic control.
1.4.2 Control Architecture and MAXQ-Based Robot Decision Making
In Chapter 3, an overview of the semi-autonomous control architecture as well as the design of
the MAXQ task hierarchy is presented. The MAXQ task hierarchy decomposes the overall task
6
of search and rescue into three main subtasks of exploration, navigation and victim
identification. The individual subtasks making up the hierarchy are defined and discussed in
detail. This chapter focuses on the hierarchy definition as well as other techniques that are
integrated into the MAXQ hierarchy including: (i) a direction-based exploration technique
designed for exploration of a rescue robot in cluttered and rough terrain environments, and (ii) a
rubble pile categorization technique that utilizes 3D information to group different terrain
conditions into multiple categories that facilitate navigation in USAR-like environments.
1.4.3 Simulations
Chapter 4 presents the numerous simulations that were performed to test the performance of the
proposed MAXQ-based control architecture. Two set of simulations were designed to evaluate
the overall capabilities of the developed control architecture. The first set of simulations was
designed to verify the performance of the overall MAXQ task hierarchy as well as to collect Q-
values and test for convergence of the Q-values at different levels of the hierarchy. Furthermore,
these simulations were used as offline training for the learning algorithm. The second set of
simulations was focused on evaluating the performance of the navigation and exploration
modules in several large-scale scene layouts with various degrees of complexity. The simulation
scene layouts were designed to test the robustness of the navigation and exploration modules to
cover larger and more complex scenes that could not be tested in laboratory space.
1.4.4 Implementation
Chapter 5 discusses the extensive experiments that were performed in order to evaluate the
proposed HRL-based semi-autonomous controller in a USAR-like experimental test bed. This
chapter discusses the experimental set-up including the design of the USAR-like environment
and an overview of the rescue robot and its sensing abilities. Two experiments are presented
which evaluate the performance of the HRL-based controller. The first set of experiments focus
on the evaluation of the navigation and exploration modules which give the rescue robot the
ability to safely navigate and explore an unknown USAR-like environment, thus providing the
opportunity to find victims in the scene. The second set of experiments, focus on the comparison
between the overall performance of the rescue robot in teleoperated (fully manual) and semi-
autonomous modes of control.
7
1.4.5 Conclusion
Chapter 6 presents the concluding remarks on the development and evaluation of the HRL-based
semi-autonomous control architecture and highlights the main contributions of the thesis as well
as a discussion of future work.
8
Chapter 2 Literature Review
2.1 Semi-Autonomous Control
To date, the semi-autonomous controllers that have been developed for mobile robotic
applications have mainly focused on the shared control of navigation and obstacle avoidance in
varying environments. These semi-autonomous control schemes can be categorized based on the
level of autonomy of the robot. In particular, control schemes in which the robot’s autonomy is
fixed focus on low level tasks such as collision avoidance and motor control, allowing the human
operator to concentrate on high level control and supervisory tasks which include path planning
and task specification [18]-[20]. Compared to fully teleoperated control of robots, these semi-
autonomous controllers have been shown to be more effective for robot navigation and obstacle
avoidance. Furthermore, they allow more neglect time for the operator, where neglect time is
defined to be the amount of time the operator can neglect a robot, while attending to other tasks
such as attending to other robots in a team of robots, reviewing remote visual information, or
communicate with other rescue operators [18]. However, these semi-autonomous controllers
with fixed autonomy levels lack the flexibility that is needed for more challenging problems such
as control of rescue robots in rubble-filled USAR environments. For example, in the case of a
robot getting physically stuck in a cluttered disaster environment, the human operator may have
to take over the control of low level operations in order to assist the robot. Similarly, high level
control may be required from the autonomous robot controller when the operator cannot perform
these tasks due to task overload or communication dropout. To address these concerns, semi-
autonomous controllers with variable autonomy can be used. Namely, these controllers can
provide an operator with the option to adjust the level of autonomy of the robot based on the task
at hand, i.e., [21]-[24]. For example, in [23], an operator is given the choice of four different
levels of autonomy, ranging between fully teleoperated control, which gives full control of the
robot to the operator, to a semi-autonomous mode where most of the low level operations such as
motor control and collision avoidance as well as high level operations such as path planning are
assigned to the autonomous robot controller, while the operator only specifies the overall goal
tasks (e.g. search in a particular region). It was concluded that this latter mode resulted in the
9
best performance in terms of minimizing the number of collisions and decreasing the workload
of the operator when compared to lower levels of robot autonomy.
Recently, a handful of semi-autonomous controllers have been specifically developed for control
of rescue robots in USAR applications, focusing on the two main tasks of robot navigation and
victim identification [9]-[12]. In particular, the semi-autonomous controller presented in [9] can
be set by the operator to perform in Safe and Shared Modes. Safe Mode is similar to teleoperated
control, in that all navigation and victim/object detection tasks are performed by the operator,
however, the robot can intervene to prevent obstacle collisions. In Shared Mode, the robot
provides the optimal path for navigation based on the operator’s directional inputs. The results
showed that navigational autonomy in Shared Mode resulted in a much better performance in
finding pre-specified objects (including victims) in the scene. Although [9] provides the operator
with the opportunity to set the level of autonomy of the controller beforehand, an operator is not
able to change this level of autonomy on the fly during a search and rescue operation, which may
be needed in unknown environments if either the robot or operator face a situation where they
need assistance from each other. This could occur, for example, in a scenario where low level
navigation and obstacle avoidance is assigned to the robot and the high level task of goal
assignment is given to the human operator; if the robot gets stuck, the operator may need to assist
the robot by taking over the navigation task in order to get the robot out. Similarly, if the
operator has lost his/her situational awareness, the robot can provide location information and an
appropriate search direction.
On the other hand, controllers presented in [10]-[12] solve this problem by providing on the fly
adjustments of a robot’s level of autonomy either by the operator [10],[11] or automatically by
the controller [12] during USAR operations. For example, in [10], a semi-autonomous control
scheme was presented where a robot could be operated in either manual mode or semi-
autonomous mode. In semi-autonomous mode, the operator was in charge of performing a
semantic search of the scene which included dividing a large search area into smaller regions
and ordering these regions by expected parameters such as proximity of region to the robot and
the likelihood of a victim. On the other hand, the robot was in charge of handling the more
routine and tedious systematic search, in which it searched for victims autonomously, alerted the
operator when a potential victim was found and waited for operator confirmation. In addition to
semantic and systematic search, another category of search known as opportunistic search was
10
also introduced, where a robot was able to operate on a reduced detection function on incoming
data without distracting the operator or slowing the navigation control. Although certain tasks
were assigned to both the operator and robot, the operator could take over the control of the robot
at any given time during the operation. In [11], a semi-autonomous controller was presented for
control of a team of rescue agents in USAR environments. The system allowed an operator's
commands to be weighted with the agents’ own autonomous commands using a mediator. In
addition to the mediator, an intervention recognition module was also utilized to identify
situations during a rescue operation where human intervention is required such as: 1) when
agents got stuck, 2) when agents got lost or unable to complete their goals, or 3) when a potential
victim was found to verify the victim. In [12], three semi-autonomous control modes were
proposed for a rescue robot to distribute tasks between an operator and robot: i) planning-based
interaction, ii) cooperation-based interaction, and iii) operator-based interaction. In planning-
based interaction, the planning system generated a cyclic sequence of actions to be executed by
the operator and rescue robot. In cooperation-based interaction, the operator could modify the
sequence generated by the planner, and lastly, in operator-based interaction, an approach similar
to teleoperation was taken where the system only observed the operator’s actions and notified the
operator if any critical safety problems occurred. The specific mode to implement was
determined automatically by the robot based on the manner in which the operator was interacting
with the overall system during a USAR operation. For example, the interaction was adjusted
from operator-based to planning-based if the operator was idle for a certain period of time.
The aforementioned research highlights the recent efforts in semi-autonomous control for USAR
applications. However, none of the aforementioned controllers incorporate learning into their
control scheme to deal with unknown and unpredictable USAR environments. In this thesis, a
learning-based semi-autonomous controller is proposed to allow a robot to learn from its own
previous experiences in order to better adapt to unknown and cluttered disaster environments.
2.2 Reinforcement Learning for Robot Control
Reinforcement Learning (RL) algorithms have previously been used to address navigation and
obstacle avoidance problems for mobile robots [25]-[30]. In particular, RL techniques have been
utilized to learn a robot’s optimal behavior by observing the outcome of its actions as it interacts
with its environment. The most commonly used RL technique in mobile robotic applications is
11
Q-learning. In Q-learning, a mapping is learned from a state-action pair to a value called Q [31].
The mapping represents the reward of performing a specific action in a particular state. The
advantage of this approach is that it is model-less and can be exploration insensitive (Q-values
will converge to optimal values, independent of robot behavior during data collection). Although
RL algorithms display advantages over many alternative learning algorithms, one drawback is
that they treat the entire state space as one large search space. Hence, for larger-scale and
complex robotic applications, the number of Q-values that must be learned (i.e. the size of the Q
table) also exponentially increases as the size of the state space increases, therefore, significantly
increasing the time it takes for learning to take place. To address this limitation, researchers have
developed a number of hybrid RL techniques in order to reduce the size of the state space [25]-
[27]. For example, in [25], a fuzzy interface system (FIS) and Q-learning were used together for
control of a mobile robot in goal seeking and obstacle avoidance tasks. To speed up the learning
process, a fuzzy controller was used to generalize the state space. Fuzzy rules were defined to
locally represent a set of input variables. When a new state (a new combination of input
variables) was encountered, the fuzzy rules compete to win over the new condition. If a close
match to the existing rules is not found, a new rule is generated which determines the action to
take at that particular step and a Q-value is initialized. This Q-value is updated with more
occurrences of the state-action pair. In [26], rough sets were utilized in conjunction with Q-
leaning for the navigation and obstacle avoidance of a mobile robot in order to speed up the
learning process of the controller. Herein, rough sets were used to reduce the effective state
space of the system by mapping the continuous state space onto a discrete set of states. Namely,
the states were grouped into predefined sets based on the distances of obstacles and the goal to
the robot. As a result, by having fewer states the overall learning time was considerably
shortened when compared to a traditional Q-learning technique. In [27], a neural network was
used with Q-learning for robotic obstacle avoidance. The overall approach was shown to train
the system in a shorter time than a traditional Q-learning technique. Utilizing neural networks
with Q-learning not only provides a more compact way of storing the Q values than a look up
table, but it can also be used to interpolate Q values for states that have not yet been visited.
Although the techniques presented in [25]-[27] have shown improvements over traditional RL
techniques, they cannot be as easily applied to more complex and larger-scale problems such as
control of rescue robots in USAR scenes. In particular, these techniques have been applied to
12
robot navigation and obstacle avoidance problems in highly structured simulated environments.
In addition, in both [25] and [26], prior knowledge of the robot’s desired locations within the
small simulated environment was needed prior to implementation. Overall, all of the
aforementioned hybrid RL techniques attempt to reduce the entire search space by interpolation
of the Q-values, or generalization of the states. Thus, even though these methods work well for
simple robotic applications, they would not be able to fully represent the states for more complex
robotic applications and environments, which will lead to inaccurate learning of the system.
In addition to the hybrid RL techniques, hierarchical RL (HRL) approaches have also been
developed to deal with some of the drawbacks of traditional RL methods, i.e., [32]-[34].
Contrary to hybrid RL techniques that reduce the entire state space by interpolating Q-values or
generalizing a set of states into a single representative state, HRL techniques reduce the number
of state-action pairs that are needed to be learned by using different types of abstractions.
Namely, HRL consists of techniques that utilize various types of abstractions to deal with the
exponentially growing number of parameters that need to be learned in large-scale problems by
reducing the overall task into a set of individual subtasks for faster learning [35]. Therefore, in
the hierarchical representation, the overall state and action pairs that are needed to be learned are
reduced without compromising the accuracy of the state representation.
To date, there have only been a handful of cases where HRL has been applied to mobile robotic
applications [28]-[30]. In this work, the use of the MAXQ HRL method within the context of a
semi-autonomous controller for the robot exploration and victim identification problem in USAR
environments is explored for the first time. In USAR applications, the environments of interest
are unknown and cluttered, increasing the complexity of the learning problem. By utilizing a
MAXQ approach, the overall search and rescue task can be reduced into smaller more
manageable subtasks that can be learned concurrently. MAXQ has fewer constraints on its
policies, i.e., mapping of states to possible actions, and thus, is generally known to require less
prior knowledge about its environment than other commonly used HRL approaches such as
Options and Hierarchical Abstract Machines (HAMs) [34]. This is advantageous when dealing
with unknown USAR environments. In addition, MAXQ can support state, temporal and subtask
abstraction. State abstraction is important since when a robot is navigating to a particular
location only the target location is important, the reason why it is navigating to that location is
irrelevant and should not affect the robot’s actions. The need for temporal abstraction exists in
13
this application since actions may take varying amounts of time to execute depending on the
complexity of the scene and the location of the robot within the scene. Subtask abstraction is
necessary because it allows subtasks to be learned only once; the solution can then be shared by
other subtasks.
2.3 Chapter Summary
The first section of the literature review presented an overview of the semi-autonomous control
schemes that have been developed for mobile robots. These can be categorized into semi-
autonomous controllers with either fixed or variable autonomy. Furthermore, recent
implementations of semi-autonomous control schemes for search and rescue robots were
discussed. The second section of the literature review focused on the application of
reinforcement learning techniques in robotic control. Traditional reinforcement learning, hybrid
reinforcement learning, and hierarchical reinforcement learning were discussed and compared.
14
Chapter 3 Proposed Learning-Based Semi-Autonomous Control Architecture
3.1 Semi-Autonomous Control Architecture
The proposed semi-autonomous control architecture is depicted in Figure 1. Sensory information
is provided by: (i) a 3D mapping sensor in the form of 2D and 3D images of the environment, (ii)
a thermal camera providing heat signature from the surrounding environment, (iii) infrared
sensors that provide proximity information surrounding a robot, and (iv) 2D visual feedback for
the HRI interface. The SLAM module uses the 2D and 3D images provided by the 3D mapping
sensor to identify and match 3D landmarks in the scene and create a 3D global map of the
environment. For more details on the 3D sensor and the SLAM technique used, the reader is
referred to [36]-[39].
The Victim Identification module provides the control architecture with the ability to identify
partially obstructed victims in a cluttered USAR scene. This module utilizes thermal images
and/or a victim detection algorithm based on 2D and 3D images provided by the 3D mapping
sensor. In the latter victim identification method colored 2D images are used for skin detection
while 3D images are used for detecting victim body parts in the scene. For more information
about the victim identification technique used, the reader is referred to [40]. The HRI interface
module provides the operator with the user interface needed for human control. The interface
allows the operator to obtain sensory information from the environment and the robot as well as
the 3D map provided by the SLAM module in order to monitor or control the robot’s motion. It
also provides the operator with potential victims in the scene for confirmation. The Deliberation
module is where the learning and decision making of the robot during semi-autonomous
deployment occurs. Real-time sensory data from the mapping sensor, the victim identification
technique as well as the generated 3D map of the explored environment (provided by the SLAM
module) are utilized as inputs into the Deliberation module. Since the robot is designed to be
semi-autonomous, it is within this module that the robot primarily decides its level of autonomy.
Namely, MAXQ is used to allow the robot to learn and make decisions regarding which tasks
should be carried out at a given time and who (the robot or human) should perform these tasks
for optimal results. If human control is prominent, then the decision making within this module
is made by the human operator via the HRI interface module. Lastly, the robot actuators module
15
consists of the robot’s motors and motor control boards. Within this module, robot actions
provided by the Deliberation module are translated into appropriate motor signals. The following
sub-section presents the detailed design of the MAXQ HRL technique utilized by the
Deliberation module for robot intelligence in order to achieve semi-autonomous control.
SLAMHRI Interface
Deliberation
Robot
Sensors
Robot
Actuators
Victim
Identification
Figure 1: Semi-autonomous control architecture.
3.2 MAXQ-Based Robot Decision Making
In this work, the use of the MAXQ HRL method within the context of a semi-autonomous
controller is explored for the robot exploration and victim identification problem in USAR
environments. Namely, MAXQ will be used in the Deliberation module of the controller to solve
the overall problem of finding victims trapped in a cluttered unknown environment by having the
robot explore the disaster scene. The robot will also have the ability to hand over control to a
human operator whenever the operator can perform a function more efficiently.
3.2.1 MAXQ Learning Technique
The MAXQ learning technique attempts to create an optimal policy for an unknown Markov
Decision Process (MDP). Thus, before discussing the details of the MAXQ learning technique,
the concept of a Markov Decision Process (MDP) and an overview of reinforcement learning are
discussed.
16
3.2.1.1 Markov Decision Process
An MDP provides a framework for modeling a decision-making algorithm. Assuming that the
decision making agent is interacting with a fully-observable environment, the MDP models this
interaction as a four-tuple ⟨ ⟩ defined as follows [41]:
S: A finite set of states of the environment fully observable by an agent.
A: A finite set of actions available to an agent.
P: The probability function of transitioning from the current state s, to the next state s’ by
executing an action , according to the probability function .
R: The expected real-valued reward received by the agent for executing an action a in state s and
transitioning to the next state s’ according to the function .
P0: The starting probability distribution. P0(s) is the probability of starting in state s when the
MDP is initialized.
To utilize the MDP model for a decision making process, a policy π must be developed which
tells the agent which action to take in any given state of the environment s. The
objective is to find an optimal policy which maximizes the cumulative reward received over the
entire process.
3.2.1.2 Reinforcement Learning
Reinforcement Learning (RL) is an algorithm that tries to build an optimal policy of an unknown
MDP. The RL algorithm is a continuing process of observing the current state s, and executing
an action a according to a policy π(a) which takes the agent into the next state s’. The agent
receives an immediate reward r for its action, which is used as a feedback to the system for
improving future policies.
As previously mentioned, the most commonly used RL algorithm is Q-Learning. Q-Learning
attempts to learn a state-action pair Q(s,a) value for every state, s, and action, a, combination.
This value is the maximum discounted reward that can be obtained by executing action, a, in
state, s, and then following an optimal policy [31]. The state-action value is estimated by taking
17
an action, a, in state, s, receiving an immediate reward, r, and observing the next state, s’, and
repeatedly updating its value with the following equation [31]:
[ ], (1)
where α (0 < α < 1) is the time-varying learning rate, and (0 < < 1) is the discount factor
which diminishes the influence of future rewards.
3.2.1.3 MAXQ Value Decomposition
MAXQ works by decomposing a given MDP, M, into a finite set of subtasks {M0,M1,..,Mn}
which define the MAXQ hierarchy [34]. Herein, M0 represents the root subtask which defines the
overall problem, and is further decomposed into subtasks M1 to Mn.
As previously mentioned, USAR environments are unknown and cluttered, that can greatly
increase the complexity of the learning problem. Thus, by implementing a MAXQ approach, the
overall search and rescue task can be reduced into smaller more manageable subtasks that can be
learned concurrently.
In MAXQ, each subtask, Mi, is a three-tuple, ⟨ ⟩, defined as follows [34]:
Ti: A finite set of termination states for subtask Mi. Generally the total set of states S is separated
into a set of active states Si, and terminal states Ti. Once the subtask reaches a terminal state it
exists the current subtask and returns to the parent subtask in the hierarchy that had originally
called it.
Ai: A set of actions available in subtask Mi. These actions can be either a primitive (one-step)
action or another subtask in the hierarchy.
Ri(s’): A reward function that specifies the immediate reward of transitioning to a terminal state,
s’ Ti.
Furthermore, for every subtask Mi, a policy, πi is defined which maps all possible states of Mi to
a child task. The child task can be either a primitive action or another subtask under Mi to
execute. Subsequently a hierarchical policy, π, (a set containing the policies for all subtasks) is
defined for the entire task hierarchy. Also, a projected value function (Q-value) is stored for
18
every state and action pair in all subtasks. The projected value function is defined as the expected
cumulative reward of executing policy πi in subtask Mi, as well as all the policies of the subtasks
that would be executed under Mi until Mi terminates. The formal definition of the projected value
function is given as [34]:
∑ ( ) (2)
Herein, is the projected value function for subtask Mi under the policy π. is
the value function of executing action a in state s. N is the number of time steps that action a
takes to transition to the next state, s’, and the rest of the terms are defined as before. Thus, the
right most part of this equation is the discounted cumulative reward of completing the task for Mi
after having executed the first action a. Therefore, the value function can be defined as [34]:
, (3)
where can be defined as [34]:
{∑
( ) , (4)
and the right most term is separately defined as the completion function [34]:
∑ ( ) (5)
The MAXQ learning algorithm updates both and for all subtasks which by
definition, updates to a convergence point for an optimal policy. The general algorithm
with which the MAXQ learning technique is carried out for a given subtask M is as follows [34]:
float MAXQ(state s, subtask M)
Let Total Reward = 0
while s is not a terminal state do
Choose action according to the policy π
Execute a.
if a is primitive, Observe one-step reward r
else r = MAXQ(s, a) which recursively calls MAXQ for subtask a
19
and return the reward received while a was executed.
Total Reward = Total Reward + r
Observe resulting state s’
if a is a primitive subtask
else a is a composite subtask
[ ]
end while
end
3.2.2 Proposed MAXQ Task Graph
The proposed MAXQ task hierarchy for the semi-autonomous control architecture is shown in
Figure 2. The Root task represents the overall USAR task to be accomplished by the robot –
finding victims while exploring a cluttered USAR environment. This Root task can be further
decomposed into four individual subtasks defined as: Navigate to Unvisited Regions, Victim
Identification, Navigate and Human Control. The focus of this thesis is on the two main subtasks
of the MAXQ task hierarchy that are responsible for exploration and navigation in a USAR
environment: Navigate to Unvisited Regions and Navigate. The following sub-sections provide a
detailed discussion of each of the individual modules that make up the overall task hierarchy
with an emphasis on the two aforementioned subtasks. A ε-greedy policy is used in the task
hierarchy. Namely, an action is chosen randomly, with probability ε, to promote exploration of
all available actions, and the optimal action, i.e., action with the highest Q-value, is chosen with
probability 1-ε. The value of ε is reduced as the system is trained. The advantage of this policy is
that it provides sufficient exploration as well as exploitation for the learning algorithm, and
therefore, ensures convergence to optimal solutions [35].
Positive rewards are given to encourage transitions from the robot’s current state to desirable
states. For example, if the robot exits the USAR scene after having explored the entire scene, a
positive reward of +10 is given to the Navigate to Unvisited Regions subtask. For the Victim
Identification subtask, a positive reward of +10 is given for correctly tagging a new victim found
in the scene. For the Navigate subtask, the reward is based on three main factors: local
exploration, obstacle avoidance, and the global exploration direction. Within this subtask,
20
positive rewards are given to encourage desirable actions such as moving to adjacent unvisited
areas in the local environment, successfully avoiding dangerous situations such as collisions with
obstacles, and moving in the desired exploration direction. For example, moving into an
unvisited region in the desired global exploration direction is given a positive reward of +15 and
avoiding an obstacle is given a positive reward of +10. In addition, a large positive reward of
+100 is also given to the Root task when the overall USAR task is completed. Negative rewards
are given when a transition is made from the robot’s current state to an undesirable state. For
example, a negative reward of -10 would be given to Navigate to Unvisited Regions if Exit USAR
Scene is executed instead of Navigate when there is still known regions to explore. In the Victim
Identification subtask, a negative reward of -10 would be given if Tag is executed when no new
victims were identified in the scene. In Navigate, negative rewards would be given for executing
commands that lead to collisions (-20) or navigating into already visited areas (-1) of a USAR
scene. With respect to Human Control, similar positive and negative rewards are given based on
the subtask that the operator is involved in, allowing the robot to learn from the operator’s
actions.
Figure 2: MAXQ task graph for a semi-autonomous USAR robot.
Task decomposition
Information flow from operator
Human
Control
Human
Control
Navigate to
Unvisited Regions
Tag
Navigate
Victim
Identification
Exit USAR
Scene
Root
F B
21
3.2.2.1 Root Task
As previously mentioned, the Root task defines the overall goal of a rescue robot in a USAR
mission- to find victims within an unknown cluttered disaster scene by exploring the
environment. The MAXQ state function of this task is defined as S(V, LR, Mxyz). V represents the
presence of potential victims in the environment. LR is the robot’s location with respect to the
global coordinate frame (defined to be at the location at which the robot enters the scene), and
Mxyz represents the 3D map of the USAR scene the robot is exploring. LR and Mxyz can both be
obtained by the SLAM module of the control architecture.
3.2.2.2 Navigate to Unvisited Region
The purpose of this subtask is to have the robot explore unvisited regions within disaster
environments. The state definition for this subtask is S(LR,Mxyz), where LR and Mxyz have the same
definitions as above. The primitive action, Exit USAR Scene, is used by the Navigate to Unvisited
Regions subtask to trigger the end of exploration and guide a rescue robot out of the scene. In
order to efficiently explore an unknown environment, a direction-based exploration technique
was developed to be used in this subtask. The technique focuses on finding and exploring new
regions within the environment in order to expand the search area of the robot. This approach is
similar to frontier-based exploration, i.e. [42]-[45], however, the proposed approach explicitly
also takes into account the cluttered terrain of USAR environments.
Frontier-based exploration strategies that have been developed to-date focus on the exploration
of unknown environments by expanding a known map of the environment by deploying either a
single robot or a team of coordinated robots into target locations at the frontier of the map, i.e.
the boundary between explored and unexplored regions of the map. A utility or cost function is
utilized to determine where to deploy a particular robot. For example, in [42], the frontier-based
exploration technique was introduced for a single robot. The approach determined the robot’s
next target location based on the shortest accessible path for the robot. In [43], a multi-robot
exploration strategy was adapted where the cost for a particular robot reaching a frontier cell was
determined based on the probability of occupancy of each cell along the path of the robot to the
corresponding frontier cell, as well as the robot’s distance to that target point. In [44], both
distance and orientation to frontier points were used in the cost function for multi-robot
exploration. In [45], in addition to determining the nearest frontier point to a robot that provides
22
maximum coverage of unknown space, the deployment locations must be chosen such that a
robot at one of these locations is visible to at least one other robot. Each robot is then deployed
one at a time.
The majority of these frontier-based exploration approaches have shown to work well in
simulated and real laboratory and/or office environments; however, they cannot be directly
applied to USAR-like environments where the robot is required to navigate over rough terrain in
cluttered scenes. Namely, in USAR environments, the terrain plays an important role in
determining a robot’s exploration path. For example, a robot may need to climb over rubble piles
in order to search a particular region or to explore new regions in an environment. Furthermore,
in rough terrain conditions, a particular cell or groups of cells may be accessible only through a
particular approach direction, due to the shape/slope of a rubble pile with respect to surrounding
cells. Hence, in this exploration technique, terrain information as well as travel distance and
coverage are considered. In addition, instead of implementing a path to target approach,
alternatively a direction-based strategy is utilized due to the cluttered and uncertain nature of
these environments, as well as a learning algorithm which aids the robot to locally navigate these
rubble-filled scenes. In general, the proposed direction-based technique is more robust to map
uncertainty than direct path planning techniques that require an accurate representation of the
scene in order for a robot to reach its target locations. Mainly, a search direction is defined for a
robot to follow in order to explore unknown regions. The proposed exploration method
determines one of four universal search directions, North, South, East or West, for the robot to
explore in order to navigate into unvisited regions in the scene. The chosen direction for
exploration is then incorporated into the rewarding system of the Navigate module which focuses
on implementing the primitive actions necessary to locally navigate the robot in this defined
search direction through the rubble. The search direction is updated as necessary in order for the
robot to search new regions.
To effectively choose an exploration direction, the current 3D map of the scene is decomposed
into a 2D grid map consisting of an array of equally sized cells, where each cell categorizes the
status of a predefined area of the real environment. The exploration technique uses this
information to locate potential cells within the map that have yet to be explored as well as cells
which represent the frontier of the known map of the scene, in order to expand the robot’s
23
knowledge about the environment. The 2D grid map is divided into four regions, where each
region represents one of the search directions, Figure 3.
Figure 3: 2D grid map of the four regions surrounding the robot.
The diagonal cells with respect to the robot’s current location in the known map are used to
determine the boundaries between the four search regions and are defined with respect to the
robot’s current location in the known map. Each region encompasses only one set of boundary
cells defined to be in the counter-clockwise direction from the region.
Prior to evaluating an exploration direction, the cells in each region are categorized into two
types: explored cells and unexplored cells. This classification is based on whether the robot has
visited a particular cell or its adjacent cells. If the robot has already explored a cell, then the
terrain or the presence of victims in that cell will be known and can be categorized accordingly.
Based on the sensing information of the robot, the cells can be identified to be:
i) Obstacle: An obstacle can be defined to be, for example, a rubble pile. All obstacles are
categorized as known or unknown obstacles. Known obstacles are further classified into
climbable and non-climbable obstacles. A robot is expected to navigate over a climbable
obstacle. The climbable obstacles category is again divided into visited and unvisited climbable
obstacles. On the other hand, the non-climbable obstacle category includes terrain conditions
such as large non-traversable rubble piles, walls and also sudden drops in the height of the rubble
pile i.e., edge of a cliff. Unknown obstacles are those that have been detected to be in the
surrounding cells of the robot, however, not enough sensory information is available to classify
them further. The various sub-categories of obstacles are each treated differently in this subtask.
24
ii) Open visited: An open and traversable area that has been visited previously by the robot.
iii) Open unvisited: This cell is unvisited by the robot but has been detected as an obstacle-free
area.
iv) Victim: The cell contains a human victim.
v) Unknown: The cell information is unknown, in which case the cell has not been explored and
there is no sensory information available.
The cells categorized as visited, known or victim cells are defined to be explored cells, whereas
unexplored cells include the unknown and unvisited cells. Unexplored cells can add to the
robot’s knowledge of the scene and lead to exploring new regions. On the other hand, re-visiting
already explored cells may not necessarily provide any new information about the environment.
Therefore, herein, unexplored cells are considered as cells of interest in robot exploration and
assist in determining exploration direction. To determine the robot’s exploration direction in the
scene, the following exploration utility function is defined:
1
( ) , n
j xj xj xjx
u
(6)
where uj is the utility function for each of the four individual regions, and j represents the
identity of the region, i.e. North, East, South, and West. x corresponds to the identity of a cell(x)
in region j, n is the total number of cells of interest in region j, and ωxj, λxj, and δxj represent three
non-zero positive coefficients for cell(x) in region j and are initially given the value 1. The
exploration utility function weighs the benefits of exploring a scene based on terrain, the number
of cells of interest in a region of the scene, and the travel distance to cells of interest using the
values of the three coefficients.
1) Terrain Coefficient: ωxj, is the coefficient that is given to cell(x) in region j based on the type
of terrain of that cell, in particular:
25
1
2
3
, if cell x is an open unvisited space
, if cell x is an unvisited climbable obstacle
, if cell x is an unknown obstacle
1.0, elsewhere.
l
lx
l
w l
w l
w l
, (7)
where l3 < l2 < l1 and lm > 1 for m=1 to 3.
Herein, wl is a positive weighting applied to l1, l2, and l3. The weighting can be used to set a
higher priority to this particular coefficient. Open unvisited cells are given the largest value since
they are obstacle-free regions and hence, allow a robot to easily navigate the cell. Unvisited
climbable obstacles have the second largest value. Unknown obstacles have the lowest value due
to the uncertainly regarding the true nature of the obstacles in these cells.
2) Neighboring Cells Coefficient: λxj is the coefficient given to cell(x) in region j based on the
information in its 8 neighboring cells, namely:
8 8
1 1
8
1
, when 0
1.0, when 0
k kp x x
k k
xkx
k
w v v
v
, where: (8)
1
2
3
4
, if cell k is unknown.
, if cell k is an open unvisited space.
, if cell k is an unvisited climbable obstacle.
, if cell k is an unknown obstacle.
0.0, elsewhere.
kx
p
p
v p
p
, (9)
and, where: p4 < p3 < p2 , p2 < p1, and pm > 1 for m=1 to 4.
kxv is the exploration value of the k
th neighboring cell of cell(x), where k=1 to 8. wp is a positive
weighting applied to 81
kk xv . λx is designed to provide a higher value to cells that are adjacent to
unknown cells or other cells of interest, since exploring those cells may immediately lead to
exploration of other surrounding cells. For example, consider the case in the partial 2D grid map
shown in Figure 4, where the two cells A and B have the same type of terrain (unknown
obstacle), therefore ωA=ωB. However, cell A has one unknown obstacles in its 8 neighboring
26
cells (λA=p4), whereas, cell B has two open unvisited cells and one unknown cell, giving it the
larger exploration value of λB= p1+2p2, when wp=1.
Figure 4: Scenario illustrating the contribution of the exploration coefficient, λx.
3) Travel Coefficient: δxj is the coefficient associated with moving to cell(x) in region j from the
robot’s current cell location, and is a function of dx which is defined as the distance of cell(x) to
the robot’s current occupied cell. As the distance to cell(x) increases, the value of travelling to
cell(x) decreases. δxj favors unknown cells closer to the robot’s current cell for exploration to
allow for more efficient search of the environment:
1
2
3
4
, if d 1
, if 1 < d 2
, if 2 < d 3
, if 3 < d 4
1.0, elsewhere.
q x
q x
x q x
q x
w q X
w q X X
w q X X
w q X X
, (10)
where q4 < q3 < q2 , q2 < q1 , and qm > 1 for m=1 to 4.
wq is a positive weighting applied to qm (m=1 to 4). Xn (n=1-4) represents predefined distance
thresholds that can be set by an operator. These thresholds can be increased as the size of the
known map increases to reflect more realistic distances.
Once the utility functions for all four regions are evaluated, the direction corresponding to the
region with the largest utility value is chosen as the exploration direction. The desired
exploration direction is passed to the Navigate subtask in order to locally move the robot in this
direction. A robot operator is able to select the weightings for the three coefficients based on the
27
rescue scenario at hand. In particular, he/she may decide to increase the weighting for one or
more of the coefficients in order to have the robot explore in a desired manner or to maximize
energy efficiency of the robot.
Concave Obstacle Avoidance Technique
A concave obstacle avoidance technique was developed and integrated with the exploration
algorithm. Herein, a concave obstacle is defined as a collection of non-climbable obstacle cells
within the 2D grid map of the environment which makes a concave pattern of any shape and size.
Figure 5 illustrates two examples of concave obstacles. As can be seen from these two examples,
the concave obstacle outline can be of a simple concave pattern, Figure 5(a), or any other
arbitrary pattern, Figure 5(b). However, the common feature about these two patterns is that if
the exploration direction of the robot happens to be towards the opening of the concave obstacle,
as shown in Figure 5, the robot travelling into the concave obstacle can find itself in a region
enclosed by non-climbable obstacles; this would make the exploration task a bit challenging.
(a)
(b)
Figure 5: Examples of concave obstacles.
The proposed concave obstacle avoidance technique is comprised of two steps. The first step is
detection of the concave obstacle ahead of time. In this step, the 2D grid map information is used
to detect a concave obstacle in the exploration direction of the robot to detect a continuous
boundary of non-climbable obstacles. Once the obstacle is detected the second step is to avoid
the region by guiding the robot around the obstacle.
The robot first attempts to detect the concave obstacles utilizing the rubble pile information that
is obtained from the 2D grid map. The following two conditions must be satisfied for the robot to
detect a collection of obstacle cells as a concave obstacle which needs to be avoided:
Robot location and heading
Open space
Obstacle
Concave outline
28
Condition 1: The collection of obstacles must have an enclosed non-climbable boundary.
Condition 2: All the cells that make up the region within the concave obstacle (including the
non-climbable boundary of the obstacles) must be explored cells.
Figure 6 illustrates how a collection of obstacle cells forming a concave boundary are detected in
the robot’s exploration direction. The detection method is performed to detect concave obstacles
at every navigation step. The algorithm begins by scanning the path in the robot’s heading
direction until a non-climbable obstacle is found, i.e. cell (3,2) in Figure 6. Thereafter, a search
is initiated from the location of the non-climbable obstacle to the left and right of the robot’s path
of exploration, Figure 6. In the example shown in Figure 6, the searching algorithm begins at cell
(3,3) and looks at all the cells to the left and right of the robot’s path of exploration until non-
climbable obstacles are found; this procedure is repeated along the robot’s exploration path down
to cell (3,7). The algorithm stores the information gained in the searching process to estimate the
size and shape of the non-climbable obstacle boundary. During this search, the algorithm ensures
that the two aforementioned conditions for a concave obstacle are met. That is the boundary of
the set of obstacles is checked for any openings such as an open visited cell or a climbable
obstacle, and the inner region of the boundary is searched for unexplored cells. During the
searching process if either of the two conditions is not satisfied the search is terminated and the
possibility of a concave collection of obstacles in the robot’s exploration direction is dismissed.
On the other hand, if the algorithm does detect an enclosed non-climbable boundary of obstacles
the approximate region of this boundary is stored for avoidance.
Figure 6(a) depicts a scenario where both conditions are satisfied, and thus the set of obstacles
need to be avoided. On the other hand, Figure 6(b) depicts a scenario where neither condition is
satisfied. Condition 1 is not met since both a climbable obstacle at (2,2) as well as an open
visited cell at (8,3) create two openings in the boundary of the concave obstacle. Condition 2 is
not satisfied either since the region bounded by the concave obstacle is not fully explored due to
the existence of an unvisited open cell at (6,5).
29
(a)
(b)
Figure 6: Scenarios illustrating the concave obstacle detection technique: (a) Not identified
as a concave obstacle, (b) identified as a concave obstacle.
Once a concave obstacle is detected, the approximate region within the 2D map that contains the
obstacle is utilized to assist the robot avoid the region. The concave obstacle avoidance is simply
a guidance procedure for the robot which allows it to steer around concave obstacle regions.
There are two possible scenarios when it comes to the robot’s concave obstacle avoidance. The
first scenario is where the robot is inside the boundaries of the concave obstacle, Figure 7(a).
This scenario may happen, for example, if the robot needs to explore unknown cells within the
set of obstacles. In this scenario, the first step of the avoidance procedure is to move out of the
concave obstacle. This is done by temporarily giving the robot the opposite exploration direction,
Figure 7(b). Once the robot has moved out of the region, the second step is to clear the concave
obstacle by moving sideways to the left or right of the concave obstacle, Figure 7(c). Once both
steps are executed, robot’s original exploration direction is assigned back to the robot allowing it
to continue exploration as before, Figure 7(d).
The second scenario is where a concave obstacle is detected which has been previously explored.
In this scenario, the first step of the concave obstacle avoidance is skipped and the robot simply
moves sideways to clear the obstacle, see Figure 7(c).
Note that there may be cases where the robot needs to avoid multiple concave boundaries. For
example, while attempting to avoid one concave obstacle the robot may face a new concave
obstacle in a temporary direction of travel. In this case, a dynamic stack list of concave obstacle
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
9
Robot's location and heading
Open visited
Open unvisited
Non-climbable obstacle
Climbable obstacle
Robot's path of exploration
Non-climbable boundaries
30
regions that should be avoided is created, and the concave obstacle regions are avoided
consecutively.
(a)
(b)
(c)
(d)
Figure 7: Scenario illustrating the concave obstacle avoidance technique.
Concave obstacles are particularly an issue with an exploration technique that relies on direction
of travel. When a robot is travelling in a specified exploration direction it can get trapped within
the boundary of a collection of obstacles. A common solution to this problem is a wall-following
technique [46], where the robot enters a concave shaped obstacle until it comes within close
proximity to the obstacle’s boundary. Once the robot reaches the obstacle, the wall-following
technique moves the robot within close proximity to the obstacles, i.e. walls, until the robot finds
its way out of the enclosed boundary. Although this technique can be simple to program and
implement in an exploration algorithm, it is time consuming, especially when dealing with very
large obstacles. To illustrate this point further Figure 8(a) depicts a scenario where the robot
uses a wall-following technique to avoid the concave obstacle boundary. As can be seen from
this example, the number of steps taken to go around the obstacle boundary entirely depends on
the size and shape of the concave obstacle. However, by utilizing the proposed avoidance
technique the route of the robot can be considerably shortened as illustrated in Figure 8(b).
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
Robot's location and heading
Open visited
Non-climbable obstacle
Concave obstacle region
Original exploration direction
Temporary opposite direction
Temporary sideway direction
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
31
(a)
(b)
Figure 8: Comparison of the robot’s path to avoid concave obstacle. (a) path taken with a
wall-following technique, (b) path taken using the proposed concave obstacle avoidance
technique.
3.2.2.3 Navigate
The goal of this subtask is to perform local navigation and obstacle avoidance. Its state definition
is defined as S(Ci, DE, Dxy, LV/R), where Ci, i=1 to 8, represents the 2D grid map information for
the 8 surrounding cells of the robot, DE corresponds to the desired exploration direction, and Dxy
contains the depth profile information of the rubble pile in the surrounding environment.
Navigate can be performed based on two conditions: local scene exploration or navigate to a
victim location. If the objective is to navigate to a potential victim rather than just explore a
scene, LV/R (relative location of a potential victim to the robot) is used in the Navigate module to
Robot's initial location and heading
Open space
Obstacle
Robot's path
32
reach a target victim location. Once the victim location is reached, the Navigate subtask
terminates and allows Victim Identification to perform the task of identifying victims.
Local exploration is performed by the Navigate subtask by utilizing the information of the 8
adjacent cells to the robot’s current cell location, i.e. Ci and not the entire 2D map. For example
in the representation depicted in Figure 9, cell C2 represents the information of the grid cell in
front of the robot (open unvisited region), whereas, C5 represents information of the cell to the
right of the robot (known obstacle). Based on the status of the robot’s surrounding cells, the
optimal primitive actions (F, B or θ) are used to rotate a robot by an angle , move a robot
forward (F), or backward (B) within the environment. Hence, the Navigate subtask can search
its local environment by exploring new cells and performing obstacle avoidance.
Figure 9: Grid representation of the local environment surrounding a robot (the robot’s
forward direction is shown towards cell C2).
In local navigation, the navigation reward Rn is generally based on two factors: obstacle
avoidance and local exploration. The immediate rewards are a direct function of the state
definition and the action at each time step. The rewards can be positive or negative depending on
the action that is executed while in state, s. Positive rewards can encourage desirable actions
such as moving in a desired direction or exploring a locally unvisited cell. Moreover, when
various options are available to the robot the adjacent cells are given priority based on their ease
of accessibility. Herein, positive rewards can also encourage the robot to explore the most
accessible unexplored cell. For example, exploring an open unvisited cell directly in front of the
robot would be more accessible than an unvisited climbable obstacle that is located in a cell to
the right of the robot. The reasoning behind this is simple: in this scenario, in order to explore the
cell in front of the robot only a Forward action must be executed, whereas to explore the cell to
the right of the robot a Turn action, followed by a Forward action must be executed, where the
C1 C2 C3
C4 C5
C6 C7 C8
33
robot needs to climb over the rubble pile. Furthermore, the Forward action in the second case
(i.e. climbing over the rubble pile) could require more energy and be more time-consuming than
would be required if the robot moves into an obstacle-free cell. In a similar manner, negative
rewards can discourage obstacle collision as well as revisiting explored cells. Although negative
rewards can be given in many situations where an action is undesirable, the value of these
rewards can be determined based on pre-set conditions. For example, due to the physical damage
a collision with an obstacle may cause the robot, the negative reward for obstacle avoidance can
be set higher than revisiting an explored cell, hence, placing obstacle avoidance as a higher
priority.
Navigating Rubble Piles
For effective navigation in highly cluttered environments, a rescue robot will need to be able to
climb over rubble piles. Dxy represents a 2D array that contains the depth profile information of
the rubble pile that appears in the robot’s path at any given time. The depth profile information
of the rubble pile is obtained from the 3D map and is conveniently represented in Dxy. By
analyzing the depth profile information, the robot is able to classify the rubble pile into two
categories: climbable and non-climbable. By having the ability to effectively distinguish between
climbable and non-climbable obstacles, the robot can potentially shorten travelling distances by
climbing over the climbable obstacles that it may face instead of the more traditional navigation
approach of avoiding all obstacles. Thus, Dxy can assist the robot achieve a more efficient
exploration in highly cluttered environments.
Figure 10 shows two examples of the depth profile information of a rubble pile in robot’s
forward view, as represented by Dxy. The lighter cells indicate lower depth values (i.e. closer)
and darker cells indicate higher depth values (i.e. farther) on the surface profile of a rubble pile.
In this figure, the X axis represents the width and the Y axis the height in the 3D depth sensor’s
field of view. The scenario illustrated on the right hand side of figure shows a rubble pile depth
profile that may be obtained from a climbable rubble pile. On the other hand, the scenario shown
on the left hand side of the figure suggests a non-climbable rubble pile due to its relatively larger
slope and height. Similar to these scenarios, various obstacles can be categorized into climbable
and non-climbable based on the height and slope information that can be extracted from their
34
respective 2D depth profile. The following section discusses the rubble pile categorization
technique in more detail.
(a)
(b)
Figure 10: 2D Dxy array for rubble pile profile classification. The cell shading becomes
darker as the cell depth values increase.
Rubble Pile Categorization Technique
The objective of the rubble pile categorization technique is to classify the rubble pile located in
the cell in the robot’s direction of travel into one of the following five scenarios: (i) open space,
(ii) non-climbable obstacle, (iii) uphill climbable obstacle, (iv) downhill terrain, and (v) drop,
Figure 11.
Figure 11: Scenarios illustrating the rubble pile classifications: (a) open space, (b) non-
climbable obstacle, (c) uphill climbable obstacle, (d) downhill, and (e) drop.
The rubble pile categorization technique mainly takes into account three parameters from the 3D
profile of the rubble pile to classify it into one of these five categories: i) the overall slope of the
Y
X
In
crea
se i
n D
epth
(a) (b) (c)
(d) (e)
35
rubble pile, ii) the smoothness of the 3D data, and iii) depth values at specific locations of the
image.
The depth values representing the profile of the rubble pile can be obtained directly from the 3D
image. To obtain the slope and smoothness of the 3D data, a plane, , is fit to the
data points using the least square method. Given a set of n points: { } , the least
squares method provides a solution to the three equation parameters A, B, and C which
minimizes the sum of the squared errors between the real depth values zi and the z values
estimated by the plane: . The three parameters are calculated by solving the
following equation [47]:
[
∑
∑ ∑
∑ ∑
∑
∑ ∑
∑
] [ ] [
∑
∑
∑
] (11)
The parameter that is of most interest is ⁄ = B, and represents the rate of change of the
depth values with respect to the height of the rubble pile. Thus, B provides a rough estimate of
the overall slope and shape of the rubble pile, Figure 12. For example, this parameter can be used
to easily distinguish between a large non-climbable obstacle (e.g. a large rubble pile or a wall)
and open flat ground. It can also be used to estimate the angle of the surface via simple
calibrations which helps determine whether or not the robot is able to climb the rubble pile.
Figure 12 shows how B can be used to distinguish between different rubble pile categories.
Figure 12: Contribution of the overall slope of the depth profile on rubble pile
classification: (a) open space, (b) non-climbable obstacle, (c) uphill climbable obstacle, (d)
downhill.
An estimate of the smoothness, R, of the 3D profile can be obtained by evaluating the sum of the
residuals of the data points. Given n data points, R is defined as:
∑ (12)
(b) (c) (d) (a)
36
The smoothness of the depth profile provides useful information regarding the terrain conditions
and the ability of the robot to traverse or climb a certain rubble pile. Figure 13 illustrates the
contribution of the smoothness parameter on rubble pile classification. In the scenarios shown in
Figure 13, the 3D profile of the rubble pile in both scenarios would produce similar slope values.
However, analyzing the 3D profile of the rubble pile in both scenarios would show that the plane
fitted to the 3D data in Figure 13(a) has a smaller cumulative residual with respect to the actual
3D data than that shown in Figure 13(b). Thus, the smoothness parameter, R, for the case
depicted in Figure 13(b) would be larger, and can determine that the terrain condition is too
rough for the robot to traverse.
Figure 13: Contribution of the smoothness parameter, R, to the rubble pile classification:
(a) small R indicates a smooth traversable terrain, (b) large R indicates a rough non-
traversable terrain.
The last parameter used in the rubble pile categorization technique is the 3D depth values at
specific locations within the 3D image. This information is useful for detecting a drop in the
terrain, and in some cases, a non-climbable obstacle.
Figure 14(a) depicts an example of what the Dxy for a traversable open space would look like.
By knowing what range of depth values are produced in a traversable open space, and comparing
it to the 3D data produced from the rubble pile, a drop in the robot’s forward view can be easily
detected. Generally if the robot faces a drop, there can be two possibilities. The first scenario
would be if the drop is small enough that the 3D sensor can capture 3D data pertaining to the
rubble surface at the bottom of the drop, Figure 14(b). In this case, the depth values in the lower
rows of the Dxy array will have values that are considerably larger than the depth values that
would normally be produced for a traversable open space, see Figure 14(a) and (b). The second
scenario would be that the drop is so deep that the rubble at the bottom is not in the field of view
of the 3D sensor, Figure 14(c). In this scenario either the entire array will be out of range of the
(a) (b)
37
3D sensor or a sudden increase in depth can be observed between two consecutive rows, as
shown in Figure 14(c).
(a)
(b)
(c)
Figure 14: Comparison of traversable open space with a drop in the terrain: (a) open space,
(b) drop where surface is observable in 3D data, (c) drop where surface is not observale in
3D data.
Furthermore, analyzing the depth values in addition to the plane parameters can also help the
rubble pile categorization technique detect obstacles that would not normally be detected from
the fitted plane parameters. For this reason, the rubble pile categorization technique checks both
the depth values and the fitted plane parameters for detection of non-climbable obstacles. For
example, in the case of highly discontinuous 3D data, the slope and smoothness parameter
cannot provide an accurate representation of the 3D profile of the rubble pile. An example of this
scenario is shown in Figure 15. In this example, the plane parameter B may not be able to
accurately represent the shape and slope of the rubble pile, but the Dxy array can determine
whether or not the gap in the rubble pile is large enough for the robot to enter. By a simple
examination of the top rows of the Dxy array it can be concluded that the rubble pile should be
categorized as a non-climbable obstacle and that the robot cannot enter this void.
Figure 15: Example of using the position of 3D data within the Dxy array to detect non-
traversable small voids in the rubble.
Y
X
Y
X
38
Calibration
To be able to successfully classify a given 3D profile into one of the five rubble pile categories
calibration is required. Since the algorithm relies on a fitted plane parameter, B, the relative
viewing angle of the 3D sensory system on the robot plays a crucial role in the calibration
process. That is, once the system is calibrated, the position and orientation of the 3D sensory
system must remain fixed with respect to the robot. In addition, the smoothness parameter and
the depth value thresholds being used in the rubble pile categorization algorithm are unique to
the specific rescue robot being utilized. For example, the maximum allowable R value that
dictates whether or not the robot has the ability to traverse a terrain with a given slope, B,
entirely depends on the robot’s physical capabilities such as locomotion mechanism, tire traction,
clearance under the robot, motor torque, etc. In addition, the depth value thresholds used in the
classification technique are governed by physical dimensions of the robot and the distance of the
sensory system to the ground in front of the robot. Thus, calibrations are needed to find the right
threshold values for the aforementioned parameters and depth values in the algorithm being
specifically tailored to the rescue robot and its 3D sensory system.
Calibration was performed using a total of 70 3D images taken from a USAR-like environment.
The images were taken from open traversable space with various rubble piles, non-climbable
obstacles of different shapes and sizes, drops, downhill terrain as well as uphill climbable rubble
pile. Before analyzing the data, image processing was performed to filter the masked pixel values
in the 3D images. Masked pixels are pixels in the 3D image that do not have an accurate depth
value due to inadequate illumination of the objects or errors in the 3D sensory system. The
masked pixels are given an arbitrary value (e.g. 1200) for easy identification in image
processing. The image processing simplifies the 3D image by segmenting the 640 by 480 pixels
in the image into an m by n Dxy array, where m and n are fractions of 640 and 480, respectively.
The size of the array can be adjusted for higher or lower resolution as needed. In these
calibrations, it was found that a 16 by 12 array provides enough resolution for a successful
categorization of the rubble pile. Using this resolution, the image is segmented into 40 by 40
pixel blocks. Within each pixel block, the non-masked depth values were averaged and the mean
depth value for each block was assigned to the corresponding Dxy array element. This process
not only gets rid of the masked pixels but simplifies the 3D image into a smaller and more
uniform depth array, Dxy.
39
Once the 3D image is processed, Equation (11) is used to fit a plane to the Dxy depth profile and
the smoothness parameter, R, is calculated using equation (12). The main purpose behind
analyzing the plane parameters was to obtain a good understanding on what ranges of values
should be expected for parameters B and R for traversable open space, climbable obstacles, non-
climbable obstacles, and downhill terrain.
Open Space: From calibration data, it was observed that for traversable open space, the plane
parameter, B, was normally between 2.5 and 4.5. The maximum allowable smoothness
parameter, R, was determined to be approximately 5000. Furthermore, the average depth values
obtained from the first two rows at the bottom of the Dxy array, i.e. distance of sensory system to
traversable flat ground, were found to be between 700 and 850 mm.
Climbable Obstacle: It was determined that the rescue robot was capable of climbing relatively
smooth uphill terrain with an incline angle of up to 18 degrees. To be able to relate the parameter
B to the physical slope of an incline, additional calibration was performed using the B
measurements at 9 different angles. Figure 16 shows the plotted results. A 3rd
degree polynomial
was fit to the data. From the obtained relationship, it was determined that the minimum
allowable B value which corresponds to an incline of 18 degrees is about 1.36. Thus, B values
lower than 1.36 were considered to be non-climbable. In addition, it was found that at a slope of
18 degrees, the robot was capable of climbing rubble pile with a maximum smoothness
parameter, R, of about1000.
Figure 16: Calibration plot showing the relationship between the plane parameter B and
angle of an incline.
B = -23.697θ3 + 125.08θ2 - 228.27θ + 156.66
00.20.40.60.8
11.21.41.61.8
2
0 10 20 30 40 50 60 70 80 90
B
θ (Degrees)
40
Downhill Terrain: The plane parameter, B, for traversable downhill terrain was determined to be
greater than 4.5 and the maximum traversable smoothness parameter, R, was determined to be
about 2000. In addition, the average depth values for the first two bottom rows of the Dxy array
was determined to be greater than 800.
Drop: For most cases where the bottom of the drop could be captured in the 3D image, the plane
parameter, B, was found to be greater than 5.5. Also, the average depth value for the two bottom
rows of the Dxy array was found to be greater than 900.
Non-Climbable Obstacle: Due to the wide range of possibilities for non-climbable obstacles, it
was more difficult to generalize the plane parameters and depth values for this category of the
rubble pile. However, by the process of elimination, non-climbable obstacles can be easily
detected. That is, if the plane parameters and depth values do not fit into any of the categories
discussed above, the rubble pile can be assumed to be a non-climbable obstacle. In addition,
there are some guidelines that can support the possibility of a non-climbable obstacle. From the
observations, it was determined that the fitted plane slope, B, for non-climbable obstacles was
normally larger than 3.0. Moreover, it was observed that all non-climbable obstacles had a
smoothness parameter larger than 1200.
By using a simple logic that incorporates the aforementioned threshold ranges for parameter and
depth values, the rubble pile could be successfully classified into the proper categories. A few
examples of the depth images used in the calibration and their respective rubble pile
classification are provided in Table 1.
Table 1: Examples of calibrated depth images
Open space Non-climbable obstacle Climbable obstacle Drop
2D
im
age
3D
gra
ysc
ale
41
3.2.2.4 Victim Identification
The objective of the victim Identification subtask is to identify victims in cluttered USAR scenes.
The state of the subtask is defined as S(LV/R, Mxyz), where LV/R represents the locations of
potential victims in the scene with respect to the robot’s location as obtained by the 3D mapping
sensor. When a victim is identified in the scene, the primitive action Tag is executed by this
subtask in order to tag the global location of the victim within the 3D map of the environment,
Mxyz. The Navigate subtask is executed when the robot needs to move closer to a target location,
i.e. a potential victim in the environment, in order to make a positive. Currently, a thermal
camera and/or a victim identification algorithm are used to detect victims. The victim
identification algorithm uses 3D object segmentation and a human body model, as well as skin
detection to identify partially obstructed victims in a rubble filled environment [40].
3.2.2.5 Human Control
The purpose of this module is to pass over the control of the robot to the human operator in case
the robot is unable to perform any of its tasks autonomously. There are two levels in the task
hierarchy in which Human Control can be used to control the robot, Figure 2. The objective of
the 2nd level Human Control subtask is to have a human operator share with the robot the tasks
pertaining to the Victim Identification and/or Navigate to Unvisited Regions subtasks. With
respect to the Victim Identification subtask, the operator can assist in identifying victims in the
scene and tagging their location. The operator utilizes the corresponding 2D and 3D information
of the robot’s surrounding cells in the grid map, as obtained in real-time from the 3D mapping
sensor, to tag the location of a victim. Namely, the operator can input which cell(s) he/she sees a
victim(s) in based on this sensory information. This input is then utilized to update the grid. The
Navigate subtask can also be called upon to move the robot closer to a potential victim, in order
for the operator to make a positive identification and tag a victim’s location. The operator
requests that the robot move to a particular cell based on the sensory information. The distance to
the cell from the robot’s current location is defined by LV/R in the Navigate subtask. With respect
to the Navigate to Unvisited Regions subtask, the operator can assist in choosing optimal search
directions for exploration. On the other hand, the 3rd level Human Control subtask allows for
operator control of the local navigation subtask of the robot. If the robot is unable to navigate
autonomously due to noisy or insufficient sensory data, or if the robot has gotten physically stuck
due to the highly cluttered nature of the environment, this Human Control subtask allows the
42
operator to take over the low level navigation and obstacle avoidance tasks of the robot. While
the control is passed over to the operator, the aim is to have the robot’s sensory system still track
the motion of the robot and update the 3D map of the scene. For both levels, when Human
Control is executed, the objective is to have the appropriate subtask continue to learn by
observing the implemented actions of the operator via the information path as well as the
outcomes that result from these actions, as highlighted in Figure 2.
3.3 Chapter Summary
In this chapter, the overall semi-autonomous control architecture developed was presented. The
proposed MAXQ task graph was also presented. The main subtasks of the MAXQ hierarchy
were discussed in detail; these included: the Root subtask, the Navigate to Unvisited Regions
subtask, the Navigate subtask, the Victim Identification subtask and the Human Control subtask.
An integrated direction-based exploration technique was also presented which used the rubble
pile information in the 2D grid map to choose a global exploration direction for the robot. A
rubble pile categorization technique utilized by the navigation module of the control architecture
was also discussed.
43
Chapter 4 Simulations
Simulations were conducted to train the MAXQ and verify the convergence of the Q-values. A
second set of simulations was conducted to test the performance and robustness of the overall
control architecture in performing search and rescue operations in both small and large simulated
environments.
4.1 MAXQ Learning and Q-Value Convergence
Offline training of the MAXQ hierarchy was performed to verify the learning of robot actions
and convergence of the Q-values. The MAXQ hierarchy was trained in a 20 by 20 cell
environment (approximately 336 m2), with each cell occupying an area of 0.84 m
2. The robot
used in the simulations occupied one cell at a time. The robot’s sensory input for the simulation
included: i) 3D depth information pertaining to the cell in front of the robot for categorization as
one of the five aforementioned rubble pile categories using the Dxy profile, ii) 5 proximity
sensors (as binary values) on the sides and back of the robot, and iii) the probability and distance
of a victim in the robot’s heading. While the boundary of the scene remained the same (20 by 20
cells), the layout of scene was automatically generated via the random sensory information to
promote learning, as the robot explored the unknown environment. Overall, a total of 1800 trials
were performed, with each trial having its own unique scene layout. Examples of the randomly
generated scenes during four trials of the simulations are shown in Figure 17.
To ensure that the MAXQ task hierarchy chooses the optimal action, Q-values at every subtask
need to converge. That is in a given subtask i, and a state si, the Q-value pertaining to the optimal
action must converge at a higher level, i.e. larger numerical value, than the rest of the possible
actions.
44
Open Visited
Unknown obstacle
Open unvisited
Drop
Visited climbable obstacle
Unknown
Unvisited climbable obstacle
Victim
Non-climbable obstacle
Robot location and heading
(a)
(b)
(c)
(d)
Figure 17: Simulation screenshots: robot exploring in randomly generated scene layouts.
Simulations were performed for 1.6 million episodes to ensure the convergence of the Q-values.
Herein, one episode is considered as a single step taken by the robot, i.e. a primitive action for
the Navigate subtask. The ε-greedy exploration policy was chosen to allow sufficient exploration
and exploitation of the states and actions for proper convergence of the Q-values. The discount
45
factor constant was chosen to be 0.8, while the exploration policy parameter ε and the learning
rate α were both decreased at the same rate as the number of episodes increased. Both
parameters were updated according to the following relationship:
(13)
where experience is a positive number that was initialized to 1 and was incremented by 1 every
100 episodes. The experience parameter is incremented every 100 episodes to ensure that the
parameter ε slowly decreases, allowing sufficient Q-value exploration. The learning rate α also
needs to be gradually decreased to ensure that the Q-values have converged before it is
diminished to zero.
Since a total of 4288 possible Q-values exist in the MAXQ hierarchy, it would be very
exhaustive to monitor the progress of all Q-values. Instead, a total of 39 Q-values were selected
from all subtasks. These selected Q-values were chosen based on their importance to the search
and rescue task as well as probability of occurrence in the simulations.
4.1.1 Simulations Results
The 39 Q-values were monitored and their progress was plotted over the 1.6 million episodes.
The simulations results showed that all of the monitored Q-values at the different levels of the
hierarchy had correctly converged as expected. The convergence of the Root task is of most
importance since it defines the overall task of exploring and finding victims for the rescue robot.
For the Q-values of the Root task to converge, all of the subtasks in the hierarchy must converge
correctly. Figures 18-20 depict the progression of Q-values and their convergence at the Root
task for 3 of the monitored states.
Figure 18 depicts the Q-values for the all possible actions available at the Root task. Herein, the
scenario presented in the figure is when the robot in the simulated environment is in a state
where the probability of a victim is below 0.3, meaning that the probability of a victim is too low
for the robot to consult the operator with the victim identification task. Also, the robot has no
difficulty exploring new regions of the scene and is not having any type of sensory malfunction
that would require the operator’s assistance. Thus, the optimal action is to execute the Navigate
to Unvisited Regions subtask for the robot to continue exploring the scene for more victims. As
46
can be seen in Figure 18, the Q-value pertaining to the Navigate to Unvisited Regions subtask
rises above the other Q-values and begins to converge after about 20,000 episodes. Note that in
this figure, the plots for the Human Control actions are overlapped at -20.0.
The reason for the Q-value fluctuation for the Navigate to Unvisited Regions subtask is that once
this subtask is executed, the reward that is received thereafter from the Navigate subtask can vary
based on the robot’s surrounding environment. Since the state definition in the Root task is
independent of those defined in the lower level hierarchy subtask, i.e. state abstraction, there will
be some fluctuation for the Q-values of parent subtasks. However, as long as the fluctuations
remain above the rest of the Q-values after convergence, a greedy policy would choose the
optimal action, i.e. subtask or primitive action.
Figure 18: Convergence of the Root Q-values when exploration is optimal.
Figure 19 depicts the Q-value for the actions available at the Root task in a different state. In this
scenario, the robot is in a state where there is no victim present. However, the robot has not been
successful in finding new unvisited cells to explore. Herein, the optimal action is for the robot to
get feedback from the operator regarding a new exploration strategy by calling the Human
Control for the Navigate to Unvisited Regions subtask. Since the Human Control subtask is a
primitive action without a child subtask of its own, the Q-value rises and converges to the
specific value, i.e. 50.0, after only a few hundred episodes and remains constant for the rest of
offline training.
47
Figure 19: Convergence of the Root Q-values when operator’s assistance is optimal.
The last example of Q-value progression during the offline training is depicted in Figure 20.
Herein, the robot is in a state where a victim has been detected within the cell in front of the
robot. The probability of the victim is above 0.7, meaning that the probability of the victim is
high enough for the robot to automatically tag its location. Therefore, the optimal action at the
Root task is to temporarily halt the exploration task and call Victim Identification to tag the
victim. For this given state, the Q-value pertaining to the Victim Identification subtask increases
abruptly and rises over 100 after about 15,000 episodes where it begins to converge.
In this case, the Q-value fluctuations reduce over the increased number of episodes, but are still
present. The overall fluctuations, however, are smaller than the Q-value pertaining to the
Navigate to Unvisited Regions subtask. The reason for the different fluctuation behaviors is that
when the Navigate subtask is executed under Navigate to Unvisited Regions, the rewards that can
be given to the system has a wider range of values, thus, creating larger fluctuations. In an
exploration mode, under the Navigate to Unvisited Regions subtask, the robot can be facing
many possible scenarios with the potential existence of a variety of rubble pile classifications,
each resulting in a unique reward. On the other hand, the Navigate subtask under the Victim
Identification subtask would only move the robot forward a single cell closer to the victim, if it
were to be executed. This is due to the fact that the victim is normally positioned in front of the
robot, within the 3D sensory range, and the exploration direction is directed towards the victim.
Note that in this figure, the plots for the Human Control actions are overlapped at -20.0.
48
Figure 20: Convergence of the Root Q-values when victim detection is optimal.
The simulations verify the effectiveness of the developed MAXQ learning technique in training
the Q-values in all of the subtasks of the hierarchy. Figure 21 shows a plot of the robot’s
performance in fully exploring the unknown environment in the first 200 trials. As can be seen
from Figure 21, the robot’s efficiency in exploring the unknown scenes dramatically improved
during the first few consecutive trials of the offline training. The total number of exploration
steps declined for about 20 trials and became steady for the rest of the simulations. The
fluctuation in the plot is largely due to the randomly generated environment, i.e. every trial had a
unique layout.
Figure 21: The steps taken by the robot to fully explore the simulated environment in the
first 200 trials of the offline training.
0
1000
2000
3000
4000
5000
6000
7000
0 20 40 60 80 100 120 140 160 180 200
Ste
ps
to E
xplo
re
Trial
49
4.2 Performance of Overall Control Architecture
A total of 864 simulations in cluttered USAR-like scenes were conducted to observe the
performance of the proposed MAXQ HRL technique and direction-based exploration algorithm
in exploring and navigating unknown cluttered environments. Each scene consisted of a 20 by 20
cell environment (approximately 336 m2) with individual cells having an area of 0.84 m
2. The
robot used in the simulations occupied only one cell at a time.
The layout of each scene consisted of a unique combination of cell categories, providing
different levels of difficulty for robot exploration. Each scene layout contained 7-8 victims
placed strategically in corners and hard to reach regions of the scene. The simulated sensory
input provided to the robot included: i) 3D depth information pertaining to the cell in front of the
robot for classification of cells as one for the five possible rubble pile categorizes based on the
Dxy profile; ii) distance sensing on the sides and back of the robot, indicating the presence of any
obstacles in the robot’s neighboring cells, and iii) thermal information for indicating the presence
of a victim. The performance of the MAXQ controller was evaluated based on: 1) the ability to
explore the entire scene using the exploration technique, 2) collisions (if any) with non-climbable
obstacles, and 3) the number of victims found. An initial off-line training stage for MAXQ was
implemented prior to conducting the simulations.
Herein, simulations conducted on three of these USAR-like scenes, consisting of different
complexities, are discussed. The three scenes, denoted as Scenes A, B and C, are shown in
Figure 22(a)-(c). In general, the scenes were designed to have a high density of non-climbable
obstacles, creating tight corners for the robot to navigate around in order to find potential
victims. In addition, the scene layouts included many isolated regions separated by non-
climbable obstacles in which the robot could only enter through a single open cell or by
traversing over climbable obstacles. Concave obstacle boundaries were also used in the
simulated scenes to promote changes in the robot’s exploration direction. Scene A was designed
with the largest amount of climbable and non-climbable obstacles compared to the other two
scenes. Scene B consisted of more isolated regions connected by single cell openings. Scene C
had the most difficult layout for exploration due to the sizeable rubble boundaries created in the
scene by large numbers of adjacent non-climbable cells.
50
Figure 22: Simulated USAR-like scene: (a)-(c) actual scenes, (d)-(f) 2D grid map developed
of the scenes during simulations with robot starting location and heading.
Four different simulations were conducted for each scene in order to determine the overall effect
of the coefficients on exploration time. In three of the simulation sets, a higher weighting (i.e., wi
= 20, for i = l, p, or q) was given to one of the three exploration coefficients, and in the fourth
set, weightings on all coefficients were set to 1. Herein, exploration time is defined as the
number of steps (i.e., the primitive navigation actions F, B, or θ) taken to explore the entire
unknown scene. Each trial was repeated 6 times and the average number of steps was
determined. The values for the coefficients utilized for these simulations are presented in Table
2. These values were chosen following the requirements outlined in Eqs. (7)-(10). The upper and
51
lower limits on dx were chosen to be proportional to the size of the map. As previously
mentioned, these coefficient values can be chosen by an operator in order to meet user specified
requirements for a rescue operation. To further test the performance of the navigation and
exploration modules of the MAXQ-based controller, the simulations were performed using three
different robot starting positions in each of the scenes, Figure 22.
Additional simulations were performed in order to observe the effect of taking into account the
terrain information of the environment in the exploration technique. The purpose of these
simulations were to compare the proposed algorithm to other exploration techniques which only
take into account factors such as the minimum distance to target locations and maximum
coverage of the unknown scene. To do so, the terrain information was omitted from the robot.
That is all the cells in the 2D grid map were categorized under: 1) obstacle, 2) open space, or 3)
unknown as defined by most frontier-based exploration techniques. Furthermore, the terrain
coefficient was taken out of the exploration technique by giving it a weighting of 1 while the
other two exploration coefficients were given an equal weighting of 20. A total of 54 simulations
were performed in Scenes A, B, and C. In each of the scene the robot was required to performing
the same exploration task, i.e. explore the entire scene while finding all victims, starting at three
different locations in the scenes (same as those shown in Figure 22). Each simulation was
repeated 6 times as before.
Table 2: Exploration coefficients used in the simulations
Conditions Simulation
1 2 3 4
Cell x: open unvisited space 200 10 10 10
Cell x: unvisited climbable obstacle 100 5 5 5
Cell x: unknown obstacle 40 2 2 2
Cell k: unknown 3 60 3 3
Cell k: open unvisited space 2 40 2 2
Cell k: unvisited climbable obstacle 1.5 30 1.5 1.5
Cell k: unknown obstacle 1 20 1 1
d 3.0x 10 10 200 10
3.0 < d 6.0x 5 5 100 5
6.0 < d 9.0x 2 2 40 2
9.0 < d 12.0x 1.5 1.5 30 1.5
* x is obtained from the sum ofkxv as described by Equation (8).
52
4.2.1 Simulations Results
The simulation results are shown in Figure 22(d)-(f) and Figure 23. Figure 22 shows the three
robot starting positions used in each scene layout. As depicted in Figure 22, the robot was able to
fully explore each scene from all three starting positions, providing the opportunity to find all the
victims. In addition, the robot was able to classify the different types of cells including open
cells, climbable and non-climbable obstacles. The robot was able to navigate the scenes without
any collisions. Herein, a collision is defined as moving into a non-climbable obstacle cell. The
only cells in the scene that were still unknown at the end of exploration, were those fully
obstructed by non-climbable cells or victim cells as the robot was not allowed to enter these
cells.
Figure 23: Number of exploration steps for the robot in Scenes A, B and C.
In the majority of the simulations, placing higher weightings on the three exploration coefficients
resulted in lower exploration steps. An analysis of variance (ANOVA) was performed on the
means of the exploration steps for the four different simulation sets in the three scenes to
determine that a statistical significant difference between the means does exist: F(3)=4.98,
p<0.01. Additional paired one-tail t-tests were performed under the hypothesis that placing a
higher weighting on the travel coefficient would result in a lower number of exploration steps.
Robot Starting Location and Heading
0
500
1000
1500
2000
2500
3000
(2,2)S (3,16)N (19,11)W (2,2)S (13,12)W (5,15)E (2,2)S (10,8)N (14,16)E
Travel Terrain Neighboring Cells Equal Weighting
Scene A Scene B Scene C
Scene A Scene B Scene C
Ste
ps
to E
xplo
re
53
The results confirmed (p<0.05) that by placing a higher weighting on the travel coefficient, the
average number of steps were reduced compared to placing a higher weighting on the terrain
coefficient (t(8)=3.36) and the neighboring cells coefficient (t(8)=2.44), and having equal
weighting on all coefficients (t(8)=3.76). This occurred because when placing a higher weighting
on the travel coefficient, the robot would explore more cells in its surroundings before moving
on to other regions of the scene. In the cases where the weightings were the same or, either the
terrain or the neighboring cells coefficients had higher weights, the robot revisited a number of
already visited cells, therefore increasing the number of steps it took to explore an entire scene.
For Scenes A, B and C, the robot took an average of 1786 ( = 293), 1911 ( = 135), and 2026 (
=249) exploration steps to explore each scene. An additional ANOVA test was conducted to
determine whether the level of difficulty of the scene affected the number of exploration steps.
With a 95% confidence level, there was no statistical significance between the average number
of exploration steps for these three scene layouts, emphasizing the robustness of the proposed
technique to different scene layouts. Furthermore, the results showed that the robot’s starting
location within a scene had minimal effect on the number of exploration steps needed.
In addition, a comparison study using Scenes A, B and C was performed to identify the effects of
including terrain as an exploration coefficient in determining robot exploration direction in
cluttered environments. In particular, the study compared the number of exploration steps, scene
coverage and number of victims found: i) using all three coefficients in the direction-based
exploration technique (as obtained from the aforementioned simulations), and ii) only using the
travel and neighboring cells coefficients in the direction-based exploration technique. For the
comparison, both sets of simulations placed a higher weighting on the travel coefficient as this
provided better performance in the previous simulations. In addition, for the second set of
simulations in the comparison study it better represents the cost functions of frontier-based
exploration techniques (which determine frontier target points based on minimizing travel
distance). Simulation set 1 is based on the aforementioned simulations presented in Figure 23 for
which the higher weighting was placed on the travel coefficient. For simulation set 2, terrain
information was removed for the exploration technique, and hence, cell classification was based
on: 1) obstacle, 2) open space, or 3) unknown. The procedure for simulation set 2 was the same
as simulation set 1, namely the robot was required to perform the same exploration task, i.e.,
54
explore the entire scene while finding all victims, starting at the same three locations in the
scenes as previously mentioned. Each simulation was also repeated 6 times.
The results for both simulation sets are presented in Table 3 and Table 4 for easy comparison.
For the results pertaining to the percentage of scene explored, 100% represents the exploration of
all available cells to the robot (i.e., it does not include the cells that were completely obstructed
and the robot was not able to access). As previously mentioned, the robot was able to explore all
scenes and identify all victims using the proposed three coefficient based exploration method,
Table 3.
Table 3: Exploration results with terrain information.
Robot Starting Location and Heading
Scene A Scene B Scene C
(2,2)S (3,16)N (19,11)W (2,2)S (13,12)W (5,15)E (2,2)S (10,8)N (14,16)E
% of Scene
Explored 100% 100% 100% 100% 100% 100% 100% 100% 100%
Exploration
Steps 1107 1512 1631 1923 1829 1718 1745 1883 1808
Victims
Found 8/8 8/8 8/8 8/8 8/8 8/8 7/7 7/7 7/7
Table 4: Exploration results without terrain information.
Robot Starting Location and Heading
Scene A Scene B Scene C
(2,2)S (3,16)N (19,11)W (2,2)S (13,12)W (5,15)E (2,2)S (10,8)N (14,16)E
% of Scene
Explored 92.8% 92.8% 4.5% 100% 100% 100% 99.0% 99.0% 99.0%
Exploration
Steps 1896 2091 34 5519 5993 4688 2424 2595 3034
Victims
Found 7/8 7/8 1/8 8/8 8/8 8/8 7/7 7/7 7/7
As can be seen from Table 4, the robot using the direction-based exploration technique without
the terrain coefficient was not able to fully explore Scene A and as a result it was not able to find
all the victims in the environment. For starting location (19,11)W, the robot started exploring in
an isolated region of the scene, as known in Figure 22(a), where the only way to leave/enter the
region was to traverse over climbable obstacles; however, as terrain information was not utilized
55
to navigate the robot, it became stuck in this region and only explored 4.5% of the scene. For
starting locations (2,2)S and (3,16)N, the robot was also not able to explore the cell regions for
which it needed to climb over climbable obstacles. However, Scene B was fully explored with all
the victims found and 99% of Scene C was explored with all victims found. In Scene C, a small
number of adjacent cells were not explored since they were obstructed by what the robot had
classified as obstacles (which were actually traversable). With respect to the number of
exploration steps, it can be seen that significantly more exploration steps were needed in
simulation set 2 for which exploration was only based on the travel and neighboring cells
coefficients than for simulation set 1 when all three coefficients were used. For example, for
Scene B, where both approaches had 100% scene coverage, the average number of exploration
steps required was 1823 for simulation set 1 and 5400 for simulation set 2. By comparing the
results from Table 3 and Table 4, it can be seen that the terrain information improved the
performance of robot exploration in these 3D cluttered environments.
4.3 Chapter Summary
Two sets of simulations were conducted to test the developed MAXQ task hierarchy and its
performance in unknown simulated environments. The first set of simulations was designed to
verify the designed MAXQ learning technique applied to the task of search and rescue, collect
Q-values and perform offline training on the system. The simulations results showed that all of
the selected Q-values converged to the optimal policy of their respective subtasks. The second
set of simulations was designed to test the performance and robustness of the overall MAXQ
controller, with a higher focus on the navigation and exploration modules of the control
architecture. The simulations results showed that the robot was able to fully explore the unknown
simulated scene in every simulation and find all the hidden victims without any collisions.
Additional simulations were performed to compare the effect of utilizing the terrain information
in the controller and exploration technique. The results showed a substantial decline in the
performance of the robot in efficiently exploring the unknown simulated environment.
56
Chapter 5 Experiments
5.1 Experimental Set-up
Two sets of experiments were carried out to evaluate the performance of the proposed learning-
based controller for a rescue robot in a USAR-like environment. The first set of experiments
focused on evaluating the performance of the navigation and exploration modules. The second
set of experiments compared the performance of the proposed semi-autonomous controller
versus teleoperated control of the rescue robot.
The USAR-like environment was designed to mimic a disaster scene, containing different types
of objects and materials including wood, metal, plastic, brick, ceramic, concrete, paper,
cardboard, plaster, rubber-like polymers and rocks. Six victims, represented by dolls and
mannequins, were also placed in the environment. The majority of the victims were partially
obstructed by rubble in which case only a portion of their bodies were visible such as their limbs
or head. Rubble was strategically placed in the environment such that the robot would have to
explore around corners and obstacles to search for victims. Within the scene, inclines were
created to allow a robot to climb over rubble piles and navigate an elevated upper level of the
scene, Figure 24(a) and (b). In the first set of experiments, a thermal camera placed at the front
of the robot was used to identify victims. Heat pads were placed on the dolls and mannequins to
produce heat signatures to mimic human bodies, Figure 24(d). Since the focus of these
experiments was on the navigation and exploration capabilities of the system thermal signatures
were used for easy identification of the victims. In the second set of experiments, on the other
hand, the heat pads were removed and a victim identification algorithm was implemented which
used the 3D information from the 3D mapping sensor as well as skin detection from its 2D
images to identify victims in the scene, Figure 24(e).
In both experiments, a rugged mobile robot was utilized, Figure 25. The robot was equipped with
a set of sensors including: 1) One real-time 3D mapping sensor based on a structured light
technique [36]. In the experiments, the sensing range of the sensor was set to 0.914m, i.e., the
length of one cell, in order to only capture the depth information pertaining to the cells in front of
the robot for rubble pile profile classification and updating of the 2D grid map; 2) five infrared
57
sensors distributed along the parameter of the robot. The infrared sensors were used to detect the
presence of objects in the cells adjacent to the robot for both obstacle avoidance and updating of
the 2D grid map. In the first experiment, a thermal camera was also utilized to capture heat
signature from heat pads covering the victims.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 24: USAR-like scene: (a) one of three scenes for experiment #1, (b) scene for
experiment #2, (c) rescue robot stuck in rubble pile (d) victim with heat pad under rubble,
in experiment #1, (d) victim on the upper level of the scene, in experiment #2, and (f) red
cushion detected as false victim.
Figure 25: Rugged rescue robot in USAR-like scene.
Thermal
Camera
Infrared
Sensor
3D Mapping
Sensor
58
5.1.1 HRI User Interface
During the trials, the interface presented in Figure 26 was utilized by the operators. The interface
is based on the following sources of information: (i) 2D video provided by two wireless cameras
(one at the front and one at the back of the robot), Figure 26(a) and Figure 26 (b), (ii) an updated
3D map of the environment which is obtained by the SLAM module (readers are referred to [48]
for details regarding map reconstruction), Figure 26(c), (iii) robot status, which provides
feedback as to whether the robot has received control commands and when they have been
implemented, Figure 26(c), (iv) control menu, which allows for connecting the controller to the
interface, displaying the map information or exiting the interface, Figure 26(c), (v) infrared
sensor feedback displayed visually to show the operator the proximity of obstacles surrounding
the robot to warn if the robot is in danger, Figure 26(c), and (vi) real-time 3D information of the
robot’s frontal view obtained from the 3D mapping sensor, Figure 26(d). During the
experiments, the 2D front camera view and the real-time 3D information were continuously
shown on side-by-side monitors to the operator.
(a)
(b)
(c)
(d)
Figure 26: (a) 2D image of robot’s front view, (b) 2D image of robot’s rear view, (c) 3D
map, robot status (green indicates the wheels of the robot are moving), infrared visual
feedback on the robot’s perimeter (if objects are very close red rectangles appear), and
control menu display, and (d) real-time 3D information.
59
5.1.2 Robot Control
The robot was controlled through the HRI interface program. This program had wireless
communication with the robot’s onboard microcontroller, via a Bluetooth module, which in turn
executed the motor commands and sent back the updated infrared sensory data. Thus, all
commands were sent to and executed by the HRI interface program.
Generally, the robot was controlled via two modes:
1) Manual control by a human operator, in which case the input commands were taken via
an input device and received by the HRI interface.
2) Autonomous control by the MAXQ based controller, in which case the commands were
sent through an inter-process communication module between the controller and the HRI
interface program.
The following sub-sections will further discuss the operator’s input device and the inter-process
communication module.
5.1.2.1 Input Device
Operator control of the robot was achieved using an Xbox 360 wireless gamepad, Figure 27. The
gamepad includes two mini joysticks, and a set of buttons. The blue X button is used to connect
the controller to the interface, the yellow Y button is used to display the 3D map, the red B
button is used to display the rear view camera, and the green A button is used to exit the HRI
interface program. The left and right trigger buttons are used to display the robot’s status and the
control menu, respectively. The left joystick is used to move the robot both forward and
backwards, while the right joystick is used to turn the robot both clockwise and
counterclockwise. The operator can input his/her commands via the controller to tag or reject the
presence of a victim using the left and right bumper buttons, respectively. Exploration directions
can be provided by the operator using the D-Pad button. The right and left D-Pad buttons are
used to change the robot’s exploration direction to the right or left of the robot’s current heading
and the down D-Pad button is used to choose an exploration direction directly opposite to the
robot’s current heading.
60
Prior to the experiments, each participant was given a tutorial on how to use the control pad and
interface, and then had a chance to control the robot outside of the scene for a duration of 10
minutes in order to familiarize him/herself with the robot.
Figure 27: Xbox 360 wireless controller used as the operator’s input device.
5.1.2.2 Inter-Process Communication
To achieve a communication channel between the robot interface program and the higher level
MAXQ based controller, a bidirectional inter-process communication stream was established via
a TCP/IP network socket. A socket provides a mean of delivering and receiving data packets
between a server and a client program. The socket address, commonly shared by both the server
and client, consists of a combination of an IP address and a port number. Herein, the interface
program played the role of the server, opening up a communication thread for incoming client
calls. The MAXQ based controller, i.e. the client, called the server at every step of the process,
sending commands to the interface, and receiving the updated robot status as well as the
commands that may have been executed by the operator. The commands sent and received were
in a coded string format, i.e. combination of letters, which were deciphered by both programs
before execution. Complete lists of the commands sent by the MAXQ based controller and
interface and their respective meanings are summarized in Table A 1-Table A 5 of Appendix A.
61
5.2 Experiment #1: Evaluation of Navigation and Exploration Modules
For the first set of experiments, three different scene layouts were used in the environments to
test the performance of the exploration and navigation modules of the controller, Figure 28(a)-
(c). Scene layout 1 contains both open spaces and non-climbable rubble piles, whereas Scene
layouts 2 and 3 also contain climbable rubble providing the robot with an opportunity to navigate
an elevated upper level of the scene. Note that in layout 2, the victim in cell (3,4) is located on
the lower level portion of the scene, i.e. hidden under the climbable rubble.
Four different experiments were conducted for each scene, with the first three experiments
having a higher weighting of 20 for one of the exploration coefficients, and the last experiment
having equal weightings of 1 for all the coefficients. In the experiments, the same coefficient
values as in Table 2 were used, with the exception that the distance limits Xn were adjusted for
the smaller physical scene to represent more realistic travelling distances. The new values were
X1=1.5, X2=3.0, X3=6.0 and X4=12.0. Similar to the simulations, the experiments for each
scene were also performed with the robot starting in three different starting positions. For each
starting position, the experiment was conducted three times.
5.2.1 Experimental Results
A few experimental depth images taken with the 3D mapping sensor in scene layouts 2 and 3
along with their corresponding Dxy arrays are depicted in Table 5. By utilizing the Dxy arrays the
robot was able to effectively categorize the simplified depth profile information as climbable
obstacle, non-climbable obstacle, or an open traversable space.
The experimental results for each scene layout are presented in Figure 28(d)-(f) and Figure 29.
The results verify that the rubble pile classification was able to correctly categorize the depth
profile information in the Dxy arrays as climbable obstacle, non-climbable obstacle, or an open
space. The unknown cells represent cells in the scene that could not be identified since 3D
sensory information could not be obtained from those regions due to obstruction of the regions
by walls or rubble piles. With an accurate categorization of the different rubble piles in the scene
in each experiment, the robot was able to create a correct representation of the scene. Figure 29
62
summarizes the average number of exploration steps for the three scene layouts and the different
robot starting locations and headings within each scene.
Table 5: Experimental depth images and depth profile arrays
Depth image
Dxy array
Classification Open space Non-climbable obstacle Climbable obstacle
Figure 28: 2D representation of experimental scenes; (a)-(c) depict the original scene
layouts 1, 2, and 3; (d)-(f) show the 2D representation of the scene as determined by the
robot during the experiments.
63
Figure 29: Number of exploration steps for the robot in scenes 1, 2 and 3.
As can be seen in Figure 29, the weightings on the different exploration coefficients did not have
a significant effect on the exploration performance. Robot exploration, on average, took 78 ( =
17), 84 ( = 14) and 90 ( = 6) steps in scene layouts 1-3, respectively. Using ANOVA, it was
found that with a 95% confidence interval, there was no statistical significance on the average
number of exploration steps between these three scene layouts, hence, further validating the
robustness of the technique. As can be seen in Figure 29, some of the robot’s different starting
locations and headings in the scenes did affect the number of exploration steps. For example, in
Scene 1, starting location (4,6) required an average of 101 exploration steps, while starting
locations (2,2) and (5,2) only required 65 and 68 steps, respectively. In Scene 2, starting location
(3,6) required an average of 65 exploration steps while from starting locations (2,2) and (5,6) the
robot required an average of 95 and 91 exploration steps, respectively. In Scene 3, with starting
locations (2,2), (3,6) and (2,4), the robot completed exploration in 92, 96, and 82 steps,
respectively. The higher number of exploration steps were a result of the robot revisiting a
number of the already visited cells. Due to the small size of the overall physical scene and a
limited number of regions to explore, there were fewer exploration options for the robot when
compared to a larger more complex scene (as presented in the simulations). From these
Robot Starting Location and Heading
0
20
40
60
80
100
(2,2)S (5,2)N (4,6)E (2,2)S (3,6)W (5,6)S (2,2)S (3,6)S (2,4)E
Travel Terrain Neighboring cells Equal Weighting
Scene 1 Scene 2 Scene 3
Ste
ps
to E
xplo
re
64
experiments, it can be concluded that when the size of the scene is small, the effects of the
weightings are similar.
5.3 Experiment #2: Teleoperated vs. Semi-Autonomous Control
The second set of experiments was conducted to verify the performance of the overall proposed
MAXQ-based semi-autonomous controller versus robot teleoperation. A new scene layout was
created for these experiments, Figure 24(b). The scene contained open space, partially obstructed
victims, non-climbable obstacles as well as climbable rubble to provide the robot with the
opportunity to explore the main lower level and also an elevated level of the scene.
Ten participants ranging in age from 19-33 years participated in the experiments. None of the
participants had any previous experience in remote robot navigation and exploration. The
operators were asked to control the robot within the scene in two different trial sets: (i) having
full teleoperated control over the robot in order to navigate and explore the scene while finding
as many victims as possible, and (ii) while the robot was in semi-autonomous mode, in which
case the MAXQ algorithm and task hierarchy were implemented, and the operator and the robot
shared tasks as needed. The following performance parameters were measured in both trials: (i)
number of victims found, (ii) percentage of the scene explored, (iii) number of robot collisions,
and (iv) total search time. For both trials, the level of difficulty traversing the scene remained the
same, however, the objects and the victims were placed in different locations to minimize human
expert knowledge of the scene.
5.3.1 Experimental Results
Figure 30-Figure 32 and Table 6 present the performance comparison results of teleoperated
control versus semi-autonomous control of the rescue robot. Figure 30 presents the percentage of
the overall rubble scene that was traversed by the robot in all ten trials. In general, during
teleoperation none of the operators were able to traverse the entire scene. Participants 4 and 7
explored the scene the most, with 88% and 94% coverage of the scene, respectively, allowing
them to find all 6 victims located in the scene. In semi-autonomous mode, the robot was able to
traverse the total area of the overall scene for all of the ten trials utilizing the proposed MAXQ
algorithm. On average approximately 4 victims were found in the teleoperated robot mode within
the trials versus all 6 in the semi-autonomous robot mode, Figure 31. False victim identification
65
was also made by participants 6 and 7 during teleoperation. Participant 6 defined a red cushion to
be a victim, Figure 24(f). Participant 7 also defined a red cushion and an orange plastic casing to
be a victim. Participant 8 also failed to identify a victim that was clearly displayed on the
interface via the front view camera of the robot.
Figure 30: Percentage of scene explored.
Figure 31: Number of victims identified.
Figure 32 presents the total number of collisions that occurred for both experiments. Herein, a
collision is defined as the robot unintentionally bumping into a non-climbable obstacle
(including a victim) or falling off the edge of a cliff, i.e. a drop, as classified by the rubble pile
categorization technique. As can be seen in the figure, the number of collisions was reduced in
0.0
20.0
40.0
60.0
80.0
100.0
1 2 3 4 5 6 7 8 9 10Per
cen
tage
of
Scen
e C
ove
red
Participants
Teleoperated Semi-Autonomous
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9 10
Nu
mb
er o
f V
icti
ms
Iden
tifi
ed
Participants
Teleoperated Semi-Autonomous
66
the semi-autonomous search and rescue operation for all of the trials. It is important to note that
the collisions that were detected in the semi-autonomous mode rarely occurred and were a result
of the control being passed on to the operator to assist the robot in getting out of a stuck position
in the rubble. The operators would normally hit the surrounding obstacles while trying to rotate
the robot out of the stuck position. Also, some collisions occurred when the robot collided with
parts of the rubble that were not within the field of view of the sensory system. On the other
hand, during teleoperation, in addition to these types of collisions, the majority of collisions took
place while the infrared range sensors that were presented to the operator on the interface had
detected the obstacles. In the teleoperated trials, the collisions often occurred as a result of the
operators using brute force with just the front view 2D camera to try to navigate the robot over
non-climbable rubble and through narrow passages in the rubble, instead of using all the
available sensory information, such as the infrared sensors around the robot’s perimeter or the
rear view camera to aid in obstacle avoidance. It is also interesting to note that in addition to the
object-based collisions, five out of the ten participants actually collided with one victim in the
scene. In the teleoperated control experiments, only four participants actually navigated the robot
to the upper level of the scene via an inclined climbable ramp for further exploration. Two of
these participants ended up driving the robot off of steep cliffs on the sides of the upper level
instead of navigating the robot down a gradual incline which was accessible to them. This upper
level area of the scene is shown in Figure 24(b).
Figure 32: Number of collisions.
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10
Nu
mb
er
of
Co
llis
ion
s
Participants
Teleoperated Semi-Autonomous
67
The total time for each trial is presented in Table 6 and is defined herein as the time it took to
explore and navigate the scene, identify victims and exit the scene. The average time for
completion of each trial was determined to be 395 seconds ( = 250), for the teleoperated mode
and 369 seconds ( = 7), for the semi-autonomous mode. Participants 4 and 7, who covered the
largest area of the scene and found all victims, had the longest trial times at 595 seconds and 987
seconds. These trial times are significantly higher than the total exploration times that these
participants had in semi-autonomous mode to achieve the same tasks.
Table 6: Summary of trial times in teleoperated and semi-autonomous experiments
Participant 1 2 3 4 5 6 7 8 9 10
Teleoperated Trial
Times (s) 169 283 179 595 186 292 987 377 402 483
Semi-Autonomous
Trial Times (s) 366 369 376 367 364 381 370 355 373 365
During the semi-autonomous trials, tasks were shared by the robot and operator. The robot
handed over control to the operator in two particular scenarios: i) when it was physically stuck in
a cell due to unsteady terrain conditions or large rubble pile on either side of the robot
obstructing the robot’s movement, and ii) when it could not identify a victim. On average these
scenarios occurred 5 times in each of the trials. In such cases the system detected that the robot
was not successfully executing the navigation commands, hence the navigation subtask was
handed over to the operator. Control is handed back to the robot once the robot sensors detect
that the executed commands by the operator were performed correctly. For victim identification,
the operators were asked to either tag or reject the victim in the viewpoint of the robot using the
control pad.
With respect to the aforementioned quantitative results, it was found that semi-autonomous
control of the robot had better performance results than teleoperation. Once the experiments were
completed, the participants were also asked to complete a questionnaire reflecting their
experiences with the robot. The questionnaire consisted of questions related to their stress level
during both trials, their ability to utilize the available sensory information, and how semi-
autonomous control affected their overall experience. In regards to stress levels, 8 participants
mentioned that they felt stressed during teleoperation of the robot, while only 2 participants felt
68
stressed during semi-autonomous control. 8 out of the 10 participants felt that they had a more
difficult time monitoring all of the sensory information that was made available to them in the
teleoperation mode for the entire duration of the search and rescue task, and hence, 7 of them
also felt disoriented and confused about where the robot was in the scene and in which direction
they should explore during the teleoperation mode. With the robot in semi-autonomous mode, 8
of the 10 participants mentioned that they had a better overall experience compared to the
teleoperated mode. In particular, these participants found that the robot having autonomous
capabilities improved their understanding of the scene and their own decision making abilities.
The experimental and questionnaire results confirm that a semi-autonomous control architecture
utilizing a MAXQ learning approach for rescue robots has the potential of improving the
performance of the search and rescue mission in obstacle avoidance, exploration and victim
identification.
5.4 Chapter Summary
The first set of experiments was designed to test the performance of the navigation and
exploration modules of the controller. The experimental results verified the ability of the
navigation the rubble pile and perform obstacle avoidance as well as to fully explore all three
experimental USAR-like scenes. The second set of experiments compared the performance of
the proposed semi-autonomous control architecture versus full teleoperation. The experimental
results showed an overall improvement in exploration, victim identification and navigation of the
human-robot rescue team. In addition, the majority of the participants agreed that semi-
autonomous control of the robot was an overall better experience compared to teleoperated
control.
69
Chapter 6 Conclusion
6.1 Summary of Contributions
6.1.1 Semi-Autonomous Control Architecture for Rescue Robots
A novel hierarchical learning-based semi-autonomous control architecture was developed for
rescue robots in cluttered USAR environments. The proposed control architecture allows the
robot to learn and make decisions regarding which tasks should be carried out at a given time
and whether the human or the robot should perform these tasks for optimum results. By giving
the robot the decision making ability to decide when human intervention is required, the human
operator can take advantage of the robot’s ability to continuously learn from its surrounding
environment. Robot sensors in the control architecture include: (i) infrared range sensors on the
perimeter of the robot, (ii) a 3D sensory system providing 3D data for navigation of a rescue
robot, victim identification and mapping of an unknown environment, (iii) 2D visual feedback
for the HRI interface, and (iv) a thermal camera providing heat signature from the surrounding
environment. The SLAM module uses the 2D and 3D images provided by the 3D sensory system
to map the environment. The victim identification module uses the thermal camera and both 2D
and 3D images to detect partially obstructed victims in the robot’s view. The HRI interface
provides the operator with the sensory feedback and the ability to monitor and control the robot.
The deliberation module provides the control architecture with the decision making ability
needed to navigate, explore and perform victim identification. The deliberation module also
decides which tasks should be carried out by the human operator and robot. Finally, the robot
actuators execute the commands issued by the deliberation module.
6.1.2 MAXQ Learning Technique Applied to the Semi-Autonomous Control Architecture
A MAXQ learning technique was developed for the decision making module of the semi-
autonomous controller. In particular, the MAXQ control algorithm allows a rescue robot to learn
and make decisions regarding navigation, exploration and victim identification in a disaster
scene. Furthermore, the designed MAXQ based control architecture decomposes the entire task
of search and rescue into a hierarchy of individual subtasks including: Navigate to Unvisited
70
Regions for search and exploration, Victim Identification for identifying and tagging the location
of victims, and Navigate for obstacle avoidance and local exploration of the robot’s
surroundings. Each subtask is able to learn the optimal action for its respective module
separately, allowing the control architecture to learn more efficiently than traditional RL based
controllers.
6.1.2.1 Rubble Pile Categorization Technique
A rubble pile categorization technique was developed for the navigation module of the control
architecture. This technique utilizes the 3D information of the rubble pile to effectively classify
the rubble pile into open traversable space, non-climbable obstacles, climbable obstacles,
downhill terrain, and a drop. The detailed classification of the rubble pile can aid a rescue robot
navigate a cluttered environment more efficiently by identifying climbable obstacles or avoid
dangerous situations such as unexpected drops and small voids in the rubble that may not be
detected by range sensors.
6.1.2.2 Direction-Based Exploration Technique
A direction-based exploration method was designed and integrated within the exploration
module of the control architecture to expand the search area of a rescue robot. Mainly the
occupancy cells of the known grid map are given priority based on their distance to the robot,
coverage potential, and terrain information to effectively choose a global direction for the robot
to explore. The exploration technique also utilizes an obstacle avoidance algorithm which guides
the robot around the enclosed concave obstacle boundaries. The algorithm detects the region
containing the concave obstacle in its path of exploration and temporarily changes the
exploration direction for an efficient avoidance of the obstacles.
6.1.3 Simulations
A simulation environment was developed to test the performance of the MAXQ control
architecture. A randomly generated simulation environment was designed to perform offline
training for the MAXQ controller as well as to monitor some of the Q-values in all subtasks for
convergence. Moreover, multiple simulated scenes were designed to examine the robustness of
the navigation and exploration modules of the controller in larger and more complex
71
environment. The simulations results showed that the robot was able to successfully explore an
unknown scene and find all hidden victims.
6.1.4 Implementation
Experiments were conducted by implementing the proposed controller with a rescue robot in a
USAR-like scene. A rubble filled environment was created to test the ability of a rescue robot
utilizing the semi-autonomous controller in navigating the rubble pile and exploring an unknown
cluttered environment. The experimental results showed that the rescue robot was able to
successfully navigate in the rubble pile by correctly identifying open traversable space,
climbable and non-climbable obstacles and drops. Moreover, the rescue robot was able to fully
explore the cluttered environment and find all the victims. Additional experiments compared
traditional teleoperated control of the rescue robot to the proposed semi-autonomous control
architecture. The results showed a significant improvement in the performance of the operator
and robot in navigating, exploring, and finding victims.
6.2 Discussion of Future Work
The proposed MAXQ-based semi-autonomous control architecture provides a unique approach
to improving the decision making and task sharing capabilities of semi-autonomous control
schemes for rescue robots. However, more extensive experiments in large-scale environments
can help further test and improve the robustness of the MAXQ-based semi-autonomous control
architecture. Due to the limited laboratory space, the direction-based exploration technique could
not be put to extensive testing in a large-scale rubble filled arena. Furthermore, the rubble pile
categorization technique and the navigation module can be tested with more varieties of rubble
pile shapes, sizes, and surface conditions (e.g. more climbable and non-climbable rubble piles).
More extensive experiments can also help improve the partnership between human and robot
control for increased performance in the overall search and rescue mission.
Furthermore, improving the sensor capabilities can greatly enhance the overall performance of
the semi-autonomous controller. For instance, improving the accuracy of the 3D mapping sensor
can reduce uncertainties in the 3D map, victim identification as well as the rubble pile
categorization technique. Increasing the sensing range of the 3D mapping sensor can also assist
in the victim identification module to identify victims that are located farther away from the
72
robot. Moreover, a more compact 3D mapping sensory can reduce the overall weight of the robot
and allow the robot to enter smaller voids, both leading to better navigation and exploration
capabilities of the robot. Lastly, more infrared sensors can provide the controller with a wider
field of view regarding the robot’s surrounding obstacles.
6.3 Final Concluding Statement
Overall, the proposed HRL-based semi-autonomous control architecture is the first of its kind
controller for rescue robots. Its unique design allows the robot to utilize the MAXQ hierarchical
learning technique to decompose the tasks of search and rescue operations, and learn the optimal
action for each individual subtasks in order to improve its overall decision making. Moreover,
the semi-autonomous control scheme keeps the human operator in the control loop, which
greatly enhances the robustness of the controller. It is envisioned that learning-based semi-
autonomous control schemes can significantly enhance the ability of rescue robots by allowing
them to continuously learn from the environment.
73
References
[1] J. Casper, M. Micire, R. Murphy, “Issues in Intelligent Robots for Search and Rescue,”
Proceedings of SPIE, vol. 4024, pp. 292-302, 2000.
[2] J. Casper, R. Murphy, "Human-Robot Interactions during the Robot-Assisted Urban Search
and Rescue Response at the World Trade Center," IEEE Transactions on Systems, Man, and
Cybernetics-Part B, vol. 33, vo. 3, pp. 367-385, 2003.
[3] R. Murphy, “Activities of the Rescue Robots at the World Trade Center from 11–21
September 2001,” IEEE Robotics & Automation Magazine, pp. 50-61, 2004.
[4] J. Burke, R. Murphy, "Human-Robot Interaction in USAR Technical Search: Two Heads Are
Better Than One," IEEE International Workshop on Robot and Human Interactive
Communication, Kurashiki, Japan, 2004, pp. 307-312.
[5] J. Casper, R. Murphy, “Workflow study on human-robot interaction in USAR”, IEEE Int.
Conf. on Robotics and Automation, pp. 1997-2003, 2002.
[6] D. D. Woods, J. Tittle, M. Feil, A. Roesler, “Envisioning Human–Robot Coordination in
Future Operations,” IEEE Transactions on Systems, Man, and Cybernetics Part C,
Application and Reviews, vol. 34, no. 2, 2004.
[7] J. Burke, R. Murphy, M. Coovert, and D. Riddle, “Moonlight in Miami: A field study of
human-robot interaction in the context of an urban search and rescue disaster response
training exercise,” Special Issue of Human-Computer Interaction, Vol. 19, pp. 85-116, 2004.
[8] J.L. Drury, J.L., B. Keyes, H.A. Yanco, “LASSOing HRI: Analyzing Situation Awareness in
Map-Centric and Video-Centric Interfaces”, ACM/IEEE Int. Conf. on Human-Robot
Interaction, pp. 279-286, 2007.
[9] D. J. Bruemmer et al., “I Call Shotgun!: An Evaluation of Mixed-Initiative Control for
Novice Users of a Search and Rescue Robot,” IEEE International Conference on Systems,
Man and Cybernetics, The Hague, Netherlands, 2004, pp. 2847-2852.
[10] R. L. Murphy and J. J. Sprouse, “Strategies for Searching an Area with Semi-Autonomous
Mobile Robots,” Proceedings of Robotics for Challenging Environments, Albuquerque, NM,
1996, pp. 15-21.
[11] R. Wegner and J. Anderson, “Balancing Robotic Teleoperation and Autonomy for Urban
Search and Rescue Environments,” Advances in Artificial Intelligence, A. Y. Tawfik, S. D.
Goodwin, Ed. Berlin / Heidelberg, Germany: Springer, 2004, vol. 3060, pp. 16-30.
[12] A. Finzi and A. Orlandini, “Human-Robot Interaction Through Mixed-Initiative Planning for
Rescue and Search Rovers,” Advances in Artificial Intelligence, vol. 3673, S. Bandini, S.
Manzoni, Ed. Berlin / Heidelberg, Germany: Springer, 2005, pp. 483-494.
[13] R. R. Murphy, S. Tadokoro, D. Nardi, A. Jacoff, P. Fiorini, H. Choset, and A. M. Erkmen,
“Search and Rescue Robotics,” in Springer Handbook of Robotics, B. Siciliano and O.
Khatib, Ed. Berlin: Springer, 2008, pp. 1151-1173.
[14] E. Guizzo, “Japan Earthquake: Robots Help Search For Survivors,” Internet:
http://spectrum.ieee.org/automaton/robotics/industrial-robots/japan-earthquake-robots-help-
search-for-survivors, Mar. 13, 2011 [Sep. 14, 2011].
74
[15] A. Birk, K. Pathak, S. Schwertfeger, and W. Chonnaparamutt, “The IUB Rugbot: An
Intelligent, Rugged Mobile Robot for Search and Rescue Operations,” IEEE Intl. Workshop
on Safety, Security, and Rescue Robotics (SSRR), Los Alamitos, 2006.
[16] M. Carlson, C. Christiansen, A. Persson, E. Jonsson, M. Seeman, and H. Vindahl,
“RobotCupRescue – Robot League Team RFC Uppsala, Sweden,” RoboCup 2004 Rescue
Robot League Competition, Lisbon, Portugal, 2004.
[17] M. W. Kadous, R. K.-M. Sheh, C. Sammut, “CASTER: A robot for Urban Search and
Rescue,” Australasian Conference on Robotics and Automation, Sydney, Australia, 2005, pp.
1-10.
[18] J. W. Crandall, M. A. Goodrich, “Characterizing the Efficiency of Human Robot Interaction:
A Case Study of Shared-Control Teleportation”, IEEE/RSJ Intl. Conference on Intelligent
Robots and Systems, Lausanne, Switzerland, 2002, pp. 1290-1295.
[19] J. Shen, J. Ibanez-Guzman, T. C. Ng, B. S. Chew, “A Collaborative-Shared Control System
with Safe Obstacle Avoidance Capability”, Conference on Robotics, Automation and
Mechatronics, Singapore, 2004, pp. 119-123.
[20] X. Perrin, R. Chavarriaga, F. Colas, R. Siegward, J. R. Millan, “Brain-coupled interaction for
semi-autonomous navigation of an assistive robot,” Robotics and Autonomous Systems, vol.
58, no. 12, pp. 1246-1255, 2010.
[21] C. K. Law, Y. Xu, “Shared Control for Navigation and Balance of a Dynamically Stable
Robot,” International Conference on Robotics & Automation, Washington, DC, 2002, pp.
1985-1990.
[22] D. A. Few, D. J. Bruemmer, M. C. Walton, “Improved Human-Robot Teaming through
Facilitated Initiative,” 15th IEEE International Symposium on Robot and Human Interactive
Communication (RO-MAN06), Hatfield, UK, 2006, pp. 171-176.
[23] J. L. Marble, D. J. Bruemmer, D. A. Few, “Lessons Learned from Usability Tests with a
Collaborative Cognitive Workspace for Human-Robot Teams,” IEEE International
Conference on Systems, Man and Cybernetics, vol. 1, pp. 448-453, 2003.
[24] H. Tang, X. Cao, A. Song, Y. Gua, J. Bao, “Human-Robot Collaborative Teleoperation
System for Semi-Autonomous Reconnaissance Robot,” IEEE International Conference on
Mechatronics and Automation, Changchun, China, 2009, pp. 1934-1939.
[25] P. Rishikesh, L. Janardhanan, E. M. Joo, “Goal Seeking of Mobile Robots Using Dynamic
Fuzzy Q-Learning,” Journal of The Institution of Engineers, vol. 45, no. 5, pp. 62-76, 2005.
[26] H. Yu, S. Liu, J. Liu, “A New Navigation Method Based on Reinforcement Learning and
Rough Sets,” International Conference on Machine Learning and Cybernetics, Kunming,
China, 2008, pp. 1093-1098.
[27] B. Huang, G. Cao, M. Guo, “Reinforcement Learning Neural Networks to the Problem of
Autonomous Mobile Robot Obstacle Avoidance,” Fourth International Conference on
Machine Learning and Cybernetics, Guangzhou, China, 2005, pp. 85-89.
[28] R. Makar, S. Mahadevan, M. Ghavamzadeh, “Hierarchical Multi-Agent Reinforcement
Learning," International Conference on Autonomous Agents, Montreal, Canada, 2001, pp.
246-253.
75
[29] L.A. Jeni, Z. Istenes, P. Szemes, H. Hashimoto, “Robot Navigation Framework Based on
Reinforcement Learning for Intelligent Space,” Conference on Human System Interactions,
Krakow, Poland, 2008, pp. 761-766.
[30] P. Stone, R. S. Sutton, “Scaling Reinforcement Learning Toward RoboCup Soccer,” The
Eighteenth International Conference on Machine Learning, Williamstown, MA, 2001, pp.
537-544.
[31] C. J. C. H. Watkins, “Learning from Delayed Rewards,” Ph.D. dissertation, Cambridge
University, Cambridge, England, 1989.
[32] R. S. Sutton, D. Precup, and S. Singh. “Between MDPs and Semi-MDPs: A Framework for
Temporal Abstraction in Reinforcement Learning,” Artificial Intelligence, vol. 112, no. 1-2,
pp. 181-211, 1999.
[33] R. Parr and S. Russell, “Reinforcement learning with hierarchies of machines,” Conference
on Advances in Neural Information Processing Systems 10, Denver, CO, 1998, pp. 1043-
1049.
[34] T. G. Dietterich, “Hierarchical Reinforcement Learning with MAXQ Value Function
Decomposition,” Journal of Artificial Intelligence Research, vol. 13, pp. 227-303, 2000.
[35] S. Singh, T. Jaakkola, M. L. Littman, C. Szepesvari, “Convergence Results for Single-Step
On-Policy Reinforcement-Learning Algorithms”, Machine Learning, 2000.
[36] Z. Zhang, H. Guo, G. Nejat, and P. Huang, “Finding Disaster Victims: A Sensory System for
Robot Assisted 3D Mapping of Urban Search and Rescue Environments,” IEEE, Int. Conf.
on Robotics and Automation, Rome, pp. 3889-3894, 2007.
[37] Z. Zhang and G. Nejat, “Robot-Assisted Intelligent 3D Mapping of Unknown Cluttered
Search and Rescue Environments,” IEEE International Conference on Intelligent Robots and
Systems, pp. 2115-2120, 2008.
[38] Z. Zhang and G. Nejat, “Robot-Assisted 3D Mapping of Search and Rescue Environments,”
ASME Design Engineering Technical Conferences & Computers and Information in
Engineering Conference, DECT2008-50012, pp. 1295-1304, 2008.
[39] Z. Zhang and G. Nejat, “Robot-Assisted 3D Mapping of Unknown Cluttered USAR
Environments,” the Disaster Robots Special Issue, Advanced Robotics Journal, vol. 23, pp.
1179–1198, 2009.
[40] W. G. Louie, “Victim Identification in Urban Search and Rescue Environments.” B.A.Sc.
thesis, University of Toronto, Canada, 2011.
[41] R. A. Howard, Dynamic Programming and Markov Processes. Cambridge, MA: The M.I.T.
Press, 1960.
[42] B. Yamauchi, “A Frontier-Based Approach for Autonomous Exploration,” IEEE
International Symposium on Computational Intelligence in Robotics and Automation,
Monterey, CA, 1997, pp. 146-151.
[43] W. Burgard, M. Moors, C. Stachniss, F. Schneider, “Coordinated Multi-Robot Exploration,”
IEEE Transactions on Robotics, vol. 21, no. 3, pp. 376-386, 2005.
[44] A. Haumann, K. Listmann and V. Willert, “DisCoverage: A new Paradigm for Multi-Robot
Exploration,” IEEE Int. Conf. on Robotics and Automation, Anchorage, Alaska, 2010, pp.
929-934.
76
[45] A. Howard, M. Mataric, and S. Sukhatme, “An Incremental Deployment Algorithm for
Mobile Robot Teams,” IEEE Int. Conf. Intell. Robot. Syst., Lausanne, Switzerland, 2002, pp.
2849–2854.
[46] X. Yun and K. C. Tan, “A Wall-Following Method for Escaping Local Minima in Potential
Field Based Motion Planning,” IEEE International Conference on Advanced Robotics,
Monterey, CA, 1997, pp. 421-426.
[47] P. J. Schneider, D. H. Eberly, Geometric Tools for Computer Graphics, San Francisco, CA:
Morgan Kaufmann, 2003.
[48] B. Mobedi and G. Nejat, "3D Active Sensing in Time-Critical Urban Search and Rescue
Missions," IEEE/ASME Transactions on Mechatronics, DOI
10.1109/TMECH.2011.2159388.
77
Appendices
Appendix A
Table A 1: List of commands and their meanings
Subtask Command Meaning
Navigate n0 Only provide robot’s status without executing any actions.
Navigate n1 Ask operator for navigation input (e.g. forward) and provide robot’s
updated status after the command is executed.
Navigate nf Move the robot forward and provide robot’s updated status after the
command is executed.
Navigate nb Move the robot backward and provide robot’s updated status after the
command is executed.
Navigate nr Turn the robot to the right and provide robot’s updated status after the
command is executed.
Navigate nl Turn the robot to the left and provide robot’s updated status after the
command is executed.
Victim Identification v Ask the operator to tag/reject a victim in robot’s view or move the
robot closer to victim, and send back the operator’s input command.
Navigate to Unvisited
Regions e
Allow the operator to change the robot’s exploration direction, or
terminate the exploration task. Send back the operator’s input
command (if any). If not input is received from the operator, the robot
will continue exploration in the same heading.
Table A 2: List of response formats for each command
Subtask Command Response Format
Navigate n0 B0B1B2B3B4B5
Navigate n1 B0B1B2B3B4B5Rn
Navigate nf B0B1B2B3B4B5
Navigate nb B0B1B2B3B4B5
Navigate nr B0B1B2B3B4B5
Navigate nl B0B1B2B3B4B5
Victim Identification v Rv
Navigate to
Unvisited Regions e Re
78
Table A 3: List of response values and meanings for the Navigate subtask
Response Value Meaning
B0 0 Robot is not stuck. The command was executed successfully.
1 Robot is stuck. The command was not executed.
B1
0 There is no obstacle within the range of the back sensor.
1 There is an obstacle within the range of the back sensor.
B2
0 There is no obstacle within the range of the front right sensor.
1 There is an obstacle within the range of the front right sensor.
B3
0 There is no obstacle within the range of the front left sensor.
1 There is an obstacle within the range of the front left sensor.
B4
0 There is no obstacle within the range of the back right sensor.
1 There is an obstacle within the range of the back right sensor.
B5
0 There is no obstacle within the range of the back left sensor.
1 There is an obstacle within the range of the back left sensor.
Rn
f Robot should move forward one cell.
b Robot should move backward one cell.
r Robot should turn right.
l Robot should turn left.
Table A 4: List of response values and their meanings for Victim Identification subtask
Response Value Meaning
Rv
t Operator identified a victim in the robot’s view and tagged it.
r Operator dismissed the possibility of a victim in the robot’s view.
n Operator wants to move closer to the robot.
79
Table A 5: List of response values and their meaning for Navigate to Unvisited Regions
subtask
Response Value Meaning
Re
o No input was received. The operator wants to continue exploration in the same heading.
r The operator wants to explore in the region to the right of the robot’s heading.
l The operator wants to explore the region to the left of the robot’ heading.
b The operator wants to explore the region to the back of the robot’s heading.
e The operator wants to finish the exploration and exit the USAR scene.
80
Appendix B
List of Publications Generated from this Thesis:
[1] B. Doroodgar, M. Ficocelli, B. Mobedi and G. Nejat, “The Search for Survivors: Cooperative
Human-Robot Interaction in Search and Rescue Environments using Semi-Autonomous
Robots,” IEEE Int. Conf. on Robotics and Automation, Anchorage, Alaska, 2010, pp. 2858-
2863.
[2] B. Doroodgar and G. Nejat, “A Hierarchical Reinforcement Learning Based Control
Architecture for Semi-Autonomous Rescue Robots in Cluttered Environments,” IEEE Conf.
on Automation Science and Engineering, Toronto, Canada, 2010, pp. 948-953.