A Learning-Based Semi-Autonomous Control Architecture for ... · rescuing victims from collapsed structures can be extremely hazardous due to asbestos and dust, general instability

A Learning-Based Semi-Autonomous Control Architecture for Robotic Exploration of Search and Rescue

Environments

by

Barzin Doroodgar

A thesis submitted in conformity with the requirements for the degree of Masters of Applied Science

Mechanical and Industrial Engineering University of Toronto

© Copyright by Barzin Doroodgar 2011

ii

A Learning-Based Semi-Autonomous Control Architecture for

Robotic Exploration in Search and Rescue Environments

Barzin Doroodgar

Masters of Applied Science

Department of Mechanical and Industrial Engineering

2011

Abstract

Semi-autonomous control schemes can address the limitations of both teleoperation and fully

autonomous robotic control of rescue robots in disaster environments by allowing cooperation

and task sharing between a human operator and a robot with respect to tasks such as navigation,

exploration and victim identification. Herein, a unique hierarchical reinforcement learning

(HRL) -based semi-autonomous control architecture is presented for rescue robots operating in

unknown and cluttered urban search and rescue (USAR) environments. The aim of the controller

is to allow a rescue robot to continuously learn from its own experiences in an environment in

order to improve its overall performance in exploration of unknown disaster scenes. A new

direction-based exploration technique and a rubble pile categorization technique are integrated

into the control architecture for exploration of unknown rubble filled environments. Both

simulations and physical experiments in USAR-like environments verify the robustness of the

proposed control architecture.

iii

Acknowledgments

I would like to thank my supervisor, Prof. Goldie Nejat for her support and constructive input

and guidance throughout this project. I would also like to thank my M.A.Sc. thesis committee for

their time and valuable feedback. In addition, I would like to thank Babak Mobedi for his

development of the 3D sensory system, Onome Igharoro for his development of the HRI

interface, Geoffrey Louie for his development of the victim detection algorithm, and Mina

Salama, John Qi, and Edward Lin for their work on the low-level navigation control of the rescue

robot. Without the aid of these colleagues the implementation of my project would not be

possible. Last but not least, I would like to thank my parents and all my friends and lab mates for

their love and support throughout the two years of my graduate school.

iv

Table of Contents

Acknowledgments .......................................................................................................................... iii

Table of Contents ........................................................................................................................... iv

List of Tables ................................................................................................................................ vii

List of Figures .............................................................................................................................. viii

List of Appendices ......................................................................................................................... xi

Chapter 1 Introduction .................................................................................................................... 1

1.1 Motivation ........................................................................................................................... 1

1.2 Literature Review ................................................................................................................ 2

1.3 Problem Definition .............................................................................................................. 4

1.4 Proposed Methodology and Tasks ...................................................................................... 5

1.4.1 Literature Review .................................................................................................... 5

1.4.2 Control Architecture and MAXQ-Based Robot Decision Making ......................... 5

1.4.3 Simulations ............................................................................................................. 6

1.4.4 Implementation ....................................................................................................... 6

1.4.5 Conclusion .............................................................................................................. 7

Chapter 2 Literature Review ........................................................................................................... 8

2.1 Semi-Autonomous Control ................................................................................................. 8

2.2 Reinforcement Learning for Robot Control ...................................................................... 10

2.3 Chapter Summary ............................................................................................................. 13

Chapter 3 Proposed Learning-Based Semi-Autonomous Control Architecture ........................... 14

3.1 Semi-Autonomous Control Architecture .......................................................................... 14

3.2 MAXQ-Based Robot Decision Making ............................................................................ 15

3.2.1 MAXQ Learning Technique ................................................................................. 15

3.2.2 Proposed MAXQ Task Graph ............................................................................... 19

v

3.3 Chapter Summary ............................................................................................................. 42

Chapter 4 Simulations ................................................................................................................... 43

4.1 MAXQ Learning and Q-Value Convergence ................................................................... 43

4.1.1 Simulations Results ............................................................................................... 45

4.2 Performance of Overall Control Architecture ................................................................... 49

4.2.1 Simulations Results ............................................................................................... 52

4.3 Chapter Summary ............................................................................................................. 55

Chapter 5 Experiments .................................................................................................................. 56

5.1 Experimental Set-up .......................................................................................................... 56

5.1.1 HRI User Interface ................................................................................................ 58

5.1.2 Robot Control ........................................................................................................ 59

5.2 Experiment #1: Evaluation of Navigation and Exploration Modules ............................... 61

5.2.1 Experimental Results ............................................................................................ 61

5.3 Experiment #2: Teleoperated vs. Semi-Autonomous Control .......................................... 64

5.3.1 Experimental Results ............................................................................................ 64

5.4 Chapter Summary ............................................................................................................. 68

Chapter 6 Conclusion .................................................................................................................... 69

6.1 Summary of Contributions ................................................................................................ 69

6.1.1 Semi-Autonomous Control Architecture for Rescue Robots ................................ 69

6.1.2 MAXQ Learning Technique Applied to the Semi-Autonomous Control

Architecture ........................................................................................................... 69

6.1.3 Simulations ........................................................................................................... 70

6.1.4 Implementation ..................................................................................................... 71

6.2 Discussion of Future Work ............................................................................................... 71

6.3 Final Concluding Statement .............................................................................................. 72

References ..................................................................................................................................... 73

vi

Appendices .................................................................................................................................... 77

vii

List of Tables

Table 1: Examples of calibrated depth images ............................................................................. 40

Table 2: Exploration coefficients used in the simulations ............................................................ 51

Table 3: Exploration results with terrain information. .................................................................. 54

Table 4: Exploration results without terrain information. ............................................................ 54

Table 5: Experimental depth images and depth profile arrays ..................................................... 62

Table 6: Summary of trial times in teleoperated and semi-autonomous experiments .................. 67

viii

List of Figures

Figure 1: Semi-autonomous control architecture. ......................................................................... 15

Figure 2: MAXQ task graph for a semi-autonomous USAR robot. ............................................. 20

Figure 3: 2D grid map of the four regions surrounding the robot. ............................................... 23

Figure 4: Scenario illustrating the contribution of the exploration coefficient, λx. ...................... 26

Figure 5: Examples of concave obstacles. .................................................................................... 27

Figure 6: Scenarios illustrating the concave obstacle detection technique: (a) Not identified as a

concave obstacle, (b) identified as a concave obstacle. ................................................................ 29

Figure 7: Scenario illustrating the concave obstacle avoidance technique. .................................. 30

Figure 8: Comparison of the robot’s path to avoid concave obstacle. (a) path taken with a wall-

following technique, (b) path taken using the proposed concave obstacle avoidance technique. 31

Figure 9: Grid representation of the local environment surrounding a robot (the robot’s forward

direction is shown towards cell C2). ............................................................................................. 32

Figure 10: 2D Dxy array for rubble pile profile classification. The cell shading becomes darker

as the cell depth values increase. .................................................................................................. 34

Figure 11: Scenarios illustrating the rubble pile classifications: (a) open space, (b) non-climbable

obstacle, (c) uphill climbable obstacle, (d) downhill, and (e) drop. ............................................. 34

Figure 12: Contribution of the overall slope of the depth profile on rubble pile classification: (a)

open space, (b) non-climbable obstacle, (c) uphill climbable obstacle, (d) downhill. ................. 35

Figure 13: Contribution of the smoothness parameter, R, to the rubble pile classification: (a)

small R indicates a smooth traversable terrain, (b) large R indicates a rough non-traversable

terrain. ........................................................................................................................................... 36

ix

Figure 14: Comparison of traversable open space with a drop in the terrain: (a) open space, (b)

drop where surface is observable in 3D data, (c) drop where surface is not observale in 3D data.

....................................................................................................................................................... 37

Figure 15: Example of using the position of 3D data within the Dxy array to detect non-

traversable small voids in the rubble. ........................................................................................... 37

Figure 16: Calibration plot showing the relationship between the plane parameter B and angle of

an incline. ...................................................................................................................................... 39

Figure 17: Simulation screenshots: robot exploring in randomly generated scene layouts. ......... 44

Figure 18: Convergence of the Root Q-values when exploration is optimal. ............................... 46

Figure 19: Convergence of the Root Q-values when operator’s assistance is optimal. ................ 47

Figure 20: Convergence of the Root Q-values when victim detection is optimal. ....................... 48

Figure 21: The steps taken by the robot to fully explore the simulated environment in the first

200 trials of the offline training. ................................................................................................... 48

Figure 22: Simulated USAR-like scene: (a)-(c) actual scenes, (d)-(f) 2D grid map developed of

the scenes during simulations with robot starting location and heading. ..................................... 50

Figure 23: Number of exploration steps for the robot in Scenes A, B and C. .............................. 52

Figure 24: USAR-like scene: (a) one of three scenes for experiment #1, (b) scene for experiment

#2, (c) rescue robot stuck in rubble pile (d) victim with heat pad under rubble, in experiment #1,

(d) victim on the upper level of the scene, in experiment #2, and (f) red cushion detected as false

victim. ........................................................................................................................................... 57

Figure 25: Rugged rescue robot in USAR-like scene. .................................................................. 57

Figure 26: (a) 2D image of robot’s front view, (b) 2D image of robot’s rear view, (c) 3D map,

robot status (green indicates the wheels of the robot are moving), infrared visual feedback on the

robot’s perimeter (if objects are very close red rectangles appear), and control menu display, and

(d) real-time 3D information. ........................................................................................................ 58

x

Figure 27: Xbox 360 wireless controller used as the operator’s input device. ............................. 60

Figure 28: 2D representation of experimental scenes; (a)-(c) depict the original scene layouts 1,

2, and 3; (d)-(f) show the 2D representation of the scene as determined by the robot during the

experiments. .................................................................................................................................. 62

Figure 29: Number of exploration steps for the robot in scenes 1, 2 and 3. ................................. 63

Figure 30: Percentage of scene explored. ..................................................................................... 65

Figure 31: Number of victims identified. ..................................................................................... 65

Figure 32: Number of collisions. .................................................................................................. 66

xi

List of Appendices

Appendix A ................................................................................................................................... 77

Appendix B ................................................................................................................................... 80

1

Chapter 1 Introduction

1.1 Motivation

Search and rescue operations in urban disaster scenes are extremely challenging due to the highly

cluttered and unstructured nature of the environments. In addition, in some scenarios the task of

rescuing victims from collapsed structures can be extremely hazardous due to asbestos and dust,

general instability of damaged structures, and in some cases, presence of toxic chemicals or

radiation in the environment [1]-[3]. Moreover, rescuing victims from collapsed structures

sometimes requires entering through small voids which may not be possible or very time

consuming (e.g. requiring to create larger openings by removing the rubble first) for rescue

workers [1]-[3]. To overcome these challenges, mobile robotic systems are being developed to

aid rescue workers in urban search and rescue (USAR) operations. Current applications of rescue

robots require a team of rescue operators to remotely guide robots in the disaster scene, i.e.,

[2],[4]. However, teleoperation of rescue robots while also searching for victims in such complex

environments can be a very stressful task, leading to both cognitive and physical fatigue [5].

Consequently, operators can suffer low levels of alertness, and lack of memory and

concentration during these time critical situations [2].

Human operators can have perceptual difficulties in trying to understand an environment via

remote visual feedback. Two common examples of perceptual difficulties that are caused by

remote perception are the keyhole effect, which causes loss of situational awareness (SA), and

scale ambiguity [6]. The keyhole effect is a result of the limited angular view that is provided by

the visual platform on rescue robots which creates a sense of trying to understand the

surrounding environment through a small hole [6]. As a result, operators can have a hard time

trying to integrate what they see from the robot’s immersed view with their mental model of the

environment, becoming disoriented and losing situational awareness [2], [6], [7]. Studies have

shown that having good situational awareness is critical, in fact it has been noted that operators

will stop everything they are doing and spend an average of 30% of their time trying to

acquire/re-acquire SA, even when they are performing a time-sensitive search and rescue task

[8]. The need for high levels of situational awareness in USAR situations can make it difficult

for operators to safely navigate the robot and identify victims. On the other hand, scale

2

ambiguity deals with the difficulty of recognizing the scale of the environment through remote

visual feedback [6]. An example of this ambiguity was observed in [2], when mobile robots were

used at the World Trade Center (WTC) to search for victims in the rubble, robot operators could

not perceive from the remote visual information whether a rescue robot was small enough to pass

through an opening or climb over a rubble pile. Due to the mental and physical overload an

operator can experience when having to perform the entire search task using teleoperated robots,

a minimum of two rescue personnel are recommended to control a single robot [4].

An alternative to teleoperated control is to use autonomous controllers for rescue robots, which

eliminates the need of constant human supervision. However, deploying a fully autonomous

rescue robot in a USAR scene requires addressing a number of challenges. Firstly, rescue

personnel do not generally trust an autonomous robot to perform such critical tasks without any

human supervision [3]. Secondly, rescue robots have very demanding hardware and software

requirements unlike any other robotic applications. Namely, autonomous navigation can be very

difficult to achieve due to the rough terrain conditions of these environments. Moreover, victim

identification can be particularly challenging due to the presence of dust, debris, poor lighting

conditions and extreme heat sources in the disaster environment [1], [3].

To address the challenges and limitations of both teleoperated and fully autonomous control of

rescue robots in USAR environments, recent efforts have focused on developing semi-

autonomous controllers that allow task sharing between rescue robots and human operators [9]-

[12]. Not only does task sharing reduce the stress and mental workload of operators, it also

allows a rescue robot to benefit from a human operator’s experience and knowledge.

1.2 Literature Review

Teleoperated rescue robots have been utilized in disaster scenes such as the WTC site in 2001,

earthquake in Niigata, Japan in 2004, and Hurricanes Katrina, Rita, and Wilma in the United

States in 2005 [13]. Rescue robots were used again recently to inspect the collapse structures in

the aftermath of the 8.9 magnitude earthquake that hit Japan in March of 2011 [14]. At the WTC

disaster site, out of the ten rescue robots proposed only three were qualified to inspect the

disaster environment and look for victims: the MicroTracs, MicroVGTV, and Solem [2]. All

three robots were remotely controlled by rescue operators; MicroTracs and MicroVGTV were

tethered while Solem was a wireless robot. While in theory all three robots could be operated by

3

one person, in practice it took two rescue operators to operate each robot. One person controlled

the robot, while the other kept the tether or safety rope from tangling. Even though the

MicroTracs and MicroVGTV had two-way audio capabilities, the operators could only use the

visual feedback since the audio headsets and protective headgear could not be used at the same

time. Thus, the operators were forced to use the limited robot’s eye view to navigate, explore and

find victims. In addition, the robots lacked mapping and localization capabilities. The robots also

lacked sufficient sensory capabilities. On average, the robots got stuck 2.1 times per minute,

requiring the operators to assist them with limited information about the robot’s status and

environment [2]. Communication dropouts were also a problem with the Solem robot, which had

a total of 21 communication dropouts in its seventh deployment [2]. Furthermore, lack of sleep,

stress and cognitive fatigue also affected the operators’ abilities to perform at their best.

The issues observed in teleoperated rescue robots have encouraged the recent developments in

enhancing the autonomous capabilities of rescue robots. By providing the rescue robots with

more autonomous capabilities, the robot can take over some of the rescue tasks. Autonomous

rescue robots presented in [15]-[17], for example, have autonomous capabilities in navigation,

localization and mapping, and victim identification. For example, the IUB Rugbot presented in

[15] is a wireless tracked robot that has autonomous mapping capabilities based on a laser range

finder and Simultaneous Localization and Mapping (SLAM). The robot also utilizes an IR

camera and shape recognition algorithm based on thermal images. Gullfaxe [16] is also a wheel-

based wireless robot. The robot captures the distance and angle readings from the surrounding

obstacles, using an array of ultrasonic sensors to generate a map of the environment. The robot

also utilizes its ultrasonic sensors as well as infrared sensors for autonomous localization and

obstacle avoidance. Furthermore, two pyro-electric sensors are utilized to capture heat emissions

from victims and a microphone is used to localize the victim with respect to the robot. Although

victim identification using heat signatures in [15] and [16] may be possible in some

environments, such as controlled rescue competition arenas, external heat sources, e.g. fire,

makes it very difficult for IR cameras to locate victims in real disaster scenes. In addition, the

audio signal captured by microphones can be extremely noisy due to weak signals and external

noises in cluttered USAR environments [1], [3].

The robot CASTER [17] is also a rugged wireless tracked robot with autonomous capabilities.

The robot utilizes two laser scanners, a range camera, a heading/altitude sensor, and a 2D camera

4

for achieving autonomous navigation and localization. Autonomous victim identification is also

achieved via an IR camera, 2D cameras, a range camera and a microphone. CASTER was tested

in the 2005 RoboCup Rescue competitions, helping its team take 3rd

place in the competitions.

Although some of the autonomous capabilities, e.g. victim identification, performed well during

the competitions, the robot’s heavy platform caused it to unintentionally push obstacles and stairs

around. The robot was also reported to flip once and permanently get stuck in two rounds of the

competition [17].

Due to the harsh and unpredictable nature of USAR environments, fully autonomous navigation

and victim identification in these environments are extremely challenging to achieve. The

operator’s knowledge and experience is still needed to assist the robots in unpredictable

situations. Semi-autonomous control architectures create a cooperated control between the robot

and the human operator by sharing the tasks between the two entities. Only a handful of semi-

autonomous controllers have been developed to date specifically for search and rescue robots to

share the navigation and localization, exploration, and victim identification tasks [9]-[12].

However, none of these control schemes incorporate learning into the decision making module of

their control architecture.

1.3 Problem Definition

The highly cluttered nature of the USAR environments and the communication limitations of

urban disaster scenes make teleoperation of rescue robots very difficult. In addition, human

operators can become disoriented, stressed and fatigued very quickly in USAR environments,

causing crucial errors in control and victim identification. Furthermore, all robots that operate in

these environments do not have a priori information about landmarks in the scene, which makes

it extremely difficult for robots to autonomously navigate the scenes and identify victims.

Although current semi-autonomous control schemes simplify the tasks of the human operator

and the rescue robot, the level of autonomy of the robot is generally manually set by the

operator. In addition, none of these semi-autonomous control schemes incorporate learning into

their control architecture to deal with unknown and highly unpredictable USAR environments.

The focus of this thesis is to develop, for the first time, a learning-based semi-autonomous

control architecture for robotic exploration of unknown search and rescue environments. The

control architecture is to provide the robot with the decision making ability to autonomously

5

navigate an unknown cluttered disaster scene and identify victims, as well as to learn how to

share the tasks with the human operator for optimal results.

1.4 Proposed Methodology and Tasks

This thesis focuses on the development of a hierarchical reinforcement learning (HRL) -based

semi-autonomous control architecture for a rescue robot for the application of exploration and

victim identification in cluttered environments. In particular, the HRL control algorithm allows a

rescue robot to learn and make decisions regarding which tasks should be carried out at a given

time and whether the human or the robot should perform these tasks for optimum results. By

giving the robot the decision making ability to decide when human intervention is required, the

human operator can take advantage of the robot’s ability to continuously learn from its

surrounding environment.

The proposed methodology and tasks for the design of the HRL-based semi-autonomous control

architecture consists of the following sections:

1.4.1 Literature Review

Chapter 2 provides a detailed literature review on the two main features of the proposed control

architecture: semi-autonomous control and reinforcement learning for robotic control. The first

section of the literature review focuses on the implementation of semi-autonomous control for

mobile robots. An overview of the advantages of semi-autonomous control over fully

teleoperated control of robots is presented. Then a more detailed discussion about semi-

autonomous control schemes is provided which categorizes the controllers into fixed and

variable autonomy. Lastly semi-autonomous control schemes implemented for the specific

application of search and rescue are discussed. The second section of the literature review

provides an overview on reinforcement learning as applied to mobile robotic navigation and

exploration. This section discusses traditional reinforcement learning, hybrid reinforcement

learning, and hierarchical reinforcement learning as applied to robotic control.

1.4.2 Control Architecture and MAXQ-Based Robot Decision Making

In Chapter 3, an overview of the semi-autonomous control architecture as well as the design of

the MAXQ task hierarchy is presented. The MAXQ task hierarchy decomposes the overall task

6

of search and rescue into three main subtasks of exploration, navigation and victim

identification. The individual subtasks making up the hierarchy are defined and discussed in

detail. This chapter focuses on the hierarchy definition as well as other techniques that are

integrated into the MAXQ hierarchy including: (i) a direction-based exploration technique

designed for exploration of a rescue robot in cluttered and rough terrain environments, and (ii) a

rubble pile categorization technique that utilizes 3D information to group different terrain

conditions into multiple categories that facilitate navigation in USAR-like environments.

1.4.3 Simulations

Chapter 4 presents the numerous simulations that were performed to test the performance of the

proposed MAXQ-based control architecture. Two set of simulations were designed to evaluate

the overall capabilities of the developed control architecture. The first set of simulations was

designed to verify the performance of the overall MAXQ task hierarchy as well as to collect Q-

values and test for convergence of the Q-values at different levels of the hierarchy. Furthermore,

these simulations were used as offline training for the learning algorithm. The second set of

simulations was focused on evaluating the performance of the navigation and exploration

modules in several large-scale scene layouts with various degrees of complexity. The simulation

scene layouts were designed to test the robustness of the navigation and exploration modules to

cover larger and more complex scenes that could not be tested in laboratory space.

1.4.4 Implementation

Chapter 5 discusses the extensive experiments that were performed in order to evaluate the

proposed HRL-based semi-autonomous controller in a USAR-like experimental test bed. This

chapter discusses the experimental set-up including the design of the USAR-like environment

and an overview of the rescue robot and its sensing abilities. Two experiments are presented

which evaluate the performance of the HRL-based controller. The first set of experiments focus

on the evaluation of the navigation and exploration modules which give the rescue robot the

ability to safely navigate and explore an unknown USAR-like environment, thus providing the

opportunity to find victims in the scene. The second set of experiments, focus on the comparison

between the overall performance of the rescue robot in teleoperated (fully manual) and semi-

autonomous modes of control.

7

1.4.5 Conclusion

Chapter 6 presents the concluding remarks on the development and evaluation of the HRL-based

semi-autonomous control architecture and highlights the main contributions of the thesis as well

as a discussion of future work.

8

Chapter 2 Literature Review

2.1 Semi-Autonomous Control

To date, the semi-autonomous controllers that have been developed for mobile robotic

applications have mainly focused on the shared control of navigation and obstacle avoidance in

varying environments. These semi-autonomous control schemes can be categorized based on the

level of autonomy of the robot. In particular, control schemes in which the robot’s autonomy is

fixed focus on low level tasks such as collision avoidance and motor control, allowing the human

operator to concentrate on high level control and supervisory tasks which include path planning

and task specification [18]-[20]. Compared to fully teleoperated control of robots, these semi-

autonomous controllers have been shown to be more effective for robot navigation and obstacle

avoidance. Furthermore, they allow more neglect time for the operator, where neglect time is

defined to be the amount of time the operator can neglect a robot, while attending to other tasks

such as attending to other robots in a team of robots, reviewing remote visual information, or

communicate with other rescue operators [18]. However, these semi-autonomous controllers

with fixed autonomy levels lack the flexibility that is needed for more challenging problems such

as control of rescue robots in rubble-filled USAR environments. For example, in the case of a

robot getting physically stuck in a cluttered disaster environment, the human operator may have

to take over the control of low level operations in order to assist the robot. Similarly, high level

control may be required from the autonomous robot controller when the operator cannot perform

these tasks due to task overload or communication dropout. To address these concerns, semi-

autonomous controllers with variable autonomy can be used. Namely, these controllers can

provide an operator with the option to adjust the level of autonomy of the robot based on the task

at hand, i.e., [21]-[24]. For example, in [23], an operator is given the choice of four different

levels of autonomy, ranging between fully teleoperated control, which gives full control of the

robot to the operator, to a semi-autonomous mode where most of the low level operations such as

motor control and collision avoidance as well as high level operations such as path planning are

assigned to the autonomous robot controller, while the operator only specifies the overall goal

tasks (e.g. search in a particular region). It was concluded that this latter mode resulted in the

9

best performance in terms of minimizing the number of collisions and decreasing the workload

of the operator when compared to lower levels of robot autonomy.

Recently, a handful of semi-autonomous controllers have been specifically developed for control

of rescue robots in USAR applications, focusing on the two main tasks of robot navigation and

victim identification [9]-[12]. In particular, the semi-autonomous controller presented in [9] can

be set by the operator to perform in Safe and Shared Modes. Safe Mode is similar to teleoperated

control, in that all navigation and victim/object detection tasks are performed by the operator,

however, the robot can intervene to prevent obstacle collisions. In Shared Mode, the robot

provides the optimal path for navigation based on the operator’s directional inputs. The results

showed that navigational autonomy in Shared Mode resulted in a much better performance in

finding pre-specified objects (including victims) in the scene. Although [9] provides the operator

with the opportunity to set the level of autonomy of the controller beforehand, an operator is not

able to change this level of autonomy on the fly during a search and rescue operation, which may

be needed in unknown environments if either the robot or operator face a situation where they

need assistance from each other. This could occur, for example, in a scenario where low level

navigation and obstacle avoidance is assigned to the robot and the high level task of goal

assignment is given to the human operator; if the robot gets stuck, the operator may need to assist

the robot by taking over the navigation task in order to get the robot out. Similarly, if the

operator has lost his/her situational awareness, the robot can provide location information and an

appropriate search direction.

On the other hand, controllers presented in [10]-[12] solve this problem by providing on the fly

adjustments of a robot’s level of autonomy either by the operator [10],[11] or automatically by

the controller [12] during USAR operations. For example, in [10], a semi-autonomous control

scheme was presented where a robot could be operated in either manual mode or semi-

autonomous mode. In semi-autonomous mode, the operator was in charge of performing a

semantic search of the scene which included dividing a large search area into smaller regions

and ordering these regions by expected parameters such as proximity of region to the robot and

the likelihood of a victim. On the other hand, the robot was in charge of handling the more

routine and tedious systematic search, in which it searched for victims autonomously, alerted the

operator when a potential victim was found and waited for operator confirmation. In addition to

semantic and systematic search, another category of search known as opportunistic search was

10

also introduced, where a robot was able to operate on a reduced detection function on incoming

data without distracting the operator or slowing the navigation control. Although certain tasks

were assigned to both the operator and robot, the operator could take over the control of the robot

at any given time during the operation. In [11], a semi-autonomous controller was presented for

control of a team of rescue agents in USAR environments. The system allowed an operator's

commands to be weighted with the agents’ own autonomous commands using a mediator. In

addition to the mediator, an intervention recognition module was also utilized to identify

situations during a rescue operation where human intervention is required such as: 1) when

agents got stuck, 2) when agents got lost or unable to complete their goals, or 3) when a potential

victim was found to verify the victim. In [12], three semi-autonomous control modes were

proposed for a rescue robot to distribute tasks between an operator and robot: i) planning-based

interaction, ii) cooperation-based interaction, and iii) operator-based interaction. In planning-

based interaction, the planning system generated a cyclic sequence of actions to be executed by

the operator and rescue robot. In cooperation-based interaction, the operator could modify the

sequence generated by the planner, and lastly, in operator-based interaction, an approach similar

to teleoperation was taken where the system only observed the operator’s actions and notified the

operator if any critical safety problems occurred. The specific mode to implement was

determined automatically by the robot based on the manner in which the operator was interacting

with the overall system during a USAR operation. For example, the interaction was adjusted

from operator-based to planning-based if the operator was idle for a certain period of time.

The aforementioned research highlights the recent efforts in semi-autonomous control for USAR

applications. However, none of the aforementioned controllers incorporate learning into their

control scheme to deal with unknown and unpredictable USAR environments. In this thesis, a

learning-based semi-autonomous controller is proposed to allow a robot to learn from its own

previous experiences in order to better adapt to unknown and cluttered disaster environments.

2.2 Reinforcement Learning for Robot Control

Reinforcement Learning (RL) algorithms have previously been used to address navigation and

obstacle avoidance problems for mobile robots [25]-[30]. In particular, RL techniques have been

utilized to learn a robot’s optimal behavior by observing the outcome of its actions as it interacts

with its environment. The most commonly used RL technique in mobile robotic applications is

11

Q-learning. In Q-learning, a mapping is learned from a state-action pair to a value called Q [31].

The mapping represents the reward of performing a specific action in a particular state. The

advantage of this approach is that it is model-less and can be exploration insensitive (Q-values

will converge to optimal values, independent of robot behavior during data collection). Although

RL algorithms display advantages over many alternative learning algorithms, one drawback is

that they treat the entire state space as one large search space. Hence, for larger-scale and

complex robotic applications, the number of Q-values that must be learned (i.e. the size of the Q

table) also exponentially increases as the size of the state space increases, therefore, significantly

increasing the time it takes for learning to take place. To address this limitation, researchers have

developed a number of hybrid RL techniques in order to reduce the size of the state space [25]-

[27]. For example, in [25], a fuzzy interface system (FIS) and Q-learning were used together for

control of a mobile robot in goal seeking and obstacle avoidance tasks. To speed up the learning

process, a fuzzy controller was used to generalize the state space. Fuzzy rules were defined to

locally represent a set of input variables. When a new state (a new combination of input

variables) was encountered, the fuzzy rules compete to win over the new condition. If a close

match to the existing rules is not found, a new rule is generated which determines the action to

take at that particular step and a Q-value is initialized. This Q-value is updated with more

occurrences of the state-action pair. In [26], rough sets were utilized in conjunction with Q-

leaning for the navigation and obstacle avoidance of a mobile robot in order to speed up the

learning process of the controller. Herein, rough sets were used to reduce the effective state

space of the system by mapping the continuous state space onto a discrete set of states. Namely,

the states were grouped into predefined sets based on the distances of obstacles and the goal to

the robot. As a result, by having fewer states the overall learning time was considerably

shortened when compared to a traditional Q-learning technique. In [27], a neural network was

used with Q-learning for robotic obstacle avoidance. The overall approach was shown to train

the system in a shorter time than a traditional Q-learning technique. Utilizing neural networks

with Q-learning not only provides a more compact way of storing the Q values than a look up

table, but it can also be used to interpolate Q values for states that have not yet been visited.

Although the techniques presented in [25]-[27] have shown improvements over traditional RL

techniques, they cannot be as easily applied to more complex and larger-scale problems such as

control of rescue robots in USAR scenes. In particular, these techniques have been applied to

12

robot navigation and obstacle avoidance problems in highly structured simulated environments.

In addition, in both [25] and [26], prior knowledge of the robot’s desired locations within the

small simulated environment was needed prior to implementation. Overall, all of the

aforementioned hybrid RL techniques attempt to reduce the entire search space by interpolation

of the Q-values, or generalization of the states. Thus, even though these methods work well for

simple robotic applications, they would not be able to fully represent the states for more complex

robotic applications and environments, which will lead to inaccurate learning of the system.

In addition to the hybrid RL techniques, hierarchical RL (HRL) approaches have also been

developed to deal with some of the drawbacks of traditional RL methods, i.e., [32]-[34].

Contrary to hybrid RL techniques that reduce the entire state space by interpolating Q-values or

generalizing a set of states into a single representative state, HRL techniques reduce the number

of state-action pairs that are needed to be learned by using different types of abstractions.

Namely, HRL consists of techniques that utilize various types of abstractions to deal with the

exponentially growing number of parameters that need to be learned in large-scale problems by

reducing the overall task into a set of individual subtasks for faster learning [35]. Therefore, in

the hierarchical representation, the overall state and action pairs that are needed to be learned are

reduced without compromising the accuracy of the state representation.

To date, there have only been a handful of cases where HRL has been applied to mobile robotic

applications [28]-[30]. In this work, the use of the MAXQ HRL method within the context of a

semi-autonomous controller for the robot exploration and victim identification problem in USAR

environments is explored for the first time. In USAR applications, the environments of interest

are unknown and cluttered, increasing the complexity of the learning problem. By utilizing a

MAXQ approach, the overall search and rescue task can be reduced into smaller more

manageable subtasks that can be learned concurrently. MAXQ has fewer constraints on its

policies, i.e., mapping of states to possible actions, and thus, is generally known to require less

prior knowledge about its environment than other commonly used HRL approaches such as

Options and Hierarchical Abstract Machines (HAMs) [34]. This is advantageous when dealing

with unknown USAR environments. In addition, MAXQ can support state, temporal and subtask

abstraction. State abstraction is important since when a robot is navigating to a particular

location only the target location is important, the reason why it is navigating to that location is

irrelevant and should not affect the robot’s actions. The need for temporal abstraction exists in

13

this application since actions may take varying amounts of time to execute depending on the

complexity of the scene and the location of the robot within the scene. Subtask abstraction is

necessary because it allows subtasks to be learned only once; the solution can then be shared by

other subtasks.

2.3 Chapter Summary

The first section of the literature review presented an overview of the semi-autonomous control

schemes that have been developed for mobile robots. These can be categorized into semi-

autonomous controllers with either fixed or variable autonomy. Furthermore, recent

implementations of semi-autonomous control schemes for search and rescue robots were

discussed. The second section of the literature review focused on the application of

reinforcement learning techniques in robotic control. Traditional reinforcement learning, hybrid

reinforcement learning, and hierarchical reinforcement learning were discussed and compared.

14

Chapter 3 Proposed Learning-Based Semi-Autonomous Control Architecture

3.1 Semi-Autonomous Control Architecture

The proposed semi-autonomous control architecture is depicted in Figure 1. Sensory information

is provided by: (i) a 3D mapping sensor in the form of 2D and 3D images of the environment, (ii)

a thermal camera providing heat signature from the surrounding environment, (iii) infrared

sensors that provide proximity information surrounding a robot, and (iv) 2D visual feedback for

the HRI interface. The SLAM module uses the 2D and 3D images provided by the 3D mapping

sensor to identify and match 3D landmarks in the scene and create a 3D global map of the

environment. For more details on the 3D sensor and the SLAM technique used, the reader is

referred to [36]-[39].

The Victim Identification module provides the control architecture with the ability to identify

partially obstructed victims in a cluttered USAR scene. This module utilizes thermal images

and/or a victim detection algorithm based on 2D and 3D images provided by the 3D mapping

sensor. In the latter victim identification method colored 2D images are used for skin detection

while 3D images are used for detecting victim body parts in the scene. For more information

about the victim identification technique used, the reader is referred to [40]. The HRI interface

module provides the operator with the user interface needed for human control. The interface

allows the operator to obtain sensory information from the environment and the robot as well as

the 3D map provided by the SLAM module in order to monitor or control the robot’s motion. It

also provides the operator with potential victims in the scene for confirmation. The Deliberation

module is where the learning and decision making of the robot during semi-autonomous

deployment occurs. Real-time sensory data from the mapping sensor, the victim identification

technique as well as the generated 3D map of the explored environment (provided by the SLAM

module) are utilized as inputs into the Deliberation module. Since the robot is designed to be

semi-autonomous, it is within this module that the robot primarily decides its level of autonomy.

Namely, MAXQ is used to allow the robot to learn and make decisions regarding which tasks

should be carried out at a given time and who (the robot or human) should perform these tasks

for optimal results. If human control is prominent, then the decision making within this module

is made by the human operator via the HRI interface module. Lastly, the robot actuators module

15

consists of the robot’s motors and motor control boards. Within this module, robot actions

provided by the Deliberation module are translated into appropriate motor signals. The following

sub-section presents the detailed design of the MAXQ HRL technique utilized by the

Deliberation module for robot intelligence in order to achieve semi-autonomous control.

SLAMHRI Interface

Deliberation

Robot

Sensors

Robot

Actuators

Victim

Identification

Figure 1: Semi-autonomous control architecture.

3.2 MAXQ-Based Robot Decision Making

In this work, the use of the MAXQ HRL method within the context of a semi-autonomous

controller is explored for the robot exploration and victim identification problem in USAR

environments. Namely, MAXQ will be used in the Deliberation module of the controller to solve

the overall problem of finding victims trapped in a cluttered unknown environment by having the

robot explore the disaster scene. The robot will also have the ability to hand over control to a

human operator whenever the operator can perform a function more efficiently.

3.2.1 MAXQ Learning Technique

The MAXQ learning technique attempts to create an optimal policy for an unknown Markov

Decision Process (MDP). Thus, before discussing the details of the MAXQ learning technique,

the concept of a Markov Decision Process (MDP) and an overview of reinforcement learning are

discussed.

16

3.2.1.1 Markov Decision Process

An MDP provides a framework for modeling a decision-making algorithm. Assuming that the

decision making agent is interacting with a fully-observable environment, the MDP models this

interaction as a four-tuple ⟨ ⟩ defined as follows [41]:

S: A finite set of states of the environment fully observable by an agent.

A: A finite set of actions available to an agent.

P: The probability function of transitioning from the current state s, to the next state s’ by

executing an action , according to the probability function .

R: The expected real-valued reward received by the agent for executing an action a in state s and

transitioning to the next state s’ according to the function .

P0: The starting probability distribution. P0(s) is the probability of starting in state s when the

MDP is initialized.

To utilize the MDP model for a decision making process, a policy π must be developed which

tells the agent which action to take in any given state of the environment s. The

objective is to find an optimal policy which maximizes the cumulative reward received over the

entire process.

3.2.1.2 Reinforcement Learning

Reinforcement Learning (RL) is an algorithm that tries to build an optimal policy of an unknown

MDP. The RL algorithm is a continuing process of observing the current state s, and executing

an action a according to a policy π(a) which takes the agent into the next state s’. The agent

receives an immediate reward r for its action, which is used as a feedback to the system for

improving future policies.

As previously mentioned, the most commonly used RL algorithm is Q-Learning. Q-Learning

attempts to learn a state-action pair Q(s,a) value for every state, s, and action, a, combination.

This value is the maximum discounted reward that can be obtained by executing action, a, in

state, s, and then following an optimal policy [31]. The state-action value is estimated by taking

17

an action, a, in state, s, receiving an immediate reward, r, and observing the next state, s’, and

repeatedly updating its value with the following equation [31]:

[ ], (1)

where α (0 < α < 1) is the time-varying learning rate, and (0 < < 1) is the discount factor

which diminishes the influence of future rewards.

3.2.1.3 MAXQ Value Decomposition

MAXQ works by decomposing a given MDP, M, into a finite set of subtasks {M0,M1,..,Mn}

which define the MAXQ hierarchy [34]. Herein, M0 represents the root subtask which defines the

overall problem, and is further decomposed into subtasks M1 to Mn.

As previously mentioned, USAR environments are unknown and cluttered, that can greatly

increase the complexity of the learning problem. Thus, by implementing a MAXQ approach, the

overall search and rescue task can be reduced into smaller more manageable subtasks that can be

learned concurrently.

In MAXQ, each subtask, Mi, is a three-tuple, ⟨ ⟩, defined as follows [34]:

Ti: A finite set of termination states for subtask Mi. Generally the total set of states S is separated

into a set of active states Si, and terminal states Ti. Once the subtask reaches a terminal state it

exists the current subtask and returns to the parent subtask in the hierarchy that had originally

called it.

Ai: A set of actions available in subtask Mi. These actions can be either a primitive (one-step)

action or another subtask in the hierarchy.

Ri(s’): A reward function that specifies the immediate reward of transitioning to a terminal state,

s’ Ti.

Furthermore, for every subtask Mi, a policy, πi is defined which maps all possible states of Mi to

a child task. The child task can be either a primitive action or another subtask under Mi to

execute. Subsequently a hierarchical policy, π, (a set containing the policies for all subtasks) is

defined for the entire task hierarchy. Also, a projected value function (Q-value) is stored for

18

every state and action pair in all subtasks. The projected value function is defined as the expected

cumulative reward of executing policy πi in subtask Mi, as well as all the policies of the subtasks

that would be executed under Mi until Mi terminates. The formal definition of the projected value

function is given as [34]:

∑ ( ) (2)

Herein, is the projected value function for subtask Mi under the policy π. is

the value function of executing action a in state s. N is the number of time steps that action a

takes to transition to the next state, s’, and the rest of the terms are defined as before. Thus, the

right most part of this equation is the discounted cumulative reward of completing the task for Mi

after having executed the first action a. Therefore, the value function can be defined as [34]:

, (3)

where can be defined as [34]:

{∑

( ) , (4)

and the right most term is separately defined as the completion function [34]:

∑ ( ) (5)

The MAXQ learning algorithm updates both and for all subtasks which by

definition, updates to a convergence point for an optimal policy. The general algorithm

with which the MAXQ learning technique is carried out for a given subtask M is as follows [34]:

float MAXQ(state s, subtask M)

Let Total Reward = 0

while s is not a terminal state do

Choose action according to the policy π

Execute a.

if a is primitive, Observe one-step reward r

else r = MAXQ(s, a) which recursively calls MAXQ for subtask a

19

and return the reward received while a was executed.

Total Reward = Total Reward + r

Observe resulting state s’

if a is a primitive subtask

else a is a composite subtask

[ ]

end while

end

3.2.2 Proposed MAXQ Task Graph

The proposed MAXQ task hierarchy for the semi-autonomous control architecture is shown in

Figure 2. The Root task represents the overall USAR task to be accomplished by the robot –

finding victims while exploring a cluttered USAR environment. This Root task can be further

decomposed into four individual subtasks defined as: Navigate to Unvisited Regions, Victim

Identification, Navigate and Human Control. The focus of this thesis is on the two main subtasks

of the MAXQ task hierarchy that are responsible for exploration and navigation in a USAR

environment: Navigate to Unvisited Regions and Navigate. The following sub-sections provide a

detailed discussion of each of the individual modules that make up the overall task hierarchy

with an emphasis on the two aforementioned subtasks. A ε-greedy policy is used in the task

hierarchy. Namely, an action is chosen randomly, with probability ε, to promote exploration of

all available actions, and the optimal action, i.e., action with the highest Q-value, is chosen with

probability 1-ε. The value of ε is reduced as the system is trained. The advantage of this policy is

that it provides sufficient exploration as well as exploitation for the learning algorithm, and

therefore, ensures convergence to optimal solutions [35].

Positive rewards are given to encourage transitions from the robot’s current state to desirable

states. For example, if the robot exits the USAR scene after having explored the entire scene, a

positive reward of +10 is given to the Navigate to Unvisited Regions subtask. For the Victim

Identification subtask, a positive reward of +10 is given for correctly tagging a new victim found

in the scene. For the Navigate subtask, the reward is based on three main factors: local

exploration, obstacle avoidance, and the global exploration direction. Within this subtask,

20

positive rewards are given to encourage desirable actions such as moving to adjacent unvisited

areas in the local environment, successfully avoiding dangerous situations such as collisions with

obstacles, and moving in the desired exploration direction. For example, moving into an

unvisited region in the desired global exploration direction is given a positive reward of +15 and

avoiding an obstacle is given a positive reward of +10. In addition, a large positive reward of

+100 is also given to the Root task when the overall USAR task is completed. Negative rewards

are given when a transition is made from the robot’s current state to an undesirable state. For

example, a negative reward of -10 would be given to Navigate to Unvisited Regions if Exit USAR

Scene is executed instead of Navigate when there is still known regions to explore. In the Victim

Identification subtask, a negative reward of -10 would be given if Tag is executed when no new

victims were identified in the scene. In Navigate, negative rewards would be given for executing

commands that lead to collisions (-20) or navigating into already visited areas (-1) of a USAR

scene. With respect to Human Control, similar positive and negative rewards are given based on

the subtask that the operator is involved in, allowing the robot to learn from the operator’s

actions.

Figure 2: MAXQ task graph for a semi-autonomous USAR robot.

Task decomposition

Information flow from operator

Human

Control

Human

Control

Navigate to

Unvisited Regions

Tag

Navigate

Victim

Identification

Exit USAR

Scene

Root

F B

21

3.2.2.1 Root Task

As previously mentioned, the Root task defines the overall goal of a rescue robot in a USAR

mission- to find victims within an unknown cluttered disaster scene by exploring the

environment. The MAXQ state function of this task is defined as S(V, LR, Mxyz). V represents the

presence of potential victims in the environment. LR is the robot’s location with respect to the

global coordinate frame (defined to be at the location at which the robot enters the scene), and

Mxyz represents the 3D map of the USAR scene the robot is exploring. LR and Mxyz can both be

obtained by the SLAM module of the control architecture.

3.2.2.2 Navigate to Unvisited Region

The purpose of this subtask is to have the robot explore unvisited regions within disaster

environments. The state definition for this subtask is S(LR,Mxyz), where LR and Mxyz have the same

definitions as above. The primitive action, Exit USAR Scene, is used by the Navigate to Unvisited

Regions subtask to trigger the end of exploration and guide a rescue robot out of the scene. In

order to efficiently explore an unknown environment, a direction-based exploration technique

was developed to be used in this subtask. The technique focuses on finding and exploring new

regions within the environment in order to expand the search area of the robot. This approach is

similar to frontier-based exploration, i.e. [42]-[45], however, the proposed approach explicitly

also takes into account the cluttered terrain of USAR environments.

Frontier-based exploration strategies that have been developed to-date focus on the exploration

of unknown environments by expanding a known map of the environment by deploying either a

single robot or a team of coordinated robots into target locations at the frontier of the map, i.e.

the boundary between explored and unexplored regions of the map. A utility or cost function is

utilized to determine where to deploy a particular robot. For example, in [42], the frontier-based

exploration technique was introduced for a single robot. The approach determined the robot’s

next target location based on the shortest accessible path for the robot. In [43], a multi-robot

exploration strategy was adapted where the cost for a particular robot reaching a frontier cell was

determined based on the probability of occupancy of each cell along the path of the robot to the

corresponding frontier cell, as well as the robot’s distance to that target point. In [44], both

distance and orientation to frontier points were used in the cost function for multi-robot

exploration. In [45], in addition to determining the nearest frontier point to a robot that provides

22

maximum coverage of unknown space, the deployment locations must be chosen such that a

robot at one of these locations is visible to at least one other robot. Each robot is then deployed

one at a time.

The majority of these frontier-based exploration approaches have shown to work well in

simulated and real laboratory and/or office environments; however, they cannot be directly

applied to USAR-like environments where the robot is required to navigate over rough terrain in

cluttered scenes. Namely, in USAR environments, the terrain plays an important role in

determining a robot’s exploration path. For example, a robot may need to climb over rubble piles

in order to search a particular region or to explore new regions in an environment. Furthermore,

in rough terrain conditions, a particular cell or groups of cells may be accessible only through a

particular approach direction, due to the shape/slope of a rubble pile with respect to surrounding

cells. Hence, in this exploration technique, terrain information as well as travel distance and

coverage are considered. In addition, instead of implementing a path to target approach,

alternatively a direction-based strategy is utilized due to the cluttered and uncertain nature of

these environments, as well as a learning algorithm which aids the robot to locally navigate these

rubble-filled scenes. In general, the proposed direction-based technique is more robust to map

uncertainty than direct path planning techniques that require an accurate representation of the

scene in order for a robot to reach its target locations. Mainly, a search direction is defined for a

robot to follow in order to explore unknown regions. The proposed exploration method

determines one of four universal search directions, North, South, East or West, for the robot to

explore in order to navigate into unvisited regions in the scene. The chosen direction for

exploration is then incorporated into the rewarding system of the Navigate module which focuses

on implementing the primitive actions necessary to locally navigate the robot in this defined

search direction through the rubble. The search direction is updated as necessary in order for the

robot to search new regions.

To effectively choose an exploration direction, the current 3D map of the scene is decomposed

into a 2D grid map consisting of an array of equally sized cells, where each cell categorizes the

status of a predefined area of the real environment. The exploration technique uses this

information to locate potential cells within the map that have yet to be explored as well as cells

which represent the frontier of the known map of the scene, in order to expand the robot’s

23

knowledge about the environment. The 2D grid map is divided into four regions, where each

region represents one of the search directions, Figure 3.

Figure 3: 2D grid map of the four regions surrounding the robot.

The diagonal cells with respect to the robot’s current location in the known map are used to

determine the boundaries between the four search regions and are defined with respect to the

robot’s current location in the known map. Each region encompasses only one set of boundary

cells defined to be in the counter-clockwise direction from the region.

Prior to evaluating an exploration direction, the cells in each region are categorized into two

types: explored cells and unexplored cells. This classification is based on whether the robot has

visited a particular cell or its adjacent cells. If the robot has already explored a cell, then the

terrain or the presence of victims in that cell will be known and can be categorized accordingly.

Based on the sensing information of the robot, the cells can be identified to be:

i) Obstacle: An obstacle can be defined to be, for example, a rubble pile. All obstacles are

categorized as known or unknown obstacles. Known obstacles are further classified into

climbable and non-climbable obstacles. A robot is expected to navigate over a climbable

obstacle. The climbable obstacles category is again divided into visited and unvisited climbable

obstacles. On the other hand, the non-climbable obstacle category includes terrain conditions

such as large non-traversable rubble piles, walls and also sudden drops in the height of the rubble

pile i.e., edge of a cliff. Unknown obstacles are those that have been detected to be in the

surrounding cells of the robot, however, not enough sensory information is available to classify

them further. The various sub-categories of obstacles are each treated differently in this subtask.

24

ii) Open visited: An open and traversable area that has been visited previously by the robot.

iii) Open unvisited: This cell is unvisited by the robot but has been detected as an obstacle-free

area.

iv) Victim: The cell contains a human victim.

v) Unknown: The cell information is unknown, in which case the cell has not been explored and

there is no sensory information available.

The cells categorized as visited, known or victim cells are defined to be explored cells, whereas

unexplored cells include the unknown and unvisited cells. Unexplored cells can add to the

robot’s knowledge of the scene and lead to exploring new regions. On the other hand, re-visiting

already explored cells may not necessarily provide any new information about the environment.

Therefore, herein, unexplored cells are considered as cells of interest in robot exploration and

assist in determining exploration direction. To determine the robot’s exploration direction in the

scene, the following exploration utility function is defined:

1

( ) , n

j xj xj xjx

u

(6)

where uj is the utility function for each of the four individual regions, and j represents the

identity of the region, i.e. North, East, South, and West. x corresponds to the identity of a cell(x)

in region j, n is the total number of cells of interest in region j, and ωxj, λxj, and δxj represent three

non-zero positive coefficients for cell(x) in region j and are initially given the value 1. The

exploration utility function weighs the benefits of exploring a scene based on terrain, the number

of cells of interest in a region of the scene, and the travel distance to cells of interest using the

values of the three coefficients.

1) Terrain Coefficient: ωxj, is the coefficient that is given to cell(x) in region j based on the type

of terrain of that cell, in particular:

25

1

2

3

, if cell x is an open unvisited space

, if cell x is an unvisited climbable obstacle

, if cell x is an unknown obstacle

1.0, elsewhere.

l

lx

l

w l

w l

w l

, (7)

where l3 < l2 < l1 and lm > 1 for m=1 to 3.

Herein, wl is a positive weighting applied to l1, l2, and l3. The weighting can be used to set a

higher priority to this particular coefficient. Open unvisited cells are given the largest value since

they are obstacle-free regions and hence, allow a robot to easily navigate the cell. Unvisited

climbable obstacles have the second largest value. Unknown obstacles have the lowest value due

to the uncertainly regarding the true nature of the obstacles in these cells.

2) Neighboring Cells Coefficient: λxj is the coefficient given to cell(x) in region j based on the

information in its 8 neighboring cells, namely:

8 8

1 1

8

1

, when 0

1.0, when 0

k kp x x

k k

xkx

k

w v v

v

, where: (8)

1

2

3

4

, if cell k is unknown.

, if cell k is an open unvisited space.

, if cell k is an unvisited climbable obstacle.

, if cell k is an unknown obstacle.

0.0, elsewhere.

kx

p

p

v p

p

, (9)

and, where: p4 < p3 < p2 , p2 < p1, and pm > 1 for m=1 to 4.

kxv is the exploration value of the k

th neighboring cell of cell(x), where k=1 to 8. wp is a positive

weighting applied to 81

kk xv . λx is designed to provide a higher value to cells that are adjacent to

unknown cells or other cells of interest, since exploring those cells may immediately lead to

exploration of other surrounding cells. For example, consider the case in the partial 2D grid map

shown in Figure 4, where the two cells A and B have the same type of terrain (unknown

obstacle), therefore ωA=ωB. However, cell A has one unknown obstacles in its 8 neighboring

26

cells (λA=p4), whereas, cell B has two open unvisited cells and one unknown cell, giving it the

larger exploration value of λB= p1+2p2, when wp=1.

Figure 4: Scenario illustrating the contribution of the exploration coefficient, λx.

3) Travel Coefficient: δxj is the coefficient associated with moving to cell(x) in region j from the

robot’s current cell location, and is a function of dx which is defined as the distance of cell(x) to

the robot’s current occupied cell. As the distance to cell(x) increases, the value of travelling to

cell(x) decreases. δxj favors unknown cells closer to the robot’s current cell for exploration to

allow for more efficient search of the environment:

1

2

3

4

, if d 1

, if 1 < d 2

, if 2 < d 3

, if 3 < d 4

1.0, elsewhere.

q x

q x

x q x

q x

w q X

w q X X

w q X X

w q X X

, (10)

where q4 < q3 < q2 , q2 < q1 , and qm > 1 for m=1 to 4.

wq is a positive weighting applied to qm (m=1 to 4). Xn (n=1-4) represents predefined distance

thresholds that can be set by an operator. These thresholds can be increased as the size of the

known map increases to reflect more realistic distances.

Once the utility functions for all four regions are evaluated, the direction corresponding to the

region with the largest utility value is chosen as the exploration direction. The desired

exploration direction is passed to the Navigate subtask in order to locally move the robot in this

direction. A robot operator is able to select the weightings for the three coefficients based on the

27

rescue scenario at hand. In particular, he/she may decide to increase the weighting for one or

more of the coefficients in order to have the robot explore in a desired manner or to maximize

energy efficiency of the robot.

Concave Obstacle Avoidance Technique

A concave obstacle avoidance technique was developed and integrated with the exploration

algorithm. Herein, a concave obstacle is defined as a collection of non-climbable obstacle cells

within the 2D grid map of the environment which makes a concave pattern of any shape and size.

Figure 5 illustrates two examples of concave obstacles. As can be seen from these two examples,

the concave obstacle outline can be of a simple concave pattern, Figure 5(a), or any other

arbitrary pattern, Figure 5(b). However, the common feature about these two patterns is that if

the exploration direction of the robot happens to be towards the opening of the concave obstacle,

as shown in Figure 5, the robot travelling into the concave obstacle can find itself in a region

enclosed by non-climbable obstacles; this would make the exploration task a bit challenging.

(a)

(b)

Figure 5: Examples of concave obstacles.

The proposed concave obstacle avoidance technique is comprised of two steps. The first step is

detection of the concave obstacle ahead of time. In this step, the 2D grid map information is used

to detect a concave obstacle in the exploration direction of the robot to detect a continuous

boundary of non-climbable obstacles. Once the obstacle is detected the second step is to avoid

the region by guiding the robot around the obstacle.

The robot first attempts to detect the concave obstacles utilizing the rubble pile information that

is obtained from the 2D grid map. The following two conditions must be satisfied for the robot to

detect a collection of obstacle cells as a concave obstacle which needs to be avoided:

Robot location and heading

Open space

Obstacle

Concave outline

28

Condition 1: The collection of obstacles must have an enclosed non-climbable boundary.

Condition 2: All the cells that make up the region within the concave obstacle (including the

non-climbable boundary of the obstacles) must be explored cells.

Figure 6 illustrates how a collection of obstacle cells forming a concave boundary are detected in

the robot’s exploration direction. The detection method is performed to detect concave obstacles

at every navigation step. The algorithm begins by scanning the path in the robot’s heading

direction until a non-climbable obstacle is found, i.e. cell (3,2) in Figure 6. Thereafter, a search

is initiated from the location of the non-climbable obstacle to the left and right of the robot’s path

of exploration, Figure 6. In the example shown in Figure 6, the searching algorithm begins at cell

(3,3) and looks at all the cells to the left and right of the robot’s path of exploration until non-

climbable obstacles are found; this procedure is repeated along the robot’s exploration path down

to cell (3,7). The algorithm stores the information gained in the searching process to estimate the

size and shape of the non-climbable obstacle boundary. During this search, the algorithm ensures

that the two aforementioned conditions for a concave obstacle are met. That is the boundary of

the set of obstacles is checked for any openings such as an open visited cell or a climbable

obstacle, and the inner region of the boundary is searched for unexplored cells. During the

searching process if either of the two conditions is not satisfied the search is terminated and the

possibility of a concave collection of obstacles in the robot’s exploration direction is dismissed.

On the other hand, if the algorithm does detect an enclosed non-climbable boundary of obstacles

the approximate region of this boundary is stored for avoidance.

Figure 6(a) depicts a scenario where both conditions are satisfied, and thus the set of obstacles

need to be avoided. On the other hand, Figure 6(b) depicts a scenario where neither condition is

satisfied. Condition 1 is not met since both a climbable obstacle at (2,2) as well as an open

visited cell at (8,3) create two openings in the boundary of the concave obstacle. Condition 2 is

not satisfied either since the region bounded by the concave obstacle is not fully explored due to

the existence of an unvisited open cell at (6,5).

29

(a)

(b)

Figure 6: Scenarios illustrating the concave obstacle detection technique: (a) Not identified

as a concave obstacle, (b) identified as a concave obstacle.

Once a concave obstacle is detected, the approximate region within the 2D map that contains the

obstacle is utilized to assist the robot avoid the region. The concave obstacle avoidance is simply

a guidance procedure for the robot which allows it to steer around concave obstacle regions.

There are two possible scenarios when it comes to the robot’s concave obstacle avoidance. The

first scenario is where the robot is inside the boundaries of the concave obstacle, Figure 7(a).

This scenario may happen, for example, if the robot needs to explore unknown cells within the

set of obstacles. In this scenario, the first step of the avoidance procedure is to move out of the

concave obstacle. This is done by temporarily giving the robot the opposite exploration direction,

Figure 7(b). Once the robot has moved out of the region, the second step is to clear the concave

obstacle by moving sideways to the left or right of the concave obstacle, Figure 7(c). Once both

steps are executed, robot’s original exploration direction is assigned back to the robot allowing it

to continue exploration as before, Figure 7(d).

The second scenario is where a concave obstacle is detected which has been previously explored.

In this scenario, the first step of the concave obstacle avoidance is skipped and the robot simply

moves sideways to clear the obstacle, see Figure 7(c).

Note that there may be cases where the robot needs to avoid multiple concave boundaries. For

example, while attempting to avoid one concave obstacle the robot may face a new concave

obstacle in a temporary direction of travel. In this case, a dynamic stack list of concave obstacle

1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

9

Robot's location and heading

Open visited

Open unvisited

Non-climbable obstacle

Climbable obstacle

Robot's path of exploration

Non-climbable boundaries

30

regions that should be avoided is created, and the concave obstacle regions are avoided

consecutively.

(a)

(b)

(c)

(d)

Figure 7: Scenario illustrating the concave obstacle avoidance technique.

Concave obstacles are particularly an issue with an exploration technique that relies on direction

of travel. When a robot is travelling in a specified exploration direction it can get trapped within

the boundary of a collection of obstacles. A common solution to this problem is a wall-following

technique [46], where the robot enters a concave shaped obstacle until it comes within close

proximity to the obstacle’s boundary. Once the robot reaches the obstacle, the wall-following

technique moves the robot within close proximity to the obstacles, i.e. walls, until the robot finds

its way out of the enclosed boundary. Although this technique can be simple to program and

implement in an exploration algorithm, it is time consuming, especially when dealing with very

large obstacles. To illustrate this point further Figure 8(a) depicts a scenario where the robot

uses a wall-following technique to avoid the concave obstacle boundary. As can be seen from

this example, the number of steps taken to go around the obstacle boundary entirely depends on

the size and shape of the concave obstacle. However, by utilizing the proposed avoidance

technique the route of the robot can be considerably shortened as illustrated in Figure 8(b).

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

Robot's location and heading

Open visited


Concave obstacle region

Original exploration direction

Temporary opposite direction

Temporary sideway direction

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

31

(a)

(b)

Figure 8: Comparison of the robot’s path to avoid concave obstacle. (a) path taken with a

wall-following technique, (b) path taken using the proposed concave obstacle avoidance

technique.

3.2.2.3 Navigate

The goal of this subtask is to perform local navigation and obstacle avoidance. Its state definition

is defined as S(Ci, DE, Dxy, LV/R), where Ci, i=1 to 8, represents the 2D grid map information for

the 8 surrounding cells of the robot, DE corresponds to the desired exploration direction, and Dxy

contains the depth profile information of the rubble pile in the surrounding environment.

Navigate can be performed based on two conditions: local scene exploration or navigate to a

victim location. If the objective is to navigate to a potential victim rather than just explore a

scene, LV/R (relative location of a potential victim to the robot) is used in the Navigate module to

Robot's initial location and heading

Open space

Obstacle

Robot's path

32

reach a target victim location. Once the victim location is reached, the Navigate subtask

terminates and allows Victim Identification to perform the task of identifying victims.

Local exploration is performed by the Navigate subtask by utilizing the information of the 8

adjacent cells to the robot’s current cell location, i.e. Ci and not the entire 2D map. For example

in the representation depicted in Figure 9, cell C2 represents the information of the grid cell in

front of the robot (open unvisited region), whereas, C5 represents information of the cell to the

right of the robot (known obstacle). Based on the status of the robot’s surrounding cells, the

optimal primitive actions (F, B or θ) are used to rotate a robot by an angle , move a robot

forward (F), or backward (B) within the environment. Hence, the Navigate subtask can search

its local environment by exploring new cells and performing obstacle avoidance.

Figure 9: Grid representation of the local environment surrounding a robot (the robot’s

forward direction is shown towards cell C2).

In local navigation, the navigation reward Rn is generally based on two factors: obstacle

avoidance and local exploration. The immediate rewards are a direct function of the state

definition and the action at each time step. The rewards can be positive or negative depending on

the action that is executed while in state, s. Positive rewards can encourage desirable actions

such as moving in a desired direction or exploring a locally unvisited cell. Moreover, when

various options are available to the robot the adjacent cells are given priority based on their ease

of accessibility. Herein, positive rewards can also encourage the robot to explore the most

accessible unexplored cell. For example, exploring an open unvisited cell directly in front of the

robot would be more accessible than an unvisited climbable obstacle that is located in a cell to

the right of the robot. The reasoning behind this is simple: in this scenario, in order to explore the

cell in front of the robot only a Forward action must be executed, whereas to explore the cell to

the right of the robot a Turn action, followed by a Forward action must be executed, where the

C1 C2 C3

C4 C5

C6 C7 C8

33

robot needs to climb over the rubble pile. Furthermore, the Forward action in the second case

(i.e. climbing over the rubble pile) could require more energy and be more time-consuming than

would be required if the robot moves into an obstacle-free cell. In a similar manner, negative

rewards can discourage obstacle collision as well as revisiting explored cells. Although negative

rewards can be given in many situations where an action is undesirable, the value of these

rewards can be determined based on pre-set conditions. For example, due to the physical damage

a collision with an obstacle may cause the robot, the negative reward for obstacle avoidance can

be set higher than revisiting an explored cell, hence, placing obstacle avoidance as a higher

priority.

Navigating Rubble Piles

For effective navigation in highly cluttered environments, a rescue robot will need to be able to

climb over rubble piles. Dxy represents a 2D array that contains the depth profile information of

the rubble pile that appears in the robot’s path at any given time. The depth profile information

of the rubble pile is obtained from the 3D map and is conveniently represented in Dxy. By

analyzing the depth profile information, the robot is able to classify the rubble pile into two

categories: climbable and non-climbable. By having the ability to effectively distinguish between

climbable and non-climbable obstacles, the robot can potentially shorten travelling distances by

climbing over the climbable obstacles that it may face instead of the more traditional navigation

approach of avoiding all obstacles. Thus, Dxy can assist the robot achieve a more efficient

exploration in highly cluttered environments.

Figure 10 shows two examples of the depth profile information of a rubble pile in robot’s

forward view, as represented by Dxy. The lighter cells indicate lower depth values (i.e. closer)

and darker cells indicate higher depth values (i.e. farther) on the surface profile of a rubble pile.

In this figure, the X axis represents the width and the Y axis the height in the 3D depth sensor’s

field of view. The scenario illustrated on the right hand side of figure shows a rubble pile depth

profile that may be obtained from a climbable rubble pile. On the other hand, the scenario shown

on the left hand side of the figure suggests a non-climbable rubble pile due to its relatively larger

slope and height. Similar to these scenarios, various obstacles can be categorized into climbable

and non-climbable based on the height and slope information that can be extracted from their

34

respective 2D depth profile. The following section discusses the rubble pile categorization

technique in more detail.

(a)

(b)

Figure 10: 2D Dxy array for rubble pile profile classification. The cell shading becomes

darker as the cell depth values increase.

Rubble Pile Categorization Technique

The objective of the rubble pile categorization technique is to classify the rubble pile located in

the cell in the robot’s direction of travel into one of the following five scenarios: (i) open space,

(ii) non-climbable obstacle, (iii) uphill climbable obstacle, (iv) downhill terrain, and (v) drop,

Figure 11.

Figure 11: Scenarios illustrating the rubble pile classifications: (a) open space, (b) non-

climbable obstacle, (c) uphill climbable obstacle, (d) downhill, and (e) drop.

The rubble pile categorization technique mainly takes into account three parameters from the 3D

profile of the rubble pile to classify it into one of these five categories: i) the overall slope of the

Y

X

In

crea

se i

n D

epth

(a) (b) (c)

(d) (e)

35

rubble pile, ii) the smoothness of the 3D data, and iii) depth values at specific locations of the

image.

The depth values representing the profile of the rubble pile can be obtained directly from the 3D

image. To obtain the slope and smoothness of the 3D data, a plane, , is fit to the

data points using the least square method. Given a set of n points: { } , the least

squares method provides a solution to the three equation parameters A, B, and C which

minimizes the sum of the squared errors between the real depth values zi and the z values

estimated by the plane: . The three parameters are calculated by solving the

following equation [47]:

[

∑

∑ ∑

∑ ∑

∑

∑ ∑

∑

] [ ] [

∑

∑

∑

] (11)

The parameter that is of most interest is ⁄ = B, and represents the rate of change of the

depth values with respect to the height of the rubble pile. Thus, B provides a rough estimate of

the overall slope and shape of the rubble pile, Figure 12. For example, this parameter can be used

to easily distinguish between a large non-climbable obstacle (e.g. a large rubble pile or a wall)

and open flat ground. It can also be used to estimate the angle of the surface via simple

calibrations which helps determine whether or not the robot is able to climb the rubble pile.

Figure 12 shows how B can be used to distinguish between different rubble pile categories.

Figure 12: Contribution of the overall slope of the depth profile on rubble pile

classification: (a) open space, (b) non-climbable obstacle, (c) uphill climbable obstacle, (d)

downhill.

An estimate of the smoothness, R, of the 3D profile can be obtained by evaluating the sum of the

residuals of the data points. Given n data points, R is defined as:

∑ (12)

(b) (c) (d) (a)

36

The smoothness of the depth profile provides useful information regarding the terrain conditions

and the ability of the robot to traverse or climb a certain rubble pile. Figure 13 illustrates the

contribution of the smoothness parameter on rubble pile classification. In the scenarios shown in

Figure 13, the 3D profile of the rubble pile in both scenarios would produce similar slope values.

However, analyzing the 3D profile of the rubble pile in both scenarios would show that the plane

fitted to the 3D data in Figure 13(a) has a smaller cumulative residual with respect to the actual

3D data than that shown in Figure 13(b). Thus, the smoothness parameter, R, for the case

depicted in Figure 13(b) would be larger, and can determine that the terrain condition is too

rough for the robot to traverse.

Figure 13: Contribution of the smoothness parameter, R, to the rubble pile classification:

(a) small R indicates a smooth traversable terrain, (b) large R indicates a rough non-

traversable terrain.

The last parameter used in the rubble pile categorization technique is the 3D depth values at

specific locations within the 3D image. This information is useful for detecting a drop in the

terrain, and in some cases, a non-climbable obstacle.

Figure 14(a) depicts an example of what the Dxy for a traversable open space would look like.

By knowing what range of depth values are produced in a traversable open space, and comparing

it to the 3D data produced from the rubble pile, a drop in the robot’s forward view can be easily

detected. Generally if the robot faces a drop, there can be two possibilities. The first scenario

would be if the drop is small enough that the 3D sensor can capture 3D data pertaining to the

rubble surface at the bottom of the drop, Figure 14(b). In this case, the depth values in the lower

rows of the Dxy array will have values that are considerably larger than the depth values that

would normally be produced for a traversable open space, see Figure 14(a) and (b). The second

scenario would be that the drop is so deep that the rubble at the bottom is not in the field of view

of the 3D sensor, Figure 14(c). In this scenario either the entire array will be out of range of the

(a) (b)

37

3D sensor or a sudden increase in depth can be observed between two consecutive rows, as

shown in Figure 14(c).

(a)

(b)

(c)

Figure 14: Comparison of traversable open space with a drop in the terrain: (a) open space,

(b) drop where surface is observable in 3D data, (c) drop where surface is not observale in

3D data.

Furthermore, analyzing the depth values in addition to the plane parameters can also help the

rubble pile categorization technique detect obstacles that would not normally be detected from

the fitted plane parameters. For this reason, the rubble pile categorization technique checks both

the depth values and the fitted plane parameters for detection of non-climbable obstacles. For

example, in the case of highly discontinuous 3D data, the slope and smoothness parameter

cannot provide an accurate representation of the 3D profile of the rubble pile. An example of this

scenario is shown in Figure 15. In this example, the plane parameter B may not be able to

accurately represent the shape and slope of the rubble pile, but the Dxy array can determine

whether or not the gap in the rubble pile is large enough for the robot to enter. By a simple

examination of the top rows of the Dxy array it can be concluded that the rubble pile should be

categorized as a non-climbable obstacle and that the robot cannot enter this void.

Figure 15: Example of using the position of 3D data within the Dxy array to detect non-

traversable small voids in the rubble.

Y

X

Y

X

38

Calibration

To be able to successfully classify a given 3D profile into one of the five rubble pile categories

calibration is required. Since the algorithm relies on a fitted plane parameter, B, the relative

viewing angle of the 3D sensory system on the robot plays a crucial role in the calibration

process. That is, once the system is calibrated, the position and orientation of the 3D sensory

system must remain fixed with respect to the robot. In addition, the smoothness parameter and

the depth value thresholds being used in the rubble pile categorization algorithm are unique to

the specific rescue robot being utilized. For example, the maximum allowable R value that

dictates whether or not the robot has the ability to traverse a terrain with a given slope, B,

entirely depends on the robot’s physical capabilities such as locomotion mechanism, tire traction,

clearance under the robot, motor torque, etc. In addition, the depth value thresholds used in the

classification technique are governed by physical dimensions of the robot and the distance of the

sensory system to the ground in front of the robot. Thus, calibrations are needed to find the right

threshold values for the aforementioned parameters and depth values in the algorithm being

specifically tailored to the rescue robot and its 3D sensory system.

Calibration was performed using a total of 70 3D images taken from a USAR-like environment.

The images were taken from open traversable space with various rubble piles, non-climbable

obstacles of different shapes and sizes, drops, downhill terrain as well as uphill climbable rubble

pile. Before analyzing the data, image processing was performed to filter the masked pixel values

in the 3D images. Masked pixels are pixels in the 3D image that do not have an accurate depth

value due to inadequate illumination of the objects or errors in the 3D sensory system. The

masked pixels are given an arbitrary value (e.g. 1200) for easy identification in image

processing. The image processing simplifies the 3D image by segmenting the 640 by 480 pixels

in the image into an m by n Dxy array, where m and n are fractions of 640 and 480, respectively.

The size of the array can be adjusted for higher or lower resolution as needed. In these

calibrations, it was found that a 16 by 12 array provides enough resolution for a successful

categorization of the rubble pile. Using this resolution, the image is segmented into 40 by 40

pixel blocks. Within each pixel block, the non-masked depth values were averaged and the mean

depth value for each block was assigned to the corresponding Dxy array element. This process

not only gets rid of the masked pixels but simplifies the 3D image into a smaller and more

uniform depth array, Dxy.

39

Once the 3D image is processed, Equation (11) is used to fit a plane to the Dxy depth profile and

the smoothness parameter, R, is calculated using equation (12). The main purpose behind

analyzing the plane parameters was to obtain a good understanding on what ranges of values

should be expected for parameters B and R for traversable open space, climbable obstacles, non-

climbable obstacles, and downhill terrain.

Open Space: From calibration data, it was observed that for traversable open space, the plane

parameter, B, was normally between 2.5 and 4.5. The maximum allowable smoothness

parameter, R, was determined to be approximately 5000. Furthermore, the average depth values

obtained from the first two rows at the bottom of the Dxy array, i.e. distance of sensory system to

traversable flat ground, were found to be between 700 and 850 mm.

Climbable Obstacle: It was determined that the rescue robot was capable of climbing relatively

smooth uphill terrain with an incline angle of up to 18 degrees. To be able to relate the parameter

B to the physical slope of an incline, additional calibration was performed using the B

measurements at 9 different angles. Figure 16 shows the plotted results. A 3rd

degree polynomial

was fit to the data. From the obtained relationship, it was determined that the minimum

allowable B value which corresponds to an incline of 18 degrees is about 1.36. Thus, B values

lower than 1.36 were considered to be non-climbable. In addition, it was found that at a slope of

18 degrees, the robot was capable of climbing rubble pile with a maximum smoothness

parameter, R, of about1000.

Figure 16: Calibration plot showing the relationship between the plane parameter B and

angle of an incline.

B = -23.697θ3 + 125.08θ2 - 228.27θ + 156.66

00.20.40.60.8

11.21.41.61.8

2

0 10 20 30 40 50 60 70 80 90

B

θ (Degrees)

40

Downhill Terrain: The plane parameter, B, for traversable downhill terrain was determined to be

greater than 4.5 and the maximum traversable smoothness parameter, R, was determined to be

about 2000. In addition, the average depth values for the first two bottom rows of the Dxy array

was determined to be greater than 800.

Drop: For most cases where the bottom of the drop could be captured in the 3D image, the plane

parameter, B, was found to be greater than 5.5. Also, the average depth value for the two bottom

rows of the Dxy array was found to be greater than 900.

Non-Climbable Obstacle: Due to the wide range of possibilities for non-climbable obstacles, it

was more difficult to generalize the plane parameters and depth values for this category of the

rubble pile. However, by the process of elimination, non-climbable obstacles can be easily

detected. That is, if the plane parameters and depth values do not fit into any of the categories

discussed above, the rubble pile can be assumed to be a non-climbable obstacle. In addition,

there are some guidelines that can support the possibility of a non-climbable obstacle. From the

observations, it was determined that the fitted plane slope, B, for non-climbable obstacles was

normally larger than 3.0. Moreover, it was observed that all non-climbable obstacles had a

smoothness parameter larger than 1200.

By using a simple logic that incorporates the aforementioned threshold ranges for parameter and

depth values, the rubble pile could be successfully classified into the proper categories. A few

examples of the depth images used in the calibration and their respective rubble pile

classification are provided in Table 1.

Table 1: Examples of calibrated depth images

Open space Non-climbable obstacle Climbable obstacle Drop

2D

im

age

3D

gra

ysc

ale

41

3.2.2.4 Victim Identification

The objective of the victim Identification subtask is to identify victims in cluttered USAR scenes.

The state of the subtask is defined as S(LV/R, Mxyz), where LV/R represents the locations of

potential victims in the scene with respect to the robot’s location as obtained by the 3D mapping

sensor. When a victim is identified in the scene, the primitive action Tag is executed by this

subtask in order to tag the global location of the victim within the 3D map of the environment,

Mxyz. The Navigate subtask is executed when the robot needs to move closer to a target location,

i.e. a potential victim in the environment, in order to make a positive. Currently, a thermal

camera and/or a victim identification algorithm are used to detect victims. The victim

identification algorithm uses 3D object segmentation and a human body model, as well as skin

detection to identify partially obstructed victims in a rubble filled environment [40].

3.2.2.5 Human Control

The purpose of this module is to pass over the control of the robot to the human operator in case

the robot is unable to perform any of its tasks autonomously. There are two levels in the task

hierarchy in which Human Control can be used to control the robot, Figure 2. The objective of

the 2nd level Human Control subtask is to have a human operator share with the robot the tasks

pertaining to the Victim Identification and/or Navigate to Unvisited Regions subtasks. With

respect to the Victim Identification subtask, the operator can assist in identifying victims in the

scene and tagging their location. The operator utilizes the corresponding 2D and 3D information

of the robot’s surrounding cells in the grid map, as obtained in real-time from the 3D mapping

sensor, to tag the location of a victim. Namely, the operator can input which cell(s) he/she sees a

victim(s) in based on this sensory information. This input is then utilized to update the grid. The

Navigate subtask can also be called upon to move the robot closer to a potential victim, in order

for the operator to make a positive identification and tag a victim’s location. The operator

requests that the robot move to a particular cell based on the sensory information. The distance to

the cell from the robot’s current location is defined by LV/R in the Navigate subtask. With respect

to the Navigate to Unvisited Regions subtask, the operator can assist in choosing optimal search

directions for exploration. On the other hand, the 3rd level Human Control subtask allows for

operator control of the local navigation subtask of the robot. If the robot is unable to navigate

autonomously due to noisy or insufficient sensory data, or if the robot has gotten physically stuck

due to the highly cluttered nature of the environment, this Human Control subtask allows the

42

operator to take over the low level navigation and obstacle avoidance tasks of the robot. While

the control is passed over to the operator, the aim is to have the robot’s sensory system still track

the motion of the robot and update the 3D map of the scene. For both levels, when Human

Control is executed, the objective is to have the appropriate subtask continue to learn by

observing the implemented actions of the operator via the information path as well as the

outcomes that result from these actions, as highlighted in Figure 2.

3.3 Chapter Summary

In this chapter, the overall semi-autonomous control architecture developed was presented. The

proposed MAXQ task graph was also presented. The main subtasks of the MAXQ hierarchy

were discussed in detail; these included: the Root subtask, the Navigate to Unvisited Regions

subtask, the Navigate subtask, the Victim Identification subtask and the Human Control subtask.

An integrated direction-based exploration technique was also presented which used the rubble

pile information in the 2D grid map to choose a global exploration direction for the robot. A

rubble pile categorization technique utilized by the navigation module of the control architecture

was also discussed.

43

Chapter 4 Simulations

Simulations were conducted to train the MAXQ and verify the convergence of the Q-values. A

second set of simulations was conducted to test the performance and robustness of the overall

control architecture in performing search and rescue operations in both small and large simulated

environments.

4.1 MAXQ Learning and Q-Value Convergence

Offline training of the MAXQ hierarchy was performed to verify the learning of robot actions

and convergence of the Q-values. The MAXQ hierarchy was trained in a 20 by 20 cell

environment (approximately 336 m2), with each cell occupying an area of 0.84 m

2. The robot

used in the simulations occupied one cell at a time. The robot’s sensory input for the simulation

included: i) 3D depth information pertaining to the cell in front of the robot for categorization as

one of the five aforementioned rubble pile categories using the Dxy profile, ii) 5 proximity

sensors (as binary values) on the sides and back of the robot, and iii) the probability and distance

of a victim in the robot’s heading. While the boundary of the scene remained the same (20 by 20

cells), the layout of scene was automatically generated via the random sensory information to

promote learning, as the robot explored the unknown environment. Overall, a total of 1800 trials

were performed, with each trial having its own unique scene layout. Examples of the randomly

generated scenes during four trials of the simulations are shown in Figure 17.

To ensure that the MAXQ task hierarchy chooses the optimal action, Q-values at every subtask

need to converge. That is in a given subtask i, and a state si, the Q-value pertaining to the optimal

action must converge at a higher level, i.e. larger numerical value, than the rest of the possible

actions.

44

Open Visited

Unknown obstacle

Open unvisited

Drop

Visited climbable obstacle

Unknown

Unvisited climbable obstacle

Victim


Robot location and heading

(a)

(b)

(c)

(d)

Figure 17: Simulation screenshots: robot exploring in randomly generated scene layouts.

Simulations were performed for 1.6 million episodes to ensure the convergence of the Q-values.

Herein, one episode is considered as a single step taken by the robot, i.e. a primitive action for

the Navigate subtask. The ε-greedy exploration policy was chosen to allow sufficient exploration

and exploitation of the states and actions for proper convergence of the Q-values. The discount

45

factor constant was chosen to be 0.8, while the exploration policy parameter ε and the learning

rate α were both decreased at the same rate as the number of episodes increased. Both

parameters were updated according to the following relationship:

(13)

where experience is a positive number that was initialized to 1 and was incremented by 1 every

100 episodes. The experience parameter is incremented every 100 episodes to ensure that the

parameter ε slowly decreases, allowing sufficient Q-value exploration. The learning rate α also

needs to be gradually decreased to ensure that the Q-values have converged before it is

diminished to zero.

Since a total of 4288 possible Q-values exist in the MAXQ hierarchy, it would be very

exhaustive to monitor the progress of all Q-values. Instead, a total of 39 Q-values were selected

from all subtasks. These selected Q-values were chosen based on their importance to the search

and rescue task as well as probability of occurrence in the simulations.

4.1.1 Simulations Results

The 39 Q-values were monitored and their progress was plotted over the 1.6 million episodes.

The simulations results showed that all of the monitored Q-values at the different levels of the

hierarchy had correctly converged as expected. The convergence of the Root task is of most

importance since it defines the overall task of exploring and finding victims for the rescue robot.

For the Q-values of the Root task to converge, all of the subtasks in the hierarchy must converge

correctly. Figures 18-20 depict the progression of Q-values and their convergence at the Root

task for 3 of the monitored states.

Figure 18 depicts the Q-values for the all possible actions available at the Root task. Herein, the

scenario presented in the figure is when the robot in the simulated environment is in a state

where the probability of a victim is below 0.3, meaning that the probability of a victim is too low

for the robot to consult the operator with the victim identification task. Also, the robot has no

difficulty exploring new regions of the scene and is not having any type of sensory malfunction

that would require the operator’s assistance. Thus, the optimal action is to execute the Navigate

to Unvisited Regions subtask for the robot to continue exploring the scene for more victims. As

46

can be seen in Figure 18, the Q-value pertaining to the Navigate to Unvisited Regions subtask

rises above the other Q-values and begins to converge after about 20,000 episodes. Note that in

this figure, the plots for the Human Control actions are overlapped at -20.0.

The reason for the Q-value fluctuation for the Navigate to Unvisited Regions subtask is that once

this subtask is executed, the reward that is received thereafter from the Navigate subtask can vary

based on the robot’s surrounding environment. Since the state definition in the Root task is

independent of those defined in the lower level hierarchy subtask, i.e. state abstraction, there will

be some fluctuation for the Q-values of parent subtasks. However, as long as the fluctuations

remain above the rest of the Q-values after convergence, a greedy policy would choose the

optimal action, i.e. subtask or primitive action.

Figure 18: Convergence of the Root Q-values when exploration is optimal.

Figure 19 depicts the Q-value for the actions available at the Root task in a different state. In this

scenario, the robot is in a state where there is no victim present. However, the robot has not been

successful in finding new unvisited cells to explore. Herein, the optimal action is for the robot to

get feedback from the operator regarding a new exploration strategy by calling the Human

Control for the Navigate to Unvisited Regions subtask. Since the Human Control subtask is a

primitive action without a child subtask of its own, the Q-value rises and converges to the

specific value, i.e. 50.0, after only a few hundred episodes and remains constant for the rest of

offline training.

47

Figure 19: Convergence of the Root Q-values when operator’s assistance is optimal.

The last example of Q-value progression during the offline training is depicted in Figure 20.

Herein, the robot is in a state where a victim has been detected within the cell in front of the

robot. The probability of the victim is above 0.7, meaning that the probability of the victim is

high enough for the robot to automatically tag its location. Therefore, the optimal action at the

Root task is to temporarily halt the exploration task and call Victim Identification to tag the

victim. For this given state, the Q-value pertaining to the Victim Identification subtask increases

abruptly and rises over 100 after about 15,000 episodes where it begins to converge.

In this case, the Q-value fluctuations reduce over the increased number of episodes, but are still

present. The overall fluctuations, however, are smaller than the Q-value pertaining to the

Navigate to Unvisited Regions subtask. The reason for the different fluctuation behaviors is that

when the Navigate subtask is executed under Navigate to Unvisited Regions, the rewards that can

be given to the system has a wider range of values, thus, creating larger fluctuations. In an

exploration mode, under the Navigate to Unvisited Regions subtask, the robot can be facing

many possible scenarios with the potential existence of a variety of rubble pile classifications,

each resulting in a unique reward. On the other hand, the Navigate subtask under the Victim

Identification subtask would only move the robot forward a single cell closer to the victim, if it

were to be executed. This is due to the fact that the victim is normally positioned in front of the

robot, within the 3D sensory range, and the exploration direction is directed towards the victim.

Note that in this figure, the plots for the Human Control actions are overlapped at -20.0.

48

Figure 20: Convergence of the Root Q-values when victim detection is optimal.

The simulations verify the effectiveness of the developed MAXQ learning technique in training

the Q-values in all of the subtasks of the hierarchy. Figure 21 shows a plot of the robot’s

performance in fully exploring the unknown environment in the first 200 trials. As can be seen

from Figure 21, the robot’s efficiency in exploring the unknown scenes dramatically improved

during the first few consecutive trials of the offline training. The total number of exploration

steps declined for about 20 trials and became steady for the rest of the simulations. The

fluctuation in the plot is largely due to the randomly generated environment, i.e. every trial had a

unique layout.

Figure 21: The steps taken by the robot to fully explore the simulated environment in the

first 200 trials of the offline training.

0

1000

2000

3000

4000

5000

6000

7000

0 20 40 60 80 100 120 140 160 180 200

Ste

ps

to E

xplo

re

Trial

49

4.2 Performance of Overall Control Architecture

A total of 864 simulations in cluttered USAR-like scenes were conducted to observe the

performance of the proposed MAXQ HRL technique and direction-based exploration algorithm

in exploring and navigating unknown cluttered environments. Each scene consisted of a 20 by 20

cell environment (approximately 336 m2) with individual cells having an area of 0.84 m

2. The

robot used in the simulations occupied only one cell at a time.

The layout of each scene consisted of a unique combination of cell categories, providing

different levels of difficulty for robot exploration. Each scene layout contained 7-8 victims

placed strategically in corners and hard to reach regions of the scene. The simulated sensory

input provided to the robot included: i) 3D depth information pertaining to the cell in front of the

robot for classification of cells as one for the five possible rubble pile categorizes based on the

Dxy profile; ii) distance sensing on the sides and back of the robot, indicating the presence of any

obstacles in the robot’s neighboring cells, and iii) thermal information for indicating the presence

of a victim. The performance of the MAXQ controller was evaluated based on: 1) the ability to

explore the entire scene using the exploration technique, 2) collisions (if any) with non-climbable

obstacles, and 3) the number of victims found. An initial off-line training stage for MAXQ was

implemented prior to conducting the simulations.

Herein, simulations conducted on three of these USAR-like scenes, consisting of different

complexities, are discussed. The three scenes, denoted as Scenes A, B and C, are shown in

Figure 22(a)-(c). In general, the scenes were designed to have a high density of non-climbable

obstacles, creating tight corners for the robot to navigate around in order to find potential

victims. In addition, the scene layouts included many isolated regions separated by non-

climbable obstacles in which the robot could only enter through a single open cell or by

traversing over climbable obstacles. Concave obstacle boundaries were also used in the

simulated scenes to promote changes in the robot’s exploration direction. Scene A was designed

with the largest amount of climbable and non-climbable obstacles compared to the other two

scenes. Scene B consisted of more isolated regions connected by single cell openings. Scene C

had the most difficult layout for exploration due to the sizeable rubble boundaries created in the

scene by large numbers of adjacent non-climbable cells.

50

Figure 22: Simulated USAR-like scene: (a)-(c) actual scenes, (d)-(f) 2D grid map developed

of the scenes during simulations with robot starting location and heading.

Four different simulations were conducted for each scene in order to determine the overall effect

of the coefficients on exploration time. In three of the simulation sets, a higher weighting (i.e., wi

= 20, for i = l, p, or q) was given to one of the three exploration coefficients, and in the fourth

set, weightings on all coefficients were set to 1. Herein, exploration time is defined as the

number of steps (i.e., the primitive navigation actions F, B, or θ) taken to explore the entire

unknown scene. Each trial was repeated 6 times and the average number of steps was

determined. The values for the coefficients utilized for these simulations are presented in Table

2. These values were chosen following the requirements outlined in Eqs. (7)-(10). The upper and

51

lower limits on dx were chosen to be proportional to the size of the map. As previously

mentioned, these coefficient values can be chosen by an operator in order to meet user specified

requirements for a rescue operation. To further test the performance of the navigation and

exploration modules of the MAXQ-based controller, the simulations were performed using three

different robot starting positions in each of the scenes, Figure 22.

Additional simulations were performed in order to observe the effect of taking into account the

terrain information of the environment in the exploration technique. The purpose of these

simulations were to compare the proposed algorithm to other exploration techniques which only

take into account factors such as the minimum distance to target locations and maximum

coverage of the unknown scene. To do so, the terrain information was omitted from the robot.

That is all the cells in the 2D grid map were categorized under: 1) obstacle, 2) open space, or 3)

unknown as defined by most frontier-based exploration techniques. Furthermore, the terrain

coefficient was taken out of the exploration technique by giving it a weighting of 1 while the

other two exploration coefficients were given an equal weighting of 20. A total of 54 simulations

were performed in Scenes A, B, and C. In each of the scene the robot was required to performing

the same exploration task, i.e. explore the entire scene while finding all victims, starting at three

different locations in the scenes (same as those shown in Figure 22). Each simulation was

repeated 6 times as before.

Table 2: Exploration coefficients used in the simulations

Conditions Simulation

1 2 3 4

Cell x: open unvisited space 200 10 10 10

Cell x: unvisited climbable obstacle 100 5 5 5

Cell x: unknown obstacle 40 2 2 2

Cell k: unknown 3 60 3 3

Cell k: open unvisited space 2 40 2 2

Cell k: unvisited climbable obstacle 1.5 30 1.5 1.5

Cell k: unknown obstacle 1 20 1 1

d 3.0x 10 10 200 10

3.0 < d 6.0x 5 5 100 5

6.0 < d 9.0x 2 2 40 2

9.0 < d 12.0x 1.5 1.5 30 1.5

* x is obtained from the sum ofkxv as described by Equation (8).

52

4.2.1 Simulations Results

The simulation results are shown in Figure 22(d)-(f) and Figure 23. Figure 22 shows the three

robot starting positions used in each scene layout. As depicted in Figure 22, the robot was able to

fully explore each scene from all three starting positions, providing the opportunity to find all the

victims. In addition, the robot was able to classify the different types of cells including open

cells, climbable and non-climbable obstacles. The robot was able to navigate the scenes without

any collisions. Herein, a collision is defined as moving into a non-climbable obstacle cell. The

only cells in the scene that were still unknown at the end of exploration, were those fully

obstructed by non-climbable cells or victim cells as the robot was not allowed to enter these

cells.

Figure 23: Number of exploration steps for the robot in Scenes A, B and C.

In the majority of the simulations, placing higher weightings on the three exploration coefficients

resulted in lower exploration steps. An analysis of variance (ANOVA) was performed on the

means of the exploration steps for the four different simulation sets in the three scenes to

determine that a statistical significant difference between the means does exist: F(3)=4.98,

p<0.01. Additional paired one-tail t-tests were performed under the hypothesis that placing a

higher weighting on the travel coefficient would result in a lower number of exploration steps.

Robot Starting Location and Heading

0

500

1000

1500

2000

2500

3000

(2,2)S (3,16)N (19,11)W (2,2)S (13,12)W (5,15)E (2,2)S (10,8)N (14,16)E

Travel Terrain Neighboring Cells Equal Weighting

Scene A Scene B Scene C


Ste

ps

to E

xplo

re

53

The results confirmed (p<0.05) that by placing a higher weighting on the travel coefficient, the

average number of steps were reduced compared to placing a higher weighting on the terrain

coefficient (t(8)=3.36) and the neighboring cells coefficient (t(8)=2.44), and having equal

weighting on all coefficients (t(8)=3.76). This occurred because when placing a higher weighting

on the travel coefficient, the robot would explore more cells in its surroundings before moving

on to other regions of the scene. In the cases where the weightings were the same or, either the

terrain or the neighboring cells coefficients had higher weights, the robot revisited a number of

already visited cells, therefore increasing the number of steps it took to explore an entire scene.

For Scenes A, B and C, the robot took an average of 1786 ( = 293), 1911 ( = 135), and 2026 (

=249) exploration steps to explore each scene. An additional ANOVA test was conducted to

determine whether the level of difficulty of the scene affected the number of exploration steps.

With a 95% confidence level, there was no statistical significance between the average number

of exploration steps for these three scene layouts, emphasizing the robustness of the proposed

technique to different scene layouts. Furthermore, the results showed that the robot’s starting

location within a scene had minimal effect on the number of exploration steps needed.

In addition, a comparison study using Scenes A, B and C was performed to identify the effects of

including terrain as an exploration coefficient in determining robot exploration direction in

cluttered environments. In particular, the study compared the number of exploration steps, scene

coverage and number of victims found: i) using all three coefficients in the direction-based

exploration technique (as obtained from the aforementioned simulations), and ii) only using the

travel and neighboring cells coefficients in the direction-based exploration technique. For the

comparison, both sets of simulations placed a higher weighting on the travel coefficient as this

provided better performance in the previous simulations. In addition, for the second set of

simulations in the comparison study it better represents the cost functions of frontier-based

exploration techniques (which determine frontier target points based on minimizing travel

distance). Simulation set 1 is based on the aforementioned simulations presented in Figure 23 for

which the higher weighting was placed on the travel coefficient. For simulation set 2, terrain

information was removed for the exploration technique, and hence, cell classification was based

on: 1) obstacle, 2) open space, or 3) unknown. The procedure for simulation set 2 was the same

as simulation set 1, namely the robot was required to perform the same exploration task, i.e.,

54

explore the entire scene while finding all victims, starting at the same three locations in the

scenes as previously mentioned. Each simulation was also repeated 6 times.

The results for both simulation sets are presented in Table 3 and Table 4 for easy comparison.

For the results pertaining to the percentage of scene explored, 100% represents the exploration of

all available cells to the robot (i.e., it does not include the cells that were completely obstructed

and the robot was not able to access). As previously mentioned, the robot was able to explore all

scenes and identify all victims using the proposed three coefficient based exploration method,

Table 3.

Table 3: Exploration results with terrain information.



(2,2)S (3,16)N (19,11)W (2,2)S (13,12)W (5,15)E (2,2)S (10,8)N (14,16)E

% of Scene

Explored 100% 100% 100% 100% 100% 100% 100% 100% 100%

Exploration

Steps 1107 1512 1631 1923 1829 1718 1745 1883 1808

Victims

Found 8/8 8/8 8/8 8/8 8/8 8/8 7/7 7/7 7/7

Table 4: Exploration results without terrain information.



(2,2)S (3,16)N (19,11)W (2,2)S (13,12)W (5,15)E (2,2)S (10,8)N (14,16)E

% of Scene

Explored 92.8% 92.8% 4.5% 100% 100% 100% 99.0% 99.0% 99.0%

Exploration

Steps 1896 2091 34 5519 5993 4688 2424 2595 3034

Victims

Found 7/8 7/8 1/8 8/8 8/8 8/8 7/7 7/7 7/7

As can be seen from Table 4, the robot using the direction-based exploration technique without

the terrain coefficient was not able to fully explore Scene A and as a result it was not able to find

all the victims in the environment. For starting location (19,11)W, the robot started exploring in

an isolated region of the scene, as known in Figure 22(a), where the only way to leave/enter the

region was to traverse over climbable obstacles; however, as terrain information was not utilized

55

to navigate the robot, it became stuck in this region and only explored 4.5% of the scene. For

starting locations (2,2)S and (3,16)N, the robot was also not able to explore the cell regions for

which it needed to climb over climbable obstacles. However, Scene B was fully explored with all

the victims found and 99% of Scene C was explored with all victims found. In Scene C, a small

number of adjacent cells were not explored since they were obstructed by what the robot had

classified as obstacles (which were actually traversable). With respect to the number of

exploration steps, it can be seen that significantly more exploration steps were needed in

simulation set 2 for which exploration was only based on the travel and neighboring cells

coefficients than for simulation set 1 when all three coefficients were used. For example, for

Scene B, where both approaches had 100% scene coverage, the average number of exploration

steps required was 1823 for simulation set 1 and 5400 for simulation set 2. By comparing the

results from Table 3 and Table 4, it can be seen that the terrain information improved the

performance of robot exploration in these 3D cluttered environments.

4.3 Chapter Summary

Two sets of simulations were conducted to test the developed MAXQ task hierarchy and its

performance in unknown simulated environments. The first set of simulations was designed to

verify the designed MAXQ learning technique applied to the task of search and rescue, collect

Q-values and perform offline training on the system. The simulations results showed that all of

the selected Q-values converged to the optimal policy of their respective subtasks. The second

set of simulations was designed to test the performance and robustness of the overall MAXQ

controller, with a higher focus on the navigation and exploration modules of the control

architecture. The simulations results showed that the robot was able to fully explore the unknown

simulated scene in every simulation and find all the hidden victims without any collisions.

Additional simulations were performed to compare the effect of utilizing the terrain information

in the controller and exploration technique. The results showed a substantial decline in the

performance of the robot in efficiently exploring the unknown simulated environment.

56

Chapter 5 Experiments

5.1 Experimental Set-up

Two sets of experiments were carried out to evaluate the performance of the proposed learning-

based controller for a rescue robot in a USAR-like environment. The first set of experiments

focused on evaluating the performance of the navigation and exploration modules. The second

set of experiments compared the performance of the proposed semi-autonomous controller

versus teleoperated control of the rescue robot.

The USAR-like environment was designed to mimic a disaster scene, containing different types

of objects and materials including wood, metal, plastic, brick, ceramic, concrete, paper,

cardboard, plaster, rubber-like polymers and rocks. Six victims, represented by dolls and

mannequins, were also placed in the environment. The majority of the victims were partially

obstructed by rubble in which case only a portion of their bodies were visible such as their limbs

or head. Rubble was strategically placed in the environment such that the robot would have to

explore around corners and obstacles to search for victims. Within the scene, inclines were

created to allow a robot to climb over rubble piles and navigate an elevated upper level of the

scene, Figure 24(a) and (b). In the first set of experiments, a thermal camera placed at the front

of the robot was used to identify victims. Heat pads were placed on the dolls and mannequins to

produce heat signatures to mimic human bodies, Figure 24(d). Since the focus of these

experiments was on the navigation and exploration capabilities of the system thermal signatures

were used for easy identification of the victims. In the second set of experiments, on the other

hand, the heat pads were removed and a victim identification algorithm was implemented which

used the 3D information from the 3D mapping sensor as well as skin detection from its 2D

images to identify victims in the scene, Figure 24(e).

In both experiments, a rugged mobile robot was utilized, Figure 25. The robot was equipped with

a set of sensors including: 1) One real-time 3D mapping sensor based on a structured light

technique [36]. In the experiments, the sensing range of the sensor was set to 0.914m, i.e., the

length of one cell, in order to only capture the depth information pertaining to the cells in front of

the robot for rubble pile profile classification and updating of the 2D grid map; 2) five infrared

57

sensors distributed along the parameter of the robot. The infrared sensors were used to detect the

presence of objects in the cells adjacent to the robot for both obstacle avoidance and updating of

the 2D grid map. In the first experiment, a thermal camera was also utilized to capture heat

signature from heat pads covering the victims.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 24: USAR-like scene: (a) one of three scenes for experiment #1, (b) scene for

experiment #2, (c) rescue robot stuck in rubble pile (d) victim with heat pad under rubble,

in experiment #1, (d) victim on the upper level of the scene, in experiment #2, and (f) red

cushion detected as false victim.

Figure 25: Rugged rescue robot in USAR-like scene.

Thermal

Camera

Infrared

Sensor

3D Mapping

Sensor

58

5.1.1 HRI User Interface

During the trials, the interface presented in Figure 26 was utilized by the operators. The interface

is based on the following sources of information: (i) 2D video provided by two wireless cameras

(one at the front and one at the back of the robot), Figure 26(a) and Figure 26 (b), (ii) an updated

3D map of the environment which is obtained by the SLAM module (readers are referred to [48]

for details regarding map reconstruction), Figure 26(c), (iii) robot status, which provides

feedback as to whether the robot has received control commands and when they have been

implemented, Figure 26(c), (iv) control menu, which allows for connecting the controller to the

interface, displaying the map information or exiting the interface, Figure 26(c), (v) infrared

sensor feedback displayed visually to show the operator the proximity of obstacles surrounding

the robot to warn if the robot is in danger, Figure 26(c), and (vi) real-time 3D information of the

robot’s frontal view obtained from the 3D mapping sensor, Figure 26(d). During the

experiments, the 2D front camera view and the real-time 3D information were continuously

shown on side-by-side monitors to the operator.

(a)

(b)

(c)

(d)

Figure 26: (a) 2D image of robot’s front view, (b) 2D image of robot’s rear view, (c) 3D

map, robot status (green indicates the wheels of the robot are moving), infrared visual

feedback on the robot’s perimeter (if objects are very close red rectangles appear), and

control menu display, and (d) real-time 3D information.

59

5.1.2 Robot Control

The robot was controlled through the HRI interface program. This program had wireless

communication with the robot’s onboard microcontroller, via a Bluetooth module, which in turn

executed the motor commands and sent back the updated infrared sensory data. Thus, all

commands were sent to and executed by the HRI interface program.

Generally, the robot was controlled via two modes:

1) Manual control by a human operator, in which case the input commands were taken via

an input device and received by the HRI interface.

2) Autonomous control by the MAXQ based controller, in which case the commands were

sent through an inter-process communication module between the controller and the HRI

interface program.

The following sub-sections will further discuss the operator’s input device and the inter-process

communication module.

5.1.2.1 Input Device

Operator control of the robot was achieved using an Xbox 360 wireless gamepad, Figure 27. The

gamepad includes two mini joysticks, and a set of buttons. The blue X button is used to connect

the controller to the interface, the yellow Y button is used to display the 3D map, the red B

button is used to display the rear view camera, and the green A button is used to exit the HRI

interface program. The left and right trigger buttons are used to display the robot’s status and the

control menu, respectively. The left joystick is used to move the robot both forward and

backwards, while the right joystick is used to turn the robot both clockwise and

counterclockwise. The operator can input his/her commands via the controller to tag or reject the

presence of a victim using the left and right bumper buttons, respectively. Exploration directions

can be provided by the operator using the D-Pad button. The right and left D-Pad buttons are

used to change the robot’s exploration direction to the right or left of the robot’s current heading

and the down D-Pad button is used to choose an exploration direction directly opposite to the

robot’s current heading.

60

Prior to the experiments, each participant was given a tutorial on how to use the control pad and

interface, and then had a chance to control the robot outside of the scene for a duration of 10

minutes in order to familiarize him/herself with the robot.

Figure 27: Xbox 360 wireless controller used as the operator’s input device.

5.1.2.2 Inter-Process Communication

To achieve a communication channel between the robot interface program and the higher level

MAXQ based controller, a bidirectional inter-process communication stream was established via

a TCP/IP network socket. A socket provides a mean of delivering and receiving data packets

between a server and a client program. The socket address, commonly shared by both the server

and client, consists of a combination of an IP address and a port number. Herein, the interface

program played the role of the server, opening up a communication thread for incoming client

calls. The MAXQ based controller, i.e. the client, called the server at every step of the process,

sending commands to the interface, and receiving the updated robot status as well as the

commands that may have been executed by the operator. The commands sent and received were

in a coded string format, i.e. combination of letters, which were deciphered by both programs

before execution. Complete lists of the commands sent by the MAXQ based controller and

interface and their respective meanings are summarized in Table A 1-Table A 5 of Appendix A.

61

5.2 Experiment #1: Evaluation of Navigation and Exploration Modules

For the first set of experiments, three different scene layouts were used in the environments to

test the performance of the exploration and navigation modules of the controller, Figure 28(a)-

(c). Scene layout 1 contains both open spaces and non-climbable rubble piles, whereas Scene

layouts 2 and 3 also contain climbable rubble providing the robot with an opportunity to navigate

an elevated upper level of the scene. Note that in layout 2, the victim in cell (3,4) is located on

the lower level portion of the scene, i.e. hidden under the climbable rubble.

Four different experiments were conducted for each scene, with the first three experiments

having a higher weighting of 20 for one of the exploration coefficients, and the last experiment

having equal weightings of 1 for all the coefficients. In the experiments, the same coefficient

values as in Table 2 were used, with the exception that the distance limits Xn were adjusted for

the smaller physical scene to represent more realistic travelling distances. The new values were

X1=1.5, X2=3.0, X3=6.0 and X4=12.0. Similar to the simulations, the experiments for each

scene were also performed with the robot starting in three different starting positions. For each

starting position, the experiment was conducted three times.

5.2.1 Experimental Results

A few experimental depth images taken with the 3D mapping sensor in scene layouts 2 and 3

along with their corresponding Dxy arrays are depicted in Table 5. By utilizing the Dxy arrays the

robot was able to effectively categorize the simplified depth profile information as climbable

obstacle, non-climbable obstacle, or an open traversable space.

The experimental results for each scene layout are presented in Figure 28(d)-(f) and Figure 29.

The results verify that the rubble pile classification was able to correctly categorize the depth

profile information in the Dxy arrays as climbable obstacle, non-climbable obstacle, or an open

space. The unknown cells represent cells in the scene that could not be identified since 3D

sensory information could not be obtained from those regions due to obstruction of the regions

by walls or rubble piles. With an accurate categorization of the different rubble piles in the scene

in each experiment, the robot was able to create a correct representation of the scene. Figure 29

62

summarizes the average number of exploration steps for the three scene layouts and the different

robot starting locations and headings within each scene.

Table 5: Experimental depth images and depth profile arrays

Depth image

Dxy array

Classification Open space Non-climbable obstacle Climbable obstacle

Figure 28: 2D representation of experimental scenes; (a)-(c) depict the original scene

layouts 1, 2, and 3; (d)-(f) show the 2D representation of the scene as determined by the

robot during the experiments.

63

Figure 29: Number of exploration steps for the robot in scenes 1, 2 and 3.

As can be seen in Figure 29, the weightings on the different exploration coefficients did not have

a significant effect on the exploration performance. Robot exploration, on average, took 78 ( =

17), 84 ( = 14) and 90 ( = 6) steps in scene layouts 1-3, respectively. Using ANOVA, it was

found that with a 95% confidence interval, there was no statistical significance on the average

number of exploration steps between these three scene layouts, hence, further validating the

robustness of the technique. As can be seen in Figure 29, some of the robot’s different starting

locations and headings in the scenes did affect the number of exploration steps. For example, in

Scene 1, starting location (4,6) required an average of 101 exploration steps, while starting

locations (2,2) and (5,2) only required 65 and 68 steps, respectively. In Scene 2, starting location

(3,6) required an average of 65 exploration steps while from starting locations (2,2) and (5,6) the

robot required an average of 95 and 91 exploration steps, respectively. In Scene 3, with starting

locations (2,2), (3,6) and (2,4), the robot completed exploration in 92, 96, and 82 steps,

respectively. The higher number of exploration steps were a result of the robot revisiting a

number of the already visited cells. Due to the small size of the overall physical scene and a

limited number of regions to explore, there were fewer exploration options for the robot when

compared to a larger more complex scene (as presented in the simulations). From these


0

20

40

60

80

100

(2,2)S (5,2)N (4,6)E (2,2)S (3,6)W (5,6)S (2,2)S (3,6)S (2,4)E

Travel Terrain Neighboring cells Equal Weighting

Scene 1 Scene 2 Scene 3

Ste

ps

to E

xplo

re

64

experiments, it can be concluded that when the size of the scene is small, the effects of the

weightings are similar.

5.3 Experiment #2: Teleoperated vs. Semi-Autonomous Control

The second set of experiments was conducted to verify the performance of the overall proposed

MAXQ-based semi-autonomous controller versus robot teleoperation. A new scene layout was

created for these experiments, Figure 24(b). The scene contained open space, partially obstructed

victims, non-climbable obstacles as well as climbable rubble to provide the robot with the

opportunity to explore the main lower level and also an elevated level of the scene.

Ten participants ranging in age from 19-33 years participated in the experiments. None of the

participants had any previous experience in remote robot navigation and exploration. The

operators were asked to control the robot within the scene in two different trial sets: (i) having

full teleoperated control over the robot in order to navigate and explore the scene while finding

as many victims as possible, and (ii) while the robot was in semi-autonomous mode, in which

case the MAXQ algorithm and task hierarchy were implemented, and the operator and the robot

shared tasks as needed. The following performance parameters were measured in both trials: (i)

number of victims found, (ii) percentage of the scene explored, (iii) number of robot collisions,

and (iv) total search time. For both trials, the level of difficulty traversing the scene remained the

same, however, the objects and the victims were placed in different locations to minimize human

expert knowledge of the scene.

5.3.1 Experimental Results

Figure 30-Figure 32 and Table 6 present the performance comparison results of teleoperated

control versus semi-autonomous control of the rescue robot. Figure 30 presents the percentage of

the overall rubble scene that was traversed by the robot in all ten trials. In general, during

teleoperation none of the operators were able to traverse the entire scene. Participants 4 and 7

explored the scene the most, with 88% and 94% coverage of the scene, respectively, allowing

them to find all 6 victims located in the scene. In semi-autonomous mode, the robot was able to

traverse the total area of the overall scene for all of the ten trials utilizing the proposed MAXQ

algorithm. On average approximately 4 victims were found in the teleoperated robot mode within

the trials versus all 6 in the semi-autonomous robot mode, Figure 31. False victim identification

65

was also made by participants 6 and 7 during teleoperation. Participant 6 defined a red cushion to

be a victim, Figure 24(f). Participant 7 also defined a red cushion and an orange plastic casing to

be a victim. Participant 8 also failed to identify a victim that was clearly displayed on the

interface via the front view camera of the robot.

Figure 30: Percentage of scene explored.

Figure 31: Number of victims identified.

Figure 32 presents the total number of collisions that occurred for both experiments. Herein, a

collision is defined as the robot unintentionally bumping into a non-climbable obstacle

(including a victim) or falling off the edge of a cliff, i.e. a drop, as classified by the rubble pile

categorization technique. As can be seen in the figure, the number of collisions was reduced in

0.0

20.0

40.0

60.0

80.0

100.0

1 2 3 4 5 6 7 8 9 10Per

cen

tage

of

Scen

e C

ove

red

Participants

Teleoperated Semi-Autonomous

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10

Nu

mb

er o

f V

icti

ms

Iden

tifi

ed

Participants


66

the semi-autonomous search and rescue operation for all of the trials. It is important to note that

the collisions that were detected in the semi-autonomous mode rarely occurred and were a result

of the control being passed on to the operator to assist the robot in getting out of a stuck position

in the rubble. The operators would normally hit the surrounding obstacles while trying to rotate

the robot out of the stuck position. Also, some collisions occurred when the robot collided with

parts of the rubble that were not within the field of view of the sensory system. On the other

hand, during teleoperation, in addition to these types of collisions, the majority of collisions took

place while the infrared range sensors that were presented to the operator on the interface had

detected the obstacles. In the teleoperated trials, the collisions often occurred as a result of the

operators using brute force with just the front view 2D camera to try to navigate the robot over

non-climbable rubble and through narrow passages in the rubble, instead of using all the

available sensory information, such as the infrared sensors around the robot’s perimeter or the

rear view camera to aid in obstacle avoidance. It is also interesting to note that in addition to the

object-based collisions, five out of the ten participants actually collided with one victim in the

scene. In the teleoperated control experiments, only four participants actually navigated the robot

to the upper level of the scene via an inclined climbable ramp for further exploration. Two of

these participants ended up driving the robot off of steep cliffs on the sides of the upper level

instead of navigating the robot down a gradual incline which was accessible to them. This upper

level area of the scene is shown in Figure 24(b).

Figure 32: Number of collisions.

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10

Nu

mb

er

of

Co

llis

ion

s

Participants


67

The total time for each trial is presented in Table 6 and is defined herein as the time it took to

explore and navigate the scene, identify victims and exit the scene. The average time for

completion of each trial was determined to be 395 seconds ( = 250), for the teleoperated mode

and 369 seconds ( = 7), for the semi-autonomous mode. Participants 4 and 7, who covered the

largest area of the scene and found all victims, had the longest trial times at 595 seconds and 987

seconds. These trial times are significantly higher than the total exploration times that these

participants had in semi-autonomous mode to achieve the same tasks.

Table 6: Summary of trial times in teleoperated and semi-autonomous experiments

Participant 1 2 3 4 5 6 7 8 9 10

Teleoperated Trial

Times (s) 169 283 179 595 186 292 987 377 402 483

Semi-Autonomous

Trial Times (s) 366 369 376 367 364 381 370 355 373 365

During the semi-autonomous trials, tasks were shared by the robot and operator. The robot

handed over control to the operator in two particular scenarios: i) when it was physically stuck in

a cell due to unsteady terrain conditions or large rubble pile on either side of the robot

obstructing the robot’s movement, and ii) when it could not identify a victim. On average these

scenarios occurred 5 times in each of the trials. In such cases the system detected that the robot

was not successfully executing the navigation commands, hence the navigation subtask was

handed over to the operator. Control is handed back to the robot once the robot sensors detect

that the executed commands by the operator were performed correctly. For victim identification,

the operators were asked to either tag or reject the victim in the viewpoint of the robot using the

control pad.

With respect to the aforementioned quantitative results, it was found that semi-autonomous

control of the robot had better performance results than teleoperation. Once the experiments were

completed, the participants were also asked to complete a questionnaire reflecting their

experiences with the robot. The questionnaire consisted of questions related to their stress level

during both trials, their ability to utilize the available sensory information, and how semi-

autonomous control affected their overall experience. In regards to stress levels, 8 participants

mentioned that they felt stressed during teleoperation of the robot, while only 2 participants felt

68

stressed during semi-autonomous control. 8 out of the 10 participants felt that they had a more

difficult time monitoring all of the sensory information that was made available to them in the

teleoperation mode for the entire duration of the search and rescue task, and hence, 7 of them

also felt disoriented and confused about where the robot was in the scene and in which direction

they should explore during the teleoperation mode. With the robot in semi-autonomous mode, 8

of the 10 participants mentioned that they had a better overall experience compared to the

teleoperated mode. In particular, these participants found that the robot having autonomous

capabilities improved their understanding of the scene and their own decision making abilities.

The experimental and questionnaire results confirm that a semi-autonomous control architecture

utilizing a MAXQ learning approach for rescue robots has the potential of improving the

performance of the search and rescue mission in obstacle avoidance, exploration and victim

identification.

5.4 Chapter Summary

The first set of experiments was designed to test the performance of the navigation and

exploration modules of the controller. The experimental results verified the ability of the

navigation the rubble pile and perform obstacle avoidance as well as to fully explore all three

experimental USAR-like scenes. The second set of experiments compared the performance of

the proposed semi-autonomous control architecture versus full teleoperation. The experimental

results showed an overall improvement in exploration, victim identification and navigation of the

human-robot rescue team. In addition, the majority of the participants agreed that semi-

autonomous control of the robot was an overall better experience compared to teleoperated

control.

69

Chapter 6 Conclusion

6.1 Summary of Contributions

6.1.1 Semi-Autonomous Control Architecture for Rescue Robots

A novel hierarchical learning-based semi-autonomous control architecture was developed for

rescue robots in cluttered USAR environments. The proposed control architecture allows the

robot to learn and make decisions regarding which tasks should be carried out at a given time

and whether the human or the robot should perform these tasks for optimum results. By giving

the robot the decision making ability to decide when human intervention is required, the human

operator can take advantage of the robot’s ability to continuously learn from its surrounding

environment. Robot sensors in the control architecture include: (i) infrared range sensors on the

perimeter of the robot, (ii) a 3D sensory system providing 3D data for navigation of a rescue

robot, victim identification and mapping of an unknown environment, (iii) 2D visual feedback

for the HRI interface, and (iv) a thermal camera providing heat signature from the surrounding

environment. The SLAM module uses the 2D and 3D images provided by the 3D sensory system

to map the environment. The victim identification module uses the thermal camera and both 2D

and 3D images to detect partially obstructed victims in the robot’s view. The HRI interface

provides the operator with the sensory feedback and the ability to monitor and control the robot.

The deliberation module provides the control architecture with the decision making ability

needed to navigate, explore and perform victim identification. The deliberation module also

decides which tasks should be carried out by the human operator and robot. Finally, the robot

actuators execute the commands issued by the deliberation module.

6.1.2 MAXQ Learning Technique Applied to the Semi-Autonomous Control Architecture

A MAXQ learning technique was developed for the decision making module of the semi-

autonomous controller. In particular, the MAXQ control algorithm allows a rescue robot to learn

and make decisions regarding navigation, exploration and victim identification in a disaster

scene. Furthermore, the designed MAXQ based control architecture decomposes the entire task

of search and rescue into a hierarchy of individual subtasks including: Navigate to Unvisited

70

Regions for search and exploration, Victim Identification for identifying and tagging the location

of victims, and Navigate for obstacle avoidance and local exploration of the robot’s

surroundings. Each subtask is able to learn the optimal action for its respective module

separately, allowing the control architecture to learn more efficiently than traditional RL based

controllers.

6.1.2.1 Rubble Pile Categorization Technique

A rubble pile categorization technique was developed for the navigation module of the control

architecture. This technique utilizes the 3D information of the rubble pile to effectively classify

the rubble pile into open traversable space, non-climbable obstacles, climbable obstacles,

downhill terrain, and a drop. The detailed classification of the rubble pile can aid a rescue robot

navigate a cluttered environment more efficiently by identifying climbable obstacles or avoid

dangerous situations such as unexpected drops and small voids in the rubble that may not be

detected by range sensors.

6.1.2.2 Direction-Based Exploration Technique

A direction-based exploration method was designed and integrated within the exploration

module of the control architecture to expand the search area of a rescue robot. Mainly the

occupancy cells of the known grid map are given priority based on their distance to the robot,

coverage potential, and terrain information to effectively choose a global direction for the robot

to explore. The exploration technique also utilizes an obstacle avoidance algorithm which guides

the robot around the enclosed concave obstacle boundaries. The algorithm detects the region

containing the concave obstacle in its path of exploration and temporarily changes the

exploration direction for an efficient avoidance of the obstacles.

6.1.3 Simulations

A simulation environment was developed to test the performance of the MAXQ control

architecture. A randomly generated simulation environment was designed to perform offline

training for the MAXQ controller as well as to monitor some of the Q-values in all subtasks for

convergence. Moreover, multiple simulated scenes were designed to examine the robustness of

the navigation and exploration modules of the controller in larger and more complex

71

environment. The simulations results showed that the robot was able to successfully explore an

unknown scene and find all hidden victims.

6.1.4 Implementation

Experiments were conducted by implementing the proposed controller with a rescue robot in a

USAR-like scene. A rubble filled environment was created to test the ability of a rescue robot

utilizing the semi-autonomous controller in navigating the rubble pile and exploring an unknown

cluttered environment. The experimental results showed that the rescue robot was able to

successfully navigate in the rubble pile by correctly identifying open traversable space,

climbable and non-climbable obstacles and drops. Moreover, the rescue robot was able to fully

explore the cluttered environment and find all the victims. Additional experiments compared

traditional teleoperated control of the rescue robot to the proposed semi-autonomous control

architecture. The results showed a significant improvement in the performance of the operator

and robot in navigating, exploring, and finding victims.

6.2 Discussion of Future Work

The proposed MAXQ-based semi-autonomous control architecture provides a unique approach

to improving the decision making and task sharing capabilities of semi-autonomous control

schemes for rescue robots. However, more extensive experiments in large-scale environments

can help further test and improve the robustness of the MAXQ-based semi-autonomous control

architecture. Due to the limited laboratory space, the direction-based exploration technique could

not be put to extensive testing in a large-scale rubble filled arena. Furthermore, the rubble pile

categorization technique and the navigation module can be tested with more varieties of rubble

pile shapes, sizes, and surface conditions (e.g. more climbable and non-climbable rubble piles).

More extensive experiments can also help improve the partnership between human and robot

control for increased performance in the overall search and rescue mission.

Furthermore, improving the sensor capabilities can greatly enhance the overall performance of

the semi-autonomous controller. For instance, improving the accuracy of the 3D mapping sensor

can reduce uncertainties in the 3D map, victim identification as well as the rubble pile

categorization technique. Increasing the sensing range of the 3D mapping sensor can also assist

in the victim identification module to identify victims that are located farther away from the

72

robot. Moreover, a more compact 3D mapping sensory can reduce the overall weight of the robot

and allow the robot to enter smaller voids, both leading to better navigation and exploration

capabilities of the robot. Lastly, more infrared sensors can provide the controller with a wider

field of view regarding the robot’s surrounding obstacles.

6.3 Final Concluding Statement

Overall, the proposed HRL-based semi-autonomous control architecture is the first of its kind

controller for rescue robots. Its unique design allows the robot to utilize the MAXQ hierarchical

learning technique to decompose the tasks of search and rescue operations, and learn the optimal

action for each individual subtasks in order to improve its overall decision making. Moreover,

the semi-autonomous control scheme keeps the human operator in the control loop, which

greatly enhances the robustness of the controller. It is envisioned that learning-based semi-

autonomous control schemes can significantly enhance the ability of rescue robots by allowing

them to continuously learn from the environment.

73

References

[1] J. Casper, M. Micire, R. Murphy, “Issues in Intelligent Robots for Search and Rescue,”

Proceedings of SPIE, vol. 4024, pp. 292-302, 2000.

[2] J. Casper, R. Murphy, "Human-Robot Interactions during the Robot-Assisted Urban Search

and Rescue Response at the World Trade Center," IEEE Transactions on Systems, Man, and

Cybernetics-Part B, vol. 33, vo. 3, pp. 367-385, 2003.

[3] R. Murphy, “Activities of the Rescue Robots at the World Trade Center from 11–21

September 2001,” IEEE Robotics & Automation Magazine, pp. 50-61, 2004.

[4] J. Burke, R. Murphy, "Human-Robot Interaction in USAR Technical Search: Two Heads Are

Better Than One," IEEE International Workshop on Robot and Human Interactive

Communication, Kurashiki, Japan, 2004, pp. 307-312.

[5] J. Casper, R. Murphy, “Workflow study on human-robot interaction in USAR”, IEEE Int.

Conf. on Robotics and Automation, pp. 1997-2003, 2002.

[6] D. D. Woods, J. Tittle, M. Feil, A. Roesler, “Envisioning Human–Robot Coordination in

Future Operations,” IEEE Transactions on Systems, Man, and Cybernetics Part C,

Application and Reviews, vol. 34, no. 2, 2004.

[7] J. Burke, R. Murphy, M. Coovert, and D. Riddle, “Moonlight in Miami: A field study of

human-robot interaction in the context of an urban search and rescue disaster response

training exercise,” Special Issue of Human-Computer Interaction, Vol. 19, pp. 85-116, 2004.

[8] J.L. Drury, J.L., B. Keyes, H.A. Yanco, “LASSOing HRI: Analyzing Situation Awareness in

Map-Centric and Video-Centric Interfaces”, ACM/IEEE Int. Conf. on Human-Robot

Interaction, pp. 279-286, 2007.

[9] D. J. Bruemmer et al., “I Call Shotgun!: An Evaluation of Mixed-Initiative Control for

Novice Users of a Search and Rescue Robot,” IEEE International Conference on Systems,

Man and Cybernetics, The Hague, Netherlands, 2004, pp. 2847-2852.

[10] R. L. Murphy and J. J. Sprouse, “Strategies for Searching an Area with Semi-Autonomous

Mobile Robots,” Proceedings of Robotics for Challenging Environments, Albuquerque, NM,

1996, pp. 15-21.

[11] R. Wegner and J. Anderson, “Balancing Robotic Teleoperation and Autonomy for Urban

Search and Rescue Environments,” Advances in Artificial Intelligence, A. Y. Tawfik, S. D.

Goodwin, Ed. Berlin / Heidelberg, Germany: Springer, 2004, vol. 3060, pp. 16-30.

[12] A. Finzi and A. Orlandini, “Human-Robot Interaction Through Mixed-Initiative Planning for

Rescue and Search Rovers,” Advances in Artificial Intelligence, vol. 3673, S. Bandini, S.

Manzoni, Ed. Berlin / Heidelberg, Germany: Springer, 2005, pp. 483-494.

[13] R. R. Murphy, S. Tadokoro, D. Nardi, A. Jacoff, P. Fiorini, H. Choset, and A. M. Erkmen,

“Search and Rescue Robotics,” in Springer Handbook of Robotics, B. Siciliano and O.

Khatib, Ed. Berlin: Springer, 2008, pp. 1151-1173.

[14] E. Guizzo, “Japan Earthquake: Robots Help Search For Survivors,” Internet:

http://spectrum.ieee.org/automaton/robotics/industrial-robots/japan-earthquake-robots-help-

search-for-survivors, Mar. 13, 2011 [Sep. 14, 2011].

http://spectrum.ieee.org/automaton/robotics/industrial-robots/japan-earthquake-robots-help-search-for-survivors

http://spectrum.ieee.org/automaton/robotics/industrial-robots/japan-earthquake-robots-help-search-for-survivors

74

[15] A. Birk, K. Pathak, S. Schwertfeger, and W. Chonnaparamutt, “The IUB Rugbot: An

Intelligent, Rugged Mobile Robot for Search and Rescue Operations,” IEEE Intl. Workshop

on Safety, Security, and Rescue Robotics (SSRR), Los Alamitos, 2006.

[16] M. Carlson, C. Christiansen, A. Persson, E. Jonsson, M. Seeman, and H. Vindahl,

“RobotCupRescue – Robot League Team RFC Uppsala, Sweden,” RoboCup 2004 Rescue

Robot League Competition, Lisbon, Portugal, 2004.

[17] M. W. Kadous, R. K.-M. Sheh, C. Sammut, “CASTER: A robot for Urban Search and

Rescue,” Australasian Conference on Robotics and Automation, Sydney, Australia, 2005, pp.

1-10.

[18] J. W. Crandall, M. A. Goodrich, “Characterizing the Efficiency of Human Robot Interaction:

A Case Study of Shared-Control Teleportation”, IEEE/RSJ Intl. Conference on Intelligent

Robots and Systems, Lausanne, Switzerland, 2002, pp. 1290-1295.

[19] J. Shen, J. Ibanez-Guzman, T. C. Ng, B. S. Chew, “A Collaborative-Shared Control System

with Safe Obstacle Avoidance Capability”, Conference on Robotics, Automation and

Mechatronics, Singapore, 2004, pp. 119-123.

[20] X. Perrin, R. Chavarriaga, F. Colas, R. Siegward, J. R. Millan, “Brain-coupled interaction for

semi-autonomous navigation of an assistive robot,” Robotics and Autonomous Systems, vol.

58, no. 12, pp. 1246-1255, 2010.

[21] C. K. Law, Y. Xu, “Shared Control for Navigation and Balance of a Dynamically Stable

Robot,” International Conference on Robotics & Automation, Washington, DC, 2002, pp.

1985-1990.

[22] D. A. Few, D. J. Bruemmer, M. C. Walton, “Improved Human-Robot Teaming through

Facilitated Initiative,” 15th IEEE International Symposium on Robot and Human Interactive

Communication (RO-MAN06), Hatfield, UK, 2006, pp. 171-176.

[23] J. L. Marble, D. J. Bruemmer, D. A. Few, “Lessons Learned from Usability Tests with a

Collaborative Cognitive Workspace for Human-Robot Teams,” IEEE International

Conference on Systems, Man and Cybernetics, vol. 1, pp. 448-453, 2003.

[24] H. Tang, X. Cao, A. Song, Y. Gua, J. Bao, “Human-Robot Collaborative Teleoperation

System for Semi-Autonomous Reconnaissance Robot,” IEEE International Conference on

Mechatronics and Automation, Changchun, China, 2009, pp. 1934-1939.

[25] P. Rishikesh, L. Janardhanan, E. M. Joo, “Goal Seeking of Mobile Robots Using Dynamic

Fuzzy Q-Learning,” Journal of The Institution of Engineers, vol. 45, no. 5, pp. 62-76, 2005.

[26] H. Yu, S. Liu, J. Liu, “A New Navigation Method Based on Reinforcement Learning and

Rough Sets,” International Conference on Machine Learning and Cybernetics, Kunming,

China, 2008, pp. 1093-1098.

[27] B. Huang, G. Cao, M. Guo, “Reinforcement Learning Neural Networks to the Problem of

Autonomous Mobile Robot Obstacle Avoidance,” Fourth International Conference on

Machine Learning and Cybernetics, Guangzhou, China, 2005, pp. 85-89.

[28] R. Makar, S. Mahadevan, M. Ghavamzadeh, “Hierarchical Multi-Agent Reinforcement

Learning," International Conference on Autonomous Agents, Montreal, Canada, 2001, pp.

246-253.

75

[29] L.A. Jeni, Z. Istenes, P. Szemes, H. Hashimoto, “Robot Navigation Framework Based on

Reinforcement Learning for Intelligent Space,” Conference on Human System Interactions,

Krakow, Poland, 2008, pp. 761-766.

[30] P. Stone, R. S. Sutton, “Scaling Reinforcement Learning Toward RoboCup Soccer,” The

Eighteenth International Conference on Machine Learning, Williamstown, MA, 2001, pp.

537-544.

[31] C. J. C. H. Watkins, “Learning from Delayed Rewards,” Ph.D. dissertation, Cambridge

University, Cambridge, England, 1989.

[32] R. S. Sutton, D. Precup, and S. Singh. “Between MDPs and Semi-MDPs: A Framework for

Temporal Abstraction in Reinforcement Learning,” Artificial Intelligence, vol. 112, no. 1-2,

pp. 181-211, 1999.

[33] R. Parr and S. Russell, “Reinforcement learning with hierarchies of machines,” Conference

on Advances in Neural Information Processing Systems 10, Denver, CO, 1998, pp. 1043-

1049.

[34] T. G. Dietterich, “Hierarchical Reinforcement Learning with MAXQ Value Function

Decomposition,” Journal of Artificial Intelligence Research, vol. 13, pp. 227-303, 2000.

[35] S. Singh, T. Jaakkola, M. L. Littman, C. Szepesvari, “Convergence Results for Single-Step

On-Policy Reinforcement-Learning Algorithms”, Machine Learning, 2000.

[36] Z. Zhang, H. Guo, G. Nejat, and P. Huang, “Finding Disaster Victims: A Sensory System for

Robot Assisted 3D Mapping of Urban Search and Rescue Environments,” IEEE, Int. Conf.

on Robotics and Automation, Rome, pp. 3889-3894, 2007.

[37] Z. Zhang and G. Nejat, “Robot-Assisted Intelligent 3D Mapping of Unknown Cluttered

Search and Rescue Environments,” IEEE International Conference on Intelligent Robots and

Systems, pp. 2115-2120, 2008.

[38] Z. Zhang and G. Nejat, “Robot-Assisted 3D Mapping of Search and Rescue Environments,”

ASME Design Engineering Technical Conferences & Computers and Information in

Engineering Conference, DECT2008-50012, pp. 1295-1304, 2008.

[39] Z. Zhang and G. Nejat, “Robot-Assisted 3D Mapping of Unknown Cluttered USAR

Environments,” the Disaster Robots Special Issue, Advanced Robotics Journal, vol. 23, pp.

1179–1198, 2009.

[40] W. G. Louie, “Victim Identification in Urban Search and Rescue Environments.” B.A.Sc.

thesis, University of Toronto, Canada, 2011.

[41] R. A. Howard, Dynamic Programming and Markov Processes. Cambridge, MA: The M.I.T.

Press, 1960.

[42] B. Yamauchi, “A Frontier-Based Approach for Autonomous Exploration,” IEEE

International Symposium on Computational Intelligence in Robotics and Automation,

Monterey, CA, 1997, pp. 146-151.

[43] W. Burgard, M. Moors, C. Stachniss, F. Schneider, “Coordinated Multi-Robot Exploration,”

IEEE Transactions on Robotics, vol. 21, no. 3, pp. 376-386, 2005.

[44] A. Haumann, K. Listmann and V. Willert, “DisCoverage: A new Paradigm for Multi-Robot

Exploration,” IEEE Int. Conf. on Robotics and Automation, Anchorage, Alaska, 2010, pp.

929-934.

76

[45] A. Howard, M. Mataric, and S. Sukhatme, “An Incremental Deployment Algorithm for

Mobile Robot Teams,” IEEE Int. Conf. Intell. Robot. Syst., Lausanne, Switzerland, 2002, pp.

2849–2854.

[46] X. Yun and K. C. Tan, “A Wall-Following Method for Escaping Local Minima in Potential

Field Based Motion Planning,” IEEE International Conference on Advanced Robotics,

Monterey, CA, 1997, pp. 421-426.

[47] P. J. Schneider, D. H. Eberly, Geometric Tools for Computer Graphics, San Francisco, CA:

Morgan Kaufmann, 2003.

[48] B. Mobedi and G. Nejat, "3D Active Sensing in Time-Critical Urban Search and Rescue

Missions," IEEE/ASME Transactions on Mechatronics, DOI

10.1109/TMECH.2011.2159388.

77

Appendices

Appendix A

Table A 1: List of commands and their meanings

Subtask Command Meaning

Navigate n0 Only provide robot’s status without executing any actions.

Navigate n1 Ask operator for navigation input (e.g. forward) and provide robot’s

updated status after the command is executed.

Navigate nf Move the robot forward and provide robot’s updated status after the

command is executed.

Navigate nb Move the robot backward and provide robot’s updated status after the


Navigate nr Turn the robot to the right and provide robot’s updated status after the


Navigate nl Turn the robot to the left and provide robot’s updated status after the


Victim Identification v Ask the operator to tag/reject a victim in robot’s view or move the

robot closer to victim, and send back the operator’s input command.

Navigate to Unvisited

Regions e

Allow the operator to change the robot’s exploration direction, or

terminate the exploration task. Send back the operator’s input

command (if any). If not input is received from the operator, the robot

will continue exploration in the same heading.

Table A 2: List of response formats for each command

Subtask Command Response Format

Navigate n0 B0B1B2B3B4B5

Navigate n1 B0B1B2B3B4B5Rn

Navigate nf B0B1B2B3B4B5

Navigate nb B0B1B2B3B4B5

Navigate nr B0B1B2B3B4B5

Navigate nl B0B1B2B3B4B5

Victim Identification v Rv

Navigate to

Unvisited Regions e Re

78

Table A 3: List of response values and meanings for the Navigate subtask

Response Value Meaning

B0 0 Robot is not stuck. The command was executed successfully.

1 Robot is stuck. The command was not executed.

B1

0 There is no obstacle within the range of the back sensor.

1 There is an obstacle within the range of the back sensor.

B2

0 There is no obstacle within the range of the front right sensor.

1 There is an obstacle within the range of the front right sensor.

B3

0 There is no obstacle within the range of the front left sensor.

1 There is an obstacle within the range of the front left sensor.

B4

0 There is no obstacle within the range of the back right sensor.

1 There is an obstacle within the range of the back right sensor.

B5

0 There is no obstacle within the range of the back left sensor.

1 There is an obstacle within the range of the back left sensor.

Rn

f Robot should move forward one cell.

b Robot should move backward one cell.

r Robot should turn right.

l Robot should turn left.

Table A 4: List of response values and their meanings for Victim Identification subtask


Rv

t Operator identified a victim in the robot’s view and tagged it.

r Operator dismissed the possibility of a victim in the robot’s view.

n Operator wants to move closer to the robot.

79

Table A 5: List of response values and their meaning for Navigate to Unvisited Regions

subtask


Re

o No input was received. The operator wants to continue exploration in the same heading.

r The operator wants to explore in the region to the right of the robot’s heading.

l The operator wants to explore the region to the left of the robot’ heading.

b The operator wants to explore the region to the back of the robot’s heading.

e The operator wants to finish the exploration and exit the USAR scene.

80

Appendix B

List of Publications Generated from this Thesis:

[1] B. Doroodgar, M. Ficocelli, B. Mobedi and G. Nejat, “The Search for Survivors: Cooperative

Human-Robot Interaction in Search and Rescue Environments using Semi-Autonomous

Robots,” IEEE Int. Conf. on Robotics and Automation, Anchorage, Alaska, 2010, pp. 2858-

2863.

[2] B. Doroodgar and G. Nejat, “A Hierarchical Reinforcement Learning Based Control

Architecture for Semi-Autonomous Rescue Robots in Cluttered Environments,” IEEE Conf.

on Automation Science and Engineering, Toronto, Canada, 2010, pp. 948-953.

Documents

A Learning-Based Semi-Autonomous Control Architecture for ... · rescuing victims from collapsed structures can be extremely hazardous due to asbestos and dust, general instability