25
ELSEVIER Robotics and Autonomous Systems 15 (1995) 275-299 Robotics and Autonomous Systems Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot* Jos6 del R. Mill~in l Institute for Systems Engineering and lnformatics, European Commission, Joint Research Centre, TP 361. 21020 Ispra (VA), Italy Abstract In this paper we argue for building reactive autonomous mobile robots through reinforcement connectionist learning. Nevertheless, basic reinforcement learning is a slow process. This paper describes an architecture which deals with complex-- high-dimensional and/or continuous--situation and action spaces effectively. This architecture is based on two main ideas. The first is to organize the reactive component into a set of modules in such a way that, roughly, each one of them codifies the prototypical action for a given cluster of situations. The second idea is to use a particular kind of planning for figuring out what part of the action space deserves attention for each cluster of situations. Salient features of the planning process are that it is grounded and that it is invoked only when the reactive component does not generalize correctly its previous experience to the new situation. We also report our experience in solving a basic task that most autonomous mobile robots must face, namely path finding. Keywords: Reinforcement learning; Mobile robots; Neural networks; Grounded planning 1. Introduction Our research is about building autonomous mobile robots. In particular, it deals with a basic task that most autonomous mobile robots must face, namely path finding. This problem is that of controlling the robot so that it moves from some initial location to a goal location avoiding ob,;tacles. We define a path as feasi- ble when it is sufficiently short and, at the same time, it has a wide clearance to the obstacles. Also, we are concerned about mobile robots that have to perform * This paper is mainly based on the author's Ph.D. thesis at the Departament de Llenguatges i Sistemes Inform~tics, Universitat Polit6cnicade Catalunya, Barcelona, Spain. This research has been totally carried out at the Institute for Systems Engineering and lnformatics (EC, JRC). I E-mail: [email protected] their tasks in indoor environments that are unknown, are perceived inaccurately (due to sensor limitations), and are partially dynamic. Moreover, mobile robots must make decisions in real time. Finally, we want mobile robots to tackle continuous domains. If the set of robot configurations and robot actions are discrete, then some feasible solutions may be lost and, on the contrary, some unfeasible solutions may be accepted as correct up to the resolution of the representation. Traditional approaches to navigation are based on planning. Some of these approaches use computa- tional geometry techniques [ 15,47,54,62], others rely on potential fields [4,11,27,51 ], and the remaining of them carry out a heuristic search through a state space [ 12,14,19,25]. For reviews, see [31,60,66]. Putting aside the particular shortcomings of every particular approach, planning alone is "per se" unsuitable for the 0921-8890/95/$09.50 @, 1995 Elsevier Science B.V. All rights reserved SSDI0921-8890(95)00021-6

Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

Embed Size (px)

Citation preview

Page 1: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

E L S E V I E R Robotics and Autonomous Systems 15 (1995) 275-299

Robotics and

Autonomous Systems

Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot*

J o s 6 d e l R. Mi l l~ in l

Institute for Systems Engineering and lnformatics, European Commission, Joint Research Centre, TP 361. 21020 Ispra (VA), Italy

Abstract

In this paper we argue for building reactive autonomous mobile robots through reinforcement connectionist learning. Nevertheless, basic reinforcement learning is a slow process. This paper describes an architecture which deals with complex-- high-dimensional and/or continuous--situation and action spaces effectively. This architecture is based on two main ideas. The first is to organize the reactive component into a set of modules in such a way that, roughly, each one of them codifies the prototypical action for a given cluster of situations. The second idea is to use a particular kind of planning for figuring out what part of the action space deserves attention for each cluster of situations. Salient features of the planning process are that it is grounded and that it is invoked only when the reactive component does not generalize correctly its previous experience to the new situation. We also report our experience in solving a basic task that most autonomous mobile robots must face, namely path finding.

Keywords: Reinforcement learning; Mobile robots; Neural networks; Grounded planning

1. Introduction

Our research is about building autonomous mobile robots. In particular, it deals with a basic task that most autonomous mobi le robots must face, namely path

finding. This problem is that of controlling the robot so that it moves from some initial location to a goal location avoiding ob,;tacles. We define a path as feasi- ble when it is sufficiently short and, at the same time, it has a wide clearance to the obstacles. Also, we are concerned about mobile robots that have to perform

* This paper is mainly based on the author's Ph.D. thesis at the Departament de Llenguatges i Sistemes Inform~tics, Universitat Polit6cnica de Catalunya, Barcelona, Spain. This research has been totally carried out at the Institute for Systems Engineering and lnformatics (EC, JRC).

I E-mail: [email protected]

their tasks in indoor environments that are unknown, are perceived inaccurately (due to sensor l imitat ions) , and are partially dynamic. Moreover, mobile robots must make decisions in real time. Finally, we want mobile robots to tackle continuous domains. I f the set of robot configurations and robot actions are discrete, then some feasible solutions may be lost and, on the contrary, some unfeasible solutions may be accepted as correct up to the resolution of the representation.

Traditional approaches to navigation are based on planning. Some of these approaches use computa- tional geometry techniques [ 15,47,54,62], others rely on potential fields [4,11,27,51 ], and the remaining of them carry out a heuristic search through a state space [ 12,14,19,25]. For reviews, see [31,60,66]. Putting aside the particular shortcomings of every particular approach, planning alone is "per se" unsuitable for the

0921-8890/95/$09.50 @, 1995 Elsevier Science B.V. All rights reserved SSDI0921-8890(95)00021-6

Page 2: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

276 Z del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299

instance of the path finding problem we are interested in, since planning is conceived as an off-line activity that acts on a perfect (or sufficiently good) model of the environment.

One way in which a mobile robot can efficiently navigate without any a priori knowledge is to first de- velop a model of the results of its actions. The robot can then employ this model to make a plan to reach the goal. Examples of this approach are the works of Atke- son [3], Mahadevan [34], Moore [43] and Thrun et al. [59]. Nevertheless, developing good enough mod- els is generally a difficult task. This is especially true in environments with dynamic obstacles and also when the action space is continuous.

A third approach to building autonomous mobile robots is that of reactive systems. In a pure reactive system [ 13,18,52,53 ], no planning mediates between perception and action, but instead the action taken is computed immediately as a function of the current situation. In its simplest form, a reactive system ex- amines constantly its sensors and uses their readings to index a look-up table of suitable responses. Thus the knowledge of a reactive system is represented as situation-action rules.

Basic reactive approaches are usually criticized by their computational complexity [22] and by their pro- gramming complexity. The former arguments stress that the size of the table grows exponentially with the complexity of the domain. The latter arguments high- light that the designer needs to think in advance about all possible situations the robot may find itself in.

It is our view that both of these limitations can be overcome within a connectionist learning framework in which the robot learns by doing. A connectionist system does not need to represent explicitly all pos- sible situation-action rules as it shows good general- ization capabilities. In addition, a robot that learns by doing does not need a teacher who proposes situations and suggests correct actions. The robot simply tries different actions for every situation it finds when expe- riencing the environment and selects the most useful ones as measured by a reinforcement or performance feedback signal. A second benefit of learning in this way is that the robot can improve its behavior con- tinuously and can adapt itself to new environments. Furthermore, connectionist systems have been shown to deal well with noisy input data, a capability which is essential for any robot working upon information

close to the raw sensory data. This kind of trial-and-error learning is also known

as reinforcement learning [5,7-9,24,56,61,64]. In the reinforcement learning literature, the situation-action mapping is customarily called policy. Simply stated, a reinforcement task is that of learning to associate with each situation X the action Y that maximizes rein- forcement z --either present, future or cumulative-- and it is solved by two simultaneous mechanisms. First, if an action is followed by high reinforcement, then the robot tends to take it in the future; otherwise the robot is strengthened to perform a different action. Second, robots must trade off performing the best cur- rently known action for the situation faced and per- forming a different unknown one to determine whether the latter is better than the former.

To find out what components of the connectionist network should be modified (and how much) in or- der to codify the correct situation-action mapping is called the structural credit assignment problem. How- ever, this is not the only problem to be solved in a reinforcement task. If the consequences of an ac- tion can emerge later in time, then the second prob- lem is to learn to estimate the long-term consequences of actions in order to solve the structural credit as- signment problem immediately after actions are per- formed. This is called the temporal credit assignment problem. Reinforcement-based reactive mobile robots must deal with both the structural and the temporal credit assignment problems.

In the case of the path finding problem, it is pos- sible to use the reinforcement signals associated with situation-action pairs to define a vector field over the environment. Then, the robot will normally perform the action that follows the gradient of the vector field. Seen in this way, our approach is related to the poten- tial field approaches [4,11,27,50,51]. However, our approach differs from them in two fundamental as- pects. First, any potential field algorithm relies on pre- determining a perfect potential function that allows the robot to reach the goal efficiently. Otherwise, the robot may get trapped into local maxima different from the goal. On the other hand, a reinforcement approach like ours starts with an initial and inefficient vector field generated from rough estimates of the reinforcement signals and adapts the field to generate efficient tra- jectories as it solves the temporal credit assignment problem. This point is particularly illustrated in Sec-

Page 3: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

J. del R. Milltin/Robotics and Autonomous Systems 15 (1995) 275-299 277

tion 5. Second, most of the potential field approaches require a global model of the environment while our approach only needs local information obtained from the robot's sensors. "[he relationships of our approach with potential field approaches are further discussed in Section 6.1.

Nevertheless, searching the appropriate action for each situation in a continuous space by using a rein- torcement signal is a very hard problem. For example, if after taking an action for a given situation the rein- forcement is much worse than expected, then the robot only knows that the appropriate reaction does not lie in the action subspace presently explored. This example illustrates what we believe to be the main shortcom- ing of reinforcement systems, namely their inability to determine rapidly where to search for a suitable ac- tion to handle each ,;ituation. If a reinforcement sys- tem fails to do that, then learning the correct situation- action rules will require an extremely long time. Fur- thermore, because tile robot is embedded in the en- vironment, performing bad rules may cause physical damage to the robot and/or to the environment.

Because this shortcoming is due to the limited in- formation provided by the reinforcement signal, the manner of avoiding it must be to develop modules for exploiting other kinds of information and not only to elaborate more sophisticated learning techniques. One of such modules is the codification scheme used by the connectionist network. If the input codifica- tion is powerful enough so as to enhance the gener- alization capabilities of the network, then searching the suitable action for a given situation can exploit the knowledge already acquired for other situations. In [40] this hypothesis is demonstrated empirically, since the pure reinforcement-based reactive robot de- scribed there tackles efficiently a simplified version of the path finding problem 2

We have experienced, however, that this pure reinforcement-based reactive approach is not suffi- cient when dealing with complex--high-dimensional and/or continuous---situation and action spaces [ 38 ]. In such cases, we claim that learning can be consider- ably accelerated if tile robot uses planning techniques

2 The simplifications concern, mainly, three aspects: the robot is considered to be a point, obstacles are modelled by circles, and the input to the connectionist network is obtained after a considerable amount of geometric preprocessing of the sensory information.

for figuring out what part of the action space deserves attention for each situation that the reactive system does not know how to handle. The reactive system invokes the planner when it cannot generalize cor- rectly its previous experience to the new situation. A salient feature of the planning process we argue for is that it is grounded since it works upon information that is extracted from the current values of the robot's sensors.

The main contribution of this paper is an archi- tecture that extends the basic reinforcement learning framework in several directions, the most important of them being the integration of reaction and planning. Some other extensions aim to enhance the general- ization abilities of the built-in connectionist network. Then, an autonomous robot based on this architecture learns extremely quickly the suitable reactions, from direct interactions with its environment, so as to gen- erate feasible paths in real time. The architecture has been explored in simulation. [ 39] presents first results obtained with an implementation of this architecture on a real robot.

As in many other works, an important assumption of our approach is that the goal location is known. In particular, the goal location is specified in relative cartesian coordinates with respect to the starting loca- tion.

The rest of the paper is organized as follows. Sec- tion 2 describes the simulated robot testbed. Section 3 presents the architecture for our reinforcement-based reactive robot. Section 4 reports the simulations we have carried out. Section 5 suggests how to extend fur- ther the architecture in cases planning does not bring much benefit to the robot. Section 6 reviews related work. Section 7 discusses how to implement our ap- proach on a real robot. Finally, Section 8 draws some conclusions about this work.

2. The simulated robot testbed

The simulator is intended for a cylindrical mobile robot of the Nomad 200 family operating in a planar environment. This robot, shown in Fig. 1, has 16 in- frared sensors, 16 sonar sensors and 20 tactile sensors evenly placed around its perimeter. Infrared and sonar sensors provide distances to the nearest obstacles the robot can perceive. Infrared sensors can only detect

Page 4: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

278 J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299

responds to the perfect distances to the sensed obsta- cles disrupted by noise. This noise is modeled by mul- tiplying the sensed distances by an uniform random number in the interval [ -0 .15 ,0 .15] .

ile ~rs

Fig. 1. The Nomad 200 mobile robot.

obstacles near to the robot, whereas sonar sensors are intended for distant obstacles. Tactile sensors are built into the bumpers and are binary; that is, they turn on when the robot detects contact with an obstacle.

The robot has three independent motors. The first motor moves the three wheels of the robot together. The second one steers the wheels together. The third motor rotates the turret of the robot. The infrared and sonar sensors are placed on the turret.

This robot is controlled by a reinforcement-based reactive system, whose main component is a connec- tionist network. This connectionist controller uses the current perceived situation to compute a spatially con- tinuous action. Then, the controller waits until the robot has finished to perform the corresponding motor command before computing the associated reinforce- ment signal and the next action. The next subsections describe what the input, output and reinforcement sig- nals are.

2.1. Input information

The input to the connectionist network consists of the attraction force exerted by the goal and a percep- tion map derived from sensor readings. The sensory information used to build up the perception map cor-

2.1.1. The attraction force The direction of the attraction force is that of the

vector that connects the current and goal robot lo- cations. We call this vector the shortest path vector (SPV). The intensity of the attraction force is an in- verse exponential function of the distance between the current and goal locations. These two signals, direc- tion and intensity of the attraction force, are real num- bers in the interval [0, 1].

Nevertheless, the system does not use these two sig- nals directly, but a semi-coarse codification of them. Each signal (direction or intensity) is represented as the pattern of activity over a set of processing units with localized receptive fields. That is, every unit is located in a particular point of the interval [0, 1 ] and it is maximally active when the signal corresponds to that point; on the other hand, if the distance between the signal and the unit is greater than a given thresh- old, then the unit does not fire. The main reason for using this codification scheme is that, since it achieves a sort of interpolation, it offers three theoretical advan- tages, namely a greater robustness, a greater general- ization ability and faster learning. A second reason is that since the dimension of the second component of the input--i.e., the perception map--is considerably high, the connectionist network could neglect the role of the attraction force if only two units were used to codify it.

In particular, each signal (direction or intensity) is codified into 5 units and the following function is used to determine the activation level of a unit centered on the point Yi when trying to codify the value x:

Sw(Yi, X) = { - ~ [ x - ( y i -w) ] [(yi+ w) - x], i f x c [ y i - w , y i+w] , (1)

0, otherwise,

where w is the width of the receptive fields. Fig. 2 provides an example of how this function works. Note that the same kind of codification may be obtained by using other functions, such as the gaussian.

As we have mentioned before, an assumption of our approach is that the robot is given the exact location

Page 5: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299 279

~/i!1- -0.~3- w~. - ~[i ~~ 0,351.0 .............. 0.45

\ ° / l.~ Fig. 2. Semi-coarse codification of a signal by means of 5 units. These units are centered on the points 0.1, 0.3, 0.5, 0.7, and 0.9. The width of the receptive: fields of all the units is 0.35.

of the goal with respect to the starting location. How- ever, our approach does not require a perfect odometry system that keeps track of the robot's relative position with respect to the goal. If the robot's position com- puted by the odometry system is not too different from the actual one, then the robot may still produce ac- ceptable actions. As pointed out above, a coarse codi- fication offers the robot greater robustness and greater generalization ability. One of the experiments reported in Section 4 provides empirical evidence for this state- ment.

2.1.2. The perception map The perception map is a discretization of the per-

ception range of the robot. In the simulations, the robot has a perception range of 2 units. The perception map is a grid of 21 × 21 cells, with a resolution of 0.2 units. The contents o1-" each cell of the perception map is the probability of occupancy, i.e. the contents of a cell is 1.0 when the robot is sure that an obstacle lies in the area of the environment corresponding to that cell. The perception map is generated from sen- sory data using certainty grid representations. Cer- tai.'nty grids, also called occupancy grids, have been proposed by Moravec and Elfes [ 20,44,45 ] and used by others [ 11,34].

While the robot is continuously sampling its sen- sors, it only uses the sensor readings obtained after per- forming an action to derive the next perception map. Every sensor reading provides the distance to the near- est object within a 30 ° cone, but it does not specify where the object is located in the reflection surface of the cone. Basically, the perception map is generated from the cones' reflection surface in the following way

N o r t h 2 units _.~

Fig. 3. The perception map.

nnin l i i l Immil I m i l l i n l I i n l I I I n l l i n l i l l i l Immnl l i E | l i n l I E i l ! ! ! !

i i i i | ~ n n

i i i I I I I I I I I I

in in e l e l e l iN e l n l i i HI E| -

mn e l

I I I I I I I I

(see Refs. [20,44] for more details). First, the cells that intersect the cone's reflection surface in its cen- ter have a higher occupancy probability that those that intersect it in the edges. Second, cells in the cone and preceding the reflection surface are considered to be empty while cells succeeding the reflection surface are considered to be occupied. In grid-based navigation approaches it is customary to integrate new sensory data into the current certainty grid in order to com- pute the next certainty grid. Our approach, however, builds every perception map from the sensor readings solely. In doing so, we lose some reliability in the ob- stacle representation. On the other hand, this approach avoids a further dependency on the odometry system since we do not need to determine what cells of the current grid are still present in the next one.

Fig. 3 shows an example of perception map where the contents of the cells are represented in a grey-level scale; i.e. white cells correspond to definitely non- occupied cells. This figure also illustrates two addi- tional features of our perception map. First, the robot is located in the center of the grid (the point indicates the direction of movement). Second, the orientation of the perception map is that of the environment.

In order to generate a global perception map we exploit the physical organization of the robot. Since the sensors are mounted on the turret and the turret is

Page 6: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

280 J. del R. Mill6n/Robotics and Autonomous Systems 15 (1995) 275-299

tivation of each new cell is the average of every cell 's activation in its receptive field 3 The potential benefits of using this kind of coarse codification are the same as those of the codification of the attraction force. Par- ticularly, this double coarse coding in addition to the certainty grid representation of the robot ' s perception range allow the robot to deal with noisy sensors.

The choice of this double coarse coding was made empirically as a trade-off between preserving as much information as possible and reducing the dimension- ality. All the simulation experiments we have carried out have confirmed that the information resulting from this coarse coding approach allows the robot to dis- criminate sensory states requiring different actions.

In conclusion, the situation perceived by the robot is a vector of 26 components: 10 of them codify the attraction force and the other 16 components represent the perception map. All the components are real values in the interval [0, 1 ].

Fig. 4. Coarse coding of the perception map. 2.2. Output information

controlled by an independent motor, it is possible to keep the robot looking north while moving along any direction. The robot ' s turret could be alternatively ori- ented toward the direction of movement. However, this manner of orienting the perception map would render learning more difficult. The reason is that every time the robot would reach a given location in a fixed envi- ronment fol lowing a different direction of movement, different perception maps would be generated.

The dimensionali ty of the perception map is reduced before the map is fed to the connectionist network. The dimensionali ty reduction is accomplished through two consecutive coarse codifications. In the first step, the original perception map is codified into a second grid of 10 x 10 units. As illustrated in Fig. 4, the cells of the new grid have overlapping local receptive fields, i.e. a new cell gets information from a 3 x 3 array of original cells whose external columns and rows are also connected to neighboring cells. The activation of each new cell is the average of the occupancy proba- bilities of the 9 cells in its receptive field. In the second step, the grid of 10 × 10 units is codified into a third grid of 4 x 4 units. In this case, the receptive field of a new cell contains a 4 x 4 array of original cells and the receptive fields of two neighboring cells share two adjacent columns or rows of the arrays. Again, the ac-

It was stated in the previous subsection that the orientation of the robot ' s turret remains fixed all along the path. So, the connectionist network only needs to control the translation and steering motors.

The "natural" choices for the angle a and the step length l along the direction determined by a are real values in the intervals [ - 1 , 1] and [0, 1], respec- tively. Nevertheless, this action codification suffers from the following shortcoming. I f c~ is a real num- ber in [ - 1, 1 ], then the actual direction of the move would be

actual_direction = ~'a. (2)

This means that two symmetrical and very close direc- tions, let us say ~- - • and -Tr + •, would correspond to two opposite te signals, let us say 1 - • ' and - I + •P.

A way of overcoming this l imitation is to codify both output signals, ct and l, in the same interval

3 It could be also possible to transform directly the original grid of 21 x 21 cells into the grid of 4 x 4 cells. But in this one- step coarse codification the activation of each new cell would be a weighted average of the occupancy probabilities of the cells in its receptive field. For the sake of simplicity and clarity, we have preferred to use two consecutive uniform filters than a single non- uniform filter.

Page 7: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299 281

[ - 1, 1 ]. In fact, the physical robot can translate for- ward and backward along the direction in which the wheels are aligned. The problem, now, is that every possible action is codified by two different combina- tions of output signals, namely (a , l) and ( - a , - l ) , which renders learning more difficult.

An alternative way of codifying an action is to spec- ify it as a move in relative cartesian coordinates with respect to the SPV. That is, the positive x axis is this vector. Both increments, Ax and Ay, are real values in the interval [ - 1, 1 ]. These increments are postpro- cessed to obtain the required translation and steering motor commands. In this way, the connectionist con- troller can generate spatially continuous actions.

As a final remark, it should be noted that the actual length of the move made by the robot is computed by the following expression:

step_length x radius, actual_length = if step_length x radius > step,,

step,, otherwise,

(3)

where step~ is a constant that represents the minimum action and

Thus the step clearance is the shortest distance pro- vided by any of the sensors while performing the ac- tion. Now,

0, if step_clearance > kclear, sc = step_clearance 1, otherwise, (7)

kclear where kclear is the shortest distance to the obstacles that is considered to be safe.

On the other hand, since only steps that do not follow the SPV contribute to making the path length greater than optimal, sl is proportional to the deviation of the step from the SPV:

sl = -actual_length x angle [ltlt+l, SPV], (8)

where actual_length is defined as in Eq. (3), It is the robot location before performing the action, lt+ 1 is the next location, Itlt+l is the vector from It to It+l, and "angle" is a linear function that returns a value in [0, 1] such that it equals 0 when the robot moves along the SPV and it equals 1 when the robot moves in the opposite direction.

{ V/~.X 2 --1- Ay 2, step_length = if X/Ax 2 + Ay 2 < 1, (4)

1, otherwise,

radius = min (perceptiOnrange/2,

dist [locationcurrent, locationgoal] ) . (5)

2.3. Reinforcement signal

The reinforcement signal is a measure of the cost of doing a particular action in a given situation. Given our definition of feasible path, actions incur a cost which depends on both the step clearance and the step length. In particular, the reinforcement signal z is

sc + sl z - - - , (6)

2

where sc and sl are the contributions due to the clear- ance and the length of the step, respectively, sc, sl, and z E [ - 1 , 0 ] .

Concerning the step clearance, the robot is con- stantly updating its sensor readings while moving.

3. A reinforcement-based reactive architecture

In this work, we formulate the path finding problem as a sequential decision task. That is, the robot's aim is to learn to perform those actions that maximize the total, or cumulative, amount of reinforcement obtained along the path.

To solve this formulation of the path finding prob- lem, we will use temporal difference (TD) methods [9,57]. Roughly, TD methods allow the robot to learn to predict the cumulative reinforcement it will obtain if it performs the best currently known actions that take the robot from its current location to the goal. It should be noted that TD methods are not reinforce- ment learning rules, but the way in which the robot uses them for discovering the suitable set of situation- action rules is an instance of reinforcement learning.

Before describing the learning algorithms, we will investigate three issues that need to be addressed so that a reinforcement-based reactive approach may be scaled up to deal with autonomous mobile robots. We will also propose an architecture for handling all of them.

Page 8: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

282 J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299

3.1. Extensions to the basic reinforcement framework

As pointed out before, a robot using the basic reinforcement learning framework might require an extremely long time to acquire the suitable reactions. In this section we will discuss three ways in which a reinforcement-based reactive robot can efficiently learn.

3.1.1. First issue: guessing where to search The main reason for the slow convergence of basic

reinforcement learning is that it is hard to determine rapidly where to search the suitable action to handle each situation. In Section 1 we stressed that this prob- lem can be overcome if the robot uses planning for figuring out what part of the action space deserves at- tention in each situation.

Our robot carries out a unique planning process characterized by the following three features. First, the planning process is grounded since it works upon in- formation that is extracted from the current values of the robot's sensors. In particular, the planner uses the sensory information perceived from the current loca- tion to evaluate the situation the robot would face if it reached a given location. That is, once an action is "mentally" experienced, the associated reinforcement signal is estimated on the basis of the perception map corresponding to the current location. This means that the robot does not have any global model of the envi- ronment and that the robot performs a one-step plan- ning process. Obviously, the planner will only exam- ine states within the perception range of the robot.

Second, planning only occurs when the robot does not know how to act, i.e. when none of the current reaction rules matches the perceived situation, or when it is acting incorrectly, i.e. when the reinforcement is much worse than expected. This allows the robot to become increasingly reactive as learning proceeds.

Third, the planning process works upon a discretiza- tion of the portion of the environment perceived by the robot and not upon the continuous whole environ- ment. This reduces significantly the complexity of the space through which the planner must search. So, the robot can always work in real time. The set of actions considered during planning is

{ (Ax, Ay) I Ax, Ay = --1.0,--0.75,--0.5,

--0.25,0.0, 0.25, 0.5, 0.75, 1.0 }.

It is worth noting that even if the planner deals with 81 discrete actions, the robot does deal with continuous actions. The discrete action proposed by the planner is just a starting point for the reinforcement-based re- active robot to search the appropriate action in a con- tinuous space. Of course, it could be the case that the planned action is the best one. But even in this case, the robot will explore alternative actions not too far from the planned action. Nevertheless, in most of the cases discrete planned actions are not suitable. The hope is that they are near the suitable ones.

The planner tries each of the 81 discrete actions and selects that with the best estimated reinforcement signal.

3.1.2. Second issue: learning faster Even if the planning process helps the robot to bias

the search process toward good parts of the action space, learning could still require a long time. Two reasons call for a heavy learning phase. (These reasons concern not only reinforcement learning but almost all the existing connectionist learning methods.)

The first is that the robot is usually trained through an (approximate) gradient algorithm. So, even when the planning process suggests an action for a given situation which is a good starting point for searching the appropriate reaction rule, the robot cannot learn this rule immediately. In the architecture that follows, we extend the basic reinforcement learning framework with a one-shot learning mechanism which allows the situation-action rules proposed by the planner to be captured in one step. These rules are then tuned through reinforcement learning.

The second reason is that when the rules are dis- tributed over all the weights of a network, learning a new rule (or tuning an existing rule) could degrade the knowledge already acquired for other situations. This problem would disappear if a different module were devoted to each set of related rules. In this case, im- provements on a module will not negatively alter other unrelated modules. In our architecture, every module codifies those rules dealing with a particular cluster of situations.

3.1.3. Third issue: improving generalization Finally, another way of decreasing the time it takes

the robot to learn is to improve the generalization ca-

Page 9: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

J. del R. Milldn/Robotics and Auwnomous Systems 15 (1995) 275-299 283

. ~ Action Units

•.. ~ : ~ ~v,0 fixed weights (to and ~ ' , . from category units)

~ ' ~ J ~- ~ Category Units ts

Z ~ m U d i f i a b l e weights

Fig. 5. Network architecture.

pabilities of the connectionist network. If a module has more hidden unit,; and weights than necessary, the robot will generalize poorly. On the other hand, if the size of a module is too small, the robot cannot repre- sent the required rules. Consequently, every module's structure must be tailored to the task it faces. By using constructive techniques, new units and their associ- ated weights are added to the appropriate module only when necessary to improve the robot's performance. Furthermore, in the proposed architecture, modules are created as the planner discovers new "prototypical" situation-action rules.

The use of constructive techniques for building modular networks allows to tackle the stability- plasticity dilemma [ 23 ]. The resulting connectionist network is plastic enough to learn new reaction rules, yet is stable enough to prevent the modification of rules learned previously.

The constructive techniques we use are based mainly on the work of Alpaydin [ 1 ]. The main dif- ference with respect to Alpaydin's constructive tech- niques is that his approach has been developed for supervised learning tasks.

3.2. Network architecture

The proposed architecture consists of three ele- ments: a reactive system, an evaluator and a planner. Fig. 5 shows only the first two elements.

The reactive system is a modular network that re- ceives situations and computes actions. A module con-

sists of a category, c j, its exemplars, e~ . . . . . e~, and the links arriving to and coming from those units. Each exemplar unit is connected to only one category.

Every category unit cj is associated with two values. First, the expected cumulative reinforcement signal, b j, that the robot will receive if it uses the module j for computing the action. Second, the prototypical action the robot should take. This prototypical action is fixed and it is codified into the weights from the category to the action units. The weights of the links from exemplars to action units are initially set to zero and evolve through reinforcement learning.

Actions are computed in two steps. First, the per- ceived situation is classified into one of a number of categories. Then, the action is built by using the mod- ule associated to the winning category. As pointed out before, the categories and their modules are not pre- defined, but are created dynamically as the robot ex- plores its environment. Section 3.2.1 describes the cat- egorization phase. It is worth noting that the robot can still work in real time despite this two-step process.

After reacting, the evaluator computes the rein- forcement signal, z, as specified in Section 2.3. Then, the difference between z and bj is used for learning. Only the weights of the links associated to the win- ning module are modified.

Finally, the planner is invoked when the reactive system cannot produce an action or when the robot reacts to a situation with an unsuitable action (as mea- sured by the reinforcement signal).

3.2.1. Categorization The basic idea behind the categorization phase is

to classify the perceived situation into that category cj that matches it acceptably and that has the highest value by. That is, the robot will react to the perceived situation with the action for which it is confident that the expected short- and long-term consequences are the best. This categorization phase is based on the "ac- tion selection" mechanism proposed by Mahadevan and Connell [35]. Our strategy selects also the cat- egory that maximizes a combination of the expected cumulative reinforcement and the degree of matching, but it does not require to collect any kind of statistics.

The categorization process works in two steps. In the first step, every exemplar unit computes its distance to the perceived situation. The activation level of every

Page 10: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

284 J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299

category is the matching degree--or output--of the closest associated exemplar. The exemplar units have localized receptive fields; i.e. each unit covers a limited area of the input space and it is maximally active when it matches the situation perfectly. In particular, the output of the ith exemplar unit is

e x, i f x < kdist, ai = 0, otherwise (9)

where

x = dist[ ~bi, x] , (10)

and ~D i is the component vector of the ith exemplar and x is the perceived situation.

In the second step, the winning category is picked among those categories whose activation levels do not differ more than a given threshold, kind, from the high- est activation level. The winning category cj is that with the highest bj. In the case that no category is ac- tive, i.e. the perceived situation does not match any exemplar unit--then the planner is triggered and the current situation becomes a new exemplar. This new exemplar is added either to one of the existing mod- ules or to a new module. Section 3.3 provides more details about how to create exemplars and modules.

3.2.2. Action units Once the situation has been classified, the output

of the winning category and its associated exemplars are sent to the action units. As mentioned before, the weighted input from a category is its prototypical ac- tion.

An action unit adds the contribution from the ex- emplars to the prototypical action. In this manner, the action to be performed by the robot is the prototypical action associated to the category in which the arriving situation has been classified plus a certain variation depending upon the location of the situation in the in- put subspace dominated by that category.

A second question concerns the nature of the action units. As any other reinforcement system, the robot needs to explore alternative actions for the same sit- uation in order to select the best action. The identifi- cation of the best action is partially accomplished by the planning process. After planning, the robot knows which is the most promising action subspace where to search for the solution. But, since the action space is

continuous, further exploration is needed in that sub- space. Even more, because of the discretization, the planning process can even suggest a subspace far from the optimal action.

An effective way of implementing exploration tech- niques in continuous domains is to use stochastic ac- tion units that select with higher probability the best currently known action.

Formally, if the perceived situation is a member of the jth category, then the output of the pth action unit is

1, ifpaJp. + Sip. > 1, Op = -1 , if pa~, + s ~ < - 1 , (11)

pa j + sip, otherwise

where paJ v is the pth component of the prototypical action associated to the j th category, and s~ is the con- tribution of the exemplars of the j th category. The out- puts of these exemplars, a~ . . . . . a~, are transformed into s~ in three steps using the following framework introduced by Gullapalli [24].

Since the signals we are interested in are continu- ous, a separate control of the location being sought (mean) and the breadth of the search around that lo- cation (variance) is needed. The first step is to de- termine the value of these parameters. The mean /z should be an estimation of the optimal output. A sim- ple way is to le t /xJ( t ) equal a weighted sum of the

inputs ~ ( t ) to the unit p:

h /.~J (t) = E 14flp/a~/, (12)

i=1

where u/pi is the weight associated to the link between the ith exemplar of the j th category and the pth action unit.

The standard deviation o-j is proportional to a trace of the absolute value of past discrepancy between the actual and expected cumulative reinforcements re- ceived:

o-j = k,, x ~j , (13)

"hi(t) =s ¢ × abs(z( t ) + b i ( t - 1) - b j ( t - 1))

+[1 - ( 1 x ' h j ( t - 1). (14)

The difference z ( t ) + bi(t - 1) - bj(t - 1) is the error provided by the TD method and its meaning

Page 11: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299 285

is explained in Section 3.3.1. The basic idea here is that the standard deviation o'j should be small if past actions taken by the robot when using the category cj were close to the best action currently known for that category, and it should be high in the opposite case.

As a second step, the unit calculates its activation level l~,(t) which is a normally distributed random variable:

I j = N(#J ( t ) , o-j(t)].

Finally, the unit computes its output s j ( t):

(15)

4 s], = 2, (16)

e - fllJe 1 +

where/3 is a constant in [0, 1]. As an aside, note that s~ E [ - 2 , 2]. So, even if the

prototypical action of the j th category is a bad starting point for searching the suitable action, the action units can discover it through reinforcement learning.

3.3. Learning

There are three basic forms of learning in the pro- posed architecture. The first regards the topology of the network: initially, there are neither exemplar nor category units and they are created as they are needed. The second kind of learning concerns weight modifi- cation. Finally, the third type of learning is related to the update of the bj values and is based on TD meth- ods [57].

3.3.1. Temporal difference methods Every value bj corresponds to an estimate of the

cumulative reinforcement the robot will obtain if it performs the best currently known actions that take the robot from its current location (whose associated observed situation is classified into the j th category) to the goal.

Consequently, the 'value bj of the category cj should be equal to the sum of the cost z of reaching the best next category ci plus the value bi:

bi = max (z) +bi. (17) yEActions

During learning, ihowever, this equality does not hold for the value bj of every category cj because the

optimal action is not always taken and, even if it is, the value bi of the next category ci has not yet converged.

A way of improving the quality of bj is to use the simplest TD method, i.e. TD(0) . This procedure works as follows. If the situation perceived at time t, x( t ) , is classified into the category cj and after per- forming the action y( t ) the next situation x ( t + 1 ) be- longs to the category ci and the reinforcement signal-- or cost--is z (t + 1), then

b j ( t + l ) =bj( t )

+ 7 × [ z ( t + l ) + b i ( t ) - b j ( t ) ] . (18)

Note that even if the robot reaches the category ci at time t + 1, the value bi(t) is used to update bj. This makes sense since the category ci could be the same as the category cj.

r/controls the intensity of the modification, and it takes greater values when the robot behaves better than expected than in the opposite case. These two values of 77 are named r/¢ and ~Tp, respectively. The rationale for modifying less intensively bj when z (t + 1) + bi(t) - bj(t) < 0 is that this error is probably due to the selection of an action different from the best currently known one for the category cj.

The problem is for the robot to figure out bi(t) when the next situation x( t + 1 ) does not match any stored exemplar. The same problem arises during the planning process, since the perception map associated to the current location is used to evaluate the next sit- uation. In other words, at planning time the robot does not have x( t + 1 ) and, consequently, it cannot classify this situation into any of the existent categories.

In these cases, bi is estimated on the basis of the distance from the next location to the goal and the distances from the next location to the occupied cells of the perception map in between the robot and the goal:

bi = klength X X + krep × ~ ir n, (19) / I

where klength and krep are constants, x is the distance from the next location to the goal, and ir n is the repul- sion force exerted by the nth cell on the next location. The nth cell does not exert any repulsion force if ei- ther its occupancy probability is null or its orientation is beyond the critical angle k~. Otherwise,

Page 12: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

286

proboc c, if X < kclear, ir" = 1. x x probocc, otherwise (20)

where x is the distance from the next location to the nth cell, proboc c is the occupancy probability of that cell, and kclear is the safety distance introduced before.

It should be noted that even if it is necessary to com- pute the distance to the goal, our approach does not require a perfect odometry system. Independently of whether it is correct or not, this distance provides just a first rough estimate of the cumulative reinforcement bi which will approach its steady value as the robot will perceive more and more situations belonging to the category cj.

Finally, the value bj of every category cj observed along a path taking to the goal is also updated in re- verse chronological order. To do that, the robot stores the pairs (c j ( t ) , z ( t + 1)). Then, after reaching the goal at time n + 1 and updating the last value bj, (n) , the value bj,,_l(n - 1) is updated, and so on until hi,(1). As Mahadevan and Connell [35] point out, this technique only accelerates the convergence of the value bj of every category c j--especially if cj classi- fies situations far from the goal--but does not change the steady value of bj.

3.3.2. When to learn Let us present now the three situations in which

learning takes place. The first of them arises during the categorization phase, whereas the other two situations happen after reacting:

(1) if the situation is not classified into one of the existent categories, i.e. it does not match any stored exemplar, then a one-step planning process is done upon a discretized action space.

The output of the planning process is a category defined by both the prototypical action with the highest associated cumulative reinforcement signal and this reinforcement signal.

Now, if that category already exists, then a new exemplar and its related connections are added to the corresponding module. If that category does not exist, then a new module is created. The new module consists initially of a category unit, an exemplar unit and their associated connections. The new exemplar is centered on the perceived situation.

The planned category matches approximately one of the categories in the network if their prototypical ac-

J. (]el R. Millrn/Robotics and Autonomous Systems 15 (1995) 275-299

tions are the same and their associated reinforcements do not differ in more than a threshold kz. Nevertheless, it does not make sense to compare the reinforcement of the planned category to the expected reinforcement, b j, of the existent categories, since the latter is com- puted after reacting, i.e. using much more information about the environment, and also converges toward the best cumulative reinforcement for the jth category. For this reason, the reinforcement of the planned category is compared to the initial cumulative reinforcement of each category, i.e. the reinforcement provided by the planning process when that category was created.

(2) if (i) the situation is classified into a category, c j, and (ii) the cumulative reinforcement signal com- puted after reacting, z + bi, is "not_too-different" from the expected cumulative reinforcement signal of the winning category, bj, then (i) the active exemplars of that category are modified to make them closer to the situation, (ii) the weights associated to the con- nections between exemplars and action units are mod- ified using reinforcement learning, and bj is updated through TD (0).

"Not_too_different" means that z (t + 1 ) + b i ( t ) -

bj ( t ) >~ - k z , where kz is the same threshold used situation 1.

The coordinates of the ith exemplar are updated proportionally to how well it matches the situation:

~bJ.(/+ 1) = ~ / ( / )

+e x a~(t) x I x ( t ) - ~ / ( / ) ] , (21)

where ~b are the coordinates of the exemplar, e is J is the output--or matching the learning rate, and a i

degree--of the exemplar. Simultaneously, the connections from the exemplars

of the category cj to the action units are updated so as to strengthen or weaken the probability of performing the action just taken for any situation in the input sub- space dominated by the category cj. The intensity of the weight modifications depends on the relative merit of the action, which is just the error provided by the TD method. Thus, the reinforcement learning rule is

W~k(t + 1) = ~ k ( t )

q-or x [Z( t q - 1 ) + b i ( t ) - b j ( t ) ] X eJpl~(t), ( 2 2 )

where ~dpk is the weight associated to the connection

Page 13: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299 287

between the kth exemplar and the pth action unit, ce J is the eligibility factor of is the learning rate, and ept

w~,~. The eligibility factor of a given weight measures

how influential that weight was in choosing the action. J is computed in such a manner In our experiments, ept

that the learning rule corresponds to a gradient ascent mechanism on the expected reinforcement [64]:

etJk(t ) OlnN l J ( t ) - I x J ( t ) : TO2 t):a t) , (23)

where N is the normal distribution function in (15). Additionally, Mill~in and Torras [40] have found, af- ter a comparative study on 25 versions of the basic reinforcement learning rule, that this is the eligibility factor that best suits the path finding problem.

As a final remark, the weights 14~pi are modified more intensively in case of reward, i.e. when the robot be- haves better than expected, than in case of penalty. These two values of ,~ are named ar and ap, respec- tively. The aim here is that the robot maintains the best situation-action rules known so far, while exploring other reaction rules.

~3) if (i) the situation is classified into a category, ~j, and (ii) z + bi JiS "too_different" from b i, then the situation needs to be categorized again. To do that, situation 1 applies.

This means that our architecture classifies situa- tions, in a first step, based on the similarity of their input representations. Then, it also incorporates task- specific information for classifying based on the sim- ilarity of reinforcements received. In this manner, the input space is split into consistent clusters since a sim- ilar cumulative reinforcement corresponds to the suit- able action for every situation in a given input sub- space. Moreover, each one of these suitable actions differs only slightly from all the others, which speeds up learning.

4. Simulation resubts

Table 1 provides all the parameter values used dur- ing the computation of the signals and during learn- ing. It is worth noting that no attempt has been made to search for the best set of parameter values. More- over, when the parameters take values close to those

Table 1 Actual values of to learn.

the parameters used to compute the signals and

Name Value w 0.35 step, 0.01

kclea r 0.1 klength 0.75 krep 0.25 k,~ 45 °

kdist 1.0 kind 0.1 kz 0.5 ,B 0.4

0.75 ko- 1.0 • 0.1 ar 0.2 ap 0.02 ~7r 0.75 r/p 0.075

Fig. 6. The environment and solution paths for a sample of starting situations.

specified in Table 1, the behavior of the robot is very similar to that reported in this section.

The environment depicted in Fig. 6 has been em- ployed in the simulations. There exist three obstacles (shown as sets of contiguous cells), the goal is repre- sented by a square and the robot is illustrated as a cir- cle (the point indicates the direction of movement). The obstacles are shown as they would be perceived by the robot if it used perfect sensors. The environ-

Page 14: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

288 J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299

ment is a room of 10 x 10 units and the robot's per- ception range is 2 units.

In previous works with the pure reinforcement- based reactive system already mentioned [40] and with a first version of the system described in this paper [37], we used a stabilizing strategy. The main idea behind this strategy is that the stochastic action units should progressively become deterministic with learning; otherwise, the robot cannot, in feasible time, generate stable solutions after it eventually discovers them. Initially, the robot we have described in the preceding sections also used it with successful results.

Nevertheless, we have observed that similar situation-action rules can be obtained without apply- ing the stabilizing strategy. In addition to no longer requiring any stabilizing strategy, we have realized that a second advantage is that the robot discovers the suitable situation-action rules using less learning steps than if it would use the strategy.

We explain this (somehow) surprising fact in the following way. Firstly, since the prototypical actions are normally close to the optimal ones and the actual costs of actions taking the robot to the goal are backed up after reaching the goal (as explained at the end of Section 3.3.1 ), then the values bj for all the cate- gories approach very rapidly their steady values. Sec- ondly, the reinforcement learning rule, Eq. (22), tries to maintain the best reaction rule currently known. Thus, the standard deviation o- in the normal distribu- tion function used to compute the activation level of the action units, Eq. (15), is very close to 0.

Figs. 6 and 7 show the behavior, after finishing the learning phase, of the robot when it does not use the stabilizing strategy. The learning phase finishes when the robot is able to produce solution paths for 50 start- ing locations that are representative of the whole en- vironment. Fig. 6 illustrates the stable solution paths generated by the system for a sample of starting loca- tions. Fig. 7 depicts instances of the situation-action rules learned. For every location considered (little cir- cles) the move to be taken by the robot is shown. The figure shows that the robot generates solution paths from any starting location in the environment.

In order to reach this level of expertise, the robot has planned 248 times and has reacted 1424 times during the learning phase. While learning, the robot collided 74 times. It is important to note that these collisions are not physical but virtual. In fact, the robot stops and re-

/,

Fig. 7. Situation-action rules applied for a sample of situations covering the whole environment.

tracts when it detects an obstacle in front of it which is closer than a security distance. The resulting network consists of 29 categories and 93 exemplars. Finally, the robot performs the cycles "categorize-plan-learn" and "categorize-react-learn" in approximately 1.6 sec- onds and 0.04 seconds, respectively. These run times have been obtained using the final network and with an implementation of the reinforcement-based reac- tive controller written in Common Lisp running on a SUN SparcStation 2.

In the preceding simulations we have assumed a perfect odometry system (but not perfect sensors, see Section 2.1). In order to test the capabilities of our approach to work with a noisy odometry system, we have carried out the following simulation experiment. Fig. 8 illustrates the behavior of the robot when each component of the robot's position given by the odom- etry system, si, is an uniform random number in the interval [ si 4- 2.0, s i - 2.0]. Comparing this figure with Fig. 7, it is evident that the robot exhibits a good noise tolerance, since the "maps" are very similar.

Several conclusions can be drawn from these re- sults. First, the learning time is extremely short. Sec- ond, the robot collides virtually only a few times be- fore learning to navigate in the environment. This is an important feature since performing bad actions may cause physical damage to the robot and/or the envi-

Page 15: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

J. del R. Milldm/Robotics and Autonomous Systems 15 (1995) 275-299 289

l t t d t i l V , ?17 I

Fig. 8. Noise tolerance exhibited by the robot when its odometry system is not perfect.

ronment. Third, the size o f the network is contained. This is an indication of the fact that the architecture only suffers partially from the problem of the "curse of dimensionality", i.e. the number of exemplars that needs to be created increases as the dimensionality of the input pattern increases. Fourth, the robot deals ef- fectively with noisy input data.

Nevertheless, the generalization capabilities of the robot are not so powerful as to navigate in environ- ments different from that seen so far. Fig. 9 depicts the behavior of the robot in one of such environments where the goal is in a different location. The figure shows that the robot (i) can correctly handle some of the situations in the new environment, (ii) applies un- suitable rules to some other situations, and (iii) does not know how to act in the largest part of the environ- ment.

The fact that the reaction rules obtained for an envi- ronment cannot be directly applied to a different envi- ronment is a consequence of the modular architecture and, most importantly, of the way in which the percep- tion map is codified. Each category covers a limited part of the input space. Thus, each module can only generalize locally. On the other hand, the perception map is specified with respect to the orientation of the environment and not relatively to the goal. Thus previ- ous experience with a given combination of obstacles

Fig. 9. Generalization abilities: Initial behavior of the robot in a new environment.

Fig. 10. Adapted behavior of the robot to the new environment.

with respect to the goal cannot be generalized to the same combination if the directions of the attraction forces in both cases are very different. In Section 7.2 we discuss how to implement a relative codification of the perception map onto the real robot.

However, this lack of powerful generalization capa- bilities is not so bad news as it could seem at a first glance. Since the robot learns very rapidly the suitable

Page 16: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

290 J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299

Fig. 11. Coping with dynamic environments: Dynamical goal.

situation-action rules for the first environment, it will also adapt to the new environment in a short time. This point is illustrated in Fig. 10, which shows instances of the new situation-action rules learned by the robot for the new environment.

To finish this section, we will investigate the ca- pabilities of the robot to cope with dynamic environ- ments. Fig. 11 shows how the robot tracks the goal when it moves from its original position (that used in the simulations above) toward the left. Note that now the robot avoids the central obstacle by its left, instead of by its right as it did when the goal was fixed (see Fig. 6).

Fig. 12 shows how the robot avoids a set of dynamic obstacles that has been added to the original environ- ment. Again, to successfully cope with this dynamic environment, the robot follows a path different from that used in Fig. 6.

In these two experiments the goal and the obstacles follow a general direction "in a random way". For in- stance, the goal is moving toward the left of the room but sometimes up and sometimes down. The obstacles move toward the bottom of the room, deciding ran- domly either left or right. The velocity of the goal and of the obstacles is also random and it is at most half the maximum speed of the robot.

5. Memorizing good actions

We have also tested the robot in a more complex environment than that used so far (see Fig. 13), in which obstacles are concave. For this environment, the robot behaved acceptably--in terms of learning t ime--but not so well as we expected. In particular, it took about 20 000 learning steps to generate a path like that illustrated in Fig. 13 for a starting location in the concave side of the obstacle.

The complexity of the task faced by the robot arises from two main facts. The first and most important one is that all the prototypical actions suggested by the planner for the situations perceived by the robot when it is inside the concave part of the obstacle are far from the optimal ones 4. This means that, for these situations, the robot does not benefit at all from planning--but it does not know it and continues to plan. Thus the robot fails to determine rapidly where to search the suitable action to handle each situa- tion, which, we believe, is the main shortcoming of reinforcement systems.

The second cause is related to how the robot learns the suitable situation-action rules. On the one hand, the learning rule (Eqs. (22) and (23)) guarantees a weight modification in the direction of the suitable actions only when they are experienced. In addition, since the weights are updated a little bit every time, the robot needs to perform several times one of the suitable actions for every situation in order to codify it. On the other hand, the search for alternative ac- tions is centered on that currently codified. These two mechanisms together work fast when the best reaction rules presently known are close to the optimal, which is not the case here. Because these rules are initially those proposed by the planner, they are far from the optimal. This means that the probability of experienc- ing one of the desired actions for a given situation the required number of times is very low.

Concerning the second cause of complexity, the re- active system would benefit--in the sense of modify- ing the weights in the right direction--even of actions worse than expected if it knew the suitable action for every situation. If this was the case, then the reactive

4 This statement is only valid for a planner like ours, which uses local information extracted from the robot's sensors and not a global model of the environment.

Page 17: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299 291

!

4 5 6

i i

7 8 9

I

10 11

I

Fig. 12. Coping with dynamic environments: Dynamical obstacles.

Page 18: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

292 J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299

Fig. 13. Example of path generated for a starting situation in the concave side of the obstacle.

system could apply supervised learning every time the action performed by the robot were not the suitable one.

A straightforward way of implementing this neat idea in our architecture is to memorize for each cate- gory the best action known so far. That is, every time the robot performs an action by using the module as- sociated to the category cj and then the cumulative re- inforcement seems to be better than b j, then cj mem- orizes that action. Formally,

if z ( t + 1) + bi(t) - bj(t) > 0.1 x abs(bj(t)) then ba j ( t + 1) := y ( t )

where y ( t ) is the action performed by the robot at time t, z (t + 1 ) is the reinforcement signal computed after doing action y( t ) , bi(t) is the expected cumu- lative reinforcement associated to the next category, and ba j ( t ) is the best action currently known for the category cj.

Note that baj is updated only if the new prediction of the best cumulative reinforcement the robot will receive from the category cj to the goal is 10% bet- ter than the current prediction. This criterion allows the system to reduce the large number of incorrect ac- tion updates at the beginning of learning. As explained above, the robot behaves initially almost randomly and can hardly reach the goal. Thus, until solution paths are generated that involve most of the categories, the

bj values are not converging to their steady values, but oscillating. In addition, this mechanism also allows to keep a stable value of baj for all the situations cov- ered by the category cj, instead of the slightly differ- ent values of bj and of baj that would be generated otherwise.

Two additional implementation details are the fol- lowing. First, when a module is created the best ini- tial action for that category is its prototypical action. Second, during the process of backing up the bj val- ues from the goal to the starting category (see Sec- tion 3.3.1), the best currently known actions are also updated. To do that, the robot stores all the triplets (cj( t ) , y ( t ) , z (t + 1)) in the path.

Finally, after performing an action by using the cat- egory c j, the reactive system adapts in the following way. If the action is better than expected, then it con- tinues to modify its weights through reinforcement learning. Otherwise, it applies supervised learning to decrease the difference between baj and the action the robot would perform if the action units were determin- istic. In particular, and since there is only one layer of modifiable connections from exemplars to action units, we use the LMS rule [63].

Fig. 13 shows the path generated by this new robot for a starting location in the concave side of the obsta- cle. In order for this robot to learn this solution path, it has planned 716 times and has reacted 2547 times, it has collided virtually 282 times, and the resulting network has 55 categories and 154 exemplars.

This simulation experiment illustrates perfectly what we claimed in Section 1. Initially, the value of ba 9 for every category cj in the concave side of the obstacle is far from optimal, which makes the robot move forward and backward for a while. However, as the robot explores alternative actions it discovers the suitable reactions and, consequently, learns the correct values of bj and of baj for every category cj. Thus it is possible to state that a reinforcement ap- proach like ours allows the robot to learn the suitable potential field function to navigate in a given environ- ment. Contrarily to potential field approaches, which associate a static potential field map to workspace models, our approach adapts an initial and inefficient potential field map to generate efficient trajectories.

As a last remark, we have tested this new controller in the environment used in the simulations reported in Section 4, and its behavior--in terms of both learn-

Page 19: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299 293

ing time and quality of the learned reaction rules--is similar to that reported in that section.

6. Related work

The architecture we have described in this paper bears some relation to a number of previous works. This section compares those works to ours.

6.1. Potential field methods

In Section 1 we pointed out that our approach is related to the potential field approaches and we dis- cussed that our approach differs from them in two fun- damental aspects. First, our approach learns the suit- able vector field function. Second, our approach only requires local information, and not global models as most of the potential field methods do. To the best of our knowledge, the only potential field approaches which are able to work in any workspace are those requiring exact global models [4,50].

There exist also potential field approaches which only need local work:~pace models built out of sensory information [ 11 ]. Their knowledge requirement is of the same order as our's. Nevertheless, those approaches suffer from the existence of local minima. In partic- ular, Borenstein and Koren's system [ 11] is able to escape from local minima by using additional knowl- edge in the form of a recovery algorithm. Furthermore, Borenstein and Koren's system does not produce effi- cient trajectories.

6.2. Integrated architectures

Planning on an approximate (and by no means per- fect) global model and then giving local control over to the reactive component has shown to be a power- ful idea [ 2,21,36,4921. Two common features of these systems are that they have a fixed set of reaction rules and that they rely on additional knowledge to the re- action rules, which needs to be carefully hand-coded.

On the other hand, some robotic integrated archi- tectures learn the results of planning as reaction rules [ 30,41 ]. A requirement of these systems is an a priori and perfect (or good enough) model. Robots use this model to plan a sequence of actions that achieve the goal, then they generate an explanation that is com- piled into reactive rules.

Recently, Mitchell and Thrun [42] have extended this explanation-based learning approach to integrate also inductive neural network learning. The inductive module codifies the reaction rules and the domain the- ory is previously learned by a set of neural networks, one network for each discrete action the robot can per- form. Then, for each actual sequence of actions tak- ing the robot either to the goal or to a failure, the robot explains the observed example in terms of its domain theory and extracts the derivatives of the de- sired reactions with respect to the training example. This derivatives are used to bias the inductive learning of the reactions rules. Our grounded planner also rep- resents prior knowledge about the task; however, it is much more elemental and is used in a different way.

Lin [ 32,33 ] combines reinforcement connectionist learning and teaching to reduce the learning time. In this framework, a human teacher shows the robot sev- eral instances of reactive sequences that achieve the task. Then, the robot learns new reaction rules from these examples. The taught reaction rules help rein- forcement learning by biasing the search for suitable actions t,ward promising parts of the action space. In our approach, the grounded planner plays the same guidance role, requiring only a programmer instead of a teacher.

A last kind of integrated architectures are those that not only learn reactions from the results of planning, but also learn the global model on which they plan. In particular, Barto et al. [6], Lin [33] and Sutton [58] have explored the integration of reaction, planning and reinforcement learning. All these systems deal only with discrete action spaces.

Barto et al. provide a theoretical framework for studying this kind of learning systems by elaborating the connections with asynchronous dynamic program- ming. Their theory is only valid for Q-learning [ 61 ], which is a different reinforcement learning method for solving sequential decision tasks. Our approach is re- lated to those of Lin and Sutton, but it differs from them in several ways, as it is explained below.

The first difference regards when to plan. In Sutton's framework, the agent plans and reacts in parallel and continuously. As Lin notes, an inefficiency of Sutton's approach is that planning is not always necessary. In Lin's framework, as well as in ours, the agent plans only when it is not very sure about which is the best action.

Page 20: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

294 J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299

The second difference concerns how to update the reactive component. Sutton's agent uses all the re- sults of planning to modify the reactive component, whereas Lin's agent only incorporates into its reactive component the results of the best planned action. Both of them update the reactive component incrementally, by modifying its weights a little bit each time. We adopt Lin's scheme. However, our robot codifies the situation-action rules proposed by the planner in one step. These rules are then tuned through reinforcement learning.

The third and final difference lies in the way the reactive component is organized. Lin's and Sutton's reactive component has a predefined structure and is a "plain" network in which the situation-action rules are distributed over all the weights. In our architecture, on the contrary, a different module is devoted to each set of related rules. Such modules are created dynamically as the planner discovers new "prototypical" situation- action rules.

6.3. Modular networks

Jacobs et al. [26], Nowlan [46], Singh [55] and Wixson [65] have also described systems that learn to decompose a given problem into distinct subtasks and assign each of them to the appropriate module. The first two approaches have been developed in the framework of supervised learning, while the other two approaches are intended for reinforcement learning systems.

Our approach is related to the first three works men- tioned above, in the sense that it is the network itself which decomposes the task into subtasks. On the other hand, Wixson's approach requires a process external to the network.

Comparing our approach to that of Jacobs et al., Nowlan, and Singh, in the three cases the number of modules is predefined. In our case, the number of mod- ules grows dynamically and it is decided by the system itself. A second difference between these approaches and ours regards how situations are allocated to mod- ules. They need to develop a scheduling module for this purpose, while our architecture does it automati- cally.

Finally, Colombetti and Dorigo [ 17], Mahadevan [34], and Mahadevan and Connell [35] also extend the basic reinforcement learning framework by com-

bining several modules, each one specialized on solv- ing a particular subtask. They adopt the subsumption architecture [ 13,18] to develop their agents. In addi- tion, they have a predefined number of modules. Ma- hadevan [ 34] and Mahadevan and Connell [ 35] also code by hand the applicability condition of each mod- ule.

6.4. Clustering

Berns et al. [10], Chapman and Kaelbling [16], Krtise and van Dam [29] and Mahadevan and Con- nell [35] enhance the generalization capabilities of their agents by splitting the sensory input space into clusters. Then, actions are associated to clusters rather than to individual states. In order to create these clus- ters, Berns et al. as well as KriSse and van Dam use Kohonen maps [28], while Chapman and Kaelbling, as well as Mahadevan and Connell use statistical tech- niques.

Our architecture also extends the basic reinforce- ment learning framework in a similar way, since per- ceived situations are classified into categories. Never- theless, the action taken by our robot is not only deter- mined by the category in which the arriving situation is classified, but also by the activity of the exemplars of that category.

The architectures developed by Berns et al. and by Kr6se and van Dam classify situations solely on the basis of the similarity of their input representations, while the architectures developed by Chapman and Kaelbling and by Mahadevan and Connell also incor- porate task-specific information for classifying on the basis of the similarity of reinforcements received. In this respect, our approach is like the two latter ones. However, those approaches are intended for binary in- put vectors, whereas ours works on real-valued input patterns.

Finally, Berns et al.'s architecture have a predefined number of clusters, while ours and all the other archi- tectures build up the clusters dynamically.

7. Discussion

The main research goal that has motivated the present work is that of determining whether au- tonomous mobile robots can be based or not on a

Page 21: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299 295

reinforcement connec.tionist learning approach. The simulations we have carried out and reported in this paper strongly suggest an affirmative answer. How- ever, the final confirmation of the practical interest of our approach can only be provided by experimenting with real robots in real environments. We are already experimenting with the real robot described in Sec- tion 2, and Mill~in [39] presents first results. In this section we will discuss how to address some potential limitations of our approach.

7.1. Odometry system and noisy sensors

The arguments provided in the paper and the simu- lations presented seem to show that our approach deals effectively with noisy sensors and also with an im- perfect odometry system. The simulation experiments have been carried out distorting the simulated sensors and odometry system with an uniform random noise, which is not the kind of noise that appears when a real robot interacts with a real environment. In any case, the main limitation of the current implementation of the reinforcement-based reactive controller is that it requires a reliable odometry system.

However, we think: that it is still possible to imple- ment our approach onto a real robot. The implementa- tion we have in min6 is based on two considerations. The first one is that once the goal can be detected by the robot, the odomei;ry system can work reliably. So, the robot will need to be equipped with sensors espe- cially designed to detect the goal. The second consid- eration is related to the use of a semi-coarse codifi- cation of the attraction force. As discussed in Section 2.1, if the encoded position of the robot is not too dif- ferent from the actual one, then the robot may still produce acceptable actions.

Now, suppose the robot's task is to reach a source of sound whose relative position with respect to the starting position of tile robot is approximately known. In this case, the path finding problem will be solved in two steps. In the first one, the robot will rely on its ability to navigate with approximate attraction forces. The first step will finish as soon as the robot detects the goal. In the second step, the robot will use the approach described in the paper.

Nevertheless, detecting the goal does not always mean knowing the distance from the robot to it. Thus a further step toward the implementation of the approach

Fig. 14. Solution paths when the intensity of the attraction force is not considered.

onto a real robot would be not to use the information related to the distance to the goal. Instead, the robot could simply use information about whether it can detect the goal or not. This kind of information is required to know if the goal is located before or behind an obstacle. To test the feasibility of this idea, we have carried out an initial experiment. Fig. 14 shows some solution paths generated by the simulated robot when the intensity of the attraction force is not considered. In this experiment the robot has dealt with 10 initial locations. The robot has planned 94 times and has reacted 492 times during the learning phase.

Even though this simulation experiment provides some evidence supporting the idea that distance infor- mation is not necessary, we need to run still a system- atic set of experiments to confirm it.

7.2. Perception map

Currently, we are using a global perception map which is generated by not sending any command to the motor controlling the robot's turret. The only prob- lem is that mechanical errors can deviate the robot's orientation.

A way of dealing with this source of errors is to in- corporate a rate-integrating gyroscope into the robot. Rate-integrating gyroscopes, unlike compasses, do not

Page 22: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

296 J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299

suffer from magnetic fields due to metallic elements in the environment. Rate-integrating gyroscopes measure absolute angular displacements and thus are used as a fixed reference in navigation systems like ships and aircraft. Then, the robot itself will realign its current orientation every time the front sensors' direction dif- fers from the gyroscope's direction more than a given threshold. This same technique can be also used for orienting the robot when it is turned on.

On the other hand, it was mentioned in Section 4 that a current limitation of our approach is that the per- ception map is not codified relatively to the goal, what prevents useful generalizations. A way of achieving this relative codification in the real robot is to rotate the robot's turret in such a way that the infrared and sonar sensors--which are placed on the turret--are oriented toward the goal. We have not used this ele- mental solution so far because the simulator we built did not provide this possibility.

If this solution was adopted then the controller would not need the direction of the attraction force as input. However, deciding the robot's orientation still depends on the odometry system. Again, a rate- integrating gyroscope could be used to reduce this dependency.

7.3. Complex environments

Our approach organizes the reactive component into a set of modules. Each of these modules computes the action for a particular input subspace. Each module has a category unit and a set of exemplar units cover- ing the sensory situations in the input subspace. These modules are created incrementally as the robot expe- riences the environment.

Now, the larger the environment or the number of obstacles is, the larger the input space will probably be. This means that as the robot perceives more different sensory situations, it will create more exemplar units and modules, and it will consequently require more memory and reaction time. Thus the limiting factor for the current implementation of our approach is the time required to search the closest exemplar to the perceived sensory situation. This time increases linearly with the number of stored exemplars.

There are three obvious ways of speeding up the search time. First, to carry out the search in parallel. Second, to use kd-trees to implement the reactive com-

ponent. Omohundro [48] reviews kd-trees and other data structures for efficient implementation of learn- ing algorithms from examples. Third, to reduce the number of stored exemplars.

Memory size could be reduced by eliminating "old" and/or superfluous units. An exemplar or category unit would be considered to be old when perceived situa- tions have not triggered it for a long time. On the other hand, an exemplar becomes superfluous when the in- put subspace it dominates is also covered by the rest of the exemplars belonging to the same category. A category can also become superfluous as learning pro- ceeds. For example, two different category units can initially be required for two perceived situations if the planner suggests for them the same prototypical action but the initial bj values associated to this action are very different. However, if TD(0) makes these val- ues converge to similar points, then the two categories represent conceptually only one. In this case, the two categories can be merged into a single one.

8. Conclusions

In this paper we put forward that the two main limitations of reactive autonomous robots, namely computational and programming complexities, can be avoided by building those robots through reinforce- ment connectionist learning.

Nevertheless, reinforcement-based reactive au- tonomous robots have their own shortcoming: if robots cannot determine rapidly where to search the suitable action to handle each situation, then learning the correct reactions will require a long time.

In order to speed up learning, we extend the ba- sic reinforcement connectionist learning framework in several ways. Some of the extensions aim to enhance the generalization capabilities of the robot. In partic- ular, we organize the reactive component into a set of modules in such a way that, roughly, each one of them codifies the prototypical action for a particular input subspace. This subspace corresponds to those situations that have similar representations and whose associated best reinforcements are also similar.

In addition, we claim that the robot should use grounded symbolic techniques for figuring out what part of the action space deserves attention for each cluster of situations. In particular, we propose to per-

Page 23: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

J. del R. Milltin/Robotics and Autonomous Systems 15 (1995) 275-299 297

form one-step planning on a discretized action space

every time the robot does not generalize correctly its previous experience to the new situation. The planner uses the sensory info~anation perceived from the cur-

rent location to evaluate the next situation. The reac-

tive component codifies the situation-action rules pro- posed by the planner in one step. These rules are then

tuned through reinforcement learning. Integrating re- action and planning in this way allows the robot to focus on promising parts of the action space immedi-

ately, to work in real time, and to become increasingly reactive as learning proceeds.

Based on these ideas, we have developed an archi- tecture which deals with complex--high-dimensional and/or continuous-- , ; i tuation and action spaces effec-

tively.

Finally, the simulal:ions reported in this paper show that a robot built according to this architecture learns

extremely quickly, deals effectively with noisy sensors,

collides virtually a very reduced number o f t imes dur- ing learning, copes wi th dynamic environments and adapts easily to new environments.

Acknowledgements

I gratefully acknowledge the contributions of Carme

Torras to the ideas presented here. I thank Marc Bec- quet, Tim Smithers and Aristide Varfis for helpful dis-

cussions. I would like also to express my gratitude to

Marco Dorigo, Ben Krtse, Tim Smithers, Carrne Tor- ras and Aristide Varfis for their detailed comments on previous versions of this paper.

References

[.1 ] E. Alpaydin, GAL: Networks that grow when they learn and shrink when they forget, Technical Report 91-032, International Computer Science Institute, Berkeley, CA (1991).

[2] R.C. Arkin, Integrating behavioral, perceptual, and word knowledge in reactive navigation, Robotics and Autonomous Systems 6 (1990) 105-122.

[3] C,G. Atkeson, Using locally weighted regression for robot learning, Proc. of the IEEE International Conference on Robotics and Automation (1991) 958-963.

[4] J. Barraquand and J.C. Latombe, Robot motion planning: A distributed representation approach, The International Journal of Robotics Research 10 ( 1991 ) 628-649.

[5] A.G. Barto and P. Anandan, Pattern-recognizing stochastic learning automata, IEEE Transactions on Systems, Man, and Cybernetics 15 (1985) 360-374.

[6] A.G. Barto, S.J. Bradtke, and S.P. Singh, Real-time learning and control using asynchronous dynamic programming, Technical Report COINS-91-57, Dept. of Computer and Information Science, University of Massachusetts, Amherst (1991).

[7] A.G. Barto, R.S. Sutton, and C.W. Anderson, Neuronlike elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics 13 (1983) 835-846.

[8] A.G. Barto, R.S. Sutton, and P.S. Brouwer, Associative search network: A reinforcement learning associative memory, Biological Cybernetics 40 ( 1981 ) 201-211.

[9] A.G. Barto, R.S. Sutton, and C.J.C.H. Watkins, Learning and sequential decision making, Technical Report COINS-89-95, Dept. of Computer and Information Science, University of Massachusetts, Amherst (1989).

[10] K. Berns, R. DiUmann, and U. Zachmann, Reinforcement- learning for the control of an autonomous mobile robot, Proc. of the 1992 IEEE/RSJ International Conference on Intelligent Robots and Systems (1992) 1808-1815.

[ 11 ] J. Borenstein and Y. Koren, Real-time obstacle avoidance for fast mobile robots, IEEE Transactions on Systems, Man, and Cybernetics 19 (1989) 1179-1187.

[12] R.A. Brooks, Solving the find-path problem by good representations of free space, IEEE Transactions on Systems, Man, and Cybernetics 13 (1983) 190-197.

[13] R.A. Brooks, A robust layered control system for a mobile robot, IEEE Journal of Robotics and Automation 2 (1986) 14-23.

[ 14] R.A. Brooks and T. Lozano-Ptrez, A subdivision algorithm in configuration space for findpath with rotation, IEEE Transactions on Systems, Man, and Cybernetics 15 (1985) 224-233.

[15] J.E Canny, The Complexity of Robot Motion Planning (MIT Press, Cambridge, MA, 1988).

[16] D. Chapman and L.P. Kaelbling, Learning from delayed reinforcement in a complex domain, Technical Report 90-11, Teleos Research, Palo Alto, CA (1990).

[ 17] M. Colombetti and M. Dorigo, Robot shaping: Developing situated agents through learning, Technical Report 92- 040, International Computer Science Institute, Berkeley, CA (1992).

[18] J.H. Connell, Minimalist Mobile Robotics: A Colony-Style Architecture for an Artificial Creature (Academic Press, San Diego, CA, 1990).

[ 19] B.R. Donald, A search algorithm for motion planning with six degrees of freedom, Artificial Intelligence 31 (1987) 295-353.

[20] A. Elfes, Using occupancy grids for mobile robot perception and navigation, 1EEE Computer 22 (1989) 46-57.

[21 ] E. Gat, Integrating reaction and planning in a heterogeneous asynchronous architecture for mobile robot navigation, SIGART Bulletin 2 (1991) Special Section on 1991 AAAI Spring Symposium on Integrated Intelligent Architectures, 70-74.

Page 24: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

298 J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299

1221 M.L. Ginsberg, Universal planning: An (almost) universally bad idea, AI Magazine 10 (1989) 40-44.

[23] S. Grossberg, Competitive learning: From interactive activation to adaptive resonance, Cognitive Science 11 (1987) 23-63.

1241 V. Gullapalli, A stochastic reinforcement learning algorithm for learning real-valued functions, Neural Networks 3 (1990) 671-692.

125] J. Ilari and C. Torras, 2D path planning: A configuration space heuristic approach, The International Journal of Robotics Research 9 (1990) 75-91.

[261 R.A. Jacobs, M.I. Jordan, S.J. Nowlan, and G.E. Hinton, Adaptive mixtures of local experts, Neural Computation 3 (1991) 79-87.

1271 O. Khatib, Real time obstacle avoidance for manipulators and mobile robots, The International Journal of Robotics Research 5 (1986) 90-98.

[28] T. Kohonen, Self-Organization and Associative Memory (Second Edition, Springer-Verlag, Berlin, 1988).

129] B.J.A. KrOse and J.W.M. van Dam, Adaptive state space quantisation for reinforcement learning of collision-free navigation, Proc. of the 1992 IEEE/RSJ International Conference on Intelligent Robots and Systems (1992) 1327-1332.

130] J.E. Laird and P.S. Rosenbloom, Integrating planning, execution, and learning in Soar for external environments, Proc. of the 8th National Conference on Artificial Intelligence (1990) 1022-1029.

[311] J.C. Latombe, Robot Motion Planning (Kluwer Academic Publishers, Boston, MA, 1991).

[32] L.-J. Lin, Programming robots using reinforcement learning and teaching, Proc. of the 9th National Conference on Artificial Intelligence ( 1991 ) 781-786.

[33] L.-J. Lin, Self-improving reactive agents based on reinforcement learning, planning and teaching, Machine Learning 8 (1992) 293-321.

[ 34 ] S. Mahadevan, Enhancing transfer in reinforcement learning by building stochastic models of robot actions, in: D. Sleeman and P. Edwards, eds., Machine Learning: Proceedings of the Ninth International Workshop, 290-299 (Morgan Kaufmann, San Mateo, CA, 1992).

[35] S. Mahadevan and J. Connell, Automatic programming of behavior-based robots using reinforcement learning, Artificial Intelligence 55 (1992) 311-365.

[36] M.J. Mataric, Integration of representation into goal-driven behavior-based robots, IEEE Transactions on Robotics and Automation 8 (1992) 304-312.

[371 J. del R. Mill~in, Building reactive path-finders through reinforcement connectionist learning: Three issues and an architecture, Proc. of the lOth European Conference on Artificial Intelligence (1992) 661-665.

[38] J. del R.' Millfin, A reinforcement connectionist learning approach to robot path finding, Ph.D. Thesis, Dept. de Llenguatges i Sistemes Informhtics, Universitat Polit~enica de Catalunya, Barcelona, Spain (1992).

[39] J. del R. Mill~in, Learning efficient reactive behavioral sequences from basic reflexes in a goal-directed autonomous

robot, in: D. Cliff, P. Husbands, J.-A. Meyer, and S.W. Wilson, eds., From Animals to Animats 111: Third International Conference on Simulation of Adaptive Behavior, 266-274 (MIT Press, Cambridge, MA, 1994).

[40] J. del R. Mill,ln and C. Torras, A reinforcement connectionist approach to robot path finding in non-maze- like environments, Machine Learning 8 (1992) 363-395.

[41] T.M. Mitchell, Becoming increasingly reactive, Proc. of the 8th National Conference on Artificial Intelligence (1990) 1051-1058.

[42] T.M. Mitchell and S.B. Thrun, Explanation-based neural networks learning for robot control, in: C.L. Giles, S.J. Hanson, and J.D. Cowan, eds., Advances in Neural Information Processing Systems 5, 287-294 (Morgan Kaufmann, San Marco, CA, 1993).

[43] A.W. Moore, Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state-spaces, in: L. Birnbaum and G. Collins, eds., Machine Learning: Proceedings of the Eighth International Workshop, 333-337 (Morgan Kaufmann, San Mateo, CA, 1991).

[44] H.P. Moravec, Sensor fusion in certainty grids for mobile robots, AI Magazine 9 (1988) 61-74.

[45] H.P. Moravec and A. Elfes, High resolution maps from wide angle sonar, Proc. of the IEEE International Conference on Robotics and Automation (1985) 116-121.

[46] S.J. Nowlan, Competing experts: An experimental investigation of associative mixture models, Technical Report CRG-TR-90-5, Dept. of Computer Science, University of Toronto, Canada (1990).

[47] C. O'Dtinlaing and C.K. Yap, A 'retraction' method for planning the motion of a disc, Journal of Algorithms 6 (1985) 104-111.

[48] S.M. Omohundro, Efficient algorithms with neural network behavior, Complex Systems 1 (1987) 273-347.

[49] D.W. Payton, Internalized plans: A representation for action resources, Robotics and Autonomous Systems 6 (1990) 89-103.

[50] E.S. Plumer, Neural network structure for navigation using potential fields, Proc. International Joint Conference on Neural Networks, VoL 1 (1992) 327-332.

[ 51 ] E. Rimon and D.E.Koditschek, Exact robot navigation using artificial potential functions, IEEE Transactions on Robotics and Automation 8 (1992) 501-518.

[52] S.J. Rosenschein and L.P. Kaelbling, The synthesis of machines with provable epistemic properties, Proc. of the 1986 Conference on Theoretical Aspects of Reasoning about Knowledge (1986) 83-98.

[53] M.J. Schoppers, Universal plans for reactive robots in unpredictable environments, Proc. of the l Oth International Joint Conference on Artificial Intelligence (1987) 1039-1046.

[54] J.T. Schwartz and M. Sharir, On the "piano movers" problem. II. General techniques for computing topological properties of real algebraic manifolds, Advances in Applied Mathematics 4 (1983) 298-351.

[55] S.P. Singh, Transfer of learning by composing solutions of elemental sequential tasks, Machine Learning 8 (1992) 323-339.

Page 25: Reinforcement learning of goal-directed obstacle-avoiding reaction strategies in an autonomous mobile robot

J. del R. Milldn/Robotics and Autonomous Systems 15 (1995) 275-299 299

[561 R.S. Sutton, Temporal credit assignment in reinforcement learning, Ph.D. Thesis, Dept. of Computer and Information Science, University of Massachusetts, Amherst (1984).

[57] R.S. Sutton, Learning to predict by the methods of temporal differences, Machine Learning 3 (1988) 9-44.

[58] R.S. Sutton, Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, Proc. of the 7th International Conference on Machine Learning (1990) 216-224.

[59] S.B. Thrun, K. M~ller, and A. Linden, Planning with an adaptive world model, in: R.E Lippmann, J.E. Moody, and D.S. Touretzky, eds., Advances in Neural Information Processing Systems 3, 450-456 (Morgan Kaufmann, San Mateo, CA, 1991 ).

160] C. Torras, Motion planning and control: Symbolic and neural levels of computation, Proc. of the 3rd COGN1TIVA Conference (1990) 207-218.

[61] C.J.C.H. Watkins, I.earning with delayed rewards, Ph.D. Thesis, Psychology Dept., Cambridge University, Cambridge, England (1989).

[62] S.H. Whitesides, Computational geometry and motion planning, in: G. Toussaint, ed., Computational Geometry (North-Holland, Amsterdam, 1985).

163 [ B. Widrow and M.E. Hoff, Adaptive switching circuits, Proc. of the 1960 IRE WESCON Convention Record, Part 4 (1960) 96-104.

[64] R.J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning 8 (1992) 229-256.

[65] L.E. Wixson, Scaling reinforcement learning techiques via modularity, in: L. Birnbaum and G. Collins, eds., Machine Learning: Proceedings of the Eighth International Workshop, 368-372 (Morgan Kaufmann, San Mateo, CA, 1991).

[66] C.-K. Yap, Algorithmic motion planning, in: J.T. Schwartz and C.-K. Yap, eds., Advances in Robotics, Vol 1. Algorithmic and Geometric Aspects of Robotics (Lawrence Edbaum, Hillsdale, NJ, 1987).

Jos6 del R. Mill~In graduated in Com- puter Science from the Universitat Aut6noma de Barcelona in 1985, and re- ceived his Ph.D. in Computer Science (Artificial Intelligence) from the Univer- sitat Polit6cnica de Catalunya in 1992. From 1986 to 1989 he was assistant pro- fessor at the Universitat Polit~cnica de Catalunya. Since 1989 he is researcher at the Joint Research Centre of the Eu- ropean Commission. His main research interests are related to autonomous learn- ing robots.