Autonomous generation of reflexion-based robot controller using inductive learning

E L S E V I E R Robotics and Autonomous Systems 17 (1996) 287-305

Robotics and Autonomous Systems

Autonomous generation of reflexion-based robot controller using inductive learning

Shin ich i N a k a s u k a * , T a k e h i s a Yairi , H i r o y u k i W a j i m a Research Center for Advanced Science and Technology, University of Tokyo, 4-6-1, Komaba, Meguro-ku, Tokyo 153, Japan

Received 15 May 1995; revised 11 July 1995 Communicated by T.C. Henderson

Abstract

The paper proposes a novel architecture for autonomously generating and managing a robot control system, aiming for the application to planetary rovers which will move in a partially unknown, unstructured environment. The proposed architecture is similar to the well known subsumption architecture in that the movements are governed by a network of various reflexion patterns. The major departures are that firstly it utilizes inductive learning to automatically generate and modify a control architecture, which is, if human is to do, quite a difficult and time consuming task, secondly it employs the concept of "goal sensor" to deal with the system goal more explicitly, and thirdly it compiles the planning results into a reflexion network and decision trees to maint~dn the strong features of reflexion based planner such as real-timeness, robustness and extensibility. The architecture has been applied to movement control of a certain rover in computer simulations and simple experiments, in which its effectiveness ',and characteristics have been cleared.

Keywords: Autonomy; Rover; Robot control architecture; Subsumption architecture; Inductive learning

1. Introduct ion

It is quite clear that a high level autonomy will be required for unmanned planetary rovers which are expected to be launched to various planets in the next century with the objective to explore and gather information before 'man reaches there. The autonomy required for them will include navigation, path planning, task planning of various observations and experiments, management of plan executions, and fault detection, isolation and reconfiguration (FDIR).

Among these functions, the one especially required for such rovers will be the capability to generate and

* Corresponding author. Fax: +81-3-3481-4452; e-mail: nakasuka@ space.t.u-tokyo.ac.jp.

manage various movement plans under partially unknown, ill-structured environments. For example, the path planning will be performed based on the maps of the planet which will have been obtained beforehand by observation from the planetary orbit, but these maps will not be so accurate and there will be in many cases lots of obstacles (such as small rocks or gaps) not represented on the maps. The path planning system, therefore, must be flexible enough to compen- sate for the inaccuracy of the maps, quickly respond to the unpredicted events such as collisions with the unknown obstacles, gather geographical information, and re-plan the path to the goal. This kind of flexibility will be needed in many other planning activities of planetary rovers as well.

0921-8890/96/$15.00 © 1996 Elsevier Science B.V. All rights reserved SSDI 0921-8890(95)00073-9

288 S. Nakasuka et al./Robotics and Autonomous Systems 17 (1996) 287-305

This paper proposes a novel architecture for autonomously generating and managing such control architecture of planetary rovers. The architecture is, basically, similar to the well-known subsumption architecture in the sense that the finally obtained decision making scheme is represented in the form of a hierachical suppression network of primitive reflex actions such as "moving towards a prescribed point", "moving towards the reverse direction when a certain touch senser senses an obstacle", and so on. This representation of controller is, as has been discussed in the literature, superior in robustness in the real world, real-timeness, and easiness in system integration.

However, the architecture has some difficult problems to be solved before actual use, such as: (1) the reflexion network must be sophisticatedly designed by human designers so that the emergent functionality achieves the given goal, which is a far more difficult task than to build a system which deals with the goal explicitly; and (2) once coded, the network is fixed during the actual operations, and therefore the change of the environment or system itself cannot be dealt with. We modified and enhanced this architecture in several points in order to implement the capability to pursue the overall goal while efficiently and more flexibly managing various unpredicted anomalies in a partially unknown, ill-structured environment such as a planetary surface.

The characteristics of our approach can be summa- rized as the following three points: (1) Introduction of the notion of "goal sensors" to deal with the goal more explicitly. (2) Employment of inductive learning to autonomously acquire knowledge as to the cause--effect relationships of each movement. (3) Compilation of the obtained knowledge into a reflexion network in order to obtain real-timeness and robustness. With these methods, the actions to pursue the goal and actions to react against unpredicted or hazardous disturbances can be blended in an efficient way, which provides a flexible management system suitable for our application. Computer simulations have been performed assuming a certain rover with an example task to fetch a ball and carry it to a prescribed goal position, in which the effectiveness of the proposed architecture has been demonstrated. We have also started experiments using an actual rover, whose first result will be given as well.

2. Subsumption architecture and its enhancement

The subsumption architecture has been proposed by Brooks [2] as a novel scheme for managing robot movements. His assertion is that symbolic manipu- lation, which has been conventionally utilized as the core of the planning system of robots (Fig. 1), cannot be used in many cases for the robots moving in the

( Symbolic World Model

I Inferring and Planning ]

Symbolic Representation of Sensor Output

Real World

] Sensing I

(

. . . . . . . . . . . . . . . . . . . . . . .

Translation of Symbolic Plan to Actual Command

[ Actuation I

Grounding Problem

Fig. 1. Conventional robot control architecture based on AI.

S. Nakasuka et aL /Robotics and Autonomous Systems 17 (1996) 287-305 289

- -

I Reflexion Pattern 31 (wandering about) / A i Mixture/

i Activate/ I i Suppress Reflexion Pattern 2 o_~_ -

toward target) i Mixture / i Activate/

I Reflexiorl Pattern 1 ~ (wall following)

t Sensing I [ Actuat-~n

Real World - - )

Fig. 2. Simplified view of subsumption architecture.

real world because the symbolic "world model" is not always consistent with the real world (i.e., "Symbolic Grounding Problem"), and quick response is hard to realize because of the required time-consuming symbolic manipulations. The subsumption architecture, on the other hand, does not employ such world model, and the actuation plan is constructed in the form of suppression/activation network of several layers rep- resenting primitive "reflex actions"(Fig. 2). Each layer has connections to necessary sensors and actuators, and operates asynchronously. Behavior of the agent at a certain time is controlled by one or several layers, sometimes affected by activation and inhibition between layers. Its mosl: significant departure from the conventional AI-based systems' concept is that the goal of the plan is not represented explicitly, but is achieved during the course of the interactions between the reflexion network and the environment. This feature is called "emergent functionality". As a result, the system has robustness in the real world environment, real-timeness and easiness in system integration and extension, and for this. reason, this architecture seems quite well-suited to the plan representation schema for rovers which move in an unstructured world.

However, it is quite difficult to construct the reflexion network by which "emergent functionality" can achieve the given goals because the goals cannot be dealt with explicitly, and the constructed network cannot be changed even if the environment or the system

itself has been changed. Brooks later suggested the use of genetic algorithm for this problem [3].

In the field of so-called "behavior-based planning architecture", several interesting approaches have been proposed, aiming for the compensation of these shortcomings of the subsumption architecture [1,3- 6,10-13,19]. The most common approach is a hybrid architecture of a reactive system and a conventional symbolic planning system [ 1,4-6,10-13,19]. The key issue in this approach is how to realize goal pursuing actions, which requires some planning with a world model, without much degrading the strong points of reactive systems such as concurrency, robustness and real-timeness. Another typical enhancement proposed is to employ machine learning to support the con- struction of the planning network or movement rules [3,8,12-14,17,19], in which various learning schemes have been proposed such as reinforcement learning [ 12,8], explanation based learning [ 13,19], and genetic algorithm [3,17]. More detailed overview of these approaches and discussions on their relationships with our approach will be given in the later section.

Our architecture is principly categorized as the hybrid architecture of a reactive system and a symbolic planning system, and tries to solve the above mentioned issues by the following techniques. Firstly, in order to deal with the goal of the system more explicitly, it introduces the notion of "goal sensors" which describe the current rover situations in terms of how they are different from the goal situations, in quite the same form as the other sensor information such as the touchness, distance or angle. With goal sensors, goal- pursuing actions and reactive actions (such as obstacle avoidance actions) can be uniformly represented in the same form of reflective actions triggered by certain sensor states. This mechanism is similar to the concept of "Logical Behavior" proposed by Henderson and Grupen [6].

Secondly, generalized causal-effect relationships of each action are acquired by inductive learning using the training data obtained by random movements. And thirdly, a subsumption architecture-type network (called "reflexion network" hereafter) can be generated autonomously by compiling the results of off-line planning. These two steps eliminate the difficult and time consuming task of constructing a control system, and also make it possible to autonomously modify the reflexion network in case of some changes of the

290 S. Nakasuka et aL /Robotics and Autonomous Systems 17 (1996) 287-305

environment or the system itself. The finally obtained reflexion network can tell which subgoal to achieve now in the current situation in order to pursue the goal as well as in order to deal with anomalous events which may happen during the goal pursuing activities. And then how to achieve the specified subgoal is dictated by decision trees also obtained by inductive learning. With this knowledge, the system can coordinate two sometimes conflicting objectives of pursuing the goal and handling the anomalous events efficiently without degrading the real-timeness and robustness. Section 3 explains these techniques in more detail.

3. Description of the proposed architecture

3.1. Assumed problem

The following explanation assumes an example task of a rover to fetch a certain ball which is placed at a certain position and then to carry it to a prescribed goal position (Fig. 3). The rover has a map which represents the position of the ball, goal and some obstacles, but several obstacles are not represented in the map ("Unknown Obstacles"). The rover itself is assumed to have four touch sensors (each is sensitive to two direction contacts), and be able to turn right/left and move forward/backward (represented as TR, TL, MF, MB, respectively) as illustrated in Fig. 4. These basic actions are termed as "low level actions" hereafter. High level, sophisticated actions, such as "Write the position of the found obstacle to the map (WO)", "Plan path to the current subgoal (PP)" or "Pickup ball (PB)" are also considered as basic actions, and so the rover has a total of seven actions. It is also assumed that the rover has a navigation system which tells the rover about its position and orientation with respect to the map.

unknown Obstacles

o BALL

Fin6

n a-

~ Initial Rover Position

(unknown)Wall

Fig. 3. Example task for the assumed rover.

Move Forward (MF) ~ ' N T%rf~

I J / " ~,7 o/,,¢ ~ ]Turn

Move Backward (MB) ~ / ' (R~%h)t

High Level Actions - Write Obstacle Position to Map (WO) - Plan Path to Goal (PP) - Pickup Ball (PB)

Fig. 4. Schematic view of assumed rover.

3.2. Sensor states and goal sensors

Basic actions are triggered by the change of the sensor states. Fig. 5 represents the sensor states employed in the rover problem, among which several states (marked as "g") corresponds to "Goal Sensors". Goal sensors are, different from the usual physical sensors, the results of certain manipulations of several

sensor outputs and denote certain aspects of the distance between the current rover and the goal situations.

The effects of introducing these goal sensors are twofold. Firstly, with this the goal pursuing plan can be represented in the form of a combination of basic reflective actions each of which induces a certain change of sensor states' values. In other words, the plan to achieve the goal can be represented as the

S. Nakasuka et al. /Robotics and Autonomous Systems 17 (1996) 287-305 291

#0

#1

#2

#3

#4

#5

#6

#7

#8

m m

3 ..... Previous Move (1 :TL, 2:TR, 3:MF, 4:MB)

5 " Tactile Sensor State (-1:cannot move, 0 0:No touch, 1-8:touch sensed)

g 1 .. Ball Carried (0) or Not (1)

g 0 • ,- Target Visible from the Current Position (0 for Yes and 1 for No)

g 0 ..... Map Updated (0 for Yes and 1 for No)

35 .... Angle between Target Direction and Heading

87 .... Direct Distance to Current Target

0 .... Possibility to Rotate Freely (0 for Yes)

1 ..... Located at Final Goal Position (0 for Yes) . m

Goal Sensor States (~)

Fig. 5. Definitions of sensors states including goal sensors.

#o #1 #2 #3 #4 #5 #6 #7 #8

Pi

" . " Z-.-

1 o

0 *

L * ckup Ball (PB)

1 0 o o

# 'Ir~

Plan Path to Goal (PP)

* [ * *

: 1 "" * I *

11,- ° ...

Write Obstacle Position to Map

(WO)

Fig. 6. State transitions of high level actions.

sequence of changing each of the sensor states one by one. Please note that this sequence, that is, which sensor state to change first, then which to change second, etc., as well as which action should be taken to induce each state change are not needed to be implemented beforehand, but are acquired autonomously during the learning phase, using the method described later.

Secondly, the high level, sophisticated actions such as PB, WO or PP can also be represented as the state transitions, such as ill Fig. 6. In this figure, "* (as- terisk)" represents that the specified state can take an

arbitrary value, and "**" denotes that the value of the state is not changed from that before the action. More- over, the goal state can be represented quite easily like Fig. 7, and when the rover collides with an unknown obstacle, its states have only to be changed like Fig. 8. As described later, with the aid of inductive learning, the system can acquire knowledge as to the effect of each low level action in the form of a generalized state transition. Therefore, these representations are especially important in order to represent the key sensor states and the effects of all the actions uniformly in

292 S. Nakasuka et el./Robotics and Autonomous Systems 17 (1996) 287-305

# 0 J

#1 *

#2 0 Ball Carried #3 * # 4 ; *

ii" #6 * #7 *

0 At Final Goal Position

Fig. 7. Goal state.

# 0 * ** #1 0 1 #Z * ** #3 * * #4 0 ~ 1

#5 * * *

#6 * ** #7 * 1 #8 * **

Fig. 8. State transition of collision with obstacles.

the same schema, which makes the later process much easier and more systematic.

3.3. Learning procedure

Fig. 9 describes the overall procedure of the learning. The final outcomes of the learning are twofold; a reflexion network which tells which sensor state should be changed in the current situation in order to achieve the goal, and decision trees (one for each state transition pattern) which tell which action should be taken to induce the required state transition. The learning proceeds in the following way.

Step 1. Acquisition of training data. The rover chooses one of the actions randomly and continues the action until at least one of the sensor state changes, by which it obtains one data as to "action vs. state transition". "Change of the sensor state" is defined as follows: for the discrete-value type states, any changes of the value, and for the continuous-value type states (such as #5 and #6), transitions of the value between positive, negative and zero. The rover iterates this process to accumulate this kind of data in a database, which will be used as training data in the later steps. The left four figures in Fig. 10 describe the examples of such data.

Step 2. Generalization of state transition. The state transitions in the database having the same effect of changing a certain state into a certain value are grouped. For example, the left four figures in Fig. 10

Normal Flow ...... ~ Trigger in case of Incorrect Knowledge ~ o v e m e n t ~

I I I Random Movements I Store ) ial State

i Experimentation i for Re-learning i ....................................... )~l Inductive Learning ] i Re-learning

: erali

i ~ | S t a t e l I Trees I Checking I Check of Causal_Effect E ~ ~ " ' ~ Relationships I ~ ~f Check

Trigger T Re-compiiat'i'on':~ Compilation I ~

Knowledge Oriented L I Movements I~

Fig. 9. Overview of the learning process.


2

#2 1

#3 0

#4 0

#5 9

#6 8 . 7 1

. 8 1

. q

4 1

7 1

1 I

it 57

o

1

0

o o 20 ~ -8

17 7

1 1 1 1 1

2- _Ooo '° - _ooo 35 7 -2

57 9 9

1 1 1

1 m . 1 _ j .

O b t a i n e d T r a i n i n g D a t a

1

0 1

0

# 0 * *

#I * _0 #2 I **

# 3 0 0

# 5 * *

# 6 1 *

# 7 1 **

. 8 I **

Generalized

State Transition

(#1 state to O)

Fig. 10. Gene ra l i za t i on o f state t ransi t ion.

have the same state transition pattern, "#1 (tactile sensor) state to 0", and so they are grouped. These groups are named as "same transition pattern groups". The groups are overlapped, and therefore most of the data belong to more than one group. For each group, all the data belonging to it are generalized to obtain a general representation of "the movement which changes a certain state to a certain value". For generalization, first the continuous type values (such as #5 and #6) are translated into one of plus, minus and zero, the :state value which has not been changed from the one before the action is replaced with **, and then the dropping condition rule, which is a widely utilized generalization rule, is employed to replace some values with .. The right figure of Fig. l0 describes an ,example of the outcome of this process, which can be interpreted as "the generalized movement which changes the tactile state (#1 state) to 0". This generalized form represents the preconditions for a certain stalte change to be possible as well as its consequences.

Step 3. Learning of actions to be taken. Another knowledge to be obtained is "which action should be

taken in order to make the specific transition". This knowledge is acquired for each transition pattern in the form of a decision tree. The algorithm used in [ 16] is utilized to make the decision tree, using the well known "information gain" criteria for the selection of the best decision rule at each node. Fig. 11 describes the example of the obtained decision trees for the state transition of "#1 state to 0" and "#5 state to 0", which describe the decision rules for selecting the action to be taken to make these transitions. The left tree tells that the previous move (state #0) and the current tactile sensor state (state #1) must be referred to for this decision making. (The • in the figure means that no action is needed because the #1 state is already 0 at that leaf node.)

Step 4. Compilation of generalized state transitions. With the generalized state transitions obtained in step 2), the sequence of state transitions from the arbitrary initial states (i.e., all the states are " ,") to the goal state (Fig. 7) is searched for. This process can be made using the STRIPS-type task planning algorithm [14]. Fig. 12 shows the schematic view of this process when backward chaining is employed. The


Func(#1 =0) Func(#S=O) previous move(?) target angle(?)

tactile TR~ "~F tactile state(-1..8) TR~ ~ state

M ~ ~ ~ t a c t i l e

M ~ F ts:C:i~e

Fig. 11. Examples of the obtained decision trees.

0

0 GOAL

--I} #3

"#8" means "Action to mal

Arbitrary Initial State

Fig. 12. Off-line task planning to make subgoal sequence.

~r

~x

4r


any state - - ~ Func(#4=0) I ).

~ Func(#3=O) I ~l

i Func(.,=o l

i ! F #1,3,4-0- ~ ! i i Func(#7=0) I

liF#1,3,4,7=0'1~ ..=- ~ ~ ~;',Func(#5--O)' I ---

i i ! i iii -

IIF #1,3,4,5=0 ~ r I

i i Sensor Input Checking

0

V

Activation Chain ...... -~Supression Chain (supresser~-~suppressed)

Fig. 13. Finally obtained reflexion network.

route from the initial state to the goal is then compiled into a subsumption-type reflexion network such as in Fig. 13. In the figure, "Func(#2=0)" means the action which changes #2 state into 0.

Step 5. Goal persuing movement. Using the obtained reflexion network and decision trees, the action to be taken at each time is determined first by identifying the state transition currently needed in the reflexion network (Fig. 13), and then by choosing an action which is expected to achieve this state transition using the decision tree (Fig. 11). This process is iterated at a certain control cycle. The sensor states are checked and one of the layer is activated as a result of the sensor state matching and the suppression-type interactions between layers, and one basic action is triggered to achieve the specified state transition. If in this subgoal achieving process one of the sensor states is changed in such a way that the currently activated layer is not activated any longer, then the control moves to another layer. This function enables the system to quickly respond to disturbances such as collisions with obstacles. If the specified state transition is achieved successfully, then the control also moves to another layer.

During the early learning phase, the amount of training data is not sufficient to obtain enough knowledge as to the generalized state transitions to com- pletely construct the goal achieving sequence in Step 4. Besides, the training data may yield incorrect knowledge because of the data noise or as a result of over-generalization. Therefore, Step 5 is not performed until the goal achieving sequence is obtained in Step 4, and even when Step 5 is performed, the consequence of every movement is checked to see whether the actual consequence conforms to the obtained knowledge, i.e., the decision trees and generalized state transitions. When some discrepancy is found, Steps 2-4 are performed again using all the accumulated and newly obtained training data.

If the fraction of the movements which contra- dict the obtained knowledge is larger than a certain threshold, the system judges that the environment or the rover itself has got a certain change, and so aban- dons the accumulated training data and starts a new learning session from Step 1. With this technique, the system has the flexibility to adapt itself to the change of the environment or the system itself.

296 s. Nakasuka et aL /Robotics and Autonomous Systems 17 (1996) 287-305

3.4. Path planning

As stated before, path planning is considered as one basic action of the rover. For path planning, a hybrid algorithm utilizing both the well known tangent graph search method and the potential method [15] have been employed. The two methods are distinctively utilized for the initial, global path-planning and the near-obstacle, local path-planning, respectively. Which is used in what situation has been implemented in the form of a certain rule beforehand. The objective of the path planning system is to provide the control system with the current subgoal position which is visible directly from the current position (i.e., there are no obstacles between). The generated reflexion network in Fig. 13 tells that if the rover collides with an obstacle not represented on the map ("IF not #4=0") during the goal achieving action, the position of the obstacle is written on the map ("FUNC (#4=0)"), which changes the map and the potential field (#3 becomes not 0) resulting in a new subgoal directly visible from the current rover position by "FUNC (#3=0)". With this strategy, the control system is always provided with the instantaneous subgoal which can be achieved in a direct path (without avoiding obstacles) from the current rover position.

4. Simulation study

4.1. Evaluation method

The proposed architecture has been applied to the example problem of a rover (Fig. 3 and 4) in computer simulations. A setting of the positions of the rover, ball, goal and obstacles are generated randomly in the computer. One setting of these parameters is called "one problem" hereafter. During the learning phase, the rover tries to achieve the goal by iterating random movements, in which it accumulates the training data. After one problem is solved, another problem is generated also randomly and the rover tries to solve it again.

The obtained knowledge is evaluated in the following manner. At several stages of the learning phase (every ten problems), the decision trees and the reflexion network are constructed with the data obtained up to that time. The rover tries to solve 40 randomly generated problems with this obtained knowledge, and

evaluation data such as (1) success rate (the percent- age by which the rover can successfully solve the problem), (2) average number of tried commands to achieve the goal (per one successfully solved problem) and (3) average number of tried actual movements to achieve the goal (per one successfully solved problem), are generated.

4.2. Simulation results

Fig. 14 describes the history of the performance of the obtained knowledge evaluated in this way. The hor- izontal axis shows the number of training data used for learning. It is observed that during the initial learning phase, the success rate is very low, and even in the case that the problem is successfully solved, it requires lots of movements to achieve the goal. As the learning proceeds, the success rate becomes higher, and with more than about 1250 training data, it can solve any problem of this kind perfectly. The efficiency to solve the problem is improved and converged simultaneously.

Fig. 15 depicts one example of the rover movement at the mature learning stage, i.e., when 3000 data are used for learning. In this case, the obtained reflexion network is the one described in Fig. 13, and an example of the decision tree is given in Fig. 11. In SCENE 1, the rover knows that the ball is not directly visible (#3 ~ 0), so "FUNC (#3=0)" is triggered to make a path plan to the ball position. Then in SCENE 2, it moves along the path until it collides with an unknown obstacle. On the collision, #3 and #4 change to "not 0", which triggers the actions WO and PP to reset these values to 0 again. In SCENE 3, it iterates the action of making a subgoal and achieving the current subgoal (so called "wall following actions") until it clears the obstacle. In SCENE 4, the rover iterates these obstacle- avoiding and ball-pursuing actions to reach the ball position. The collision points on its way against unknown obstacles are marked on the map of the rover. In SCENE 5, the rover plans path to the final goal, and in this case the planned path already avoids the ini- tially unknown obstacle, and in SCENE 6, after total 107 movements, the rover achieves the overall goal.

Fig. 16 shows the comparison of the rover movements in a certain same problem between the cases when 500 data are utilized (named Case A) and when 3000 data are utilized (Case B) for learning. In Case A, the rover can avoid the known obstacle, but when

S. Nakasuka et al./Robotics and Autonomous Systems 17 (1996) 287-305 297

Number of Tried. Commands 400

388

288

loft

0-

History of Performance during Learning Phases (ave. over 40 problems)

~ Success Rate

',, /'--...i I /

I',** " " % A / !

I

Number of Data Used for Learning(xleB8)

Fig. 14. Goal achieving performance of rover with knowledge.

I00

50

it collides with an unknown obstacle, i t gets stucked. This is because the training data are insufficient to coordinate the subgoal making and achieving actions efficiently. On the other hand, in Case B, the unknown obstacle can also be cleared successfully. In order to clear unknown obstacle successfully, several planning skills are required, such as "when the rover collides with an obstacle, it must undo the previous action until it can rotate freely", or "in order to avoid unknown obstacles, ill must perform wall following movements which can be realized by sophisticated combinations of subgoal making and achieving actions". The important point is that these skills are obtained autonomously during the course of the interactions between the rover and the environment.

4.3. Exper iment using an actual rover

We have just started a primitive in-door experiment using an actual rover in order to verify the proposed architecture in the real world. Fig. 17 shows the photograph of the rover to be used, which has four touch sensors each sensing two directional contacts, an onboard camera which can move in two translational and one rotational dimensions, and a caterpillar mechanism to move forward and backward as well as rotate the rover. The onboard camera is not used in the current experiment. The position and the orientation of

the rover is acquired by the bird's-eye view of the experimental site obtained from a camera placed on the ceiling. (This tentative navigation system is now being replaced by the onboard gyro and camera hybrid navigation system.)

The experiment has been performed as follows. The characteristic parameters of the rover and the environment are input into the computer simulator, where the learning has been performed using the method described in the previous sections. The position, shape, and size of the obstacles, and the positions of the rover/ball/goal are set randomly in the learning phase. Then the obtained reflexion network and decision trees are returned to the computer which controls the actual rover, and the rover movements are experimented on the experimental site with several obstacles placed and arbitrary position settings of the rover/ball/goal.

Fig. 18 describes one example of the rover movement. The dotted lines show the paths planned to the subgoals while the solid lines show the actual trajecto- ries of the rover. Although the problem setting is quite simple in this experiment, the rover can successfully plan paths, clear the obstacle, and reach the ball and goal positions.

In the current experiment, the whole learning is performed in the computer simulation. This is because the following problems remain unsolved: (1) The current learning method requires a lot of trial


SCENE I

Obstacle

© Final Goal

Ball

Path Planned

Obstacle

Rover Initial Position

SCENE 3: Achievement of Current

SCENE 2: Collision with Unknown Obsta~

I Re-pal Planni]

| be Rotatable

S C

S C S C . . . . . . . . .

Fig. 15. Example of rover movements controlled by knowledge.

movements as training data in order for the goal- pursuing action strategy to be obtained.

(2) The safety during the learning is not guaranteed, in other words, the rover may be in a dangerous state or even get damaged during the learning phase.

The former problem is important because the actual

rovers (especially future planetary rovers) will have some limits of time and energy available for the learning. For this problem, we are now considering the following two countermeasures; (a) Hybrid learning strategy. Most learning is performed in computer simulations, and only such


500 data learned

BALL OGOAL

ial

Failure: Cannot Clear Unknown Obstacle

3000 data learned

0 11

_~s~ 8 ~GO~L

m

~u Initial Rover

i Tried Commands:92

Movements:75

Fig. 16. Comparison between early and mature learning stages.

Fig. 17. Photograph of the rover.

knowledge as quite sensitive to the environment (for example, the obstacle avoidance strategy which is sensitive to the average size of the obstacles) is obtained via learning in the real world. The proposed system has the mechanism that the knowledge obtained via simulation learning is checked during the

real movements, which gives the opportunity for re- learning in the real world. Especially for planetary rovers, it is very practical that the whole learning is once performed on the Earth and that the check and re-learning mechanism further adapts the knowledge to the specific planetary environment.


(CII])

300

200

Ined

oll ide I00

ball

s tar t

0 100 200 300 (cm)

o ... . . . . . . . . . ~ R a n n e d Paths t o Subgoals

o o Actual Rover Tra jec to ry

Fig. 18. Example of rover movement in experiment.

(b) Learning considering required time and energy. The compromise between the required time and energy and the coverage and preciseness of the obtained knowledge must be pursued. One method to do that is to make an evaluation function taking these factors into account and to stop the learning at the point when this function begins to decline.

The latter problem is especially important for rovers (such as planetary rovers) which must operate autonomously without human attendance. For this problem, the notion of "safe learning" must be elaborated, which is now being studied.

We are now studying the above items so that the experiments may be very close to the expected operations and environments of planetary rovers. Hardware, as well as software, is now being enhanced so that the rover has sufficient navigation, hazard avoidance and fault tolerant capability for the planned out-door experiments.

4.4. Discussions

In our architecture, the problem of making a plan to achieve the goal is decomposed into the following two subproblems, which is iteratively solved by the obtained knowledge corresponding to them; (a) What state should be changed now in the current

situation? This may be solved by performing STRIPS-

type task planning [14] at every decision time using the knowledge as to the preconditions and consequences of each action. In fact, our system obtains the generalized state transitions having this information, and therefore such real-time task planning would be possible. The reason why the task planning is made in off-line fashion and its result is compiled into a reflexion network is to pursue real-timeness. Besides, in order to provide the capability to handle the anomalous events (such

S. Nakasuka et al. /Robotics and Autonomous Systems 17 (1996 ) 287-305 301

as collisions with obstacles) or failures during the course of pursuing a certain subgoal, the reflexion network is cortfigured in such a way that several layers can be triggered at the same time from which one is actiwated as a result of their suppression interactions.

(b) How can that be achieved in the current situation? This can be solved by the decision trees gener-

ated for each state transition pattern. The decision tree scheme is utilized because of its powerful inductive learning capability, its ability to deal with both of discrete type and continuous type values, and its quite small computational load in decision making.

These two levels of knowledge can be managed separately. This is useful because the two types are different in abstract level, and it is likely that the knowledge (a) is less subjective to the environmental change than knowledge (b). In case that some incon- sistency between knowledge and actual causal-effect relationships is observed, knowledge is updated by re- learning, which can be made more efficiently if the knowledge is separated.

The learning method to be employed is also an important problem. In our system, the training data as to pre- and post-conditions of actions are obtained during random movements, from which generalized causal- effect relationships of actions and decision trees are extracted by inductive learning. We do not employ reinforcement learning because it is very difficult to make credit assignments and because the obtained knowledge in reinforcement learning is not so easily modified reflecting the environmental changes, as the learned knowledge based on causal-effect relationships.

How to avoid oscillatory behaviors is an important issue, too. In the example rover problem, oscillatory behaviors may possibly occur in the following ways: (1) An apparently reverse action is taken after a cer-

tain action, and this iterates infinitely. For example, the sequence of "Turn Right" and "Turn Left" actions iterates infinitely without any actual effects.

(2) When the rover is surrounded by obstacles in al- most all the directions, obstacle avoidance actions may enter into an infinite loop.

In our system, these situations are avoided in the following ways, respectively;

(a) This type of oscillation is avoided by introducing the sensor state "Previous Move". Dur- ing the learning phase, as stated before, the decision trees are constructed which dictate which actions to be taken to achieve certain subgoals. These decision trees are so generated that such actions as reverse to the previous ones are not taken (see Fig. 11 in which the "previous move" is referred to) because such actions will not contribute at all to the subgoal achievement.

(b) This type of oscillation is avoided by the mechanism to memorize the positions of the obstacles every time the tactile sensors col- lide with them. Thanks to this memorization, the "Plan Path to Goal" action can make, also considering the recently acquired obstacle positions, path plans different from the ones generated before, which prevents the rover from entering into an infinite loop.

In the current rover problem, other oscillatory behaviors do not occur at all. This favorable situation is not, however, guaranteed for every type of problem, and therefore more general method to avoid os- cillations and its theoretical backup will have to be elaborated. This is another important research issue.

Finally let us discuss the a priori information to be implemented before learning. In our system, the following informations must be specified: (1) definitions of sensor states including goal sensor states, (2) representations of the goal in terms of state values, (3) definitions of high level actions (such as WO, PB, PP) in state transition forms, (4) various parameter settings for generalization of state transitions and generation of binary decision trees.

The most important knowledge is (1), which defines the maximum information which can be referred to for decision makings. They are also important in the sense that some combinations of the state values provide "way points" leading to the overall goal such as in Fig. 12. With these way points, the goal pursuing activity can be decomposed into a series of subgoal achieving actions which can be governed by simple reflexion-type controls. Goal sensor states are especially important to obtain this feature. Therefore we must carefully choose the sensor states so that the important states may not be dropped. In fact, the goal sensors in the example problem of this paper are so

302 S. Nakasuka et aL /Robotics and Autonomous Systems 17 (1996) 287-305

sophisticatedly defined that these two objectives can be achieved. But in general problems, this definition may be quite difficult for human because of the com- plexity of the problem. In such a case, definition of important sensor states would better be made also autonomously by learning or other methods if possible. This is, however, quite a tough problem observed in many areas of artificial intelligence and beyond the state-of-the-art. The practical methods for this problem will be to define as many states which seem rel- evant as possible by human so that the system can choose important ones from them during the learning.

5. Overview of related researches and comparisons

In this section, related research is reviewed and the relationships to our approach are discussed.

Maes proposes behavior networks in which goals, perceptions and behavior modules are connected by links of activation and inhibition [ l l] . In this network, activation energy is spread through the causal- effect links from both of the goal and the current situation, then such behavior module whose activation level reaches a certain threshold is selected and executed. This architecture has both high reactivity based on parallel process and ability to deal with goals explicitly. However, this has a difficulty in constructing the behavior network. Maes uses a kind of reinforcement learning approach to handle this problem [12], which enables each link to gradually learn its appro- priate weight from experience. A notable characteristics of the algorithm is that this learning is also carried out in a distributed, decentralized way.

Arkin's AuRA architecture is intended to combine a hierarchical planner based on the conventional AI and the distributed reactive control, utilizing schema theory [ 1 ]. Schemata are individual agents which implement basic behavior patterns and perceptual strate- gies, and each of them operates concurrently. Given a certain goal, the planner makes a plan and decom- poses it into a set of schemata, which operate in parallel, then the' outputs of them are combined resulting in the behavior of the entire system. Besides, Arkin suggests that the world knowledge is very useful even in reactive control. He distinguishes between the per- sistent knowledge which is static information about

the environment and the transitory knowledge which is dynamically acquired. In AURA, the former is utilized for efficient use of its resources and the latter is used only for reconfiguration of reactive control regime when difficulties are encountered.

Mitchell's Theo-Agent architecture is also a hybrid architecture which incorporates both a stimulus- response subsystem for rapid reaction and a search- based planner for handling unanticipated situations [ 13]. Theo-Agent reacts to its environment following a corresponding stimulus-response rule, whenever it is possible, i.e., whenever one of the accumulated rules applies to current situation. Only when no rules ap- ply, is the planner invoked. A worthnoting feature of this architecture is its learning capability. Whenever it is forced to plan, it invokes an explanation-based learning mechanism, and acquires a new reactive rule which covers the situation. This learning strategy makes the agent increasingly reactive.

Segre and Turney describe a different type of hybrid architecture named SEPIA combining planning and reaction, in which the planner incrementally generates approximate plans and outputs the results in the form of values of pseudo sensors [19]. SEPIA's reactive ex- ecutive system then looks up the values of both real and pseudo-sensors and decides the action based on the rule which applies to the current situation. SEPIA also utilizes explanation-based learning to create efficient macro-operators.

Spector and Hendler's supervenience architecture is a multi-level architecture in which higher level layers for deliberative functions are incrementally put upon lower levels for reactive control [18]. Each layer con- tains its own local representation system, i.e., blackboard and local operators defined over that blackboard, and communicates with other levels above and below itself through its blackboard. One of the most notable characteristics is that only the lowest level has direct access to the environment via sensors and actuators. They call this characteristics as "world knowledge up, goal down".

Gat presents an alternative approach to integrate planning and reacting [4]. His architecture called ATLANTIS consists of three heterogeneous compo- nents operating asynchronously, that is, Controller which manages primitive and reactive activities, De- liberator which is responsible for time-consuming computational tasks such as planning and model-

S. Nakasuka et al. /Robotics and Autonomous Systems 17 (199t~ ) 287-305 303

building, and Sequencer which controls sequences of activities in both Conl~roller and Deliberator. What is the most important is that the outputs of the planner are utilized only for adtvice and not for direct control. This allows the architecture to reuse the heritage of classical AI such as search-based planner and model- ing techniques without modification.

Oxford AGV employs a hybrid architecture named DTRA (distributed re,d-time architecture) combining traditional functional decomposition and behavioral decomposition in which tasks in different levels such as path planner, navigator, obstacle avoidance, etc., are modularized and each of them has direct tight connection to specific :sensors and actuators, and loose connection to other behavior modules for parallel operations [5]. Basically this architecture obeys traditional hierarchical decomposition scheme, but control tasks of the whole system are distributed between behavior experts and executed in a decentralized way.

Henderson and Grapen proposed the concept of "Logical Behavior" for organizing multi-sensor integration and control [6]. This architecture is based on the behavior based control schemes, but its originality is that a behavior is represented as a mapping from "logical sensors" to "logical actuators", where "logical sensors/actuators" need not be linked to physical sensors/actuators but may represent any hypothetical states and their transiti~ons. With this framework, plan- nings can be represented in the same manner as actual executions, which makes it possible to construct more complex robot controllers than the mere reflex- ive mappings from sensors to actuators.

Lyons and Hendriks proposed "Planner-Reactor" hybrid architecture, where the planner can iteratively improve the reactive system to contain novel, auto- generated behavior, while minimally interfering with the ability of the reactive system to react quickly to the environment [10]. The main objectives of this architecture is to provide the reactive system with the ability of adaptation to the change in the environment and the goal. In addition, a novel concept of "safe adaptation" has been introduced, and the constraints to ensure that the incremental adaptations converge to a desired reactor haw ~. been discussed.

As many researchers point out, it is generally hard to construct distributed parallel control systems. Kael- bling and Rosenschein propose a language system named GAPPS for this problem which compiles the

agent's behaviors described in a symbolic and declara- tive way into parallel programs composed of low-level actions [7,9]. Kaelbling later advocates the importance of adaptation to the environment and exploits a sort of reinforcement learning [8].

Steels advocates the on-line use of genetic algo- rithms for this adaptation problem [17]. His PDL architecture is highly subsymbolic and shares many common features with Brooks' subsumption architecture. A key difference is that Steels utilizes coopera- tion and competition mechanisms between modules instead of inhibition for behavior selection.

Let us compare these approaches and ours in two aspects.

(1) How to combine goal pursuing planning and reactive actions? Most of the above mentioned research tries to maintain concurrency, real-timeness and robustness by employing a stimulus-response type reactive control similar to subsumption architecture, and besides, in order to add the goal pursuing activity, some planning capability is incorporated, which results in a hybrid architecture. But, the way to combine the two schema is different among them; planning and situated actions are controlled by several behavior modules connected by cause-effect relationships in [11], goal decomposed by the upper level planner are realized by several reactive modules in Arkin [1], controls are made by reactive rules which are generated by the planner at the time when the situation is not covered by the current set of rules in Theo-Agent [13], results of approximate planning are transferred to reactive systems by pseudo-sensors in SEPIA [19], and distributed planning and reactive modules are activated concurrently in ATLANTIS [4] and DTRA [5]. In [6], planning is represented as one of generalized behaviors, and in [10], the planner adapts the reactive system in order to deal with the changes in its environment and goal.

In our approach, the goal pursuing planning is made in an off-line fashion whose results are compiled into a reflective actions network. This method is in a sense similar to Mitchell's Theo-Agent Approach in which planning results are represented in reactive rules. The key characteristics of our approach are that the goal achieving and obstacle avoidance actions can be represented uniformly in the same form of reactive actions by introducing the notion of goal sensors, and that the decision making is decomposed into "what sub-

304 s. Nakasuka et al./Robotics and Autonomous Systems 17 (1996) 287-305

goal to achieve now" governed by the reflexion network and "how to achieve it" governed by the decision trees.

(2) H o w to construct decision making system. It has been pointed out in the literature that it is quite difficult to construct a distributed parallel control system which as a whole can achieve certain objectives. In order to solve this problem, various machine learning methods have been applied. For example, Maes [12] and Kaelbling [8] use reinforcement learning, Theo-Agent [ 13] and SEPIA [ 19] use explanation-based learning, and Brooks [3] and Steels [17] employ genetic algorithm to totally or partially construct the planning and control system. In GAPPS [7,9], a language system is proposed to help human construct a parallel program of low level actions.

In our approach, inductive machine learning is employed. The notable characteristic is that the causal- effect relationships are obtained in a generalized form which is then compiled into a distributed reflexion network. This eliminates the difficult problem of how to make credit assignment which is required in the case of reinforcement learning or genetic algorithm, and makes the system more adaptive to the environmental changes.

6. Conclusions and future research

An architecture to control a rover under ill- structured, partially unknown environments has been proposed. The key idea is that the proposed system has the capability to autonomously and flexibly tune its goal achieving plans fit for the environment in which it moves. This is achieved by introducing the inductive learning capability to acquire the knowledge as to "what is the current subgoal to achieve in the current situation" and "how to achieve it". Some sophisticated actions such as "wall following movement" can also be acquired autonomously in the same architecture. The system has been evaluated by simulations and quite simple experiments, and more realistic out-door experiments assuming a planetary rover are now planned.

There still remain several problems to be solved. The most important one is, as discussed earlier, how to define the sensor states including goal sensors. This issue is similar to the problem of how to represent

the real world with symbols, which has been quite a tough problem typical in many AI fields. We are now studying about a learning scheme for this objective. Another problem is how to accelerate the learning. In our current system, the training data near the final goal situations are hard to collect because the combination of random movements rarely achieves the overall goal. We are now considering the meta-capability of the system to control the sequence of learning. Finally, in the practical situation of planetary rovers, most of the actions will be learned on the Earth, and on the target planet only the difference of the planet and the Earth surfaces should be compensated for without try- ing many random movements. For these objectives, a novel method of modifying the obtained reflexion network and the decision trees without performing the re-learning, is required. We are continuing research on these problems as well as experimenting the architecture in the real world.

References

[1 ] R.C. Arkin, Integrating Behavioral Perceptual and World Knowledge in Reactive Navigation, Designing Autonomous Agents (The MIT Press, Cambridge MA, 1990) 105-122.

[2] R.A. Brooks, A robust layered control system for a mobile robot, IEEE Transactions on Robotics and Automation (1986) 14-23.

[3] R.A. Brooks, Artificial life and real robots, toward a practice of autonomous systems, Proc, of the First European Conference on Artificial Life (The MIT Press, Cambridge, MA, 1991) 3-10.

[4] E. Gat , Integrating planning and reacting in a heterogeneous asynchronous architecture for controlling real-word mobile robots, Proc. of the Eleventh National Conference on Artificial Intelligence (1992) 809-815.

[5] H. Hu, in; S. Cameron and P. Probert, eds., Sensor-based Control Architecture, Advanced Guided Vehicles (World Scientific, Singapore, 1994) 17-35.

[6] T.C. Henderson and R. Grupen, Logical behaviors, Journal of Robotic Systems 7(3) (1990) 309-336.

[7] L.P. Kaelbling, Goals as parallel program specifications, Proc. of the Seventh National Conference on Artificial

Intelligence (1988) 60--65. [8] L.P. Kaelbling, An adaptive mobile robot, toward a practice

of autonomous systems, Proc. of the First European Conference on Artificial Life (The MIT Press, Cambridge, MA, 1991) 41~,7.

[9] L.P. Kaelbling and S.J. Rosenschein, Action and planning in embedded agents, Designing Autonomous Agents (The MIT Press, Cambridge, MA, 1990) 35-48.

S. Nakasuka et al./ Robotics and Autonomous Systems 17 (1996) 287-305 305

[10] D.M. Lyons and A.J. Hendriks, Planning as incremental adaptation of a reactive system, Robotics and Autonomous Systems 14(4) (19951).

[11] P. Maes, Situated agents can have goals, Designing Autonomous Agents (The MIT Press, Cambridge, MA, 1990) 49-70.

[12] P. Maes, Learning behavior networks from experience, toward a practice of autonomous systems, Proc. of the First European Conference on Artificial Life (The MIT Press, Cambridge, MA, 1991) 48-57.

[13] T.M. Mitchell, Becoming increasingly reactive, Proc. of the Ninth National Conference on Artificial Intelligence (1990) 1051-1058.

[14] N.J. Nilsson, Principles of Artificial Intelligence (Springer, Berlin, 1980).

[15] K. Noborio and J. Hashime, A feasible path-planning algorithm for a mobile robot with a visible region in an uncertain workspace, Journal of Japan Robot Society 10(3) (1991) 378-384.

116] S. Nakasuka and T. Yoshida, Dynamic scheduling system utilizing machine learning as a knowledge acquisition tool, International Journal of Production Research 30(2) (1992) 411-431.

[17] L. Steels, Emergent functionality in robotic agents through on-line evolution, Artificial Life IV (1994) 8-14.

[18] L. Spector and J. Hendler et al., Planning and reacting across supervenient levels of representation, International Journal of lntelligent and Cooperative Information System 1(3,4) (1992) 411-449.

[19] A. Segre and J. ~[Urney, in: S. Minton, ed., Planning, Acting, and Learning in a Dynamic Domain, Machine Learning Methods for Planning (Morgan Kaufmann, Los Altos, CA, 1993) 125-158.

Takehisa Yairi received his B.E. from Tokyo University in 1994 and is currently pursuing M.E. with the Department of Aeronautics and Astro- nautics at that university. His interest includes robotics, machine learning and artificial intelligence. His current research concentrates on robust and fanlt-tolerant control architecture for autonomous robot.

Hiroyuki Wajiyama received B.E. de- gree in aeronautics and astronautics from the University of Tokyo in 1995. He is a graduate student at the Uni- versity of Tokyo. His research interests include autonomous robots and multi- agent systems.

Shinichi Nakasuka received his Mas- ter of Engineering and Ph.D. degrees in aeronautics and astronautics from University of Tokyo in 1985 and 1988, respectively. He is now an associate Professor at Research center for Ad- vanced Science and Technology, Uni- versity of Tokyo. His current research interests are autonomy and intelligence for space systems, multi-agent robotics and machine learning.

Documents

Autonomous generation of reflexion-based robot controller using inductive learning