State Space Construction for Behavior Acquisition in Multi Agent … · 2011-11-15 · State Space Construction for Behavior Acquisition in Multi Agent Environments with Vision and

State Space Construction for Behavior Acquisition in Multi AgentEnvironments with Vision and Action

Eiji Uchibe Minoru Asada Koh HosodaDept. of Adaptive Machine Systems

Osaka UniversitySuita, Osaka 565, Japan

AbstractThis paper proposes a method which estimates the

relationships between learner’s behaviors and other a-gents’ ones in the environment through interactions(observation and action) using the method of systemidentification. In order to identify the model of eachagent, Akaike’s Information Criterion is applied to theresults of Canonical Variate Analysis for the relation-ship between the observed data in terms of action andfuture observation. Next, reinforcement learning basedon the estimated state vectors is performed to obtainthe optimal behavior. The proposed method is appliedto a soccer playing situation, where a rolling ball andother moving agents are well modeled and the learn-er’s behaviors are successfully acquired by the method.Computer simulations and real experiments are shownand a discussion is given.

1 IntroductionBuilding a robot that learns to accomplish a task

through visual information has been acknowledged asone of the major challenges facing vision, robotics, andAI. In such an agent, vision and action are tightly cou-pled and inseparable [2]. For instance, we, human be-ings, cannot see anything without the eye movements,which may suggest that actions significantly affect thevision processes and vice versa. There have been sever-al approaches which attempt to build an autonomousagent based on tight coupling of vision (and/or othersensors) and actions [15, 13, 14]. They consider thatvision is not an isolated process but a component ofthe complicated system (physical agent) which inter-acts with its environment [3, 16, 8]. This is a quitedifferent view from the conventional CV approachesthat have not been paying attention to physical bod-ies. A typical example is the problem of segmentationwhich has been one of the most difficult problems incomputer vision because of the historical lack of thecriterion: how significant and useful the segmentationresults are. These issues would be difficult to be evalu-

ated without any purposes. That is, instinctively taskoriented. However, the problem is not the straight-forward design issue for the special purposes, but theapproach based on physical agents capable of sensingand acting. That is, segmentation and its organizationcorresponds to the problem of building the agent’s in-ternal representation through the interactions betweenthe agent and its environment.

The internal representation can be regarded as s-tate vectors from a viewpoint of control theory becausethey include the necessary and sufficient informationto accomplish a given task, and also as state spacerepresentation in robot learning for the same reasonas in the control theory, especially, in reinforcementlearning which has recently been receiving increasedattention as a method with little or no a priori knowl-edge and higher capability of reactive and adaptivebehaviors [7].

There have been a few work on the reinforcementlearning with vision and action. To the best of ourknowledge, Whitehead and Ballard proposed an ac-tive vision system [19] in which only a computer sim-ulation has been done. Asada et al. [6] applied thevision-based reinforcement learning to the real robottask. In these methods, the environment does notinclude independently moving agents, therefore, thecomplexity of the environment is not so high as oneincluding other agents. In case of multi-robot environ-ment, the internal representation would become morecomplex to accomplish the given tasks [4]. The mainreason is that the other robot has perception (sensa-tion) different from the learning robot’s. This meansthat the learning robot would not be able to discrimi-nate different situations which the other robot can do,and vice versa. Therefore, the learner cannot predictthe other robot behaviors correctly even if its policy isfixed unless explicit communication is available. It isimportant for the learner to understand the strategiesof the other robots and to predict their movements in

advance to learn the behaviors successfully.There are several approaches to multiagent learn-

ing problem (ex., [11], [17]) which utilize the currentsensor outputs as states, and therefore they can notcope with the changes of the current situation. Fur-ther, they need well-defined attributes (state vectors)in order for the learning to converge correctly. How-ever, it is generally difficult to find such attributes inadvance. Therefore, the modeling architecture (statevector estimation) is required to enable the reinforce-ment learning applicable.

In this paper, we propose a method which finds therelationships between the behaviors of the learner andthe other agents through interactions (observation andaction) using the method of system identification. Inorder to construct the local predictive model of otheragents, we apply Akaike’s Information Criterion(AIC)[1] to the results of Canonical Variate Analysis(CVA)[10], which is widely used in the field of system identifi-cation. The local predictive model is based on the ob-servation and action of the learner (observer). We ap-ply the proposed method to a simple soccer-like game.The task of the robot is to shoot a ball which is passedback from the other robot. Because the environmentconsists of the stationary agents (the goal), a passiveagent (the ball) and an active agent (the opponent),the learner has to construct the appropriate modelsfor all of these agents. After the learning robot iden-tifies the model, the reinforcement learning is appliedin order to acquire purposive behaviors. The proposedmethod can cope with a moving ball because the statevector is estimated appropriately to predict its motionin image. Simulation results and real experiments areshown and a discussion is given.

2 Construct the internal model fromobservation and action

2.1 Local predictive model of other a-gents

In order to make the learning successful, it is nec-essary for the learning agent to estimate appropriatestate vectors. However, the agent can not obtain thecomplete information to estimate them because of thepartial observation due to the limitation of its sensingcapability. Then, what the learning agent can do is tocollect all the observed data with the motor commandstaken during the observation and to find the relation-ship between the observed agents and the learner’sbehaviors in order to take an adequate behavior al-though it might not be guaranteed as optimal. In thefollowing, we consider to utilize a method of systemidentification, regarding the previous observed data

and the motor commands as the input, and future ob-servation as the output of the system respectively.

environment

Q learning

action

state vectornstate vector

n

Local predictivemodel

order estimatorstate estimator

actionobservation

reward

Figure 1: An overview of the learning architecture

Figure 1 shows an overview of the proposed learningarchitecture consisting of an estimator of the local pre-dictive models and a reinforcement learning method.At first, the learning agent collects the sequence ofsensor outputs and motor commands to construct thelocal predictive models, which are described in section2.2. By approximating the relationships between thelearner’s action and the resultant observation, the lo-cal predictive model gives the learning agent not onlythe successive state of the agent but also the priorityof the state vectors, which means that the validity ofthe state vector with respect to the prediction.2.2 Canonical Variate Analysis(CVA)

A number of algorithms to identify multi-input multi-output (MIMO) combined deterministic-stochastic systems have been proposed. Among them,Larimore’s Canonical Variate Analysis (CVA) [10] isone representative, which uses canonical correlationanalysis to construct a state estimator.

Let u(t) ∈ <m and y(t) ∈ <q be the input andoutput generated by the unknown system

x(t + 1) = Ax(t) + Bu(t) + w(t),y(t) = Cx(t) + Du(t) + v(t), (1)

with

E

{[w(t)v(t)

] [wT (τ) vT (τ)

]}=

[Q S

ST R

]δtτ ,

and A, Q ∈ <n×n, B ∈ <n×m, C ∈ <q×n, D ∈ <q×m,S ∈ <n×q, R ∈ <q×q. E{·} denotes the expectedvalue operator and δtτ the Kronecker delta. v(t) ∈ <q

and w(t) ∈ <n are unobserved, Gaussian-distributed,zero-mean, white noise vector sequences. CVA uses

a new vector µ which is a linear combination of theprevious input-output sequences since it is difficult todetermine the dimension of x. Eq.(1) is transformedas follows:

[µ(t + 1)

y(t)

]= Θ

[µ(t)u(t)

]+

[T−1w(t)

v(t),

], (2)

where

Θ̂ =[

T−1AT T−1BCT D

], (3)

and x(t) = Tµ(t). We follow the simple explanationof the CVA method.

1. For {u(t), y(t)}, t = 1, · · ·N , construct new vec-tors

p(t) =

u(t− 1)...

u(t− l)y(t− 1)

...y(t− l)

, f(t) =

y(t)y(t + 1)

...y(t + k − 1)

,

2. Compute estimated covariance matrices Σ̂pp,Σ̂pf and Σ̂ff , where Σ̂pp and Σ̂ff are regularmatrices.

3. Compute singular value decomposition

Σ̂−1/2

pp Σ̂pfΣ̂−1/2

ff = UauxSauxV Taux, (4)

UauxUTaux = I l(m+q), V auxV T

aux = Ikq,

and U is defined as:

U := UTauxΣ̂

−1/2

pp .

4. The n dimensional new vector µ(t) is defined as:

µ(t) = [In 0]Up(t), (5)

5. Estimate the parameter matrix Θ applying leastsquare method to Eq (2).

Strictly speaking, all the agents do in fact interactwith each other, therefore the learning agent shouldconstruct the local predictive model taking these in-teractions into account. However, it is intractable tocollect the adequate input-output sequences and esti-mate the proper model because the dimension of statevector increases drastically. Therefore, the learning(observing) agent applies the CVA method to each(observed) agent separately.

2.3 Determine the dimension of other a-gent

It is important to decide the dimension n of thestate vector x and lag operator l that tells how longthe historical information is related in determining thesize of the state vector when we apply CVA to theclassification of agents. Although the estimation isimproved if l is larger and larger, much more historicalinformation is necessary. However, it is desirable thatl is as small as possible with respect to the memorysize. For n, complex behaviors of other agents can becaptured by choosing the order n high enough.

In order to determine n, we apply Akaike’s Informa-tion Criterion (AIC) which is widely used in the fieldof time series analysis. AIC is a method for balancingprecision and computation (the number of parameter-s). Let the prediction error be ε and covariance matrixof error be

R̂ =1

N − k − l + 1

N−k+1∑

t=l+1

ε(t)εT (t).

Then AIC(n) is calculated by

AIC(n) = (N − k − l + 1) log |R̂|+ 2λ(n), (6)

where λ is the number of the parameters. The optimaldimension n∗ is defined as

n∗ = arg min AIC(n).

Since the reinforcement learning algorithm is ap-plied to the result of the estimated state vector tocope with the non-linearity and the error of model-ing, the learning agent does not have to construct thestrict local predict model. However, the parameter lis not under the influence of the AIC(n). Therefore,we utilize log |R̂| to determine l.

1. Memorize the q dimensional vector y(t) about theagent and m dimensional vector u(t) as a motorcommand.

2. From l = 1 · · ·, identify the obtained data.

(a) If log |R̂| < 0, stop the procedure and deter-mine n based on AIC(n),

(b) else, increment l until the condition (a) issatisfied or AIC(n) does not decrease.

3 Reinforcement LearningAfter estimating the state space model given by Eq.2, the agent begins to learn behaviors using a rein-forcement learning method. Q learning [18] is a form

passer

shooter

Figure 2: The environment and our mobile robot

of reinforcement learning based on stochastic dynamicprogramming. It provides robots with the capabilityof learning to act optimally in a Markovian environ-ment. In the previous section, appropriate dimensionn of the state vector µ(t) is determined, and the suc-cessive state is predicted. Therefore, we can regard anenvironment as Markovian.

4 Task and AssumptionsWe apply the proposed method to a simple soccer-

like game including two agents (Figure 2). Each agenthas a single color TV camera and does not know thelocation, the size and the weight of the ball, the oth-er agent, any camera parameters such as focal lengthand tilt angle, or kinematics/dynamics of itself. Theymove around using a 4-wheel steering system. As mo-tor commands, each agent has 7 actions such as gostraight, turn right, turn left, stop, and go backward.Then, the input u is defined as the 2 dimensional vec-tor as

uT = [v φ] , v, φ ∈ {−1, 0, 1},where v and φ are the velocity of motor and the angleof steering respectively and both of which are quan-tized.

The output (observed) vectors are shown in Figure3. As a result, the dimension of the observed vectorabout the ball, the goal, and the other robot are 4, 11,and 5 respectively.

5 Experimental Results5.1 Simulation ResultsTable1 shows the result of identification. In order topredict the successive situation, l = 1 is sufficient forthe goal, while the ball needs 2 steps. The motion ofthe random walk agent can not be correctly predictedas a matter of course while the move-to-the-ball agentcan be identified by the same dimension of the random

( , )4x y4

goal

x y( , )1 1

2x y( , )2

3x y( , )3

robot

h

w

ball (x, y)

area

(x, y)

(x, y)

area area

center position

width

height radius 4 corners

center position center position

robot ball goal

image features

Figure 3: Image features of the ball, goal, and agent

Table 1: The estimated dimension (computer simula-tion)

agent l n log |R| AICgoal 1 2 −0.001 121ball 2 4 0.232 138

random walk 3 6 1.22 232move to the ball 3 6 −0.463 79

agent, but the prediction error is much smaller thanthat of random walk.

Table 2 shows the success rates of shooting andpassing behaviors compared with the results in ourprevious work [6] in which only the current sensor in-formation is used as a state vector. We assign a re-ward value 1 when the robot achieved the task, or 0otherwise. If the learning agent uses the only currentinformation about the ball and the goal, the leaningagent can not acquire the optimal behavior when theball is rolling. In other words, the action value func-tion does not become to be stable because the stateand action spaces are not consistent with each other.

Table 2: Comparison between the proposed methodand using current information

state vector success of success ofshooting (%) passing (%)

current position 10.2 9.8using CVA 78.5 53.2

5.2 Real ExperimentsWe have constructed the radio control system of

the robot, following the remote-brain project by Ina-

MC68040MaxVideoDigiColorparallel I/O

UHFantenna

tuner

Monitor

Sun WS

R/Ctransmiter

UHFtransmiter

TVCamera

Soccer Robot

VME BOXSPARC station 20

UPP

Figure 4: A configuration of the real system.

Table 3: The estimated dimension (real environment)from the shooter

l n log |R| AICball 4 4 1.88 284goal 1 3 −1.73 −817

passer 5 4 3.43 329from the passer

l n log |R| AICball 4 4 1.36 173

shooter 5 4 2.17 284

ba et al. [9]. Figure 4 shows a configuration of thereal mobile robot system. The image taken by a TVcamera mounted on the robot is transmitted to a UHFreceiver and processed by Datacube MaxVideo 200, areal-time pipeline video image processor. In order tosimplify and speed up the image processing time, wepainted the ball, the goal, and the opponent red, blue,and yellow, respectively. The input NTSC color videosignal is first converted into HSV color components inorder to make the extraction of the objects easy. Theimage processing and the vehicle control system areoperated by VxWorks OS on MC68040 CPU whichare connected with host Sun workstations via Ethernet. The tilt angle is about −26 [deg] so that robotcan see the environment effectively. The horizontaland vertical visual angle are about 67 [deg] and 60[deg], respectively.

The task of the passer is to pass a ball to the shoot-

-10

-5

0

5

10

0 50 100 150 200 250 300

erro

r (p

ixel

)

time steps (1/30 s)

error of y position

(a) y position of the ball

-6

-4

-2

0

2

4

6

8

10

12

14

0 50 100 150 200 250 300

erro

r (p

ixel

)

time steps (1/30 s)

error of y position

(b) y position of the left-upper of the goal

Figure 5: Prediction errors in the real environment

er while the task of the shooter is to shoot a ball intothe goal. Table 3 and Figure 5 show the experimentalresults. The value of l for the ball and the agent arebigger than that of computer simulation, because ofthe noise of the image processing and the dynamicsof the environment due to such as the eccentricity ofthe centroid of the ball. Even though the local predic-tive model of the same ball for each agent is similar(n = 4, and slight difference in log |R| and AIC) fromTable3, the estimated state vectors are different fromeach other because there are differences in several fac-tors such as tilt angle, the velocity of the motor andthe angle of steering. We checked what happened if wereplace the local predictive models between the pass-er and the shooter. Eventually, the large predictionerrors of both side were observed. Therefore the localpredictive models can not be replaced between physi-cal agents. Figure 6 shows a sequence of images wherethe shooter shoot a ball which is kicked by the passer.

6 Conclusion

This paper presents a method of behavior acquisi-tion in order to apply reinforcement learning to the en-vironment including other agents. Our method takesaccount of the trade-off among the precision of pre-diction, the dimension of state vector and the lengthof steps to predict. Spatial quantization of the imageinto objects has been easily solved by painting objectsin single color different from each other. Rather, theorganization of the image features and their tempo-ral segmentation for the purpose of task accomplish-ment have been done simultaneously by the method.The method is applied to a soccer playing game inwhich one robot can shoot a rolling ball passed byother robot.

In the proposed method, we need a quantizationprocedure of the estimated state vectors. Several seg-mentation methods such as Parti game algorithm[12]and Asada’s method [5] might be promising.

1 2

3 4

passer

shooter

(a) top view

1 2

3 4

(b) obtained images (left:shooter, right:passer)

Figure 6: Acquired behavior

References[1] H. Akaike. A new look on the statistical model iden-

tification. IEEE Trans. AC-19, pp. 716–723, 1974.

[2] Y. Aloimonos. Introduction: Active vision revisited.In Y. Aloimonos ed., Active Perception, chapter 0, pp.1–18. Lawrence Erlbaum Associate, Publishers, 1993.

[3] Y. Aloimonos. Reply: What i have learned. CVGIP:Image Understanding, 60:1:74–85, 1994.

[4] M. Asada. An agent and an environment: A viewof “having bodies” – a case study on behavior learn-ing for vision-based mobile robot –. In Proc. of 1996IROS Workshop on Towards Real Autonomy, pp. 19–24, 1996.

[5] M. Asada, S. Noda, and K. Hosoda. Action-based sen-sor space categorization for robot learning. In Proc.of the 1996 IEEE/RSJ International Conference onIntelligent Robots and Systems, 1996.

[6] M. Asada, S. Noda, S. Tawaratumida, and K. Hoso-da. Purposive behavior acquisition for a real robot by

vision-based reinforcement learning. Machine Learn-ing, 23:279–303, 1996.

[7] J. H. Connel and S. Mahadevan. Robot Learning.Kluwer Academic Publishers, 1993.

[8] S. Edelman. Reply: Representation without recon-struction. CVGIP: Image Understanding, 60:1:92–94,1994.

[9] M. Inaba. Remote-brained robotics : Interfacing AIwith real world behaviors. In Preprints of ISRR’93,Pitsuburg, 1993.

[10] W. E. Larimore. Canonical variate analysis in identi-fication, filtering, and adaptive control. In Proc. 29thIEEE Conference on Decision and Control, pp. 596–604, Honolulu, Hawaii, December 1990.

[11] M. L. Littman. Markov games as a framework formulti-agent reinforcement learning. In Proc. of the11th International Conference on Machine Learning,pp. 157–163, 1994.

[12] A. W. Moore and C. G. Atkeson. The parti-game al-gorithm for variable resolution reinforcement learningin multidimensional state-spaces. Machine Learning,21:199–233, 1995.

[13] T. Nakamura and M. Asada. Motion sketch: Acqui-sition of visual motion guided behaviors. In 14thInternational Joint Conference on Artificial Intelli-gence, pp. 126–132. Morgan Kaufmann, 1995.

[14] T. Nakamura and M. Asada. Stereo sketch: Stere-o vision-based target reaching behavior acquisitionwith occlusion detection and avoidance. In Proc. ofIEEE International Conference on Robotics and Au-tomation, pp. 1314–1319, 1996.

[15] G. Sandini. Vision during action. In Y. Aloimonos ed.,Active Perception, chapter 4, pp. 151–190. LawrenceErlbaum Associate, Publishers, 1993.

[16] G. Sandini and E. Grosso. Reply: Why purposivevision. CVGIP: Image Understanding, 60:1:109–112,1994.

[17] E. Uchibe, M. Asada, and K. Hosoda. Behavior coor-dination for a mobile robot using modular reinforce-ment learning. In Proc. of the 1996 IEEE/RSJ In-ternational Conference on Intelligent Robots and Sys-tems, pp. 1329–1336, 1996.

[18] C. J. C. H. Watkins and P. Dayan. Technical note:Q-learning. Machine Learning, pp. 279–292, 1992.

[19] S. D. Whitehead and D. H. Ballard. Active perceptionand reinforcement learning. In Proc. of Workshop onMachine Learning-1990, pp. 179–188. Morgan Kauf-mann, 1990.

Documents

State Space Construction for Behavior Acquisition in Multi Agent … · 2011-11-15 · State Space Construction for Behavior Acquisition in Multi Agent Environments with Vision and