A cognitive framework for imitation learning

Robotics and Autonomous Systems 54 (2006) 403–408www.elsevier.com/locate/robot

A cognitive framework for imitation learning

A. Chellaa,∗, H. Dindoa, I. Infantinob

a Dipartimento Ingengeria Informatica, Universita degli Studi di Palermo, Viale delle Scienze Ed. 6, 90128, Palermo, Italyb Istituto di Calcolo e Reti ad Alte Prestazioni - Consiglio Nazionale delle Ricerche - Viale delle Scienze Ed. 11, 90128 Palermo, Italy

Available online 13 March 2006

Abstract

In order to have a robotic system able to effectively learn by imitation, and not merely reproduce the movements of a human teacher, thesystem should have the capabilities of deeply understanding the perceived actions to be imitated. This paper deals with the development ofcognitive architecture for learning by imitation in which a rich conceptual representation of the observed actions is built. The purpose of thefollowing discussion is to show how this Conceptual Area can be employed to efficiently organize perceptual data, to learn movement primitivesfrom human demonstration and to generate complex actions by combining and sequencing simpler ones. The proposed architecture has beentested on a robotic system composed of a PUMA 200 industrial manipulator and an anthropomorphic robotic hand.c© 2006 Elsevier B.V. All rights reserved.

Keywords: Imitation learning; Conceptual spaces; Cognitive robotics; Intelligent manipulation

1. Introduction

Although the control of robotic systems has reacheda high level of precision and accuracy, complexity andtask specificity are limiting factors for large scale uses. Apromising approach towards effective robot programming isthe learning by imitation paradigm (for reviews on differentaspects on imitation see [16–18]). The development of roboticarchitectures which can successfully benefit from the learningby imitation approach involves several difficult researchproblems outlined in [5,9], and different aspects of the problemhave been addressed in many working systems in the pastyears ([2–4,13,19,20]). However, the main emphasis is givento the problem of how to map the observed movement of ateacher onto the movement apparatus of the robot. While thisis a fundamental aspect of any robotic system which copeswith imitation, we claim that in order to effectively learnby imitation, the system itself should have the capabilitiesto effectively “understand” its environment and the perceivedactions to be imitated. In our approach, understanding involvesthe generation of a high-level, declarative description of theperceived world, along the lines of [11] and [15].

∗ Corresponding author.E-mail address: [email protected] (A. Chella).

0921-8890/$ - see front matter c© 2006 Elsevier B.V. All rights reserved.doi:10.1016/j.robot.2006.01.008

We propose a cognitive architecture which tightly linksvisual perception with knowledge representation in the contextof imitation learning. Our proposal is based on the hypothesisthat a principled integration of the approaches of artificialvision and of symbolic knowledge requires the introduction ofan intermediate level between the two. Such a role is playedby conceptual spaces, according to the approach proposedby Gardenfors [12]. The core of the architecture is a richinner Conceptual Area where perceptual data acquired bya real-time vision system are depicted. The purpose of thefollowing discussion is to show how this Conceptual Areacan be employed to efficiently organize perceptual data, tolearn movement primitives from human demonstration and torepresent and generate complex tasks, which are automaticallysegmented into several simpler tasks.

To validate our approach we performed several experimentson an arm/hand robotic system equipped with a video camera.The robot learns simple manipulative capabilities shown bydifferent human teachers. To allow cognitive processing ofthe perceived data, the system reconstructs and records allthe relevant information in a scene, the teacher’s externalcoordinates (Cartesian position and orientation), the propertiesof the objects in the scene (position, shape, color), and representthem, through conceptual spaces, in a high-level LinguisticArea. Such representations may be employed to perform

http://www.elsevier.com/locate/robot

mailto:[email protected]

http://dx.doi.org/10.1016/j.robot.2006.01.008

404 A. Chella et al. / Robotics and Autonomous Systems 54 (2006) 403–408

Fig. 1. The computational areas of the proposed architecture and their relationwith the external world.

high-level inferences, e.g. those needed to generate complexlong-range plans, or to perform reasoning about the acquiredmovements.

The rest of the paper is organized as follows. In thenext section the cognitive architecture is summarized anddetailed description of the conceptual representation is given.Section 3 clarifies the concept of learning in the conceptualspaces with emphasis on imitation learning. Section 4 showsan implementation of the architecture on a robotic platform.Finally, Section 5 gives brief conclusions and outlines the futurework.

2. The cognitive architecture

The proposed architecture is organized in three computa-tional areas as shown in Fig. 1. The Subconceptual Area is con-cerned with the low-level processing of perceptual data comingfrom vision sensors through a set of built-in visual mechanisms(color segmentation, feature extraction). It is also responsiblefor the control of the robotic system. We call this area subcon-ceptual because a conceptual categorization of the informationis not yet performed.

In the Conceptual Area, the information from theSubconceptual Area is organized into conceptual categoriesthrough conceptual spaces with increasing level of complexityin which different properties of the perceived scene arecaptured. However, entities in the Conceptual Area arepurely quantitative and still independent from any linguisticcharacterization. In the Linguistic Area, representation andprocessing are based on a high-level symbolic language. Thesymbols in this area are anchored to sensory data by mappingthem on appropriate representations in the different layers ofthe Conceptual Area.

The same architecture is able both to analyze the observedtask to be imitated (observation mode) and to perform theimitation of the given task (imitation mode). For the purposes ofthe present discussion, we consider a simplified bidimensionalworld populated with various simple objects (such as squares

and circles) in which observation/imitation takes place. Thechoice of the 2D world allows us to focus on the architecturalissues and to clarify the proposed framework, while avoidingunnecessary burden associated with the description of 3Dcomputational processes.

During the observation mode, the user shows her/his handwhile performing arbitrary tasks of manipulating objects infront of a single calibrated camera. The task is then segmentedinto meaningful units and its properties (object position andorientation, color and shape; action types) are represented,through the Conceptual Area, by using high-level symbolicterms. In the imitation mode, the robotic system performs thesame tasks triggered by a known stimuli in the external worldor by a user’s query.

2.1. The Subconceptual Area

In this area low-level processes are performed. Theirfunctionalities differ during the observation and imitationmodes. Basically, this area is responsible for: tracking thehuman hand motion; segmenting the overall action beingperformed by the user into meaningful atomic units; controllingthe robotic system; performing detection and representation ofthe objects in the scene in terms of 2D parametric curves.

2.1.1. The Perceptuomotor ModuleThe Perceptuomotor Module in the Subconceptual Area is

concerned with both the human hand’s posture reconstructionand with the mapping between human and robot movements.The system performs skin detection in order to detect the humanhand and continuously tracks its position and orientation in theimage plane. The homography relationship between the imageand the working plane of the user is computed during an off-line calibration phase. The module outputs the stimuli relativeto the human hand’s trajectory in the world coordinate system:

PMMhand(t) = [xhand(t), yhand(t), θhand(t)]T. (1)

As the scene evolves changes occur in the world becauseobjects are being grasped by the hand and moved to anotherlocation. In order to detect changes in a dynamic scene, weperform motion segmentation on the human hand data by usingzero-velocity crossing detection. The zero velocity occurs eitherwhen the direction of movement changes or when no motionis present. In order to accurately detect the zero-crossings inthe movement data, we first apply a low-pass filter to eliminatehigh-frequency noise components, and then we seek for thezero velocities within a threshold. The result is a list of frameindexes, named key-frames, in which object properties (shapeand color) are to be processed. Fig. 2 shows the processof extracting the key-frames from the human hand trajectorydata.

This module is also responsible of the issues related tothe mapping of teacher’s internal/external space, e.g. jointangles or Cartesian coordinates of the end-effector, to that ofthe robot, which is the classical correspondence problem inthe imitation literature [1]. In the current implementation weconvert the Cartesian coordinates (in terms of x–y position and

A. Chella et al. / Robotics and Autonomous Systems 54 (2006) 403–408 405

Fig. 2. Trajectories of the human hand during an observation phase (positionand orientation of the human hand). Vertical lines correspond to extracted key-frames.

orientation) into robot joint angles by the means of standardinverse kinematics methods.

2.1.2. Scene reconstruction moduleThe presence of objects in the external world generates a

set of stimuli which are processed in this area by the meansof hard-wired computations. We perform color segmentation ofthe image in the HSV color space in order to detect areas whichcontain the human hand as well as all the objects present in thescene. Machine vision techniques then perform the extractionof the object features.

We represent each object in a scene as a superellipse [14]. Asuperellipse is a closed curve defined by the following equation:( x

a

)m+

( y

b

)m= 1 (2)

where a and b are the sizes (positive real numbers) of the majorand minor axes and m is a rational number which controls theshape of the curve. For example, if m = 2 and a = b, weget the equation of a circle. For larger m we get graduallymore rectangular shapes, until for m → ∞ the curve takesup a rectangular shape. A superellipse in general positionrequires three additional parameters for expressing the rotationand translation relative to the center of the world coordinatesystem (x, y, θ). This module outputs a stimulus for each objectdetected:

SRMobj = [a, b, m, x, y, θ, H, S, V ]T (3)

and the computation of the parameters of the superellipses isdone as a curve fitting process. However, since this process istime-consuming, the detection of scene objects is performedonly at the key-frames extracted in the Perceptuomotor Module.

2.2. The Conceptual Area

The conceptual spaces are based on geometric treatment ofconcepts in which concepts are not independent of each other

but structured into domains [12], e.g. color, shape, kinematicsand dynamics quantities. Each domain is defined in terms ofa set of quality dimensions, e.g. the color domain may bedescribed in terms of hue/saturation/value quantities while thekinematics domain is described by joint angles. In a conceptualspace, objects and their temporal and spatial evolution arerepresented as points describing the quality dimensions ofeach domain. It is possible to measure the similarity betweendifferent individuals in the space by introducing suitablemetrics. In the context of present architecture we proposetwo conceptual spaces capturing different temporal and spatialaspects of the scene as described in the following.

2.2.1. The Perceptual SpaceThe Perceptual Space is a metric space whose dimensions

are strictly related to the quantities processed in theSubconceptual Area. Points in the Perceptual Space aregeometric primitives describing the acquired scene (objectshape and position descriptors, joint angles, 3D geometricparameters, etc.). A complete description of the PerceptualSpace may be found in [6].

Points in the Perceptual Space are computed starting fromthe stimuli obtained during the subconceptual processing stage.In the current implementation, the quality dimensions of ourPerceptual Space are given by the shape and color properties ofthe objects perceived:

PSobj = [aobj, bobj, mobj, hobj, sobj, vobj]T. (4)

Therefore, this space captures the shape and color propertiesof objects. This is useful in assigning high-level terms toobjects, such as square, circle, red, green. Each point is thePerceptual Space is assigned a unique identifier, ps id, togetherwith the spatial information (x, y, θ) and the key-frame atwhich it has been recorded.

2.2.2. Action SpaceIn the Perceptual Space, moving objects should be

represented as a set of points corresponding to each timesample. However, such a representation does not capturethe motion in its wholeness. In order to account for thedynamic aspects of the perceived scene, we propose an ActionSpace in which each point represents the simple motion ofentities. An action corresponds to a “scattering” from onesituation to another in the conceptual space. However, thedecision on which kind of motion can be considered simpleis not straightforward, and it is strictly related to the problemof motion segmentation, as described in Section 2.1.1. Animmediate choice is to consider “simple” any motion of thehuman hand occurring between key-frames.

In order to represent simple motions, it is possible tochoose a set of basis functions, in terms of which any simplemotion can be expressed. Such functions can be associated withthe axes of the dynamic conceptual space as its dimensions.Therefore, from the mathematical point of view, the resultingAction Space is a functional space. A single point in the spacetherefore describes a simple motion of the human hand. Inthe context of the present work we choose the technique of


Principal Component Analysis (PCA) in order to obtain the setof basis functions.

We first obtain a set of training trajectories, ϕ(t) =

[x(t), y(t), θ(t)]T, performed by different users. Human handtrajectories are labelled according to the type of motionperformed, e.g. move forward, move up. In order to accountfor different temporal durations of simple motions, weinterpolate the data to a fixed length. We noted that generallythe first six principal components account for the most of thedata variance, and are used to encode the simple human handmotions.

The principal components form the basis functions, ϕi (t),in terms of which each hand trajectory can be represented as alinear combination:

ϕ(t) =

k∑i=1

ciϕi (t) (5)

where ci are the projection coefficients, and the index k refersto the number of principal components (in our case, k = 6).Hence, the quality dimensions of the Action Space are given bythe PCA coefficients:

AS = [c1, c2, . . . , c6]T. (6)

However, principal components are useful also as the basisfor generating novel trajectories that resemble those seenduring the training phase. This is done as the process of datainterpolation: given the starting, (ϕ(t0)), and ending, (ϕ(t f )),hand configurations, it is possible to determine the coefficients[c1, . . . , c6]

T which satisfy the boundary conditions by solvingthe set of six (linear) equations in six unknowns:

ϕ(t0) =

k∑i=1

ciϕi (t)

ϕ(t f ) =

k∑i=1

ciϕi (t) (7)

2.3. The Linguistic Area

The more “abstract” forms of reasoning, that are lessperceptually constrained, are likely to be performed mainlywithin the Linguistic Area. The elements of the LinguisticArea are symbolic terms that have the role of summarizingperceptions and actions represented in the conceptual spacespreviously described, i.e., linguistic terms are anchored to thestructures in the conceptual spaces [7].

We adopted a first-order logic language in order toimplement the Linguistic Area. Perceptions and actions arerepresented as first-order terms. The semantics of the languageis given by structures in conceptual spaces according toGardenfors [12]. Individual constants are mapped into pointsinto Conceptual Space, one-place predicates are mapped intosets of points in conceptual spaces and two-place predicates aremapped into couples of sets of points into conceptual spaces. Inparticular, terms related to actions are interpreted in the Action

Space, while terms related with objects are interpreted intoPerceptual Space.

Each complex task may be described as a sequence ofprimitive actions:reach (ps1 id), grasp (ps1 id),transfer relative to (ps2 id),in the case of relative positioning tasks; orreach (ps id), grasp (ps id),transfer absolute (x, y, θ),in the case of absolute positioning tasks.

Reach and transfer actions are compositions of one ormore simple actions, move forward, move left, etc.

3. Imitation learning

Learning and recognition capabilities of our architecture arestrictly related to the structure of the conceptual spaces, whichallows us to manage concepts and their relationships by usingthe language of geometry [12]. It is worthwhile noting that aspecial role in conceptual spaces is played by the so-callednatural concepts, which correspond to convex regions in thespace. Hence, it is possible to account for prototypical effects:given a convex region representing a natural concept, the centralpoints of the region correspond to “more typical” instances ofthe category, while the peripheral points correspond to “lesstypical” instances.

Natural actions can be managed in Conceptual Spaces bygenerating suitable clusters in the Action Space by usingclustering techniques [10]. In particular, the Action Space maybe clustered in order to associate different types of actions,act type, performed by the human hand (move forward,move left, etc).

Learning of natural concepts creates the basis for theimitative capabilities of our system: the architecture is capableof imitating complex tasks by sequencing and combiningsimple ones. This can be done by symbolic reasoning in theLinguistic Area.

In addition, shape and color properties of objects isassociated with suitable clusters in the Perceptual Space.

4. The system at work

Our robotic set-up is composed of a conventionalsix-degrees-of-freedom PUMA 200 manipulator and ananthropomorphic four-fingered robotic hand. The experimentswe performed are concerned with the problem of teachingthe robotic system several manipulative tasks. This is doneby observing a human teacher doing the task in a simplifiedbidimensional world (observation mode). The human hand isbeing continuously tracked and its configuration (position andorientation) is recorded. The Perceptuomotor Module performsthe segmentation of the task into key-frames and the scenereconstruction module is invoked in order to obtain objects’properties (shape, color) to be encoded into the PerceptualSpaces, in addition to the spatial properties (position andorientation). By using the PCA, the motion of the human handis encoded into the Action Space and recognized as belonging

A. Chella et al. / Robotics and Autonomous Systems 54 (2006) 403–408 407

Fig. 3. Sequence of actions performed by the human user. Human handis continuously tracked and its position and orientation is recorded. Thezero-velocity crossing algorithm performs the detection of key-frames andobject properties (position, color and shape) are extracted and depicted in thePerceptual Space. Action Space encodes the trajectory performed by the userbetween consecutive key-frames.

to one of the clusters. Fig. 4 shows the clusters obtained andfour action primitives individuated by the k-means algorithm.

Having all this information, terms in the Linguistic Areaare obtained, assigning objects and actions symbolic namesas described before (Section 3). From the description ofConceptual Spaces, a symbolic description of the task beingperformed is inferred and terms are anchored to points in theConceptual Area.

As an example, consider the task (Fig. 3) of placing the“yellow sponge” (the left one in the figure) next to the“red sponge”. Having seen this task performed by a user,the system extracts the key-frames and encodes all sceneobjects and human hand’s motion in the Conceptual Area.Suppose that “yellow sponge” is assigned the perceptual spacelabel ps id1 and “red sponge” is assigned the label ps id2.This task is automatically assigned the following description:reach(ps id1), grasp(ps id1), transfer relative(ps id1,ps id2), and could be labelled accordingly to the goalperformed.

Since the perceptual space points hold the information aboutthe objects involved in a task, it is possible to refer the previousdescription to another scene in which similar objects are present(e.g. yellow and red sponges) and perform exactly the sametask. This is done in the imitation mode. If it is not able tomatch the task, it prompts the user which eventually showsthe task in the observation mode. Otherwise, the matcheddescription is then tailored to the particular situation, andabsolute positions and orientations to be reached by the roboticsystem are produced and sent to the Action Space for thetrajectory computation. The whole trajectory is then sent tothe Subconceptual Space and mapped to our robot’s actuators.Fig. 5 shows the entire process of imitating the task seen by theuser. Note that in this example, the control of the robotic handfingers is performed as a reactive behavior and does not involve

Fig. 4. Visualization of the clustered Action Space which shows the four actionprimitives individuated by k-means, and its structural properties (quantizationand topographic errors). The various experimentations performed have showna rate of correct classification for the new movements near to 94%.

Fig. 5. (a) Top view of the world as seen by the robot. (b)–(f) Sequence ofactions performed by the robot in the imitation mode: the robot executes thesame task performed by the user.

any learning or grasping strategy (for a learning approach forthe robotic hand, refer to [8]).


5. Conclusions

The main contribution of the paper is the proposal of aneffective cognitive architecture for imitation learning in whichthree different representation and processing areas coexistand integrate in a theoretical way: the Subconceptual, theConceptual and the Linguistic Areas. The resulting system isable to learn natural movements and to generate complex actionplans by manipulating symbols in the Linguistic Area.

It should be noted that the conceptual spaces may beemployed as a simulation structure, i.e. to simulate movementsand actions before effectively executing them. Current researchconcerns how to enhance the proposed architecture by stressingthese simulating capabilities in the context of imitationlearning.

Acknowledgement

This research is partially supported by MIUR (ItalianMinistry of Education, University and Research) under projectRoboCare (“A Multi-Agent System with Intelligent Fixed andMobile Robotic Components”).

References

[1] A. Alissandrakis, C.L. Nehaniv, K. Dautenhahn, Solving the correspon-dence problem between dissimilarly embodied robotic arms using theALICE imitation mechanism, in: Proceedings of the Second Interna-tional Symposium on Imitation in Animals and Artifacts, The Society forthe Study of Artificial Intelligence and Simulation of Behaviour, 2003,pp. 79–92.

[2] C.G. Atkeson, S. Schaal, Learning tasks from a single demonstration,in: Proceedings of IEEE-ICRA, Albuquerque, New Mexico, 1997,pp. 1706–1712.

[3] A. Billard, M.J. Mataric, Learning human arm movements by imitation:Evaluation of a biologically inspired connectionist architecture, Roboticsand Autonomous System 37 (2001) 145–160.

[4] A. Billard, Imitation: a means to enhance learning of a syntheticprotolanguage in autonomous robots, in: K. Dautenhahn, C.L. Nehaniv(Eds.), Imitation in Animals and Artifacts, MIT Press, 2002, pp. 281–310.

[5] C. Breazeal, B. Scasselati, Challenges in building robots that imitatepeople, in: K. Dautenhahn, C.L. Nehaniv (Eds.), Imitation in Animals andArtifacts, MIT Press, 2002, pp. 363–390.

[6] A. Chella, M. Frixione, S. Gaglio, A cognitive architecture for artificialvision, Artificial Intelligence 89 (1997) 73–111.

[7] A. Chella, M. Frixione, S. Gaglio, Anchoring symbols to conceptualspaces: the case of dynamic scenarios, in: Perceptual Anchoring, Roboticsand Autonomous Systems 43 (2–3) (2003) 175–188 (special issue).

[8] A. Chella, I. Infantino, H. Dindo, I. Macaluso, A posture sequencelearning system for an anthropomorphic robotic hand, Robotics andAutonomous Systems 47 (June) (2004) 143–152.

[9] K. Dautenhahn, C.L. Nehaniv, The agent-based perspective on imitation,in: K. Dautenhahn, C.L. Nehaniv (Eds.), Imitation in Animals andArtifacts, MIT Press, 2002, pp. 1–40.

[10] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, WileyInterscience, 2001.

[11] H. Friedrich, O. Rogalla, R. Dillmann, N. Guil, P. Trabado, Knowledgerepresentation and efficient image processing for high-level skill

acquisition, in: IEEE/RSJ Intl Conf. on Intelligent Robots and Systems,IROS, Grenoble, France, September 1997.

[12] P. Gardenfors, Conceptual Spaces, MIT Press–Bradford Books, Cam-bridge, MA, 2000.

[13] J.A. Ijspeert, J. Nakanishi, S. Schaal, Movement imitation with nonlineardynamical systems in humanoid robots, in: Proceedings of Intl. Conf. onRobotics and Automation, ICRA2002, Washington, 2002.

[14] A. Jaklic, A. Leonardis, F. Solina, Segmentation and Recovery ofSuperquadrics, in: Computational Imaging and Vision, vol. 20, Kluwer,Dordrecht, 2000.

[15] Y. Kuniyoshi, M. Inaba, H. Inoue, Learning by watching: Extractingreusable task knowledge from visual observations of human performance,IEEE Trans. Robot. Automat. 10 (November) (1994) 799–822.

[16] J. Rittscher, A. Blake, A. Hoogs, G. Stein, Mathematical modelling ofanimate and intentional motion, Philosophical Transactions: BiologicalSciences (The Royal Society) 358 (2003) 475–490.

[17] S. Schaal, Is imitation learning the route to humanoid robots? Trends inCognitive Sciences 3 (1999) 233–242.

[18] S. Schaal, A.J. Ijspeert, A. Billard, Computational approaches to motorlearning by imitation, Philosophical Transactions: Biological Sciences(The Royal Society) 358 (2003) 537–547.

[19] A. Ude, T. Shibata, C.G. Atkeson, Real time visual system for interactionwith a humanoid robot, Robotics and Autonomous Systems 37 (2001)115–126.

[20] D.M. Wolpert, K. Doya, M. Kawato, A unifying computational frameworkfor motor control and social interaction, Philosophical Transactions:Biological Sciences (The Royal Society) 358 (2003) 593–602.

Antonio Chella was born in Florence, Italy, on4 March 1961. He received his laurea degree inElectronic Engineering in 1988 and his Ph.D. inComputer Engineering in 1993 from the Universityof Palermo, Italy. From 2001 he has been a professorof robotics at the University of Palermo and ascientific advisor of the ICAR (Institute for HighPerformance Computer Networks) of the ItalianResearch Council (CNR). His research interestsare in the field of autonomous robotics, artificialvision, hybrid (symbolic–subsymbolic) systems and

knowledge representation. He is a member of IEEE, ACM and AAAI.

Haris Dindo received his M.S. degree from theUniversity of Palermo in 2003. He is currently aPh.D. student at the same university. His main researchinterests are in the area of imitation learning andhuman–robot interaction.

Ignazio Infantino was born in Agrigento, Italy, on29 January 1971. He received his laurea degree incomputer science and his Ph.D. degree from theUniversity of Palermo in 1997 and 2001, respectively.Since December 2001 he has been a researcher at theCNR-ICAR (Italian Research Council) of Palermo.His research interests lie primarily in the area ofcomputer vision and cognitive robotics. He is amember of IEEE.

Documents

A cognitive framework for imitation learning