6
Probing attention for insights into imitation learning mechanisms Jonathan Maycock, Kai Essig, Daniel Prinzhorn, Thomas Schack and Helge Ritter Abstract— Where to focus attention when being taught how to do something is an ability that children learn from a very young age. Endowing technical system, such as robots, with this ability is a non-trivial task. An approach we are developing to better understand the involved processes is to follow the gaze patterns of humans as they observe manual interaction events taking place. This has a precedence, but where our approach differs from previous ones is that we track 3D gaze patterns while participants observe real movements and interactions taking place (as opposed to viewing them in video format). Our setup also allows for very natural movements to be carried out due to the mobile nature of the technology employed. The tracking and aligning of all relevant components (i.e., eyes, hands, objects), both spatially and temporally, is discussed in this paper and an experiment is presented in which participants observe how a mug of coffee and hot chocolate are prepared. I. I NTRODUCTION A major goal in robotics, which is proving stubbornly elusive, is to create robots that can work autonomously to assist us in general environments such as our homes. It is hoped that we will be able to simply show these robots how to perform certain tasks and they will then be able to carry them out. To circumvent the need to search the enormous possible state-action space that robots with many degrees of freedom operate in, imitation learning has been suggested as a way to enable focus to be placed on only relevant parts of the movement to be performed [1]. In an important early paper on imitation Meltzoff and Moore [2] reported on the ability of very young babies (12- 21 days old) and, in a later paper babies that were less than an hour old [3] to imitate both facial and manual gestures. The babies had neither seen their own faces nor been exposed to viewing faces of other humans for any significant amount of time. This suggests an innate ability to map perceived gestures to ones own body. There is evidence to suggest that many animals cannot learn by imitation [4], and those that can are thought to exhibit high intelligence. If robots of the future are to become useful to us in our daily lives, then the ability to imitate is a skill they surely must possess. However, knowing where to focus attention while a movement or interaction is being demonstrated is a non- trivial task [see Fig.1]. Matari´ c and Pomplun [5] investigated where focus was placed in two conditions: just watching and watching to imitate. They found no major differences between the two conditions suggesting strong sensori-motor The authors are cooperating within the Bielefeld Excellence Cluster Cognitive Interaction Technology (CITEC). Jonathan Maycock and Helge Ritter are with the Neuroinformatics Group and Kai Essig, Daniel Prinzhorn and Thomas Schack are with the Neurocognition and Action Group at Bielefeld University, Germany. jmaycock/kessig @techfak.uni-bielefeld.de Fig. 1. Experimental setup. Observer’s 3D gaze pattern is tracked while a mug of coffee is prepared. integration. Neurobiological research has further solidified the idea that the neural circuitry responsible for triggering action execution overlaps extensively with that activated when actions are observed [6]. Results from Matari´ c and Pomplun’s experiments suggest that people typically follow the hand that is moving. They suggest that people may be capable of filling in missing details of a movement (such as how the upper part of the arm moves if attention is focused on the hand) by mapping internal movement primitives onto the observed movement of interaction. These findings provide important clues to the inherent mechanisms people employ when they learn by imitation and their transfer to robots is a topic of ongoing research. The importance of attention in guiding manual actions has seen renewed interest of late [7], [8]. In order to achieve good eye-hand coordination, hands and eyes must work together in smooth and efficient patterns. Johansson et al. [9] analyzed the coordination between gaze behavior, finger- tip movements, and movements of a manipulated object. Obligatory gaze targets were those regions where contact occurred. They concluded that gaze supports hand movement planning by marking key positions to which the fingertips are subsequently directed. We also verified this in a recent study [10]. In this paper we have switched our attention to the observer of the action rather than the executor of the action [see Fig.1]. Our analysis of the eye movements reveals that even though the subjects did not perform the action their eye movements do preceede the hand of the demonstrator. This eye-hand span has been found in many object manipulation studies. Our initial results, following previous work [5], show that this is not just the case when we perform actions, but also

Probing attention for hints towards robot imitation learning

  • Upload
    cit-ec

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Probing attention for insights into imitation learning mechanisms

Jonathan Maycock, Kai Essig, Daniel Prinzhorn, Thomas Schack and Helge Ritter

Abstract— Where to focus attention when being taught how todo something is an ability that children learn from a very youngage. Endowing technical system, such as robots, with this abilityis a non-trivial task. An approach we are developing to betterunderstand the involved processes is to follow the gaze patternsof humans as they observe manual interaction events takingplace. This has a precedence, but where our approach differsfrom previous ones is that we track 3D gaze patterns whileparticipants observe real movements and interactions takingplace (as opposed to viewing them in video format). Our setupalso allows for very natural movements to be carried out due tothe mobile nature of the technology employed. The tracking andaligning of all relevant components (i.e., eyes, hands, objects),both spatially and temporally, is discussed in this paper andan experiment is presented in which participants observe howa mug of coffee and hot chocolate are prepared.

I. INTRODUCTION

A major goal in robotics, which is proving stubbornlyelusive, is to create robots that can work autonomously toassist us in general environments such as our homes. It ishoped that we will be able to simply show these robots howto perform certain tasks and they will then be able to carrythem out. To circumvent the need to search the enormouspossible state-action space that robots with many degrees offreedom operate in, imitation learning has been suggested asa way to enable focus to be placed on only relevant parts ofthe movement to be performed [1].

In an important early paper on imitation Meltzoff andMoore [2] reported on the ability of very young babies (12-21 days old) and, in a later paper babies that were less thanan hour old [3] to imitate both facial and manual gestures.The babies had neither seen their own faces nor been exposedto viewing faces of other humans for any significant amountof time. This suggests an innate ability to map perceivedgestures to ones own body. There is evidence to suggest thatmany animals cannot learn by imitation [4], and those thatcan are thought to exhibit high intelligence. If robots of thefuture are to become useful to us in our daily lives, then theability to imitate is a skill they surely must possess.

However, knowing where to focus attention while amovement or interaction is being demonstrated is a non-trivial task [see Fig.1]. Mataric and Pomplun [5] investigatedwhere focus was placed in two conditions: just watchingand watching to imitate. They found no major differencesbetween the two conditions suggesting strong sensori-motor

The authors are cooperating within the Bielefeld Excellence ClusterCognitive Interaction Technology (CITEC).

Jonathan Maycock and Helge Ritter are with the NeuroinformaticsGroup and Kai Essig, Daniel Prinzhorn and Thomas Schack are withthe Neurocognition and Action Group at Bielefeld University, Germany.jmaycock/kessig @techfak.uni-bielefeld.de

Fig. 1. Experimental setup. Observer’s 3D gaze pattern is tracked while amug of coffee is prepared.

integration. Neurobiological research has further solidifiedthe idea that the neural circuitry responsible for triggeringaction execution overlaps extensively with that activatedwhen actions are observed [6]. Results from Mataric andPomplun’s experiments suggest that people typically followthe hand that is moving. They suggest that people may becapable of filling in missing details of a movement (such ashow the upper part of the arm moves if attention is focusedon the hand) by mapping internal movement primitivesonto the observed movement of interaction. These findingsprovide important clues to the inherent mechanisms peopleemploy when they learn by imitation and their transfer torobots is a topic of ongoing research.

The importance of attention in guiding manual actions hasseen renewed interest of late [7], [8]. In order to achievegood eye-hand coordination, hands and eyes must worktogether in smooth and efficient patterns. Johansson et al. [9]analyzed the coordination between gaze behavior, finger-tip movements, and movements of a manipulated object.Obligatory gaze targets were those regions where contactoccurred. They concluded that gaze supports hand movementplanning by marking key positions to which the fingertipsare subsequently directed. We also verified this in a recentstudy [10]. In this paper we have switched our attention tothe observer of the action rather than the executor of theaction [see Fig.1].

Our analysis of the eye movements reveals that eventhough the subjects did not perform the action their eyemovements do preceede the hand of the demonstrator. Thiseye-hand span has been found in many object manipulationstudies. Our initial results, following previous work [5], showthat this is not just the case when we perform actions, but also

Fig. 2. Software layers and communication

when we passively observe them. In order to calculate the 3Dgaze vector from the observer of the movement or interac-tion, we have combined data from a mobile eye-tracking sys-tem with a motion-capture system. The Vicon-EyeTrackingVisualizer (VEV) software we have developed [11] allows usto not only calculate and visualize the 3D gaze vector, butalso to automatically annotate trials. In this paper we brieflydescribe the developed technical systems and then present anexperiment in which participants observed people performingeveryday actions such as pouring a mug of coffee. The VEVsystem opens up the possibility to conduct very realisticexperiments that heretofore have not been possible. Thework in this paper is part of a larger project in which wecomplement the current, strongly control- and physics-basedapproach for the synthesis of robot manual actions with anobservation-driven approach [12]. To facilitate this a multi-modal database [see Fig. 2] is being constructed in whichrecordings of humans performing various interactive manualtasks is its main focus. We briefly discuss the dedicated labwe have built to help populate this database in the nextsection.

II. THE MANUAL INTELLIGENCE LAB (MILAB)

There are some freely available motion capture databasessuch as the CMU Graphics Lab Motion Capture Database[13], the Karlsruhe Human Motion Library [14] or therecent TUM Kitchen Data Set [15] focusing on differentaspects and settings of human motion data. However, astheir common focus is on full body movements, they onlyallocate a very small number of degrees of freedom to thehands, enabling only a very coarse representation of manualactions. Thus, to date there does not exist a database of finedetailed human manual interaction situations and thereforethe Manual Intelligence Lab (MILAB) was purposefully builtto capture such details at a high level of spatio and temporalcoherence.

MILAB consists of fourteen Vicon MX3+ cameras thatcan track reflective markers attached to both subjects and

objects. Vicon provides a marker tracking accuracy of below1mm and this provides us with good ground truth data aboutwhere objects are in the scene. In order to get insights intowhere people look as they perform everyday tasks with theirhands or indeed to allow us to track where people look asthey observe a tack, we use a SMI IViewX (monocular)mobile eye tracking system for our experiments. It has asampling rate of 200 Hertz, with a gaze position accuracy< 0.5◦−1◦. Each scene video has a resolution of 376 x 240pixels at 25 fps. Other devices that are supported in MILAB,but not used in the current experiment, are stereo-visioncameras, Swiss Ranger time-of-flight cameras, Cyber GloveII datagloves, fingertip tactile sensors and microphones. Fora more detailed discussion of MILAB please see [12] and theinterested reader is also directed towards recent work carriedout in the lab [10], [16], [17].

The challenges of manual action capture are many andvaried, but in most cases reduce to the problem of ensuringthat there is a high degree of both spatial and temporalcoherence amongst the different input channels. At thecore of MILAB is a Vicon system providing us with highprecision 3D positional data to which the other modalitiesneed to be aligned. For modalities not directly supportedby Vicon, i.e., the mobile eye tracker used in this paper, itwas necessary to develop custom made solutions in orderto ensure spatial coherence. It is also important to ensuretemporal coherence across all modalities. Figure 2 gives anoverview of the software architecture that was developed forMILAB. The Multiple Start Synchronizer (MSS) ensures thatrecording of all data streams start at the same time, and runfor a specific duration or until a stop command is sent. MSSfirst checks if all computer clocks are synchronized correctlyusing the Network Time Protocol. Pressing start sends allrelevant data (i.e., start and stop time, experiment details)via Open Sound Control protocol to the listening clients.The client applications then wait for the starting time andbegin capturing until the end time is reached or a stop signalfrom MSS is received. Once experiments are finished alldata is transferred to the database. A front end GUI allowsfor easy playback of experiments and provides features suchas an annotation plugin that lets the experimenter associateimportant higher level event data with the trials.

III. VICON-EYE TRACKING VISUALIZER (VEV)

Although there are several approaches that combine headand eye tracking systems, they often have shortcomings orthey are not easy to use [18]. For example, [19] describea virtual reality simulator that features a binocular eyetracker built into the system’s Head Mounted Display, whichallows for the recording of the user’s dynamic gaze vectorwithin the virtual environment. However, this system has tobe specifically designed and suffers from the limitation ofvirtual reality Head Mounted Displays. Allison et al. [20]used a magnetic head tracker, which allows for only arelatively small range of head movements. A third approachby [21] suffers from occlusions and a low update rate ofgaze positions, making it ill-suited for large screen based

Fig. 3. The three different coordinate systems and their locations. Redmarks the world coordinate system, green the eye coordinate system andyellow the head coordinate system.

virtual reality systems. For these reasons [18] proposed ahead-eye tracking system consisting of a video-based eyetracking and a magnetic-inertial head tracking system thatcan be used for desktop as well as immersive applications(e.g., VR-setups). However. their system requires a complexthree-stage calibration procedure in which the subjects’ headis fixed by a chin rest. An open-source library (calledlibGaze) was introduced by [22] to allow the computationof a viewer’s position and gaze direction in real time, usingdata from a head-mounted eye-tracking system and a Viconsystem while people walk around freely, gesture and interactwith large display screens. One disadvantage of this approachis that the head mounted eye-tracking system used does notallow the same kind of free movement that is possible with amobile eye tracker. Furthermore, for the system calibration,a separated, quite complex three-stage calibration processhas to be performed in order to calculate the relationshipsbetween the 3 coordinate systems, taken into account tocalculate the 3D gaze position in real world coordinates,and to map the eye tracker coordinates to the 3D viewingvector in the eye coordinate system. A commercial solutionfor tracking point of gaze in any environment was recentlydeveloped using Vicon Bonita cameras, Nexus Software andErgoneers Eye Trackers [23]. In contrast to their approachwhich is closed and requires specific hardware, our approachis more modular and in theory can be used with any mobileeye tracker and motion-tracking system.

The aim of our solution, VEV, is to calculate the 3D gazevector from the data captured by the Vicon and mobile eye-tracking systems. Vicon markers are attached to the helmetof the mobile eye tracking system and the bridge of thenose to measure spatial head alignment. Using the 2D gazeposition data from the mobile eye-tracker we can calculateand visualize the 3D gaze vector within the Vicon coordinateframe. This allows us to automatically calculate the numberof fixations along with their cumulative fixation duration oneach marker in the scene. By clustering the data for themarkers into particular objects, it is possible to calculateobject specific perception duration.

When calculating the 3D gaze vector from the recorded

data of the two different tracking systems, three differentcoordinate systems have to be taken into consideration [seeFig. 3]. The coordinate systems are:

• World Coordinate System: This is the coordinatesystem of the environment in which the 3D gaze vectorresides (in our case: the Vicon coordinate system). Eachmarker attached to objects scene has a position in theworld coordinate system, allowing us to calculate thedistances between the gaze vector and the markers.

• Head Coordinate System: Describes the orientation ofsubjects’ head and all marker locations relative to thehead. The head coordinate system is determined by thefour markers attached to the eye tracker helmet.

• Eye Coordinate System: This coordinate system orig-inates at subjects’ right or left eyeball (depending onthe tracked eye). This coordinate system results fromshifting the head coordinate system to the nose bridge.From there, the origin is moved to the center of the leftor right eyeball. A vector in the eye coordinate systemrepresents the eyes’ viewing direction.

We first calculate the z-vector of the head coordinatesystem in world coordinates, with its origin in the head centerand pointing towards the center of the field of view, usingthe positions of three markers on the eye tracker helmet(i.e., left, right and front markers) [see Fig. 3]. The originof the head coordinate system is then shifted to the nasalbridge marker and then to the center of the eye to createthe eye coordinate system. The calibration procedure is usedto determine how the gaze vector direction changes withshifting eye movements. After a series of straightforwardgeometric calculations, two angles βxz and βyz are calculatedand the eye coordinate system is then rotated by these angles.The 3D gaze vector emanating from the eye is calculatedevery 5ms. Finally, a snapshot of the system is shown inFig. 4 and for a detailed account of how the system worksthe interested reader is directed to [11].

IV. EXPERIMENT AND SOME INITIAL RESULTS

To test our implementation we designed a simple experi-ment, in which subjects had to observe a human demonstratorperforming two everyday tasks, preparing a mug of coffeeand a mug of hot chocolate. The observer’s eye movementswere tracked using a mobile eye tracker and the demonstrator

Objects Coffee Chocolatemarker position numFix fixDur numFix fixDur

hand 52 19.11 48 14.48kettle 3 0.86 6 1.96mug 20 10.59 24 14.55

coffee 4 2.63 - -milk 10 2.91 - -

chocolate box - - 16 4.12

TABLE IRESULTS FOR THE AVERAGE NUMBER OF FIXATIONS (NUMFIX) AND

FIXATION DURATION (FIXDUR IN SECONDS) FOR BOTH THE COFFEE AND

CHOCOLATE PREPARATION INTERACTION SCENES.

Fig. 4. The VICON-EyeTracking Visualizer (VEV). By combining mobile eye tracking and body motion capture data (Vicon) the 3D gaze vector canbe determined and visualized (see left image). VEV provides automatic analysis of object specific fixation information. 3D gaze vector is coloured yellowand the current attended marker is coloured red. On the right the corresponding scene video from the mobile eye tracker is shown.

hand and all objects were tracked using Vicon. Prior to theexperiment the motion-capture system was calibrated. Thensubjects sat in front of a glass table, and the eye tracker wasset-up and calibrated using a standard five point calibrationprocedure. The same procedure is then used to calibrate theVEV system [11]. After calibration, the panel was removedfrom the table and the actual experiment started. Subjectssat opposite the human demonstrator while two tasks wereperformed: i) preparing a mug of coffee, which involvedpouring coffee powder into the mug, adding hot water andthen milk, and ii) preparing a mug of hot chocolate, whichinvolved opening the chocolate powder box, pouring powderinto the cup, adding hot water to it and finally stirring it [seeFig.1]. At the end of both tasks the demonstrator handed themug to the subjects. 16 reflection markers were placed on thedemonstrators’ hand, and more markers were used to trackthe single objects in the scene, such as cup, milk, kettle,coffee and chocolate. Finally markers were also attached tothe eye tracker helmet and the nasal bridge for the VEVsystem [see Fig. 3]. The task for the subjects was to observethe interaction scenario closely and afterwards to performthe same actions as exactly as possible.

Table I shows an exemplar sample of the data we capturedin our experiment. They reveal that most of the fixations arelocated on the hand of the demonstrator. During the seconddemonstration (hot chocolate preparation) there is slightlyless attention on the hand, because there are more actionsin the chocolate task, such as stirring the spoon, whichattracts more attention towards the mug area. There are morefixations on the coffee and milk in the coffee task than on thechocolate box in the chocolate task. The reason is that thesubjects looked at the milk for quite a long time while thedemonstrator was pouring milk into the cup. When watchingthe kettle more attention was directed towards the spout ofthe kettle than on the body of the kettle itself. We note thata more optimal placement of markers would have includedsome on the spoon and some near the kettle spout and wewill correct this for future experiments. This explains whythe cumulative fixation duration on all objects is smaller thanthe overall trial duration of 40 seconds.

Our results closely match those of Land et al. [24] andMary Hayhoe et al. [25] who studied the eye movementof subjects making a cup of tea and peanut butter andjelly sandwich. They had the following general findings: i)saccades are made almost exclusively to objects involved inthe task, even when there are plenty of other objects aroundto grab the attention of the eye, ii) the eyes usually deal withone object at a time, and iii) the time spent on each objectcorresponds roughly to the duration of the manipulation ofthat object and may involve a number of fixations on differentparts of the object. Our experiment showed slight differenceswith regards to the second finding, namely that the eyes onlydeal with one object at a time. We found at various pointsduring observation of the task that subjects gaze paths wouldjump to objects about to be acted upon and then back to apreviously fixated object (e.g., when pouring hot water fromthe kettle into the mug, attention switched from the kettle tothe mug and back again [see Fig. 5]).

We note that in the study carried out by Land et al. [24]they wanted to investigate the interaction between bodymovements, eye movements and manipulation movements.This was a particularly difficult task to accomplish and theyended up having to record all of these events manually ona large 12 m roll of paper in order to perform the analysis.They found a definite temporal pattern at each step in thetea preparation process. This involved first, a movementof the whole body towards the next object in the overallsequence; second, a gaze movement to that object; and finallythe initiation of manipulation of the object by the hands.Our developed technical system allows us to automaticallymeasure these kinds of events and reduces the need for suchmanual overheads.

The direct matching hypothesis postulates that actionunderstanding stems from a mechanism that maps an ob-served action onto motor representations of that action [26].Flanaghan and Johansson [27] produced evidence for thishypothesis when they found that subjects observing a blockstacking task rarely followed the moving hands reactively,but rather predicted the path of the hand as if they werecarrying out the task themselves. In our two scenarios, which

Fig. 5. A temporal and spatial depiction of where attention was directed to during an exemplar trail. The highlighted attention units, Dcup and Ekettlereveal that the observer of an action can make attention switches to objects that are about to be acted upon or in the case of the kettle to a different partof the object that is associated with a different primitive. The two Ehand attention units reveal that the object in hand is not exclusively focused on, butthat the hand itself also receives attention during this movement primitive.

involved more complex actions, we found a mixture ofpredictive and reactive eye movements. Subjects’ gazes didtrack the hand for large parts of the trajectory of a movement[see Tab. I and Fig. 5], but also made jumps to objects thatwere about to be interacted with. Ours is only a preliminarystudy, but our initial results suggest that more complex tasksproduce different gaze trajectories in observers than occurwhen less complex tasks are observed.

V. DISCUSSION

In this paper we discussed an ongoing project con-cerned with the synthesis of robot manual actions usingan observation-driven approach. In a change to our normalexperimental setup we tracked the eye movements of theobserver of an interaction scenario rather than the executor ofthe interaction. While investigations of this sort been carriedout before [5], until now it has not been possible to trackthe 3D gaze pattern of free moving participants looking atcluttered scenes and to automatically analyse the results.Recent progress made by us [11], [12] to both spatially andtemporally align different input sensor devices has opened upthe possibility for many interesting experiments to be carriedout.

Our results support early finding by Mataric and Pom-plun [5] that people typically follow the hand or the objectin the hand that is moving. However as our experimentsinvolved complex interaction scenarios, such as making amug of coffee, we also observed jumps in the focus ofattention to objects that were about to be acted upon. Thesepatterns reflect similar ones recorded from the point of viewof the executor of a task and are thought to help with pathplanning and finding suitable landing points for grasps [8].The fact that observers do this too adds to the growing bodyof evidence for the strong sensori-motor integration in thebrain. In future experiments we intend to investigate thisrelationship by applying the SDA-M [28] to measure thecognitive representation of objects and movement in longterm memory.

Further experiments are planned to investigate the correla-tion in eye gaze patterns between the observer and the execu-tor of a task and will require some adaptations to be madeto the VEV program. The raw data from these and similarexperiments them, along with higher level representationsof the captured interactions (events such as spatial locationof the focus of attention, when and where contact occursetc.), are being added to the MILAB database on a dailybasis. It is our belief that this observation-driven approachplaces us on the correct path to eventually being able tosynthesize robot manual actions in a way that realisticallymimic the actions of humans. Some initial work towards thislofty goal has been recently published by our group. In thefirst task a jar was successfully unscrewed [29] and in thesecond a piece of paper was picked up [30], both times usingtwo anthropomorphic Shadow Hands. Schack and Ritter [31]have outlined an approach to transfer the findings of studiesof motor control in humans into models that can guide theimplementation of cognitive robot architectures. Finding theelementary building blocks of movement, or the languageof movement, and applying this gained knowledge to endowrobots with sophisticated skills is the ultimate goal we arestriving for.

VI. ACKNOWLEDGMENTS

This research was supported by the DFG CoE 277: Cog-nitive Interaction Technology (CITEC). We wish to thankTobias Roehlig for his assistance with the experiments.

REFERENCES

[1] S. Schaal, “Is imitation learning the route to humanoid robots?” Trendsin Cognitive Sciences, vol. 3, pp. 233–242, 1999.

[2] A. N. Meltzoff and M. K. Moore, “Imitation of facial and manualgestures by human neonates,” Science, vol. 198, pp. 74–78, 1977.

[3] ——, “Newborn infants imitate adult facial gestures,” Child Develop-ment, vol. 54, pp. 702–709, 1983.

[4] M. Tomasello, A. C. Kruger, and H. H. Ranter, “Cultural learning,”Behavioral and Brain Sciences, vol. 16, pp. 495–552, 1993.

[5] M. J. Mataric and M. Pomplun, “Fixation behavior in observation andimitation of human movement,” Cognitive Brain Research, vol. 7, pp.191–202, 1998.

[6] S. Clark, F. Tremblay, and D. Ste-Marie, “Differential modulationof corticospinal excitability during observation, mental imagery andimitation of hand actions,” Neuropsychologia, vol. 42, pp. 105–112,2003.

[7] B. Rasolzadeh, M. Bjorkman, K. Huebner, and D. Kragic, “An activevision system for detecting, fixating and manipulating objects in realworld,” International Journal of Robotics Research, vol. 29, no. 2–3,pp. 133–154, 2010.

[8] D. Jonikaitis and H. Deubel, “Independent allocation of attention toeye and hand targets in coordinated eye-hand movements,” Psycho-logical Science, vol. 22, pp. 339–347, 2011.

[9] R. S. Johansson, G. Westling, A. Backstrom, and J. R. Flanagan,“Eye-hand coordination in object manipulation,” The Journal of Neu-roscience, vol. 21, no. 17, pp. 6917–6932, 2001.

[10] J. Maycock, K. Essig, R. Haschke, T. Schack, and H. J. Ritter, “To-wards an understanding of grasping using a multi-sensing approach,”in IEEE International Conference on Robotics and Automation (ICRA)- Workshop on Autonomous Grasping, Shanghai, China, May 2011.

[11] K. Essig, D. Prinzhorn, J. Maycock, D. Dornbusch, H. J. Ritter,and T. Schack, “Automatic analysis of 3d gaze coordinates on sceneobjects using data from eye-tracking and motion capture systems,”in Proceedings of the Symposium on Eye Tracking Research andApplications (ETRA 2012), USA, March 2012, pp. 37–44.

[12] J. Maycock, D. Dornbusch, C. Elbrechter, R. Haschke, T. Schack,and H. J. Ritter, “Approaching manual intelligence,” KI-KunstlicheIntelligenz - Issue Cognition for Technical Systems, pp. 1–8, 2010.

[13] CMU Graphics Lab Motion Capture Database. [Online]. Available:http://mocap.cs.cmu.edu

[14] P. Azad, T. Asfour, and R. Dillmann, “Toward an unified representationfor imitation of human motion on humanoids,” in Int. Conf. onRobotics and Automation. IEEE, 2007, pp. 2558–2563.

[15] M. Tenorth, J. Bandouch, and M. Beetz, “The TUM kitchen data setof everyday manipulation activities for motion tracking and actionrecognition,” in Workshop on Tracking Humans for the Evaluation oftheir Motion in Image Sequences (ICCV), 2009.

[16] J. Maycock, J. Steffen, R. Haschke, and H. J. Ritter, “Robust trackingof human hand postures for robot teaching,” in IEEE/RSJ Int. Conf. onIntelligent Robots and Systems (IROS), USA, 2011, pp. 2947–2952.

[17] K. Essig, J. Maycock, H. J. Ritter, and T. Schack, “The cognitivenature of action - a bi-modal approach towards the natural grasping ofknown and unknown objects,” in IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS), USA, Sep 2011.

[18] H. Huang, R. S. Allison, and M. Jenkin, “Combined head- eye track-ing for immersive virtual reality,” in ICAT’2004 14th InternationalConference on Artificial Reality and Telexistance Seoul, Korea, 2004.

[19] A. T. Duchowski, E. Medlin, A. Gramopadhye, B. J. Melloy, andS. Nair, “Binocular eye tracking in vr for visual inspection training,”in Proceedings of the Symposium on Virtual Reality Software andTechnology. ACM, 2001, pp. 1–9.

[20] R. Allison, M. Eizenman, and B. Cheung, “Combined head and eyetracking system for dynamic testing of the vestibular system,” IEEETran. on Biomedical Engineering, vol. 43, pp. 1073–1081, 1996.

[21] A. Perez, M. Cordoba, A. Garcıa, R. Mendez, M. Munoz, J.L.Pedraza,and F. Sanchez, “A precise eye-gaze detection and tracking system,”in Proceedings of the 11th Int. Conf. in Central Europe on ComputerGraphics, Visualization and Computer Vision, 2003, pp. 1–4.

[22] S. Herholz, L. L. Chang, T. G. Tanner, H. H. Buelthoff, and R. W.Fleming, “Libgaze: Real-time gaze-tracking of freely moving ob-servers for wall-sized displays,” in Proceedings of the 13th Interna-tional Fall Workshop on Vision, Modeling, and Visualization (VMV2008). IOS Press / Amsterdam, 2008, pp. 101–110.

[23] Eye Tracking Integration with Bonita, “ [Online],”http://www.vicon.com/products/bonitavideos.html.

[24] M. F. Land, N. Mennie, and J. Rusted, “The roles of vision and eyemovements in the control of activities of daily living,” Perception,vol. 28, pp. 1311–1328, 1999.

[25] M. M. Hayhoe, A. Srivastrava, R. Mruczec, and J. B. Pelz, “Visualmemory and motor planning in a natural task,” J. Vision, vol. 3, pp.49–63, 2003.

[26] V. Gallese, L. Fadiga, L. Fogassi, and G. Rizzolatti, “Action recogni-tion in the premotor cortex,” Brain, vol. 119, pp. 593–609, 1996.

[27] J. R. Flanagan and R. S. Johansson, “Action plans used in actionobservation,” Nature, vol. 424, pp. 769–771, 2003.

[28] T. Schack, “A method for measuring mental representation,” Handbookof Measurement in Sport. Human Kinetics, pp. 203–214, 2012.

[29] J. Steffen, C. Elbrechter, R. Haschke, and H. J. Ritter, “Bio-inspiredmotion strategies for a bimanual manipulation task,” in InternationalConference on Humanoid Robots (Humanoids). IEEE, 2010.

[30] C. Elbrechter, R. Haschke, and H. J. Ritter, “Bi-manual robotic papermanipulation based on real-time marker tracking and physical mod-elling,” in IEEE/RSJ International Conference on Intelligent Robotsand Systems (IROS), 2011, pp. 1427–1432.

[31] T. Schack and H. J. Ritter, “The cognitive nature of action functionallinks between cognitive psychology, movement science, and robotics,”Progress in Brain Research, vol. 174, pp. 231–251, 2009.