Collaborative programming by demonstration in a virtual environment

14 1541-1672/12/$31.00 © 2012 IEEE IEEE INTELLIGENT SYSTEMSPublished by the IEEE Computer Society

H u m a n - a g e n t - R o b o t t e a m w o R k

Collaborative Programming by Demonstration in a Virtual EnvironmentYunsick Sung and Kyungeun Cho, Dongguk University

Collaborative

learning for a

robot, a software

agent, and a human

subject in a virtual

environment reduces

time, labor, and cost

compared to real-life

approaches.

tasks from humans by interacting and coop-erating with them.2

Robots and software agents must employ iterative learning to successfully interact with humans. PbD researchers investigating human-robot interactions try to reduce the time, labor, and cost associated with inter-active learning, which they can achieve by carrying out the research in a virtual envi-ronment. (See the “Related Work on Pro-gramming by Demonstration” sidebar.) A virtual agent can play the role of a human subject, and robots and software agents can learn from the virtual agent. Reinforcement learning can fine-tune behaviors the robot learns from by observing human subjects.3

In this article, we propose a method for ro-bot and software agent learning that involves virtual agents interacting with a human subject in a virtual environment. We paired a human subject and a cleaning robot to clean a house. Two virtual agents were created in the virtual environment—one for the subject and the other for the robot. Interactive learning was then em-ployed to enable them to work together.

Learning in a Virtual EnvironmentOur collaborative-learning approach in-tegrates Bayesian probability research,4 PbD-based behavior-generation research,5 Q-learning,6 and finite-state-machine principles.

Collaborative LearningThe proposed process for collaborative learning has three aspects (see Figure 1). The first, human modeling, entails creating a virtual agent in a virtual environment. A human subject controls the virtual agent, called a virtual human, and makes it learn policies that generate a human model. The virtual human executes actions based on the model. Next, another agent, called the virtual learner, observes the virtual human during behavior learning and generates the behaviors needed for collaboration with the human subject. Finally, the virtual learner learns to collaborate via interacting with a virtual human during collaborative learn-ing. After learning in a virtual environment,

Programming by demonstration (PbD) lets robots imitate humans

without needing to be programmed.1 PbD research helps integrate in-

teractive robots or software agents into teams that include humans. For

example, programming by cooperation (PbC) focuses on making robots learn

IS-27-02-Cho.indd 14 4/24/12 3:18 PM

MarCh/aprIL 2012 www.computer.org/intelligent 15

we apply the results to a robot or software agent. We call this series of learning processes collaborative pro-gramming by demonstration (CPbD).

human Modeling algorithmEquation 1 defines action an as the nth defined action involving multiple movements (m1

n, mn2, and so on) tak-

ing place over duration dn:

an = (mn1, mn

2, …, mnr, dn), (1)

in which r is the number of move-ments of an action. The virtual hu-man records its actions as controlled by the human subject and then gen-erates a human model using Bayesian probability. This method considers not only the action at the time t − 1, as was the case for related research,4 but also states from t − m to t − 1 and the actions from t − m to t − 2, as shown in Equation 2. Action at, goal gt, and state st indicate an action, a goal, and a state, respectively, estimated at time t.

P(at = a | st = s, …, st−m = s, gt = g, at−1 = a, …, at−m = a). (2)

Behavior Learning algorithmA virtual learner generates behaviors using a behavior-generation frame-work5 as follows:

1. Movements executed by the vir-tual human are transferred to the virtual learner.

2. A movement filter removes move-ments that are not altered significantly.

3. The action generator defines ac-tions with the movements.

4. The action collector collects actions quantized by the action quantizer.

5. The behavior generator identifies all consecutive types of actions

Figure 1. The process of collaborative programming by demonstration. Working in a virtual environment, the virtual learner learns how to collaborate with a virtual human, and a robot or software agent uses those results to collaborate with a real human.

1. Human modeling

4. Application of results

3. Collaborative learning

2. Behavior learning

Virtual humanHuman

Robot/software agent Virtual learner

Real-life environment Virtual environment

Our investigation on how teams comprising humans, robots, and software agents can perform interactive learning in a virtual environment drew on consider-

able previous research. First, we had to understand thoroughly any policy employed by the human subject so that the virtual agent could imitate the subject. This was achieved using a hid-den Markov model as well as Bayesian probability research.1,2

Next, we needed a method to generate behaviors for the robot and software agent to permit them to act like humans. Research on programming by demonstration has yielded observations on human body movements that can be integrated into motor primitives for robots.3 In addition, there is a framework for generating behaviors with con-secutive actions whereby candidate behaviors are identified and selected using a maximin selection algorithm.4 The in-teractions described earlier can also generate behaviors in programming by cooperation-based research.5

Finally, we required a policy-learning method for the ro-bot to cooperate with the human subject. Virtual agents learn behaviors through reinforcement learning in a vir-tual environment, and we can apply the results to robots.6 Moreover, people can teach the physical movements to hu-manoid robots, and genetic algorithms can optimize these motions in a virtual environment.7

References 1. S. Calinon and A. Billard, “Recognition and Reproduction of

Gestures Using a Probabilistic Framework Combining PCA, ICA, and HMM,” Proc. 22nd Int’l Conf. Machine Learning (ICML 05), ACM, 2005, pp. 105–112.

2. C. Thurau, T. Paczian, and C. Bauckhage, “Is Bayesian Imitation Learning the Route to Believable Gamebots,” Proc. GAME-ON North America, Eurosis-ETI, 2005, pp. 3–9.

3. A. Fod, M.J. Matari, and O.C. Jenkins, “Automated Derivation of Primitives for Movement Classification,” Autonomous Robots, vol. 12, no. 1, 2002, pp. 39–54.

4. Y. Sung and K. Cho, “A Sequential Action Generation Frame-work for Intelligent Agents in Cyberspace,” J. Internet Tech-nology, vol. 12, no. 6, 2011, pp. 971–980.

5. J. Boucher and P.F. Dominey, “Programming by Cooperation: Perceptual-Motor Sequence Learning via Human-Robot Inter-action,” Proc. 2006 IEEE-RAS Int’l Conf. Humanoid Robots (Humanoids 06), IEEE CS, 2006, pp. 222–227.

6. T. Yamaguchi et al., “Propagating Learned Behaviors from a Virtual Agent to a Physical Robot in Reinforcement Learning,” Proc. IEEE Int’l Conf. Evolutionary Computation (ECTA 96), IEEE CS, 1996, pp. 855–859.

7. E. Berger et al., “Towards a Simulator for Imitation Learn-ing with Kinesthetic Bootstrapping,” Workshop Proc. Int’l Conf. Simulation, Modeling and Programming for Autono-mous Robots (SIMPAR 08), Springer, 2008, pp. 167–173.

Related work on Programming by Demonstration

IS-27-02-Cho.indd 15 4/24/12 3:19 PM

16 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

H u m a n - a g e n t - R o b o t t e a m w o R k

that are part of a series of actions.

6. The behavior generator selects the behaviors using the maximin se-lection algorithm.

In the fifth step, the behavior genera-tor defines the candidate behavior cj as shown in Equation 3, in which au and av are actions of the uth and vth collected actions, respectively.

cj = au ∙ au+1 ∙ … ∙ av, 1 ≤ u ≤ v ≤ Number of Collected Actions (3)

Finally, the behavior generator se-lects the behaviors using the maximin selection algorithm.5 The function Diff calculates the difference between two behaviors as shown in Equa-tion 4, where behavior bh is the hth selected behavior, wi is a weight for the ith movement, and mi

h,t is the ith movement of the hth behavior at time t.

Diff( , )b ch m =

−t

d d

i

i ih t

ij t

h j

w m m= =∑ ∑ × (

1 1

min( , )

, ,

ρ

))

+ × ( )

= + =∑ ∑

2

1 1

2

t d d

d

i

i ih t

h j

h

w mmin( , )

,

ρ

+ × ( )

= + =∑ ∑

t d d

d

i

i ij t

h j

j

w mmin( , )

,1 1

2ρ

12

.

(4)

Collaborative Learning algorithmOnce the virtual learner learns how to collaborate in a specific way, the virtual human executes the action with the highest Bayesian probabil-ity for each state based on the human model. The virtual learner learns the policies that are necessary to ex-ecute behaviors in each state using Q-learning,6 as shown in Equation 5. Here, the value a denotes the learn-ing rate and the value r is the reward:

Q(s, bh) ← Q(s, bh) + a × r. (5)

The virtual learner also considers the states during the time period t − 1 to t − n and the actions during the time period t − 1 to t − n, so as to co-operate accurately with the virtual hu-man. In addition, the virtual learner can archive multiple goals in sequence. Therefore, it can select a current goal using a finite-state machine.

ExperimentsIn our experiments, a cleaning robot interacted with a human subject to learn how to clean a house, according to the method presented earlier.

Experiment OverviewThe main goal of the human subject and the cleaning robot was to clean a house together. The human subject needed to move through every space in

the house so the cleaning robot could learn its path. The robot then cleaned the house to the best of its ability ac-cording to the path, while the human subject executed tasks that the clean-ing robot could not, such as moving items or cleaning the tops of objects.

The cleaning robot therefore had three goals:

• try to find out if the subject was moving;

• travel to the subject; and• follow the subject and cooperate

during cleaning.

When the subject stopped moving, the cleaning robot went back to its re-charging station, after which it had to try to find the subject again. We created a virtual human for the subject and a virtual learner for the cleaning robot.

human-Modeling ExperimentThe virtual human moved in four di-rections in the virtual environment; we also defined a “stand still” action using one movement. We defined the state by x and y, which denoted loca-tions in the virtual house.

To learn how to travel throughout the virtual house, the human sub-ject controlled the virtual human (see Figure 2a). Equation 6 shows the calculation of the Bayesian probabil-ity after the human subject decided on a controlled move. The decision on action at took into consideration the previous state st−1.

P(at = a | st = s, st−1 = s, gt = g). (6)

As Figure 2b shows, when the vir-tual human entered the same state it, the travel direction could be classified by the previous state st−1.

Task-Learning ExperimentThe virtual human executed one action every second. The cleaning

Figure 2. Actions over the entire space of the virtual house. (a) The human subject. (b) The human model. The ellipses show where the human subject taught the virtual human to move in two directions.

(a) (b)

End Start End Start

IS-27-02-Cho.indd 16 4/24/12 3:19 PM

MarCh/aprIL 2012 www.computer.org/intelligent 17

robot’s behavior consisted of up to three actions. To generate actions, the movement filter transferred 49 move-ments to the action generator, which generated 49 actions. The action col-lector then collected the actions.

Next, the behavior generator gen-erated 42 behaviors and eliminated duplicates. The behavior selector designated 15 behaviors by consid-ering the number of collaborative-learning states, as shown in Table 1. When applying Equation 4, we cal-culated the difference between two movements to be 1 for each 90-degree change in direction, and we set the only single-weight value w1 to 1.

Collaborative Learning ExperimentWe employed a finite-state machine to select three goals in turn for the virtual learner, which learned through Q-learning by executing the generated behaviors. Figure 3 shows the learning results of the virtual learner when it found the virtual human’s location over 30 episodes; when following the virtual human over 50,000 episodes; and when re-turning to the recharging station over 50 episodes.

To apply the method to more complex environments with di-

verse humans and agents—as in real life—we need to model multiple hu-mans for each task to reflect the need to cooperate with different people. Behavior learning and collaborative learning is then required using each human model. Once all the learning is completed, the robots and software agents each select a task and execute behaviors depending on that task.

References1. C. Allen, Watch What I Do: Programming

by Demonstration, MIT Press, 1993.

2. J. Boucher and P.F. Dominey, “Program-

ming by Cooperation: Perceptual-Motor

Sequence Learning via Human-Robot

Interaction,” Proc. 2006 IEEE-RAS

Int’l Conf. Humanoid Robots (Human-

oids 06), IEEE CS, 2006, pp. 222–227.

3. T. Yamaguchi et al., “Propagating

Learned Behaviors from a Virtual Agent

to a Physical Robot in Reinforcement

Learning,” Proc. IEEE Int’l Conf.

Evolutionary Computation (ECTA 96),

IEEE CS, 1996, pp. 855–859.

4. C. Thurau, T. Paczian, and C. Bauck-

hage, “Is Bayesian Imitation Learning

the Route to Believable Gamebots,”

Proc. GAME-ON North America,

Eurosis-ETI, 2005, pp. 3–9.

5. Y. Sung and K. Cho, “A Sequential

Action Generation Framework for

Intel ligent Agents in Cyberspace,”

J. Internet Technology, vol. 12, no. 6,

2011, pp. 971–980.

6. C.J.C.H. Watkins and P. Dayan, “Q-L,”

Machine Learning, vol. 8, nos. 3–4,

1992, pp. 279–292.

7. M. Johnson et al., “Beyond Cooperative

Robotics: The Central Role of Interde-

pendence in Coactive Design,” IEEE

Intelligent Systems, vol. 26, no. 3, 2011,

pp. 81–88.

Table 1. Generated behaviors.

Seq. Behavior Seq. Behavior Seq. Behavior Seq. Behavior

1 ↑ 5 ← 9 ↓↓↓ 13 ↓

2 ←←← 6 →→ 10 ↓↓→ 14 →

3 ↓← 7 ↑↑← 11 ↓→→ 15 ↑↑

4 →↓↓ 8 ↑←↓ 12 ←↓→

Figure 3. Paths of the virtual human and the virtual learner. (a) The path to the virtual human. (b) The path to the end position. (c) The path to the recharging station. The virtual learner selects a goal using a finite-state machine and learns how to execute behaviors according to the selected goal.

Virtualhumanstart

VirtualhumanstartEnd

End

End

Start

Virtuallearnerstart

(a) (b) (c)

t H e a u t H o R sYunsick Sung is a postdoctoral researcher at the University of Florida. His research interests include human-robot interaction learning, human action representation, pro-gramming by demonstration, ubiquitous computing, and reinforcement learning. He has a DEng in game engineering from Dongguk University. Contact him at sung@ dongguk.edu.

Kyungeun Cho is a professor in the department of multimedia engineering at Dongguk University. Her research interests include intelligence in robots and virtual characters and real-time computer graphics technologies. She has a DEng in computer engineering from Dongguk University. Contact her at [email protected].

Selected CS articles and columns are also available for free at

http://ComputingNow.computer.org.

IS-27-02-Cho.indd 17 4/24/12 3:19 PM

To Order Order Onlinehttp://bit.ly/x4E0ErEnter code DZWE5FVEfor 15% off!*

Also available onhttp://amazon.com

*Note: discount code only valid on createspace.com

NEW TITLE FROM

Bringing together memoirs of key IBM engineers and managers of the past 100 years, this unique volume details IBM’s entrance into computing and the transformative IBM hardware and software that changed the world.

Features include an IBM timeline, the most comprehensive IBM anno-tated bibliography to date, and a new introductory essay.

Save 15%

978-0-7695-4611-7 • November 2011 Paperback • 292 pages • $29.95 An IEEE Computer Society Press Publication

IS-27-02-Cho.indd 18 4/24/12 3:19 PM

Documents

Collaborative programming by demonstration in a virtual environment