Robot Motor Skill Coordination with EM-based Reinforcement Learning

Preview:

DESCRIPTION

A Barrett WAM robot learns to flip pancakes by reinforcement learning.The motion is encoded in a mixture of basis force fields through an extension of Dynamic Movement Primitives (DMP) that represents the synergies across the different variables through stiffness matrices. An Inverse Dynamics controller with variable stiffness is used for reproduction.The skill is first demonstrated via kinesthetic teaching, and then refined by Policy learning by Weighting Exploration with the Returns (PoWER) algorithm. After 50 trials, the robot learns that the first part of the task requires a stiff behavior to throw the pancake in the air, while the second part requires the hand to be compliant in order to catch the pancake without having it bounced off the pan.

Citation preview

Robot Motor Skill Coordination withEM-based Reinforcement Learning

Italian Institute of TechnologyAdvanced Robotics dept.

http://www.iit.it

Petar Kormushev, Sylvain Calinon, Darwin G. Caldwell

October 20, 2010IROS 2010

Petar Kormushev, Italian Institute of Technology

Motivation

• How to learn complex motor skills which also require variable stiffness?

• How to demonstrate the required stiffness/compliance?

• How to teach highly-dynamic tasks?

2/22

Petar Kormushev, Italian Institute of Technology

Background

• Learning adaptive stiffness by extracting variability and correlation information from multiple demonstrations Sylvain Calinon et al., IROS 2010

3/22

Petar Kormushev, Italian Institute of Technology

Robot Motor Skill Learning

Demonstration by human

Encoding the skill

Refining the skill

Reproduction

Imitation learning

Reinforcement learning

Shared representation(encoding)

Motion capture Kinesthetic teaching

4/22

Petar Kormushev, Italian Institute of Technology

Skill representation (encoding)

Time independent

Trajectory-basedVia-pointsDMP

GMM/GMRDS-based

Time dependent

5/22

Petar Kormushev, Italian Institute of Technology

Dynamic Movement Primitives

DMP

Sequence of attractorsDemonstrated trajectory

Ä̂x =KX

i=1

hi(t)h· P (¹ X

i ¡ x) ¡ · V _xi

Ijspeert, Nakanishi, Schaal, IROS 2001

6/22

Petar Kormushev, Italian Institute of Technology

Extended DMP to include coordination

Ä̂x =KX

i=1

hi(t)hK Pi (¹

Xi ¡ x) ¡ · V _x

i

Coordination matrix (full stiffness matrix)

Advantages: capture correlations between the different motion variables reduce number of primitives

Ä̂x =KX

i=1

hi(t)h· P (¹ X

i ¡ x) ¡ · V _xi

Stiffness gain (scalar)

Proposal: use Reinforcement learning to learn the coordination matrices

7/22

Petar Kormushev, Italian Institute of Technology

Example: Reaching task with obstacle

Using full coordination matricesUsing diagonal matrices

Reward function: r(t) =½w1

T e¡ jjxRt ¡ x

Dt jj;t 6= te

w2 e¡ jjxRt ¡ x

Gjj;t = te

Expected return: 0.61 Expected return: 0.73

8/22

Petar Kormushev, Italian Institute of Technology

EM-based Reinforcement learning (RL)

• PoWER algorithm - Policy learning by Weighting Exploration with the Returns

• Advantages over policy-gradient based RL:no need of learning ratecan use importance samplingsingle rollout enough to update policy

Jens Kober and Jan Peters, NIPS 2009

9/22

Petar Kormushev, Italian Institute of Technology

µn+1=µn+

D(µk ¡ µn)R(¿k)

E

w(¿k)DR(¿k)

E

w(¿k)

RL implementation

• Policy parameters– Full coordination matrices:– Attractor vectors:

• Policy update rule:

• Importance sampling– uses best σ rollouts so far

K Pi

¹ Xi

µ

Df (µk;¿k)

E

w(¿k)=

¾X

k=1

f (µind(k);¿ind(k))

10/22

Petar Kormushev, Italian Institute of Technology

Pancake flipping: Experimental setup

Frying pan mounted on the end-effector

Artificial pancakewith 4 passive markers

(more robust to occlusions)

Barrett WAM 7-DOF robot

11/22

Petar Kormushev, Italian Institute of Technology

Evaluation: Tracking of the pancake

NaturalPoint OptiTrack motion capture system

x 12

100 Hz camera fps 40 Hz real-time capturing

12/22

Petar Kormushev, Italian Institute of Technology

Reward function

• Cumulative return of a rollout:

r(tf ) =w1harccos(v0:vtf )

¼

i+w2e¡ jjx

p¡ xF jj +w3xM3

R(¿) =TX

t=1

r(t)

• Reward function:

orientation position height

13/22

Petar Kormushev, Italian Institute of Technology

Kinesthetic demonstration of the task

14/22

Petar Kormushev, Italian Institute of Technology

Learning by trial and error

15/22

Petar Kormushev, Italian Institute of Technology

Finally learned skill

16/22

Petar Kormushev, Italian Institute of Technology

Motion capture to evaluate rollouts

17/22

Petar Kormushev, Italian Institute of Technology

Captured pancake trajectory

90° flip 180° flip

18/22

Petar Kormushev, Italian Institute of Technology

Performance

19/22

Petar Kormushev, Italian Institute of Technology

M (q)Äq+C(_q;q) _q+g(q) =¿G +¿T

¿G =LX

i=1

J TG;iFG;i

Gravity compensation

Task execution

¿T =J TT FT

Reproduction control strategy

20/22

Petar Kormushev, Italian Institute of Technology

Conclusion

• Combining Imitation learning + RL to learn motor skills with variable stiffness– Imitation used to initialize policy– RL to learn coordination matrices– Learned variable stiffness during reproduction

• Future work– other representations– other RL algorithms

21/22

Thanks for your attention!

Petar Kormushev, Italian Institute of Technology 22/22

Recommended