Download pptx - Robot Motor Skill Coordination with EM-based Reinforcement Learning

Robot Motor Skill Coordination withEM-based Reinforcement Learning

Italian Institute of TechnologyAdvanced Robotics dept.

http://www.iit.it

Petar Kormushev, Sylvain Calinon, Darwin G. Caldwell

October 20, 2010IROS 2010

http://www.iit.it/

Petar Kormushev, Italian Institute of Technology

Motivation

• How to learn complex motor skills which also require variable stiffness?

• How to demonstrate the required stiffness/compliance?

• How to teach highly-dynamic tasks?

2/22


Background

• Learning adaptive stiffness by extracting variability and correlation information from multiple demonstrations Sylvain Calinon et al., IROS 2010

3/22


Robot Motor Skill Learning

Demonstration by human

Encoding the skill

Refining the skill

Reproduction

Imitation learning

Reinforcement learning

Shared representation(encoding)

Motion capture Kinesthetic teaching

4/22


Skill representation (encoding)

Time independent

Trajectory-basedVia-pointsDMP

GMM/GMRDS-based

Time dependent

5/22


Dynamic Movement Primitives

DMP

Sequence of attractorsDemonstrated trajectory

Ä̂x =KX

i=1

hi(t)h· P (¹ X

i ¡ x) ¡ · V _xi

Ijspeert, Nakanishi, Schaal, IROS 2001

6/22


Extended DMP to include coordination

Ä̂x =KX

i=1

hi(t)hK Pi (¹

Xi ¡ x) ¡ · V _x

i

Coordination matrix (full stiffness matrix)

Advantages: capture correlations between the different motion variables reduce number of primitives

Ä̂x =KX

i=1

hi(t)h· P (¹ X

i ¡ x) ¡ · V _xi

Stiffness gain (scalar)

Proposal: use Reinforcement learning to learn the coordination matrices

7/22


Example: Reaching task with obstacle

Using full coordination matricesUsing diagonal matrices

Reward function: r(t) =½w1

T e¡ jjxRt ¡ x

Dt jj;t 6= te

w2 e¡ jjxRt ¡ x

Gjj;t = te

Expected return: 0.61 Expected return: 0.73

8/22


EM-based Reinforcement learning (RL)

• PoWER algorithm - Policy learning by Weighting Exploration with the Returns

• Advantages over policy-gradient based RL:no need of learning ratecan use importance samplingsingle rollout enough to update policy

Jens Kober and Jan Peters, NIPS 2009

9/22


µn+1=µn+

D(µk ¡ µn)R(¿k)

E

w(¿k)DR(¿k)

E

w(¿k)

RL implementation

• Policy parameters– Full coordination matrices:– Attractor vectors:

• Policy update rule:

• Importance sampling– uses best σ rollouts so far

K Pi

¹ Xi

µ

Df (µk;¿k)

E

w(¿k)=

¾X

k=1

f (µind(k);¿ind(k))

10/22


Pancake flipping: Experimental setup

Frying pan mounted on the end-effector

Artificial pancakewith 4 passive markers

(more robust to occlusions)

Barrett WAM 7-DOF robot

11/22


Evaluation: Tracking of the pancake

NaturalPoint OptiTrack motion capture system

x 12

100 Hz camera fps 40 Hz real-time capturing

12/22


Reward function

• Cumulative return of a rollout:

r(tf ) =w1harccos(v0:vtf )

¼

i+w2e¡ jjx

p¡ xF jj +w3xM3

R(¿) =TX

t=1

r(t)

• Reward function:

orientation position height

13/22


Kinesthetic demonstration of the task

14/22


Learning by trial and error

15/22


Finally learned skill

16/22


Motion capture to evaluate rollouts

17/22


Captured pancake trajectory

90° flip 180° flip

18/22


Performance

19/22


M (q)Äq+C(_q;q) _q+g(q) =¿G +¿T

¿G =LX

i=1

J TG;iFG;i

Gravity compensation

Task execution

¿T =J TT FT

Reproduction control strategy

20/22


Conclusion

• Combining Imitation learning + RL to learn motor skills with variable stiffness– Imitation used to initialize policy– RL to learn coordination matrices– Learned variable stiffness during reproduction

• Future work– other representations– other RL algorithms

21/22

Thanks for your attention!

Petar Kormushev, Italian Institute of Technology 22/22