Upload
petar-kormushev
View
546
Download
0
Embed Size (px)
DESCRIPTION
A Barrett WAM robot learns to flip pancakes by reinforcement learning.The motion is encoded in a mixture of basis force fields through an extension of Dynamic Movement Primitives (DMP) that represents the synergies across the different variables through stiffness matrices. An Inverse Dynamics controller with variable stiffness is used for reproduction.The skill is first demonstrated via kinesthetic teaching, and then refined by Policy learning by Weighting Exploration with the Returns (PoWER) algorithm. After 50 trials, the robot learns that the first part of the task requires a stiff behavior to throw the pancake in the air, while the second part requires the hand to be compliant in order to catch the pancake without having it bounced off the pan.
Citation preview
Robot Motor Skill Coordination withEM-based Reinforcement Learning
Italian Institute of TechnologyAdvanced Robotics dept.
http://www.iit.it
Petar Kormushev, Sylvain Calinon, Darwin G. Caldwell
October 20, 2010IROS 2010
Petar Kormushev, Italian Institute of Technology
Motivation
• How to learn complex motor skills which also require variable stiffness?
• How to demonstrate the required stiffness/compliance?
• How to teach highly-dynamic tasks?
2/22
Petar Kormushev, Italian Institute of Technology
Background
• Learning adaptive stiffness by extracting variability and correlation information from multiple demonstrations Sylvain Calinon et al., IROS 2010
3/22
Petar Kormushev, Italian Institute of Technology
Robot Motor Skill Learning
Demonstration by human
Encoding the skill
Refining the skill
Reproduction
Imitation learning
Reinforcement learning
Shared representation(encoding)
Motion capture Kinesthetic teaching
4/22
Petar Kormushev, Italian Institute of Technology
Skill representation (encoding)
Time independent
Trajectory-basedVia-pointsDMP
GMM/GMRDS-based
Time dependent
5/22
Petar Kormushev, Italian Institute of Technology
Dynamic Movement Primitives
DMP
Sequence of attractorsDemonstrated trajectory
Ä̂x =KX
i=1
hi(t)h· P (¹ X
i ¡ x) ¡ · V _xi
Ijspeert, Nakanishi, Schaal, IROS 2001
6/22
Petar Kormushev, Italian Institute of Technology
Extended DMP to include coordination
Ä̂x =KX
i=1
hi(t)hK Pi (¹
Xi ¡ x) ¡ · V _x
i
Coordination matrix (full stiffness matrix)
Advantages: capture correlations between the different motion variables reduce number of primitives
Ä̂x =KX
i=1
hi(t)h· P (¹ X
i ¡ x) ¡ · V _xi
Stiffness gain (scalar)
Proposal: use Reinforcement learning to learn the coordination matrices
7/22
Petar Kormushev, Italian Institute of Technology
Example: Reaching task with obstacle
Using full coordination matricesUsing diagonal matrices
Reward function: r(t) =½w1
T e¡ jjxRt ¡ x
Dt jj;t 6= te
w2 e¡ jjxRt ¡ x
Gjj;t = te
Expected return: 0.61 Expected return: 0.73
8/22
Petar Kormushev, Italian Institute of Technology
EM-based Reinforcement learning (RL)
• PoWER algorithm - Policy learning by Weighting Exploration with the Returns
• Advantages over policy-gradient based RL:no need of learning ratecan use importance samplingsingle rollout enough to update policy
Jens Kober and Jan Peters, NIPS 2009
9/22
Petar Kormushev, Italian Institute of Technology
µn+1=µn+
D(µk ¡ µn)R(¿k)
E
w(¿k)DR(¿k)
E
w(¿k)
RL implementation
• Policy parameters– Full coordination matrices:– Attractor vectors:
• Policy update rule:
• Importance sampling– uses best σ rollouts so far
K Pi
¹ Xi
µ
Df (µk;¿k)
E
w(¿k)=
¾X
k=1
f (µind(k);¿ind(k))
10/22
Petar Kormushev, Italian Institute of Technology
Pancake flipping: Experimental setup
Frying pan mounted on the end-effector
Artificial pancakewith 4 passive markers
(more robust to occlusions)
Barrett WAM 7-DOF robot
11/22
Petar Kormushev, Italian Institute of Technology
Evaluation: Tracking of the pancake
NaturalPoint OptiTrack motion capture system
x 12
100 Hz camera fps 40 Hz real-time capturing
12/22
Petar Kormushev, Italian Institute of Technology
Reward function
• Cumulative return of a rollout:
r(tf ) =w1harccos(v0:vtf )
¼
i+w2e¡ jjx
p¡ xF jj +w3xM3
R(¿) =TX
t=1
r(t)
• Reward function:
orientation position height
13/22
Petar Kormushev, Italian Institute of Technology
Kinesthetic demonstration of the task
14/22
Petar Kormushev, Italian Institute of Technology
Learning by trial and error
15/22
Petar Kormushev, Italian Institute of Technology
Finally learned skill
16/22
Petar Kormushev, Italian Institute of Technology
Motion capture to evaluate rollouts
17/22
Petar Kormushev, Italian Institute of Technology
Captured pancake trajectory
90° flip 180° flip
18/22
Petar Kormushev, Italian Institute of Technology
Performance
19/22
Petar Kormushev, Italian Institute of Technology
M (q)Äq+C(_q;q) _q+g(q) =¿G +¿T
¿G =LX
i=1
J TG;iFG;i
Gravity compensation
Task execution
¿T =J TT FT
Reproduction control strategy
20/22
Petar Kormushev, Italian Institute of Technology
Conclusion
• Combining Imitation learning + RL to learn motor skills with variable stiffness– Imitation used to initialize policy– RL to learn coordination matrices– Learned variable stiffness during reproduction
• Future work– other representations– other RL algorithms
21/22
Thanks for your attention!
Petar Kormushev, Italian Institute of Technology 22/22