Upload
julius-quinn
View
218
Download
2
Tags:
Embed Size (px)
Citation preview
Apprenticeship Learning for Robotic Control
Pieter AbbeelStanford University
Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley
Pieter Abbeel
Motivation for apprenticeship learning
Pieter Abbeel
Preliminary: reinforcement learning.
Apprenticeship learning algorithms.
Experimental results on various robotic platforms.
Outline
Pieter Abbeel
Reinforcement learning (RL)
System
Dynamics
Psa
state s0
s1
System
dynamics
Psa
…
System
Dynamics
PsasT-1
sT
s2
a0 a1 aT-1
reward R(s0) R(s2) R(sT-1)R(s1) R(sT)+ ++…++
Example reward function: R(s) = - || s – s* ||
Goal: Pick actions over time so as to maximize the expected score: E[R(s0) + R(s1) + … + R(sT)]
Solution: policy which specifies an action for each possible state for all times t= 0, 1, … , T.
Pieter Abbeel
Model-based reinforcement learning
Run RL algorithm in simulator.
Control policy
Pieter Abbeel
Apprenticeship learning algorithms use a demonstration to help us find
a good dynamics model,
a good reward function,
a good control policy.
Reinforcement learning (RL)
Reward Function
R
ReinforcementLearning Control
policy
|)(...)(Emax 0 TsRsR
Dynamics Model
Psa
Pieter Abbeel
Apprenticeship learning for the dynamics model
Reward Function R
ReinforcementLearning
Control policy
|)(...)(Emax 0 TsRsR
Dynamics Model
Psa
Pieter Abbeel
Accurate dynamics model Psa
Motivating example
•Textbook model• Specification
Accurate dynamics model Psa
Collect flight data.
•Textbook model• Specification
Learn model from data.
How to fly helicopter for data collection? How to ensure that entire flight envelope is covered by the data collection process?
Pieter Abbeel
Learning the dynamical model
State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)
Have goodmodel of dynamics?
NO
“Explore”
YES
“Exploit”
Pieter Abbeel
Learning the dynamical model
State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)
Have goodmodel of dynamics?
NO
“Explore”
YES
“Exploit”
Exploration policies are impractical: they do not even try
to perform well.Can we avoid explicit exploration and just
exploit?
Pieter Abbeel[ICML 2005]
Apprenticeship learning of the model
Teacher: human pilot flight
(a1, s1, a2, s2, a3, s3, ….)
Learn P sa
(a1, s1, a2, s2, a3, s3, ….)
Autonomous flight
Learn Psa
Dynamics Model
Psa
Reward Function R
ReinforcementLearning )(...)(Emax 0 TsRsR
Control policy
No explicit exploration, always try to fly as well as
possible.
Pieter Abbeel
Assuming a polynomial number of teacher demonstrations,
then after a polynomial number of trials, with probability 1-
E [ sum of rewards | policy returned by algorithm ]
¸ E [ sum of rewards | teacher’s policy] - .
Here, polynomial is with respect to
1/, 1/,
the horizon T,
the maximum reward R,
the size of the state space.
Theorem.
Pieter Abbeel
Learning the dynamics model Details of algorithm for learning dynamics model:
Exploiting structure from physics Lagged learning criterion
[NIPS 2005, 2006]
Pieter Abbeel
Helicopter flight results First high-speed autonomous funnels.
Speed: 5m/s. Nominal pitch angle: 30 degrees.
30o
Pieter Abbeel
Autonomous nose-in funnel
Pieter Abbeel
Accuracy
Pieter Abbeel
Autonomous tail-in funnel
Pieter Abbeel
Key points
Unlike exploration methods, our algorithm concentrates on the task of interest.
Bootstrapping off an initial teacher demonstration is sufficient to perform the task as well as the teacher.
Pieter Abbeel
Apprenticeship learning: reward
Reward Function R
ReinforcementLearning
Control policy )(...)(Emax 0 TsRsR
Dynamics Model
Psa
Pieter Abbeel
Example task: driving
Pieter Abbeel
Related work
Previous work: Learn to predict teacher’s actions as a function of states. E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et
al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …
Assumes “policy simplicity.”
Our approach: Assumes “reward simplicity” and is based on inverse
reinforcement learning (Ng & Russell, 2000). Similar work since: Ratliff et al., 2006, 2007.
Pieter Abbeel
Find R s.t. R is consistent with the teacher’s policy * being optimal.
Find R s.t.:
Find w:
Linear constraints in w, quadratic objective QP. Very large number of constraints.
Inverse reinforcement learning
Pieter Abbeel
Algorithm
For i = 1, 2, …
Inverse RL step:
RL step: (= constraint generation)
Compute optimal policy i for the estimated reward Rw.
Pieter Abbeel
Theorem. After at most nT 2/2 iterations our algorithm
returns a policy that performs as well as the teacher according to the teacher’s unknown reward function, i.e.,
Note: Our algorithm does not necessarily recover the teacher’s reward function R* --- which is impossible to recover.
Theoretical guarantees
[ICML 2004]
Pieter Abbeel
Performance guarantee intuition Intuition by example:
Let
If the returned policy satisfies
Then no matter what the values of and are, the policy performs as well as the teacher’s policy *.
Pieter Abbeel
Case study: Highway drivingInput: Driving demonstration Output: Learned behavior
The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided.
Pieter Abbeel
More driving examples
In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.
Driving demonstratio
n
Driving demonstrati
on
Learned behavior
Learned behavior
Pieter Abbeel
Helicopter
Reward Function R
ReinforcementLearning Control
policy )(...)(Emax 0 TsRsR
Dynamics Model
Psa
Differential dynamic programming [Jacobson & Mayne, 1970; Anderson & Moore, 1989]
25 features
[NIPS 2007]
Pieter Abbeel
Autonomous aerobatics [Show helicopter movie in Media Player.]
Pieter Abbeel
Quadruped
Pieter Abbeel
Reward function trades off: Height differential of terrain.
Gradient of terrain around each foot.
Height differential between feet.
… (25 features total for our setup)
Quadruped
Pieter Abbeel
Teacher demonstration for quadruped
Full teacher demonstration = sequence of footsteps.
Much simpler to “teach hierarchically”: Specify a body path. Specify best footstep in a small area.
Pieter Abbeel
Hierarchical inverse RL
Quadratic programming problem (QP): quadratic objective, linear constraints.
Constraint generation for path constraints.
Pieter Abbeel
Training: Have quadruped walk straight across a fairly
simple board with fixed-spaced foot placements. Around each foot placement: label the best foot
placement. (about 20 labels) Label the best body-path for the training board.
Use our hierarchical inverse RL algorithm to learn a reward function from the footstep and path labels.
Test on hold-out terrains: Plan a path across the test-board.
Experimental setup
Pieter Abbeel
Quadruped on test-board
[Show movie in Media Player.]
Pieter Abbeel
Apprenticeship learning: RL algorithm
Reward Function R
ReinforcementLearning
Control policy )(...)(Emax 0 TsRsR
Dynamics Model
Psa
(Sloppy) demonstration
(Crude) model
Small number of real-life trials
Pieter Abbeel
Experiments Two Systems:
RC car Fixed-wing flight simulator
Control actions: throttle and steering.
Pieter Abbeel
RC Car: Circle
Pieter Abbeel
RC Car: Figure-8 Maneuver
Pieter Abbeel
Conclusion
Apprenticeship learning algorithms help us find better controllers by exploiting teacher demonstrations.
Our current work exploits teacher demonstrations to find
a good dynamics model,
a good reward function,
a good control policy.
Pieter Abbeel
Acknowledgments
J. Zico Kolter, Andrew Y. Ng
Morgan Quigley, Andrew Y. Ng
Andrew Y. Ng
Adam Coates, Morgan Quigley, Andrew Y. Ng