43
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Embed Size (px)

Citation preview

Page 1: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Apprenticeship Learning for Robotic Control

Pieter AbbeelStanford University

Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Page 2: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Motivation for apprenticeship learning

Page 3: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Preliminary: reinforcement learning.

Apprenticeship learning algorithms.

Experimental results on various robotic platforms.

Outline

Page 4: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Reinforcement learning (RL)

System

Dynamics

Psa

state s0

s1

System

dynamics

Psa

System

Dynamics

PsasT-1

sT

s2

a0 a1 aT-1

reward R(s0) R(s2) R(sT-1)R(s1) R(sT)+ ++…++

Example reward function: R(s) = - || s – s* ||

Goal: Pick actions over time so as to maximize the expected score: E[R(s0) + R(s1) + … + R(sT)]

Solution: policy which specifies an action for each possible state for all times t= 0, 1, … , T.

Page 5: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Model-based reinforcement learning

Run RL algorithm in simulator.

Control policy

Page 6: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Apprenticeship learning algorithms use a demonstration to help us find

a good dynamics model,

a good reward function,

a good control policy.

Reinforcement learning (RL)

Reward Function

R

ReinforcementLearning Control

policy

|)(...)(Emax 0 TsRsR

Dynamics Model

Psa

Page 7: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Apprenticeship learning for the dynamics model

Reward Function R

ReinforcementLearning

Control policy

|)(...)(Emax 0 TsRsR

Dynamics Model

Psa

Page 8: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Accurate dynamics model Psa

Motivating example

•Textbook model• Specification

Accurate dynamics model Psa

Collect flight data.

•Textbook model• Specification

Learn model from data.

How to fly helicopter for data collection? How to ensure that entire flight envelope is covered by the data collection process?

Page 9: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Learning the dynamical model

State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)

Have goodmodel of dynamics?

NO

“Explore”

YES

“Exploit”

Page 10: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Learning the dynamical model

State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)

Have goodmodel of dynamics?

NO

“Explore”

YES

“Exploit”

Exploration policies are impractical: they do not even try

to perform well.Can we avoid explicit exploration and just

exploit?

Page 11: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel[ICML 2005]

Apprenticeship learning of the model

Teacher: human pilot flight

(a1, s1, a2, s2, a3, s3, ….)

Learn P sa

(a1, s1, a2, s2, a3, s3, ….)

Autonomous flight

Learn Psa

Dynamics Model

Psa

Reward Function R

ReinforcementLearning )(...)(Emax 0 TsRsR

Control policy

No explicit exploration, always try to fly as well as

possible.

Page 12: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Assuming a polynomial number of teacher demonstrations,

then after a polynomial number of trials, with probability 1-

E [ sum of rewards | policy returned by algorithm ]

¸ E [ sum of rewards | teacher’s policy] - .

Here, polynomial is with respect to

1/, 1/,

the horizon T,

the maximum reward R,

the size of the state space.

Theorem.

Page 13: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Learning the dynamics model Details of algorithm for learning dynamics model:

Exploiting structure from physics Lagged learning criterion

[NIPS 2005, 2006]

Page 14: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Helicopter flight results First high-speed autonomous funnels.

Speed: 5m/s. Nominal pitch angle: 30 degrees.

30o

Page 15: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Autonomous nose-in funnel

Page 16: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Accuracy

Page 17: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Autonomous tail-in funnel

Page 18: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Key points

Unlike exploration methods, our algorithm concentrates on the task of interest.

Bootstrapping off an initial teacher demonstration is sufficient to perform the task as well as the teacher.

Page 19: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley
Page 20: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Apprenticeship learning: reward

Reward Function R

ReinforcementLearning

Control policy )(...)(Emax 0 TsRsR

Dynamics Model

Psa

Page 21: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Example task: driving

Page 22: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Related work

Previous work: Learn to predict teacher’s actions as a function of states. E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et

al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …

Assumes “policy simplicity.”

Our approach: Assumes “reward simplicity” and is based on inverse

reinforcement learning (Ng & Russell, 2000). Similar work since: Ratliff et al., 2006, 2007.

Page 23: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Find R s.t. R is consistent with the teacher’s policy * being optimal.

Find R s.t.:

Find w:

Linear constraints in w, quadratic objective QP. Very large number of constraints.

Inverse reinforcement learning

Page 24: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Algorithm

For i = 1, 2, …

Inverse RL step:

RL step: (= constraint generation)

Compute optimal policy i for the estimated reward Rw.

Page 25: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Theorem. After at most nT 2/2 iterations our algorithm

returns a policy that performs as well as the teacher according to the teacher’s unknown reward function, i.e.,

Note: Our algorithm does not necessarily recover the teacher’s reward function R* --- which is impossible to recover.

Theoretical guarantees

[ICML 2004]

Page 26: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Performance guarantee intuition Intuition by example:

Let

If the returned policy satisfies

Then no matter what the values of and are, the policy performs as well as the teacher’s policy *.

Page 27: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Case study: Highway drivingInput: Driving demonstration Output: Learned behavior

The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided.

Page 28: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

More driving examples

In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.

Driving demonstratio

n

Driving demonstrati

on

Learned behavior

Learned behavior

Page 29: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Helicopter

Reward Function R

ReinforcementLearning Control

policy )(...)(Emax 0 TsRsR

Dynamics Model

Psa

Differential dynamic programming [Jacobson & Mayne, 1970; Anderson & Moore, 1989]

25 features

[NIPS 2007]

Page 30: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Autonomous aerobatics [Show helicopter movie in Media Player.]

Page 31: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Quadruped

Page 32: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Reward function trades off: Height differential of terrain.

Gradient of terrain around each foot.

Height differential between feet.

… (25 features total for our setup)

Quadruped

Page 33: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Teacher demonstration for quadruped

Full teacher demonstration = sequence of footsteps.

Much simpler to “teach hierarchically”: Specify a body path. Specify best footstep in a small area.

Page 34: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Hierarchical inverse RL

Quadratic programming problem (QP): quadratic objective, linear constraints.

Constraint generation for path constraints.

Page 35: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Training: Have quadruped walk straight across a fairly

simple board with fixed-spaced foot placements. Around each foot placement: label the best foot

placement. (about 20 labels) Label the best body-path for the training board.

Use our hierarchical inverse RL algorithm to learn a reward function from the footstep and path labels.

Test on hold-out terrains: Plan a path across the test-board.

Experimental setup

Page 36: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Quadruped on test-board

[Show movie in Media Player.]

Page 37: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley
Page 38: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Apprenticeship learning: RL algorithm

Reward Function R

ReinforcementLearning

Control policy )(...)(Emax 0 TsRsR

Dynamics Model

Psa

(Sloppy) demonstration

(Crude) model

Small number of real-life trials

Page 39: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Experiments Two Systems:

RC car Fixed-wing flight simulator

Control actions: throttle and steering.

Page 40: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

RC Car: Circle

Page 41: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

RC Car: Figure-8 Maneuver

Page 42: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Conclusion

Apprenticeship learning algorithms help us find better controllers by exploiting teacher demonstrations.

Our current work exploits teacher demonstrations to find

a good dynamics model,

a good reward function,

a good control policy.

Page 43: Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Acknowledgments

J. Zico Kolter, Andrew Y. Ng

Morgan Quigley, Andrew Y. Ng

Andrew Y. Ng

Adam Coates, Morgan Quigley, Andrew Y. Ng