Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with:...

Preview:

Citation preview

Apprenticeship Learning for Robotic Control

Pieter AbbeelStanford University

Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel

Motivation for apprenticeship learning

Pieter Abbeel

Preliminary: reinforcement learning.

Apprenticeship learning algorithms.

Experimental results on various robotic platforms.

Outline

Pieter Abbeel

Reinforcement learning (RL)

System

Dynamics

Psa

state s0

s1

System

dynamics

Psa

System

Dynamics

PsasT-1

sT

s2

a0 a1 aT-1

reward R(s0) R(s2) R(sT-1)R(s1) R(sT)+ ++…++

Example reward function: R(s) = - || s – s* ||

Goal: Pick actions over time so as to maximize the expected score: E[R(s0) + R(s1) + … + R(sT)]

Solution: policy which specifies an action for each possible state for all times t= 0, 1, … , T.

Pieter Abbeel

Model-based reinforcement learning

Run RL algorithm in simulator.

Control policy

Pieter Abbeel

Apprenticeship learning algorithms use a demonstration to help us find

a good dynamics model,

a good reward function,

a good control policy.

Reinforcement learning (RL)

Reward Function

R

ReinforcementLearning Control

policy

|)(...)(Emax 0 TsRsR

Dynamics Model

Psa

Pieter Abbeel

Apprenticeship learning for the dynamics model

Reward Function R

ReinforcementLearning

Control policy

|)(...)(Emax 0 TsRsR

Dynamics Model

Psa

Pieter Abbeel

Accurate dynamics model Psa

Motivating example

•Textbook model• Specification

Accurate dynamics model Psa

Collect flight data.

•Textbook model• Specification

Learn model from data.

How to fly helicopter for data collection? How to ensure that entire flight envelope is covered by the data collection process?

Pieter Abbeel

Learning the dynamical model

State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)

Have goodmodel of dynamics?

NO

“Explore”

YES

“Exploit”

Pieter Abbeel

Learning the dynamical model

State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)

Have goodmodel of dynamics?

NO

“Explore”

YES

“Exploit”

Exploration policies are impractical: they do not even try

to perform well.Can we avoid explicit exploration and just

exploit?

Pieter Abbeel[ICML 2005]

Apprenticeship learning of the model

Teacher: human pilot flight

(a1, s1, a2, s2, a3, s3, ….)

Learn P sa

(a1, s1, a2, s2, a3, s3, ….)

Autonomous flight

Learn Psa

Dynamics Model

Psa

Reward Function R

ReinforcementLearning )(...)(Emax 0 TsRsR

Control policy

No explicit exploration, always try to fly as well as

possible.

Pieter Abbeel

Assuming a polynomial number of teacher demonstrations,

then after a polynomial number of trials, with probability 1-

E [ sum of rewards | policy returned by algorithm ]

¸ E [ sum of rewards | teacher’s policy] - .

Here, polynomial is with respect to

1/, 1/,

the horizon T,

the maximum reward R,

the size of the state space.

Theorem.

Pieter Abbeel

Learning the dynamics model Details of algorithm for learning dynamics model:

Exploiting structure from physics Lagged learning criterion

[NIPS 2005, 2006]

Pieter Abbeel

Helicopter flight results First high-speed autonomous funnels.

Speed: 5m/s. Nominal pitch angle: 30 degrees.

30o

Pieter Abbeel

Autonomous nose-in funnel

Pieter Abbeel

Accuracy

Pieter Abbeel

Autonomous tail-in funnel

Pieter Abbeel

Key points

Unlike exploration methods, our algorithm concentrates on the task of interest.

Bootstrapping off an initial teacher demonstration is sufficient to perform the task as well as the teacher.

Pieter Abbeel

Apprenticeship learning: reward

Reward Function R

ReinforcementLearning

Control policy )(...)(Emax 0 TsRsR

Dynamics Model

Psa

Pieter Abbeel

Example task: driving

Pieter Abbeel

Related work

Previous work: Learn to predict teacher’s actions as a function of states. E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et

al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …

Assumes “policy simplicity.”

Our approach: Assumes “reward simplicity” and is based on inverse

reinforcement learning (Ng & Russell, 2000). Similar work since: Ratliff et al., 2006, 2007.

Pieter Abbeel

Find R s.t. R is consistent with the teacher’s policy * being optimal.

Find R s.t.:

Find w:

Linear constraints in w, quadratic objective QP. Very large number of constraints.

Inverse reinforcement learning

Pieter Abbeel

Algorithm

For i = 1, 2, …

Inverse RL step:

RL step: (= constraint generation)

Compute optimal policy i for the estimated reward Rw.

Pieter Abbeel

Theorem. After at most nT 2/2 iterations our algorithm

returns a policy that performs as well as the teacher according to the teacher’s unknown reward function, i.e.,

Note: Our algorithm does not necessarily recover the teacher’s reward function R* --- which is impossible to recover.

Theoretical guarantees

[ICML 2004]

Pieter Abbeel

Performance guarantee intuition Intuition by example:

Let

If the returned policy satisfies

Then no matter what the values of and are, the policy performs as well as the teacher’s policy *.

Pieter Abbeel

Case study: Highway drivingInput: Driving demonstration Output: Learned behavior

The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided.

Pieter Abbeel

More driving examples

In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.

Driving demonstratio

n

Driving demonstrati

on

Learned behavior

Learned behavior

Pieter Abbeel

Helicopter

Reward Function R

ReinforcementLearning Control

policy )(...)(Emax 0 TsRsR

Dynamics Model

Psa

Differential dynamic programming [Jacobson & Mayne, 1970; Anderson & Moore, 1989]

25 features

[NIPS 2007]

Pieter Abbeel

Autonomous aerobatics [Show helicopter movie in Media Player.]

Pieter Abbeel

Quadruped

Pieter Abbeel

Reward function trades off: Height differential of terrain.

Gradient of terrain around each foot.

Height differential between feet.

… (25 features total for our setup)

Quadruped

Pieter Abbeel

Teacher demonstration for quadruped

Full teacher demonstration = sequence of footsteps.

Much simpler to “teach hierarchically”: Specify a body path. Specify best footstep in a small area.

Pieter Abbeel

Hierarchical inverse RL

Quadratic programming problem (QP): quadratic objective, linear constraints.

Constraint generation for path constraints.

Pieter Abbeel

Training: Have quadruped walk straight across a fairly

simple board with fixed-spaced foot placements. Around each foot placement: label the best foot

placement. (about 20 labels) Label the best body-path for the training board.

Use our hierarchical inverse RL algorithm to learn a reward function from the footstep and path labels.

Test on hold-out terrains: Plan a path across the test-board.

Experimental setup

Pieter Abbeel

Quadruped on test-board

[Show movie in Media Player.]

Pieter Abbeel

Apprenticeship learning: RL algorithm

Reward Function R

ReinforcementLearning

Control policy )(...)(Emax 0 TsRsR

Dynamics Model

Psa

(Sloppy) demonstration

(Crude) model

Small number of real-life trials

Pieter Abbeel

Experiments Two Systems:

RC car Fixed-wing flight simulator

Control actions: throttle and steering.

Pieter Abbeel

RC Car: Circle

Pieter Abbeel

RC Car: Figure-8 Maneuver

Pieter Abbeel

Conclusion

Apprenticeship learning algorithms help us find better controllers by exploiting teacher demonstrations.

Our current work exploits teacher demonstrations to find

a good dynamics model,

a good reward function,

a good control policy.

Pieter Abbeel

Acknowledgments

J. Zico Kolter, Andrew Y. Ng

Morgan Quigley, Andrew Y. Ng

Andrew Y. Ng

Adam Coates, Morgan Quigley, Andrew Y. Ng

Recommended