56
Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science Department Stanford University July 2008, ICML

Space-Indexed Dynamic Programming: Learning to Follow Trajectories

  • Upload
    colum

  • View
    18

  • Download
    0

Embed Size (px)

DESCRIPTION

Space-Indexed Dynamic Programming: Learning to Follow Trajectories. J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science Department Stanford University July 2008, ICML. TexPoint fonts used in EMF. - PowerPoint PPT Presentation

Citation preview

Page 1: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamic Programming: Learning to

Follow Trajectories

J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway

Computer Science DepartmentStanford University

July 2008, ICML

Page 2: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Outline

• Reinforcement Learning and Following Trajectories

• Space-indexed Dynamical Systems and Space-indexed Dynamic Programming

• Experimental Results

Page 3: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Reinforcement Learning and Following Trajectories

Page 4: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Trajectory Following

• Consider task of following trajectory in a vehicle such as a car or helicopter

• State space too large to discretize, can’t apply tabular RL/dynamic programming

Page 5: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Trajectory Following

• Dynamic programming algorithms w/ non-stationary policies seem well-suited to task– Policy Search by Dynamic Programming

(Bagnell, et. al), Differential Dynamic Programming (Jacobson and Mayne)

Page 6: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Dynamic Programming

t=1

Divide control task into discrete time steps

Page 7: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Dynamic Programming

t=1

Divide control task into discrete time steps

t=2

Page 8: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Dynamic Programming

t=1

Divide control task into discrete time steps

t=2t=3

t=4 t=5 : : :

Page 9: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Proceeding backwards in time, learn policies for

t = T, T-1, …, 2, 1

Page 10: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Proceeding backwards in time, learn policies for

t = T, T-1, …, 2, 1

¼5

Page 11: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Proceeding backwards in time, learn policies for

t = T, T-1, …, 2, 1

¼5¼4

Page 12: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Proceeding backwards in time, learn policies for

t = T, T-1, …, 2, 1

¼5¼4¼3

¼2¼1

Page 13: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Key Advantage: Policies are local (only need to perform well over small

portion of state space)

¼5¼4¼3

¼2¼1

Page 14: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Problems with Dynamic Programming

Problem #1: Policies from traditional dynamic

programming algorithms are time-indexed

Page 15: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Problems with Dynamic Programming

¼5

Supposed we learned policy assuming this

distribution over states¼5

Page 16: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Problems with Dynamic Programming

¼5

But, due to natural stochasticity of environment, car is actually here at t = 5

Page 17: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Problems with Dynamic Programming

¼5

Resulting policy will perform very poorly

Page 18: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Problems with Dynamic Programming

¼5¼4

¼3¼2

¼1

Partial Solution: Re-indexingExecute policy closest to current

location, regardless of time

Page 19: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Problems with Dynamic Programming

Problem #2: Uncertainty over future states makes it hard to

learn any good policy

Page 20: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Problems with Dynamic Programming

Due to stochasticity, large uncertainty over states in

distant future

Dist. over states at time t = 5

Page 21: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Problems with Dynamic Programming

DP algorithms require learning policy that performs well over entire distribution

Dist. over states at time t = 5

Page 22: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamic Programming

• Basic idea of Space-Indexed Dynamic Programming (SIDP):

Perform DP with respect to space indices (planes tangent to trajectory)

Page 23: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamical Systems and Dynamic

Programming

Page 24: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Difficulty with SIDP

• No guarantee that taking single action will move to next plane along trajectory

• Introduce notion of space-indexed dynamical system

Page 25: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

Page 26: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

current state

Page 27: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

control actioncurrent state

Page 28: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

control actioncurrent statetime derivative of state

Page 29: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

Euler integration

st+¢ t = st +f (st;ut)¢ t

Page 30: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamical Systems

• Creating space-indexed dynamical systems:

• Simulate forward until whenever vehicle hits next tangent plane

space index d

space index d+1

Page 31: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamical Systems

• Creating space-indexed dynamical systems:

space index dspace index d+1

_s = f (s;u)

sd+1 = sd+f (sd;ud)¢ t(sd;ud)

Page 32: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamical Systems

• Creating space-indexed dynamical systems:

space index dspace index d+1

_s = f (s;u)

(Positive solution exists as long as controller makes

some forward progress)

sd+1 = sd+f (sd;ud)¢ t(sd;ud)

¢ t(s;u) =( _s?d+1)

T (s¡ s?d+1)( _s?d+1)

T _s

¢ t(s;u)

Page 33: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamical Systems

• Result is a dynamical system indexed by spatial-index variable d rather than time

• Space-indexed dynamic programming runs DP directly on this system

sd+1 = sd+f (sd;ud)¢ t(sd;ud)

Page 34: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamic Programming

Divide trajectory into discrete space planes

d=1

Page 35: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamic Programming

Divide trajectory into discrete space planes

d=1 d=2

Page 36: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamic Programming

Divide trajectory into discrete space planes

d=1 d=2d=3

d=4d=5

Page 37: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamic Programming

d=1 d=2d=3

d=4d=5

Proceeding backwards, learn policies for

d = D, D-1, …, 2, 1

Page 38: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamic Programming

d=1 d=2d=3

d=4d=5

¼5

Proceeding backwards, learn policies for

d = D, D-1, …, 2, 1

Page 39: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamic Programming

d=1 d=2d=3

d=4d=5

¼5¼4

Proceeding backwards, learn policies for

d = D, D-1, …, 2, 1

Page 40: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamic Programming

d=1 d=2d=3

d=4d=5

¼5¼4¼3

¼2¼1

Proceeding backwards, learn policies for

d = D, D-1, …, 2, 1

Page 41: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Problems with Dynamic Programming

Problem #1: Policies from traditional dynamic

programming algorithms are time-indexed

Page 42: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamic Programming

Time indexed DP: can execute

policy learned for different location

Space indexed DP: always executes policy based on current spatial

index

¼5

¼4

Page 43: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Problems with Dynamic Programming

Problem #2: Uncertainty over future states makes it hard to

learn any good policy

Page 44: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamic Programming

Time indexed DP: wide distribution

over future states

Space indexed DP: much tighter

distribution over future states

Dist. over states at time t = 5 Dist. over states at index d = 5

Page 45: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed Dynamic Programming

Time indexed DP: wide distribution

over future states

Space indexed DP: much tighter

distribution over future states

Dist. over states at time t = 5 Dist. over states at index d = 5

t(5):

Page 46: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Experiments

Page 47: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Experimental Domain

• Task: following race track trajectory in RC car with randomly placed obstacles

Page 48: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Experimental Setup

• Implemented space-indexed version of PSDP algorithm– Policy chooses steering angle using SVM

classifier (constant velocity)– Used simple textbook model simulator of car

dynamics to learn policy

• Evaluated PSDP time-indexed, time-indexed with re-indexing and space-indexed

Page 49: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Time-Indexed PSDP

Page 50: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Time-Indexed PSDP w/ Re-indexing

Page 51: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Space-Indexed PSDP

Page 52: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Empirical Evaluation

Time-indexed PSDP Time-indexed PSDP with Re-indexing

Space-indexed PSDP

Cost: 49.32Cost: Infinite (no trajectory succeeds) Cost: 59.74

Page 53: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Additional Experiments

• In the paper: additional experiments on the Stanford Grand Challenge Car using space-indexed DDP, and on a simulated helicopter domain using space-indexed PSDP

Page 54: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Related Work

• Reinforcement learning / dynamic programming: Bagnell et al., 2004; Jacobson and Mayne, 1970; Lagoudakis and Parr, 2003; Langford and Zadrozny, 2005

• Differential Dynamic Programming: Atkeson, 1994; Tassa et al., 2008

• Gain Scheduling, Model Predictive Control: Leith and Leithead, 2000; Garica et al., 1989

Page 55: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Summary

• Trajectory following uses non-stationary policies, but traditional DP / RL algorithms suffer because they are time-indexed

• In this paper, we introduce the notions of a space-indexed dynamical system, and space-indexed dynamic programming

• Demonstrated usefulness of these methods on real-world control tasks.

Page 56: Space-Indexed Dynamic  Programming: Learning to Follow Trajectories

Thank you!

Videos available online athttp://cs.stanford.edu/~kolter/icml08videos