Guided Policy Search - Stanford Universitygraphics.stanford.edu/projects/gpspaper/gps_full.pdf · 2013. 5. 15. · Guided Policy Search step t, the agent observes a state x t and

Guided Policy Search

Sergey Levine [email protected] Koltun [email protected]

Computer Science Department, Stanford University, Stanford, CA 94305 USA

Abstract

Direct policy search can effectively scaleto high-dimensional systems, but complexpolicies with hundreds of parameters oftenpresent a challenge for such methods, requir-ing numerous samples and often falling intopoor local optima. We present a guided pol-icy search algorithm that uses trajectory op-timization to direct policy learning and avoidpoor local optima. We show how differentialdynamic programming can be used to gener-ate suitable guiding samples, and describe aregularized importance sampled policy opti-mization that incorporates these samples intothe policy search. We evaluate the method bylearning neural network controllers for planarswimming, hopping, and walking, as well assimulated 3D humanoid running.

1. Introduction

Reinforcement learning is a powerful framework forcontrolling dynamical systems. Direct policy searchmethods are often employed in high-dimensional ap-plications such as robotics, since they scale gracefullywith dimensionality and offer appealing convergenceguarantees (Peters & Schaal, 2008). However, it isoften necessary to carefully choose a specialized pol-icy class to learn the policy in a reasonable numberof iterations without falling into poor local optima.Substantial improvements on real-world systems havecome from specialized and innovative policy classes(Ijspeert et al., 2002). This specialization comes ata cost in generality, and can restrict the types of be-haviors that can be learned. For example, a policythat tracks a single trajectory cannot choose differenttrajectories depending on the state. In this work, weaim to learn policies with very general and flexible rep-

Proceedings of the 30 th International Conference on Ma-chine Learning, Atlanta, Georgia, USA, 2013. JMLR:W&CP volume 28. Copyright 2013 by the author(s).

resentations, such as large neural networks, which canrepresent a broad range of behaviors. Learning suchcomplex, nonlinear policies with standard policy gra-dient methods can require a huge number of iterations,and can be disastrously prone to poor local optima.

In this paper, we show how trajectory optimization canguide the policy search away from poor local optima.Our guided policy search algorithm uses differentialdynamic programming (DDP) to generate “guidingsamples,” which assist the policy search by exploringhigh-reward regions. An importance sampled variantof the likelihood ratio estimator is used to incorporatethese guiding samples directly into the policy search.

We show that DDP can be modified to sample froma distribution over high reward trajectories, makingit particularly suitable for guiding policy search. Fur-thermore, by initializing DDP with example demon-strations, our method can perform learning fromdemonstration. The use of importance sampled pol-icy search also allows us to optimize the policy withsecond order quasi-Newton methods for many gradientsteps without requiring new on-policy samples, whichcan be crucial for complex, nonlinear policies.

Our main contribution is a guided policy search algo-rithm that uses trajectory optimization to assist pol-icy learning. We show how to obtain suitable guidingsamples, and we present a regularized importance sam-pled policy optimization method that can utilize guid-ing samples and does not require a learning rate ornew samples at every gradient step. We evaluate ourmethod on planar swimming, hopping, and walking, aswell as 3D humanoid running, using general-purposeneural network policies. We also show that both theproposed sampling scheme and regularizer are essen-tial for good performance, and that the learned policiescan generalize successfully to new environments.

2. Preliminaries

Reinforcement learning aims to find a policy π to con-trol an agent in a stochastic environment. At each time


step t, the agent observes a state xt and chooses anaction according to π(ut|xt), producing a state transi-tion according to p(xt+1|xt,ut). The goal is specifiedby the reward r(xt,ut), and an optimal policy is onethat maximizes the expected sum of rewards (return)from step 1 to T . We use ζ to denote a sequence ofstates and actions, with r(ζ) and π(ζ) being the totalreward along ζ and its probability under π. We focuson finite-horizon tasks in continuous domains, thoughextensions to other formulations are also possible.

Policy gradient methods learn a parameterized policyπθ by directly optimizing its expected return E[J(θ)]with respect to the parameters θ. In particular, like-lihood ratio methods estimate the gradient E[∇J(θ)]using samples ζ1, . . . , ζm drawn from the current pol-icy πθ, and then improve the policy by taking a stepalong this gradient. The gradient can be estimatedusing the following equation (Peters & Schaal, 2008):

E[∇J(θ)]=E[r(ζ)∇ log πθ(ζ)]≈ 1

m

m∑i=1

r(ζi)∇ log πθ(ζi),

where ∇ log πθ(ζi) decomposes to∑t∇ log πθ(ut|xt),

since the transition model p(xt+1|xt,ut) does not de-pend on θ. Standard likelihood ratio methods requirenew samples from the current policy at each gradientstep, do not admit off-policy samples, and require thelearning rate to be chosen carefully to ensure conver-gence. In the next section, we discuss how importancesampling can be used to lift these constraints.

3. Importance Sampled Policy Search

Importance sampling is a technique for estimating anexpectation Ep[f(x)] with respect to p(x) using sam-ples drawn from a different distribution q(x):

Ep [f(x)] = Eq

[p(x)

q(x)f(x)

]≈ 1

Z

m∑i=1

p(xi)

q(xi)f(xi).

Importance sampling is unbiased if Z = m, though

we use Z =∑ip(xi)q(xi)

throughout as it provides lower

variance. Prior work proposed estimating E[J(θ)] withimportance sampling (Peshkin & Shelton, 2002; Tang& Abbeel, 2010). This allows using off-policy samplesand results in the following estimator:

E[J(θ)] ≈ 1

Z(θ)

m∑i=1

πθ(ζi)

q(ζi)r(ζi). (1)

The variance of this estimator can be reduced furtherby observing that past rewards do not depend on fu-ture actions (Sutton et al., 1999; Baxter et al., 2001):

E[J(θ)] ≈T∑t=1

1

Zt(θ)

m∑i=1

πθ(ζi,1:t)

q(ζi,1:t)r(xit,u

it

), (2)

where πθ(ζi,1:t) denotes the probabiliy of the first tsteps of ζi, and Zt(θ) normalizes the weights. To in-clude samples from multiple distributions, we followprior work and use a fused distribution of the formq(ζ) = 1

n

∑j qj(ζ), where each qj is either a previous

policy or a guiding distribution constructed with DDP.

Prior methods optimized Equation (1) or (2) directly.Unfortunately, with complex policies and long rollouts,this often produces poor results, as the estimator onlyconsiders the relative probability of each sample anddoes not require any of these probabilities to be high.The optimum can assign a low probability to all sam-ples, with the best sample slightly more likely thanthe rest, thus receiving the only nonzero weight. Tangand Abbeel (2010) attempted to mitigate this issue byconstraining the optimization by the variance of theweights. However, this is ineffective when the distri-butions are highly peaked, as is the case with long roll-outs, because very few samples have nonzero weights,and their variance tends to zero as the sample count in-creases. In our experiments, we found this problem tobe very common in large, high-dimensional problems.

To address this issue, we augment Equation (2) with anovel regularizing term. A variety of regularizers arepossible, but we found the most effective one to be thelogarithm of the normalizing constant:

Φ(θ) =

T∑t=1

[1

Zt(θ)

m∑i=1

πθ(ζi,1:t)

q(ζi,1:t)r(xit,u

it)+wr logZt(θ)

].

This objective is maximized using an analytic gradientderived in Appendix A of the supplement. It is easy tocheck that this estimator is consistent, since Zt(θ)→ 1in the limit of infinite samples. The regularizer actsas a soft maximum over the logarithms of the weights,ensuring that at least some samples have a high proba-bility under πθ. Furthermore, by adaptively adjustingwr, we can control how far the policy is allowed to de-viate from the samples, which can be used to limit theoptimization to regions that are better represented bythe samples if it repeatedly fails to make progress.

4. Guiding Samples

Prior methods employed importance sampling to reusesamples from previous policies (Peshkin & Shelton,2002; Kober & Peters, 2009; Tang & Abbeel, 2010).However, when learning policies with hundreds of pa-rameters, local optima make it very difficult to find agood solution. In this section, we show how differentialdynamic programming (DDP) can be used to supple-ment the sample set with off-policy guiding samplesthat guide the policy search to regions of high reward.


4.1. Constructing Guiding Distributions

An effective guiding distribution covers high-rewardregions while avoiding large q(ζ) densities, which resultin low importance weights. We posit that a good guid-ing distribution is an I-projection of ρ(ζ) ∝ exp(r(ζ)).An I-projection q of ρ minimizes the KL-divergenceDKL(q||ρ) = Eq[−r(ζ)] − H(q), where H denotes en-tropy. The first term forces q to be high only in regionsof high reward, while the entropy maximization favorsbroad distributions. We will show that an approxi-mate Gaussian I-projection of ρ can be computed usinga variant of DDP called iterative LQR (Tassa et al.,2012), which optimizes a trajectory by repeatedly solv-ing for the optimal policy under linear-quadratic as-sumptions. Given a trajectory (x1, u1), . . . , (xT , uT )and defining xt = xt − xt and ut = ut − ut, the dy-namics and reward are approximated as

xt+1 ≈ fxtxt + futut

r(xt,ut) ≈ xTt rxt + uTrut +

1

2xTt rxxtxt +

1

2uTt ruutut

+ uTt ruxtxt + r(xt, ut).

Subscripts denote the Jacobians and Hessians of thedynamics f and reward r, which are assumed to exist.1

Iterative LQR recursively estimates the Q-function:

Qxxt=rxxt+fTxtVxxt+1fxt Qxt=rxt+f

TxtVxt+1

Quut=ruut+fTutVxxt+1fut Qut=rut+f

TutVxt+1

Quxt=ruxt+fTutVxxt+1fxt,

as well as the value function and linear policy terms:

Vxt = Qxt −QTuxtQ

−1uutQu kt = −Q−1uutQut

Vxxt = Qxxt −QTuxtQ

−1uutQux Kt = −Q−1uutQuxt.

The deterministic optimal policy is then given by

g(xt) = ut + kt + Kt(xt − xt). (3)

By repeatedly computing this policy and following itto obtain a new trajectory, this algorithm converges toa locally optimal solution. We now show how the samealgorithm can be used with the framework of linearlysolvable MDPs (Dvijotham & Todorov, 2010) and therelated concept of maximum entropy control (Ziebart,2010) to build an approximate Gaussian I-projection ofρ(ζ) ∝ exp(r(ζ)). Under this framework, the optimalpolicy πG maximizes an augmented reward, given by

r(xt,ut) = r(xt,ut)−DKL(πG(·|xt)||p(·|xt)),1In our implementation, the dynamics are differentiated

with finite differences and the reward is differentiated an-alytically.

where p is a “passive dynamics” distribution. If p isuniform, the expected return of a policy πG is

EπG [r(ζ)] = EπG [r(ζ)] +H(πG),

which means that if πG maximizes this return, it isan I-projection of ρ. Ziebart (2010) showed that theoptimal policy under uniform passive dynamics is

πG(ut|xt) = exp(Qt(xt,ut)− Vt(xt)), (4)

where V is a modified value function given by

Vt(xt) = log

∫exp (Qt(xt,ut)) dut.

Under linear dynamics and quadratic rewards, it canbe shown that V has the same form as the deriva-tion above, and Equation 4 is a linear Gaussian withthe mean given by g(xt) and the covariance given by−Q−1uut. This stochastic policy corresponds approx-imately to a Gaussian distribution over trajectories.We can therefore sample from an approximate Gaus-sian I-projection of ρ by following the stochastic policy

πG(ut|xt) = G(ut; g(xt),−Q−1uut).

It should be noted that πG(ζ) is only Gaussian un-der linear dynamics. When the dynamics are nonlin-ear, πG(ζ) approximates a Gaussian around the nomi-nal trajectory. Fortunately, the feedback term usuallykeeps the samples close to this trajectory, making themsuitable guiding samples for the policy search.

4.2. Adaptive Guiding Distributions

The distribution in the preceding section captureshigh-reward regions, but does not consider the currentpolicy πθ. We can adapt it to πθ by sampling from an I-projection of ρθ(ζ) ∝ exp(r(ζ))πθ(ζ), which is the op-timal distribution for estimating Eπθ [exp(r(ζ))].2 Toconstruct the approximate I-projection of ρθ, we sim-ply run the DDP algorithm with the reward r(xt,ut) =r(xt,ut) + log πθ(ut|xt). The resulting distribution isthen an approximate I-projection of ρθ(ζ) ∝ exp(r(ζ)).

In practice, we found that many domains do not re-quire adaptive samples. As discussed in the evaluation,adaptation becomes necessary when the initial samplescannot be reproduced by any policy, such as when theyact differently in similar states. In that case, adaptedsamples will avoid different actions in similar states,making them more suited for guiding the policy.

2While we could also consider the optimal sampler forEπθ [r(ζ)], such a sampler would also need to also cover re-gions with very low reward, which are often very large andwould not be represented well by a Gaussian I-projection.


4.3. Incorporating Guiding Samples

We incorporate guiding samples into the policy searchby building one or more initial DDP solutions and sup-plying the resulting samples to the importance sam-pled policy search algorithm. These solutions can beinitialized with human demonstrations or with an of-fline planning algorithm. When learning from demon-strations, we can perform just one step of DDP start-ing from the example demonstration, thus construct-ing a Gaussian distribution around the example. Ifadaptive guiding distributions are used, they are con-structed at each iteration of the policy search startingfrom the previous DDP solution.

Although our policy search component is model-free,DDP requires a model of the system dynamics. Nu-merous recent methods have proposed to learn themodel (Abbeel et al., 2006; Deisenroth & Rasmussen,2011; Ross & Bagnell, 2012), and if we use initial ex-amples, only local models are required. In Section 8,we also discuss model-free alternatives to DDP.

One might also wonder why the DDP policy πG is notitself a suitable controller. The issue is that πG is time-varying and only valid around a single trajectory, whilethe final policy can be learned from many DDP solu-tions in many situations. Guided policy search canbe viewed as transforming a collection of trajectoriesinto a controller. This controller can adhere to any pa-rameterization, reflecting constraints on computationor available sensors in partially observed domains. Inour evaluation, we show that such a policy generalizesto situations where the DDP policy fails.

5. Guided Policy Search

Algorithm 1 summarizes our method. On line 1, webuild one or more DDP solutions, which can be ini-tialized from demonstrations. Initial guiding samplesare sampled from these solutions and used on line 3 topretrain the initial policy πθ? . Since the samples aredrawn from stochastic feedback policies, πθ? can al-ready learn useful feedback rules during this pretrain-ing stage. The sample set S is constructed on line 4from the guiding samples and samples from πθ? , andthe policy search then alternates between optimizingΦ(θ) and gathering new samples from the current pol-icy πθk . If the sample set S becomes too big, a subsetSk is chosen on line 6. In practice, we simply choosethe samples with high importance weights under thecurrent best policy πθ? , as well as the guiding samples.

The objective Φ(θ) is optimized on line 7 with LBFGS.This objective can itself be susceptible to local optima:when the weight on a sample is very low, it is effec-

Algorithm 1 Guided Policy Search

1: Generate DDP solutions πG1 , . . . , πGn2: Sample ζ1, . . . , ζm from q(ζ) = 1

n

∑i πGi(ζ)

3: Initialize θ? ← arg maxθ∑i log πθ?(ζi)

4: Build initial sample set S from πG1 , . . . , πGn , πθ?

5: for iteration k = 1 to K do6: Choose current sample set Sk ⊂ S7: Optimize θk ← arg maxθ ΦSk(θ)8: Append samples from πθk to Sk and S9: Optionally generate adaptive guiding samples

10: Estimate the values of πθk and πθ? using Sk11: if πθk is better than πθ? then12: Set θ? ← θk13: Decrease wr14: else15: Increase wr16: Optionally, resample from πθ?

17: end if18: end for19: Return the best policy πθ?

tively ignored by the optimization. If the guiding sam-ples have low weights, the optimization cannot benefitfrom them. To mitigate this issue, we repeat the op-timization twice, once starting from the best currentparameters θ?, and once by initializing the parameterswith an optimization that maximizes the log weighton the highest-reward sample. Prior work suggestedrestarting the optimization from each previous policy(Tang & Abbeel, 2010), but this is very slow with com-plex policies, and still fails to explore the guiding sam-ples, for which there are no known policy parameters.

Once the new policy πθk is optimized, we add samplesfrom πθk to S on line 8. If we are using adaptive guid-ing samples, the adaptation is done on line 9 and newguiding samples are also added. We then use Equa-tion 2 to estimate the returns of both the new policyand the current best policy πθ? on line 10. Since Sknow contains samples from both policies, we expect theestimator to be accurate. If the new policy is better, itreplaces the best policy. Otherwise, the regularizationweight wr is increased. Higher regularization causesthe next optimization to stay closer to the samples,making the estimated return more accurate. Once thepolicy search starts making progress, the weight is de-creased. In practice, we clamp wr between 10−2 and10−6 and adjust it by a factor of 10. We also foundthat the policy sometimes failed to improve if the sam-ples from πθ? had been unusually good by chance. Toaddress this, we draw additional samples from πθ? online 16 to prevent the policy search from getting stuckdue to an overestimate of the best policy’s value.


6. Experimental Evaluation

We evaluated our method on planar swimming, hop-ping, and walking, as well as 3D running. Each taskwas simulated with the MuJoCo simulator (Todorovet al., 2012), using systems of rigid links with noisymotors at the joints.3 The policies were represented byneural network controllers that mapped current jointangles and velocities directly to joint torques. The re-ward function was a sum of three terms:

r(x,u) = −wu ||u||2 − wv(vx − v?x)2 − wh(py − p?y)2,

where vx and v?x are the current and desired horizon-tal velocities, py and p?y are the current and desiredheights of the root link, and wu, wv, and wh deter-mine the weight on each objective term. The weightsfor each task are given in Appendix B of the supple-ment, along with descriptions of each simulated robot.

The policy was represented by a neural network withone hidden layer, with a soft rectifying nonlinearitya = log(1 + exp(z)) at the first layer and linear con-nections to the output layer. Gaussian noise was addedto the output to create a stochastic policy. The pol-icy search ran for 80 iterations, with 40 initial guidingsamples, 10 samples added from the current policy ateach iteration, and up to 100 samples selected for theactive set Sk. Adaptive guiding distributions were onlyused in Section 6.2. Adaptation was performed in thefirst 40 iterations, with 50 DDP iterations each time.

The initial guiding distributions for the swimmer andhopper were generated directly with DDP. To illustratethe capacity of our method to learn from demonstra-tion, the initial example trajectory for the walker wasobtained from a prior locomotion system (Yin et al.,2007), while the 3D humanoid was initialized with ex-ample motion capture of human running.

When using examples, we regularized the reward witha tracking term equal to the squared deviation fromthe example states and actions, with a weight of 0.05.This term serves to keep the guiding samples close tothe examples and ensures that the policy covariance ispositive definite. The same term was also used withthe swimmer and hopper for consistency. The trackingterm was only used for the initial guiding distributions,and was not used during the policy search.

The swimmer, hopper, and walker are shown in Fig-ure 1, with plots of the root position under the learnedpolicies and under the initial DDP solution. The 3Dhumanoid is shown in Figure 5.

3The standard deviation of the noise at each joint wasset to 10% of the standard deviation of its torques in theinitial DDP-generated gait.

Figure 1. Plots of the root trajectory for the swimmer, hop-per, and walker. The green lines are samples from learnedpolicies, and the black line is the DDP solution.

6.1. Comparisons

In the first set of tests, we compared to three variantsof our algorithm and three prior methods, using theplanar swimmer, hopper, and walker domains. Thetime horizon was 500, and the policies had 50 hiddenunits and up to 1256 parameters. The first variant didnot use the regularization term.4 The second, referredto as “non-guided,” did not use the guiding samplesduring the policy search. The third variant also didnot use the guiding samples in the policy objective,but used the guiding distributions as “restart distri-butions” to specify new initial state distributions thatcover states we expect to visit under a good policy,analogously to prior work (Kakade & Langford, 2002;Bagnell et al., 2003). This comparison was meant tocheck that the guiding samples actually aided policylearning, rather than simply focusing the policy searchon “good” states. All variants initialize the policy bypretraining on the guiding samples.

The first prior method, referred to as “single exam-ple,” initialized the policy using the single initial exam-ple, as in prior work on learning from demonstration,and then improved the policy with importance sam-pling but no guiding samples, again as in prior work(Peshkin & Shelton, 2002). The second method usedstandard policy gradients, with a PGT or GPOMDP-type estimator (Sutton et al., 1999; Baxter et al.,2001). The third method was DAGGER, an imita-tion learning algorithm that aims to find a policy thatmatches the “expert” DDP actions (Ross et al., 2011).

The results are shown in Figure 2 in terms of the meanreward of the 10 samples at each iteration, along withthe value of the initial example and a shaded intervalindicating two standard deviations of the guiding sam-ple values. Our algorithm learned each gait, matchingthe reward of the initial example. The comparison tothe unregularized variant shows that the proposed reg-ularization term is crucial for obtaining a good policy.The non-guided variant, which only used the guidingsamples for pretraining, sometimes found a good solu-tion because even a partially successful initial policy

4We also evaluated the ESS constraint proposed byTang and Abbeel, but found that it performed no betteron our tasks than the unregularized variant.


guiding samples

initial trajectory

complete GPS

unregularized

non-guidedrestart distribution

single example

policy gradient

DAGGER

walker, 20 hidden units

iteration

mea

n re

war

d

10 20 30 40 50 60 70 80

-5k

-4k

-3k

-2k

-1k

0kwalker, 50 hidden units

iteration

mea

n re

war

d

10 20 30 40 50 60 70 80

-5k

-4k

-3k

-2k

-1k

0khopper, 50 hidden units

iteration

mea

n re

war

d

10 20 30 40 50 60 70 80

-5k

-4k

-3k

-2k

-1k

0kswimmer, 50 hidden units

iteration

mea

n re

war

d

10 20 30 40 50 60 70 80-1.2k

-1.1k

-1.0k

-0.9k

Figure 2. Comparison of guided policy search (GPS) with ablated variants and prior methods. Our method successfullylearns each gait, while methods that do not use guiding samples or regularization fail to make progress. All methods use10 rollouts per iteration. Guided variants (GPS, unregularized, restart, DAGGER) also use 40 guiding samples.

may succeed on one of its initial samples, which thenserves the same function as the guiding samples byindicating high-reward regions. However, a successfulnon-guided outcome hinged entirely on the initial pol-icy. This is illustrated by the fourth graph in Figure 2,which shows a walking policy with 20 hidden units thatis less successful initially. The full algorithm was stillable to improve using the guiding samples, while thenon-guided variant did not make progress. The restartdistribution variant performed poorly. Wider initialstate distributions greatly increased the variance ofthe estimator, and because the guiding distributionswere only used to sample states, their ability to alsopoint out good actions was not leveraged.

Standard policy gradient and “single example” learn-ing from demonstration methods failed to learn any ofthe gaits. This suggests that guiding samples are cru-cial for learning such complex behaviors, and initializa-tion from a single example is insufficient. DAGGERalso performed poorly, since it assumed that the ex-pert could provide optimal actions in all states, whilethe DDP policy was actually only valid close to theexample. DAGGER therefore wasted a lot of effort onstates where the DDP policy was not valid, such asafter the hopper or walker fell.

walker torques

time step

RM

S to

rque

(N

m)

1 10 20 30 40 50 60 70 800

2

4

6

8

10

GPSDDPsamples

Interestingly, the GPS poli-cies often used less torquethan the initial DDP solu-tion. The plot on the rightshows the torque magni-tudes for the walking pol-icy, the deterministic DDP policy around the example,and the guiding samples. The smoother, more compli-ant behavior of the learned policy can be explained inpart by the fact that the neural network can producemore subtle corrections than simple linear feedback.

6.2. Adaptive Resampling

The initial trajectories in the previous section could allbe reproduced by neural networks with sufficient train-ing, making adaptive guiding distributions unneces-sary. In the next experiment, we constructed a walking

example that switched to another gait after 2.5s, mak-ing the initial guiding samples difficult to recreate witha stationary policy. As discussed in Section 4.2, adapt-ing the guiding distribution to the policy can producemore suitable samples in such cases.

walker, adaptation test

iteration

mea

n re

war

d

10 20 30 40 50 60 70 80

-3k

-2k

-1k

0k

adapted

not adaptated

The plot on the right showsthe results with and with-out adaptive samples. Withadaptation, our algorithmquickly learned a success-ful gait, while without it, itmade little progress.5 The supplemental video showsthat the final adapted example has a single gait thatresembles both initial gaits. In practice, adaptation isuseful if the examples are of low quality, or if the obser-vations are insufficient to distinguish states where theytakes different actions. As the examples are adapted,the policy encourages the DDP solution to be moreregular. In future work, it would be interesting to seeif this iterative process can find trajectories that aretoo complex to be found by either method alone.

6.3. Generalization

Next, we explore how our policies generalize to new en-vironments. We trained a policy with 100 hidden unitsto walk 4m on flat ground, and then climb a 10◦ in-cline. The policy could not perceive the slope, and theroot height was only given relative to the ground. Wethen moved the start of the incline and compared ourmethod to the initial DDP policy, as well as a simpli-fied trajectory-based dynamic programming (TBDP)approach that aggregates local DDP policies and usesthe policy of the nearest neighbor to the current state(Atkeson & Stephens, 2008). Prior TBDP methodsadd new trajectories until the TBDP policy achievesgood results. Since the initial DDP policy already suc-ceeds on the training environment, only this initial pol-icy was used. Both methods therefore used the sameDDP solution: TBDP used it to build a nonparametricpolicy, and GPS used it to generate guiding samples.

5Note that the adapted variant used 20 samples periteration: 10 on-policy and 10 adaptive guiding samples.


inclined terrain walking

incline location

mea

n re

war

d

-4.0m -2.4m -0.8m 0.8m 2.4m 4.0m-7k

-6k

-5k

-4k

-3k

-2k

-1k

0k

GPS

TBDPDDP

GPS:

TBDP:

DDP:

+1.2m rollouts:

Figure 3. Comparison of GPS, TBDP, and DDP at varyingincline locations (left) and plots of their rollouts (right).GPS and TBDP generalized to all incline locations.

training terrain:

test 1:

test 2:

test 3:

test 4:

GPS[-828]

GPS[-851]

GPS[-832]

GPS[-806]

TBDP[-6375]

TBDP[-6643]

TBDP[-7118]

TBDP[-6030]

DDP [-8016]

DDP [-7955]

DDP [-7885]

DDP [-7654]

Figure 4. Rollouts of GPS, TBDP, and DDP on test ter-rains, shown as colored trajectories, with mean rewardsin brackets. All DDP rollouts and most TBDP rolloutsfall within a few steps, and all TBDP rollouts fall beforereaching the end, while GPS generalizes successfully.

As shown in Figure 3, GPS generalized to all positions,while the DDP policy, which expected the incline tostart at a specific time, could only climb it in a smallinterval around its original location. The TBDP solu-tion was also able to generalize successfully.

In the second experiment, we trained a walking policyon terrain consisting of 1m and 2m segments with vary-ing slopes, and tested on four random terrains. Theresults in Figure 4 show that GPS generalized to thenew terrains, while the nearest-neighbor TBDP pol-icy did not, with all rollouts eventually failing on eachtest terrain. Unlike TBDP, the GPS neural networklearned generalizable rules for balancing on sloped ter-rain. While these rules might not generalize to muchsteeper inclines without additional training, they in-dicate a degree of generalization significantly greaterthan nearest-neighbor lookup. Example rollouts fromeach policy are shown in the supplemental video.6

6The videos can be viewed on the project website athttp://graphics.stanford.edu/projects/gpspaper/index.htm

test 1:

test 2:

test 3:

test 4:

GPS[-161]

GPS[-157]

GPS[-174]

GPS[-184]

TBDP[-3418]

TBDP[-3425]

TBDP[-3787]

TBDP[-3649]

Figure 5. Humanoid running on test terrains, with meanrewards in brackets (left), along with illustrations of the 3Dhumanoid model (right). Our approach again successfullygeneralized to the test terrains, while the TBDP policy isunable to maintain balance.

6.4. Humanoid Running

In our final experiment, we used our method to learna 3D humanoid running gait. This task is highly dy-namic, requires good timing and balance, and has63 state dimensions, making it well suited for ex-ploring the capacity of our method to scale to high-dimensional problems. Previous locomotion methodsoften rely on hand-crafted components to introduceprior knowledge or ensure stability (Tedrake et al.,2004; Whitman & Atkeson, 2009), while our methodused only general purpose neural networks.

As in the previous section, the policy was trained onterrain of varying slope, and tested on four randomterrains. Due to the complexity of this task, we usedthree different training terrains and a policy with 200hidden units. The motor noise was reduced from 10%to 1%. The guiding distributions were initialized frommotion capture of a human run, and DDP was used tofind the torques that realized this run on each train-ing terrain. Since the example was recorded on flatground, we used more mild 3◦ slopes.

Rollouts from the learned policy on the test terrainsare shown in Figure 5, with comparisons to TBDP.Our method again generalized to all test terrains, whileTBDP did not. This indicates that the task requirednontrivial generalization (despite the mild slopes), andthat GPS was able to learn generalizable rules to main-tain speed and balance. As shown in the supplementalvideo, the learned gait also retained the smooth, life-like appearance of the human demonstration.

http://graphics.stanford.edu/projects/gpspaper/index.htm


7. Previous Work

Policy gradient methods often require on-policy sam-ples at each gradient step, do not admit off-policysamples, and cannot use line searches or higher orderoptimization methods such as LBFGS, which makesthem difficult to use with complex policy classes (Pe-ters & Schaal, 2008). Our approach follows prior meth-ods that use importance sampling to address thesechallenges (Peshkin & Shelton, 2002; Kober & Peters,2009; Tang & Abbeel, 2010). While these methods re-cycle samples from previous policies, we also introduceguiding samples, which dramatically speed up learningand help avoid poor local optima. We also regular-ize the importance sampling estimator, which preventsthe optimization from assigning low probabilities to allsamples. The regularizer controls how far the policydeviates from the samples, serving a similar functionto the natural gradient, which bounds the informationloss at each iteration (Peters & Schaal, 2008). UnlikeTang and Abbeel’s ESS constraint (2010), our regular-izer does not penalize reliance on a few samples, butdoes avoid policies that assign a low probability to allsamples. Our evaluation shows that the regularizercan be crucial for learning effective policies.

Since the guiding distributions point out high rewardregions to the policy search, we can also consider priormethods that explore high reward regions by usingrestart distributions for the initial state (Kakade &Langford, 2002; Bagnell et al., 2003). Our restart dis-tribution variant is similar to Kakade and Langfordwith α = 1. In our evaluation, this approach suffersfrom the high variance of the restart distribution, andis outperformed significantly by GPS.

We also compare to a nonparametric trajectory-baseddynamic programming method. While several TBDPmethods have been proposed (Atkeson & Morimoto,2002; Atkeson & Stephens, 2008), we used a sim-ple nearest-neighbor variant with a single trajectory,which is suitable for episodic tasks with a single ini-tial state. Unlike GPS, TBDP cannot learn arbitraryparametric policies, and in our experiments, GPS ex-hibited better generalization on rough terrain.

The guiding distributions can be built from expertdemonstrations. Many prior methods use expert ex-amples to aid learning. Imitation methods such asDAGGER directly mimic the expert (Ross et al.,2011), while our approach maximizes the actual returnof the policy, making it less vulnerable to suboptimalexperts. DAGGER fails to make progress on our tasks,since it assumes that the DDP actions are optimal inall states, while they are only actually valid near theexample. Additional DDP optimization from every

visited state could produce better actions, but wouldbe quadratic in the trajectory length, and would stillnot guarantee optimal actions, since the local DDPmethod cannot recover from all failures.

Other previous methods follow the examples directly(Abbeel et al., 2006), or use the examples for super-vised initialization of special policy classes (Ijspeertet al., 2002). The former methods usually producenonstationary feedback policies, which have limited ca-pacity to generalize. In the latter approach, the policyclass must be chosen carefully, as supervised learningdoes not guarantee that the policy will reproduce theexamples: a small mistake early on could cause dras-tic deviations. Since our approach incorporates theguiding samples into the policy search, it does not relyon supervised learning to learn the examples, and cantherefore use flexible, general-purpose policy classes.

8. Discussion and Future Work

We presented a guided policy search algorithm thatcan learn complex policies with hundreds of parame-ters by incorporating guiding samples into the policysearch. These samples are drawn from a distributionbuilt around a DDP solution, which can be initializedfrom demonstrations. We evaluated our method usinggeneral-purpose neural networks on a range of chal-lenging locomotion tasks, and showed that the learnedpolicies generalize to new environments.

While our policy search is model-free, it is guided by amodel-based DDP algorithm. A promising avenue forfuture work is to build the guiding distributions withmodel-free methods that either build trajectory follow-ing policies (Ijspeert et al., 2002) or perform stochastictrajectory optimization (Kalakrishnan et al., 2011).

Our rough terrain results suggest that GPS can gen-eralize by learning basic locomotion principles such asbalance. Further investigation of generalization is anexciting avenue for future work. Generalization couldbe improved by training on multiple environments, orby using larger neural networks with multiple layers orrecurrent connections. It would be interesting to seewhether such extensions could learn more general andportable concepts, such as obstacle avoidance, pertur-bation recoveries, or even higher-level navigation skills.

Acknowledgements

We thank Emanuel Todorov, Tom Erez, and YuvalTassa for providing the simulator used in our experi-ments. Sergey Levine was supported by NSF GraduateResearch Fellowship DGE-0645962.


References

Abbeel, P., Coates, A., Quigley, M., and Ng, A. An ap-plication of reinforcement learning to aerobatic he-licopter flight. In Advances in Neural InformationProcessing Systems (NIPS 19), 2006.

Atkeson, C. and Morimoto, J. Nonparametric rep-resentation of policies and value functions: Atrajectory-based approach. In Advances in NeuralInformation Processing Systems (NIPS 15), 2002.

Atkeson, C. and Stephens, B. Random sampling ofstates in dynamic programming. IEEE Transactionson Systems, Man, and Cybernetics, Part B, 38(4):924–929, 2008.

Bagnell, A., Kakade, S., Ng, A., and Schneider, J. Pol-icy search by dynamic programming. In Advances inNeural Information Processing Systems (NIPS 16),2003.

Baxter, J., Bartlett, P., and Weaver, L. Experi-ments with infinite-horizon, policy-gradient estima-tion. Journal of Artificial Intelligence Research, 15:351–381, 2001.

Deisenroth, M. and Rasmussen, C. PILCO: a model-based and data-efficient approach to policy search.In International Conference on Machine Learning(ICML), 2011.

Dvijotham, K. and Todorov, E. Inverse optimal con-trol with linearly-solvable MDPs. In InternationalConference on Machine Learning (ICML), 2010.

Ijspeert, A., Nakanishi, J., and Schaal, S. Move-ment imitation with nonlinear dynamical systemsin humanoid robots. In International Conferenceon Robotics and Automation, 2002.

Kakade, S. and Langford, J. Approximately opti-mal approximate reinforcement learning. In Inter-national Conference on Machine Learning (ICML),2002.

Kalakrishnan, M., Chitta, S., Theodorou, E., Pastor,P., and Schaal, S. STOMP: stochastic trajectoryoptimization for motion planning. In InternationalConference on Robotics and Automation, 2011.

Kober, J. and Peters, J. Learning motor primitives forrobotics. In International Conference on Roboticsand Automation, 2009.

Peshkin, L. and Shelton, C. Learning from scarce ex-perience. In International Conference on MachineLearning (ICML), 2002.

Peters, J. and Schaal, S. Reinforcement learning ofmotor skills with policy gradients. Neural Networks,21(4):682–697, 2008.

Ross, S. and Bagnell, A. Agnostic system identificationfor model-based reinforcement learning. In Inter-national Conference on Machine Learning (ICML),2012.

Ross, S., Gordon, G., and Bagnell, A. A reduction ofimitation learning and structured prediction to no-regret online learning. Journal of Machine LearningResearch, 15:627–635, 2011.

Sutton, R., McAllester, D., Singh, S., and Mansour, Y.Policy gradient methods for reinforcement learningwith function approximation. In Advances in NeuralInformation Processing Systems (NIPS 11), 1999.

Tang, J. and Abbeel, P. On a connection betweenimportance sampling and the likelihood ratio policygradient. In Advances in Neural Information Pro-cessing Systems (NIPS 23), 2010.

Tassa, Y., Erez, T., and Todorov, E. Synthesis andstabilization of complex behaviors through onlinetrajectory optimization. In IEEE/RSJ InternationalConference on Intelligent Robots and Systems, 2012.

Tedrake, R., Zhang, T., and Seung, H. Stochastic pol-icy gradient reinforcement learning on a simple 3dbiped. In IEEE/RSJ International Conference onIntelligent Robots and Systems, 2004.

Todorov, E., Erez, T., and Tassa, Y. MuJoCo:A physics engine for model-based control. InIEEE/RSJ International Conference on IntelligentRobots and Systems, 2012.

Whitman, E. and Atkeson, C. Control of a walkingbiped using a combination of simple policies. In 9thIEEE-RAS International Conference on HumanoidRobots, 2009.

Yin, K., Loken, K., and van de Panne, M. SIMBICON:simple biped locomotion control. ACM TransactionsGraphics, 26(3), 2007.

Ziebart, B. Modeling purposeful adaptive behavior withthe principle of maximum causal entropy. PhD the-sis, Carnegie Mellon University, 2010.


A. Objective Gradients

To compute the gradient ∇Φ(θ), we first write the gra-dient in terms of the gradients of Zt(θ) and πθ:

∇Φ(θ) =

T∑t=1

[1

Zt(θ)

m∑i=1

∇πθ(ζi,1:t)q(ζi,1:t)

r(xit,uit)−

∇Zt(θ)Zt(θ)2

m∑i=1

πθ(ζi,1:t)

q(ζi,1:t)r(xit,u

it) + wr

∇Zt(θ)Zt(θ)

].

From the definition of Zt(θ), we have that

∇Zt(θ)Zt(θ)

=1

Zt(θ)

m∑i=1


.

Letting Jt(θ) = 1Zt(θ)

∑mi=1

πθ(ζi,1:t)q(ζi,1:t)

r(xit,uit), we can

rewrite the gradient as

∇Φ(θ) =

T∑t=1

1

Zt(θ)

m∑i=1


[r(xit,u

it)−Jt(θ)+wr

]=

T∑t=1

1

Zt(θ)

m∑i=1

πθ(ζi,1:t)

q(ζi,1:t)∇ log πθ(ζi,1:t)ξ

it,

using the identity ∇πθ(ζ) = πθ(ζ)∇ log πθ(ζ). Whenthe policy is represented by a large neural network,it is convenient to write the gradient as a sum wherethe output at each state appears only once, to producea set of errors that can be fed into a standard back-propagation algorithm. For a neural network policywith uniform output noise σ and mean µ(xt), we have

∇ log πθ(ζi,1:t) =∑t

∇ log πθ(ut|xt)

=∑t

∇µ(xt)ut − µ(xt)

σ2,

and the gradient of the objective is given by

∇Φ(θ) =

T∑t=1

m∑i=1

∇µ(xit)uit − µ(xit)

σ2

T∑t′=t

1

Zt′(θ)

πθ(ζi,1:t′)

q(ζi,1:t′)ξit′ .

The gradient can now be computed efficiently by feed-ing the terms after ∇µ(xit) into the standard back-propagation algorithm.

B. Dynamic System Descriptions

This appendix describes the dynamical systems cor-responding to the simulated robots in the swimming,hopping, and walking tasks. Images of each robot areprovided in Figure 1 of the paper.

Swimmer: The swimmer is a 3-link snake, with 10state dimensions for the position and angle of the head,the joint angles, and the corresponding velocities, aswell as 2 action dimensions for the torques. The sur-rounding fluid applies a drag on each link, allowing thesnake to propel itself. The simulation step is 0.05s, thereward weights are wu = 0.0001, wv = 1, and wh = 0,and the desired velocity is v?x = 2m/s.

Hopper: The hopper has 4 links: torso, upper leg,lower leg, and foot. The state has 12 dimensions, andthe actions have 3. To make it easier to optimize a gaitwith DDP, we employed a softened contact model asproposed in (Tassa et al., 2012). The reward weightsare wu = 0.001, wv = 1, and wh = 10, and the desiredvelocity and height are v?x = 1.5m/s and p?y = 1.5m. Alower time step of 0.02s was used to handle contacts.

Walker: The walker has 7 links, corresponding totwo legs and a torso, 18 state dimensions and 6 torques.The reward weights are wu = 0.0001, wv = 1, andwh = 10, and the desired velocity and height are v?x =1.2m/s and p?y = 1.5m. The time step is 0.01s.

3D Humanoid: The humanoid consists of 13 links,with a free-floating 6 DoF base, 4 ball joints, 3 jointswith 2 DoF, and 5 hinge joints, for a total of 29 degreesof freedom. Ball joints are represented by quaternions,while their velocities are represented by 3D vectors,so the entire model has 63 dimensions. The rewardweights are are wu = 0.00001, wv = 1, and wh = 10,and the desired velocity and height are v?x = 2.5m/sand p?y = 0.9m. The time step is 0.01s. Due to thecomplexity of this model, the joint noise was reducedfrom 10% of example torque variance to 1%.

Documents

Guided Policy Search - Stanford Universitygraphics.stanford.edu/projects/gpspaper/gps_full.pdf · 2013. 5. 15. · Guided Policy Search step t, the agent observes a state x t and