Journal Title Learning Dexterous Manipulation Policies from …todorov/papers/... · 2016-11-08 · Vikash Kumar1, Abhishek Gupta2, Emanuel Todorov1 and Sergey Levine2 Abstract

Learning Dexterous ManipulationPolicies from Experience and Imitation

Journal TitleXX(X):1–17c©The Author(s) 2016

Reprints and permission:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/ToBeAssignedwww.sagepub.com/

Vikash Kumar1, Abhishek Gupta2, Emanuel Todorov1 and Sergey Levine2

AbstractWe explore learning-based approaches for feedback control of a dexterous five-finger hand performing non-prehensilemanipulation. First we learn local controllers that are able to perform the task starting at a predefined initial state.These controllers are constructed using trajectory optimization with respect to locally-linear time-varying modelslearned directly from sensor data. In some cases we initialize the optimizer with human demonstrations collectedvia tele-operation in a virtual environment. We demonstrate that such controllers can perform the task robustly, bothin simulation and on the physical platform, for a limited range of initial conditions around the trained starting state.We then consider two interpolation methods for generalizing to a wider range of initial conditions: deep learning, andnearest neighbours. We find that nearest neighbors achieve higher performance. Nevertheless the neural networkhas its advantages: it uses only tactile and proprioceptive feedback but no visual feedback about the object (i.e.it performs the task blind), and learns a time-invariant policy. In contrast, the nearest neighbours method switchesbetween time-varying local controllers based on proximity of initial object states sensed via motion capture. Whileboth generalization methods leave room for improvement, our work shows that (i) local trajectory-based controllers forcomplex non-prehensile manipulation tasks can be constructed from surprisingly small amounts of training data, and(ii) collections of such controllers can be interpolated to form more global controllers.

KeywordsReinforcement Learning, Dexterous Manipulation, Trajectory Optimization

1 Introduction

Dexterous manipulation is among the most challengingcontrol problems in robotics, and remains largely unsolved.This is due to a combination of factors including highdimensionality, intermittent contact dynamics, and under-actuation in the case of dynamic object manipulation.Here we describe our efforts to tackle this problem ina principled way. We do not rely on manually designedcontrollers. Instead we synthesize controllers automatically,by optimizing high-level cost functions, as well as bybuilding off of human-provided expert demonstrations.The resulting controllers are able to manipulate freely-moving objects, as shown in Figure 1. Such non-prehensilemanipulation is challenging, since the system must reasonabout both the kinematics and the dynamics of theinteraction Lynch and Mason (1999). We present resultsfor learning both local models and control policies thatcan succeed from a single initial state, as well as moregeneralizable global policies that can use limited onboardsensing to perform a complex grasping behavior. The smallamount of data needed for learning each controller (around60 trials on the physical hardware) indicate that the approachcan practically be used to learn large repertoires of dexterousmanipulation skills.

We use our ADROIT platform (Kumar et al. 2013),which is a ShadowHand skeleton augmented with high-performance pneumatic actuators. This system has a 100-dimensional continuous state space, including the positionsand velocities of 24 joints, the pressures in 40 pneumatic

cylinders, and the position and velocity of the object beingmanipulated.

Pneumatics have non-negligible time constants (around20 ms in our system), which is why the cylinder pressuresrepresent additional state variables, making it difficult toapply torque-control techniques. The system also has a40-dimensional continuous control space – namely thecommands to the proportional valves regulating the flowof compressed air to the cylinders. The cylinders act onthe joints through tendons. The tendons do not introduceadditional state variables (since we avoid slack via pre-tensioning) but nevertheless complicate the dynamics.Overall this is a daunting system to model, let alone control.

Depending on one’s preference of terminology, ourmethod can be classified as model-based ReinforcementLearning (RL), or as adaptive optimal control (Bellman andKalaba 1959). While RL aims to solve the same generalproblem as optimal control, its uniqueness comes fromthe emphasis on model-free learning in stochastic domains(Sutton and Barto 1998). The idea of learning policieswithout having models still dominates RL, and forms the

1Department of Computer Science & Engineering, University ofWashington, WA, USA{vikash, todorov}@cs.washington.edu2Department of Electrical Engineering & Computer Sciences, Universityof California at Berkeley, CA, USA{abhigupta, svlevine}@berkeley.edu

Corresponding author:Vikash Kumar, add postal address

Prepared using sagej.cls [Version: 2015/06/09 v1.01]

2 Journal Title XX(X)

Figure 1. Learned hand manipulation behavior involving clockwise rotation of the object

basis of the most remarkable success stories, both old(Tesauro 1994) and new (Mnih et al. 2015). However RLwith learned models has also been considered. Adaptivecontrol on the other hand mostly focuses on learning theparameters of a model with predefined structure, essentiallyinterleaving system identification with control (Astrom andWittenmark 2013).

Our approach here lies somewhere in between (to fixterminology, we call it RL in subsequent sections). Werely on a model, but that model does not have anyinformative predefined structure. Instead, it is a time-varyinglinear model learned from data, using a generic prior forregularization. Related ideas have been pursued previously(Mitrovic et al. 2010; Levine and Abbeel 2014; Levineet al. 2015b). Nevertheless, as with most approaches toautomatic control and computational intelligence in general,the challenge is not only in formulating ideas but also ingetting them to scale to hard problems – which is our maincontribution here. In particular, we demonstrate scaling froma 14-dimensional state space in (Levine et al. 2015b) toa 100-dimensional state space here. This is important inlight of the curse of dimensionality. Indeed RL has beensuccessfully applied to a range of robotic tasks (Tedrake et al.2004; Kober et al. 2010; Pastor et al. 2009; Deisenroth et al.2011), however dimensionality and sample complexity havepresented major challenges (Kober et al. 2013; Deisenrothet al. 2013).

The manipulation skills we learn are initially repre-sented as time-varying linear-Gaussian controllers. Thesecontrollers are fundamentally trajectory-centric, but other-wise are extremely flexible, since they can represent anytrajectory with any linear stabilization strategy. Since thecontrollers are time-varying, the overall learned control lawis nonlinear, but is locally linear at each time step. Thesetypes of controllers have been employed previously forcontrolling lower-dimensional robotic arms (Lioutikov et al.2014; Levine et al. 2015b).

For learning more complex manipulation skills, we alsoexplore the use of human demonstrations to initialize thecontrollers. Complex tasks with delayed rewards, such asgrasping and in-hand repositioning of a heavy object, aredifficult to learn from scratch. We show that a teleoperationsystem can be used to provide example demonstrations froma human operator using a glove-based interface, and thatthese demonstrations can be used to initialize learning forcomplex skills.

Finally, to move beyond local policies that can succeedfrom only a narrow range of initial states, we exploregeneralization through two distinct approaches. In theapproach method, we train a collection of local policies, each

initialized with a different initial demonstration, and thenuse a nearest neighbor query to select the local policy forwhich the initial conditions best match the current pose ofthe manipulated object. In the second method, we use a deepneural network to learn to mimic all of the local policies,thus removing the need for nearest neighbor queries. Ourexperimental results show that the nearest neighbor approachachieves the best success rate, but at the cost of requiringthe variables for the nearest neighbor queries (the pose ofthe object) to be provided by hand. We also show thatthe deep neural network can learn a time-invariant policyfor performing the task without requiring knowledge of theobject pose at all, using only onboard sensing on the five-finger hand to perform the task.

The work on local trajectory-based control (Sections 5and 6) was previously described in conference proceedings(Kumar et al. 2016) while the work on generalization(Sections 7 and 8) is new and is described here for the firsttime.

2 Related WorkAlthough robotic reinforcement learning has experiencedconsiderable progress in recent years, with successful resultsin domains ranging from flight (Abbeel et al. 2006) tolocomotion (Tedrake et al. 2004) to manipulation (Peterset al. 2010a; Theodorou et al. 2010; Peters and Schaal2008), comparatively few methods have been applied tocontrol dexterous hands. van Hoof et al. (2015) report resultsfor simple in-hand manipulation with a 3-finger hand, andour work reports learning of simple in-hand manipulationskills, such as rotating a cylinder, using time-varyinglinear-Gaussian controllers (Kumar et al. 2016). However,neither of these prior methods demonstrate generalizationto conditions not seen during training. Our experimentsdemonstrate that our approach can learn policies for acomplex precision grasping task, and can generalize tovariation in the initial position of the target object. Incontrast to in-hand manipulation, this task exhibits complexdiscontinuities at the point of contact. We overcome thischallenge by combining learning from experience withimitation learning from human demonstrations, providedthrough a data glove teleoperation interface.

Initialization of controllers from demonstration is a widelyemployed technique in robotic reinforcement learning(Peters et al. 2010a; Theodorou et al. 2010). However, mostprior robotic reinforcement learning methods still use a hand-specified reward or cost function to provide the goal ofthe task during learning. Specifying suitable cost functionsfor complex dexterous manipulation can be exceedingly

Prepared using sagej.cls

Kumar, Gupta, Todorov and Levine 3

challenging, since simple costs can lead to poor local optima,while complex shaped costs require extensive intuition aboutthe task. In our work, we define the cost in terms ofthe example demonstrations. This approach resembles thework of Gupta et al. (2016), which used an EM-stylealgorithm to associate demonstrations with initial states ina reinforcement learning scenario. However, this prior workshowed results on a simple inflatable hand with limitedactuation, and did not demonstrate dexterous manipulationfor complex tasks.

3 OverviewThe ADROIT platform, which serves as the experimentalplatform for all of our dexterous manipulation experiments,is described in detail in Section 4. This system is used in threesets of experiments: the first set of experiments examineslearning dexterous manipulation skills from scratch usingtrajectory-centric reinforcement learning, the second set ofexperiments is focused on learning more complex skills witha combination of trajectory-centric reinforcement learningand learning from demonstration, and the third set ofexperiments examines how various approximation methods,including nearest neighbor and deep neural networks, canbe used to learn from multiple learned behaviors to acquirea single generalizable skill that succeeds under variouscircumstances.

The trajectory-centric reinforcement learning algorithmthat we use combines the linear-quadratic regulator (LQR)algorithm with learned time-varying local linear models.This algorithm, which follows previous work (Levine andAbbeel 2014), is described in Section 5. We then presentresults on both a real-world and simulated version of theADROIT platform using the algorithm, in Section 6. Thisfirst set of experiments focuses primarily on the capabilityof the trajectory-centric reinforcement learning method toefficiently learn viable and robust manipulation skills.

The second set of experiments, presented in Section 7,is aimed at evaluating how human demonstrations canbe used to aid learning for more complex skills. In thissection, we examine a complex grasping scenario, wheretrajectory-centric reinforcement learning on its own doesnot produce sufficiently successful behaviors, while humandemonstrations alone also do not achieve a sufficientdegree of robustness in the face of perturbations. Wedemonstrate that combining demonstrations with trajectory-centric reinforcement learning produces effective skills witha high degree of robustness to variation in the initialplacement of objects in the world.

Our final set of experiments, presented in Section 8,address the question of generalization: can we use multipleskills, learned with a combination of imitation and trajectory-centric reinforcement learning, to acquire a single robustand generalizable dexterous manipulation policy? To thatend, we explore the use of deep neural networks to achievegeneralization by combining the behaviors of multipleskills from the previous section. We also compare tonearest neighbor as a baseline method. We demonstrate thatdeep neural networks can learn time-invariant manipulationpolicies that acquire the strategies represented by thetime-varying controllers learned with trajectory-centric

reinforcement learning, and furthermore can perform thoseskills using onboard sensing in a simulated experiment,without knowledge of the true position of the manipulatedobject.

4 SystemModularity and the ease of switching robotic platformsformed the overarching philosophy of our system design.The learning algorithm (algorithm 1) has no dependency onthe selected robotic platform except for step 3, where thepolicies are shipped to the robotic platform for evaluationand the resulting execution trajectories are collected. Thisallows the training to happen either locally (on the machinecontrolling the robot) or remotely (if more computationalpower is needed).

Manipulation strategies were studied for two differentplatforms detailed below.

4.1 Hardware PlatformThe ADROIT platform is described in detail in (Kumaret al. 2013). Here we summarize the features relevant tothe present context. ADROIT manipulation platform is ananthropomorphic arm-hand system actuated using a custombuild high-performance pneumatic actuation. It consists ofa 24 dof hand and a 4 dof arm. As our motivation here isto understand in-hard dexterous manipulation, we mountedthe 24 dof hand on a fixed base to promote finger centricbehaviors. Fixed base severely limits the workspace of theoverall system but renders the system amenable to onlyfinger-based and wrist-based manipulation strategies. Forthis work, we use the term ‘ADROIT’ to refer to the fixedbase 24 dof hand setup. 20 out of the 24 hands joints areindependently actuated using 40 antagonistic tendons. TheDIP joints are coupled with the respective PIP joint forthe 4 fingers. Finger-tendons can exert up to 42 Newton,while the wrist-tendon can exert up to 120 Newton offorce. Each cylinder is supplied with compressed air via ahigh-performance Festo valve. The cylinders are fitted withsolid-state pressure sensors. The pressures together with thejoint positions and velocities (sensed by potentiometers ineach joint) are provided as state variables to our controller.ADROIT’s low-level driver runs on a 12 core 3.47GHzIntel(R) Xeon(R) processor with 12GB memory runningWindows x64.

The manipulation task also involves an object – whichis a long tube filled with coffee beans, inspired by earlierwork on grasping (Amend Jr et al. 2012). The object is fittedwith PhaseSpace active infrared markers on each end. Themarkers are used to estimate the object position and velocity(both linear and angular) which are also provided as statevariables. Since all our sensors have relatively low noise, weapply a minimal amount of filtering before sending the sensordata to the controller.

4.2 Simulation PlatformWe model the ADROIT hand, including the antagonistictendon transmission, joint coupling and pressure dynamics,using the MuJoCo simulator we have developed (Todorovet al. 2012). Pressure dynamics are implemented by



Pressure(NI) (35 KHz)

ADROIT State

Tendon(NI) (9 KHz)

Controls =

Har

dw

are

Dri

ver

Co

ntr

olle

r

ADROIT Control

Valve(NI)(200 Hz)

JointµU (500 Hz)

F i l t e r a n d s u b s a m p l e

State =

𝑗𝑜𝑖𝑛𝑡𝑗𝑜𝑖𝑛𝑡 𝑣𝑒𝑙𝑝𝑟𝑒𝑠𝑠𝑢𝑟𝑒

(100,1)

Socket (200hz) Socket (200hz)

𝑉𝑎𝑙𝑣𝑒 𝑣𝑜𝑙𝑡𝑎𝑔𝑒 (40,1)

Controller

Figure 2. System overview

extending the default actuation model with user callbacks.Simulating a 5 s trajectory at 2 ms timestep takes around 0.7s of CPU time or around 0.3 ms of CPU time per simulationstep. This includes evaluating the feedback control law(which needs to be interpolated because the trajectoryoptimizer uses 50 ms time steps) and advancing the physicssimulation.

Having a fast simulator enables us to prototype andquickly evaluate candidate learning algorithms and costfunction designs, before testing them on the hardware.Apart from being able to run much faster than real-time,the simulator can automatically reset itself to a specifiedinitial state (which needs to be done manually on thehardware platform). Note that the actual computation time(involving GMM fitting, policy update, and network training)is practically the same for both system as the system isoblivious to the source of the observations (i.e if they weregenerated by the hardware or the simulation platform).

Ideally, the results on the simulation platform should beleveraged to either transfer behaviors or seed the learningon the hardware platform. This, however, is hard in practice(and still an active area of research in the field) due to (a)the difficulty in aligning the high dimensional state space ofthe two platforms, (b) the non-deterministic nature of thereal world. State space alignment requires precise systemidentification and sensor calibration which otherwise are notnecessary, as our algorithm can learn the local state spaceinformation directly from the raw sensor values.

5 Reinforcement Learning with LocalLinear Models

In this section, we describe the reinforcement learningalgorithm (summarised in algorithm 1) that we use tocontrol our pneumatically-driven five finger hand. Thederivation in this section follows previous work (Levineand Abbeel 2014), but we describe the algorithm in thissection for completeness. The aim of the method is tolearn a time-varying linear-Gaussian controller of the formp(ut|xt) = N (Ktxt + kt,Ct), where xt and ut are thestate and action at time step t. The actions in our sys-tem correspond to the pneumatic valve’s input voltage,while the state space is described in the preceding sec-tion. The aim of the algorithm is to minimize the expecta-tion Ep(τ)[`(τ)] over trajectories τ = {x1,u1, . . . ,xT ,uT },

(a) End pose, learned (b) End pose, human

(c) End pose, learned (d) End pose, human

Figure 3. Object rotation task: end poses for the two rotationdirections (a-b: clockwise, c-d: anticlockwise), comparing thelearned controller to the movement of a human who has notseen the robot perform the task.

Algorithm 1 RL with linear-Gaussian controllers

1: initialize p(ut|xt)2: for iteration k = 1 to K do3: run p(ut|xt) to collect trajectory samples {τi}4: fit dynamics p(xt+1|xt,ut) to {τj} using linear

regression with GMM prior5: fit p = argminpEp(τ)[`(τ)] s.t. DKL(p(τ)‖p(τ)) ≤

ε6: end for

where `(τ) =∑Tt=1 `(xt,ut) is the total cost, and the expec-

tation is under p(τ) = p(x1)∏Tt=1 p(xt+1|xt,ut)p(ut|xt),

where p(xt+1|xt,ut) is the dynamics distribution.

5.1 Optimizing Linear-Gaussian ControllersThe simple structure of time-varying linear-Gaussiancontrollers admits a very efficient optimization procedurethat works well even under unknown dynamics. The methodis summarized in Algorithm 1. At each iteration, we run thecurrent controller p(ut|xt) on the robot to gather N samples(N = 5 in all of our experiments), then use these samplesto fit time-varying linear-Gaussian dynamics of the formp(xt+1|xt,ut) = N (fxtxt + futut + fct,Ft). This is doneby using linear regression with a Gaussian mixture modelprior, which makes it feasible to fit the dynamics even whenthe number of samples is much lower than the dimensionalityof the system (Levine and Abbeel 2014). We also compute asecond order expansion of the cost function around each ofthe samples, and average the expansions together to obtain alocal approximate cost function of the form

`(xt,ut) ≈1

2[xt;ut]

T`xu,xut[xt;ut] + [xt;ut]T`xut + const.



where subscripts denote derivatives, e.g. `xut is the gradientof ` with respect to [xt;ut], while `xu,xut is the Hessian.The particular cost functions used in our experiments aredescribed in the next section. When the cost functionis quadratic and the dynamics are linear-Gaussian, theoptimal time-varying linear-Gaussian controller of the formp(ut|xt) = N (Ktxt + kt,Ct) can be obtained by using theLQR method. This type of iterative approach can be thoughtof as a variant of iterative LQR (Li and Todorov 2004),where the dynamics are fitted to data. Under this modelof the dynamics and cost function, the optimal policy canbe computed by recursively computing the quadratic Q-function and value function, starting with the last time step.These functions are given by

V (xt) =1

2xTt Vx,xtxt + xT

t Vxt + const

Q(xt,ut) =1

2[xt;ut]

TQxu,xut[xt;ut]+[xt;ut]TQxut+const

We can express them with the following recurrence:

Qxu,xut = `xu,xut + fTxutVx,xt+1fxut

Qxut = `xut + fTxutVxt+1

Vx,xt = Qx,xt −QTu,xtQ

−1u,utQu,xt

Vxt = Qxt −QTu,xtQ

−1u,utQut,

which allows us to compute the optimal controllaw as g(xt) = ut + kt +Kt(xt − xt), whereKt = −Q−1u,utQu,xt and kt = −Q−1u,utQut. If we considerp(τ) to be the trajectory distribution formed by thedeterministic control law g(xt) and the stochastic dynamicsp(xt+1|xt,ut), LQR can be shown to optimize the standardobjective

ming(xt)

T∑t=1

Ep(xt,ut)[`(xt,ut)]. (1)

However, we can also form a time-varying linear-Gaussiancontroller p(ut|xt), and optimize the following objective:

minp(ut|xt)

T∑t=1

Ep(xt,ut)[`(xt,ut)]−H(p(ut|xt)).

As shown in previous work (Levine and Koltun 2013),this objective is in fact optimized by setting p(ut|xt) =N (Ktxt + kt,Ct), where Ct = Q−1u,ut. While we ulti-mately aim to minimize the standard controller objective inEquation (1), this maximum entropy formulation will be auseful intermediate step for a practical learning algorithmtrained with fitted time-varying linear dynamics.

5.2 KL-Constrained OptimizationIn order for this learning method to produce good results, itis important to bound the change in the controller p(ut|xt)at each iteration. The standard iterative LQR method canchange the controller drastically at each iteration, which cancause it to visit parts of the state space where the fitteddynamics are arbitrarily incorrect, leading to divergence.Furthermore, due to the non-deterministic nature of the realworld domains, line search based methods can get misguidedleading to unreliable progress.

To address these issues, we solve the followingoptimization problem at each iteration:

minp(ut|xt)

Ep(τ)[`(τ)] s.t. DKL(p(τ)‖p(τ)) ≤ ε,

where p(τ) is the trajectory distribution induced by theprevious controller. Using KL-divergence constraints forcontroller optimization has been proposed in a number ofprior works (Bagnell and Schneider 2003; Peters and Schaal2008; Peters et al. 2010b). In the case of linear-Gaussiancontrollers, a simple modification to the LQR algorithmdescribed above can be used to solve this constrainedproblem. Recall that the trajectory distributions are givenby p(τ) = p(x1)

∏Tt=1 p(xt+1|xt,ut)p(ut|xt). Since the

dynamics of the new and old trajectory distributions areassumed to be the same, the KL-divergence is given by

DKL(p(τ)‖p(τ)) =T∑t=1

Ep(xt,ut)[log p(ut|xt)]−H(p),

and the Lagrangian of the constrained optimization problemis given by

Ltraj(p, η) = Ep[`(τ)] + η[DKL(p(τ)‖p(τ))− ε] =[∑t

Ep(xt,ut)[`(xt,ut)−η log p(ut|xt)]

]−ηH(p(τ))−ηε.

The constrained optimization can be solved with dualgradient descent (Boyd and Vandenberghe 2004), where wealternate between minimizing the Lagrangian with respectto the primal variables, which are the parameters of p, andtaking a subgradient step on the Lagrange multiplier η. Theoptimization with respect to p can be performed efficientlyusing the LQG algorithm, by observing that the Lagrangianis simply the expectation of a quantity that does not dependon p and an entropy term. As described above, LQR can beused to solve maximum entropy control problems where theobjective consists of a term that does not depend on p, andanother term that encourages high entropy. We can convertthe Lagrangian primal minimization into a problem of theform

minp(ut|xt)

T∑t=1

Ep(xt,ut)[˜(xt,ut)]−H(p(ut|xt))

by using the cost ˜(xt,ut) =1η `(xt,ut)− log p(ut|xt).

This objective is simply obtained by dividing the Lagrangianby η. Since there is only one dual variable, dual gradientdescent typically converges very quickly, usually in under10 iterations, and because LQR is a very efficienttrajectory optimization method, the entire procedure can beimplemented to run very quickly.

We initialize p(ut|xt) with a fixed covariance Ct and zeromean, to produce random actuation on the first iteration. TheGaussian noise used to sample from p(ut|xt) is generatedin advance and smoothed with a Gaussian kernel with astandard deviation of two time steps, in order to producemore temporally coherent noise.

6 Learning Policies from ExperienceIn this section, we will describe our first set of experiments,which uses the trajectory-centric reinforcement learning



(a) Start pose, movementagainst gravity

(b) End pose, movementagainst gravity

(c) Start pose, movementassisted by gravity

(d) End pose, movementassisted by gravity

Figure 4. Positioning task

Table 1. Different hand positioning task variations learned

Platform # different task variations learnedHardware 5 (move assisted by gravity)

2 (move against gravity)Simulated 5 (move assisted by gravity)

3 (move against gravity)

Table 2. Different object manipulation task variations learned

Platform # different task variations learnedHardware + 2 ({clockwise & anti-clockwise}an elongated object rotations along vertical)object (Fig:3)Simulated + 13 ({clockwise, anti-clockwise,4 object clockwise then anti-clockwise}variations object rotation along vertical

8 ({clockwise, anti-clockwise}object rotation along horizontal)

algorithm in the previous section to learn dexterousmanipulation skills from scratch on both the physical andsimulated ADROIT platform. The experiments in this sectionare aimed to ascertain whether we can learn complexmanipulation behaviors entirely from scratch, using onlyhigh-level task definitions provided in terms of a costfunction, with the controller learned at the level of valveopening and closing commands. The particular tasks aredetailed in Table 1 and 2 and shown in the accompanyingvideo, and include both hand posing behaviors and objectmanipulation skills.

6.1 Hand BehaviorsIn the first set of tasks, we examine how well trajectory-centric reinforcement learning can control the hand to reachtarget poses. The state space is given by x = (q, q, a). Here,

q denotes the vector of hand joint angles, q is the vector ofjoint angular velocities, a the vector of cylinder pressures,and the actions ut correspond to the valve command signals,which are real-valued and correspond to the degree to whicheach valve is opened at each time step. The tasks in thissection require moving the hand to a specified pose from agiven initial pose. We arranged the pair of poses such that inone task-set the finger motions were helped by gravity, andin another task-set they had to overcome gravity, as shownin Figure 4. Note that for a system of this complexity, evenachieving a desired pose can be challenging, especially sincethe tendon actuators are in agonist-antagonist pairs and theforces have to balance to maintain posture. The cost functionat each time step is given by

`(xt,ut) = ||qt − q∗||2 + 0.001||ut||2,

and the cost at the final time step T emphasizes the targetpose to ensure that it is reached successfully:

`(xT ,uT ) = 10||qt − q∗||2.

6.2 Object Manipulation Behaviors

The manipulation tasks we focused on require in-handrotation of elongated objects. We chose this task becauseit involves intermittent contacts with multiple fingers andis quite dynamic, while at the same time having a certainamount of intrinsic stability. We studied different variations(table 2) of this task with different objects: rotationclockwise (figure 1), rotation counter-clockwise, rotationclockwise followed by rotation counter-clockwise, androtation clockwise without using the wrist joint (to encouragefinger oriented maneuvers) – which was physically locked inthat condition. Figure 3 illustrates the start and end posesand object configurations in the task learned on the ADROIThardware platform. The running cost was

`(xt,ut) =0.01||qt − q∗||2 + 0.001||ut||2+||qpost − qpos∗||2 + 10||qrott − qrot∗x||2

where x = (q, qpos, qrot, q, qpos, qrot, a). Here q denotes thevector of hand joint angles, qpos the object positions, qrot theobject rotations, a the vector of cylinder pressures, and ut thevector of valve command signals. At the final time we used

`(xt,ut)t=T =2[0.01||qt − q∗||2 + ||qpost − qpos∗||2

+ 10||qrott − qrot∗x||2].

Here, the cost function included an extra term for desiredobject position and orientation. The final cost was scaled bya factor of 2 relative to the running cost.



# Iteration

1 2 3 4 5 6 7 8 9 10

Co

st

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000Hand posing progress

curling with gravity(hardware)

curling with gravity(sim)

curling against gravity(hardware)

curling against gravity(sim)

# Iteration

0 2 4 6 8 10 12 14 16 18

Cost

3000

3500

4000

4500

5000

5500

6000

6500

7000Object twirling progress

Rotate clockwise

Rotate anticlockwise

Rotate clockwise(WR)

Figure 5. Learning curves for the positioning (top) andmanipulation (bottom) tasks. ∗

6.3 Results

Besides designing the cost functions, minimal parametertuning was required to learn each skill. Training consistedof around 15 iterations. In each iteration we performed 5trials with different instantiations of the exploration noisein the controls. The progress of training as well as the finalperformance is illustrated in the video accompanying thesubmission, and in the figure at the beginning of the paper.

Here we quantify the performance and the robustness tonoise. Figure 5 shows how the total cost for the movement(as measured by the cost functions defined above) decreasedover iterations of the algorithm. The solid curves are datafrom the physical system. Note that in all tasks and taskvariations we observe very rapid convergence. Surprisingly,the manipulation task which is much harder from a controlviewpoint takes about the same number of iterations to learn.

In the positioning task we also performed a systematiccomparison between learning in the physical system andlearning in simulation. Performance early in training wascomparable, but eventually the algorithm was able to findbetter policies in simulation. Although it is not shown inthe figure, training on simulation platform happens a lotfaster, because the robot can only run in real-time while thesimulated platform runs faster than real-time, and becauseresetting between repetitions needs to be done manually onthe robot.

# Iteration

0 2 4 6 8 10 12 14

Cost

4800

5000

5200

5400

5600

5800

6000

6200

6400

6600

6800Effects of injected noise

Sigma=1

Sigma=2

Sigma=3

Figure 6. Effect of noise smoothing on learning. Sigma=1(width of the Gaussian kernel used for smoothing noise), takesa slow start but maintains a constant progress. Higher sigmafavors steep decent but it fails to maintain the progress as it isunable to successfully maintain the stability of the object beingmanipulated and ends up dropping it. The algorithm incurs ahuge cost penalty and restarts its decent from there. ∗

We further investigated the effects of exploration noisemagnitude injected during training. Figure 6 shows that fora relatively small amount of noise performance decreasesmonotonically. As we increase the noise magnitude,sometimes we see faster improvement early on but thebehavior of the algorithm is no longer monotonic. These aredata on the ADROIT hardware platform.

6.4 Delayed RobustificationFinally, we used the simulation platform to investigaterobustness to perturbations more quantitatively, in themanipulation task. We wanted to quantify how robust ourcontrollers are to changes in initial state (recall that thecontrollers are local). Furthermore, we wanted to see iftraining with noisy initial states, in addition to explorationnoise injected in the controls, will result in more robustcontrollers. Naıvely adding initial state noise at each iterationof the algorithm (Algorithm 1) severely hindered the overallprogress. However, adding initial state noise after the policywas partially learned (iteration ≥ 10 in our case) resultedin much more robust controllers. We term this strategy asdelated robustification.

The results of these simulations are shown in Figure 7.We plot the orientation of the object around the vertical axisas a function of time. The black curve is the unperturbedtrajectory. As expected, noise injected in the initial statemakes the movements more variable, especially for thecontroller that was trained without such noise. Adding initialstate noise during training substantially improved the abilityof the controller to suppress perturbations in initial state.Overall, we were surprised at how much noise we could add(up to 20 % of the range of each state variable) without the

∗ At each iteration, the current controller p(ut|xt) is deployed on the robotto gather N samples (N = 5 in all of our experiments).



sec

0 5

rad

ian

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

5% noise

sec

0 5

10% noise

sec

0 5

20% noise

sec

0 5

rad

ian

0

0.5

1

1.5

2

5% noise

sec

0 5

10% noise

sec

0 5

20% noise

Figure 7. Robustness to noise in initial state. Each columncorresponds to a different noise level: 5, 10, 20 % of the rangeof each state variable. The top row is a controller trained withnoise in the initial state. The bottom row is a controller trainedwith the same initial state (no noise) for all trials.

hand dropping the object, in the case of the controller trainedwith noise. The controller trained without noise dropped theobject in 4 out of 20 test trials. Thus injecting some noisein the initial state (around 2.5 %) helps improve robustness.Of course on the real robot we cannot avoid injecting suchnoise, because exact repositioning is very difficult.

7 Learning Policies from Experience andImitation

The previous section highlighted the strengths of ourreinforcement learning algorithm, outlined in Section 5,in synthesizing the details of dexterous manipulationstrategies while still preserving sample efficiency. However,as our algorithm makes progress by optimizing a quadraticapproximation of the cost over a local approximation ofthe dynamics, it can become stuck in local minima if theapproximation of the learned dynamics isn’t sufficientlyaccurate, or when the cost is not convex. In principle,arbitrary precision can be achieved by increasing the numberof Gaussian kernels used for the dynamics prior, and byincreasing the number of trajectory samples N used forthe fitting the dynamics. In practice, collecting an arbitrarynumber of samples might not be feasible due to timeand computational limitations, especially when physical

robots are involved. Furthermore, many useful task goalsare non-convex, and while the method can optimize non-convex cost functions, like all local optimization methods,it does not necessarily converge to a global optimum.Random exploration can help mitigate some of these issues.To encourage exploration, we add random noise whilecollecting trajectory samples(step 3 of the algorithm 1). Fortasks where the reward is delayed and there are multiple localminima, randomly exploring around and hoping to get luckytakes a big toll on sample efficiency. A well know strategy toovercome some of these challenges is to imitate an expert,in order to effectively steer the learned controller towardsan effective solution. However, simply following an expert-provided behavior does not necessarily produce behaviorthat is robust to perturbations. In this section, we describehow we can combine learning from expert teleoperation withtrajectory-centric reinforcement learning to acquire morecomplex manipulation skills that overcome local optimaby following expert demonstrations while still retaining therobustness benefits of learning from experience.

7.1 Task DetailsThe task in this second set of experiments consists of pickingup an elongated tube from the table. This task is challengingbecause the cost depends on the final configuration of thetube, which depends in a discontinuous manner on thepositions of the fingers. Furthermore, grasping the tube fromvarious initial poses involves the use of multiple differentstrategies. Note that the hand is rigidly mounted, so thepicking must be done entirely using motion of the wristand fingers. The shape of the tube further complicates thetask. Naıve strategies like aligning the hand to the object’sprincipal axis and force-closure will not perform well, sinceonly the wrist is available for positioning the hand. Instead,the hand must use complex finger gaits to maneuver theobject into position. The significant weight of the tube (incomparison to the lifting capability of the hand) constantlytilts the object out of the grasp. Successful strategies need todiscover ways to use the thumb and fingers to act as ‘support’and ‘pivot’ for each other, in order to reposition and reorientthe object leading to a successful grasp and pick up. Failureto establish either the support or pivot results in the tubeflying out of the workspace due to a net unbalanced forceon the object.

The hand is mounted approximately thirty degrees to thehorizontal. The task starts with the hand in zero position(Figure 8). The goal is to lift the object from a known initialcondition (the object is being tracked using the PhaseSpacemotion capture system). The tube is considered lifted if it isin complete control of the hand (i.e. not falling out of thehand or resting on the table) and all points on the objectare above the ground by a certain height. Note that forthe resemblance between the hardware and the simulatedplatform, the contact between the fingers and the groundplane is disabled for the simulated platform in order to allowthe fingers to curl from below the object.

7.2 Mujoco HaptixWhile leveraging an expert is quite desirable, deploying andexploiting an expert is exceptionally difficult for dexterous



(a) Hardware platform (b) Simulation platform

Figure 8. Initial pose for the pickup task

manipulation. This is due to two principal factors. First,dexterous manipulation strategies are extremely sensitive tominor variations in the contact forces, contact locations, andobject positions. Thus, minor deviations from the expertdemonstrations render the demonstration useless. Second,technology to capture the details of hand manipulationis often unreliable. Unlike full body movements, handmanipulation behaviors unfold in a compact region of spaceco-inhabited by the objects being manipulated. This makesmotion capture difficult, due to occlusions and, in thecase of passive systems, marker confusion. Manipulationalso involves large numbers of contacts, including dynamicphenomena such as rolling, sliding, stick-slip, deformations,and soft contacts. The human hand takes advantage of theserich dynamics, but recording the data and interpreting it withregard to well-defined physics models is challenging.

To address these challenges, we exploit the adaptationcapabilities of the brain in order to ship the data collectionfrom the real world to a physically realistic simulation.The Mujoco Haptix system (Kumar and Todorov 2015)was developed to facilitate physically-consistent recordingof rich hand-object interactions. This was done in thecontext of the DARPA HAPTIX program and was adaptedfor our purposes here. The simulation is based on theMuJoCo physics engine. The Haptix framework augmentsthe simulator with real-time motion capture of arm and handmovements, and stereoscopic visualization using OpenGLprojection from the viewpoint of the users head (whichis also tracked via motion capture.) The resulting systemhas empirically-validated end-to-end latency of 42 msec. Itcreates a sense of realism which is sufficient for humanusers to interact with virtual objects in a natural way. Sincethe interaction happens in simulation, we can record everyaspect of it including joint kinematics and dynamics, contactinteractions, simulated sensor readings etc. There are nosensor technologies available today that could record suchrich dataset from hand-object interactions in the physicalworld. Furthermore, since the interaction is based on oursimulation model of multi-joint and contact dynamics, thedataset is by definition physically-consistent.

(a) Overall setup

(b) Expert interacting with the simulation using Cyber glovesystem

Figure 9. Mujoco Haptix system. NVIDIA 3D Vision 2 glassesare used for stereoscopic visualization, together with a BenQGTG XL2720Z stereo monitor. OptiTrack motion capture systemis used for 3D glasses, head, and forearm tracking. ACyberGlove, calibrated for the ADROIT model, is used fortracking the finger joints.

7.3 Expert Demonstrations

We use Mujoco Hapix to capture expert demonstrations.The expert first goes through a regression based calibrationprocess, that maps the Cyberglove sensors to the ADROITjoint space. To enable hand manipulation, we map the jointangle qc reported by the Cyberglove to actuator commandsusing the Equation (2):

ut = kJ tendon(qt − qct), (2)

where qt is the current joint configuration of the hand,J tendom is the tendon jacobian that maps the joint spaceto the tendon space, and k is the gain vector. ut is appliedas controls to the pneumatic actuators and we let thesimulation evolve by stepping the physics of the worldforward in time. Hand-object interactions evolve as thesimulation steps forward. The expert is in a tight feedbackloop with the simulation via the stereoscopic rendering. Asthe user interacts with the system by changing its strategy,hand manipulation behaviors emerge. We record the statext and the control trajectory ut over time as demonstrationtrajectory. where x = (q,qpos,qrot, q, qpos, qrot,a). Hereq denotes the vector of hand joint angles, qpos the objectpositions, qrot the object rotations, a the vector of cylinderpressures, and u the vector of valve command signals.



7.4 Learning Imitation PoliciesWe first attempted to learn the pickup task from scratch usingthe following cost function:

`(xt,ut) =α1||qt − q∗||2 + α2||ut||2+α3||qpost − qpos∗||2 + α4||qrott − qrot∗x||2,

where qpos∗ and qrot∗x denote the goal pose of the object,and q∗ is a target joint angle configuration that correspondsto a grasp. Simple learning by experience fails to comeup with a strategy that can pick up the rod even aftersignificant cost parameter tuning. The observed behavior is acombination of the following: (a) Hand randomly exploresfor a while and fails to find the object; (b) The fingerseventually find the object and knock it away from themanipulate-able workspace; (c) The fingers keep tapping onthe top of the tube without any significant improvement.The difficulty in finding a successful pickup behavior stemsfrom a combination of factors. First, the cost function onlyproduces a meaningful signal toward the end of the episode,once the object is already grasped. Second, almost everystrategy that can succeed has to first clear the object andonly then dig the fingers into the space below the tube, withfingers on both sides to restrict the object’s motion. Lastly,the high dimensionality of the ADROIT platform means thatthe space of successful solutions is very narrow, and a hugevariety of movements are available that all fail at the task.Note that, for cases where the object isn’t aligned well withthe palm, additional reorientation maneuvers are required inorder to pick up the object, due to the hand’s restricted lateralmobility.

Our learning process for these skills combines learningfrom demonstrations with learning from experience, withthe expert demonstration used to bootstrap the learningprocess. During the learning from experience phase, we usean additional shaping cost, which is a weak cost term thatprevents the learning process from deviating too far awayfrom the expert demonstration. The assumption here is thatthe expert demonstration is already quite good and we don’tneed to deviate too far to improve the solution. The strengthof the shaping cost controls the amount of deviation from theexpert demonstration.

For bootstrapping the learning using expert demonstra-tions, instead of using a random policy in step 3 of the Algo-rithm 1 at the initial iteration, small random noise is injectedinto the control trajectory of the expert demonstration tocollect trajectory samples {τi}. The running cost we use is

`(xt,ut) =||qt − qt||2 + 0.1||ut||2 + 50||qposZt − 0.12||2,

where qt is the hand configuration of the expert at time tand qposZt is the vertical height of the object at time t. Thefirst term is the shaping cost that restricts the learning fromdeviating too far from the demonstration, the second termis the control cost and the final term encourages picking upthe object to a height of 12 cm above the ground. The finalcost is the same as the running cost. Figure 10 presents arepresentative pickup behavior achieved using this method.

8 Policy GeneralizationSection 6 and Section 7 explores the capabilities oftrajectory-centric reinforcement learning to learn robust

“local” policies for dexterous manipulation of freely-movingobjects. However, the resulting local policies succeed fromspecific initial states, and are not designed to handle variationin the initial conditions, such as different initial placement ofthe manipulated object. In this section, we will discuss howwe can use the local policies to learn a single global policythat generalizes effectively across different initial conditions.

Generalizable global policies can typically be learnedusing reinforcement learning with more expressive functionapproximators. Deep neural networks represent a particularlygeneral and expressive class of function approximators, andhave been recently used to learn skills that range fromplaying Atari games (Mnih et al. 2015) to vision-basedrobotic manipulation (Levine et al. 2015a). In this section,we evaluate how well deep neural networks can capturegeneralizable policies for the pickup skill discussed in theprevious section, where variation in the task corresponds tochanges in the initial pose of the rod. Our goal is to explorethe generalization capabilities of deep neural networks alongtwo axes: the ability to handle variability in the initialconditions, and the ability to handle partial observabilityand limited sensing. To that end, we will evaluate deepneural network policies that perform the task either with fullstate observation, or with only the onboard sensing availableon the ADROIT platform, without external motion capturemarkers to provide the pose of the rod.

8.1 Task DetailsWe evaluate generalization on the same pickup task asoutlined in Section 7.1. To analyze generalization, we varythe orientation of the rod at the initial state. The goalis to learn a strategy for this task that succeeds for anyinitial rod orientation. This is particularly challenging sincethe robot cannot translate or reorient the wrist (since thehand is stationary), and therefore must utilize substantiallydifferent grasping strategies for different rod orientations,including the use of auxiliary finger motions to repositionthe rod into the desired pose. Besides this challenge, thetask also inherits all of the difficulties detailed in Section 7,including the high dimensionality of the system and thechallenge of delayed rewards. To mitigate the challengeof local optima, we again use expert demonstrations. Wecollected a set of 10 demonstrations across 180 degrees ofvariation in the rod orientation. Figure 11 shows the 10initial configurations from which an expert was engaged forproviding demonstrations.

8.2 Local PoliciesBefore evaluating generalization, we first analyze theperformance of the individual local policies trained withimitation and learning from experience. For each taskexecution, we evaluate the success or failure of trialaccording to the following criteria: a successful picking trialmust result in the object being stationary, both extremely ofthe rod being above the ground by a certain height, and theentire rod aligned with the x-axis, so as to ensure a successfulgrasp into the desired goal position. Note that partiallysuccessful executions, where the tube’s center of mass islifted up but one of the end points remains on the ground,are still marked as failures. This allows us to determine not



Figure 10. Pick up strategy learned using learning via imitation. The movement starts with aligning the palm with the object, thencurling the fingers under the object followed by an object-reorientation movement that further aligns the object with the palm. Thefinal movement engages the thumb is squeeze hard again the palm before lifting the wrist upward.

Figure 11. The initial tube configuration of the expertdemonstration set.

only whether the tube was picked up, but whether it was alsopositioned in the desired pose – an important condition forany subsequent manipulations that might be applied to it.

In Figure 12a, we analyze the performance obtained byfollowing the demonstrated behaviors directly, for variationin the rod angle within the local neighborhood of the rod posefor which the demonstration was generated. The numberscorrespond to the conditions illustrated in Figure 11. The x-axis corresponds to variation in the orientation of the rod.Successful trials are marked as green circles, while failuresare marked as red crosses. Overall, we observe that mostdemonstrations are somewhat successful for the particularrod angle for which they were created, but the successrate decreases sharply under variation in the rod pose, withdiscontinuous boundaries in the success region (particularlyfor conditions 9, 3, and 0). Some conditions, such ascondition 6, are typically successful, but exhibit a brittlestrategy that sometimes fails right at the demonstration angle,while others, such as conditions 3 and 1, fail sporadically atvarious rod poses.

As shown in Figure 12b, we can improve the robustness ofthe local policies corresponding to each condition by furthertraining each of those policies from experience, following

the method described in Section 7. Each demonstrationwas subjected to 10 iterations of learning from experience,and the resulting controllers were evaluated using the sameprocedure as in the previous paragraph. The controllerssucceed in a wider neighborhood around their default rodpose, with the brittle strategy in condition 6 becomingmuch more robust, and the region of success for conditions9, 3, and 0 expanding substantially on both sides. Theoverall costs and success rates for individual conditions aresummarized in Figure 13. While we observe that the overallcost remained similar, the success rates for varying rod posesincreased substantially, particularly for conditions 6, 8, 9,and 10. Figure 14 evaluates the effectiveness of the localpolicies for the entire range of task variation. We observethat the policies are less effective as they move away fromthe zero point (marked with dark vertical line).

8.3 Nearest NeighborWe can observe from the results in Figure 12b that, aftertraining, each local policy succeeds in a neighborhood thatextends to the boundary of the next local policy. Thissuggests that a relatively simple nearest-neighbor techniquecould in principle allow for a nonparametric strategy toexpand the success region to the entire range of rodorientations. In Figure 15, we evaluate the performance ofthis nearest neighbor strategy, which simply deploys thelocal policy trained for the rod orientation that is closestto the orientation observed in the initial state, in terms ofEuclidean distance. Successful trials are marked in greenand failures in red, and the overall success rate is 90.8%.Although the nearest neighbor strategy is successful in thiscase, it requires retaining the full set of local policies andmanual specification of the variables on which the nearestneighbor queries should be performed (in this case, the rodpose). Furthermore, the nearest neighbor strategy requiresus to know the rod pose, which in a physical experimentwould involve the use of external motion capture markers.Can we instead learn a generalizable policy for grasping thetube that does not use any external sensing, only the onboardsensing available on the hand? In the next section, we willdiscuss how deep neural network function approximators canleverage the local policies to learn generalizable strategies.

8.4 Generalization with Deep Neural NetworksAs discussed previously, although the nearest neighborstrategy has an extremely high success rate for thepick-up task being considered, it suffers from severallimitations. First, nearest neighbor methods suffer from



Rod angle

0

0.3

5

0.7

5

0.9

5

Tra

jecto

ry C

ost.

(1

4.4

76

6)

14

16

18

20

22

24

26

28

30

32

34

condition0 (67%)

Failed

Success

Rod angle

0

0.3

5

0.7

5

0.9

5

Tra

jecto

ry C

ost.

(1

4.8

55

9)

14

16

18

20

22

24

26

28

30

32

34

condition1 (80%)

Failed

Success

Rod angle

-0.3

5 0

0.3

5

0.7

5

0.9

5

Tra

jecto

ry C

ost.

(1

5.0

14

6)

14

16

18

20

22

24

26

28

30

32

34

condition2 (96%)

Failed

Success

Rod angle

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5

Tra

jecto

ry C

ost.

(1

8.2

65

6)

14

16

18

20

22

24

26

28

30

32

34

condition3 (73%)

Failed

Success

Rod angle

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

Tra

jecto

ry C

ost.

(2

2.9

09

8)

14

16

18

20

22

24

26

28

30

32

34

condition4 (86%)

Failed

Success

Rod angle

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

Tra

jecto

ry C

ost.

(2

1.4

00

1)

14

16

18

20

22

24

26

28

30

32

34

condition5 (54%)

Failed

Success

Rod angle

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

Tra

jecto

ry C

ost.

(1

8.8

06

5)

14

16

18

20

22

24

26

28

30

32

34

condition6 (97%)

Failed

Success

Rod angle

-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5

Tra

jecto

ry C

ost.

(1

6.5

73

4)

14

16

18

20

22

24

26

28

30

32

34

condition7 (61%)

Failed

Success

Rod angle

-2

-1.6

5

-1.3

-0.9

5

-0.7

Tra

jecto

ry C

ost.

(1

5.8

83

6)

14

16

18

20

22

24

26

28

30

32

34

condition8 (52%)

Failed

Success

Rod angle

-2

-1.6

5

-1.3

Tra

jecto

ry C

ost.

(1

8.3

81

9)

14

16

18

20

22

24

26

28

30

32

34

condition9 (47%)

Failed

Success

Demo playback

(a) Testing robustness of the expert demonstration in its local neighbourhood.

Rod angle

0

0.3

5

0.7

5

0.9

5

Tra

jecto

ry C

ost. (

14.4

362)

14

16

18

20

22

24

26

28

30

32

34

condition0 (63%)

Failed

Success

Rod angle

0

0.3

5

0.7

5

0.9

5

Tra

jecto

ry C

ost. (

14.6

357)

14

16

18

20

22

24

26

28

30

32

34

condition1 (87%)

Failed

Success

Rod angle

-0.3

5 0

0.3

5

0.7

5

0.9

5

Tra

jecto

ry C

ost. (

14.7

869)

14

16

18

20

22

24

26

28

30

32

34

condition2 (93%)

Failed

Success

Rod angle

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5

Tra

jecto

ry C

ost. (

17.7

907)

14

16

18

20

22

24

26

28

30

32

34

condition3 (75%)

Failed

Success

Rod angle

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

Tra

jecto

ry C

ost. (

20.6

241)

14

16

18

20

22

24

26

28

30

32

34

condition4 (91%)

Failed

Success

Rod angle

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

Tra

jecto

ry C

ost. (

18.0

72)

14

16

18

20

22

24

26

28

30

32

34

condition5 (86%)

Failed

Success

Rod angle

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

Tra

jecto

ry C

ost. (

18.2

64)

14

16

18

20

22

24

26

28

30

32

34

condition6 (99%)

Failed

Success

Rod angle

-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5

Tra

jecto

ry C

ost. (

18.9

236)

14

16

18

20

22

24

26

28

30

32

34

condition7 (81%)

Failed

Success

Rod angle

-2

-1.6

5

-1.3

-0.9

5

-0.7

Tra

jecto

ry C

ost. (

16.7

368)

14

16

18

20

22

24

26

28

30

32

34

condition8 (87%)

Failed

Success

Rod angle

-2

-1.6

5

-1.3

Tra

jecto

ry C

ost. (

18.3

462)

14

16

18

20

22

24

26

28

30

32

34

condition9 (63%)

Failed

Success

GPS Local Policies

(b) Testing robustness of the local policies trained around the demonstrations in its local neighbourhood.

Figure 12. Different plots represent different conditions. Ticks on the X axis marks the different rod angles, corresponding to eachcondition where expert demonstration was collected. The true angle of the plot is marked with solid black line. Each condition wastested for 100 trials. Successful trials are marked in green and unsuccessful are marked in red. Success percentage are marked inthe title and the average cost are provided in the Y label.



0

10

20

30

Aver

age

Cost

demo

GPS

condition

1 2 3 4 5 6 7 8 9 10

0

50

100

Succ

ess

%

Figure 13. Performance comparison between the expertdemonstrations and the local policies trained around the expertdemonstrations in the local neighbourhood of the task (i.e. therod angle with the vertical)

the curse of dimensionality, which limits their applicabilityto high-dimensional spaces. Indeed, the nearest neighbortechnique described in the preceding section requires us tomanually specify that the nearest neighbor queries shouldbe performed with respect to the pose of the rod. Thisinformation may not be readily available in general roboticmanipulation tasks, where the robot must choose the strategybased on high-dimensional raw sensory information, such ascamera images or inputs from tactile sensors. Second, thenearest neighbor method described above still correspondsto a time-varying control strategy, which can be limiting incases where the robot might begin the task from arbitraryinitial states. Finally, it requires storing all of the individualtrajectory-centric controllers.

In this section, we evaluate whether a deep neural networkrepresentation of the rod grasping manipulation skill canbe learned with comparable generalization capability andlimited sensing. Deep neural networks provide a powerfuland flexible class of function approximators that can bereadily combined with high-dimensional inputs and complexsensing. Furthermore, the use of deep neural networks as thefinal policy representation does not require us to manuallyselect a small set of state variables for nearest neighborqueries, making them a more general choice that is likelyto extend to a wide variety of dexterous manipulation tasks.

To evaluate the use of deep neural networks forlearning generalizable dexterous manipulation strategies,we used the collection of local policies trained frominitial demonstrations to produce training data for neuralnetwork training. Each of the local policies discussed inSection 8.2 was used to generate 50 sample trajectoriesby executing each policy from its corresponding initialstate. These samples were then used to train a deep neuralnetwork with 6 fully connected layers and 150 recitifiedlinear (ReLU) hidden units in each layer, as shown inFigure 16. The network was trained either with the full stateinformation provided to the local policies (which includesthe pose and velocity of the rod), or with partial informationreflecting onboard sensing, with knowledge about the rodpose excluded from the inputs to the neural network.

The results of the neural network policy with full stateobservations are shown in Figure 17. This result indicatesthat the neural network policy trained from the local policiesis generally not as successful as the nearest neighborstrategy in the fully observed case. This is not particularlysurprising: the nearest neighbor strategy already uses a setof very successful local policies that can succeed up toone angle increment, and thus form overlapping regions ofeffectiveness. However, the neural network must distill thedistinct strategies of different local policies into a singlecoherent function, and therefore generally performs worsein this case. Figure 18 and Figure 19 show the performanceof a large and small neural network trained without the rodpose provided as input, instead using inputs from the hand’sonboard tactile sensors in the fingertips. When the networkis trained without observations of the rod pose, it achievesa success rate of 74%, nearly as high as the fully observedcondition, which is substantially higher than the best localpolicies, as shown in Figure 14.† These results also showthat the larger network is essential for properly handlingthis condition. Interestingly, when the neural network isnot provided with the tactile sensors either, as shownin Figure 20, it achieves nearly the same performance,indicating that the neural network is able to make use ofproprioceptive sensing to determine which strategy to use.This is also not entirely surprising, since collisions andcontacts result in motion of the fingers that can be detectedfrom proprioception alone. These results indicate that theneural network policy that is only allowed to use onboardsensing learns a robust feedback behavior that can adjustthe grasping strategy based on proprioceptive feedback,something that is not possible with the nearest neighborstrategy, which must choose the local policy to deploy at thebeginning of the episode (before any proprioceptive sensingcan take place). This suggests one of the benefits of deepneural networks for dexterous control.

8.5 Results on ADROIT Hardware PlatformIn this section, we present our generalization results for thepick up task on the ADROIT hardware platform. The detailsof the task are the same as mentioned above. However, weconsider generalization both in the position and orientationof the object. The orientation generalization is in a smallerneighborhood than the simulated experiments above, andwe train a 6 layer fully connected neural network with 120rectified linear (ReLU) units at each layer. The trainingsamples are collected by sampling the local controllerslearned around the 4 demonstrations provided by the expertuser. The expert demonstrations were collected at the initialcondition as shown in Figure 21. Local controllers weretrained using 10 iterations of learning via imitation strategyas mentioned in Section 7. The training set for the networkconsists of 20 samples for each condition (i.e. 80 samplesin total). In figure Figure 22, we cross-validate each policy

†It is worth noting here that the hand is reset to the starting state of thetrajectory controller with the closest angle for each rod pose. While thesestates are very similar, they may nonetheless provide additional cues tothe network. Our ongoing experiments, which we will include in the finalversion, will address randomized initial poses to control for this potentiallyconfounding factor.



Rod angle (500 samples)

-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5

Tra

ject

ory

Cost

(Aver

age=

4.1

713)

0

10

20

30

40

50

60

70

Local Policy (condition9) (success:46.2%)

Failed Trials

Successful Trial


-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5

Tra

ject

ory

Cost

(Aver

age=

3.4

2)

0

10

20

30

40

50

60

70


Failed Trials

Successful Trial


-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5

Tra

ject

ory

Cost

(Aver

age=

3.9

493)

0

10

20

30

40

50

60

70


Failed Trials

Successful Trial


-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5

Tra

ject

ory

Co

st

(Av

erag

e=4

.36

08

)

0

10

20

30

40

50

60

70


Failed Trials

Successful Trial


-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5

Tra

ject

ory

Cost

(Aver

age=

3.6

139)

0

10

20

30

40

50

60

70


Failed Trials

Successful Trial


-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5

Tra

ject

ory

Cost

(Aver

age=

2.4

246)

0

10

20

30

40

50

60

70

Local Policy (condition4) (success:35%)

Failed Trials

Successful Trial

Rod angle (500 samples)-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5

Tra

ject

ory

Co

st

(Av

erag

e=2

.93

21

)0

10

20

30

40

50

60

70


Failed Trials

Successful Trial


-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5

Tra

ject

ory

Co

st

(Av

erag

e=3

.41

08

)

0

10

20

30

40

50

60

70


Failed Trials

Successful Trial


-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5

Tra

ject

ory

Co

st

(Av

erag

e=4

.20

7)

0

10

20

30

40

50

60

70


Failed Trials

Successful Trial


-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5

Tra

ject

ory

Co

st

(Av

erag

e=5

.41

64

)

0

10

20

30

40

50

60

70


Failed Trials

Successful Trial

Figure 14. Performance of the local policies across the task variations

-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5


5

10

15

20

25

30

Tra

ject

ory

Co

st

(Av

erag

e=1

.91

32

)

Nearest Neighbor for Pick Up Task (success:90.8%)

Failed Trials

Successful Trial

Figure 15. Performance of the nearest neighbor policy with fullstate information. Test trials are collected from random initialconditions of the object, with the local policy corresponding tothe nearest rod orientation used in each trial.

Pressure(NI) (35 KHz)

ADROIT State

Tendon(NI) (9 KHz)

Controls =

Har

dw

are

Dri

ver

Co

ntr

olle

r

ADROIT Control

Valve(NI)(200 Hz)

JointµU (500 Hz)

F i l t e r a n d s u b s a m p l e

State =

𝑗𝑜𝑖𝑛𝑡𝑗𝑜𝑖𝑛𝑡 𝑣𝑒𝑙𝑝𝑟𝑒𝑠𝑠𝑢𝑟𝑒

(100,1)

Socket (200hz) Socket (200hz)

𝑉𝑎𝑙𝑣𝑒 𝑣𝑜𝑙𝑡𝑎𝑔𝑒 (40,1)

Figure 16. Over architecture of the system using the networkas controller.

for different condition, including random conditions. Wefound that local controllers are partially successful in aneighborhood wider than just their own. There is no localcontroller that works well for all the conditions. Overallthe neural network policy performs at par with the localcontrollers on the respective conditions. The neural networkpolicy, however, generalizes better than the individual localpolicies as conveyed by its higher success in picking objectfrom random initial configurations.

9 Discussion and Future Work

We demonstrated learning-based control of a complex, high-dimensional, pneumatically-driven hand. Our results includesimple tasks such as reaching a target pose, as well asdynamic manipulation behaviors that involve repositioning

-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5


2

4

6

8

10

12

14

Tra

ject

ory

Co

st

(Av

erag

e=2

.27

29

)

Fully Observed Neural Network Policy using Large Network (success:75%)

Failed Trials

Successful Trial

Figure 17. Neural network policy performance with fullyobserved state, at random initial object poses. The networkconsisted of 6 layers with 150 hidden units each, and wastrained on the local policies from conditions 1, 2, 3, 4, 5, 6, and8.

-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5Rod angle (500 samples)

10

20

30

40

50

60

70

Tra

ject

ory

Cost

(Aver

age=

1.8

86)

Partially Observed Neural Network Policy without Rod Position/Velocity with Large Network (success:74%)

Failed Trials

Successful Trial

Figure 18. Performance of a large neural network (6 layers,150 units) with partial observations and touch sensors. The rodposition is not provided to the network, but instead the networkcan use inputs from the tactile sensors on the fingertips. Notethat overall performance exceeds the best single local policy.

a freely-moving cylindrical object. We also explored morecomplex grasping tasks that require repositioning of anobject in the hand and handling delayed rewards, withthe use of expert demonstrations to bootstrap learning.Aside from the high-level objective encoded in the costfunction and the demonstrations, the learning algorithmsdo not use domain knowledge about the task or thehardware. The experiments show that effective manipulationstrategies can be automatically discovered in this way. Wefurther evaluated the generalization capabilities of deepneural network controllers and nearest neighbour switchingcontrollers. While both methods were able to generalize



-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5


10

20

30

40

50

60

Tra

ject

ory

Cost

(Aver

age=

2.4

025)

Partially Observed Neural Network Policy without Rod Position/Velocity with Small Network (success:58.517%)

Failed Trials

Successful Trial

Figure 19. Performance of a small neural network (4 layers, 80units) with partial observations and touch sensors. The rodposition is not provided to the network, but instead the networkcan use inputs from the tactile sensors on the fingertips. Notethat overall performance of the smaller network is substantiallydegraded.

-2

-1.6

5

-1.3

-0.9

5

-0.7

-0.3

5 0

0.3

5

0.7

5

0.9

5


2

4

6

8

10

12

14

16

Tra

ject

ory

Cost

(Aver

age=

4.2

198)

Partially Observed Neural Network Policy without Rod Position/Velocity/Touch Sensors (success:73%)

Failed Trials

Successful Trial

Figure 20. Performance of a large neural network (6 layers,150 units) with partial observations and without touch sensors.The rod position is not provided to the network. Note that thelarge network performs well even without the touch sensors,indicating that the network is able to use proprioceptive inputsto determine the strategy.

Figure 21. Different conditions used for expert demonstrations

to some extent, the performance of the nearest neighbourmethod was higher. The two methods have relative strengthsand weaknesses and can be improved and perhaps combinedin future work, as described below.

The neural network controller was given access toproprioceptive and tactile sensory input, but not to visioninput about the object state. Thus it was solving a harderproblem, with the caveat that the initial pose of the hand

Figure 22. Cross validation of different policies under differentconditions on ADROIT Hardware platform

was correlated with the initial pose of the object, and weneed to investigate in more detail to what extent the networkcan indeed perform the task blindly. Blind manipulation is ofcourse not a requirement, given that many vision sensors nowexist and real-time state estimators (Schmidt et al. 2015) canprovide object states. It will be interesting to test the networkperformance with more complete sensory input.

The nearest neighbour controller is currently time-varying: the switch happens at the beginning of themovement based on the initial state of the object, and thenwe use the selected local feedback controller for the restof the movement. Another approach would be to store thedataset of states, controls and feedback gains along thelocal trajectories, and use nearest neighbour queries at eachpoint in time. This requires more storage and processingthan evaluating a neural network, but should not be achallenge for modern computing hardware. Furthermore,it may be possible to store a reduced set of controllersas opposed to storing the states along all trajectories. Forexample, Simbicon (Yin et al. 2007) uses a small collectionof time-invariant feedback controllers together with asuitable switching policy, and performs bipedal locomotionremarkably well. The Boston Dynamics approach to control,while not described in detail, also appears to be related.

It may be possible to combine the benefits of the twogeneralization methods considered here. While deep learningis currently popular, there is no reason to limit ourselvesto generic networks. Instead we could consider a mixture-of-experts architecture (Jordan and Jacobs 1994), where thegating network corresponds to the switching mechanismin our current nearest neighbor approach, while the expertnetworks correspond to our local trajectory controllers.We would still want to leverage the power of trajectoryoptimization in training the experts, and perhaps use thecurrent switching mechanism to pre-train the gating network.

Another direction for future work is to expand the set oftasks under consideration. While the present tasks are quitecomplex, they have been selected for their intrinsic stability.For example, consider our object-spinning task. If we wereto attempt an identical task but with the palm facing down,our present approach would not work. This is because theobject would drop before we have had time to interact withit and learn anything. In general, data-driven approaches torobotics require either an intrinsically stable task (which maybe rare in practice), or a pre-existing controller that is able tocollect relevant data without causing immediate failure. Inthis particular case we could perhaps use tele-operation to



obtain such a controller, but a more general and automatedsolution is needed.

Finally, even though the focus of this paper is purelydata-driven learning, there is no reason to take such a one-sided approach longer term. As with every other instanceof learning from limited and noisy data, the best resultsare likely to be obtained when data is combined withsuitable priors. In the case of robotics, physics providea strong prior that can rule out the large majority ofcandidate models and control policies, thereby improvinggeneralization from limited data on the physical system.Existing physics simulators can simulate complex systemssuch as the one studied here much faster than real-time,and can be run in parallel. This functionality can be usedfor model-predictive control, aided by a neural networkrepresenting a controller and/or a value function (Zhong et al.2013). The model itself could be a hybrid between a physics-based model with a small number of parameters learnedfrom data, and a neural network with a larger number ofparameters used to fit the residuals that the physics-basedmodel could not explain.

Acknowledgements

This work was supported by the NIH, NSF and DARPA. Theauthors declare that there is no conflict of interest.

References

Abbeel P, Coates A, Quigley M and Ng A (2006) An applicationof reinforcement learning to aerobatic helicopter flight. In:Advances in Neural Information Processing Systems (NIPS).

Amend Jr JR, Brown E, Rodenberg N, Jaeger HM and LipsonH (2012) A positive pressure universal gripper based on thejamming of granular material. Robotics, IEEE Transactions on28(2): 341–350.

Astrom KJ and Wittenmark B (2013) Adaptive control. CourierCorporation.

Bagnell JA and Schneider J (2003) Covariant policy search.In: International Joint Conference on Artificial Intelligence(IJCAI).

Bellman R and Kalaba R (1959) A mathematical theory of adaptivecontrol processes. Proceedings of the National Academy ofSciences 8(8): 1288–1290.

Boyd S and Vandenberghe L (2004) Convex Optimization. NewYork, NY, USA: Cambridge University Press.

Deisenroth M, Neumann G and Peters J (2013) A survey on policysearch for robotics. Foundations and Trends in Robotics 2(1-2):1–142.

Deisenroth M, Rasmussen C and Fox D (2011) Learning to controla low-cost manipulator using data-efficient reinforcementlearning. In: Robotics: Science and Systems (RSS).

Gupta A, Eppner C, Levine S and Abbeel P (2016) Learningdexterous manipulation for a soft robotic hand from humandemonstration. In: IROS 2016.

Jordan M and Jacobs R (1994) Hierarchical mixtures of experts andthe em algorithm. Neural Computation 6(2): 181–214.

Kober J, Bagnell JA and Peters J (2013) Reinforcement learning inrobotics: A survey. International Journal of Robotic Research32(11): 1238–1274.

Kober J, Oztop E and Peters J (2010) Reinforcement learningto adjust robot movements to new situations. In: Robotics:Science and Systems (RSS).

Kumar V and Todorov E (2015) Mujoco haptix: A virtual realitysystem for hand manipulation. In: Humanoids.

Kumar V, Todorov E and Levine S (2016) Optimal control withlearned local models: Application to dexterous manipulation.In: International Conference on Robotics and Automation(ICRA).

Kumar V, Xu Z and Todorov E (2013) Fast, strong and compliantpneumatic actuation for dexterous tendon-driven hands. In:International Conference on Robotics and Automation (ICRA).

Levine S and Abbeel P (2014) Learning neural network policieswith guided policy search under unknown dynamics. In:Advances in Neural Information Processing Systems (NIPS).

Levine S, Finn C, Darrell T and Abbeel P (2015a) End-to-end training of deep visuomotor policies. arXiv preprintarXiv:1504.00702 .

Levine S and Koltun V (2013) Guided policy search. In:International Conference on Machine Learning (ICML).

Levine S, Wagener N and Abbeel P (2015b) Learning contact-richmanipulation skills with guided policy search. In: InternationalConference on Robotics and Automation (ICRA).

Li W and Todorov E (2004) Iterative linear quadratic regulatordesign for nonlinear biological movement systems. In: ICINCO(1). pp. 222–229.

Lioutikov R, Paraschos A, Neumann G and Peters J (2014) Sample-based information-theoretic stochastic optimal control. In:International Conference on Robotics and Automation (ICRA).

Lynch K and Mason MT (1999) Dynamic nonprehensilemanipulation: Controllability, planning and experiments.International Journal of Robotics Research 18(1): 64–92.

Mitrovic D, Klanke S and Vijayakumar S (2010) Adaptive optimalfeedback control with learned internal dynamics models. In:From Motor Learning to Interaction Learning in Robots,volume 264. pp. 65–84.

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, BellemareMG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski Get al. (2015) Human-level control through deep reinforcementlearning. Nature 518(7540): 529–533.

Pastor P, Hoffmann H, Asfour T and Schaal S (2009) Learning andgeneralization of motor skills by learning from demonstration.In: International Conference on Robotics and Automation(ICRA).

Peters J, Mulling K and Altun Y (2010a) Relative entropy policysearch. In: AAAI. Atlanta.

Peters J, Mulling K and Altun Y (2010b) Relative entropy policysearch. In: AAAI Conference on Artificial Intelligence.

Peters J and Schaal S (2008) Reinforcement learning of motor skillswith policy gradients. Neural Networks 21(4).

Schmidt T, Newcombe R and Fox D (2015) Dart: dense articulatedreal-time tracking with consumer depth cameras. In:Autonomous Robots.

Sutton R and Barto A (1998) Reinforcement Learning: AnIntroduction. MIT Press.

Tedrake R, Zhang T and Seung H (2004) Stochastic policy gradientreinforcement learning on a simple 3d biped. In: InternationalConference on Intelligent Robots and Systems (IROS).



Tesauro G (1994) Td-gammon, a self-teaching backgammonprogram, achieves master-level play. Neural computation 6(2):215–219.

Theodorou E, Buchli J and Schaal S (2010) Reinforcementlearning of motor skills in high dimensions. In: InternationalConference on Robotics and Automation (ICRA).

Todorov E, Erez T and Tassa Y (2012) Mujoco: A physics enginefor model-based control. In: Intelligent Robots and Systems(IROS), 2012 IEEE/RSJ International Conference on. IEEE,pp. 5026–5033.

van Hoof H, Hermans T, Neumann G and Peters J (2015) Learningrobot in-hand manipulation with tactile features. In: HumanoidRobots (Humanoids). IEEE.

Yin K, Loken K and van de Panne M (2007) Simbicon: Simplebiped locomotion control. ACM Trans. Graph. 26(3): Article105.

Zhong M, Johnson M, Tassa Y, Erez T and Todorov E (2013)Value function approximation and model-predictive control.In: IEEE Symposium on Adaptive Dynamic Programming andReinforcement Learning.


Documents

Journal Title Learning Dexterous Manipulation Policies from …todorov/papers/... · 2016-11-08 · Vikash Kumar1, Abhishek Gupta2, Emanuel Todorov1 and Sergey Levine2 Abstract