A Visual Foresight: Model-Based Deep …2018/11/22 · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Visual Foresight: Model-Based DeepReinforcement Learning for Vision-Based

Robotic ControlFrederik Ebert*, Chelsea Finn*, Sudeep Dasari, Annie Xie, Alex Lee, Sergey Levine

Abstract—Deep reinforcement learning (RL) algorithms can learn complex robotic skills from raw sensory inputs, but have yet toachieve the kind of broad generalization and applicability demonstrated by deep learning methods in supervised domains. We presenta deep RL method that is practical for real-world robotics tasks, such as robotic manipulation, and generalizes effectively tonever-before-seen tasks and objects. In these settings, ground truth reward signals are typically unavailable, and we therefore proposea self-supervised model-based approach, where a predictive model learns to directly predict the future from raw sensory readings,such as camera images. At test time, we explore three distinct goal specification methods: designated pixels, where a user specifiesdesired object manipulation tasks by selecting particular pixels in an image and corresponding goal positions, goal images, where thedesired goal state is specified with an image, and image classifiers, which define spaces of goal states. Our deep predictive models aretrained using data collected autonomously and continuously by a robot interacting with hundreds of objects, without humansupervision. We demonstrate that visual MPC can generalize to never-before-seen objects—both rigid and deformable—and solve arange of user-defined object manipulation tasks using the same model.

Index Terms—Deep Reinforcement Learning, Video Prediction, Robotic Manipulation, Model Predictive Control

F

Figure 1: Our approach trains a single model from unsupervised interaction that generalizes to a wide range of tasks andobjects, while allowing flexibility in goal specification and both rigid and deformable objects not seen during training.Each row shows an example trajectory. From left to right, we show the task definition, the video predictions for theplanned actions, and the actual executions. Tasks can be defined as (top) moving pixels corresponding to objects, (bottomleft) providing a goal image, or (bottom right) providing a few example goals. Best viewed in PDF.

1 INTRODUCTION

Humans are faced with a stream of high-dimensionalsensory inputs and minimal external supervision, and yet,are able to learn a range of complex, generalizable skillsand behaviors. While there has been significant progressin developing deep reinforcement learning algorithms thatlearn complex skills and scale to high-dimensional obser-vation spaces, such as pixels [1], [2], [3], [4], learning be-haviors that generalize to new tasks and objects remainsan open problem. The key to generalization is diversity.When deployed in a narrow, closed-world environment, areinforcement learning algorithm will recover skills that aresuccessful only in a narrow range of settings. Learning skillsin diverse environments, such as the real world, presents anumber of significant challenges: external reward feedbackis extremely sparse or non-existent, and the agent has only

• The first two authors contributed equally.

Manuscript received 11/22/2018

indirect access to the state of the world through its senses,which, in the case of a robot, might correspond to camerasand joint encoders.

We approach the problem of learning generalizable be-havior in the real world from the standpoint of sensoryprediction. Prediction is often considered a fundamentalcomponent of intelligence [5]. Through prediction, it ispossible to learn useful concepts about the world even froma raw stream of sensory observations, such as images froma camera. If we predict raw sensory observations directly,we do not need to assume availability of low-dimensionalstate information or an extrinsic reward signal. Image ob-servations are both information-rich and high-dimensional,presenting both an opportunity and a challenge. Futureobservations provide a substantial amount of supervisoryinformation for a machine learning algorithm. However, thepredictive model must have the capacity to predict thesehigh-dimensional observations, and the control algorithm

arX

iv:1

812.

0056

8v1

[cs

.RO

] 3

Dec

201

8


must be able to use such a model to effectively select actionsto accomplish human-specified goals. Examples of suchgoals are shown in figure 1.

We study control via prediction in the context of roboticmanipulation, formulating a model-based reinforcementlearning approach centered around prediction of raw sen-sory observations. One of the biggest challenges in learning-based robotic manipulation is generalization: how can welearn models that are useful not just for a narrow range oftasks seen during training, but that can be used to performnew tasks with new objects that were not seen previously?Collecting a training dataset that is sufficiently rich anddiverse is often challenging in highly-structured robotics ex-periments, which depend on human intervention for rewardsignals, resets, and safety constraints. We instead set up aminimally structured robotic control domain, where data iscollected by the robot via unsupervised interaction with awide range of objects, making it practical to collect largeamounts of interaction data. The robot collects a streamof raw sensory observations (image pixels), without anyreward signal at training time, and without the ability toreset the environment between episodes. This setting is bothrealistic and necessary for studying RL in diverse real-worldenvironments, as it enables automated and unattended col-lection of diverse interaction experience. Since the trainingsetting affords no readily accessible reward signal, learningby prediction presents an appealing option: the supervisionsignal for prediction is always available even in the streamof unsupervised experience. We therefore propose to learnaction-conditioned predictive models directly on raw pixelobservations, and show that they can be used to accomplisha range of pixel-based manipulation tasks on a real robot inthe physical world at test-time.

The main contributions of this work are as follows.We present visual MPC, a general framework for deep re-inforcement learning with sensory prediction models thatis suitable for learning behaviors in diverse, open-worldenvironments (see figure 2). We describe deep neural net-work architectures that are effective for predicting pixel-level observations amid occlusions and with novel objects.Unlike low-dimensional representations of state, specifyingand evaluating the reward from pixel predictions at test-time is nontrivial: we present several practical methodsfor specifying and evaluating progress towards the goal—including distances to goal pixel positions, registration togoal images, and success classifiers—and compare theireffectiveness and use-cases. Finally, our evaluation showshow these components can be combined to enable a realrobot to perform a range of object manipulation tasks fromraw pixel observations. Our experiments include manipu-lation of previously unseen objects, handling multiple ob-jects, pushing objects around obstructions, handling clutter,manipulating deformable objects such as cloth, recoveringfrom large perturbations, and grasping and maneuveringobjects to user-specified locations in 3D-space. Our resultsrepresent a significant advance in the generality of skills thatcan be acquired by a real robot operating on raw pixel valuesusing a single model.

This article combines and extends material from severalprior conference papers [6], [7], [8], [9], presenting them inthe context of a unified system. We include additional ex-

Figure 2: Overview of visual MPC. (top) At training time,interaction data is collected autonomously and used to traina video-prediction model. (bottom) At test time, this modelis used for sampling-based planning. In this work we discussthree different choices for the planning objective.

periments, including cloth manipulation and placing tasks,a quantitative multi-task experiment assessing the perfor-mance of our method on a wide range of distinct taskswith a single model, as well as a comprehensive, open-sourced simulation environment to facilitate future researchand better reproducibility. The code and videos can be foundon the project webpage1.

2 RELATED WORK

Model-based reinforcement learning. Learning a model topredict the future, and then using this model to act, fallsunder the general umbrella of model-based reinforcementlearning. Model-based RL algorithms are generally knownto be more efficient than model-free methods [10], andhave been used with both low-dimensional [11] and high-dimensional [12] model classes. However, model-based RLmethods that directly operate on raw image frames have notbeen studied as extensively. Several algorithms have beenproposed for simple, synthetic images [13] and video gameenvironments [14], [15], [16], but have not been evaluatedon generalization or in the real world, while other workhas also studied model-based RL for individual roboticskills [17], [18], [19]. In contrast to these works, we placespecial emphasis on generalization, studying how predictivemodels can enable a real robot to manipulate previouslyunseen objects and solve new tasks. Several prior workshave also sought to learn inverse models that map frompairs of observations to actions, which can then be usedgreedily to carry out short-horizon tasks [20], [21]. However,such methods do not directly construct longer-term plans,relying instead on greedy execution. In contrast, our methodlearns a forward model, which can be used to plan out asequence of actions to achieve a user-specified goal.Self-supervised robotic learning. A number of recentworks have studied self-supervised robotic learning, wherelarge-scale unattended data collection is used to learn in-dividual skills such as grasping [22], [23], [24], [25], push-grasp synergies [26], or obstacle avoidance [27], [28]. In

1. For videos & code: https://sites.google.com/view/visualforesight

https://sites.google.com/view/visualforesight


contrast to these methods, our approach learns predictivemodels that can be used to perform a variety of manipu-lation skills, and does not require a success measure, eventindicator, or reward function during data collection.Sensory prediction models. We propose to leverage sen-sory prediction models, such as video-prediction models, toenable large-scale self-supervised learning of robotic skills.Prior work on action-conditioned video prediction has stud-ied predicting synthetic video game images [16], [29], 3Dpoint clouds [19], and real-world images [30], [31], [32],using both direct autoregressive frame prediction [31], [32],[33] and latent variable models [34], [35]. Several works havesought to use more complex distributions for future images,for example by using pixel autoregressive models [32], [36].While this often produces sharp predictions, the resultingmodels are extremely demanding computationally. Videoprediction without actions has been studied for unstruc-tured videos [33], [37], [38] and driving [39], [40]. In thiswork, we extend video prediction methods that are based onpredicting a transformation from the previous image [31],[40].

3 OVERVIEW

In this section, we summarize our visual model-predictivecontrol (MPC) method, which is a model-based reinforce-ment learning approach to end-to-end learning of roboticmanipulation skills. Our method, outlined in Figure 2,consists of three phases: unsupervised data collection, pre-dictive model training, and planning-based control via themodel at test-time.Unsupervised data collection: At training time, data is col-lected autonomously by applying random actions sampledfrom a pre-specified distribution. It is important that thisdistribution allows the robot to visit parts of the state spacethat are relevant for solving the intended tasks. For sometasks, uniform random actions are sufficient, while for oth-ers, the design of the exploration strategy takes additionalcare, as detailed in Sections 7 and 9.4.Model training: Also during training time, we train a videoprediction model on the collected data. The model takes asinput an image of the current timestep and a sequence ofactions, and generates the corresponding sequence of futureframes. This model is described in Section 4.Test time control: At test time, we use a sampling-based,gradient free optimization procedure, similar to a shootingmethod [41], to find the sequence of actions that minimizesa cost function. Further details, including the motivation forthis type of optimizer, can be found in Section 6.

Depending on how the goal is specified, we use one ofthe following three cost functions. When the goal is pro-vided by clicking on an object and a desired goal-position, apixel-distance cost-function, detailed in Section 5.1, evaluateshow far the designated pixel is from the goal pixels. We canspecify the goal more precisely by providing a goal imagein addition to the pixel positions and make use of image-to-image registration to compute a cost function, as discussedin Section 5.2. Finally, we show that we can specify moreconceptual tasks by providing one or several examples ofsuccess and employing a classifier-based cost function asdetailed in Section 5.3. The strengths and weaknesses of

Transformations

True Images

LSTM-States

Actions

Predicted Images

Time

Figure 3: Computation graph of the video-prediction model.Time goes from left to right, at are the actions, ht are the hiddenstates in the recurrent neural network, Ft+1←t is a 2D-warpingfield, It are real images, and It are predicted images, L is apairwise training-loss.

different costs functions and trade-offs between them arediscussed in Section 5.4.

The model is used to plan T steps into the future,and the first action of the action sequence that attainedlowest cost, is executed. In order to correct for mistakesmade by the model, the actions are iteratively replannedat each real-world time step2 τ ∈ {0, ..., τmax} followingthe framework of model-predictive control (MPC). In thefollowing sections, we explain the video-prediction model,the planning cost function, and the trajectory optimizer.

4 VIDEO PREDICTION FOR CONTROL

In visual MPC, we use a transformation-based video predic-tion architecture, first proposed by Finn et al. [31]. The ad-vantage of using transformation-based models over a modelthat directly generates pixels is two-fold: (1) prediction iseasier, since the appearance of objects and the backgroundscene can be reused from previous frames and (2) the trans-formations can be leveraged to obtain predictions aboutwhere pixels will move, a property that is used in several ofour planning cost function formulations. The model, whichis implemented as a recurrent neural network (RNN) gθparameterized by θ, has a hidden state ht and takes in aprevious image and an action at each step of the rollout.Future images It+1 are generated by warping the previousgenerated image It or the previous true image It, whenavailable, according to a 2-dimensional flow field Ft+1←t. Asimplified illustration of model’s structure is given in figure3. It is also summarized in the following two equations:

[ht+1, Ft+1←t] = gθ(at, ht, It) (1)

It+1 = Ft+1←t � It (2)

Here, the bilinear sampling operator � interpolates the pixelvalues bilinearly with respect to a location (x, y) and itsfour neighbouring pixels in the image, similar to [42]. Note

2. With real-world step we mean timestep of the real-world as op-posed to predicted timesteps.


Actions

5x5

48x64x1648x64x3

24x32x32

3x3

12x16x64

3x3

6x8x128

3x3

48x64x2

Flow Field

CompositingMasks

24x32x32

3x3

12x16x64

3x3

tile

skip

Transformation6x8x5

3x3

48x64x2

Convolution+Bilinear Upsampling

Conv-LSTM

3x3

Convolution+Bilinear Downsampling

Figure 4: Forward pass through the recurrent SNA model. Thered arrow indicates where the image from the first time stepI0 is concatenated with the transformed images Ft+1←t � Itmultiplying each channel with a separate mask to produce thepredicted frame for step t+ 1.

that, as shown in figure 3, at the first time-step the realimage is transformed, whereas at later timesteps previouslygenerated images are transformed in order to generatemulti-frame predictions. The model is trained with gradientdescent on a `2 image reconstruction loss, denoted by L infigure 3. A forward pass of the RNN is illustrated in figure4. We use a series of stacked convolutional LSTMs and stan-dard convolutional layers interleaved with average-poolingand upsampling layers. The result of this computation is the2 dimensional flow-field Ft+1←t which is used to transforma current image It or It. More details on the architecture areprovided in Appendix A.Predicting pixel motion. When using visual MPC with acost-function based on start and goal pixel positions, werequire a model that can effectively predict the 2D mo-tion of the user-selected start pixels d(1)0 , . . . , d

(P )0 up to T

steps into the future3. More details about the cost functionsare provided in section 5. Since the model we employis transformation-based, this motion prediction capabilityemerges automatically, and therefore no external pixel mo-tion supervision is required. To predict the future positionsof the designated pixel d, the same transformations usedto transform the images are applied to the distributionover designated pixel locations. The warping transforma-tion Ft+1←t can be interpreted as a stochastic transitionoperator allowing us to make probabilistic predictions aboutfuture locations of individual pixels:

Pt+1 = Ft+1←t � Pt (3)

Here, Pt is a distribution over image locations which hasthe same spatial dimension as the image. For simplicityin notation, we will use a single designated pixel movingforward, but using multiple is straightforward. At the firsttime step, the distribution P0 is defined as 1 at the positionof the user-selected designated pixel and zero elsewhere.The distribution Pt+1 is normalized at each prediction step.

Since this basic model, referred to as dynamic neuraladvection (DNA), predicts images only based on the pre-vious image, it is unable to recover shapes (e.g., objects)after they have been occluded, for example by the robot

3. Note that when using a classifier-based cost function, we do notrequire the model to output transformations.

arm. Hence, this model is only suitable for planning motionswhere the user-selected pixels are not occluded during themanipulation, limiting its use in cluttered environmentsor with multiple selected pixels. In the next section, weintroduce an enhanced model, which lifts this limitation byemploying temporal skip connections.Skip connection neural advection model. To enable ef-fective tracking of objects through occlusions, we can addtemporal skip connections to the model: we now transformpixels not only from the previously generated image It,but from all previous images I1, ...It, including the contextimage I0, which is a real image. All these transformedimages can be combined to a form the predicted imageIt+1 by taking a weighted sum over all transformed images,where the weights are given by masks Mt with the samesize as the image and a single channel:

It+1 = M0(Ft+1←0 � It) +τ∑j=1

Mj(Ft+1←j � Ij). (4)

We refer to this model as the skip connection neural advectionmodel (SNA), since it handles occlusions by using temporalskip connections such that when a pixel is occluded, e.g.,by the robot arm or by another object, it can still reappearlater in the sequence. Transforming from all previous imagescomes with increased computational cost, since the numberof masks and transformations scales with the number oftime-steps τ . However, we found that in practice a greatlysimplified version of this model, where transformations areapplied only to the previous image and the first image of thesequence I0, works equally well. Moreover we found thattransforming the first image of the sequence is not necessary,as the model uses these pixels primarily to generate theimage background. Therefore, we can use the first imagedirectly, without transformation. More details can be foundin the appendix A and [7].

5 PLANNING COST FUNCTIONS

In this section, we discuss how to specify and evaluate goalsfor planning. One naıve approach is to use pixel-wise error,such as `2 error, between a goal image and the predicted image.However there is a severe issue with this approach: largeobjects in the image, i.e. the arm and shadows, dominatesuch a cost; therefore a common failure mode occurs whenthe planner matches the arm position with its position inthe goal image, disregarding smaller objects. This failuremotivates our use of more sophisticated mechanisms forspecifying goals, which we discuss next.

5.1 Pixel Distance CostA convenient way to define a robot task is by choosing oneor more designated pixels in the robot’s camera view andchoosing a destination where each pixel should be moved.For example, the user might select a pixel on an object andask the robot to move it 10 cm to the left. This type ofobjective is general, in that it can define any object relocationtask on the viewing plane. Further, success can be measuredquantitatively, as detailed in section 9. Given a distributionover pixel positions P0, our model predicts distributionsover its positions Pt at time t ∈ {1, . . . , T}. One way of


𝐼"𝐼# 𝐼$

𝐼%$

…

𝐼%#

…

𝐹'#←" 𝐹'$←"

𝐼"Figure 5: Closed loop control is achieved by registering thecurrent image It globally to the first frame I0 and the goalimage Ig . In this example registration to I0 succeeds whileregistration to Ig fails since the object in Ig is too far away.

defining the cost per time-step ct is by using the expectedEuclidean distance to the goal point dg , which is straight-forward to calculate from Pt and g, as follows:

c =∑

t=1,...,T

ct =∑

t=1,...,T

Edt∼Pt

[‖dt − dg‖2

](5)

The per time-step costs ct are summed together givingthe overall planing objective c. The expected distance tothe goal provides a smooth planning objective and enableslonger-horizon tasks, since this cost function encouragesmovement of the designated objects into the right directionfor each step of the execution, regardless of whether thegoal-position can be reached within T time steps or not.This cost also makes use of the uncertainty estimates ofthe predictor when computing the expected distance tothe goal. For multi-objective tasks with multiple designatedpixels d(i) the costs are summed to together, and optionallyweighted according to a scheme discussed in subsection 5.2.

5.2 Registration-Based Cost

We now propose an improvement over using pixel dis-tances. When using pixel distance cost functions, it is neces-sary to know the current location of the object, d(1)0 , . . . , d

(P )0

at each replanning step, so that the model can predict thepositions of this pixel from the current step forward. Toupdate the belief of where the target object currently is,we propose to register the current image to the start andoptionally also to a goal image, where the designated pixelsare marked by the user. Adding a goal image can makevisual MPC more precise, since when the target object isclose to the goal position, registration to the goal-imagegreatly improves the position estimate of the designatedpixel. Crucially, the registration method we introduce is self-supervised, using the same exact data for training the videoprediction model and for training the registration model.This allows both models to continuously improve as therobot collects more data.Test time procedure. We will first describe the registrationscheme at test time (see Figure 6(a)). We separately registerthe current image It to the start image I0 and to the goalimage Ig by passing it into the registration network R, im-plemented as a fully-convolutional neural network. The reg-istration network produces a flow map F0←t ∈ RH×W×2, a

𝐹"#←%

𝐼%

𝐼#

𝑅

𝐹"#←%

𝐼%

𝐼#

𝑅

(a) Testing usage.

𝐹"#$%←%𝐼#

𝐹"#←#$%𝐼#$%

𝐼(#$%

𝐼(#

𝐼#$%

𝐼#

shared

loss

loss𝑅

𝑅

(b) Training usage.

Figure 6: (a) At test time the registration network registersthe current image It to the start image I0 (top) and goalimage Ig (bottom), inferring the flow-fields F0←t and Fg←t. (b)The registration network is trained by warping images fromrandomly selected timesteps along a trajectory to each other.

vector field with the same size as the image, that describesthe relative motion for every pixel between the two frames.

F0←t = R(It, I0) Fg←t = R(It, Ig) (6)

The flow map F0←t can be used to warp the image of thecurrent time step t to the start image I0, and Fg←t can beused to warp from It to Ig (see Figure 5 for an illustration).There is no difference to the warping operation used in thevideo prediction model, explained in section 4, equation 2:

I0 = F0←t � It Ig = Fg←t � It (7)

In essence for a current image F0←t puts It in correspon-dence with I0, and Fg←t puts It in correspondence withIg . The motivation for registering to both I0 and Ig is toincrease accuracy and robustness. In principle, registering toeither I0 or Ig is sufficient. While the registration network istrained to perform a global registration between the images,we only evaluate it at the points d0 and dg chosen by theuser. This results in a cost function that ignores distractors.The flow map produced by the registration network is usedto find the pixel locations corresponding to d0 and dg in thecurrent frame:

d0,t = d0 + F0←t(d0) dg,t = dg + Fg←t(dg) (8)

For simplicity, we describe the case with a single des-ignated pixel. In practice, instead of a single flow vectorF0←t(d0) and Fg←t(dg), we consider a neighborhood offlow-vectors around d0 and dg and take the median in thex and y directions, making the registration more stable.Figure 7 visualizes an example tracking result while thegripper is moving an object.Registration-based pixel distance cost. Registration can failwhen distances between objects in the images are large.During a motion, the registration to the first image typicallybecomes harder, while the registration to the goal imagebecomes easier. We propose a mechanism that estimateswhich image is registered correctly, allowing us to utilizeonly the successful registration for evaluating the planningcost. This mechanism gives a high weight λi to pixel dis-tance costs ci associated with a designated pixel di,t thatis tracked successfully and a low, ideally zero, weight toa designated pixel where the registration is poor. We usethe photometric distance between the true frame and the


Figure 7: Outputs of registration network. The first row shows the timesteps from left to right of a robot picking and moving ared bowl, the second row shows each image warped to the initial image via registration, and the third row shows the same for thegoal image. A successful registration in this visualization would result in images that closely resemble the start- or goal image. Inthe first row, the locations where the designated pixel of the start image d0 and the goal image dg are found are marked with redand blue crosses, respectively. It can be seen that the registration to the start image (red cross) is failing in the second to last timestep, while the registration to the goal image (blue cross) succeeds for all time steps. The numbers in red, in the upper left cornersindicate the trade off factors λ between the views and are used as weighting factors for the planning cost. (Best viewed in PDF)

warped frame evaluated at d0,i and dg,i as an estimate forlocal registration success. A low photometric error indicatesthat the registration network predicted a flow vector leadingto a pixel with a similar color, thus indicating warpingsuccess. However this does not necessarily mean that theflow vector points to the correct location. For example, therecould be several objects with the same color and the networkcould simply point to the wrong object. Letting Ii(di) denotethe pixel value in image Ii for position di, and Ii(di)denote the corresponding pixel in the image warped by theregistration function, we can define the general weightingfactors λi as:

λi =||Ii(di)− Ii(di)||−12∑Nj ||Ij(dj)− Ij(dj)||

−12

. (9)

where Ii = Fi←t � It. The MPC cost is computed as theaverage of the costs ci weighted by λi, where each ci is theexpected distance (see equation 5) between the registeredpoint di,t and the goal point dg,i. Hence, the cost used forplanning is c =

∑i λici. In the case of the single view model

and a single designated pixel, the index i iterates over thestart and goal image (and N = 2).

The proposed weighting scheme can also be used withmultiple designated pixels, as used in multi-task settingsand multi-view models, which are explained in section 8.The index i then also loops over the views and indices ofthe designated pixels.Training procedure. The registration network is trained onthe same data as the video prediction model, but it does notshare parameters with it.4 Our approach is similar to theoptic flow method proposed by [43]. However, unlike thisprior work, our method computes registrations for framesthat might be many time steps apart, and the goal is not toextract optic flow, but rather to determine correspondencesbetween potentially distant images. For training, two im-ages are sampled at random times steps t and t + h alongthe trajectory and the images are warped to each other inboth directions.

It = Ft←t+h � It+h It+h = Ft+h←t � It (10)

The network, which outputs Ft←t+h and Ft+h←t, see Fig-ure 6 (b), is trained to minimize the photometric distance

4. In principle, sharing parameters with the video prediction modelmight be beneficial, but this is left for future work.

between It and It and It+h and It+h, in addition to asmoothness regularizer that penalizes abrupt changes in theoutputted flow-field. The details of this loss function followprior work [43]. We found that gradually increasing the tem-poral distance h between the images during training yieldedbetter final accuracy, as it creates a learning curriculum. Thetemporal distance is linearly increased from 1 step to 8 stepsat 20k SGD steps. In total 60k iterations were taken.

The network R is implemented as a fully convolutionalnetwork taking in two images stacked along the channel di-mension. First the inputs are passed into three convolutionallayers each followed by a bilinear downsampling operation.This is passed into three layers of convolution each followedby a bilinear upsampling operation (all convolutions usestride 1). By using bilinear sampling for increasing or de-creasing image sizes we avoid artifacts that are caused bystrided convolutions and deconvolutions.

5.3 Classifier-Based Cost FunctionsAn alternative way to define the cost function is with agoal classifier. This type of cost function is particularly well-suited for tasks that can be completed in multiple ways.For example, for a task of rearranging a pair objects intorelative positions, i.e. pushing the first object to the left ofthe second object, the absolute positions of the objects donot matter nor does the arm position. A classifier-based costfunction allows the planner to discover any of the possiblegoal states.

Unfortunately, a typical image classifier will require alarge amount of labeled examples to learn, and we donot want to collect large datasets for each and every task.Instead, we aim to learn a goal classifier from only a fewpositive examples, using a meta-learning approach. A fewpositive examples of success are easy for people to provideand are the minimal information needed to convey a goal.

Formally, we consider a goal classifier y = f(o), whereo denotes the image observation, and y ∈ [0, 1] indicatesthe predicted probability of the observation being of asuccessful outcome of the task. Our objective is to infer aclassifier for a new task Tj from a few positive examples ofsuccess, which are easy for a user to provide and encodethe minimal information needed to convey a task. In otherwords, given a dataset D+

j of K examples of successful endstates for a new task Tj : Dj := {(ok, 1)|k = 1...K}j , ourgoal is to infer a classifier for task Tj .


Figure 8: We propose a framework for quickly specifying visualgoals. Our goal classifier is meta-trained with positive andnegative examples for diverse tasks (left), which allows it tometa-learn that some factors matter for goals (e.g., relativepositions of objects), while some do not (e.g. position of thearm). At meta-test time, this classifier can learn goals for newtasks from a few of examples of success (right - the goal is toplace the fork to the right of the plate). The cost can be derivedfrom the learned goal classifier for use with visual MPC.

Meta-learning for few-shot goal inference. To solve theabove problem, we propose learning a few-shot classifierthat can infer the goal of a new task from a small set ofgoal examples, allowing the user to define a task from a fewexamples of success. To train the few-shot classifier, we firstcollect a dataset of both positive and negative examples fora wide range of tasks. We then use this data to learn howto learn goal classifiers from a few positive examples. Ourapproach is illustrated in Figure 8.

We build upon model-agnostic meta-learning(MAML) [44], which learns initial parameters θ formodel f that can efficiently adapt to a new task with one ora few steps of gradient descent. Grant et al. [45] proposedan extension of MAML, referred to as concept acquisitionthrough meta-learning (CAML), for learning to learn newconcepts from positive examples alone. We apply CAMLto the setting of acquiring goal classifiers from positiveexamples, using a meta-training data with both positiveand negative examples. The result of the meta-trainingprocedure is an initial set of parameters that can be used tolearn new goal classifiers at test time.Test time procedure. At test time, the user provides adatasetD+

j of K examples of successful end states for a newtask Tj : Dj := {(ok, 1)|k = 1...K}j , which are then used toinfer a task-specific goal classifierCj . In particular, the meta-learned parameters θ are updated through gradient descentto adapt to task Tj :

Cj(o) = f(o; θ′j) = f(o; θ − α∇θ

∑(on,yn)∈D+

j

L(yn, f(on; θ))

where L is the cross-entropy loss function, α is thestep size, and θ′ denotes the parameters updated throughgradient descent on task Tj .

During planning, the learned classifier Cj takes as inputan image generated by the video prediction model and out-puts the predicted probability of the goal being achieved forthe task specified by the few examples of success. To convertthis into a cost function, we treat the probability of successas the planning cost for that observation. To reduce the effectof false positives and mis-calibrated predictions, we use the

classifier conservatively by thresholding the predictions sothat reward is only given for confident successes. Below thisthreshold, we give a reward of 0 and above this threshold,we provide the predicted probability as the reward.Training time procedure. During meta-training, we explic-itly train for the ability to infer goal classifiers for the set oftraining tasks, {Ti}. We assume a small dataset Di for eachtask Ti, consisting of both positive and negative examples:Di := {(on, yn)|n = 1...N}i. To learn the initial parametersθ, we optimize the following objective using Adam [46]:

minθ

∑i

∑(on,yn)∈Dtest

i

L(yn, f(on; θ′i))

In our experiments, our classifier is represented by a con-volutional neural network, consisting of three convolutionallayers, each followed by layer normalization and a ReLUnon-linearity. After the final convolutional layer, a spatialsoft-argmax operation extracts spatial feature points, whichare then passed through fully-connected layers.

5.4 When to Use Which Cost Function?We have introduced three different forms of cost function,pixel distance based cost functions with and without regis-tration, as well as classifier-based cost functions. Here wediscuss the relative strengths and weaknesses of each.

Pixel distance based cost functions have the advantagethat they allow moving objects precisely to target locations.They are also easy to specify, without requiring any examplegoal images, and therefore provide an easy and fast userinterface. The pixel distance based cost function also hasa high degree of robustness against distractor objects andclutter, since the optimizer can ignore the values of otherpixels; this is important when targeting diverse real-worldenvironments. By incorporating an image of the goal, wecan also add a registration mechanism to allow for morerobust closed-loop control, at the cost of a more significantburden on the user.

The classifier-based cost function allows for solving moreabstract tasks since it can capture invariances, such asthe position of the arm, and settings where the absolutepositions of an object is not relevant, such as positioninga cup in front of a plate, irrespective of where the plateis. Providing a few example images takes more effort thanspecifying pixel locations but allows a broader range of goalsets to be specified.

6 TRAJECTORY OPTIMIZER

The role of the optimizer is to find actions sequences a1:Tthat minimize the sum of the costs c1:T along the planninghorizon T . We use a simple stochastic optimization proce-dure for this, based on the cross-entropy method (CEM), agradient-free optimization procedure. CEM consists of iter-atively resampling action sequences and refitting Gaussiandistributions to the actions with the best predicted cost.

Although a variety of trajectory optimization methodsmay be suitable, one advantage of the stochastic optimiza-tion procedure is that it allows us to easily ensure thatactions stay within the distribution of actions the modelencountered during training. This is crucial to ensure that


Algorithm 1 Planning in Visual MPC

1: Inputs: Predictive model g, planning cost function c2: for t = 0...T − 1 do3: for i = 0...niter − 1 do4: if i == 0 then5: Sample M action sequences {a(m)

t:t+H−1} fromN (0, I) or custom sampling distribution

6: else7: Sample M action sequences a(m)

t:t+H−1 fromN (µ(i),Σ(i))

8: Check if sampled actions are withinadmissible range, otherwise resample.

9: Use g to predict future image sequences I(m)t:t+H−1

and probability distributions P (m)t:t+H−1

10: Evaluate action sequences using a cost function c11: Fit a diagonal Gaussian to the k action samples

with lowest cost, yielding µ(i),Σ(i)

12: Apply first action of best action sequence to robot

the model does not receive out-of-distribution inputs andmakes valid predictions. Algorithm 1 illustrates the plan-ning process. In practice this can be achieved by definingadmissible ranges for each dimension of the action vectorand rejecting a sample if it is outside of the admissible range.

In the appendix C we present a few improvements to theCEM optimizer for visual MPC.

7 CUSTOM ACTION SAMPLING DISTRIBUTIONS

When collecting data by sampling from simple distri-butions, such as a multivariate Gaussian, the skills thatemerged were found to be generally restricted to pushingand dragging objects. This is because with simple distri-butions, it is very unlikely to visit states like picking upand placing of objects or folding cloth. Not only would themodel be imprecise for these kinds of states, but also duringplanning it would be unlikely to find action sequences thatgrasp an object or fold an item of clothing. We thereforeexplore how the sampling distribution used both in datacollection and sampling-based planning can be changedto visit these, otherwise unlikely, states more frequently,allowing more complex behavior to emerge.

To allow picking up and placing of objects as well asfolding of cloth to occur more frequently, we incorporatea simple “reflex” during data collection, where the gripperautomatically closes, when the height of the wrist abovethe table is lower than a small threshold. This reflex isinspired by the palmar reflex observed in infants [47].With this primitive, when collecting data with rigid objectsabout 20% of trajectories included some sort of grasp. Fordeformable objects such as towels and cloth, this primitivehelps increasing the likelihood of encountering states wherecloths are folded. We found that the primitive can be slightlyadapted to avoid cloths becoming tangled up. More detailsare provided in Appendix B.

It is worth noting that, other than this reflex, nograsping-specific or folding-specific engineering was ap-plied to the policy, allowing a joint pushing, grasping andfolding policy to emerge through planning (see figure 16 in

the appendix). In our experiments, we evaluate our methodusing data obtained both with and without the graspingreflex, evaluating both purely non-prehensile and combinedprehensile and non-prehensile manipulation.

8 MULTI-VIEW VISUAL MPC

Webcams

Object Bin

Robot

Viewing Direction

Figure 9: Robot setup, with 2standard web-cams arranged atdifferent viewing angles.

The visual MPC algorithmas described so far is onlyable to solve manipulationtasks specified in 2D, likerearranging objects on thetable. However, this canimpose severe limitations;for example, a task such aslifting an object to a partic-ular position in 3D cannotbe fully specified with asingle view, since it wouldbe ambiguous. We use acombination of two views,taken from two cameras arranged appropriately, to jointlydefine a 3D task. Figure 9 shows the robot setup, includ-ing two standard webcams observing the workspace fromdifferent angles. The registration method described in theprevious section is used separately per view to allow fordynamic retrying and solving temporally extended tasks.The planning costs from each view are combined usingweighted averaging where the weights are provided by theregistration network (see equation 9). Rows 5 and 6 of figure12 show a 3D object positioning task, where an object needsto be positioned at a particular point in 3D space. This taskneeds two views to be fully specified.

9 EXPERIMENTAL EVALUATION

In this section we present both qualitative and quantitativeperformance evaluations of visual MPC on various manip-ulation tasks assessing the degree of generalization andcomparing different prediction models and cost functionsand with a hand-crafted baseline. In Figures 1 and 12 wepresent a set of qualitative experiments showing that visualMPC trained fully self-supervised is capable of solving awide range of complex tasks. Videos for the qualitativeexamples are at the following webpage5. In order to performquantitative comparisons, we define a set of tasks where therobot is required to move object(s) into a goal configuration.For measuring success, we use a distance-based evaluationwhere a human annotates the positions of the objects afterpushing allowing us to compute the remaining distance tothe goal.

9.1 Comparing Video Prediction Architectures

We first aim to answer the question: Does visual MPCusing the occlusion-aware SNA video prediction model thatincludes temporal skip connections outperform visual MPCwith the dynamic neural advection model (DNA) [6] withouttemporal skip-connections?

5. Videos & code: https://sites.google.com/view/visualforesight/

https://sites.google.com/view/visualforesight/


moved imp.± std err. of mean

stationary imp.± std err. of mean

DNA [6] 0.83 ±0.25 -1.1 ± 0.2SNA 10.6 ± 0.82 -1.5 ± 0.2

Table 1: Results for multi-objective pushing on 8 object/goalconfigurations with 2 seen and 2 novel objects. Values indi-cate improvement in distance from starting position, higheris better. Units are pixels in the 64x64 images.

Short Long

Visual MPC + predictor propagation 83% 20%Visual MPC + OpenCV tracking 83% 45%Visual MPC + registration network 83% 66%

Table 2: Success rate for long-distance pushing experiment with20 different object/goal configurations and short-distance ex-periment with 15 object/goal configurations. Success is definedas bringing the object closer than 15 pixels to the goal, whichcorresponds to around 7.5cm.

To examine whether our skip-connection model (SNA)helps with handling occlusions, we devised a task thatrequires the robot to push one object, while keeping anotherobject stationary. When the stationary object is in the way,the robot must move the target object around it. This isillustrated on the left side of Figure 17 in the appendix.While pushing the target object, the gripper may occludethe stationary object, and the task can only be performedsuccessfully if the model can make accurate predictionsthrough this occlusion. These tasks are specified by selectingone starting pixel on the target object, a goal pixel locationfor the target object, and commanding the obstacle to remainstationary by selecting the same pixel on the obstacle forboth start and goal.

We use four different object arrangements with twotraining objects and two objects that were not seen duringtraining. We find that, in most cases, the SNA model isable to find a valid trajectory, while the DNA model, thatis not able to handle occlusion, is mostly unable to find asolution. The results of our quantitative comparisons areshown in Table 1, indicating that temporal skip-connectionsindeed help with handling occlusion in combined pushingand obstacle avoidance tasks.

9.2 Evaluating Registration-Based Cost FunctionsIn this section we ask: How important is it to update themodel’s belief of where the target objects currently are? Wefirst provide two qualitative examples: In example (5)-(6)of Figure 12 the task is to bring the stuffed animal to aparticular location in 3D-space on the other side of the arena.To test the system’s reaction to perturbations that couldbe encountered in open-world settings, during executiona person knocks the object out of the robot’s hand (in the3rd frame). The experiment shows that visual MPC is ableto naturally perform a new grasp attempt and bring theobject to the goal. This trajectory is easier to view in thesupplementary video.

In Figure 15 in the appendix, the task is to push the bottleto the point marked with the green dot. In the beginning of

Figure 10: Object arrangement performance of our goal classi-fier with distractor objects and with two tasks. The left shows asubset of the 5 positive examples that are provided for inferringthe goal classifier(s), while the right shows the robot executingthe specified task(s) via visual planning.

the trajectory the object behaves differently than expected,it moves downwards instead of to the right. However thesystem recovers from the initial failure and still pushes theobject to the goal.

The next question we investigate is: How much doestracking the target object using the learned registrationmatter for short horizon versus long horizon tasks? In thisexperiment, we disable the gripper control, which requiresthe robot to push objects to the target. We compare two vari-ants of updating the positions of the designated pixel whenusing a pixel-distance based cost function. The first is a costfunction that uses our registration-based method, trained ina fully self-supervised fashion, and the second is with a costfunction that uses off-the shelf tracking from OpenCV [48].Additionally we compare to visual MPC, which uses thevideo-prediction model’s own prior predictions to updatethe current position of the designated pixel, rather thantracking the object with registration or tracking.

We evaluate our method on 20 long-distance and 15short-distance pushing tasks. For long distance tasks theinitial distance between the object and its goal position is30cm while for short distance tasks it is 15cm. Table 2 listsquantitative comparisons showing that on the long distanceexperiment visual MPC using the registration-based cost notonly outperforms prior work [7], but also outperforms thehand-designed, supervised object tracker [48]. By contrast,for the short distance experiment, all methods perform com-parably. Thus, theses results demonstrate the importanceof tracking the position of the target object for long-horizontasks, while for short-horizon tasks object tracking appearsto be irrelevant.

9.3 Evaluating Classifier-Based Cost FunctionThe goal of the classifier-based cost function is to providean easy way to compute an objective for new tasks froma few observations of success for that task, so we compareour approach to alternative and prior methods for doing sounder the same assumptions: pixel distance and latent spacedistance. In the latter, we measure the distance betweenthe current and goal observations in a learned latent space,obtained by training an autoencoder (DSAE) [17] on thesame data used for our classifier. Since we are considering adifferent form of task specification incompatible with user-specified pixels, we do not compare the classifier-based costfunction to the cost function based on designated pixels.


Figure 11: Quantitative performance of visual planning forobject rearrangement tasks across different goal specificationmethods: our meta-learned classifier, DSAE [17], and pixelerror. Where possible, we include break down the cause offailures into errors caused by inaccurate prediction or planningand those caused by an inaccurate goal classifier.

To collect data for meta-training the classifier, we ran-domly select a pair of objects from our set of trainingobjects, and position them in many different relative po-sitions, recording the image for each configuration. Eachtask corresponds to a particular relative positioning of twoobjects, e.g. the first object to the left of the second, and weconstruct positive and negative examples for each task bylabeling the aforementioned images. We randomly positionthe arm in each image, as it is not a determiner of tasksuccess. A good classifier should ignore the position of thearm. We also include randomly-positioned distractor objectsin about a third of the collected images.

We evaluate the classifier-based cost function in threedifferent experimental settings. In the first setting, the goal isto arrange two objects into a specified relative arrangement.The second setting is the same, but with distractor objectspresent. In the final and most challenging setting, the goalis to achieve two tasks in sequence. We provide positiveexamples for both tasks, infer the classifier for both, performMPC for the first task until completion, followed by MPC forthe second task. The arrangements of the evaluation taskswere chosen among the eight principal directions (N, NE, E,SE, etc.). To evaluate the ability to generalize to new goalsand settings, we use novel, held-out objects for all of thetask and distractor objects in our evaluation.

We qualitatively visualize the tasks in Figure 10. Onthe left, we show a subset of the five images provided toillustrate the task(s), and on the left, we show the motionsperformed by the robot. We see that the robot is able toexecute motions which lead to a correct relative positioningof the objects. We quantitatively evaluate the three costfunctions across 20 tasks, including 10 unique object pairs.A task was considered successfully completed if more thanhalf of the object was correctly positioned relative to theother. The results, shown in Figure 11, indicate that thedistance-based metrics struggle to infer the goal of the task,while our approach leads to substantially more successfulbehavior on average.

9.4 Evaluating Multi-Task Performance

One of the key motivations for visual MPC is to build asystem that can solve a wide variety of different tasks, in-volving completely different objects, physics and, objectives.Examples for tasks that can be solved with visual MPC

are shown in Figure 1 and 12. Task 1 in Figure 1 showsa “placing task” where an object needs to be grasped andplaced onto a plate while not displacing the plate. Task 2is an object rearrangement tasks. The example shown inTask 4 and all examples in Figure 10 show relative objectrearrangement tasks. Examples 5 and 6 show the same 3Dobject positioning tasks from different views. In Task 7, thegoal is to move the black object to the goal location whileavoiding the obstacle in the middle which is marked with adesignated- and goal pixel. We also demonstrate that visualMPC – without modifications to the algorithm – solves tasksinvolving deformable objects such as a task where a towelneeds to be wrapped around an object (Task 3), or foldinga pair of shorts (Task 8). To the best of our knowledgethis is the first algorithm for robotic manipulation handlingboth rigid and deformable objects. For a full illustration ofeach of these tasks, we encourage the reader to watch thesupplementary video.

The generality of visual MPC mainly stems from twocomponents — the generality of the visual dynamics modeland the generality of the task definition. We found that thedynamics model often generalizes well to objects outsideof the training set, if they have similar properties to theobjects it was trained with. For example, Task 8 in Figure12 shows the model predicting a pair of shorts being folded.We observed that a model, which was only provided videosof towels during training, generalized to shorts, althoughit had never seen them before. In all of the qualitativeexamples, the predictions are performed by the same model.We found that the model sometimes exhibits confusionabout whether an object follows the dynamics of a clothor rigid objects, which is likely caused by a lack of trainingdata in the particular regime. To overcome this issue we adda binary token to the state vector indicating whether theobject in the bin is hard or soft. We expect that adding moretraining data would remove the need for this indicator andallow the model to infer material properties directly fromimages.

The ability to specify tasks in multiple different waysadds to the flexibility of the proposed system. Using desig-nated pixels, object positioning tasks can be defined in 3Dspace, as shown in Task 1 and 2 in Figure 1 and task 5-6 inFigure 12. When adding a goal image, the positioning accu-racy can be improved by utilizing the registration schemediscussed in Section 5.2. For tasks where we care aboutrelative rather than absolute positioning, a meta-learnedclassifier can be used, as discussed in Section 5.3.

Next, we present a quantitative evaluation to answerthe following question: How does visual MPC compare toa hand-engineered baseline on a large number of diversetasks? For this comparison, we engineered a simple trajec-tory generator to perform a grasp at the location of the initialdesignated pixel, lift the arm, and bring it to the positionof the goal pixel. Camera calibration was performed tocarry out the necessary conversions between image-spaceand robot work-space coordinates, which was not requiredfor our visual MPC method. For simplicity, the baselinecontroller executes in open loop. Therefore, to allow for afair comparison, visual MPC is also executed open-loop, i.e.no registration or tracking is used. Altogether we selected 16tasks, including the qualitative examples presented earlier.


Figure 12: Visual MPC successfully solves a wide variety of tasks including multi-objective tasks, such as placing an objecton a plate (row 5 and 6), object positioning with obstacle avoidance (row 7) and folding shorts (row 8). Zoom in on PDF.

% of Trials withFinal Pixel Distance < 15

Visual MPC 75%Calibrated Camera Baseline 18.75 %

Table 3: Results for a multi-task experiment of 10 hard objectpushing and grasping tasks, along with 6 cloth foldingtasks, evaluating using a single model. Values indicate thepercentage of trials that ended with the object pixel closerthan 15 pixels to the designated goal. Higher is better.

The quantitative comparison is shown in Table 3, illustratingthat visual MPC substantially outperforms this baseline.Visual MPC succeeded for most of the tasks. While thebaseline succeeded for some of the cloth folding tasks, itfailed for almost all of the object relocation tasks. Thisindicates that an implicit understanding of physics, as cap-tured by our video prediction models, is indeed essentialfor performing this diverse range of object relocation andmanipulation tasks, and the model must perform non-trivialphysical reasoning beyond simply placing and moving theend-effector.

9.5 Discussion of Experimental ResultsGeneralization to many distinct tasks in visually diversesettings is arguably one of the biggest challenges in rein-forcement learning and robotics today. While deep learninghas relieved us from much of the problem-specific engineer-ing, most of the works either require extensive amountsof labeled data or focus on the mastery of single taskswhile relying on human-provided reward signals. From theexperiments with visual MPC, especially the qualitative ex-amples and the multi-task experiment, we can conclude thatvisual MPC generalizes to a wide range of tasks it has neverseen during training. This is in contrast to many model-free approaches for robotic control which often struggle toperform well on novel tasks. Most of the generalizationperformance is likely a result of large-scale self-supervisedlearning, which allows to acquire a rich, task-agnostic dy-namics model of the environment.

10 CONCLUSION

We presented an algorithm that leverages self-supervisionfrom visual prediction to learn a deep dynamics model onimages, and show how it can be embedded into a planningframework to solve a variety of robotic control tasks. Wedemonstrate that visual model-predictive control is able tosuccessfully perform multi-object manipulation, pushing,

picking and placing, and cloth-folding tasks – all within asingle framework.Limitations. The main limitations of the presented frame-work are that all target objects need to be visible throughoutexecution, it is currently not possible to handle partiallyobserved domains. This is especially important for tasksthat require objects to be brought into occlusion (or takenout of occlusion), for example putting an object in a box andclosing it. Another limitation is that the tasks are still of onlymedium duration and usually only touch one or two objects.Longer-term planning remains an open problem. Lastly, thefidelity of object positioning is still significantly below whathumans can achieve.Possible future directions. The key advantage of a model-based deep-reinforcement learning algorithm like visualMPC is that it generalizes to tasks it has never encounteredbefore. This makes visual MPC a good candidate for abuilding block of future robotic manipulation systems thatwill be able solve an even wider range of complex tasks withmuch longer horizons.

REFERENCES

[1] G. Tesauro, “Temporal difference learning and td-gammon,” Com-munications of the ACM, vol. 38, no. 3, pp. 58–68, 1995.

[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforce-ment learning,” arXiv:1312.5602, 2013.

[3] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end learningof deep visuomotor policies,” Journal of Machine Learning Research(JMLR), 2016.

[4] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai,A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “Mas-tering chess and shogi by self-play with a general reinforcementlearning algorithm,” arXiv:1712.01815, 2017.

[5] A. Bubic, D. Y. Von Cramon, and R. Schubotz, “Prediction, cogni-tion and the brain,” Frontiers in Human Neuroscience, 2010.

[6] C. Finn and S. Levine, “Deep visual foresight for planning robotmotion,” in International Conference on Robotics and Automation(ICRA), 2017.

[7] F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-supervised visualplanning with temporal skip connections,” Conference on RobotLearning (CoRL), 2017.

[8] F. Ebert, S. Dasari, A. X. Lee, S. Levine, and C. Finn, “Robust-ness via retrying: Closed-loop robotic manipulation with self-supervised learning,” Conference on Robot Learning (CoRL), 2018.

[9] A. Xie, A. Singh, S. Levine, and C. Finn, “Few-shot goal inferencefor visuomotor learning and planning,” Conference on Robot Learn-ing (CoRL), 2018.

[10] M. P. Deisenroth, G. Neumann, J. Peters et al., “A survey on policysearch for robotics,” Foundations and Trends R© in Robotics, 2013.

[11] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based anddata-efficient approach to policy search,” in International Conferenceon Machine Learning (ICML), 2011.

[12] I. Lenz and A. Saxena, “Deepmpc: Learning deep latent featuresfor model predictive control,” in In RSS. Citeseer, 2015.


[13] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller, “Em-bed to control: A locally linear latent dynamics model for controlfrom raw images,” in Neural Information Processing Systems, 2015.

[14] A. Dosovitskiy and V. Koltun, “Learning to act by predicting thefuture,” International Conference on Learning Representations (ICLR),2017.

[15] D. Ha and J. Schmidhuber, “World models,” arXiv:1803.10122,2018.

[16] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, “Action-conditional video prediction using deep networks in atari games,”in Neural Information Processing Systems, 2015.

[17] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel,“Deep spatial autoencoders for visuomotor learning,” in Interna-tional Conference on Robotics and Automation (ICRA), 2016.

[18] M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. J. Johnson, andS. Levine, “Solar: Deep structured latent representations formodel-based reinforcement learning,” arXiv:1808.09105, 2018.

[19] A. Byravan and D. Fox, “Se3-nets: Learning rigid body motionusing deep neural networks,” arXiv:1606.02378, 2016.

[20] P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine, “Learningto poke by poking: Experiential learning of intuitive physics,” inNeural Information Processing Systems, 2016.

[21] A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, andS. Levine, “Combining self-supervised learning and imitation forvision-based rope manipulation,” in International Conference onRobotics and Automation (ICRA), 2017.

[22] L. Pinto and A. Gupta, “Supersizing self-supervision: Learningto grasp from 50k tries and 700 robot hours,” in InternationalConference on Robotics and Automation (ICRA), 2016.

[23] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen,“Learning hand-eye coordination for robotic grasping with deeplearning and large-scale data collection,” International Journal ofRobotics Research (IJRR), 2016.

[24] R. Calandra, A. Owens, M. Upadhyaya, W. Yuan, J. Lin, E. H.Adelson, and S. Levine, “The feeling of success: Does touchsensing help predict grasp outcomes?” arXiv:1710.05512, 2017.

[25] L. Pinto, D. Gandhi, Y. Han, Y.-L. Park, and A. Gupta, “The curiousrobot: Learning visual representations via physical interactions,”in European Conference on Computer Vision, 2016.

[26] A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, andT. Funkhouser, “Learning synergies between pushing andgrasping with self-supervised deep reinforcement learning,”arXiv:1803.09956, 2018.

[27] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, and S. Levine,“Uncertainty-aware reinforcement learning for collision avoid-ance,” arXiv:1702.01182, 2017.

[28] D. Gandhi, L. Pinto, and A. Gupta, “Learning to fly by crashing,”arXiv:1704.05588, 2017.

[29] S. Chiappa, S. Racaniere, D. Wierstra, and S. Mohamed, “Recurrentenvironment simulators,” arXiv:1704.02254, 2017.

[30] B. Boots, A. Byravan, and D. Fox, “Learning predictive modelsof a depth camera & manipulator from raw execution traces,” inInternational Conference on Robotics and Automation (ICRA), 2014.

[31] C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning forphysical interaction through video prediction,” in Neural Informa-tion Processing Systems (NIPS), 2016.

[32] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka,O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel net-works,” arXiv:1610.00527, 2016.

[33] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale videoprediction beyond mean square error,” International Conference onLearning Representations (ICLR), 2016.

[34] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine,“Stochastic variational video prediction,” International Conferenceon Learning Representations (ICLR), 2018.

[35] T. Kurutach, A. Tamar, G. Yang, S. Russell, and P. Abbeel,“Learning plannable representations with causal infogan,”arXiv:1807.09341, 2018.

[36] S. Reed, A. v. d. Oord, N. Kalchbrenner, S. G. Colmenarejo,Z. Wang, D. Belov, and N. de Freitas, “Parallel multiscale autore-gressive density estimation,” arXiv:1703.03664, 2017.

[37] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c.Woo, “Convolutional lstm network: A machine learning approachfor precipitation nowcasting,” in Neural Information Processing Sys-tems, 2015.

[38] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videoswith scene dynamics,” in Neural Information Processing Systems,2016.

[39] W. Lotter, G. Kreiman, and D. Cox, “Deep predictive codingnetworks for video prediction and unsupervised learning,” Inter-national Conference on Learning Representations (ICLR), 2017.

[40] B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool, “Dynamicfilter networks,” in Neural Information Processing Systems (NIPS),2016.

[41] J. T. Betts, “Survey of numerical methods for trajectory optimiza-tion,” Journal of guidance, control, and dynamics, 1998.

[42] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, “Viewsynthesis by appearance flow,” in European conference on computervision, 2016.

[43] S. Meister, J. Hur, and S. Roth, “Unflow: Unsupervised learningof optical flow with a bidirectional census loss,” arXiv:1711.07837,2017.

[44] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learningfor fast adaptation of deep networks,” International Conference onMachine Learning (ICML), 2017.

[45] E. Grant, C. Finn, J. Peterson, J. Abbott, S. Levine, T. Griffiths,and T. Darrell, “Concept acquisition via meta-learning: Few-shotlearning from positive examples,” in NIPS Workshop on Cognitively-Informed Artificial Intelligence, 2017.

[46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” CoRR, vol. abs/1412.6980, 2014.

[47] D. Sherer, “Fetal grasping at 16 weeks’ gestation,” Journal ofultrasound in medicine, 1993.

[48] B. Babenko, M.-H. Yang, and S. Belongie, “Visual tracking withonline multiple instance learning,” in Computer Vision and PatternRecognition (CVPR). IEEE, 2009.

[49] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine,“Stochastic adversarial video prediction,” arXiv:1804.01523, 2018.

[50] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Instance normaliza-tion: The missing ingredient for fast stylization,” arXiv:1607.08022,2016.

[51] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution andcheckerboard artifacts,” Distill, 2016.

[52] S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation viaadaptive separable convolution,” in International Conference onComputer Vision (ICCV), 2017.

[53] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics enginefor model-based control,” in International Conference on IntelligentRobots and Systems (IROS), 2012.

Frederik Ebert Frederik Ebert received a BS in Mechatronics and In-formation Technology as well a MS in ”Robotics, Cognition, Intelligence(RCI)” from the Technical University of Munich (TUM). He is currently aPhD student at Berkeley Artifical Intelligence Research (BAIR), wherehe focuses on developing algorithms for robotic manipulation combiningideas from computer vision, machine learning, and control.

Chelsea Finn is a research scientist at Google and a post-doctoralscholar at UC Berkeley, and will join the Computer Science faculty atStanford in 2019. Her research focuses on algorithms that can enableagents to autonomously learn a range of complex skills. She receiveda BS in Electrical Engineering and Computer Science from MIT and aPhD in Computer Science from UC Berkeley.

Sudeep Dasari is a 4th year student at UC Berkeley pursuing a B.Sin Electrical Engineering and Computer Science. His primary researchinterests are computer vision, machine learning, and robotic control.

Annie Xie is pursuing a B.S. degree in Electrical Engineering andComputer Science at UC Berkeley. Her research interests are in theareas of computer vision and robot learning.

Alex Lee received a BS in Electrical Engineering and Computer Sciencefrom UC Berkeley and is currently pursuing a PhD in Computer Sciencefrom UC Berkeley. His work focuses on algorithms that can enablerobots to learn complex sensorimotor skills.

Sergey Levine received a BS, MS, and PhD in Computer Science fromStanford. He is currently on the faculty of the Department of ElectricalEngineering and Computer Sciences at UC Berkeley. His work focuseson machine learning for decision making and control, with an emphasison deep learning and reinforcement learning algorithms.


APPENDIX ASKIP CONNECTION NEURAL ADVECTION MODEL

Our video prediction model, shown in Figure 3, is inspiredby the dynamic neural advection (DNA) model proposedby [31] and it is a variant of the SNA model proposed by [7].The model uses a convolutional LSTM [37] to predict futureframes. The prediction is initialized with an initial sequenceof 2 ground truth frames, and predicts 13 futures frames.The model predicts these frames by iteratively making next-frame predictions and feeding those predictions back toitself. Each predicted frame, is given by a compositing layer,which composes intermediate frames with predicted com-positing masks. The intermediate frames include the previ-ous 2 frames and transformed versions them. To easily allowvarious transformations for different parts of the image, wepredict 8 transformed frames, 4 of which are transformedfrom the previous frame, and the other 4 from the frame2 steps in the past. These intermediate frames are thencombined with compositing masks, which are also predictedby the convolutional LSTM. For simplicity, we collectivelyrefer to these transformations as a single transformationFt+1←t in Equation 2. In addition, the first frame of thesequence is also given as one of the intermediate frames [7].

To enable action conditioning, the robot action at eachtime step is passed to all the convolutional layers of the mainnetwork, by concatenating it along the channel dimensionof the inputs of these layers. Since they are vectors with nospatial dimensions, they are replicated spatially to match thespatial dimensions of the inputs.

The SNA variant that we use incorporate the architec-tural improvements proposed by [49]. Each convolutionallayer is followed by instance normalization [50] and ReLUactivations. We also use instance normalization on the LSTMpre-activations (i.e., the input, forget, and output gates, aswell as the transformed and next cell of the LSTM). Inaddition, we modify the spatial downsampling and upsam-pling mechanisms. Standard subsampling and upsamplingbetween convolutions is known to produce artifacts fordense image generation tasks [51], [52]. In the encodinglayers, we reduce the spatial resolution of the feature mapsby average pooling, and in the decoding layers, we increasethe resolution by using bilinear interpolation. All convo-lutions in the generator use a stride of 1. The actions areconcatenated to the inputs of all the convolutional layers ofthe main network, as opposed to only the bottleneck.

Figure 13: The bluedot indicates the des-ignated pixel

We provide an example ofthe skip connection neural ad-vection (SNA) model recoveringfrom occlusion in Figure 14. Inthis figure, the arm is predictedto move in front of the desig-nated pixel, marked in blue inFigure 13. The predictions of theDNA model, shown in figure Fig-ure 14(b), contain incorrect motionof the marked object, as shownin the heatmaps visualizing Pt, al-though the arm actually passes infront of it. This is because the DNAmodel cannot recover information about an object that it

(a) Skip connection neural advection (SNA) does noterase or move objects in the background

(b) Standard DNA [6] exhibits undesirable movement ofthe distribution Pd(t) and erases the background

Figure 14: Top rows: Predicted images of arm moving infront of green object with designated pixel (as indicated inFigure 13). Bottom rows: Predicted probability distributionsPd(t) of designated pixel obtained by repeatedly applyingtransformations.

has ‘overwritten’ during its predictions, causing the modelto predict that the pixel moves with the arm. We identifiedthis as one of the major causes of planning failure using theDNA model. By contrast our SNA model predicts that theoccluded object will not move, shown in figure Figure 14(a).

APPENDIX BIMPROVED ACTION SAMPLING DISTRIBUTIONS FORDATA COLLECTION

In order to collect meaningful interaction data for learningfolding of deformable objects such as towels and cloth, wefound that the grasping primitive can be slightly adapted toincrease the likelihood of encountering states where clothsare actually folded. When using actions sampled from asimple distribution or the previously-described distribution,clothing would become tangled up. To improve the effi-ciency of folding cloths we use an action primitive similarto the grasping primitive, but additionally we reduce lateralmotion of the end-effector when the gripper is close to thetable, thus reducing events where cloths become tangled up.

APPENDIX CIMPROVEMENTS OF THE CEM-OPTIMIZER

In the model-predictive control setting, the action sequencesfound by the optimizer can be very different between exe-cution real-world times steps. For example at one time stepthe optimizer might find a pushing action leading towardsthe goal and in the next time step it determines a graspingaction to be optimal to reach the goal. Naıve replanningat every time step can then result in alternating betweena pushing and a grasping attempt indefinitely causing theagent to get stuck and not making any progress towards togoal.

We can resolve this problem by modifying the samplingdistribution of the first iteration of CEM so that the opti-mizer commits to the plan found in the previous time step.In the simplest version of CEM the sampling distribution


Time

Designted Pixel

Goal Pixel

Figure 15: Applying our method to a pushing task. In the first 3 time instants the object behaves unexpectedly, moving down.The tracking then allows the robot to retry, allowing it to eventually bring the object to the goal.

Time

Designted Pixel

Goal Pixel

Figure 16: Retrying behavior of our method combining prehensile and non-prehensile manipulation. In the first 4 time instantsshown the robot pushes the object. It then loses the object, and decides to grasp it pulling it all the way to the goal. Retrying isenabled by applying the learned registration to both camera views (here we only show the front view).

Figure 17: Left: Task setup with green dot marking the obstacle. Right, first row: the predicted frames generated by SNA.Second row: the probability distribution of the designated pixel on the moving object (brown stuffed animal). Note that partof the distribution shifts down and left, which is the indicated goal. Third row: the probability distribution of the designatedpixel on the obstacle-object (blue power drill). Although the distribution increases in entropy during the occlusion (in themiddle), it then recovers and remains on its original position.

at first iteration of CEM is chosen to be a Gaussian withdiagonal covariance matrix and zero mean. We instead usethe best action sequence found in the optimization of theprevious real-world time step as the mean for sampling newactions in the current real-world time-step. Since this actionsequence is optimized for the previous time step we onlyuse the values a2:T and omit the first action. To sampleactions close to the action sequence from the previous timestep we reduce the entries of the diagonal covariance matrixfor the first T − 1 time steps. It is crucial that the last entryof the covariance matrix at the end of the horizon is notreduced otherwise no exploration could happen for the lasttime step causing poor performance at later time steps.

APPENDIX DEXPERIMENTAL SETUP

To train both our video-prediction and registration models,we collected 20,000 trajectories of pushing motions and15,000 trajectories with gripper control, where the robotrandomly picks and moves objects using the grasping reflexdescribed in section 7. The data collection process is fullyautonomous, requiring human intervention only to changeout the objects in front of the robot. The action space con-sisted of relative movements of the end-effector in cartesianspace along the x, y, and z axes, and for some parts of thedataset we also added azimuthal rotations of the gripperand a grasping action.

APPENDIX EEXPERIMENTAL EVALUATION

Figure 17 shows an example of the SNA model successfullypredicting the position of the obstacle through an occlusionand finding a trajectory that avoids the obstacle.

APPENDIX FSIMULATED EXPERIMENTS

In order to provide a more controlled comparison, we alsoset up a realistic simulation environment using MuJoCo[53], which includes a robotic manipulator controlled viaCartesian position control, similar to our real world setup,pushing randomly-generated L-shaped objects with randomcolors (see details in supplementary materials). We trainedthe same video prediction model in this environment, andset up 50 evaluation tasks where blocks must be pushed totarget locations with maximum episode lengths of 120 steps.We compare our proposed registration-based method, “pre-dictor propagation,” and ground-truth registration obtainedfrom the simulator, which provides an oracle upper boundon registration performance.

Documents

A Visual Foresight: Model-Based Deep …2018/11/22 · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based