Deep Forward and Inverse Perceptual Models for Tracking and …juxi.net/workshop/deep-learning-rss-2017/papers/Lambert.pdf · 2017-07-12 · Deep Forward and Inverse Perceptual Models

Deep Forward and Inverse Perceptual Models forTracking and Prediction

Alexander Lambert, Amirreza Shaban, Zhen Liu and Byron BootsInstitute for Robotics & Intelligent Machines

Georgia Institute of Technology, Atlanta, GA, USA{alambert6, amirreza, liuzhen1994}@gatech.edu, [email protected]

Abstract—We present a non-parametric perceptual model forgenerating video frames with deep networks, and provide aframework for its use in tracking and prediction tasks on a realrobotic system. This is shown to greatly outperform standardde-convolutional methods for image generation, producing clearphoto-realistic images. For tracking, we incorporate the sensormodel into an Extended Kalman Filter and estimate robot tra-jectories. As a comparison, we introduce a secondary frameworkconsisting of a discriminative, inverse model for state estimation,and compare this approach to that using a generative model.

I. INTRODUCTION & RELATED WORK

Several fundamental problems in robotics, such as stateestimation, prediction, and motion planning rely on accuratemodels that can map state to measurements (forward models)or measurements to state (inverse models). Classic examplesinclude the measurement models for global positioning sys-tems, inertial measurement units, or beam sensors that arefrequently used in simultaneous localization and mapping [16],or the forward and inverse kinematic models that map jointconfigurations to workspace and vice-versa. Some of thesemodels can be very difficult to derive analytically, and, inthese cases, roboticists have often resorted to machine learningto infer accurate models directly from data. For example,complex nonlinear forward kinematics have been modeledwith techniques as diverse as Bayesian networks [3] and BezierSplines [17], and many researchers have tackled the problemof learning inverse kinematics with nonparametric methodslike locally weighted projection regression (LWPR) [18, 6],mixtures of experts [2], and Gaussian Process Regression [13].While these techniques are able to learn accurate models,they rely heavily on prior knowledge about the kinematicrelationship between the robot state space and work space.

Despite the important role that forward and inverse modelshave played in robotics, there has been little progress in defin-ing these models for very high-dimensional sensor data likeimages and video. This has been disappointing: cameras are acheap, reliable source of information about the robot and itsenvironment, but the precise relationship between a robot poseor configuration, the environment, and the generated imageis extremely complex. A possible solution to this problem isto learn a forward model that directly maps the robot poseor configuration to high-dimensional perceptual space or aninverse model that maps new images to the robot pose orconfiguration. Given these learned models, one could directly

and accurately perform a range of important tasks includingstate estimation, prediction, and motion planning.

In this paper, we will explore the idea of directly learn-ing forward and inverse perceptual models that relate high-dimensional images to low dimensional robot state. In par-ticular, we use deep neural networks to learn both forwardand inverse perceptual models for a camera pointed at themanipulation space of a Barret WAM arm. While recent workon convolutional neural networks (CNNs) provides a fairlystraightforward framework for learning inverse models thatcan map images to robot configurations, learning accurategenerative (forward) models remains a challenge.

Generative neural networks have recently shown muchpromise in addressing the problem of mapping low-dimensional encodings to a high-dimensional pixel-space [4,12, 8, 14]. The generative capacity of these approaches isheavily dependent on learning a strictly parametric model tomap input vectors to images. Using deconvolutional networksfor learning controllable, kinematic transformations of objectshas previously been demonstrated, as in [5, 15]. However,these models have difficulty reproducing clear images withmatching textures, and have mainly been investigated on affinetransformations of simulated objects.

Learning to predict frames has also been conducted ontwo-dimensional robot manipulation tasks. In [7], the authorspropose using an LSTM-based network to predict next-frameimages, given the current frame and state-action pair. Inorder to model pixel transformations, the authors make useof composited convolutions with either unconstrained or affinekernels. The generated image frames produce some semblanceof linear motion in the scene, however do not manage to repli-cate multi-degree-of-freedom dynamics. Given that forwardprediction is conducted by recursive input of predicted frames,the error is compounded and the quality of predictions quicklydegrades over future timesteps.

Instead of generating images directly after applying trans-formations in a low-dimensional encoding, another approach isto learn a transformation in the high-dimensional space of theoutput. One can then re-use pixel information from observedimages to reconstruct new views from the same scene, asproposed in [19]. Here, the authors learn a model to generatea flow-field transformation from an input image-pose pairderived from synthetic data. This is then subsequently appliedto a reference frame, effectively rotating the original image to a

previously unseen viewpoint. Using confidence masks to com-bine multiple flow-fields generated from various viewpoints isalso proposed. However, selection of an additional referenceframe is based on random local sampling.

In the current work, we propose deep forward and in-verse perceptual models for a camera pointed at the manip-ulation space of a Barrett WAM arm. The forward modelmaps a 4-dimensional arm configuration to a (256×256×3)-dimensional RGB image and the inverse model maps(256×256×3)-dimensional images to 4-dimensional config-urations. A major contribution of this work is a new forwardmodel that extends the image-warping model in [19] witha non-parametric reference-frame selection component. Weshow that this model can generate sharp near photo-realisticimages from a never-before-seen 4-dimensional arm configu-rations and greatly outperforms state of the art deconvolutionalnetworks [5, 15]. We also show how to design a convolutionalinverse model, and use the two models in tandem for trackingthe configuration of the robot from an unknown initial config-uration and prediction to future never-before-seen images onthe real robot.

II. SENSOR MODEL DESIGN

Tracking and prediction are important tasks in robotics.Consider the problem of tracking state and predicting fu-ture images, illustrated in Fig.1. Given a history of RGB-image observations Ot = (oi)

ti=0 where o ∈ Rm, we wish to

track the state of the system x ∈ Rn for the correspondingsequence Xt = (xi)

ti=0 . Additionally, given a designated goal

image oT , we wish to define a target end state xT . Boththese state-estimation objectives can be accomplished with aninverse sensor model, a parametrized function g : Rm 7→ Rn.Having mapped the planning problem in joint-space, wecan perform online motion-planning to acquire a sequenceof expected states X t = (xi)

Ti=t+1. Classically, model-based

motion-planning would rely on cost functions computed insimulation. However, we can translate a particular motionplan into observation space using a forward sensor model,allowing future observations to be predicted given a sequenceof commands and trajectory. This could also permit the useof a cost function to be defined in observation space, whichis subsequently leveraged by the motion planner to ensuretrajectories are compatible with the real-world scene.

A. Forward Sensor Model

We use a generative observation model g to perform amapping from joint-space values xi to pixel-space values oifor a trajectory sequence of future states X t . This is performedusing image-warping layers as proposed in [19]. Given aninput xi and a reference state-image pair (xr,or), we learnto predict the pixel flow field

−→h (xi,xr) from reference image

or to the image oi. The final prediction is made by warpingthe reference image or with the predicted flow field

−→h :

g(xi) = warp(or,−→h ). (1)

As long as there is a high correlation between visualappearance of the reference image or and the output imageoi, warping the reference image results in higher-fidelity pre-dictions when compared to models which map input directlyto RGB values [15].

We propose a variation of this idea that is extremelyeffective in practice. Our network is comprised of two parts:1) A non-parametric component, where given a state value xi,a reference pair (xr,or) is found such that or has a similarappearance as the output image oi; and 2) A parametriccomponent, where given the same state value xi and thereference pair (xr,or), the inverse flow field

−→h is computed

and used to warp the reference image to produce the outputimage oi. The overall architecture is shown in Fig.2.

For the non-parametric component, we take a subset of thetraining set Fp ⊆F , consisting of configuration-image pairsand store the data in a KD-tree. We can quickly find a referencepair by searching for the training data pair closest to the currentjoint configuration xi in configuration space. Since the state-space is low-dimensional we can find this nearest neighborvery efficiently. To prevent over-fitting, we randomly sampleone of the k-nearest neighbors in the training phase.

For the parametric component, we use the deep neuralnetwork architecture depicted in Figure 2. Given a currentand reference state vector as input, the network is trained toproduce a flow tensor to warp the reference image (similarlyto [19]). The input is first mapped to a high dimensionalspace using six fully-connected layers. The resulting vectoris reshaped into a tensor with size 256× 8× 8. The spa-tial resolution is increased from 8× 8 to 28 × 28 by using5 deconvolution layers. Following the deconvolutions withconvolutional layers was found to qualitatively improve theoutput images. Finally, we use the predicted flow field of size2×256×256 to warp the reference image. To allow end-to-endtraining, we used a differentiable image sampling method withbi-linear interpolation kernel in the warping layer [9]. UsingL2 prediction loss, the final optimization function is definedas follows:

minW

L(W ) = ∑i||oi−warp(or,

−→h W (xi,xr))||2 +λ ||W ||2 (2)

in which W contains the network parameters and λ is theregularization coefficient. Training was conducted using theCaffe [10] library, which had been extended to support thewarping layer. The ADAM optimization algorithm [11] wasused for the solver method.

The idea of warping the nearest neighbor can be extended towarping an ensemble of k-nearest neighbor reference images,with each neighbor contributing separately to the prediction.In addition to the flow field, each network in the ensemblealso predicts a 256×256 confidence map (as shown in Fig.2).We use these confidence maps to compute the weighted sumof different predictions in the ensemble and compute the finalprediction.

Fig. 1. The proposed framework makes use of an inverse sensor model (blue) for inference of a state xi from observed frame oi. A motionplan can be generated given a simulated model of the system. The resulting trajectory can be mapped into image space using a forwardsensor model (green).

x5 times

fc1(256)

fc2(256)

fc4 (1024)

fc5 (1024)

fc3 (512) fc3

(256x8x8)

...

Deconvi (cixsixsi)

Convi (cixsixsi)

...

iflow (2x256x256)

xr Reference state

xCurrent state

confidence map (256x256)

Forward Branch

Warp

First Ir

NN Forward Branch #1First xr W

eighted Average

iflow

Warp

Second Ir

Forward Branch #2

Second xr iflow

confidence map

confidence mapCurrent state

x

Fig. 2. Forward sensor model architecture. Left: Forward parametric branch example. Right: Complete architecture, containing a NearestNeighbor (NN) module producing (image, state)-pair outputs. These are used by individual Forward branches to produce a warped referenceimage.

B. Tracking using the Forward ModelA common approach to state estimation is to use a Bayesian

filter with a generative sensor model to provide a correctionconditioned on measurement uncertainty. For instance, theforward model describe in Section II-A could be used toprovide such a state update, and potentially allow for asingle model to be used for both tracking and prediction. Wetherefore consider this approach as a suitable comparison tothe inverse-forward framework being proposed.

In order to track a belief-state distribution, we can derive anExtended Kalman Filter from the forward sensor model. Thisis a straightforward process, as network models are inherentlyamenable to linearization. The correction step is performedby computing the Jacobian J as the product of the layer-wiseJacobians:

J = J(L)× J(L−1)×·· ·× J(0), (3)

where L is the total number of layers in the network. Thedimensionality of the observations makes it impractical tocompute the Kalman Gain

(K = ΣJT (JΣJT +R)−1

)directly.

Instead, we use a low-rank approximation of the sensor-noisecovariance matrix R, and perform the inversion in projectedspace:

(JTΣJ+R)−1 ≈U(UT JT

ΣJU +SR)−1UT (4)

where SR contains the top-most singular values of R. Forsimplicity, the state prediction covariance Q was approximated

as Q = γIn×n, where γ = 10−6. A first-order transition modelis used, where next-state joint-value estimates are defined asx′t+1 = xt +(xt − xt−1)∆t for a fixed time-step ∆t. We providean initial prior over the state values with identical covariance,and an arbitrary mean-offset from ground-truth.

C. Tracking using an Inverse Sensor Model

In order to infer the latent joint states xi from observedimages oi, we can also define a discriminative model g f wd .Given the capacity of convolutional neural network models toperform regression on high-dimensional input data, we used amodified version of the VGG CNN-S network [1] containing5 convolutional layers and 4 fully-connected layers (shown inFig.3). The model was trained on 256×256×3 input imagesand corresponding joint labels, optimizing an L2 loss on jointvalues and regularized with dropout layers.

fc6 (4096) fc7

(1024)conv5(512)

conv3(512) conv4

(512)

conv1(96) conv2

(256)fc8

(1024)fc9 (7)

Fig. 3. Inverse sensor model architecture.

Fig. 4. Inferred joint state values using (a) the inverse model, and the forward sensor model in an EKF update for a (b) 5-degree and (c) 10-degree offsetfrom ground-truth at t = 0.

III. EXPERIMENTAL RESULTS

A. Datasets

The experiments were conducted using a Barrett WAMmanipulator, a cable-actuated robotic arm (see Figure). Datawas captured from raw joint-encoder traces and a statically-positioned camera, collecting 640x480 RGB frames at a rateof 30 fps. Due to non-linear effects arising from joint flexibilityand cable stretch, large joint velocities and accelerationsinduce discrepancies between recorded joint values and actualpositions seen in the images. In order to mitigate aliasing,the joint velocities were kept under 10◦/s. This and otherpractical constraints imposed limitations on obtaining an ade-quate sampling density of the four-dimensional joint space. Assuch, the training data was collected while executing randomlygenerated linear trajectories using only the first four joints. Atotal of 225,000 camera frames and corresponding joint valueswere captured, with 50,000 reserved for a nearest-neighbordata-set, and the remaining for training data. Test data wascollected for arbitrary joint trajectories with varying velocitiesand accelerations (without concern for joint flexibility andstretch).

B. Forward Sensor Model Evaluation

We first examine the predicted observations generated by theproposed forward sensor model (kNN-FLOW). A deconvolu-tional network (DECONV) similar to that used in [5] is chosenas a baseline, with the absence of a branch for predictingsegmentation masks (as these are not readily available fromRGB data). Qualitative comparisons between the ground-truth, forward-sensor predictions, and DECONV outputs aredepicted in Fig.6 for a sequence of state-input values at varioustimes on a pair of test trajectories. Detailed texture and featureshave been preserved in the kNN-FLOW predictions, and thegenerated robot poses closely match the ground truth images.

Fig. 5. Inferred joint state values from a sequence of RGB images usinginverse sensor model for an arbitrary trajectory.

RGB Error Mean L1 RMSkNN-FLOW (ours) 0.01109 0.023097

DECONV (baseline) 0.03066 0.061374

TABLE IFORWARD MODEL BASELINE COMPARISON.

The DECONV outputs suffer from blurred reconstructions, asexpected, and do not manage to render certain components ofthe robot arm (such as the end-effector). Quantitative resultsare shown in Table I, where it is apparent that the modeloutperforms the DECONV baseline in both mean L1 and RMSpixel error.

C. Tracking Task Evaluation

Given a sequence of observations {o0,o1, ...,ot−1,ot}, wewish to infer the corresponding sequences of state values

Fig. 6. Sequences of generated images for two independent test trajectories. For each trajectory, the first row corresponds to ground-truth images, middle rowthe predicted output, and bottom row the DECONV baseline. Times indicated for t = 1,10,20, ...,80, from left to right, respectively.

{x0,x1, ...,xt−1,xt} which includes the current state. This track-ing task can be accomplished with a discriminative (inverse)sensor model, which provides deterministic subspace valuesindependently of previously observed frames. An example oftracking accuracy, using the model described in Section II-C, isshown in Fig.4, which demonstrates frame-by-frame trackingcapability within a margin of 3-degrees from ground-truthdata.

We evaluate the accuracy of the EKF tracker on the sametest trajectories as the inverse model. A single-trajectory ex-ample is shown in Figure 4 for 5 and 10-degrees mean-offsetsat t = 0, where the ground-truth joint values are identical tothose shown for the inverse model. The results demonstratethe ability of the tracker to converge the state estimate to thetrue trajectory over time, given an initial prior.

The stability of the tracking is highly dependent on thequality of the generated images. Discontinuities arise fromthe non-parametric component of the observation model, as se-lected nearest-neighbors may abruptly change during a trackedtrajectory. Although this effect is mitigated by using a softmaxmask for weighting of nearest-neighbor-flow outputs, these

sudden jumps in the value of the Jacobian and residual termscan lead to aberrations in the tracking performance.

The comparatively high robustness of the inverse model intracking is further demonstrated by estimating the state ofan arbitrary nonlinear test trajectory shown in Figure 5. Nolatent dynamics model is assumed here, and state estimatesare produced independently given a currently observed frame.

IV. CONCLUSIONS

A framework for tracking and prediction has been proposed,which consists of separate models for purposes of state estima-tion and generation of future observations. Beginning with thetask of frame prediction, a non-parametric approach is takento warping reference frames to generate novel instances ofthe scene. In both a quantitative and qualitative sense, thisgenerative network produces improved results over the deconv-net baseline.

For state-estimation and tracking, it is shown that thegenerative observation model can be used in an EKF-basedframework to perform probabilistic inference on the underly-

ing latent state, and track the manipulator state over a simpletrajectory.

For this given dataset, the object is essentially fully-observed, with limited instances of self-occlusion. As such,it is shown that using a straight-forward discriminative modelfor tracking outperforms the filtered approach. However, ina scene containing many instances of partial or full objectocclusion, an approach which models latent-state dynamicsmay be favorable over the use of a discriminative, inversemodel.

V. FUTURE WORK

In training our models on state-observation pairs, we haveassumed that ground-truth state measurements are reliablyobtained from joint encoders through low-velocity operation.In future work, training data will be collected using a high-fidelity motion-capture system to collect ground-truth jointdisplacement values. New data sets will also be comprisedof scenes containing partial or total occlusion of the manipu-lator, to further demonstrate the tracking performance of theframework.

REFERENCES

[1] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, andAndrew Zisserman. Return of the devil in the details:Delving deep into convolutional nets. arXiv preprintarXiv:1405.3531, 2014.

[2] Bruno Damas and Jose Santos-Victor. An online al-gorithm for simultaneously learning forward and in-verse kinematics. In Intelligent Robots and Systems(IROS), 2012 IEEE/RSJ International Conference on,pages 1499–1506. IEEE, 2012.

[3] Anthony Dearden and Yiannis Demiris. Learning forwardmodels for robots. In IJCAI, volume 5, page 1440, 2005.

[4] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, PhilipHausser, Caner Hazirbas, Vladimir Golkov, Patrickvan der Smagt, Daniel Cremers, and Thomas Brox.Flownet: Learning optical flow with convolutional net-works. In Proceedings of the IEEE International Con-ference on Computer Vision, pages 2758–2766, 2015.

[5] Alexey Dosovitskiy, Jost Springenberg, MaximTatarchenko, and Thomas Brox. Learning to generatechairs, tables and cars with convolutional networks.IEEE transactions on pattern analysis and machineintelligence, 2016.

[6] Aaron D’Souza, Sethu Vijayakumar, and Stefan Schaal.Learning inverse kinematics. In Intelligent Robots andSystems, 2001. Proceedings. 2001 IEEE/RSJ Interna-tional Conference on, volume 1, pages 298–303. IEEE,2001.

[7] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsu-pervised learning for physical interaction through videoprediction. In Advances In Neural Information Process-ing Systems, pages 64–72, 2016.

[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville,

and Yoshua Bengio. Generative adversarial nets. In Ad-vances in neural information processing systems, pages2672–2680, 2014.

[9] Max Jaderberg, Karen Simonyan, Andrew Zisserman,et al. Spatial transformer networks. In Advances inNeural Information Processing Systems, pages 2017–2025, 2015.

[10] Yangqing Jia, Evan Shelhamer, Jeff Donahue, SergeyKarayev, Jonathan Long, Ross Girshick, Sergio Guadar-rama, and Trevor Darrell. Caffe: Convolutional ar-chitecture for fast feature embedding. arXiv preprintarXiv:1408.5093, 2014.

[11] Diederik Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980,2014.

[12] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli,and Josh Tenenbaum. Deep convolutional inverse graph-ics network. In Advances in Neural Information Process-ing Systems, pages 2539–2547, 2015.

[13] Duy Nguyen-Tuong, Jan R Peters, and Matthias Seeger.Local gaussian process regression for real time onlinemodel learning. In Advances in Neural InformationProcessing Systems, pages 1193–1200, 2009.

[14] Alec Radford, Luke Metz, and Soumith Chintala. Un-supervised representation learning with deep convolu-tional generative adversarial networks. arXiv preprintarXiv:1511.06434, 2015.

[15] Maxim Tatarchenko, Alexey Dosovitskiy, and ThomasBrox. Single-view to multi-view: Reconstructing un-seen views with a convolutional network. CoRRabs/1511.06702, 2015.

[16] Sebastian Thrun, Wolfram Burgard, and Dieter Fox.Probabilistic robotics. MIT press, 2005.

[17] Stefan Ulbrich, Michael Bechtel, Tamim Asfour, andRudiger Dillmann. Learning robot dynamics with kine-matic bezier maps. In Intelligent Robots and Systems(IROS), 2012 IEEE/RSJ International Conference on,pages 3598–3604. IEEE, 2012.

[18] Sethu Vijayakumar and Stefan Schaal. Locally weightedprojection regression: Incremental real time learning inhigh dimensional space. In Proceedings of the Seven-teenth International Conference on Machine Learning,pages 1079–1086. Morgan Kaufmann Publishers Inc.,2000.

[19] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, JitendraMalik, and Alexei A Efros. View synthesis by appearanceflow. In European Conference on Computer Vision, pages286–301. Springer, 2016.

Documents

Deep Forward and Inverse Perceptual Models for Tracking and …juxi.net/workshop/deep-learning-rss-2017/papers/Lambert.pdf · 2017-07-12 · Deep Forward and Inverse Perceptual Models