1
SE3-Nets: Learning Rigid Body Motion using Deep Neural Networks Arunkumar Byravan and Dieter Fox Robotics and State Estimation Lab; University of Washington, Seattle Contact: [email protected] Introduction Network Architecture Datasets Summary XYZ (3) Conv1 (8) 3D Input Conv2 (16) Conv3 (32) Ac+on n FC1 (128) FC2 (256) 3D Input Transform layer 3D Output Deconv1 (32) Deconv2 (16) Deconv3 (8) Masks (k) SE3s (k) FC4 (128) FC3 (256) FC5 (64) Mask Penalty & Normaliza:on CAT Comparisons / Training Results The ability to predict how an environment changes based on applied forces is fundamental for a robot to achieve specific goals: - Optimal control techniques (iLQG, AICO) can use these “dynamics” models for planning - Traditional models make use of (hard to estimate) explicit physical concepts (eg. friction, mass) Prior work has looked at learning these models of “physical intuition”: - Predictive State Representations (PSRs) for modeling depth data of a manipulator [1] - Deep networks for predicting motion of a tower of blocks [2], billiards balls [3], robot pushing objects [5] - Deep networks for predicting 3D rotations between pairs of images [4] SE3-Nets explicitly model the motion of rigid bodies under applied forces, using raw point cloud data and ground truth data associations: - Segments the scene into “objects” & Predicts motion through SE3 transformations Inputs: 3D point cloud (3-channel image), Action (n-dimensional vector) Encoder: Generates a joint encoding (CAT) of the inputs Decoder: Estimates point-wise motion, by predicting - Motion (What?): k-SE3 transforms, specified by a rotation (R) & translation (t) - Location (Where?): k-channel mask representing the probability of a point being affected by the k-SE3 motions Transform layer: Generates 3D output as a weighted blending of k-transformed input points - Enforcing Rigidity: Smoothly sharpen the masks to a binary decision over SE3s Output: Transformed 3D point cloud (same number of points as the input cloud) ʹ x = Rx + t M i = {m i1 , m i 2 ,..., m ik }| m ij = 1 j = 1 k y i = m ij [ R j x i + t j ] j = 1 k ʹ m ij = ( m ij + Ν (0, σ 2 )) γ m γ ik k Four simulated datasets using the physics simulator Gazebo - Single Box: Ball launched to collide against a single box on a table - Multiple Boxes: Same as above, but with 1-3 boxes of random size/mass - Household Objects: Similar to above, but with 11 objects from the LineMOD dataset - Baxter: 14-DOF robot randomly moving 1-4 joints on it’s right arm Preliminary tests (67 examples) on poking objects using a Baxter robot - Uses an attached stick to poke three objects (cheezeit box, mustard & pringles cans) We trained two versions of our network and three baselines - Ours: SE3-Net, 3 Conv/Deconv layers, ~1.4 million parameters - Ours (Large): SE3-Net, 5 Conv/Deconv layers, ~6.5 million parameters - No Penalty: Non-rigid SE3-Net, 3 Conv/Deconv layers, ~1.4 million parameters - Flow: Supervised flow net, 3 Conv/Deconv layers, ~1.1 million parameters - Flow (Large): Supervised flow net, 5 Conv/Deconv layers, ~10 million parameters All networks were trained with input point clouds of 240 x 320 resolution - Control = 10 dim for box datasets (ball pose, force), 14 dim for baxter (joint velocities) Fig: Sequential prediction results for the Baxter dataset. Network output is fed back as input for 4 consecutive steps. Ground Truth Ours (Large) Flow (Large) t = 0.15 sec t = 0.3 sec t = 0.45 sec t = 0.6 sec t = 0.75 sec SE3-Nets model rigid body motion, by jointly segmenting the scene into objects and predicting SE3 motions for each distinct object Tested on four simulated tasks & a real robot poking task Future work: - Optimal control with learned rigid body motion models - Unsupervised training without explicit data-associations - Training on sequences (Recurrent Nets) Paper: https://arxiv.org/abs/1606.02378 References [1] Byron Boots, Arunkumar Byravan, and Dieter Fox. Learning predictive models of a depth camera & manipulator from raw execution traces. In ICRA, pages 4021–4028. IEEE, 2014. [2] Adam Lerer, Sam Gross, and Rob Fergus. Learning physical intuition of block towers by example. arXiv preprint arXiv:1603.01312, 2016. [3] Katerina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning visual predictive models of physics for playing billiards. arXiv preprint arXiv:1511.07404, 2015. [4] Yang, Jimei, Scott E. Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentangling with recurrent transformations for 3d view synthesis. In NIPS 2015 [5] Finn, Chelsea, Ian Goodfellow, and Sergey Levine. "Unsupervised Learning for Physical Interaction through Video Prediction." arXiv preprint arXiv:1605.07157 (2016). Fig: Predicted masks. No Penalty masks blend across multiple classes. Input Ours Ours (Large) No Penalty Single Box Multiple Boxes Baxter Fig: Prediction results for three simulated datasets (0.15 seconds forward), showing re-projected depth images Input Ground Truth Ours Ours (Large) No Penalty Flow Flow (Large) Task Ours Ours (Large) No Penalty Flow Flow (Large) Single Box 3.73 1.65 ± 0.17 25.3 10.1 2.48 ± 0.22 Multiple Boxes 3.22 1.29 ± 0.14 19.7 6.2 1.69 ± 0.17 Baxter 0.074 0.057 ± 0.002 0.074 0.11 0.063 ± 0.001 Table: Average per-point flow MSE (cm) across tasks and networks. Table: Average per-point flow MSE (cm) for networks trained with noise in depth (left 4 columns) and in the data association (right 4 columns) Task Depth Noise Data Association Noise SD = 0.75 cm SD = 1.5cm, No depth scaling 9x9, threshold = ±10cm 15x15, threshold = ±20 cm Ours (Large) Flow (Large) Ours (Large) Flow (Large) Ours (Large) Flow (Large) Ours (Large) Flow (Large) Single Box 2.61 6.87 2.70 4.31 1.79 3.15 2.80 5.32 Multiple Boxes 1.95 6.10 3.56 4.42 1.05 1.95 2.48 4.26 Baxter 0.073 0.066 0.44 0.63 0.10 0.15 0.30 0.43 SE(3) predictions “Object” mask predictions 3D point cloud + Control (t) Predicted 3D point cloud (t+1) Predicted 3D flow Fig: Two prediction results from the household objects dataset Input Ground Truth Ours (Large) Flow (Large) Fig: Sequential prediction results for the real world poking dataset. Small SE3-Net (Ours) gets the GT mask of the arm as additional input. Ground Truth Ours Flow t = 0.27 sec t = 0.54 sec t = 0.81 sec t = 1.08 sec t = 1.35 sec Paper Funded in part by NSF-NRI-1227234: Collaborative Research: Purposeful prediction: Co-robot interaction via Understanding Intent and Goals

SE3-Nets: Learning Rigid Body Motion using Deep Neural ...€¦ · Learning physical intuition of block towers by example. arXiv preprint arXiv:1603.01312, 2016. [3] Katerina Fragkiadaki,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SE3-Nets: Learning Rigid Body Motion using Deep Neural ...€¦ · Learning physical intuition of block towers by example. arXiv preprint arXiv:1603.01312, 2016. [3] Katerina Fragkiadaki,

SE3-Nets: Learning Rigid Body Motion using Deep Neural Networks Arunkumar Byravan and Dieter Fox

Robotics and State Estimation Lab; University of Washington, Seattle Contact: [email protected]

Introduction

Network Architecture

Datasets

Summary

XYZ(3) Conv1(8)

3DInput

Conv2(16)

Conv3(32)

Ac+on

nFC1(128)

FC2(256)

3DInput

Transformlayer

3DOutput

Deconv1(32)

Deconv2(16)

Deconv3(8)

Masks(k)

SE3s(k)

FC4(128)

FC3(256)

FC5(64)

MaskPenalty&Normaliza:on

CAT

Comparisons / Training

Results

•  The ability to predict how an environment changes based on applied forces is fundamental for a robot to achieve specific goals: -  Optimal control techniques (iLQG, AICO) can use these “dynamics” models for planning -  Traditional models make use of (hard to estimate) explicit physical concepts (eg. friction, mass)

•  Prior work has looked at learning these models of “physical intuition”: -  Predictive State Representations (PSRs) for modeling depth data of a manipulator [1] -  Deep networks for predicting motion of a tower of blocks [2], billiards balls [3], robot pushing objects [5] -  Deep networks for predicting 3D rotations between pairs of images [4]

•  SE3-Nets explicitly model the motion of rigid bodies under applied forces, using raw point cloud data and ground truth data associations: -  Segments the scene into “objects” & Predicts motion through SE3 transformations

•  Inputs: 3D point cloud (3-channel image), Action (n-dimensional vector)

•  Encoder: Generates a joint encoding (CAT) of the inputs

•  Decoder: Estimates point-wise motion, by predicting -  Motion (What?): k-SE3 transforms, specified by a rotation (R) & translation (t)

-  Location (Where?): k-channel mask representing the probability of a point being affected by the k-SE3 motions

•  Transform layer: Generates 3D output as a weighted blending of k-transformed input points

-  Enforcing Rigidity: Smoothly sharpen the masks to a binary decision over SE3s

•  Output: Transformed 3D point cloud (same number of points as the input cloud)

ʹx = Rx + t

Mi = {mi1,mi2,...,mik} | mij =1j=1

k∑

yi = mij[Rjxi + t j ]j=1

k∑

ʹmij =(mij +Ν(0,σ

2 ))γ

mγikk∑

•  Four simulated datasets using the physics simulator Gazebo -  Single Box: Ball launched to collide against a single box on a table

-  Multiple Boxes: Same as above, but with 1-3 boxes of random size/mass -  Household Objects: Similar to above, but with 11 objects from the LineMOD dataset -  Baxter: 14-DOF robot randomly moving 1-4 joints on it’s right arm

•  Preliminary tests (67 examples) on poking objects using a Baxter robot -  Uses an attached stick to poke three objects (cheezeit box, mustard & pringles cans)

•  We trained two versions of our network and three baselines -  Ours: SE3-Net, 3 Conv/Deconv layers, ~1.4 million parameters -  Ours (Large): SE3-Net, 5 Conv/Deconv layers, ~6.5 million parameters -  No Penalty: Non-rigid SE3-Net, 3 Conv/Deconv layers, ~1.4 million parameters -  Flow: Supervised flow net, 3 Conv/Deconv layers, ~1.1 million parameters -  Flow (Large): Supervised flow net, 5 Conv/Deconv layers, ~10 million parameters

•  All networks were trained with input point clouds of 240 x 320 resolution -  Control = 10 dim for box datasets (ball pose, force), 14 dim for baxter (joint velocities)

Fig: Sequential prediction results for the Baxter dataset. Network output is fed back as input for 4 consecutive steps.

Gro

und

Trut

h O

urs

(Lar

ge)

Flow

(L

arge

)

t = 0.15 sec t = 0.3 sec t = 0.45 sec t = 0.6 sec t = 0.75 sec

•  SE3-Nets model rigid body motion, by jointly segmenting the scene into objects and predicting SE3 motions for each distinct object

•  Tested on four simulated tasks & a real robot poking task •  Future work:

-  Optimal control with learned rigid body motion models -  Unsupervised training without explicit data-associations -  Training on sequences (Recurrent Nets)

Paper: https://arxiv.org/abs/1606.02378

References [1] Byron Boots, Arunkumar Byravan, and Dieter Fox. Learning predictive models of a depth camera & manipulator from raw execution traces. In ICRA, pages 4021–4028. IEEE, 2014.

[2] Adam Lerer, Sam Gross, and Rob Fergus. Learning physical intuition of block towers by example. arXiv preprint arXiv:1603.01312, 2016.

[3] Katerina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning visual predictive models of physics for playing billiards. arXiv preprint arXiv:1511.07404, 2015.

[4] Yang, Jimei, Scott E. Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentangling with recurrent transformations for 3d view synthesis. In NIPS 2015

[5] Finn, Chelsea, Ian Goodfellow, and Sergey Levine. "Unsupervised Learning for Physical Interaction through Video Prediction." arXiv preprint arXiv:1605.07157 (2016).

Fig: Predicted masks. No Penalty masks blend across multiple classes. Input Ours Ours (Large) No Penalty

Single B

ox M

ultiple B

oxes B

axter

Fig: Prediction results for three simulated datasets (0.15 seconds forward), showing re-projected depth images Input Ground Truth Ours Ours (Large) No Penalty Flow Flow (Large)

Fig. 2: Prediction results for three simulated datasets. Rows are results from Singlebox, Multiple Boxes and Baxter datasets respectively.All images (except first column on the left) were rendered by projecting the predicted 3D point cloud to 2D using the camera parametersand rounded off to the nearest pixel without any interpolation. (From left to right) Input point cloud with applied force on the ball shownin green; ground truth; predictions generated by different networks. 3D point clouds for the flow networks were computed by adding thepredicted flow to the input. Image best viewed in high resolution. For a better understanding of the results, please refer to the supplementaryvideo.

Task Ours Ours (Large) No Penalty Flow Flow (Large)Single Box 3.73 1.65 ± 0.17 25.3 10.1 2.48 ± 0.22

Multiple Boxes 3.22 1.29 ± 0.14 19.7 6.2 1.69 ± 0.17Baxter 0.074 0.057 ± 0.002 0.074 0.11 0.063 ± 0.001

TABLE I: Average per-point flow MSE (cm) across tasks and networks. Our (large) network achieves the best flow error compared tobaselines even though it is not directly trained to predict flow.

Fig. 3: Multi-step prediction results obtained by feeding back the output of the network to the input for five consecutive times. Groundtruth is reset at the end of each frame.

Fig. 4: Two results from the Household Objects dataset, showing topping and sliding motion. Our network segments the objects correctlyand predicts consistent motion leading to sharp point clouds while the flow baseline smears the object across the image. Ball is highlightedin red.

Table: Average per-point flow MSE (cm) across tasks and networks. Table: Average per-point flow MSE (cm) for networks trained with noise in depth (left 4 columns) and in the data

association (right 4 columns)

TaskDepth Noise Data Association Noise

SD = 0.75 cm SD = 1.5cm, No depth scaling 9x9, threshold = ±10cm 15x15, threshold = ±20 cmOurs (Large) Flow (Large) Ours (Large) Flow (Large) Ours (Large) Flow (Large) Ours (Large) Flow (Large)

Single Box 2.61 6.87 2.70 4.31 1.79 3.15 2.80 5.32Multiple Boxes 1.95 6.10 3.56 4.42 1.05 1.95 2.48 4.26

Baxter 0.073 0.066 0.44 0.63 0.10 0.15 0.30 0.43

TABLE II: Average per-point flow MSE (cm) for networks trained with noise added to depth (left four columns) and data associations(right four columns). Our network performance degrades gracefully with increasing noise as compared to large errors for the flow baseline.SD = Depth noise standard deviation

(Fig. 5) and predicting individual SE(3) transforms, ournetwork ensures that points which belong to an object rigidlymove together in an interpretable manner. This results ina sharp prediction with very little noise, as compared tothe flow networks which do not have any such constraints.With increasing layer depth, the flow networks can somewhatcompensate for this, but there is still a significant amount ofnoise in the predictions resulting in smearing across the canvas(Fig. 2). Interestingly, we noticed that the flow networksperform quite poorly on examples where only a few pointsexhibit motion such as when just the ball moves in the scenewhile our networks are able to predict the ball’s motion quiteaccurately.

We also present mask predictions made by our networksin Fig. 5. The colors indicate that the masks predicted byour networks for the box datasets are near binary (we renderthe 3-channel masks directly as RGB images). Our networksuccessfully segments the ball and box as distinct objectswithout any explicit supervision. In practice, we found thatit is crucial to give the network examples where the ballmoves independently as this provides implicit knowledgethat the ball and the box are distinct. In cases where thetraining examples always have the ball in contact with thebox, the network had a hard time separating the objects, oftenmasking them out together. For the Baxter dataset, dependingon the motion, our network usually segments the arm into2-3 distinct parts (Fig. 5), often with a split at the elbow.

In comparison, the No Penalty SE3-NET rarely predictsbinary masks, often blending across different SE(3)s. Un-fortunately, this leads to a significant overfit on the boxsettings resulting in a large flow error. On the Baxter datasets,this network achieves errors comparable to the flow network,performing significantly better. We believe that this is becauseblending allows the network to capture the serial dependencein the Baxter’s kinematic chain better. This hints at aninteresting approach for modeling non-rigid motions andkinematic chains using the SE(3) net, which we discussfurther in Sec. V.

Fig. 4 shows two representative results from testing on thehousehold object dataset where the network has to deal withcomplex shaped objects, some with holes. As it is clear, thenetwork can model the dynamics of these objects well, withthe resulting predictions being significantly sharper comparedto the large flow network, which does very poorly. We havealso seen that our network is able to gracefully handle caseswhere objects topple or undergo large motions.

Finally, we test the consistency of our network in modelingsequences by allowing the network to forward propagate

the scene dynamics multiple steps into the future. Fig. 3shows these results for a Baxter sequence where we feed thenetwork’s predictions back in as input and fix the controlvector for 5 steps into the future. We compare against groundtruth and see that our predictions remain consistent acrosstime without significant noise addition. In comparison, thepredictions from the large flow network degrades over timeas the noise cascades. To get a better understanding of ourresults, we encourage the readers to look at the supplementaryvideo attachment where we show prediction across multiplesequences.

D. Robustness

We perform two types of additional experiments to test therobustness of our algorithms to hyper-parameter choices andnoisy data. Sensitivity to number of objects: In all priorexperiments, we have chosen the number of predicted SE(3)s(k) apriori with our knowledge of the datasets. To test thesensitivity of our algorithm to this parameter, we trained ournetworks setting k to a large number (k = 8 for the Baxterdataset and k = 6 for the rest). In most cases, we found thatthe network automatically segments the scene into the correctnumber of objects, with the remaining mask channels assignedto identity. We also saw little to no performance drop in theseexperiments. Robustness to depth noise: To test whetherour network is capable of handling the types of noise seen inreal depth sensors, we trained networks under two types ofdepth noise: First, we added gaussian noise with a standarddeviation (SD) of 0.75 cm, and scaled the noise by the depth(farther points get more noise) as is common in commoditydepth sensors. Second, we increased the noise SD to 1.5cmwithout scaling by the depth. Table II shows the performanceof the two large networks under both types of noise: whileour performance degrades, we significantly outperform thebaseline flow network. Additionally, our network is stillable to segment the objects properly in most of our tests.Robustness to noise in data association: We test how wellour networks respond to uncertainty in data association byallowing spurious associations. We allow each point to berandomly associated to any other point in a mxm windowaround it, as long as their depth differences are no largerthan a threshold. We train in two increasingly noisy settings:9x9 window with a threshold of ±10 cm and in a 15x15window with a threshold of ± 20 cm. Table II shows theresults of these tests. Our network strongly outperforms theflow baseline, with errors almost half of the flow baseline.While this test does not simulate a systematic associationbias, it still proves that our network is robust to uncertain

SE(3) predictions

“Object” mask predictions

3D point cloud + Control

(t)

Predicted 3D point cloud (t+1)

Predicted 3D flow

Fig: Two prediction results from the household objects dataset

Inpu

t G

roun

d Tr

uth

Our

s (L

arge

) Fl

ow

(Lar

ge)

Fig: Sequential prediction results for the real world poking dataset. Small SE3-Net (Ours) gets the GT mask of the arm as additional input.

Gro

und

Trut

h O

urs

Flow

t = 0.27 sec t = 0.54 sec t = 0.81 sec t = 1.08 sec t = 1.35 sec

Paper Funded in part by NSF-NRI-1227234: Collaborative Research: Purposeful prediction: Co-robot interaction via

Understanding Intent and Goals