arXiv:2006.13194v1 [cs.CV] 23 Jun 2020 · 2020. 7. 9. · We then project the bound- ... detection and regression. The shape task predicts the ob-ject’s shape signals depending

Instant 3D Object Tracking with Applications in Augmented Reality

Adel Ahmadyan ∗ Tingbo Hou ∗ Jianing Wei ∗ Liangkai Zhang Artsiom AblavatskiMatthias Grundmann,

Google Research1600 Amphitheatre Pkwy, Mountain View, CA 94043

{ahmadyan, tingbo, jianingwei, liangkai, artsiom, grundman}@google.com

Abstract

Tracking object poses in 3D is a crucial building blockfor Augmented Reality applications. We propose an instantmotion tracking system that tracks an object’s pose in space(represented by its 3D bounding box) in real-time on mo-bile devices. Our system does not require any prior sensorycalibration or initialization to function. We employ a deepneural network to detect objects and estimate their initial3D pose. Then the estimated pose is tracked using a ro-bust planar tracker. Our tracker is capable of performingrelative-scale 9-DoF tracking in real-time on mobile de-vices. By combining use of CPU and GPU efficiently, weachieve 26-FPS+ performance on mobile devices.

1. IntroductionTracking in monocular videos is a challenging and well

studied problem in computer vision. While 2D tracking ismature with robust solutions [16, 13, 3, 4, 12, 10], 3D track-ing from monocular RGB images remains an open problem.Current approaches to 3D tracking [7, 17, 18] require com-plex initialization procedures to estimate depth, and are notvery robust and have high computational cost.

The objective of 3D tracking is to track the 3D boundingbox of a rigid object throughout frames when both cameraand object motions are present. The object pose in 3D isuniquely determined by its 3D bounding box and has 9 DoFincluding orientation, translation, and physical size.

In this paper, we propose a system to detect and track anobject’s 3D pose in real-time. Initially, we detect the objectand estimate its pose using a deep neural network [6]. Thedetection network does not require prior knowledge of theobject’s shape, size or CAD model to be known and can de-tect category-level unseen objects. This model can run inreal-time on mobile devices and can be used in a tracking-by-detection paradigm. When the model is applied to ev-ery frame, the detection output may suffer from jitter due

∗Equal contribution

to the prediction noise from the model. This jitter is un-desirable for AR applications. We adopt a detection-plus-tracking framework to mitigate this issue. This frameworkmitigates the need to run the network on every frame, allow-ing the use of heavier and therefore more accurate models,while keeping the pipeline real-time on mobile devices. Italso retains object identity across frames and ensures thatthe prediction is temporally consistent, effectively reducingthe jitter.

Our detection-plus-tracking system works as follows.We first locate the object’s pose using the detection network,estimating its 3D bounding box. We then project the bound-ing box’s 3D vertices to the image plane and track the 2Dpoints, which rest on a plane, using a planar tracker [16]. Fi-nally, we lift the tracked 2D points to 3D using the EPnP [8]algorithm to estimate the 3D bounding box in subsequentframes. This work extends our previous works on 3D ob-ject detection [6], instant motion tracking [16], and 3D ob-ject tracking [2].

Our tracker has three properties: it is robust, instant, andreal-time. The tracker is very efficient in utilizing both CPUand GPU on-device for tracking and detection, respectively,thus achieving real-time (26-FPS+) performance on mobiledevices. The detection is performed on a single RGB imageand the planar tracker is instant. Thus our whole pipelinedoes not require any parallax-inducing motion to initializeand locate the object and its overall latency is low. Finally,with the assumption that 3D bounding box sits on a planarground, the planar tracker is very robust given typical ARapplications.

Our main contributions are:

• We propose an end-to-end system to track the ob-ject’s 3D pose (orientation, translation, and size up toa scale). This system uses a CNN to initialize the pose,then track it using a planar tracker.

• The proposed system is calibration-free and does notrequire any complex initialization sequence or anyhardware beyond camera or IMU sensors. It does notrequire prior knowledge of object’s shape or model andcan detect and track unseen objects.

1

arX

iv:2

006.

1319

4v1

[cs

.CV

] 2

3 Ju

n 20

20

• The end-to-end system (including detection) runs inreal-time on mobile devices.

To encourage researchers and developers to experimentand prototype based on our pipeline, we open-sourced ouron-device ML pipeline in [1], including an end-to-end demomobile application and our trained detector for shoes andchairs. We hope that sharing our solution with the wideresearch and development community will stimulate newuse cases, applications, and research efforts.

2. Related workEfficient and robust tracking is an essential component

for any AR application. Object tracking has been studiedand practiced extensively in computer vision. Planar andregion-based trackers [13, 3, 4, 12, 10] rely on 3D geometryto estimate the camera motion and track objects. Recently,neural nets have been utilized to learn and estimate motion.Generally tracking system consists of two components:a) detector: which detects the objects in each frame and es-timates their 2D or 3D bounding boxes, andb) a matching algorithm, which tracks object correspon-dences between frames.

[7] detects an object’s 3D bounding box and then esti-mates the 3D motion between frames to track the box. Em-ploying a Kalman filter after 3D detection was investigatedin [17] and shown to achieve good performance. In [9] and[11], tracking the 3D bounding box with stereo cameras forautonomy applications were studied. [15] used 3D cues fortracking vehicles’ 2D bounding boxes. Recently, in [18] theauthors propose to use a deep neural network for both 3Ddetection and tracking.

3. Instant 3D trackingFigure 1 shows an overview of our 3D tracking system.

Initially, the frames are passed through a single-stage CNN,as shown in Figure 2, to predict the object’s pose and phys-ical size from a single RGB image.

The model backbone has an encoder-decoder architec-ture, built upon MobileNetV2[14]. We employ a multi-tasklearning approach, jointly predicting an object’s shape withdetection and regression. The shape task predicts the ob-ject’s shape signals depending on what ground truth annota-tion is available, e.g. segmentation. This is optional if thereis no shape annotation present in the training data. For thedetection task, we use the annotated bounding boxes andfit a Gaussian to the box, with the center at the box’s cen-troid and standard deviations proportional to the box size.The goal for detection is then to predict this distributionwith its peak representing the objects center location [5].The regression task estimates the 2D projections of the eightbounding box vertices. To obtain the final 3D coordinatesfor the bounding box, we leverage a well established poseestimation algorithm (EPnP) [8]. It can recover the 3D

Figure 1: Overview of our 3D tracking system.

Figure 2: Our network detects the object and estimates the3D bounding box.

Figure 3: Estimated 3D bounding box and the segmentationmask produced by our object detector network.

bounding box of an object, without a priori knowledge ofthe object dimensions. Given the 3D bounding box, we caneasily compute pose and size of the object. Figure 2 showsour network architecture and post-processing. The model islight enough to run real-time on mobile devices (at 26 FPSon an Adreno 650 mobile GPU). The details of our modelis described in [6].

After initializing the object’s pose with the object detec-

Figure 4: Tracking a 3D bounding box with both cameramotion and object movement present.

tor network, we compute the nine key-points of the bound-ing box and project them to the image. The nine key-pointsconsists of the bounding box’s eight vertices plus its center.We track these points using a planar tracker [16]. Our track-ing system consists of a motion analysis module, a robustbox tracker, and a pose estimation module for calibration-free 6DoF tracking [16]. Tracking the nine 2D points usinga 6-DoF relative scale tracker [16] is sufficient for track-ing the object’s 9DoF pose. At every frame, we lift the 2Dpoints back to 3D using the EPnP algorithm [8] to estimatethe 3D bounding box (with 9-DoF) up-to scale. Conse-quently, our model inference only needs to run every fewframes, resulting in high efficiency in our mobile pipeline.When a new prediction is made, we consolidate the detec-tion result with the tracking result based on the area of over-lap.

4. Results and applications to AR

In Figure 4, we demonstrate tracking multiple 3D bound-ing boxes results from a video. The complete system is im-plemented in the Mediapipe framework. The model and thecode is available at [1]. Our detection network predicts the3D bounding box with average precision of 0.59 at 0.5 3DIoU. The model weights only 5.54MB. The model’s out-put also includes shape information such as segmentationmask. The object detector runs at 26.5fps on the mobile

Figure 5: Fitting and rendering a mesh model to the detectedobjects for AR Applications.

GPU while the 3D tracking runs at 30fps+ on a mobileCPU (Samsung S20 device with Qualcomm’s Snapdragon865 SoC).

Figure 5 shows an example of how to use the trackedobject’s pose for AR applications such as virtual shoe try-on. In each frame, we render a CAD model at the objectpose using OpenGL rendering pipeline. The same polygonmesh model is also rendered in Figure 4 inside the boundingbox as an occluder to give a 3D effect to the visualization.

We make two key assumptions for the 3D tracking sys-tem to work properly: first, there is no gauge ambiguity inthe 3D bounding box and the detected eight keypoints areunique. For symmetric objects, e.g. volleyball, this assump-tion does not hold. As a result, the detection network wouldpredict different orientations each time and that would causefailure when consolidating the detection and tracking resultstogether. The second assumption by the tracking module isthat the object’s plane does not significantly change whilewe are tracking it. This assumption is true for most ob-jects without the roll, however, if the tracked object rollsand changes its plane during tracking, e.g. volleyball, wemay lose their track.

5. Conclusion

In conclusion, we present a system for 3D object track-ing that enables real-time instant 3D bounding box trackingon mobile devices. Our proposed system uses a neural net-work to initialize the 3D pose then utilizes a planar surfacetracker to track the object’s pose in the video frame. Theend-to-end system runs in real-time on mobile devices.

References[1] Mediapipe objectron. https://github.com/

google/mediapipe/blob/master/mediapipe/docs/objectron_mobile_gpu.md. Accessed:2020-04-01. 2, 3

[2] Adel Ahmadyan and Tingbo Hou. Real-Time 3D Ob-ject Detection on Mobile Devices with MediaPipe.ai.googleblog.com/2020/03/real-time-3d-object-detection-on-mobile.html, 2020. 1

[3] Simon Baker and Iain Matthews. Lucas-kanade 20 years on:A unifying framework. In International Journal of ComputerVision, volume 56, page 221255, 2004. 1, 2

[4] S. Benhimane and E. Malis. Real-time image-based track-ing of planes using efficient second-order minimization. InInternational Conference on Intelligent Robots and Systems(IROS), volume 1, pages 943–948 vol.1, Sep. 2004. 1, 2

[5] Li Ding and Lex Fridman. Object as distribution, 2019. 2[6] Tingbo Hou, Adel Ahmadyan, Liankai Zhang, Jianing Wei,

and Matthias Grundmann. MobilePose: Real-Time Pose Es-timation for Unseen Objects with Weak Shape Supervision.arXiv preprint arXiv:2003.03522, 2020. 1, 2

[7] Hou-Ning Hu, Qi-Zhi Cai, Dequan Wang, Ji Lin, Min Sun,Philipp Krhenbhl, Trevor Darrell, and Fisher Yu. Jointmonocular 3d vehicle detection and tracking, 2018. 1, 2

[8] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua.EPnP: An accurate O(N) solution to the PnP problem. Inter-national Journal of Computer Vision (IJCV), 81(2):155–166,2009. 1, 2, 3

[9] Peiliang Li, Tong Qin, and Shaojie Shen. Stereo vision-basedsemantic 3d object and ego-motion tracking for autonomousdriving, 2018. 2

[10] Pengpeng Liang, Yifan Wu, Hu Lu, Liming Wang, Chun-yuan Liao, and Haibin Ling. Planar object tracking in thewild: A benchmark. In Proceedings of the IEEE Interna-tional Conference on Robotics and Automation, pages 651–658, 2018. 1, 2

[11] Aljosa Osep, Wolfgang Mehner, Markus Mathias, and Bas-tian Leibe. Combined image- and world-space tracking intraffic scenes, 2018. 2

[12] Christian Pirchheim and Gerhard Reitmayr. Homography-based planar mapping and tracking for mobile phones. IEEEInternational Symposium on Mixed and Augmented Reality,pages 27–36, 2011. 1, 2

[13] Simon J.D. Prince, Ke Xu, and Adrian David Cheok. Aug-mented reality camera tracking with homographies. IEEEComputer Graphics and Applications, 22(6):39–45, Nov.2002. 1, 2

[14] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-moginov, and Liang-Chieh Chen. Mobilenetv2: Invertedresiduals and linear bottlenecks, 2018. 2

[15] Sarthak Sharma, Junaid Ahmed Ansari, J. Krishna Murthy,and K. Madhava Krishna. Beyond pixels: Leveraging geom-etry and shape cues for online multi-object tracking, 2018.2

[16] Jianing Wei, Genzhi Ye, Tyler Mullen, Matthias Grundmann,Adel Ahmadyan, and Tingbo Hou. Instant Motion Tracking

and Its Applications to Augmented Reality. In CVPR Work-shop on Computer Vision for Augmented and Virtual Reality2019, Long Beach, CA, 2019. 1, 3

[17] Xinshuo Weng and Kris Kitani. A baseline for 3d multi-object tracking, 2019. 1, 2

[18] Xingyi Zhou, Vladlen Koltun, and Philipp Krhenbhl. Track-ing objects as points, 2020. 1, 2
https://github.com/google/mediapipe/blob/master/mediapipe/docs/objectron_mobile_gpu.mdhttps://github.com/google/mediapipe/blob/master/mediapipe/docs/objectron_mobile_gpu.mdhttps://github.com/google/mediapipe/blob/master/mediapipe/docs/objectron_mobile_gpu.md

Documents

arXiv:2006.13194v1 [cs.CV] 23 Jun 2020 · 2020. 7. 9. · We then project the bound- ... detection and regression. The shape task predicts the ob-ject’s shape signals depending