Multi-camera DeepTAM · 2020. 2. 17. · 3D Vision 2019 Goal: Requirements / Tools: Supervisor: Description: Benchmarking local features for multi-camera garden SLAM Evaluate the

3D Vision 2019

Goal:

Requirements / Tools: Supervisor:

Description:

Multi-camera DeepTAM

Generalize DeepTAM to use a multi-camera setup

Visual Odometry methods based on classical 3D geometry have been around for years, using either indirect feature matching or direct visual error minimization. Lately, learning based methods that combine both matching and geometry estimation in a single network have shown impressive results. Classical methods have been shown to benefit from the extended field of view that can be provided by using multiple cameras, but these setups have been mostly ignored by current learning-based methods.

The goal of this project is to extend the existing DeepTAM pipeline to leverage a multi-camera setup with known geometry.

References:

[1] H. Zhou and B. Ummenhofer and T. Brox. DeepTAM: Deep Tracking and Mapping. ECCV, 2018

Marcel Geppert <[email protected]>Viktor Larsson <[email protected]>

Required: Python, Tensorflow

3D Vision 2019

Goal:


Description:

Benchmarking local features for multi-camera garden SLAM

Evaluate the impact of different local feature types on runtime and accuracy of SLAM in garden environments

While there are many visual SLAM systems with different approaches available today, most of them fail when moving from a structured, artificial environment to cluttered environments such as gardens or open nature. One approach to improve robustness is to increase the field of view, either by using wide angle lenses or multiple cameras, to increase the amount of available information. Still, in this case processing the available data on time becomes more and more difficult.

The goal of this project is to replace the currently used SIFT features in our SLAM pipeline with multiple different features and evaluate the impact on runtime, recognition and matching, and finally the resulting pose accuracy.

Marcel Geppert <[email protected]>C++, OpenCV, Matlab

3D Vision 2019

Goal:


Description:

3D Occupancy Prediction for Autonomous Driving

Implement a 3D occupancy prediction methodfor autonomous driving.

Due to the complexity of most real environments, such as urban streets or crowded areas, it is very important to predict the future 3D occupancy based on the temporal dependencies for autonomous driving.

There are mainly two techniques to model short-term occupancy and make predictions into the future. One is the dynamic Gaussian process (DGP) map [1], and the spatio-temporal Hilbert (STHM) map [2]. However, these methods only studied the 2D cases. In this project, we plan to implement a method for 3D occupancy prediction for autonomous driving..

References:[1] Callaghan et al. Gaussian process occupancy maps for dynamic environments. Experimental Robotics, 2015.

[2] Senanayake et al. Spatiotemporal Hilbert maps for continuous occupancy representation in dynamic environments. NIPS, 2016.

Zhaopeng Cui <[email protected]>Required: C++ / Matlab

https://docs.google.com/file/d/1AOPgZgkW2VHVBNUm_nnLsvMtGckuz4u7/preview

3D Vision 2019

Goal:


Description:

Fast Dense Semantic Fusion for 3D Semantic Reconstruction

Implement an real-time dense semantic fusion for large-scale semantic scene reconstruction

In order to make the intelligent decision for autonomous driving, the real-time dense 3D semantic mapping is needed. Similar to geometric mapping, we need to fuse the semantic segmentation of each frame in order to obtain an global semantic map. Vineet et al. [1] proposed an efficient mean-field inference algorithm for large-scale dense semantic fusion.

This project aims to implement an real-time dense semantic fusion algorithm for large-scale semantic reconstruction based on [1]. The existing framework [2] for real-time 3D reconstruction will be used for this project, and the KITTI stereo dataset will be used for testing.

References:[1] Vineet et al. Incremental Dense Semantic Stereo Fusion for Large-Scale Semantic Scene Reconstruction, ICRA, 2015.

[2] Prisacariu et al. InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure. arXiv, 2017.

Zhaopeng Cui <[email protected]>Required: C++ and Cuda

3D Vision 2019

Goal:


Description:

SurfelWarp: Efficient Non-VolumetricSingle View Dynamic Reconstruction

The goal is to implement a dense SLAM system for reconstruction of non-rigid deforming scenes

The reconstruction of non-rigid deforming scenes is a challenging problem. Recent work [1] has proposed a dense SLAM system for reconstruction of non-rigid deforming scenes based on surfels. The approach is based on [2] of which an implementation exists online [3], working on static scenes.

In this work, the goal is to implement the extensions from [1]. The online implementation of a dense SLAM system called InfiniTAM [3] can be used as a starting block. The resulting pipeline doesn’t have to run in real-time, i.e. CPU-based implementation is sufficient. For testing, existing datasets [4] can be used.

References:

[1] Gao, Wei, and Russ Tedrake. "Surfelwarp: Efficient non-volumetric single view dynamic reconstruction." Robotics: Science and Systems. 2018.[2] Keller, Maik, et al. "Real-time 3d reconstruction in dynamic scenes using point-based fusion." 3D Vision-3DV 2013, 2013 International Conference on. IEEE, 2013.[3] Prisacariu, Victor Adrian, et al. "InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure." arXiv preprint arXiv:1708.00783 (2017).[4] Innmann, Matthias, et al. "VolumeDeform: Real-time volumetric non-rigid reconstruction." European Conference on Computer Vision. Springer, Cham, 2016.

Sandro Lombardi <[email protected]>Required: C++

Image from [1]

3D Vision 2019

Goal:


Description:

Learning to Reconstruct 3D Meshes with only 2D Supervision

Implement a deep learning method for reconstructing 3D meshes from 2D images

In this project we want to learn to reconstruct 3D meshes from 2D images without using the ground truth mesh during the training stage. Recent work [1] has proposed a promising method to achieve this: They exploit shading information with the use of a differential renderer.

The goal is to implement the method introduced by [1]. Code for a few building blocks is available online (differential renderer [2][3], variational autoencoder [4]). For training and testing, existing datasets [5] can be used.

References:

[1] Henderson, Paul, and Vittorio Ferrari. "Learning single-image 3D reconstruction by generative modelling of shape, pose and shading." arXiv preprint arXiv:1901.06447 (2019).[2] https://github.com/pmh47/dirt[3] Kato, Hiroharu, Yoshitaka Ushiku, and Tatsuya Harada. "Neural 3d mesh renderer." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.[4] https://github.com/pytorch/examples/tree/master/vae[5] Chang, Angel X., et al. "Shapenet: An information-rich 3d model repository." arXiv preprint arXiv:1512.03012 (2015).

Sandro Lombardi <[email protected]>Required: PythonRecommended: Experience with TensorFlow, PyTorch or other deep learning frameworks

Image from [1]

https://github.com/pmh47/dirt

https://github.com/pytorch/examples/tree/master/vae

3D Vision 2019

Goal:


Description:

Multi-Device Multi-Session Mobile Mapping

Build a framework for building large-scale maps from HoloLens/ARKit/ARCore trajectories from multiple sessions using maplab

The goal of the project is to build a framework that records SLAM trajectories on mobile devices using HoloLens/ARKit/ARCore [1, 2]. These recorded SLAM trajectories from different sessions and/or users should then be merged into large-scale maps using the maplab library [3]. The project involves writing small ARKit/ARCore apps to record the data (HoloLens already has research mode [4]), which should then be parsed into maplab compatible format. The project should then use already available maplab features to merge these trajectories into consistent large-scale maps. It will be key to evaluate the performance of the system and analyze potential failure cases. If time permits, the students should develop improvements to maplab that overcome these potential limitations and failure cases. Ideally, 3 different groups tackle the problem, where each group focuses on one device platform (HoloLens/iOS/Android) and they share collected trajectories with each other to test cross-device map merging. Since the HoloLens group can skip developing an app, they should instead try to augment the maps with dense depth data from the built-in Kinect sensor from the HoloLens.

[1] https://developer.apple.com/arkit/[2] https://developers.google.com/ar/[3] https://github.com/ethz-asl/maplab[4] https://github.com/Microsoft/HoloLensForCV

Johannes Schönberger<[email protected]>

C++, iOS or Android dev (optional), ARKit/ARCore capable phones

https://developer.apple.com/arkit/

https://developers.google.com/ar/

https://github.com/ethz-asl/maplab

https://github.com/Microsoft/HoloLensForCV

3D Vision 2019

Goal:


Description:

Projected Virtual Windows

Using a mobile projector and cameras, project a viewer-position dependent image of a 3D scene onto a surface, creating the illusion of a virtual window.

The goal of this project is to combine projection mapping with head tracking.

Using the projector, an image is shown on a flat surface. This surface might not necessarily be orthogonal to the projector. Using a passive setup such as stereo from webcams, the user’s head position is determined and an image is computed that creates the illusion of a virtual window when viewed from the user’s position.

Challenges include the on-the-fly calibration of the projection mapping and the tracking of the user’s head, as well as registering those coordinate systems to each other and rendering an appropriate image for projection.

Daniel Thul <[email protected]>C++, OpenGL

3D Vision 2019

Goal:


Description:

HoloLens Obstacle Avoidance

Develop a guidance system that uses the onboard sensors of the HoloLens to identify obstacles and directs the user to avoid them using visual cues as a guide.

The onboard sensing capabilities of the HoloLens could enable the device to warn the user of potential dangers in their path, including static or dynamic obstacles. A motivating use case of an obstacle avoidance system would be to help guide persons with impaired vision through cluttered environments. The project would involve utilizing a combination of sensor modalities (e.g. depth and optical cameras) and vision algorithms (e.g. optical flow) to identify collision risks along the user’s current trajectory. The goal would be to guide the user away from these obstacles by leveraging existing research from the robotics and autonomous driving domains, and to provide visual feedback to the user for how to change course in the form of holograms.

Jeff Delmerico <[email protected]>C++

3D Vision 2019

Goal:


Description:

Deep Learning of Graph Matching

Implement a network for the graph matching task that can learn deep potentials for the task.

Graph matching is a widely used algorithm for combinatorial optimization and computer vision problems. One particular interesting case is to perform matching between an image pair of temporally or semantically related images. The problem of graph matching is usually solved with on optimization algorithm. Recent developments in deep learning, however, allow to not only solve the problem but also to backpropagate through the whole algorithm. This allows to learn better unary and pairwise potentials for the task in end-to-end fashion and superior performance.

The goal of this project is to (re-)implement a recently proposed method [1], apply modifications to the network and connectivity of the graph and evaluate it for the task of sparse optical flow and semantic matching. If time suffices, the network could also be extended to deliver dense optical flow via an inpainting framework, eg [2].

References:

[1] Deep Learning of Graph Matching, Zanfir et al. CVPR 2018

[2] Learning energy based inpainting for optical flow, Vogel et al. ACCV 2018

Christoph Vogel<[email protected]>Python/C++, PyTorch

3D Vision 2019

Goal:


Description:

3D Pose Motion Representation for Action Recognition

Implement and evaluate an action recognition framework based on 3D human pose features

Action recognition is one of the most fundamentals problems of computer vision. Human pose features provide valuable cues for recognizing human actions. To this end, [1] recently proposed an efficient motion descriptor based on 2D pose features. Specifically, the authors first run a state- of-the-art human pose estimator and extract heatmaps for the human joints in each frame. Then, a motion descriptor is obtained by temporally aggregating these probability maps. The resulting motion descriptor is trained to recognize actions and is able provide the state-of-the-art performance even with shallow neural network architectures.

While 2D pose features are helpful in estimating the human action, they lack depth information which is crucial for recognizing fine-grained actions. Therefore, we would like to account for the depth of human joints and extend this idea to the 3D setting, where 3D pose features are aggregated temporally within a volumetric representation. The resulting motion descriptor is then going to be trained to recognize human actions using different neural network architectures and compared against the state-of-the-art.

[1] “PoTion: Pose Motion Representation for Action Recognition”, Choutas et al. CVPR 2018

Bugra Tekin <[email protected]>Federica Bogo <[email protected]>Taein Kwon <[email protected]>

Python, PyTorch

mailto:[email protected]

mailto:[email protected]

3D Vision 2019

Goal:


Description:

3D Hand Shape and Pose from Images in the Wild

Implement the deep learning framework proposed in [1]

Estimating 3D hand shape and pose from single RGB images in unconstrained environments is important in many applications.

A recently proposed approach [1] achieves impressive results by combining a deep encoder-decoder architecture with a parametric 3D model of the human hand.

In this project, we will implement the network proposed by the authors, experiment with training procedures, identify shortcomings and propose possible improvements.

[1] “3D Hand Shape and Pose from Images in the Wild”, Boukhayma et al. Arxiv 2019

Taein Kwon [email protected] Bogo [email protected] Tekin [email protected]

Python, PyTorch

3D Vision 2019

Goal:


Description:

FAID-D: A local descriptor robust to illumination changes

Implement and train a network for local descriptor learning

The Flash and Ambient Illuminations Dataset (FAID) [1] consists of aligned flash-only and ambient-only illumination image pairs captured with mobile devices. Using a detector such as DoG or Hessian-Affine, pairs of corresponding relevant patches can be extracted from this dataset. These patches can in turn be used for training a local descriptor (e.g. using the pipeline introduced [2]). Finally, the obtained descriptors can be compared to the state-of-the-art on a patches benchmark [3] or even evaluated on real-life applications such as the challenging visual localization tasks of [4].

References:

[1] - A Dataset of Flash and Ambient Illumination Pairs from the Crowd, Aksoy et al., ECCV 2018

[2] - Working hard to know your neighbor's margins: Local descriptor learning loss, Mishchuk et al., NIPS 2017

[3] - HPatches: A benchmark and evaluation of handcrafted and learned local descriptors, Balntas et al., CVPR 2017

[4] - Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions, Sattler et al., CVPR 2018

Mihai Dusmanu <[email protected]>Python (PyTorch), MATLAB (VLFeat)

Image from [2]

3D Vision 2019

Goal:


Description:

Scene structure based localization

Localize between 3d reconstructions created from ARKit/ARCore trajectories and images.

The goal of this project is to investigate localization using scene structure. The localization problem can be split into two parts. The first part involves finding two correctly corresponding scenes. The second part is about estimating a transformation between them. In this project we primarily focus on 3d structure (point clouds) based pose estimation. The scene structure will be created using recorded SLAM trajectories (ARKit [1]/ARCore [2]) and images from mobile phones. Therefore, an IOS/Android app has to be written, which is capable of recording the data. The final reconstruction then can be created using COLMAP[3]. The basic algorithm for pose estimation will be a variant of ICP[5] (implementations can be found in [4]). With this framework the students will be able to conduct various experiments (dense vs sparse scene reconstruction, different ICP algorithms). If time allows, semantic ICP algorithms can be explored [6].

[1] https://developer.apple.com/arkit/[2] https://developers.google.com/ar/[3] https://https://colmap.github.io/[4] http://pointclouds.org/[5] https://en.wikipedia.org/wiki/Iterative_closest_point [6] http://bmvc2018.org/contents/papers/1073.pdf

Lukas [email protected]

C++, iOS or Android dev, ARKit/ARCore capable phones

https://developer.apple.com/arkit/

https://developers.google.com/ar/

https://colmap.github.io/

http://pointclouds.org/

https://en.wikipedia.org/wiki/Iterative_closest_point

http://bmvc2018.org/contents/papers/1073.pdf

3D Vision 2019

Goal:


Description:

HoloLens Robot Controller

Implementing a Robot Control System using HoloLens

The Microsoft HoloLens is equipped with a broad range of sensors suitable for 3D localization and gesture classification, including an IMU, a depth sensor, and an RGB camera and four grayscale cameras. Similar systems are also commonly used in robotics for localization and obstacle avoidance.

In this project, we will make a basic interaction system that can control the Trimbot, a gardening robot platform running a visual SLAM system, using HoloLens.

This includes(1) Setting up the communication between HoloLens and Trimbot using available libraries.(2) Align the respective maps and coordinate systems and maintaining this alignment through map updates.(3) Display the robot's map on top of the real environment in HoloLens.(4) Classify user's gesture.(5) Send control information from HoloLens to Trimbot.

Taein Kwon <[email protected]> Marcel Geppert <[email protected]>Jeff Delmerico <[email protected]>

C#(Unity), ROS, C++

3D Vision 2019

Goal:


Description:

Depth map fusion with Hololens

Depth map fusion with HoloLens

Thanks to research mode in Hololens it’s possible to get direct access to the raw data obtained from the time-of-flight (ToF) camera. With this type of modality the accuracy is much higher but the drawback is an increased amount of unfiltered data.

The goal of this project is to implement an algorithm that fuse multiple depthmaps in a single pointcloud by checking for visibility, geometric and appearance consistency on the image plane.

The resulting point cloud will contain the same amount of 3D information but in a more compact and consistent representation.

Silvano Galliani : <[email protected]>C++

3D Vision 2019

Goal:


Description:

Incremental SfM for 1D Radial Cameras

Implement and evaluate an incremental SfM pipeline for 1D radial cameras

In this project the goal is to implement a simple incremental Structure-from-Motion pipeline for the 1D radial camera model. In contrast to the pinhole camera model which projects 3D points to points in the image plane, the radial cameras instead projects 3D points onto radial lines. Since the camera model only considers the direction of the projection it becomes invariant to changes in focal length as well as radial distortion. Unfortunately it also becomes invariant to translation along the principal axis, which makes one degree of freedom in the translation unobservable. Since the initialization is trickier for radial cameras (requiring 4 cameras instead of 2) we will start by assuming pinhole camera models for the initial pair.

References:

Thirthala & Pollefeys, Radial Multi-focal Tensors, IJCV’12

Kim et al., Multi-view 3D reconstruction from uncalibrated radially-symmetric cameras, ICCV’13

Kukelova et al., Real-time solution to the absolute pose problem with unknown radial distortion and focal length, ICCV’13

Camposeco et al., Non-Parametric Structure-Based Calibration of Radially Symmetric Cameras, ICCV’15

Viktor Larsson <[email protected]>C++

Some familiarity with SfM

3D Vision 2019

Goal:


Description:

Deep Keypoint Detector

Implement and evaluate a trained keypoint detector(s) using deep ranking function

The goal is to implement and evaluate the keypoint detector using a variety of deep network structures and pre-trained model. Additional task will be to train several independent detectors, which minimize both ranking cost function and maximizes the distances between them.

References:

Savinov et al., Quad-networks: unsupervised learning to rank for interest point detection, CVPR 2017

Lubor Ladicky <[email protected]>C++, any deep learning framework

3D Vision 2019

Goal:


Description:

Motion blur aware camera pose tracking

Implement a coarse-to-fine camera pose tracker which is robust to image motion blur

Camera pose tracker is usually a front-end for a visual odometry (VO) algorithm. Most existing works assume the input images to VO are sharp images. However, images can be easily blurred, which would further fail the VO, if the camera moves too fast within a longer exposure time.

In this project, we plan to investigate and implement an efficient motion blur aware camera pose tracker. To make the problem more tractable, we assume the reference image is sharp and only current image is being motion blurred. Furthermore, we assume the depth map corresponding to the reference image is already known. All the required dataset can be generated from a simulation tool, which is already being set up for you.

Peidong Liu <[email protected]>Pytorch

3D Vision 2019

Goal:


Description:

MegaPatches: a dataset for training or benchmarking local descriptors

Construct a dataset of patches from large-scale 3D reconstructions

The first step of this project is to extract a new dataset for descriptor training / benchmarking from the 196 scenes of the MegaDepth [1] dataset, reconstructed using the COLMAP SfM & MVS pipeline [2]. One way to do this is by warping keypoints from one image to nearby images using the estimated camera parameters and dense depth maps (see [3]). The final objective would be to compare different state-of-the-art descriptors on this dataset (e.g. using the metrics presented in [4]).

References:

[1] - MegaDepth: Learning Single-View Depth Prediction from Internet Photos, Li and Snavely, CVPR 2018

[2] - COLMAP - https://colmap.github.io

[3] - Brown Patches dataset - http://matthewalunbrown.com/patchdata/patchdata.html

[4] - HPatches: A benchmark and evaluation of handcrafted and learned local descriptors, Balntas et al., CVPR 2017

Mihai Dusmanu <[email protected]>MATLAB (VLFeat)C++ understanding

Image from [3]

https://colmap.github.io/

http://matthewalunbrown.com/patchdata/patchdata.html

3D Vision 2019

Goal:


Description:

Learning to propagate variational methods

Train a network on propagation for semantic scene completion

Variational methods in computer vision refer to those methods that solve problems by posing them as functional minimizations. Such techniques can be applied for image denoising, inpainting, segmentation… In our case, we are interested in applications to semantic 3D reconstructions.

The minimization of such functionals relies on iterative algorithm, such as primal dual, which minimize the given objective at every step until convergence. Recent work has shown that these algorithm can be implemented into neural networks (referred here as variational networks). The main interest of such networks is the fact that they rely on few parameters.

For this to work, the number of iterations in the minimization algorithm must be fixed during training. Unfortunately, unlike true variational methods, when running the network for inference, adding more iterations does not improve the results, but often degrades them.

In this project, we want to explore methods that will allow to train a variational network that will improve when more iterations are added. To do so, we will try implementing a different loss function that focuses more on the functional minimization, and try to use synthetic ground truth data that corresponds to different steps of the minimization.

Ian [email protected]

Python, convex optimization

3D Vision 2019

Goal:


Description:

Soccer on HoloLens Demo

Create a demo program to watch soccer in 3Don HoloLens.

Starting from an existing work [1], the goal of this project is to create a running demo program to watch a soccer game in 3D on HoloLens.Further, the original framework can be improved in several ways. Most notably the original work only uses a single camera view and an extension to use multiple views would increase the quality of extracted 3D surfaces for the players.Other possible improvements (e.g. output quality improvements, synchronized multi-device viewing) will be discussed and selected during the course of the project.

[1] Konstantinos Rematas, Ira Kemelmacher-Shlizerman, Brian Curless, Steve Seitz, Soccer On Your Tabletop, CVPR 2018http://grail.cs.washington.edu/projects/soccer/ https://github.com/krematas/soccerontable

Martin [email protected]

Python, C#, Unity3D, HoloLens

http://grail.cs.washington.edu/projects/soccer/

https://github.com/krematas/soccerontable

3D Vision 2019

Goal:


Description:

Hybrid 2.5D / 3D Large-Scale Urban Reconstruction

Create a deep neural network approach that estimates the whichparts of the scene require full 3D reconstruction vs. simple 2.5D depth maps.

While 2.5D depth maps are highly efficient for urban reconstruction they are unable to capture sophisticated building architecture an especially overhanging structures like roof overhangs, road overpasses, bridges etc. On the other hand, we are able to reconstruct such structures in great detail with volumetric (voxel-based) approaches. However, they are very resource demanding and do not scale to large urban areas.

Luckily, in the majority of cases a 2.5D reconstruction is sufficient to obtain high quality surface geometry. Therefore, the best scalable reconstruction method should by hybrid and carefullyselect the areas that require expensive full 3D and computes everything else in 2.5D. The maingoal of this work is to create a selection algorithm that steers the decision between which reconstruction method should be applied.

[1] Learning Priors for Semantic 3D Reconstruction, Cherabier et al., ECCV, 2018[2] Olaf Ronneberger, Philipp Fischer, Thomas Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015

Martin [email protected]

Python

3D Vision 2019

Goal:


Description:

Create a synthetic motion blur dataset

Create a synthetic motion blurred dataset with Unreal game engine and benchmark existing deep learning based motion deblurring methods

Motion blurred image affects many computer vision tasks, such as image based motion estimation, localization, segmentation etc. In this project, you are required to create a synthetic motion blurred dataset based on Unreal game engine. The software framework is being set-up already and will be provided for your convenience.

A large dataset with varying foreground objects, backgrounds and camera motion should be created based on the provided software framework. Furthermore, you are also required to benchmark existing deep learning based motion deblurring algorithms (source code will be provided) with the generated dataset.

Peidong LiuCAB [email protected]

Python, C++

Documents

Multi-camera DeepTAM · 2020. 2. 17. · 3D Vision 2019 Goal: Requirements / Tools: Supervisor: Description: Benchmarking local features for multi-camera garden SLAM Evaluate the