Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Multi-camera DeepTAM
Generalize DeepTAM to use a multi-camera setup
Visual Odometry methods based on classical 3D geometry have been around for years, using either indirect feature matching or direct visual error minimization. Lately, learning based methods that combine both matching and geometry estimation in a single network have shown impressive results. Classical methods have been shown to benefit from the extended field of view that can be provided by using multiple cameras, but these setups have been mostly ignored by current learning-based methods.
The goal of this project is to extend the existing DeepTAM pipeline to leverage a multi-camera setup with known geometry.
References:
[1] H. Zhou and B. Ummenhofer and T. Brox. DeepTAM: Deep Tracking and Mapping. ECCV, 2018
Marcel Geppert <[email protected]>Viktor Larsson <[email protected]>
Required: Python, Tensorflow
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Benchmarking local features for multi-camera garden SLAM
Evaluate the impact of different local feature types on runtime and accuracy of SLAM in garden environments
While there are many visual SLAM systems with different approaches available today, most of them fail when moving from a structured, artificial environment to cluttered environments such as gardens or open nature. One approach to improve robustness is to increase the field of view, either by using wide angle lenses or multiple cameras, to increase the amount of available information. Still, in this case processing the available data on time becomes more and more difficult.
The goal of this project is to replace the currently used SIFT features in our SLAM pipeline with multiple different features and evaluate the impact on runtime, recognition and matching, and finally the resulting pose accuracy.
Marcel Geppert <[email protected]>C++, OpenCV, Matlab
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
3D Occupancy Prediction for Autonomous Driving
Implement a 3D occupancy prediction methodfor autonomous driving.
Due to the complexity of most real environments, such as urban streets or crowded areas, it is very important to predict the future 3D occupancy based on the temporal dependencies for autonomous driving.
There are mainly two techniques to model short-term occupancy and make predictions into the future. One is the dynamic Gaussian process (DGP) map [1], and the spatio-temporal Hilbert (STHM) map [2]. However, these methods only studied the 2D cases. In this project, we plan to implement a method for 3D occupancy prediction for autonomous driving..
References:[1] Callaghan et al. Gaussian process occupancy maps for dynamic environments. Experimental Robotics, 2015.
[2] Senanayake et al. Spatiotemporal Hilbert maps for continuous occupancy representation in dynamic environments. NIPS, 2016.
Zhaopeng Cui <[email protected]>Required: C++ / Matlab
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Fast Dense Semantic Fusion for 3D Semantic Reconstruction
Implement an real-time dense semantic fusion for large-scale semantic scene reconstruction
In order to make the intelligent decision for autonomous driving, the real-time dense 3D semantic mapping is needed. Similar to geometric mapping, we need to fuse the semantic segmentation of each frame in order to obtain an global semantic map. Vineet et al. [1] proposed an efficient mean-field inference algorithm for large-scale dense semantic fusion.
This project aims to implement an real-time dense semantic fusion algorithm for large-scale semantic reconstruction based on [1]. The existing framework [2] for real-time 3D reconstruction will be used for this project, and the KITTI stereo dataset will be used for testing.
References:[1] Vineet et al. Incremental Dense Semantic Stereo Fusion for Large-Scale Semantic Scene Reconstruction, ICRA, 2015.
[2] Prisacariu et al. InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure. arXiv, 2017.
Zhaopeng Cui <[email protected]>Required: C++ and Cuda
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
SurfelWarp: Efficient Non-VolumetricSingle View Dynamic Reconstruction
The goal is to implement a dense SLAM system for reconstruction of non-rigid deforming scenes
The reconstruction of non-rigid deforming scenes is a challenging problem. Recent work [1] has proposed a dense SLAM system for reconstruction of non-rigid deforming scenes based on surfels. The approach is based on [2] of which an implementation exists online [3], working on static scenes.
In this work, the goal is to implement the extensions from [1]. The online implementation of a dense SLAM system called InfiniTAM [3] can be used as a starting block. The resulting pipeline doesn’t have to run in real-time, i.e. CPU-based implementation is sufficient. For testing, existing datasets [4] can be used.
References:
[1] Gao, Wei, and Russ Tedrake. "Surfelwarp: Efficient non-volumetric single view dynamic reconstruction." Robotics: Science and Systems. 2018.[2] Keller, Maik, et al. "Real-time 3d reconstruction in dynamic scenes using point-based fusion." 3D Vision-3DV 2013, 2013 International Conference on. IEEE, 2013.[3] Prisacariu, Victor Adrian, et al. "InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure." arXiv preprint arXiv:1708.00783 (2017).[4] Innmann, Matthias, et al. "VolumeDeform: Real-time volumetric non-rigid reconstruction." European Conference on Computer Vision. Springer, Cham, 2016.
Sandro Lombardi <[email protected]>Required: C++
Image from [1]
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Learning to Reconstruct 3D Meshes with only 2D Supervision
Implement a deep learning method for reconstructing 3D meshes from 2D images
In this project we want to learn to reconstruct 3D meshes from 2D images without using the ground truth mesh during the training stage. Recent work [1] has proposed a promising method to achieve this: They exploit shading information with the use of a differential renderer.
The goal is to implement the method introduced by [1]. Code for a few building blocks is available online (differential renderer [2][3], variational autoencoder [4]). For training and testing, existing datasets [5] can be used.
References:
[1] Henderson, Paul, and Vittorio Ferrari. "Learning single-image 3D reconstruction by generative modelling of shape, pose and shading." arXiv preprint arXiv:1901.06447 (2019).[2] https://github.com/pmh47/dirt[3] Kato, Hiroharu, Yoshitaka Ushiku, and Tatsuya Harada. "Neural 3d mesh renderer." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.[4] https://github.com/pytorch/examples/tree/master/vae[5] Chang, Angel X., et al. "Shapenet: An information-rich 3d model repository." arXiv preprint arXiv:1512.03012 (2015).
Sandro Lombardi <[email protected]>Required: PythonRecommended: Experience with TensorFlow, PyTorch or other deep learning frameworks
Image from [1]
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Multi-Device Multi-Session Mobile Mapping
Build a framework for building large-scale maps from HoloLens/ARKit/ARCore trajectories from multiple sessions using maplab
The goal of the project is to build a framework that records SLAM trajectories on mobile devices using HoloLens/ARKit/ARCore [1, 2]. These recorded SLAM trajectories from different sessions and/or users should then be merged into large-scale maps using the maplab library [3]. The project involves writing small ARKit/ARCore apps to record the data (HoloLens already has research mode [4]), which should then be parsed into maplab compatible format. The project should then use already available maplab features to merge these trajectories into consistent large-scale maps. It will be key to evaluate the performance of the system and analyze potential failure cases. If time permits, the students should develop improvements to maplab that overcome these potential limitations and failure cases. Ideally, 3 different groups tackle the problem, where each group focuses on one device platform (HoloLens/iOS/Android) and they share collected trajectories with each other to test cross-device map merging. Since the HoloLens group can skip developing an app, they should instead try to augment the maps with dense depth data from the built-in Kinect sensor from the HoloLens.
[1] https://developer.apple.com/arkit/[2] https://developers.google.com/ar/[3] https://github.com/ethz-asl/maplab[4] https://github.com/Microsoft/HoloLensForCV
Johannes Schönberger<[email protected]>
C++, iOS or Android dev (optional), ARKit/ARCore capable phones
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Projected Virtual Windows
Using a mobile projector and cameras, project a viewer-position dependent image of a 3D scene onto a surface, creating the illusion of a virtual window.
The goal of this project is to combine projection mapping with head tracking.
Using the projector, an image is shown on a flat surface. This surface might not necessarily be orthogonal to the projector. Using a passive setup such as stereo from webcams, the user’s head position is determined and an image is computed that creates the illusion of a virtual window when viewed from the user’s position.
Challenges include the on-the-fly calibration of the projection mapping and the tracking of the user’s head, as well as registering those coordinate systems to each other and rendering an appropriate image for projection.
Daniel Thul <[email protected]>C++, OpenGL
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
HoloLens Obstacle Avoidance
Develop a guidance system that uses the onboard sensors of the HoloLens to identify obstacles and directs the user to avoid them using visual cues as a guide.
The onboard sensing capabilities of the HoloLens could enable the device to warn the user of potential dangers in their path, including static or dynamic obstacles. A motivating use case of an obstacle avoidance system would be to help guide persons with impaired vision through cluttered environments. The project would involve utilizing a combination of sensor modalities (e.g. depth and optical cameras) and vision algorithms (e.g. optical flow) to identify collision risks along the user’s current trajectory. The goal would be to guide the user away from these obstacles by leveraging existing research from the robotics and autonomous driving domains, and to provide visual feedback to the user for how to change course in the form of holograms.
Jeff Delmerico <[email protected]>C++
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Deep Learning of Graph Matching
Implement a network for the graph matching task that can learn deep potentials for the task.
Graph matching is a widely used algorithm for combinatorial optimization and computer vision problems. One particular interesting case is to perform matching between an image pair of temporally or semantically related images. The problem of graph matching is usually solved with on optimization algorithm. Recent developments in deep learning, however, allow to not only solve the problem but also to backpropagate through the whole algorithm. This allows to learn better unary and pairwise potentials for the task in end-to-end fashion and superior performance.
The goal of this project is to (re-)implement a recently proposed method [1], apply modifications to the network and connectivity of the graph and evaluate it for the task of sparse optical flow and semantic matching. If time suffices, the network could also be extended to deliver dense optical flow via an inpainting framework, eg [2].
References:
[1] Deep Learning of Graph Matching, Zanfir et al. CVPR 2018
[2] Learning energy based inpainting for optical flow, Vogel et al. ACCV 2018
Christoph Vogel<[email protected]>Python/C++, PyTorch
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
3D Pose Motion Representation for Action Recognition
Implement and evaluate an action recognition framework based on 3D human pose features
Action recognition is one of the most fundamentals problems of computer vision. Human pose features provide valuable cues for recognizing human actions. To this end, [1] recently proposed an efficient motion descriptor based on 2D pose features. Specifically, the authors first run a state- of-the-art human pose estimator and extract heatmaps for the human joints in each frame. Then, a motion descriptor is obtained by temporally aggregating these probability maps. The resulting motion descriptor is trained to recognize actions and is able provide the state-of-the-art performance even with shallow neural network architectures.
While 2D pose features are helpful in estimating the human action, they lack depth information which is crucial for recognizing fine-grained actions. Therefore, we would like to account for the depth of human joints and extend this idea to the 3D setting, where 3D pose features are aggregated temporally within a volumetric representation. The resulting motion descriptor is then going to be trained to recognize human actions using different neural network architectures and compared against the state-of-the-art.
[1] “PoTion: Pose Motion Representation for Action Recognition”, Choutas et al. CVPR 2018
Bugra Tekin <[email protected]>Federica Bogo <[email protected]>Taein Kwon <[email protected]>
Python, PyTorch
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
3D Hand Shape and Pose from Images in the Wild
Implement the deep learning framework proposed in [1]
Estimating 3D hand shape and pose from single RGB images in unconstrained environments is important in many applications.
A recently proposed approach [1] achieves impressive results by combining a deep encoder-decoder architecture with a parametric 3D model of the human hand.
In this project, we will implement the network proposed by the authors, experiment with training procedures, identify shortcomings and propose possible improvements.
[1] “3D Hand Shape and Pose from Images in the Wild”, Boukhayma et al. Arxiv 2019
Taein Kwon [email protected] Bogo [email protected] Tekin [email protected]
Python, PyTorch
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
FAID-D: A local descriptor robust to illumination changes
Implement and train a network for local descriptor learning
The Flash and Ambient Illuminations Dataset (FAID) [1] consists of aligned flash-only and ambient-only illumination image pairs captured with mobile devices. Using a detector such as DoG or Hessian-Affine, pairs of corresponding relevant patches can be extracted from this dataset. These patches can in turn be used for training a local descriptor (e.g. using the pipeline introduced [2]). Finally, the obtained descriptors can be compared to the state-of-the-art on a patches benchmark [3] or even evaluated on real-life applications such as the challenging visual localization tasks of [4].
References:
[1] - A Dataset of Flash and Ambient Illumination Pairs from the Crowd, Aksoy et al., ECCV 2018
[2] - Working hard to know your neighbor's margins: Local descriptor learning loss, Mishchuk et al., NIPS 2017
[3] - HPatches: A benchmark and evaluation of handcrafted and learned local descriptors, Balntas et al., CVPR 2017
[4] - Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions, Sattler et al., CVPR 2018
Mihai Dusmanu <[email protected]>Python (PyTorch), MATLAB (VLFeat)
Image from [2]
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Scene structure based localization
Localize between 3d reconstructions created from ARKit/ARCore trajectories and images.
The goal of this project is to investigate localization using scene structure. The localization problem can be split into two parts. The first part involves finding two correctly corresponding scenes. The second part is about estimating a transformation between them. In this project we primarily focus on 3d structure (point clouds) based pose estimation. The scene structure will be created using recorded SLAM trajectories (ARKit [1]/ARCore [2]) and images from mobile phones. Therefore, an IOS/Android app has to be written, which is capable of recording the data. The final reconstruction then can be created using COLMAP[3]. The basic algorithm for pose estimation will be a variant of ICP[5] (implementations can be found in [4]). With this framework the students will be able to conduct various experiments (dense vs sparse scene reconstruction, different ICP algorithms). If time allows, semantic ICP algorithms can be explored [6].
[1] https://developer.apple.com/arkit/[2] https://developers.google.com/ar/[3] https://https://colmap.github.io/[4] http://pointclouds.org/[5] https://en.wikipedia.org/wiki/Iterative_closest_point [6] http://bmvc2018.org/contents/papers/1073.pdf
Lukas [email protected]
C++, iOS or Android dev, ARKit/ARCore capable phones
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
HoloLens Robot Controller
Implementing a Robot Control System using HoloLens
The Microsoft HoloLens is equipped with a broad range of sensors suitable for 3D localization and gesture classification, including an IMU, a depth sensor, and an RGB camera and four grayscale cameras. Similar systems are also commonly used in robotics for localization and obstacle avoidance.
In this project, we will make a basic interaction system that can control the Trimbot, a gardening robot platform running a visual SLAM system, using HoloLens.
This includes(1) Setting up the communication between HoloLens and Trimbot using available libraries.(2) Align the respective maps and coordinate systems and maintaining this alignment through map updates.(3) Display the robot's map on top of the real environment in HoloLens.(4) Classify user's gesture.(5) Send control information from HoloLens to Trimbot.
Taein Kwon <[email protected]> Marcel Geppert <[email protected]>Jeff Delmerico <[email protected]>
C#(Unity), ROS, C++
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Depth map fusion with Hololens
Depth map fusion with HoloLens
Thanks to research mode in Hololens it’s possible to get direct access to the raw data obtained from the time-of-flight (ToF) camera. With this type of modality the accuracy is much higher but the drawback is an increased amount of unfiltered data.
The goal of this project is to implement an algorithm that fuse multiple depthmaps in a single pointcloud by checking for visibility, geometric and appearance consistency on the image plane.
The resulting point cloud will contain the same amount of 3D information but in a more compact and consistent representation.
Silvano Galliani : <[email protected]>C++
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Incremental SfM for 1D Radial Cameras
Implement and evaluate an incremental SfM pipeline for 1D radial cameras
In this project the goal is to implement a simple incremental Structure-from-Motion pipeline for the 1D radial camera model. In contrast to the pinhole camera model which projects 3D points to points in the image plane, the radial cameras instead projects 3D points onto radial lines. Since the camera model only considers the direction of the projection it becomes invariant to changes in focal length as well as radial distortion. Unfortunately it also becomes invariant to translation along the principal axis, which makes one degree of freedom in the translation unobservable. Since the initialization is trickier for radial cameras (requiring 4 cameras instead of 2) we will start by assuming pinhole camera models for the initial pair.
References:
Thirthala & Pollefeys, Radial Multi-focal Tensors, IJCV’12
Kim et al., Multi-view 3D reconstruction from uncalibrated radially-symmetric cameras, ICCV’13
Kukelova et al., Real-time solution to the absolute pose problem with unknown radial distortion and focal length, ICCV’13
Camposeco et al., Non-Parametric Structure-Based Calibration of Radially Symmetric Cameras, ICCV’15
Viktor Larsson <[email protected]>C++
Some familiarity with SfM
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Deep Keypoint Detector
Implement and evaluate a trained keypoint detector(s) using deep ranking function
The goal is to implement and evaluate the keypoint detector using a variety of deep network structures and pre-trained model. Additional task will be to train several independent detectors, which minimize both ranking cost function and maximizes the distances between them.
References:
Savinov et al., Quad-networks: unsupervised learning to rank for interest point detection, CVPR 2017
Lubor Ladicky <[email protected]>C++, any deep learning framework
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Motion blur aware camera pose tracking
Implement a coarse-to-fine camera pose tracker which is robust to image motion blur
Camera pose tracker is usually a front-end for a visual odometry (VO) algorithm. Most existing works assume the input images to VO are sharp images. However, images can be easily blurred, which would further fail the VO, if the camera moves too fast within a longer exposure time.
In this project, we plan to investigate and implement an efficient motion blur aware camera pose tracker. To make the problem more tractable, we assume the reference image is sharp and only current image is being motion blurred. Furthermore, we assume the depth map corresponding to the reference image is already known. All the required dataset can be generated from a simulation tool, which is already being set up for you.
Peidong Liu <[email protected]>Pytorch
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
MegaPatches: a dataset for training or benchmarking local descriptors
Construct a dataset of patches from large-scale 3D reconstructions
The first step of this project is to extract a new dataset for descriptor training / benchmarking from the 196 scenes of the MegaDepth [1] dataset, reconstructed using the COLMAP SfM & MVS pipeline [2]. One way to do this is by warping keypoints from one image to nearby images using the estimated camera parameters and dense depth maps (see [3]). The final objective would be to compare different state-of-the-art descriptors on this dataset (e.g. using the metrics presented in [4]).
References:
[1] - MegaDepth: Learning Single-View Depth Prediction from Internet Photos, Li and Snavely, CVPR 2018
[2] - COLMAP - https://colmap.github.io
[3] - Brown Patches dataset - http://matthewalunbrown.com/patchdata/patchdata.html
[4] - HPatches: A benchmark and evaluation of handcrafted and learned local descriptors, Balntas et al., CVPR 2017
Mihai Dusmanu <[email protected]>MATLAB (VLFeat)C++ understanding
Image from [3]
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Learning to propagate variational methods
Train a network on propagation for semantic scene completion
Variational methods in computer vision refer to those methods that solve problems by posing them as functional minimizations. Such techniques can be applied for image denoising, inpainting, segmentation… In our case, we are interested in applications to semantic 3D reconstructions.
The minimization of such functionals relies on iterative algorithm, such as primal dual, which minimize the given objective at every step until convergence. Recent work has shown that these algorithm can be implemented into neural networks (referred here as variational networks). The main interest of such networks is the fact that they rely on few parameters.
For this to work, the number of iterations in the minimization algorithm must be fixed during training. Unfortunately, unlike true variational methods, when running the network for inference, adding more iterations does not improve the results, but often degrades them.
In this project, we want to explore methods that will allow to train a variational network that will improve when more iterations are added. To do so, we will try implementing a different loss function that focuses more on the functional minimization, and try to use synthetic ground truth data that corresponds to different steps of the minimization.
Python, convex optimization
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Soccer on HoloLens Demo
Create a demo program to watch soccer in 3Don HoloLens.
Starting from an existing work [1], the goal of this project is to create a running demo program to watch a soccer game in 3D on HoloLens.Further, the original framework can be improved in several ways. Most notably the original work only uses a single camera view and an extension to use multiple views would increase the quality of extracted 3D surfaces for the players.Other possible improvements (e.g. output quality improvements, synchronized multi-device viewing) will be discussed and selected during the course of the project.
[1] Konstantinos Rematas, Ira Kemelmacher-Shlizerman, Brian Curless, Steve Seitz, Soccer On Your Tabletop, CVPR 2018http://grail.cs.washington.edu/projects/soccer/ https://github.com/krematas/soccerontable
Martin [email protected]
Python, C#, Unity3D, HoloLens
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Hybrid 2.5D / 3D Large-Scale Urban Reconstruction
Create a deep neural network approach that estimates the whichparts of the scene require full 3D reconstruction vs. simple 2.5D depth maps.
While 2.5D depth maps are highly efficient for urban reconstruction they are unable to capture sophisticated building architecture an especially overhanging structures like roof overhangs, road overpasses, bridges etc. On the other hand, we are able to reconstruct such structures in great detail with volumetric (voxel-based) approaches. However, they are very resource demanding and do not scale to large urban areas.
Luckily, in the majority of cases a 2.5D reconstruction is sufficient to obtain high quality surface geometry. Therefore, the best scalable reconstruction method should by hybrid and carefullyselect the areas that require expensive full 3D and computes everything else in 2.5D. The maingoal of this work is to create a selection algorithm that steers the decision between which reconstruction method should be applied.
[1] Learning Priors for Semantic 3D Reconstruction, Cherabier et al., ECCV, 2018[2] Olaf Ronneberger, Philipp Fischer, Thomas Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015
Martin [email protected]
Python
3D Vision 2019
Goal:
Requirements / Tools: Supervisor:
Description:
Create a synthetic motion blur dataset
Create a synthetic motion blurred dataset with Unreal game engine and benchmark existing deep learning based motion deblurring methods
Motion blurred image affects many computer vision tasks, such as image based motion estimation, localization, segmentation etc. In this project, you are required to create a synthetic motion blurred dataset based on Unreal game engine. The software framework is being set-up already and will be provided for your convenience.
A large dataset with varying foreground objects, backgrounds and camera motion should be created based on the provided software framework. Furthermore, you are also required to benchmark existing deep learning based motion deblurring algorithms (source code will be provided) with the generated dataset.
Peidong LiuCAB [email protected]
Python, C++