CZECHTECHNICAL UNIVERSITY INPRAGUE F3 ... · Therefore low-resolution depth sen-sors(e.g. with only 4-row measurements planes) are often used in contemporary semi-autonomous cars,

CZECH TECHNICALUNIVERSITYIN PRAGUE

F3 Faculty of Electrical EngineeringDepartment of Computer Science

Bachelor’s Thesis

Deep learning for densereconstruction from sparse depthmeasurements

Dmitrii NoskovOpen Informatics, Software Systems

May 2019Supervisor: Ing. Vojtěch Šalanský

Acknowledgement / Declaration

First and foremost, I would liketo thank my supervisor Ing. VojtěchŠalanský for interesting thesis topic,his guidance and advices. Further-more, this thesis would not be possiblewithout a support of my family andfriends.

I hereby declare that I have elabo-rated this Bachelor Thesis on my ownand I have mentioned all used informa-tion sources and literature according toMethodological guidance to ethical prin-ciples in the preparation of universitytheses.

Prague, 24. 05. 2019

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

Abstrakt / Abstract

3D rekonstrukce je důležitým kom-ponentem pro základní bezpečnostnífunkcí autonomního vozidla, jako jenouzové brzdění, prediktivní kontrolaaktivního tlumení, bezpečné otáčenína křižovatce, nebo odhadnutí vlastnípolohy na offline mapě. V důsledkutoho každé plně autonomní vozidlovyžaduje senzor, který poskytne mě-ření s vysokým rozlišením a 3D měřeníhloubky na dlouhé vzdálenosti. Senzorys vysokým rozlišením jsou drahé, těžké,pomalé a náchylné k mechanickémuopotřebení. Snímače hloubky s nízkýmrozlišením (např. pouze s čtyřmi řádkyměření) jsou často používány v součas-ných poloautonomních automobilecha odhaduje se místní mapa 3D rekon-strukce. Hlavním cílem této práce senaučit konvoluční neuronovou síť’ pro3D rekonstrukci mapy voxelů z řídkýchhloubkových měření

Klíčová slova: neuronová síť; 3d re-konstrukce; KITTI dataset; voxelovámapa; hluboké učení.

Překlad titulu: Hluboké učení pro re-konstrukci prostredí z řidkých hloubko-vých měření

Accurate 3D reconstruction is anessential component for many of funda-mental capabilities such as emergencybraking, predictive control for ac-tive damping, safe turning in a roadintersection or self-localization fromoffline maps. Consequently, any fully-autonomous vehicle requires a sensorproviding high resolution and longrange 3D measurements. The high-resolution sensors are expensive, heavy,slow and prone to mechanical wear.Therefore low-resolution depth sen-sors(e.g. with only 4-row measurementsplanes) are often used in contemporarysemi-autonomous cars, and the local3D reconstruction map is estimated.The main goal of this thesis is to learna deep convolutional neural networkfor dense 3D voxel map reconstructionfrom sparse depth measurement.

Keywords: neural network; 3d recon-struction; KITTI dataset; voxel map;deep learning.

iv

Contents /

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .31.1 Motivation . . . . . . . . . . . . . . . . . . . . . . .31.2 Thesis goals . . . . . . . . . . . . . . . . . . . . . .31.3 Thesis structure . . . . . . . . . . . . . . . . . .3

2 Related work . . . . . . . . . . . . . . . . . . . . . . . .52.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . .5

2.1.1 Depth estimation . . . . . . . . . .52.1.2 3D reconstruction . . . . . . . . . .62.1.3 Semantic segmentation . . . .7

3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83.1 Training data . . . . . . . . . . . . . . . . . . . .8

3.1.1 KITTI dataset . . . . . . . . . . . . .83.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . .93.3 Tensorflow . . . . . . . . . . . . . . . . . . . . . . . .9

3.3.1 TensorBoard . . . . . . . . . . . . . 103.4 Convolutional Neural Net-

works . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.4.1 Input layer . . . . . . . . . . . . . . . 103.4.2 Convolutional layer . . . . . . 113.4.3 ReLU layer . . . . . . . . . . . . . . . 113.4.4 Pooling layer . . . . . . . . . . . . . 123.4.5 Deconvolutional layer . . . . 123.4.6 Training . . . . . . . . . . . . . . . . . . 123.4.7 Loss function . . . . . . . . . . . . . 12

3.5 Networks suggested for ex-periments . . . . . . . . . . . . . . . . . . . . . . . 133.5.1 CNN based on Learn-

ing for Active 3D map-ping network . . . . . . . . . . . . . 13

3.5.2 CNN based on ResNetarchitecture . . . . . . . . . . . . . . 13

4 Implementation . . . . . . . . . . . . . . . . . . 154.1 Workaround . . . . . . . . . . . . . . . . . . . . 154.2 Project structure . . . . . . . . . . . . . . . 154.3 Data processing . . . . . . . . . . . . . . . . 164.4 Building model . . . . . . . . . . . . . . . . . 16

4.4.1 Model implementationbased on [1] . . . . . . . . . . . . . . 17

4.4.2 Model implementationbased on [1] with skipconnections . . . . . . . . . . . . . . . 17

4.4.3 Model implementationbased on ResNet-50 . . . . . . 18

4.4.4 Model implementationbased on ResNet-152 . . . . 19

4.5 Training . . . . . . . . . . . . . . . . . . . . . . . . 19

4.6 Saving training model . . . . . . . . . 194.7 Drawing ROC curves . . . . . . . . . . 194.8 Visual representation of

computational graphs andloss function . . . . . . . . . . . . . . . . . . . . 20

4.9 Visual representation of 3Dreconstruction . . . . . . . . . . . . . . . . . . 204.9.1 Mayavi . . . . . . . . . . . . . . . . . . . 20

5 Testing and evaluation . . . . . . . . . . . 255.1 Evaluation routine . . . . . . . . . . . . . 255.2 ResNet networks comparison . . 255.3 MappingNet networks com-

parison . . . . . . . . . . . . . . . . . . . . . . . . . 255.3.1 Input: Velodyne

64, label: groundtruth+velodyne data . . . . 25

5.3.2 Input: Velodyne 16,label: Velodyne 64 . . . . . . . 26

5.4 Loss comparison, best per-formers comparison . . . . . . . . . . . . 26

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 326.1 Future work . . . . . . . . . . . . . . . . . . . . 32

References . . . . . . . . . . . . . . . . . . . . . . . . 33A Abbreviations . . . . . . . . . . . . . . . . . . . . . 35B DVD content . . . . . . . . . . . . . . . . . . . . . . 36

v

Tables / Figures

5.1. MappingNet with skip con-nections evaluation . . . . . . . . . . . . . 30

5.2. ResNet evaluation . . . . . . . . . . . . . . 305.3. MappingNet evaluation . . . . . . . . 31

2.1. Multi-Scale Deep Network . . . . . . .52.2. 3D-R2N2 model . . . . . . . . . . . . . . . . . .62.4. 3D shape completion archi-

tecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .62.3. Architecture of mapping net-

work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72.5. Semantic segmentation ar-

chitecture example . . . . . . . . . . . . . . .73.1. KITTI dataset example . . . . . . . . . .83.2. Voxel map example . . . . . . . . . . . . . .93.3. TensorBoard example . . . . . . . . . . 103.4. Example of input data . . . . . . . . . 113.5. Residual block example . . . . . . . . 133.6. ResNet architectures . . . . . . . . . . . 144.1. Overfitting example . . . . . . . . . . . . 224.2. Voxel map represented by

16-beam data example . . . . . . . . . 234.3. Voxel map represented by

64-beam data example . . . . . . . . . 244.4. Voxel map represented by

ground truth data example . . . . 245.1. ROC curves comparison.

ResNet networks . . . . . . . . . . . . . . . 265.2. Loss function during training

on Velodyne data ResNet-152based network . . . . . . . . . . . . . . . . . . 26

5.3. Example of 3D reconstruc-tion using ResNet-152 basednetwork . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.4. ROC curves comparison.MappingNet networks withground truth labels . . . . . . . . . . . . 27

5.5. Loss function during trainingon MappingNet+skip con-nection(kernel 5x5) networkwith ground truth labels . . . . . . . 28

5.6. Example of 3D reconstruc-tion using MappingNet+skipconnection(kernel 5x5) net-work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.7. ROC curves comparison.MappingNet networkstrained with Velodyne 64 aslabel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.8. Loss function during trainingon MappingNet(kernel3x3)

vi

network trained with Velo-dyne 64 as label . . . . . . . . . . . . . . . . 29

5.9. Example of 3D recon-struction using Map-pingNet(kernel 3x3) networktrained with Velodyne 64labels . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.10. ROC curves comparison be-tween best performers. . . . . . . . . . 31

vii

1

2

Chapter 1Introduction

1.1 Motivation3D reconstruction is considered one of the fundamental problems of computer visiondue to extreme complexity and relevance in many engineering applications. In recentyears, deep convolutional neural networks(CNN) have gained popularity in image andvideo processing tasks, such as classification [2], segmentation [3], recognition [4]. Men-tioned approaches proved their efficiency in learning from 2D data. How to representdata for depth estimation and 3d reconstruction is the topic of discussion in deep learn-ing(DL) community. [5] Some of the studies used a single-shot RGB image for depthmap prediction [6]. With convolutional neural networks, Eigen et al. [7] showed thatdepth estimation could perform reasonably well using single monocular image. In re-cent years, many variations of this method have been proposed [8–9]. For autonomouscar application, obtaining depth information from RGB images have disadvantages. Inreal-world datasets data from images may be noisy due to the environment(e.g low-lightscenario, reflective surfaces) [7]. Depth measurements provides the most reliable datafor accurate 3D reconstruction in case of autonomous driving [10]. This thesis proposesseveral CNN architectures with different hyperparameters evaluated on artificially syn-thesized Velodyne LiDAR laser radar data, which obtained from KITTI dataset[11].One of designed architecture extends existing approach proposed by Zimmermann etal. [1]. Their CNN outperformed Res3D-GRU-3 [12] network.

1.2 Thesis goals

. Inspect deep learning algorithms suitable for 3d reconstruction and semantic segmen-tation.. Examine and process the data received from KITTI dataset, so that they becomesuitable for training the neural network.. Design and implement several neural network architectures for the application of 3doccupancy reconstruction.. Collect the metrics from all network setups and compare the results.

1.3 Thesis structureChapter 2 presents an overview of several approaches to solving problems of depthprediction, 3D reconstruction and semantic segmentation. There is multiple methodswhich input can be used for depth learning, such as single monocular images, stereoimages, laser radar data or combining these data. Chapter 3 describes the choice ofKITTI dataset, deep learning framework TensorFlow and chosen CNN for further im-plementation. Also in this chapter I will briefly provide information about used building

3

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .blocks of convolutional neural network. Chapter 4 takes a closer look at implementationspecifics, such as CNN model building in TensorFlow and reconstructed map visual-ization in Mayavi [13]. Input data and their post-processing for further use in theimplementation is also shown in this chapter. Chapter 5 focuses on the performancecomparison in learning and reconstruction accuracy of designed networks. Finally,chapter 6 presents conclusion and suggestions for future work.

4

Chapter 2Related work

2.1 Overview3D reconstruction methods can rely on different input data. The goal of this thesis isaccurate reconstruction from sparse depth measurements, but many approaches focuson depth estimation from single or stereo RGB images.

2.1.1 Depth estimationSaxena et al. [6] presented an approach based on capturing depths and relationshipsbetween depths using linear regression and Markov Random Fields. The model usestwo types of features for absolute and relative depths. Depth information for trainingwas collected using a 3-D laser scanner. Eigen et al. [7] proposed a two CNN stack: theglobal coarse-scale network for scene understanding and the local fine-scale network,see Figure 2.1. Depth information for training provided by NYU Depth v2 [14] andKITTI datasets. Results of this method outperform Make3D system [15], which isan extension for previously mentioned work [6]. Approaches based on deep residualnetwork [4] such as [16–17] achieve higher accuracy with simpler architecture. Lainaet al. [16] extends ResNet-50 architecture. Fully connected layer has been replacedwith upsampling layers. This change reduces the number of trainable parameters andrequires less training data. Li et al. [17] have similarities with the previously mentionedapproach. Based on ResNet-152 architecture, fully connected layer has been replacedwith a convolutional layer and deconvolutional layer.

Figure 2.1. Multi-Scale Deep Network model architecture. Source [7]

5

2. Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.1.2 3D reconstruction

There are many state of the art methods in surface reconstruction and 3d prototypingfrom incomplete point clouds. Choy et al. [12] proposed 3D recurrent reconstructionneural network. This approach performs state of the art results on single or multi-viewreconstruction. In addition to the commonly known convolutional neural network in2d space, this work uses 3D Convolutional LSTM and a 3D Deconvolutional NeuralNetwork. These novel units help architecture to handle object self-occlusions whenmultiple views are fed to the network. The output from 127x127 images is voxelized3D reconstruction of size 32x32x32.

Figure 2.2. 3D Recurrent Reconstruction Neural Network model architecture. Source [12]

Zimmermann et al. [1] propose an active 3D mapping method for depth sensors,such as solid-state lidars. The layer design of CNN is similar to LeNet network [18],where convolutional layers followed by max pooling layers. To process higher resolutionvoxel maps network requires more convolutional layers, so there is 6 convolutional layersfollowed by ReLU, instead of only 2 convolutional layers, see Figure 2.3. To match sizeof output to size of input, there is deconvolutional layer in the end of the model insteadof fully connected layer.

Another example of using 3D CNN architecture is shape completion. Han et al.[19] propose shape completion through joint inference of global structure and localgeometry. Generated 3D models have a high 256x256x256 resolution in comparisonwith other approaches, such as [12]

Figure 2.4. 3D shape completion architecture. Source [19]

6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Overview

Figure 2.3. Architecture of mapping network proposed by Zimmermann et al. [1] op: Anexample input with sparse measurements, showing only the occupied voxels. Bottom:The corresponding reconstructed dense occupancy confidence after thresholding. Right:Schema of the network architecture, composed from the convolutional layers (denotedconv), linear rectifier units (relu), pooling layers (pool), and upsampling layers (deconv).

Source [1]

2.1.3 Semantic segmentationThe main goal of this thesis is 3d reconstruction from sparse depth measurements. Noneof the experiments are contributed to semantic segmentation problem. In previouslydescribed approaches I mentioned that in tasks of depth completion or 3d reconstruc-tion, commonly used CNN architectures [20, 4] is modified by changing fully connectedlayers to upsampling layers. Upsampling layer is one of the main layers in semantic seg-mentation state of the art models [21–23]. Deconvolution layer can recover the originalinput size, which is necessary for accurate 3D reconstruction. The network proposed byNoh et al. [23] for segmentation based on VGG-16 [20] with additional deconvolutionaland unpooling is very similar to architecture proposed by Zimmermann et al. [1] whichis suitable for 3D reconstruction.

Figure 2.5. Semantic segmentation architecture with deconvolutional layers. Source [23]

7

Chapter 3Analysis

3.1 Training data

3.1.1 KITTI datasetThe main focus of the KITTI dataset is to provide free data for development in thecomputer vision industry such as stereo camera, optical flow, visual odometry, 3D objectdetection, and 3D object tracking. This dataset contains a wide range of recordings fromdifferent environments, such as city streets and crossroads, rural roads or freeways. Thedataset is recorded from a VW station wagon using four cameras with high definitionrecording and one Velodyne LIDAR 3D rotating scanner HDL-64E.

Figure 3.1. KITTI dataset development kit example from [11]. Velodyne point clouds arecombined with 2D RGB image.

Zimmermann et al. [1] use Velodyne LIDAR data from KITTI dataset to artificiallycreate voxel maps from point clouds. These voxel maps contain only occupied voxels.The single map contain volume of 64m x 64m x 6.4m discretized into 320x320x32voxels. For lack of learning speed and reconstruction accuracy(which decays over longdistances), I cut the outer part of the map, so the single map now contains volumeof 32m x 32m x 6.4m discretized into 160x160x32 voxels are suitable for further CNNlearning.

8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Python

Figure 3.2. Adjusted data from KITTI dataset [11]. The red square shows cut the outerpart of initial 320x320x32 voxel map.

3.2 PythonBecause majority of deep learning frameworks and data science tools are for Pythonprogramming language, this is an obviuos choice. Python version 3.4.3 support all thetools necessary to achieve results

3.3 Tensorflow

There are several popular deep learning frameworks, such as TensorFlow1 , Caffe 2,PyTorch 3,. They provide many tools that make solving complex machine learningtasks possible. Those framworks can be used to wrap complex mathematical calcula-tions in one code line, to create linear models and neural networks. There are manypre-designed CNN models and training algorithms. TensorFlow fulfills all the imple-mentation requirements for this thesis and shows high performance in comparison withother DL frameworks. [24]

1 tensorflow.org, TensorFlow https://www.tensorflow.org/2 Caffe, deep learning network. https://caffe.berkeleyvision.org3 PyTorch, deep learning network. https://pytorch.org

9

3. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 3.3. Example of TensorBoard visualization. Screenshot shows change of loss scalarvalue during neural network training.

3.3.1 TensorBoardI chose TensorFlow mostly because of TensorBoard1 feature. It is a very convenient,cross-platform solution for visualizing various kinds of data, such as scalars, computa-tion graphs, images or histograms.

3.4 Convolutional Neural NetworksA convolutional neural network is a neural network architecture that was originally cre-ated and used for effective image recognition. [18] Convolutional layers (convolutions)interchange with nonlinear activation functions (ReLU) and pooling layers.

3.4.1 Input layerIn case of this thesis, the input layer is voxel map with width = 160, height = 160 anddepth = 32. Occupied voxels are represented as 1 in input file, whereas empty voxels are1 tensorflow.org, TensorBoard https://www.tensorflow.org/tensorboard

10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Convolutional Neural Networks

Figure 3.4. Input layer for thesis experiments.

represented as 0. Further in the implementation chapter I discuss data pre-processingfor training.

3.4.2 Convolutional layer

In the convolution operation, a matrix of weights of small size is used, which is shiftedalong the entire processed layer, forming after each shift an activation signal for theneuron of the next layer with a similar position. This matrix is called the convolutionkernel. When performing a convolution operation, each fragment of image is element-wise multiplied by the convolution matrix, and the result is summed up and writtento the similar position of the output image. The convolution matrix is a visual codingof some feature. The next layer resulting from the convolution operation shows thepresence of this feature. In the convolutional neural network, there are many setsof weights that encode the elements of the images, such as edges, curves, or colors.Convolution kernels are formed during network learning. After we have the entire setof filters, the feature map is formed. Since there are many independent feature maps onone layer, the network becomes multi-channel. Also, the convolutional layer has otherparameters: Stride, or offset - determines by how many positions the filter is shifted toform the signal of the next convolutional neuron. Padding - adding neurons with zerovalues that do not affect the formation of a convolutional layer signal at the edges ofthe previous layer to adjust the dimensions of the convolutional layer of the CNN. Thedimension of the convolutional layer is determined by its parameters and the size of theprevious network layer

3.4.3 ReLU layer

The activation functions are used to generate the output signal of the convolutionalneurons in the CNN These functions transform neuron signals according to certainrules. The most used activation functions are logistic activation(sigmoid), hyperbolictangent(tanh) and ReLU. The ReLU function is a rectified linear function and currently

11

3. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .is considered a much simpler and more efficient in terms of computational complexity.The ReLU derivative is equal to either 0 or 1, which is why its use prevents the explodingand vanishing of gradients and leads to a thinning of the weights, which positively affectsthe computational ability of the CNN. f(x) = max(0, x)

3.4.4 Pooling layerThis layer of the convolutional neural network is used to reduce the dimensions of data toreduce the likelihood of rapid overtraining, as well as to reduce computational costs andmemory consumption. Usually, this layer is used after the operation of convolution andtransforms the signals of the convolutional layer with highlighting the most significantaccording to defined criteria. Pooling layer with 2x2 kernels and stride 2 can discard75 percent of the activations.

3.4.5 Deconvolutional layerDeconvolution is a transposed convolution. This operation swaps the sizes of input andoutput of a regular convolution. This operation is used for interpolation. In this thesisexperiments deconvolution is necessary for restoring input dimensions for predictionoccupancy.

3.4.6 TrainingTraining a convolution neural network is the process of adjusting the values of theweights of the connections between the neurons of the network. For unsupervisedlearning, it is important that the algorithm does not have the data to verify the neuralnetwork output. This method works on the principle of clustering, finding similarelements in the input data. Based on these categories, the method tries to sort thedata into groups, but all without any verification. The number of searched groupsdepends on the learning parameters, either defined or calculated by the algorithm.

Supervised learning, on the other hand, has data to verify network output. Thetraining set consists of a pair input-label, where the label is the class of correspondingoutput. In the case of image recognition task, the label would be the correct class name.In the case of this thesis experiments, the correct label is ground truth voxel map. Themost common routine for supervised learning:

. Pre-process input data. Create training, testing and validation data sets. Build the model of CNN. Initialize the network, usually with random numbers. Start training epoch. Pass input data to the network. Calculate the loss and update weights. After every epoch, evaluate accuracy of network on testing data set. Update learning rate to avoid overtraining

3.4.7 Loss functionLoss function works as a mathematical expression of a neural network error. Themost common loss functions are a mean squared error, L2 and cross entropy; Lossfunction should be selected depending on CNN output. For example, cross entropyis used in binary classification, so labels are assumed to be 0 or 1. For this thesisexperiment, I used weighted logistic loss, which was used in Learning for Active 3D

12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Networks suggested for experiments

Mapping [1]. Optimization algorithms are trying to minimalize loss function. The goalof optimization algorithms is changing the weights and biases in CNN so loss functionvalue will decrease. Gradual decrease loss function value means that CNN continueslearning.

3.5 Networks suggested for experiments

3.5.1 CNN based on Learning for Active 3D mapping networkThis work extends existing approach Learning for Active 3D mapping [1]. ProposedCNN for 3D reconstruction in Learning for 3D mapping performs better than stateof the art solution by 20 percent in recall. The goal of expanding this work is toimprove 3D reconstruction. There are many possibilities on how to achieve this goal.The final CNN design was predominantly the result of experimental research betweendifferent hyperparameters. These parameters were kernel size selection and by addingskip connections, also known as residual connections. In a retrospective of ILSVRC1

[25] winners proves that bigger convolutional kernels are less efficient that 3x3 or 5x5kernels, which is a common choice for the modern state of the art approaches [20, 12].

3.5.2 CNN based on ResNet architectureDuring training CNN based on Learning for Active 3D mapping approach, residualconnections improved occupancy prediction. Skip connection is the primary buildingblock of residual neural networks, or ResNet [4]. Many researchers have noticed that bystacking more layers the quality of such a model grows only to a certain limit [20], andthen the quality begins to fall. This problem is called the degradation problem [26],and the networks created by stacking more layers are called plain networks. He et al.[4] found such a topology where the quality of the model increases with the addition ofnew layers.

Figure 3.5. A building block of residual learning. Source [4]1 image-net.org, ImageNet Large Scale Visual Recognition Challenge http://image-net.org/challenges/LSVRC/

13

3. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ResNet main element is residual block with shortcut-connection over which data

passes unchanged. The residual block consists of several convolutional layers withactivation, which convert the input signal x to F (x). Shortcut connections skip one ormore layers, see Figure 3.5. Unlike the state of the art model, I change fully connectedlayer in the end to upsampling(deconvolutional) layer to restore input dimensions. Iimplement ResNet-50 and ResNet-152 networks to compare output and computationalcomplexity.

Figure 3.6. All ResNet architectures. Source [4]

14

Chapter 4Implementation

After the analysis in section 3, several neural networks were proposed: based on Learn-ing for Active 3D mapping and based on ResNet architecture. The ResNet architectureis going to be realized in two modifications: ResNet-50 and ResNet-152. Besides origi-nal approach proposed by Zimmermann et al. [1], I’m going to implement this networkwith skip connections layer and different kernel sizes.

4.1 WorkaroundTo start training CNN with Tensorflow and visualize results with Mayavi it is necessaryto setup a workaround. First of all, I should install Python1 and pip2 When pipis installed, I have two options to install TensorFlow: tensorflow-gpu or tensorflowversions. Since all computation are extremely complex, GPU version is preferable thanCPU-only.

pip3 install --user --upgrade tensorflow

Voxel map samples from KITTI dataset are saved on CTU FEE Department of Cyber-netics server goedel.felk.cvut.cz. All voxel maps are in MATLAB3. Neither Python orTensorflow can not work directly with .mat files. MATLAB files are HDF5 files with adifferent extension and meta-data. The hdf5storage4 library provides all the necessaryutilities.

pip install hdf5storage

Voxel maps are variables of MATLAB files. There is an example how to access thesevariables using Python, NumPy5 and hdf5storage:

f1 = h5py.File(os.path.join(’filename.mat’))input = np.array(f1[’input’])label = np.array(f1[’label’])

4.2 Project structure/3dreconstruction/....main folder/3dreconstruction/nets/....folder with networks/3dreconstruction/nets/resnet50.py ....ResNet-50 implementation/3dreconstruction/nets/resnet152.py ....ResNet-152 implementation/3dreconstruction/nets/net.py ....Implementation based on [1]

1 python.org, Install Python https://www.python.org/downloads/2 pip.pypa.io, Install pip https://pip.pypa.io/en/stable/installing/3 mathworks.com, MATLAB https://www.mathworks.com/products/matlab.html4 pypi.org, hdf5storage https://pypi.org/project/hdf5storage/5 https://www.numpy.org, NumPy, package for scientific computing with Python

15

4. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ./3dreconstruction/nets/netskip.py ....Implementation based on [1] with skip connec-tion/3dreconstruction/utils/....folder with utils/3dreconstruction/utils/roc.py ....script for ROC curves graphs generation/3dreconstruction/utils/resultvisualization.ipynb ....Jupyter Notebook for occupancyprediction visualization

4.3 Data processingPrepared dataset from KITTI LIDAR data contains 3982 voxel maps. These mapsshould be separated to training, validation and testing subgroups. The training datasetis used to train the model.This is the largest subgroup which contains 3300 data samples.One training epoch goes through all 3300 samples. Every epoch samples are shuffled.Shuffling prevents the model overfitting. Overfitting is problem when the model predictswell only examples from the training set. Validation dataset contains 560 voxel mapsamples. The validation dataset is evaluated after every epoch. The main goal ofvalidation dataset is to update higher level hyperparameters, such as learning rate.The testing sub dataset is used to evaluate the whole model training training andcontains 122 voxel maps.

Every .mat file from dataset contains 5 variables:

. basedir contains directory to corresponding example in original KITTI dataset. Thegiven examples are artificially created. offset. static the ground truth voxel map, see Figure 4.4. velodine16 voxel map represented by 16-beam data from SSL, see Figure 4.2. velodine64 voxel map represented by 64-beam data from SSL, see Figure 4.3. disttovisible64 3d array, which contains voxel-wise distance from LIDAR to occupiedvoxel

All maps used voxels of edge size 0.2 m. For better evaluation, the weights areevaluated as follows:

f = h5py.File(os.path.join(path, listnames.pop()))inputArr = np.array(f[’velodine64’])labelArr = np.array(f[’static’]) + inputArrdtv = np.array(f[’dist_to_visible64’])weightArr = np.copy(labelArr)weightArr[weightArr>0] = 12weightArr[weightArr<0] = 1weightArr = np.where(dtv>4,0,weightArr)

For each occupied voxel corresponding weight is significantly larger due to big differencein the ratio between occupied end empty classes. By changing the weights we balancethe classes. [1]

4.4 Building modelAfter all required software installed and learning data is adjusted, I am ready to startbuilding the models. First, I set up variables, such as image size, kernel size, depthsize, batch size. For thesis experiments, only kernel size is varied between 3 and 5. As Imentioned in the analysis chapter, TensorFlow requires to create computational graph:

16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Building model

graph = tf.Graph()with graph.as_default():

In graph scope all placeholders for input, labels, weights are defined:tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size, im-

age_size, num_channels),name="tfTRAINDATA")

tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, image_size, image_size,num_channels))

tf_weights = tf.placeholder(tf.float32, shape=(batch_size, image_size, image_size,num_channels))

tf_valid_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size, im-age_size, num_channels))

tf_test_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size, image_size,num_channels))

global_step = tf.Variable(0, trainable=False)

4.4.1 Model implementation based on [1]There are many approaches on how to define the CNN model in TensorFlow. I learnedthat approach from Udacity1 course about machine learning by Google. This model wastrained with different kernel size and with different input and labels, see chapter 5. Forsake of readability, I remove Python spacing. The part of bachelor thesis is DVD withall source codes with correct indentations. The following code shows implementationof model based on Learning for Active 3D mapping.

def model(data):conv1 = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding=’SAME’)bias1 = tf.nn.relu(conv1 + layer1_biases)conv2 = tf.nn.conv2d(bias1, layer2_weights, [1, 1, 1, 1], padding=’SAME’)bias2 = tf.nn.relu(conv2 + layer2_biases)pool2 = tf.nn.max_pool(bias2, [1, 2, 2, 1], [1, 2, 2, 1], padding=’SAME’)conv3 = tf.nn.conv2d(pool2, layer3_weights, [1, 1, 1, 1], padding=’SAME’)bias3 = tf.nn.relu(conv3 + layer3_biases)pool3 = tf.nn.max_pool(bias3, [1, 2, 2, 1], [1, 2, 2, 1], padding=’SAME’)conv4 = tf.nn.conv2d(pool3, layer4_weights, [1, 1, 1, 1], padding=’SAME’)bias4 = tf.nn.relu(conv4 + layer4_biases)conv5 = tf.nn.conv2d(bias4, layer5_weights, [1, 1, 1, 1], padding=’SAME’)bias5 = tf.nn.relu(conv5 + layer5_biases)conv6 = tf.nn.conv2d(bias5, layer6_weights, [1, 1, 1, 1], padding=’SAME’)deconv1 = upsample(conv6, 32, 4, "deconv1")return deconv1

4.4.2 Model implementation based on [1] with skip connectionsThere are minor differences in comparison with previous model. Thanks to skip connec-tions the input data is directly propagated to output, so model did not lose any voxel,which positive occupancy state is obvious. The following code shows propagation ofoccupied voxels straight do the last deconvolutional layer.

zero = tf.constant(0, dtype=tf.float32)occupied = tf.not_equal(input, zero)##there is layers from previous model#deconv1 = upsample(conv6, 32, 4, "deconv1")data = tf.scalar_mul(tf.reduce_max(deconv1), input)skip_conn = tf.where(occupied, input, deconv1)

1 https://eu.udacity.com, Udacity, online courses in programming, data science, artificial intelligence

17

4. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.4.3 Model implementation based on ResNet-50

First, I should define methods for resized and non-resized layers. This methods signifi-cantly simplifying model coding.

def resize_layer(scope_name, inputs, small_size, big_size, stride=1, rate=1):with arg_scope([layers.conv2d], rate=rate):with tf.variable_scope(scope_name) as scope:conv = slim.conv2d(inputs, num_outputs=small_size, kernel_size=1,

stride=stride,activation_fn=tf.nn.relu)

conv = slim.conv2d(conv, num_outputs=small_size, kernel_size=3,stride=1,activation_fn=tf.nn.relu)

conv = slim.conv2d(conv, num_outputs=big_size, kernel_size=1,stride=1,activation_fn=None)

conv2 = slim.conv2d(inputs, num_outputs=big_size, kernel_size=1, stride=stride,activation_fn=None)

conv = conv + conv2return tf.nn.relu(conv)

def non_resize_layer(scope_name, inputs, small_size, big_size, rate=1):with arg_scope([layers.conv2d], rate=rate):with tf.variable_scope(scope_name) as scope:conv = slim.conv2d(inputs, num_outputs=small_size, scope=’conv2’ kernel_size=1, stride=1,

activation_fn=tf.nn.relu)conv = slim.conv2d(conv, num_outputs=small_size, scope=’conv3’ kernel_size=3, stride=1,

activation_fn=tf.nn.relu)conv = slim.conv2d(conv, num_outputs=big_size, scope=’conv4’, kernel_size=1, stride=1,

activation_fn=None)conv = conv + inputsreturn tf.nn.relu(conv1)

After that I can easily describe the network model.def model(data):conv = slim.conv2d(data, num_outputs=64, scope=’conv1’, kernel_size=7, stride=2,

activation_fn=tf.nn.relu)pool1 = slim.max_pool2d(conv, kernel_size=3, stride=2, scope=’pool1’)conv = resize_layer("resize1", pool1, small_size=64, big_size=256)for i in range(2):

conv = non_resize_layer("resize2-" + str(i), conv, small_size=64, big_size=256)conv = resize_layer("resize3", conv, small_size=128, big_size=512, stride=2)l1concat = convfor i in range(7):

conv = non_resize_layer("resize4-" + str(i), conv, small_size=128, big_size=512)print(conv.shape)

l2concat = convconv = resize_layer("resize5", conv, small_size=256, big_size=1024, rate=2)l3concat = convfor i in range(35):

conv = non_resize_layer("resize6-" + str(i), conv, small_size=256, big_size=1024,rate=2)

l4concat = convconv = resize_layer("resize7", conv, small_size=512, big_size=2048, rate=4)l5concat = convfor i in range(2):

conv = non_resize_layer("resize8-" + str(i), conv, small_size=512, big_size=2048,rate=4)

l6concat = convconv = tf.concat([l1concat, l2concat, l3concat, l4concat, l5concat, l6concat], axis=3)

18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Training

conv = slim.conv2d(conv, num_outputs=32, scope=’convFinal’, kernel_size=3, stride=1,normalizer_fn=None, activation_fn=None)conv = slim.conv2d_transpose(conv, num_outputs=32, kernel_size=8, stride=8,normalizer_fn=None, activation_fn=None, scope=’deconvFinal’)return conv

4.4.4 Model implementation based on ResNet-152The only difference between this model and previous implementation is number of layerrepetitions, see Figure 3.6

4.5 TrainingFor learning all described networks, I used initial learning rate = 0.01 with continuousdecay to 0.005 and 0.001 every 10 epochs.

boundaries = [34200, 68400]values = [0.01, 0.005, 0.001]learning_rate = tf.train.piecewise_constant(global_step, boundaries,

values)

For stochastic gradient descent I used Momentum Optimizer with momentum 0.99.After many unsuccessful tries, I left that parameter the same as [1].

optimizer = tf.train.MomentumOptimizer(learning_rate, 0.99).minimize(loss,global_step=global_step)

In TensorFlow computing graphs are evaluated in sessions. The session object(tf.Session) is used for the graph execution context and invoke all necessary resources,classes, placeholders.

with tf.Session(graph=graph, config=config) as session:tf.global_variables_initializer().run()

4.6 Saving training modelTensorflow allows to simply save trained network. With tf.train.Saver() variable in thecomputational graph model can be saved during running session.

saver.save(session, "/home.stud/model.ckpt")

4.7 Drawing ROC curvesThe main metric for evaluation accuracy of occupancy prediction is ROC curve. Thiscurve is a line from (0,0) to (1,1) in the coordinates of True Positive Rate (TPR) andFalse Positive Rate (FPR). The larger area under the curve is better. Using matplolibtoolkit 1, I can easily draw multiple ROC curves:

import numpy as npimport matplotlibmatplotlib.use(’Agg’)import matplotlib.pyplot as pltx_1 = np.genfromtxt(’/Users/admin/Downloads/bp/resnet50/x.csv’, delimiter=’,’)y_1 = np.genfromtxt(’/Users/admin/Downloads/bp/resnet50/y.csv’, delimiter=’,’)

1 https://matplotlib.org, Matplotlib, https://matplotlib.org

19

4. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x_2 = np.genfromtxt(’/Users/admin/Downloads/bp/resnet152/x.csv’, delimiter=’,’)y_2 = np.genfromtxt(’/Users/admin/Downloads/bp/resnet152/y.csv’, delimiter=’,’)

plt.figure()lines = plt.plot(x_1, y_1, x_2, y_2)plt.setp(lines[0], color=’r’, linewidth=2.0, label = ’Kernel3, input: Vel64, label:

static’)plt.setp(lines[1], color=’g’, linewidth=2.0, label = ’Kernel5, input: Vel64, label:

static’)

plt.xscale(’log’)plt.ylim([0.0, 1.05])plt.xlim([0.005, 1])plt.xlabel(’False Positive Rate’)plt.ylabel(’True Positive Rate’)plt.title(’ROC Comparsion’)plt.legend(loc="lower right")plt.savefig("/Users/admin/Downloads/resnetvs4.png")

4.8 Visual representation of computational graphsand loss function

Tensorboard is a tool that can be used to monitor the progress of model training. Definesummary writer and scalars to observe:

loss_scalar = tf.summary.scalar("loss", loss)loss_val_scalar = tf.summary.scalar("loss_val", loss)loss_tst_scalar = tf.summary.scalar("loss_tst", loss)logs_path = ’/home.stud/resnet50/lossgraph’summary_writer = tf.summary.FileWriter(logs_path, graph=tf.get_default_graph())summary_writer.add_summary(summary, batch_size*batch_iter + batch_step)

Then this data can be accessible with web browser in TensorBoard web application. Tostart this application on local machine, execute this command:

tensorboard --log_dir=/home.stud/resnet50/lossgraph

4.9 Visual representation of 3D reconstructionThe output of designed network is occupancy prediction 3D voxel map. The size ofthis map is 160x160x32 voxels. Matplotlib toolkit can not render such a big voxel map.Mayavi is a single tool that capable to do such task without complex rendering. Alloutput renders are created on my local machine without graphic card.

4.9.1 MayaviBelow is an example of Mayavi creating voxel map from ground truth sample, such asFigure 4.4

import mayavi.mlab as mlabf1 = h5py.File(os.path.join(’filename.mat’))output = np.array(f1[’static’])data = np.array(f1[’static’])output = np.transpose(output[...], (2,1,0))data = np.transpose(data[...], (2,1,0))output = np.where(output > 0.4, output, 0)mask=(data > 0) | (output > 0)output = np.where(data > 0, data, output)

20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Visual representation of 3D reconstruction

output = np.flip(output, axis=2)data = np.flip(data, axis=2)clrs = output*(1.0/output.max())np.count_nonzero(clrs)

clors = np.zeros(output.shape + (3,))clors[..., 0] = clrsclors[..., 1] = clrsclors[..., 2] = clrsfor i in range(0,32):

clors[:, :, i, 0] = colorlist[i][0]clors[:, :, i, 1] = colorlist[i][1]clors[:, :, i, 2] = colorlist[i][2]

n = np.count_nonzero(clrs) # number of pointsx, y, z = np.where(output > 0)rgba = np.random.randint(0, 256, size=(n, 4), dtype=np.uint8)rgba[:, -1] = 255 # no transparencyrgba[:, 0] = clors[..., 0][np.nonzero(output)].reshape(-1)*255.0 # no

transparencyrgba[:, 1] = clors[..., 1][np.nonzero(output)].reshape(-1)*255.0 # no

transparencyrgba[:, 2] = clors[..., 2][np.nonzero(output)].reshape(-1)*255.0 # no

transparencypts = mlab.pipeline.scalar_scatter(x, y, z) # plot the pointspts.add_attribute(rgba, ’colors’) # assign the colors to each pointpts.data.point_data.set_active_scalars(’colors’)g = mlab.pipeline.glyph(pts, mode = ’cube’)g.glyph.glyph.scale_factor = 1 # set scaling for all the pointsg.glyph.scale_mode = ’data_scaling_off’ # make all the points same size

mlab.show()

21

4. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.1. Overfitting example. Source: [27]

22

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Visual representation of 3D reconstruction

Figure 4.2. Voxel map represented by 16-beam data example

23

4. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.3. Voxel map represented by 64-beam data example

Figure 4.4. Voxel map represented by ground truth data example

24

Chapter 5Testing and evaluation

There is 2 input data setups to test:. Velodyne 16-beam data as input, Velodyne 64-beam data as label.. Velodyne 64-beam data as input, ground truth data combined with Velodyne 64-beam data as label.These 2 setups are trained on 6 models. For sake of simplicity, the network which

was initially based on approach proposed in [1] I would call MappingNet. The list ofall models continues as follows:. MappingNet. Kernel 3x3. MappingNet. Kernel 5x5. MappingNet+skip connections. Kernel 3x3. MappingNet+skip connections. Kernel 5x5. ResNet-50. ResNet-152

5.1 Evaluation routineAll networks were trained on CTU FEE Department of Cybernetics server Goedel.Goedel is machine with 40 cores, 256GB RAM and 3 Nvidia Titan X GPU. For everytraining instance I used one GPU. The average time of model training is 8.5 hours.This includes 30 epochs of training on 3300 samples with validation after every epochon 560 samples and finally testing on 122 samples. After that, trained model is savedin /predicition/ folder for each corresponding network sample. After each evaluation,the ROC curve image is created and saved under corresponded network name. Datafor ROC curve recreation for comparison is saved in csv files x.csv and y.csv. Graphscontains training, testing and validation loss are saved under TensorBoard format filein /lossgraph/ folder.

5.2 ResNet networks comparisonThe ResNet-152 based network shows better results even in comparison with Learningfor Active 3D Mapping approach. But only when trained on Velodyne 16 data.

Average loss on testing data for outperforming ResNet-152 is 0.238

5.3 MappingNet networks comparison

5.3.1 Input: Velodyne 64, label: ground truth+velodyne dataVelodyne 64-beam data as input, ground truth data combined with Velodyne 64-beamdata as label does not improve ROC curve in comparison with state of the art approach.

Average loss on testing data for the best performed MappingNet with skip connec-tions(5x5 kernel) is 0.255

25

5. Testing and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 5.1. ROC curves comparison between ResNet based networks and [1]

Figure 5.2. Loss function during training on Velodyne data ResNet-152 based network.Axis X: number of processed samples. Axis Y: loss value.

5.3.2 Input: Velodyne 16, label: Velodyne 64Velodyne 16-beam data as input, Velodyne 64 as label data trained on MappingNetwithout skip connections outperforms state of the art approach.

Average loss on testing data for the best performed MappingNet without skip con-nections(3x3 kernel) is 0.188

5.4 Loss comparison, best performers comparisonLoss comparion on tesing data for all setups.

26

. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Loss comparison, best performers comparison

Figure 5.3. Example of 3D reconstruction using ResNet-152 based network. Velodyne 16beams as input. Velodyne 64 beams as label. The input shows as red voxels. Other voxels

are predicted.

Figure 5.4. ROC curves comparison between MappingNet networks with ground truthlabels and [1]

27


Figure 5.5. Loss function during training on MappingNet+skip connection(kernel 5x5)network. Axis X: number of processed samples. Axis Y: loss value.

Figure 5.6. Example of 3D reconstruction using MappingNet+skip connection(kernel 5x5)network. Velodyne 64 data as input, ground truth as label. The input shows as red voxels.

Other voxels are predicted.

28


Figure 5.7. ROC curves comparison between MappingNet networks trained with Velodyne64 as label and [1]

Figure 5.8. Loss function during training on MappingNet(kernel 3x3) network trained withVelodyne 64 as label. Axis X: number of processed samples. Axis Y: loss value.

29


Figure 5.9. Example of 3D reconstruction using MappingNet (kernel 3x3) network. Velo-dyne 16 data as input, Velodyne 64 as label. The input shows as red voxels. Other voxels

are predicted.

Model Kernel size Input data Label data Loss valueMappingNet 3x3 Velodyne 16 rays Velodyne 64 rays 0.281with skip connectionsMappingNet 5x5 Velodyne 16 rays Velodyne 64 rays 0.290with skip connectionsMappingNet 3x3 Velodyne 16 rays Velodyne 64 rays 0.264with skip connections combined with static voxelsMappingNet 5x5 Velodyne 16 rays Velodyne 64 rays 0.269with skip connections combined with static voxels

Table 5.1. MappingNet with skip connections evaluation. Number of epochs: 30. Iterationper epoch: 3800. Initial learning rate: 0.01. Learning rate decay after 10 epochs.

Model Input data Label data Loss valueResNet50 Velodyne 16 rays Velodyne 64 rays 0.301ResNet152 Velodyne 16 rays Velodyne 64 rays 0.238ResNet50 Velodyne 16 rays Velodyne 64 rays 0.298

combined with static voxelsResNet152 Velodyne 16 rays Velodyne 64 rays 0.291

combined with static voxels

Table 5.2. ResNet evaluation. Number of epochs: 30. Iteration per epoch: 3800. Initiallearning rate: 0.01. Learning rate decay after 10 epochs.

30


Model Kernel size Input data Label data Loss valueMappingNet 3x3 Velodyne 16 rays Velodyne 64 rays 0.188MappingNet 5x5 Velodyne 16 rays Velodyne 64 rays 0.195MappingNet 3x3 Velodyne 16 rays Velodyne 64 rays 0.301

combined with static voxelsMappingNet 5x5 Velodyne 16 rays Velodyne 64 rays 0.322

combined with static voxels

Table 5.3. MappingNet evaluation. Number of epochs: 30. Iteration per epoch: 3800.Initial learning rate: 0.01. Learning rate decay after 10 epochs.

Figure 5.10. ROC curves comparison between best performers and [1]

31

Chapter 6Conclusion

The main goal of this thesis was analyzing different deep learning algorithms suitablefor 3D reconstruction. In related work chapter, I reviewed different state of the art ap-proaches for depth estimation and 3D reconstruction tasks and briefly discussed similar-ities with semantic segmentation task. Data for successful learning can be representedby monocular images or sparse samples. The analysis part of the thesis explained thechoice of KITTI dataset and TensorFlow framework. Then I show the basic concept ofconvolutional neural networks. Their main building blocks were described at the begin-ning. Then I introduced several network architectures, based on the well-known stateof the art CNN, such as ResNet-50 and ResNet-152. I modified them for 3D occupancyreconstruction task by removing fully connected layers and adding a deconvolutionallayer on end. Also I recreate CNN architecture proposed by Zimmermann et al. [1] andtried to extend this approach by adding skip connections. Then I implemented themand trained with different kernel sizes and on different input/label pairs. Implementednetworks proves their ability to learn and perform reasonably well in comparison withstate of the art architectures.

6.1 Future workThis work can be continued in several ways. The architecture of networks can beextended to classify occupied voxels to dynamic and static classes such as movingcars, pedestrians, houses, road signs. KITTI dataset has these classes labeled, butonly for image data. Since my work is focused on reconstruction from sparse depthmeasurements, an extension to classes can be achieved by adding semantic segmentationarchitecture components.

32

References

[1] Karel Zimmermann, Tomas Petricek, Vojtech Salansky, and Tomás Svoboda.Learning for Active 3D Mapping. CoRR. 2017, abs/1708.02074

[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classificationwith Deep Convolutional Neural Networks. 2012, 1097–1105.

[3] E. Shelhamer, J. Long, and T. Darrell. Fully Convolutional Networks for SemanticSegmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence.2017, 39 (4), 640-651. DOI 10.1109/TPAMI.2016.2572683.

[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learningfor Image Recognition. CoRR. 2015, abs/1512.03385

[5] Eman Ahmed, Alexandre Saint, Abd El Rahman Shabayek, Kseniya Cherenkova,Rig Das, Gleb Gusev, Djamila Aouada, and Bjorn Ottersten. ”A survey on DeepLearning Advances on Different 3D Data Representations”. arXiv e-prints. ”2018”,arXiv:1808.01462.

[6] Ashutosh Saxena, Sung H. Chung, and Andrew Y. Ng. Learning Depth from SingleMonocular Images. NIPS’05. 2005.http://dl.acm.org/citation.cfm?id=2976248.2976394.

[7] David Eigen, Christian Puhrsch, and Rob Fergus. Depth Map Prediction from aSingle Image using a Multi-Scale Deep Network. CoRR. 2014, abs/1406.2283

[8] Fangchang Ma, and Sertac Karaman. Sparse-to-Dense: Depth Prediction fromSparse Depth Samples and a Single Image. CoRR. 2017, abs/1709.07492

[9] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised Monoc-ular Depth Estimation with Left-Right Consistency. CoRR. 2016, abs/1609.03677

[10] Waleed Ali, Sherif Abdelkarim, Mohamed Zahran, Mahmoud Zidan, and AhmadEl Sallab. YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box De-tection from LiDAR Point Cloud. CoRR. 2018, abs/1808.02350

[11] A Geiger, P Lenz, C Stiller, and R Urtasun. Vision meets robotics: The KITTIdataset. The International Journal of Robotics Research. 2013, 32 (11), 1231-1237.

[12] Christopher Bongsoo Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and SilvioSavarese. 3D-R2N2: A Unified Approach for Single and Multi-view 3D ObjectReconstruction. CoRR. 2016, abs/1604.00449

[13] P. Ramachandran, and G. Varoquaux. Mayavi: 3D Visualization of Scientific Data.Computing in Science & Engineering. 2011, 13 (2), 40–51.

[14] Pushmeet Kohli Nathan Silberman, Derek Hoiem, and Rob Fergus. Indoor Seg-mentation and Support Inference from RGBD Images. In: ECCV. 2012.

[15] Ashutosh Saxena, Min Sun, and Andrew Y. Ng. Make3D: Learning 3D SceneStructure from a Single Still Image. IEEE Trans. Pattern Anal. Mach. Intell..2009, 31 (5), 824–840. DOI 10.1109/TPAMI.2008.132.

33

http://dx.doi.org/10.1109/TPAMI.2016.2572683

http://dl.acm.org/citation.cfm?id=2976248.2976394

http://dx.doi.org/10.1109/TPAMI.2008.132

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .[16] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nas-

sir Navab. ”Deeper Depth Prediction with Fully Convolutional Residual Networks”.arXiv e-prints. ”2016”, arXiv:1606.00373.

[17] Bo Li, Yuchao Dai, and Mingyi He. ”Monocular Depth Estimation with Hierar-chical Fusion of Dilated CNNs and Soft-Weighted-Sum Inference”. arXiv e-prints.”2017”, arXiv:1708.02287.

[18] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-basedlearning applied to document recognition. In: Proceedings of the IEEE. 1998. 2278–2324.

[19] Xiaoguang Han, Zhen Li, Haibin Huang, Evangelos Kalogerakis, and Yizhou Yu.High-Resolution Shape Completion Using Deep Neural Networks for Global Struc-ture and Local Geometry Inference. CoRR. 2017, abs/1709.07599

[20] Karen Simonyan, and Andrew Zisserman. Very Deep Convolutional Networks forLarge-Scale Image Recognition. CoRR. 2014, abs/1409.1556

[21] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and HartwigAdam. ”Encoder-Decoder with Atrous Separable Convolution for Semantic ImageSegmentation”. arXiv e-prints. ”2018”, arXiv:1802.02611.

[22] Yuhui Yuan, and Jingdong Wang. ”OCNet: Object Context Network for SceneParsing”. arXiv e-prints. ”2018”, arXiv:1809.00916.

[23] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. ”Learning DeconvolutionNetwork for Semantic Segmentation”. arXiv e-prints. ”2015”, arXiv:1505.04366.

[24] Yanzhao Wu, Ling Liu, Calton Pu, Wenqi Cao, Semih Sahin, Wenqi Wei, andQi Zhang. A Comparative Measurement Study of Deep Learning as a ServiceFramework. CoRR. 2018, abs/1810.12210

[25] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual RecognitionChallenge. International Journal of Computer Vision (IJCV). 2015, 115 (3),211-252. DOI 10.1007/s11263-015-0816-y.

[26] Prasun Roy, Subhankar Ghosh, Saumik Bhattacharya, and Umapada Pal. ”Effectsof Degradations on Deep Neural Network Architectures”. arXiv e-prints. ”2018”,arXiv:1807.10108.

[27] Chabacano [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)].Overfitting example.https://upload.wikimedia.org/wikipedia/commons/1/19/Overfitting.svg.

34

http://dx.doi.org/10.1007/s11263-015-0816-y

https://upload.wikimedia.org/wikipedia/commons/1/19/Overfitting.svg

Appendix AAbbreviations

NN Neural NetworkRNN Recurrent Neural NetworkCNN Convolutional Neural NetworkGPU Graphic Processing UnitCPU Central Processing Unit

ILSVRC ImageNet Large Scale Visual Recognition ChallengeDL Deep Learning

ReLU Rectified Linear UnitSSL Solid State Lidar

LIDAR Light Detection And RangingROC Receiver Operating Characteristic curve

35

Appendix BDVD content

root/bptext – Thesis TeX sources/src – Project sources

36

Documents

CZECHTECHNICAL UNIVERSITY INPRAGUE F3 ... · Therefore low-resolution depth sen-sors(e.g. with only 4-row measurements planes) are often used in contemporary semi-autonomous cars,