7
Efficient Yet Deep Convolutional Neural Networks for Semantic Segmentation Sharif Amit Kamran Center for Cognitive Skill Enhancement Independent University Bangladesh Dhaka, Bangladesh Email: [email protected] Muhammad Hasan Computational Intelligence Lab KAIST Daejeon, Republic of Korea Email: [email protected] Ali Shihab Sabbir Center for Cognitive Skill Enhancement Independent University Bangladesh Dhaka, Bangladesh Email: [email protected] AbstractSemantic Segmentation using deep convolutional neural network pose more complex challenge for any GPU intensive work, as it has to compute million of parameters resulting to huge consumption of memory. Moreover, extract- ing finer features and conducting supervised training tends to increase the complexity furthermore. With the introduction of Fully Convolutional Neural Network, which uses finer strides and utilizes deconvolutional layers for upsampling, it has been a go to for any image segmentation task. We propose two segmentation architecture transferring weights from the popular classification neural net VGG19 and VGG16 which were trained on Imagenet classification dataset, transform all the fully connected layers to convolutional layers, use dilated convolution for decreasing the parameters, moreover we add more finer strides and attach four skip architectures which are element-wise summed with the deconvolutional layers in steps. We train and test on different sparse and fine data-sets like Pascal VOC2012, Pascal-Context and NYUDv2 and show how better our model performs in this tasks. On the other hand our model consumes up to 10-20 percent less memory for training and testing with NVIDIA Pascal GPUs, making it more efficient and less memory consuming architecture for pixel-wise segmentation. Keywords— Deep Learning, Convolutional Neural Network, Semantic Image Segmentation, Skip Architectures; 1 Introduction With the introduction of convolutional neural network, the image recognition task has accelerated with great pace and has given state-of-the-art results for classification, detection and semantic segmentation alike. These recognition models not only yielded better results for classifying the image as a whole [1], [2], [3], but also aced in extracting local features and giving finer output [4], [5]. Problem definition of the task in hand is to keep the global structure in contrast with the local context [4], [5]. Most of the time the local features tend to get lost in the training process and global context seems to dominate throughout the segmentation mask. So for extracting those local fine features and for upsampling[6] and element-wise summing the coarse semantic information, Skip architectures were introduced [4], [5], [7], [8] with the preexisting FCN architecture. By using Skip architectures the image representation becomes more finer and less coarse. The drawback for designing convolutional neural network with high level computing for object recognition or for pixel- wise classification seems to be the huge amount of memory required for the task. Firstly, orthodox ConvNets have rather large receptive fields because of their convolutional filters and generates coarse blob-like output map when it is redefined to produce pixel-wise segmentation [4]. Secondly, subsampling with maxpool in ConvNets diminishes the chance to get finer output[9]. Futhermore, similar labeling in neighboring pixels tends to get lost in deeper layers where the upsampling takes place. So visual consistency and retaining spatial feature is one of the essential job for producing sharp segmentation mask. Falling short of producing such fine output can result in poor object portrayal and patch-like false regions in the segmentation mask [9], [10], [11], [12]. Using finer strides [4] and replacing vanilla convolution with dilated convolution [5], [13] we can achieve better seg- mentation results while keeping the memory usage in check. Because with dilation we can increase the receptive fields exponentially[5] whereas the filter of the convolution remains the same. So with the expense of reducing the size of the filter and adding dilation between it, we can free up more memory for computing from the sixth convolutional layer in the FCN architecture which is the most expensive layer. Adding more Skip architectures seems to increase memory usage for the whole end-to-end network, but because addi- tional memory has been freed up by dilated convolutions[5], [13], we can use two extra skip architectures to upsample the local features from the first two convolutional layers.In a feed forward network the size of the representation changes with each convolutions and it is down sampled using pool, so the feature hierarchies have to be added with the upsampled [4], [6] pool in steps. Our proposal in this paper is a efficient yet deep feed for- ward neural net for a strongly supervised image segmentation task. Our work tends to integrate both dilated and vanilla convolution to recreate a FCN architecture which generates better output while consuming less memory. In addition, we introduce four skip architectures which fetches more local information lost in the network in bottom layers, upsampling it to top layers and element-wise summing with the global feature map in steps producing better segmentation mask arXiv:1707.08254v2 [cs.CV] 23 Dec 2017

Efficient Yet Deep Convolutional Neural Networks …Ali Shihab Sabbir Center for Cognitive Skill Enhancement Independent University Bangladesh Dhaka, Bangladesh Email: [email protected]

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficient Yet Deep Convolutional Neural Networks …Ali Shihab Sabbir Center for Cognitive Skill Enhancement Independent University Bangladesh Dhaka, Bangladesh Email: asabbir@iub.edu.bd

Efficient Yet Deep Convolutional Neural Networks forSemantic Segmentation

Sharif Amit KamranCenter for Cognitive Skill Enhancement

Independent University BangladeshDhaka, Bangladesh

Email: [email protected]

Muhammad HasanComputational Intelligence Lab

KAISTDaejeon, Republic of Korea

Email: [email protected]

Ali Shihab SabbirCenter for Cognitive Skill Enhancement

Independent University BangladeshDhaka, Bangladesh

Email: [email protected]

Abstract— Semantic Segmentation using deep convolutionalneural network pose more complex challenge for any GPUintensive work, as it has to compute million of parametersresulting to huge consumption of memory. Moreover, extract-ing finer features and conducting supervised training tends toincrease the complexity furthermore. With the introduction ofFully Convolutional Neural Network, which uses finer strides andutilizes deconvolutional layers for upsampling, it has been a go tofor any image segmentation task. We propose two segmentationarchitecture transferring weights from the popular classificationneural net VGG19 and VGG16 which were trained on Imagenetclassification dataset, transform all the fully connected layersto convolutional layers, use dilated convolution for decreasingthe parameters, moreover we add more finer strides and attachfour skip architectures which are element-wise summed with thedeconvolutional layers in steps. We train and test on differentsparse and fine data-sets like Pascal VOC2012, Pascal-Contextand NYUDv2 and show how better our model performs in thistasks. On the other hand our model consumes up to 10-20 percentless memory for training and testing with NVIDIA Pascal GPUs,making it more efficient and less memory consuming architecturefor pixel-wise segmentation.

Keywords— Deep Learning, Convolutional Neural Network,Semantic Image Segmentation, Skip Architectures;

1 Introduction

With the introduction of convolutional neural network, theimage recognition task has accelerated with great pace andhas given state-of-the-art results for classification, detectionand semantic segmentation alike. These recognition modelsnot only yielded better results for classifying the image as awhole [1], [2], [3], but also aced in extracting local featuresand giving finer output [4], [5].

Problem definition of the task in hand is to keep the globalstructure in contrast with the local context [4], [5]. Most ofthe time the local features tend to get lost in the trainingprocess and global context seems to dominate throughout thesegmentation mask. So for extracting those local fine featuresand for upsampling[6] and element-wise summing the coarsesemantic information, Skip architectures were introduced [4],[5], [7], [8] with the preexisting FCN architecture. By usingSkip architectures the image representation becomes morefiner and less coarse.

The drawback for designing convolutional neural network

with high level computing for object recognition or for pixel-wise classification seems to be the huge amount of memoryrequired for the task. Firstly, orthodox ConvNets have ratherlarge receptive fields because of their convolutional filters andgenerates coarse blob-like output map when it is redefined toproduce pixel-wise segmentation [4]. Secondly, subsamplingwith maxpool in ConvNets diminishes the chance to get fineroutput[9]. Futhermore, similar labeling in neighboring pixelstends to get lost in deeper layers where the upsampling takesplace. So visual consistency and retaining spatial feature isone of the essential job for producing sharp segmentationmask. Falling short of producing such fine output can resultin poor object portrayal and patch-like false regions in thesegmentation mask [9], [10], [11], [12].

Using finer strides [4] and replacing vanilla convolutionwith dilated convolution [5], [13] we can achieve better seg-mentation results while keeping the memory usage in check.Because with dilation we can increase the receptive fieldsexponentially[5] whereas the filter of the convolution remainsthe same. So with the expense of reducing the size of the filterand adding dilation between it, we can free up more memoryfor computing from the sixth convolutional layer in the FCNarchitecture which is the most expensive layer.

Adding more Skip architectures seems to increase memoryusage for the whole end-to-end network, but because addi-tional memory has been freed up by dilated convolutions[5],[13], we can use two extra skip architectures to upsample thelocal features from the first two convolutional layers.In a feedforward network the size of the representation changes witheach convolutions and it is down sampled using pool, so thefeature hierarchies have to be added with the upsampled [4],[6] pool in steps.

Our proposal in this paper is a efficient yet deep feed for-ward neural net for a strongly supervised image segmentationtask. Our work tends to integrate both dilated and vanillaconvolution to recreate a FCN architecture which generatesbetter output while consuming less memory. In addition, weintroduce four skip architectures which fetches more localinformation lost in the network in bottom layers, upsamplingit to top layers and element-wise summing with the globalfeature map in steps producing better segmentation mask

arX

iv:1

707.

0825

4v2

[cs

.CV

] 2

3 D

ec 2

017

Page 2: Efficient Yet Deep Convolutional Neural Networks …Ali Shihab Sabbir Center for Cognitive Skill Enhancement Independent University Bangladesh Dhaka, Bangladesh Email: asabbir@iub.edu.bd

while keeping GPU memory consumption in check. MostImportantly, with this changes in architecture, the end-to-enddeep network can be trained on any type of data while utilizingthe usual back-propagation algorithm with more efficient andfiner results. Our benchmark result for PASCAL VOC2012reduced validation[8], [14] set scored meanIOU of 64.9 per-cent for Dilated FCN-2s-VGG19 compared to 63.9 percentof FCN-8s. Additionally, we evaluate the performance of ourdeep convNet on Pascal VOC 2012 test data set[15], yieldingmeanIOU of 69 percent for FCN-2s-VGG19 and 67.6 percentfor FCN-2s-VGG16[4].

2 Literature Review

Following section describe different procedures which hasbeen proposed before for conducting semantic segmentationtask using deep learning. Out of many approaches only fewhave been adopted for high computing pixel-wise segmenta-tion.

Our proposed model was developed based on a particularneural net that was used for image classification task [2], [1],[3] and the weights were transferred from it [16], [17]. Trans-fer learning was seen being performed in classification task,afterwards it was applied to object detection tasks, lately it hasbeen adopted for instance aware segmentation[18] and imagesegmentation models with a powerful classifier[19], [20], [7].We redesign and redefine the architecture and perform fine-tuning to get more sparse and accurate prediction for semanticsegmentation. Furthermore. we compare different models withour one and show how it is more efficient and effective forsemantic segmentation jobs.

Multi-digit recognition with neural network[21], an exten-sion of LeNet[22], was such work where erratic range of val-ues for input was first witnessed. Though the task was ordainedfor one dimensional data, Viterbi decoding was sufficient forsuch task. Three years later convolutional neural network waselongated for two dimensional feature output for processingpostal address data [23]. These historical breakthroughs weredesigned to conduct small yet powerful detection task. Addi-tionally LeCun et al.[24] using fully convolutional inferencedeveloped a CNN for sparse multiple class segmentation ofembryo. We have also seen FCNs being used in many recentdeep layered nets for high level computation. Using Slidingwindow for integrated object detection and localization byEigen et al. [25], Recurrent neural network for scene labelingby Pinheiro et al. [26] , and restoring dirt clad image usingconvNet by Eigen et al. [27] is such remarkable example.Training a FCN can be difficult, but has been used for detectinghuman parts and estimating pose efficiently by Tompson et al.[28]

Different approaches can be taken to get finer segmenta-tion mask exploiting convolutional neural network. One suchstrategy could be to develop individual system for extractingdense features and detecting zoomed-in edges from images forfiner semantic segmentation [29], [30]. A single step process

can be, extract semantic feature with convnet and then usingsuperpixels for figuring out the inner layout of the image.Another procedure can be to retrieve superpixels from thegiven image layout and then extracting features from imagesone by one[29], [31]. The only drawback of this approachis that the erroneous super pixels may result into fallaciouspredictions, irrespective how powerful feature extraction tookplace. Zheng et al. [8], [14] designed a RNN model and usedConditional random field to get more finer features by trainingan end-to-end network for semantic segmentation. They alsoproposed a disjointed version of the same model having lessaccuracy and consuming more memory to prove that an end-to-end network always have an upper hand over two or eventhree stage effective segmentation retrieval procedure.

Another strategy could be to develop a model and train itusing supervised image data and output the segmentation labelmap for each categories. Retaining the spatial information,one can replace the fully connected layers with convolutionallayers in a deep convnet, which was shown by Eigen et al[32]. The most groundbreaking work so far was by Shelhamerand Long et al [4] where the idea was, FCN can be designedto harness features to help classify pixel from the top-mostlayers, whereas the bottom layers can be used for detectingshapes,contour and edges. With element-wise summation ofearlier layers with latter layers they introduced the idea ofskip architecture. On the other hand conditional random fieldswas used to refine semantic segmentation furthermore[8], [14].CRF was also used by Snavely et al. [33] and Chen et al. [9],[34] for refining the existing segmentation mask. Snavely etal. conducted recognition for materials and it segmentation, onthe other hand Chen et al. developed better ways to obtain finersemantic image segmentation. Though the previous proceduresincluded disjointed CRF for conducting post-processing on thesegmented output, the method developed by Torr et al. [8], [14]employed CRF as recurrent neural network and also developedhigher order model which is an extension of CRFasRNN. Notonly is the convnet is end-to-end but also it converges fasterthan the previous CRF models and produces finer segmentationmask.

Difference between dilated and vanilla convolution is theextra parameter called holes or dilation that affects the recep-tive fields of the convolution‘s filter. The whole idea of Atrousalgorithm, which is based on wavelet decomposition[35] iswholeheartedly based on dilated filter. In [5] Fisher Yu et al.used the term “dilated convolution“ instead of “convolutionwith a dilated filter“ to formulate that no dilated filter weren’tbuilt or produced. Convolutional layer was modified insteadto make way for a new parameter called dilation to alter thepreexisting filter. In [36] Chen et al. made use of dilationto modify the architecture of Shelhamer et al [4] to make itsuitable for his task. In contrast, Yu et al. [5] developed anew range of feed forward neural net which exploits dilatedconvolutions and multi-scale context aggregation but get ridof the preexisting skip architectures.

Page 3: Efficient Yet Deep Convolutional Neural Networks …Ali Shihab Sabbir Center for Cognitive Skill Enhancement Independent University Bangladesh Dhaka, Bangladesh Email: asabbir@iub.edu.bd

3 Segmentation Architecture

3.1 Transfer Learning from Classification Net

VGGnet is a famous neural net which won ILSVRC14[1]for image classification. The neural net worked on the prin-cipal of using 3 × 3 sized filters for feature extraction andconcurrently joins each convolutions to make the receptivefield bigger. We transferred weights from the VGG 19-layernetwork , removed the classifier from the network and turnedall the fully connected layers to convolutions as done byShelhamer et al.[4].

In covolutional neural network all the tensors have threedimensions of size N ×H ×W , where H and W are definedas height and width, and N is the color channel or featureoutput map. At first layer the image is taken as input, wherethe pixel size is H×W , and the three color channel for RGBis N. As described by shelhamer et al. [4] receptive fieldsis the locations in higher layers corresponds to the locationspath-connected to the image.

3.2 Spatial Information and Receptive Field

The output feature map for each convolutions can bepredefined, but the spatial dimension depends on the size ofthe filter, strides and padding. FCN architecture tends to keepthe spatial dimension of each convolutions same before max-pooling (exception being Fc6 and Fc7 layers). Considering theoutput channel dimension be Oij and input channel dimensionbe Iij for any convolution layer, where i, j is the spatialdimension. The equation for output channel dimension willbe

Oij =Iij + 2Pij −Kij

Sij+ 1 (1)

Here,P stands for Padding, K is for Kernel or filter size andS is for stride. We choose a filter size of 3×3, single paddingand stride 1 for convolutional layers. This helps us to retainthe convolutional structure before pooling to reduce the spatialdimension.

For pooling we use stride of 2 and filter of 2× 2, to lowerthe spatial length of the tensors. If we use the probable valuefor filter and stride, in equation 1, then we can see the outputbecomes half of the size of the input.

The tensors go through first convolution layer to secondconvolution layer and onward as it is fed forward throughthe net. In the first two convolution layers we have 3 × 3filters. Therefore the first one has a receptive field of 3 × 3and second one has receptive field of 5×5. From third to fifthset of convolutional layers we have four convolutions for eachof them. So, the receptive field is quite larger than the earlierlayers. Notice that receptive fields increases linearly after eachconvolutions. Let receptive field r, filter f and stride be s. Ifthe filter and stride values remains the same for concurrentconvolutions, an equation to compute the receptive field canbe formulated :

rn × rn = (2rn−1 + 1)× (2rn−1 + 1) (2)

where rn is the receptive field of the next convolution andrn−1 is the receptive field of the previous convolution.

The sixth layer which is the most expensive one comesnext. It is a fully connected layer, having a spatial dimensionof 4096×4096 with a filter size 7×7. The next is also a fullyconnected layer having a spatial dimension of 4096 × 1000and a filter size of 1 × 1. We convert both of this layers toconvolutional layers with filter size of 3×3 and 1×1 as doneby Shelhamer et al.[4].

3.3 Decreasing Parameters using Dilation

Fihser et al. [5] combined dilation with vanilla convolutionthroughout the network and emphasized multi-context aggre-gation. Whereas, we stick to the original structure of FCN [4]but include dilation only in the sixth convolution, which is themost expensive layer and has the most amount of parametercomputation. Similar work was done before in Parsenet, by Liuet al. [13], but they trained on a reduced version of VGG-16net. We train on the original VGG-19 classification neural netand only change the filter size and enter dilation as parameterin Fc6 convolution (see Table 1.) while retaining the samenumber of parameters across all the convolutions. The relationbetween filter size and dilation can formulated using followingequation :

K ′ = K + (K − 1)(d− 1) (3)

where K’ is the new filter size, K is the given filter size andd is the amount of dilation.

A convolution with dilation 1 is the same as having no dila-tion. Using equation 3, the filter size of the sixth convolutionlayer can be changed from 7 × 7 to 3 × 3 with a dilationsize of 3 and filter size of 3. Fisher et al [5] has also definedand equation for calculating the receptive field of a dilatedconvolution considering the same condition as before :

ri+1 × ri+1 = (2i+2 − 1)× (2i+2 − 1) (4)

3.4 Deconvolution with Finer Strides

As upsampling with factor f means using convolution witha fractional input stride of 1/f [4]. If f is an integer, we canreverse the forward and backward pass to make the upsamplingwork by replacing vanilla convolution with transpose convo-lution. So we use upsampling with finer stride as done byShelhamer et al. [4] for end-to-end calculation of pixel-wisesemantic loss. But the author used strides of 32, 16 and 8,whereas we use a stride of 2 to upsample it in steps. This wasdone to element-wise summed local features from the bottomlayers using skip architectures (discussed in the next section).Figure 1 shows the procedure in details. In [17], transposeconvolution layers are called “deconvolution layers“. Thedeconvolutional layers are used for “bilinear interpolation“as described in [6] rather than learning. It was witnessed byShelhamer et al.[4] and Torr et al.[8] that upsampling in anend-to-end network is way faster and effective for learningdense prediction.

Page 4: Efficient Yet Deep Convolutional Neural Networks …Ali Shihab Sabbir Center for Cognitive Skill Enhancement Independent University Bangladesh Dhaka, Bangladesh Email: asabbir@iub.edu.bd

Fig. 1. The skeleton of Dilated Fully Convolutional Neural Network withSkip Architectures. N stands for the number of category for pixel

classification. The net upsample stride 2 predictions back to pixels in fivesteps. The pool-4 to pool-1 are element-wise summed with stride 2, 4, 8 and

16 in steps in reverse order. This provides with finer segmentation andaccurate pixel classification

3.5 Multiple Skip Architectures

We adopt a similar procedure as fcn-8s-all-at-once [4]for training rather than fcn-8s staged version Because it ismore time consuming to do in stages while predicting withnearly the same accuracy as the all-at-once version. Moreover,for all-at-once version each skip architectures scales withfixed constant and these constants are chosen in such a waythat it equals to the average feature norms across all skiparchitectures[4]. This helps to decrease inference time a lot.For our case the inference time was less than 200ms foreach forward and backward pass combined, whereas fcn-8shad an inference time of 500ms. We use a total of four skiparchitectures. Also the skip architectures tends to consume lessmemory compared to the convolution layer due to its onlyoperation being element-wise summation with other layer.

Table 1. Parameters Comparison

FCN-8s Dilated FCN-2s (our)inference time 0.5s 0.2sFc6 Weights 4096x512x7x7 4096x512x3x3Dilation 1(none) 3Fc6 Parameters 102,760,448 18,874,368

4 Experiments

4.1 Data-set and Procedure

Pascal VOC Transfer learning was performed by copyingweights separately from VGG-19 and VGG-16 classificationnets for our two models, Dilated FCN-2s-VGG19 and Di-lated FCN-2s-VGG16. We adopt the Back propagation [22]algorithm to train the network end-to-end with forward andbackward pass. We used dilation for our most expensivelayer, Fc6 as seen in Table 1. which reduced the number ofparameters. Resulting into less computation by the machineyet faster inference time. The total time needed was 12 hoursfor both the networks to get the best mIOU using a singleGPU. We used PASCAL VOC 2012 training data counting upto 1464 images. Validation was done on the reduced VOC2012validation set of 346 images[8] in which we got 58 percentmeanIOU.

Table 2. Evaluation on PASCAL VOC2012 reduced validation set

Neural Nets PixelAccuracy

MeanAccuracy

MeanIOU

FWAccuracy

FCN-8s-all-at-once 90.8 77.4 63.8 84FCN-8s 90.9 76.6 63.9 84Dilated FCN-2susing VGG16 (ours) 91 78.3 64.15 84.4

Dilated FCN-2susing VGG19 (ours) 91.2 77.6 64.86 84.7

Semantic Boundaries Dataset Extensive data was used toimprove the pixel accuracy and mean-intersection-over-unionscore of both the models for which the training was done onSemantic Boundaries data-set [20]. The set consists of 8498training and 2857 validation data. Training was done on boththe training and validation data summing up to 11355 images.The reduced set for validation was found by removing thecommon images in Augmented VOC2012 training set andVOC2012 validation set[15], resulting to 346 images. Ourmodel, Dilated FCN-2s-VGG16 achieves a meanIOU of 64.1percent and Dilated FCN-2s-VGG19 scores a meanIOU of64.9 percent. Clearly, the deeper version of the model is moreprecise for pixel-wise-segmentation. Training was done withlearning rate of 10e−11 with 200,000 iterations.

Table 3. Evaluation of Pascal Context data-set

Architectures pixelaccu.

meanaccu.

meanIU

f.w.IU

O2P[37] - - 18.1 -CFM[38] - - 18.1 -FCN-32s 65.5 49.1 36.7 50.9FCN-16s 66.9 51.3 38.4 52.3FCN-8s 67.5 52.3 39.1 53.0CRFasRNN[8] - - 39.28 -HO-CRF[14] - - 41.3 -DeepLab-LargeFOV-CRF[39] - - 39.6 -

Ours - - 42.6 -

Pascal Context Data-set We train on more sparse data-setlike Pascal Context which has 60 classes and pose more chal-

Page 5: Efficient Yet Deep Convolutional Neural Networks …Ali Shihab Sabbir Center for Cognitive Skill Enhancement Independent University Bangladesh Dhaka, Bangladesh Email: asabbir@iub.edu.bd

Fig. 2. The Third and fourth image shows the output of our model. Fourthone being more accurate Dilated FCN-2s VGG19. The second image shows

the output of the previous best method by Shelhamer et al.[4].

lenging pixel-wise prediction task [40]. The data-set consistsof 10103 images. We split the data set into 5105 validationimages and rest are used as training set. Our model, scoresbetter mean-IOU of 42.6 percent than the other state-of-the-art models. Moreover, many deeper models with Higher OrderCRF as post processing layer scored worse than our model.This clearly indicates that our model is better suited for pixel-wise prediction of sparse data-set. Training was done withlearning rate of 10e−10 with 300,000 iterations.

Table 4. Evaluation on NYUDv2 data-set

pixelacc,

meanacc.

meanIU

f.w.IU

Gupta et al. [7] 60.3 - 28.6 47FCN-32s RGB 61.8 44.7 31.6 46FCN-32s HHA 58.3 35.7 25.2 41.7FCN-2s Dilated RGB 62.6 47.1 32.3 47.5FCN-2s Dilated HHA 58.8 39.3 26.8 43.5

NYUDv2 We train on NYUD version 2, an RGB-D datasetcollected with the Microsoft Kinect. It consists of 1,449 RGB-D images, with pixel-wise semantic labels that is divided into40 semantic classes by Gupta et al. [41]. The data is splitinto 795 training images and 654 testing images. In, Table5 the comparison of between fcn and our model is given.We train with Dilated FCN-2s VGG19 with three channelRGB images. Then we add depth information and train ona new model upgraded to take four-channel RGB-D input.Though the performance doesn’t increase. Long et al. [4]describes this phenomenon happens due to having similarnumber of parameters or the failure to propagate all thesemantic gradients through the net. Following the footstepof Gupta et al. [7], we next train on three-dimensional HHAencoding of depth. The results proves to be more precise andyields better score for our model. Training was done withlearning rate of 10e−10 with 150,000 iterations.

4.2 Metrics and Evaluation

We use four different metrics to score the pixel accuracy,mean-intersection-over-union, mean accuracy and frequencyweighted accuracy. As background pixels numbers in majority,pixel accuracy is not preferable.For semantic segmentationand scene labeling mean-intersection-over union is the mostoptimum choice for bench-marking.

pixel accuracy:∑i

Nii/∑i

∑j

Nij

mean accuracyy: (1/Nclass)∑i

Nii/∑j

Nij

mean IOU: (1/Nclass)∑i

Nii/(∑j

Nij +∑j

Nji −Nii)

frequency weighted IU:(∑

k

∑j Nkj)

−1 ∑i

∑j NijNii

(∑

j Nij +∑

j Nji −Nii)

where Nij is the number of pixels of class i predicted tobelong to class j, Nclass is the number of classes and

∑j Pij

is the total number of pixels of class i. The data was usedas it is provided by [15] and [20]. No pre or post processingor augmentation was done to training or validation images toenhance the accuracy of the segmentation output.

In Table 3, Pixel Accuracy, Mean Accuracy, MeanIOU andFW Accuracy comparison between our model and Other FCNarchitecture. Both of our model outperforms the preexistingFCN structure for the reduced validation set.

Table 5. PASCAL VOC 12 test results

Neural Nets MeanIOUFCN-8s [4] 62.2FCN-8s-heavy [4] 67.2DeepLab-CRF [9] 66.4DeepLab-CRF-MSc [9] 67.1VGG19 FCN [42] 68.1Dilated FCN-2s using VGG16 (ours) 67.6Dilated FCN-2s using VGG19 (ours) 69

4.3 Test Results

The test results as shown on Table 4 indicates our mod-els scoring better than similar FCN architecture in PascalVOC2012 Segmentation Challenge. We didn’t train on anyadditional data, neither did we add any graphical model likeCRF or MRF [8], [14] to enhance the accuracy further-more. Reason being it consumes 2× more GPU memoryfor training. Moreover, our model scores better than FCNmodel in NYUDv2 sets too (see Table 3). Figure 2 showsthe segmentation mask compared to FCN-8s and the groundtruth. Also table 5 demonstrates, how less our nets consumememory for training and testing with GPU. As one can see,the reduction in memory usage is more than 20 percent fortraining with FCN-2s Dilated VGG16.

Both the models where trained and tested with Caffe [43]on Nvidia GTX1060 and GTX1070 separately. The codefor this model can be found at: https://github.com/SharifAmit/DilatedFCNSegmentation

Page 6: Efficient Yet Deep Convolutional Neural Networks …Ali Shihab Sabbir Center for Cognitive Skill Enhancement Independent University Bangladesh Dhaka, Bangladesh Email: asabbir@iub.edu.bd

Fig. 3. Results after testing on Pascal-Context dataset [40] for 59-classes.The Third column of images show the output of our model which tends to be moreaccurate. On the second column O2P model’s [38] [37] output which is wrongly predicted in many instances.

Table 6. GPU memory usage comparison

GPU Memory UsageTraining (MB)

GPU Memory UsageTraining + Testing (MB)

Fcn-8s 3759 4649Dilated FCN-2sVGG16 3093 4101Dilated FCN-2sVGG19 3367 4509

5 Conclusion

Enhancing accuracy for pixel-wise segmentation requireshuge amount of memory and time. Fully convolutional net-works can be used to transfer weights from pre-trained net,element-wise summing different layers to improve accuracyand to train end-to-end for entire images with extensive data.Dilation increases the receptive fields while decreasing param-eters and inference time. The objective was to create efficientyet deeper architectures for generating accurate output. Andthe proposed models have efficiently produced accurate pixel-wise segmentation.

Acknowledgment

We would like to thank Evan Shelhamer for providingthe evaluation scripts and Caffe users community for theiradvice and suggestions. We also would like to acknowledge thetechnical support “Center for Cognitive Skill Enhancement“has provided to us.

References

[1] Karen Simonyan and Andrew Zisserman. Very deep convolu-tional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014.

[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenetclassification with deep convolutional neural networks. In Advancesin neural information processing systems, pages 1097–1105, 2012.

[3] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and AndrewRabinovich. Going deeper with convolutions. In Proceedings of theIEEE conference on computer vision and pattern recognition, pages 1–9, 2015.

[4] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutionalnetworks for semantic segmentation. IEEE transactions on patternanalysis and machine intelligence, 39(4):640–651, 2017.

[5] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation bydilated convolutions. arXiv preprint arXiv:1511.07122, 2015.

Page 7: Efficient Yet Deep Convolutional Neural Networks …Ali Shihab Sabbir Center for Cognitive Skill Enhancement Independent University Bangladesh Dhaka, Bangladesh Email: asabbir@iub.edu.bd

[6] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning de-convolution network for semantic segmentation. In Proceedings of theIEEE International Conference on Computer Vision, pages 1520–1528,2015.

[7] Saurabh Gupta, Ross Girshick, Pablo Arbelaez, and Jitendra Malik.Learning rich features from rgb-d images for object detection andsegmentation. In European Conference on Computer Vision, pages 345–360. Springer, 2014.

[8] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, VibhavVineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr.Conditional random fields as recurrent neural networks. In Proceedingsof the IEEE International Conference on Computer Vision, pages 1529–1537, 2015.

[9] Chen Liang-Chieh, George Papandreou, Iasonas Kokkinos, Kevin Mur-phy, and Alan Yuille. Semantic image segmentation with deep convo-lutional nets and fully connected crfs. In International Conference onLearning Representations, 2015.

[10] Zhuowen Tu. Auto-context and its application to high-level vision tasks.In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEEConference on, pages 1–8. IEEE, 2008.

[11] Zhuowen Tu, Xiangrong Chen, Alan L Yuille, and Song-Chun Zhu.Image parsing: Unifying segmentation, detection, and recognition. In-ternational Journal of computer vision, 63(2):113–140, 2005.

[12] Chris Russell, Pushmeet Kohli, Philip HS Torr, et al. Associativehierarchical crfs for object class image segmentation. In ComputerVision, 2009 IEEE 12th International Conference on, pages 739–746.IEEE, 2009.

[13] Wei Liu, Andrew Rabinovich, and Alexander C Berg. Parsenet: Lookingwider to see better. arXiv preprint arXiv:1506.04579, 2015.

[14] Anurag Arnab, Sadeep Jayasumana, Shuai Zheng, and Philip HS Torr.Higher order conditional random fields in deep neural networks. InEuropean Conference on Computer Vision, pages 524–540. Springer,2016.

[15] Mark Everingham, Luc Van Gool, Chris Williams, J Winn, and A Zis-serman. Pascal visual object classes challenge results. Available fromwww. pascal-network. org, 1(6):7, 2005.

[16] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang,Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activationfeature for generic visual recognition. In International conference onmachine learning, pages 647–655, 2014.

[17] Matthew D Zeiler and Rob Fergus. Visualizing and understandingconvolutional networks. In European conference on computer vision,pages 818–833. Springer, 2014.

[18] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semanticsegmentation via multi-task network cascades. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages3150–3158, 2016.

[19] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik.Region-based convolutional networks for accurate object detection andsegmentation. IEEE transactions on pattern analysis and machineintelligence, 38(1):142–158, 2016.

[20] Bharath Hariharan, Pablo Arbelaez, Ross Girshick, and Jitendra Malik.Simultaneous detection and segmentation. In European Conference onComputer Vision, pages 297–312. Springer, 2014.

[21] Ofer Matan, Christopher JC Burges, Yann LeCun, and John S Denker.Multi-digit recognition using a space displacement neural network. InAdvances in neural information processing systems, pages 488–495,1992.

[22] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson,Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Back-propagation applied to handwritten zip code recognition. Neural com-putation, 1(4):541–551, 1989.

[23] Ralph Wolf and John C Platt. Postal address block location usinga convolutional locator network. In Advances in Neural InformationProcessing Systems, pages 745–752, 1994.

[24] Feng Ning, Damien Delhomme, Yann LeCun, Fabio Piano, Leon Bottou,and Paolo Emilio Barbano. Toward automatic phenotyping of develop-ing embryos from videos. IEEE Transactions on Image Processing,14(9):1360–1371, 2005.

[25] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fer-gus, and Yann LeCun. Overfeat: Integrated recognition, localization anddetection using convolutional networks. arXiv preprint arXiv:1312.6229,2013.

[26] Pedro Pinheiro and Ronan Collobert. Recurrent convolutional neuralnetworks for scene labeling. In International Conference on MachineLearning, pages 82–90, 2014.

[27] David Eigen, Dilip Krishnan, and Rob Fergus. Restoring an image takenthrough a window covered with dirt or rain. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 633–640, 2013.

[28] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler.Joint training of a convolutional network and a graphical model forhuman pose estimation. In Advances in neural information processingsystems, pages 1799–1807, 2014.

[29] Mohammadreza Mostajabi, Payman Yadollahpour, and GregoryShakhnarovich. Feedforward semantic segmentation with zoom-outfeatures. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 3376–3385, 2015.

[30] Pablo Arbelaez, Bharath Hariharan, Chunhui Gu, Saurabh Gupta,Lubomir Bourdev, and Jitendra Malik. Semantic segmentation usingregions and parts. In Computer Vision and Pattern Recognition (CVPR),2012 IEEE Conference on, pages 3378–3385. IEEE, 2012.

[31] Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun.Learning hierarchical features for scene labeling. IEEE transactions onpattern analysis and machine intelligence, 35(8):1915–1929, 2013.

[32] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map preddictionfrom a single image using a multi-scale deep network. In Advances inneural information processing systems, pages 2366–2374, 2014.

[33] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. Materialrecognition in the wild with the materials in context database. InProceedings of the IEEE conference on computer vision and patternrecognition, pages 3479–3487, 2015.

[34] George Papandreou, Liang-Chieh Chen, Kevin Murphy, and Alan LYuille. Weakly-and semi-supervised learning of a dcnn for semanticimage segmentation. arXiv preprint arXiv:1502.02734, 2015.

[35] Matthias Holschneider, Richard Kronland-Martinet, Jean Morlet, andPh Tchamitchian. A real-time algorithm for signal analysis with thehelp of the wavelet transform. In Wavelets, pages 286–297. Springer,1990.

[36] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L Yuille.Attention to scale: Scale-aware semantic image segmentation. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 3640–3649, 2016.

[37] Joao Carreira, Rui Caseiro, Jorge Batista, and Cristian Sminchisescu.Semantic segmentation with second-order pooling. Computer Vision–ECCV 2012, pages 430–443, 2012.

[38] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role ofcontext for object detection and semantic segmentation in the wild. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 891–898, 2014.

[39] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Mur-phy, and Alan L Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully connected crfs.arXiv preprint arXiv:1606.00915, 2016.

[40] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KIWilliams, John Winn, and Andrew Zisserman. The pascal visual objectclasses challenge: A retrospective. International journal of computervision, 111(1):98–136, 2015.

[41] Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik. Perceptual organiza-tion and recognition of indoor scenes from rgb-d images. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,pages 564–571, 2013.

[42] Sharif Amit Kamran, Bin Khaled, Md Asif, and Sabit Bin Kabir.Exploring deep features: deeper fully convolutional neural network forimage segmentation, 2017. Bachelor Thesis, BRAC University.

[43] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, JonathanLong, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe:Convolutional architecture for fast feature embedding. In Proceedingsof the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.