Enhanced Pedestrian Detection using Deep Learning based Semantic Image Segmentation · 2020. 3. 24. · for traditional machine learning methods to handle at once. Deep learning has

Enhanced Pedestrian Detection using DeepLearning based Semantic Image Segmentation

Tianrui Liu and Tania StathakiDepartment of Electrical and Electronic Engineering

Imperial College London, United Kingdomt.liu15, [email protected]

Abstract—Pedestrian detection and semantic segmentation arehighly correlated tasks which can be jointly used for betterperformance. In this paper, we propose a pedestrian detectionmethod making use of semantic labeling to improve pedestriandetection results. A deep learning based semantic segmentationmethod is used to pixel-wise label images into 11 common classes.Semantic segmentation results which encodes high-level imagerepresentation are used as additional feature channels to beintegrated with the low-level HOG+LUV features. Some falsepositives, such as falsely detected pedestrians located on a tree,can be easier eliminated by making use of the semantic cues.Boosted forest is used for training the integrated feature channelsin a cascaded manner for hard negatives mining. Experimentson the Caltech-USA pedestrian dataset show improvements ondetection accuracy by using the additional semantic cues.

Index Terms—Pedestrian detection, semantic segmentation,pixel-wise image labeling, filtered feature channels.

I. INTRODUCTION

Pedestrian detection is of great research interest in computervision. It has many significant applications including videosurveillance, robotics automative, and intelligence transpota-tions. A substantial number of methods have been developedin order to improve detection accuracy [1], [2], [3], [4], [5],[6]. HOG [1] based detectors using the multi-scale slidingwindow mechanism have long been the dominant approachesfor pedestrian detection. While no single hand-craft featurehas been shown to outperform HOG, the combinations ofHOG with other features have made improvements by makinguse of complementary visual cues. A texture descriptor basedon local binary patterns (LBP) [7] was combined with HOGin [8] to cope with partial occlusions. HOG descriptors areused together with LUV color features in the form of imagechannels features (ICF) in [9]. The ICF detector outperformsHOG at a faster computational speed by computing integralimages over feature channels. Aggregated channel features(ACF) [10] is a generalization of ICF which approximatesmultiscale scale gradients using nearby scales. In such way,the ACF [10] detector can achieve very fast feature pyramidfor real-time multi-scale detection on CPU along.

Pedestrian detection is often challenged by greatly variationsin human pose and appearance. It is generally difficult toreduce false positives on hard negative samples such as treeleaves, traffic lights, poles, etc. Some of these hard negativescan be removed by making use of higher level vision cues,i.e. semantic information. For instance, it is very unlikely

that a pedestrian is located on a tree. Therefore, it would beeasier to eliminate the falsely detected pedestrian located on atree, given the semantic information that there is a region oftree leaves within and/or around the detection window. Thisindicates that good pedestrian detectors need to be extendedwith a better semantic understanding of images. Semantic seg-mentation methods [11], [12], [13], [14] aiming at classifyingimage pixels into semantic classes such as sky, building, road,vehicle, etc. Among these classes, backgrounds such as sky,road, buildings provide semantic context that can be used forsearch space for pedestrian detection, while foreground classeslike pedestrian and cyclists can be served as an alternativeprinciple for pedestrian detection.

Object detection and semantic segmentation are two strong-ly correlated and complementary tasks for better performance[15], [16]. Dai et al. [16] used segments extracted fromeach object detection hypothesis for better localization. Mostrecently, Costea et al. [15] proposed to make used of se-mantic classification cost for pedestrian detection. In theirwork, decision trees are applied to classify each individualpixel into one of eight semantic classes. However, pixel-wiseclassification results using decision trees can be noisy andinconsistent (see Fig. 2 (b)). To tackle this problem, they reliedon a Conditional Random Field (CRF) inference procedure toimprove the segmentation results. Dense CRFs [17] definedover uniform 2-dimentional grids are applied over each pixelfor 3 rounds in order to achieve a more consistent semanticsegmentation result (see Fig. 2 (c)).

In this paper, we propose a pedestrian detection methodmaking use of semantic context to improve detection results.The semantic channel for pedestrians provides an addition-al cue for their presence, so that successful detection andsegmentation require implicitly agreement of both detectionand segmentation predictions. Instead of obtaining semanticsegmentation with decision trees and dense CRFs as in [15],we use a deep learning based semantic segmentation methodwhich can achieve superb labeling accuracy in a single round.11 common semantic channels e.g. sky, building, road, tree,vehicle, pedestrian etc., are used as context information toassist pedestrian detection. As many top performances pedes-trian detectors [18], [3], [19] are based on the same 10LUV+HOG channels used by the aggregated channel features(ACF) [10] detector , we also take the ACF detector as ourbased line. In addition, we make use of the results from

Fig. 1. Motivation of using semantic segmentation to improve pedestrian detection

semantic segmentation to provide higher level representations.The semantic information is integrated together with theLUV+HOG channels and they are all trained jointly usingboosted forest to distinguish a pedestrian from its backgrounds.In order to enhance the feature representation capability, weuse intermediate filtering layer to filter the feature channelsbefore boosted trained by RealBoost. The semantic featurechannel can be regarded as a reinforcement of the filteredchannel features for pedestrian detection. Experimental resultsshowed improvements on detection rates using the additionalsemantic channels.

II. PROPOSED METHOD

Fig. 1 illustrates the motivation for proposing the pedestriandetector assisted by a deep learning based semantic imagesegmentation method. Given an image, instead of performingpedestrian detection directly as in the upper path, we first applya deep learning method to semantically label the image into11 classes. The semantic information is utilized as a semanticfeature channel to be integrated together with the LUV+HOGchannel features and they are trained altogether using boostedforest.

A. Semantic Image Labeling for Pedestrian Detection

In this part, a deep learning based semantic segmentationmethod is used to perform semantic image labeling for theproposed detector. Semantic image labeling aims to assignevery pixel of an image with an object class label, challeng-ingly combining image segmentation and object recognitionin a single process. In semantic labeling problems, thereare typically too many categories of objects in an imagefor traditional machine learning methods to handle at once.Deep learning has recently been investigated for the semanticsegmentation task [11], [20], [12] owing to its strong capabilityof feature representation for multiple classes classification.

SegNet [12] is a practical deep convolutional neural net-work architecture for semantic pixel-wise image labeling. Thearchitecture of SegNet can be seen as a convolutional encoder-decoder framework followed by a pixel-wise classifier. Theencoder aims to generate feature maps and the decoder is used

to upsample the feature maps output from the encoder beforefed into the softmax classifier for pixel-wise classfication. Asmany other advanced deep architectures for semantic segmen-tation [11], [20], the encoder network of SegNet is identicalto the first 13 convolutional layers of the VGG16 network[21]. The encoder network consists of convolutional layersassociated with batch normalization and ReLU non-linearityprocedures. Some of the convolutional layers are followed bya max-pooling layer. The encoder network has strong featurerepresentation capability thanks to the deep framework withconsequence non-linear processing layers. However, outputfeature maps throughout 5 max-pooling layers of the encodernetwork are of very low resolution resulting . Take an inputimage of size 480 × 360, for example, SegNet encorder willshrink the image size by 25 so that the output feature mapsare of size 15× 6. Therefore, upsampling is required to mapthe low resolution feature maps into original image size forpixel-wise semantic labeling. The decoder network of SegNetis topologically axisymmetric to the encoder network. It acts asa “deconvolutioner” in order to upsample the lower resolutionfeature maps. In SegNet, the decoder network is proposedto use pooling indices (i.e. indices of the pixels retainedduring max-pooling) of the corresponding encoder to performupsampling. Since the feature maps obtained from this non-linear upsampling process are sparse, these feature maps arethen convolved with trainable filters to produce dense featuremaps.

The SegNet architecture is trained using Caffe-SegNet [22],[12] on a large database combining a set of urban trafficimages [23] as in [12]. 11 common semantic classes are usedfor training: building, tree, sky, car, sidewalk, column pole,sign-symbol, road, fence, pedestrian and bicyclist; while pixelsof other classes are labeled in black and will be ignoredduring training. Network weights of the 13 layers encoderare pre-trained on ImageNet [24]. An example of semanticlabeling result obtained by SegNet is given in Fig. 2 (d).Compared to (b)-(c) which are semantic labeling results in[15], SegNet provides more clear cut labeling results andremoves some of the ambiguities between buildings and cars

Fig. 2. Example semantic image labeling results using the method in [15](b)-(c) (from [15]) and using SegNet (d).

in the example image. The reported performance in termsof average class classification accuracy on the CamVid testis 55.5% [15] versus 71.2% [12]. The detection time forsemantic segmentation using SegNet is ~1ms using singleGPU. The pixel-wise semantic segmentation results is encodedas a semantic channel containing the class index of everyimage pixel ranging from 0 to 11 (0 for unlabelled pixles). Thesemantic index map is using as an additional feature channelfor the proposed pedestrian detector.

B. Filtered Channel Features based Pedestrian Detector usingSemantic Image Segmentation

In this part, we integrated the semantic image segmentationresult with our baseline detectors [10], [3] which make use offiltered channel features. In the ACF method, an input image istransformed into a set of feature channels, i.e. 10 HOG+LUVchannels which contain 6 gradient orientations, 1 gradientmagnitude, and 3 color channels. The feature channels areaggregated via sum-pooling before fed into a decision forestfor feature selection. According to [3], using an intermediatelayer to filter the low-level features before the boosting stagecan provide better performing pedestrian detectors. We adopt-ed this filtered channel features strategy to improve the featurecapacity with an additional filter layer. There are filter bankof other format, such as decorrelated filters [19], Haar-likefilters [18], and checkerboards [3] filters which can influencethe detection quality by different amount. The filter bankused in [15] consist of only 3 filters for each scale, i.e. asimple pooling filter, a vertical filter and a horizontal filter.In our work, we apply the Checkerboards filters [3] (a setof filters that appear like checkerboard patterns, includinguniform squares and horizontal/vertical gradient filters, seeFig. 3(d)), which can achieve top performance.

1) Integrating Semantic Feature for Pedestrian Detection:

The framework of a filtered channel features based detectoris illustrated in Fig. 3. Given an image in Fig. 2(a), 10HOG+LUV feature channels Fig. 2(c) are computed efficiently

Fig. 3. Structure diagram of the proposed filtered channel features basedpedestrian detector using semantic image segmentation. Given an image in (a),Semantic features (b) are obtained using SegNet, and 10 HOG+LUV featurechannels (c) are computed using integral images. The semantic features areintegrated with the 10 HOG+LUV feature channels and are filtered with theCheckerboards filter bank (d) before fed into boosted decision forests.

using integral images. Semantic features in Fig. 2(b) areobtained from the deep learning based semantic labelingmethod. In the experiment, the semantic channel is encodedas a channel containing the class index of every image pixel(range from 0 to 11), here Fig. 2(b) is shown in color forvisualization. The integrated feature channels are filtered withthe Checkerboards filter bank Fig. 2(d). Boosted Forest (BF)[25] is used to train the semantic features together with theHOG+LUV feature channels to distinguish pedestrian frombackground.

2) Boosted Forest for Integrated Feature Channels:Boosted Forest is an ensemble learning method which canachieve fast and accurate classification by training accumulat-ed hard negatives samples in cascade. Owing to high accuracyand low computation cost, decision forests have been widelyused in computer vision tasks such as image classification [26],[27], object recognition [28], [29] and super-resolution [30],[31], [32]. During learning, each decision tree is responsiblefor finding the binary test parameter for feature selectionand responsible for learning the thresholds in the split nodes.Generally, the split nodes in the trees are a simple comparisonbetween a feature value and a learned threshold. A non-leafnode of decision tree will stop splitting and be declared as aleaf node if stop criterion have been met, i.e. typically whenthe maximum tree depth is reached; or the size of trainingdata is too small. After training, each non-leaf node stores itsbinary test parameters and each leaf node stores the learnedprediction models.

By applying filter banks for feature channel filtering, thefeatures become simple pixel lookup and there is no needfor integral images. The boosted forest classifier imposes noconstraint on the dimensions of features. Hence, we directlyappend the obtained semantic channel feature to the 10 feature

(a) (b)

Fig. 4. False positives per image (FFPI) against miss rate. (a) Comparison of methods on the Caltech test set trained using Caltech original training set, (b)result on the proposed pedestrian detector on the Caltech test set trained using Caltech10x training set.

channels. The integrated features channels are used to train BFclassifiers.

III. EXPERIMENTS

We train our pedestrian detector on the Caltech-USA dateset[33]. Under default settings, the training dataset contains 3880frames selected from every 30th frame of the videos. Positivepedestrian samples are extracted and resized into windows ofsize (120, 180) in terms of window (width, height). There are1586 positive pedestrian samples in the 3880 training images.In order to acquire more training data, each pedestrian windowis mirrored along the horizontal direction to generate 3172positive samples in total.

The work is implemented based on the publicly availabletoolbox [34]. To train the decision forests, we use 32, 512,1024, 2048 and 4096 weak classifiers respectively for 5 boot-strapping rounds as in [3]. Initially, the training set consistsof all positive examples and 10000 negative samples. In thefirst training stage, negative training samples are randomlygenerated, avoiding the regions containing pedestrians. For theother 4 stages, hard negative samples are selected using thedetector trained from the previous stage. The number of hardnegative samples to be added after each bootstrapping roundare limited to 10000. The BF classifier obtained at the finalstage is used for testing.

For evaluation, the “reasonable” testing set of Caltech-USA [33] is used. The testing set contains 4024 framesand pedestrians above 50 in height with no occlusion oronly partial occlusions are used for evaluation. Following theevaluation method given in [33], measuring the log averagemiss rate ranging from 10−2 to 100 false positive per image(FFPI). Performance of the proposed pedestrian detector iscompared to VJ [35], HOG [1], HogLbp [8], LatSVM-V2[2], FPDW [36], ICF [9], and ACF [10] in Fig. 4. As wecan see, there is a big improvement between the [10] and[3]. This improvement reveals the effective on the additionalfiltering layer using the Checkerboards filter bank. Overall,

the proposed detection method outperforms ACF [10] by about30%, and outperforms the state-of-the-art work in [3] by 1.4%.The results reveal that it is beneficial to improve the detectionaccuracy by using additional semantic cues. In order to tosee whether the proposed detector is hunger of data usingthe original Caltech dataset, we also generate the Caltech 10xdataset by saving the image frames every 3th frame fromthe Caltech videos. While the original Caltech-USA trainingdataset contains 3172 positive samples, the number of positivesamples Caltech10x is increased to 32752. Detection result onthe proposed pedestrian detector trained using Caltech10x setis given in Fig. 4(b), By comparing the performance in Fig.4 (a), it is clearly demonstrated that the proposed detectortrained using Caltech10x has been improved by around 4%by using more positive samples.

IV. CONCLUSIONS

In this paper, we proposed a pedestrian detector whichmakes use of semantic image segmentation results in orderto improve the filtered channel feature based detector. Weuse a deep learning method to pixel-wise label images in-to 11 semantic classes. The semantic image segmentationresults are used jointly with the HOG+LUV features. Theproposed approach relies on the learning mechanism to assessthe semantic content cues for a more powerful pedestriandetector. Experiments on the Caltech-USA dataset indicatethat the proposed detector make improvement on the baselinedetector by enforcing the consistency between detection andsegmentation. In the future, we consider a fast implementationof the proposed detector which make use of GPUs.

ACKNOWLEGMENT

This work was supported by the EU H2020 TERPSICHOREproject “Transforming Intangible Folkloric Performing Artsinto Tangible Choreographic Digital Objects” under the grantagreement 691218.

REFERENCES

[1] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Computer Vision and Pattern Recognition (CVPR), 2005IEEE Conference on, vol. 1. IEEE, Conference Proceedings, pp. 886–893.

[2] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,“Object detection with discriminatively trained part-based models,” Pat-tern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32,no. 9, pp. 1627–1645, 2010.

[3] S. Zhang, R. Benenson, and B. Schiele, “Filtered channel featuresfor pedestrian detection,” in Computer Vision and Pattern Recognition(CVPR), 2015 IEEE Conference on. IEEE, Conference Proceedings,pp. 1751–1760.

[4] S. Zhang, R. Benenson, M. Omran, J. H. Hosang, and B. Schiele,“How far are we from solving pedestrian detection?” CoRR, vol.abs/1602.01237, 2016. [Online]. Available: http://arxiv.org/abs/1602.01237

[5] T. Liu and T. Stathaki, “Fast head-shoulder proposal for deformablepart model based pedestrian detection,” in 2016 IEEE InternationalConference on Digital Signal Processing (DSP), Oct 2016, pp. 457–461.

[6] T.-R. Liu, V. Copin, and T. Stathaki, “Human detection from groundtruth cameras through combined use of histogram of oriented gradientsand body part models,” in Proceedings of the 11th Joint Conferenceon Computer Vision, Imaging and Computer Graphics Theory andApplications, 2016, pp. 735–740.

[7] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scaleand rotation invariant texture classification with local binary pattern-s,” IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 24, no. 7, pp. 971–987, 2002.

[8] X. Wang, T. X. Han, and S. Yan, “An hog-lbp human detector withpartial occlusion handling,” in 2009 IEEE 12th International Conferenceon Computer Vision, Sept 2009, pp. 32–39.

[9] P. Dollár, Z. Tu, P. Perona, and S. Belongie, “Integral channel features,”2009.

[10] P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramidsfor object detection,” IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, vol. 36, no. 8, pp. 1532–1545, 2014.

[11] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks forsemantic segmentation,” in The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), June 2015.

[12] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con-volutional encoder-decoder architecture for image segmentation,” arXivpreprint arXiv:1511.00561, 2015.

[13] T. R. Liu and S. C. Chan, “A hierarchical semantic image labelingmethod via random forests,” in TENCON 2015 - 2015 IEEE Region10 Conference, Nov 2015, pp. 1–5.

[14] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,“Semantic image segmentation with deep convolutional nets and fullyconnected crfs,” arXiv preprint arXiv:1412.7062, 2014.

[15] A. D. Costea and S. Nedevschi, “Semantic channels for fast pedestriandetection,” in 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2016, pp. 2360–2368.

[16] Q. Dai and D. Hoiem, “Learning to localize detected objects,” in 2012IEEE Conference on Computer Vision and Pattern Recognition, June2012, pp. 3322–3329.

[17] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfswith gaussian edge potentials,” CoRR, vol. abs/1210.5644, 2012.

[18] S. Zhang, C. Bauckhage, and A. Cremers, “Informed haar-like featuresimprove pedestrian detection,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, Conference Proceedings,pp. 947–954.

[19] W. Nam, P. Dollár, and J. H. Han, “Local decorrelation for improvedpedestrian detection,” in Advances in Neural Information ProcessingSystems, 2014, pp. 424–432.

[20] H. Noh, S. Hong, and B. Han, “Learning deconvolution networkfor semantic segmentation,” in Proceedings of the IEEE InternationalConference on Computer Vision, 2015, pp. 1520–1528.

[21] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[22] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.

[23] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classesin video: A high-definition ground truth database,” Pattern RecognitionLetters, vol. 30, no. 2, pp. 88–97, 2008.

[24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A Large-Scale Hierarchical Image Database,” in Computer Vision andPattern Recognition (CVPR), 2009 IEEE Conference on, 2009.

[25] R. E. Schapire and Y. Singer, “Improved boosting algorithms usingconfidence-rated predictions,” Machine learning, vol. 37, no. 3, pp. 297–336, 1999.

[26] A. Bosch, A. Zisserman, and X. Munoz, “Image classification usingrandom forests and ferns,” in Computer Vision, 2007. ICCV 2007. IEEE11th International Conference on. IEEE, 2007, pp. 1–8.

[27] J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests forimage categorization and segmentation,” in Computer vision and patternrecognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp.1–8.

[28] J. Gall and V. Lempitsky, Class-specific hough forests for object detec-tion. Springer, 2013, pp. 143–157.

[29] P. Wohlhart, S. Schulter, M. Köstinger, P. M. Roth, and H. Bischof,“Discriminative hough forests for object detection,” in BMVC, 2012,Conference Proceedings, pp. 1–11.

[30] J.-J. Huang and W.-C. Siu, “Learning hierarchical decision trees forsingle image super-resolution,” IEEE Transactions on Circuits andSystems for Video Technology, 2015.

[31] J. J. Huang, W. C. Siu, and T. R. Liu, “Fast image interpolation viarandom forests,” IEEE Transactions on Image Processing, vol. 24,no. 10, pp. 3232–3245, Oct 2015.

[32] J.-J. Huang, T. Liu, P. L. Dragotti, and T. Stathaki, “Srhrf+: Self-example enhanced single image super-resolution using hierarchical ran-dom forests,” in CVPR Workshop: New Trends in Image Restoration andEnhancement workshop, 2017.

[33] P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: Abenchmark,” in Computer Vision and Pattern Recognition (CVPR), 2009.IEEE Conference on. IEEE, Conference Proceedings, pp. 304–311.

[34] P. Dollár, “Piotr’s Computer Vision Matlab Toolbox (PMT),” https://github.com/pdollar/toolbox.

[35] P. Viola and M. Jones, “Robust real-time object detection,” InternationalJournal of Computer Vision, vol. 4, no. 34–47, 2001.

[36] P. Dollár, S. Belongie, and P. Perona, “The fastest pedestrian detectorin the west.” in BMVC, vol. 2, no. 3. Citeseer, 2010, p. 7.

http://arxiv.org/abs/1602.01237

http://arxiv.org/abs/1602.01237

https://github.com/pdollar/toolbox

https://github.com/pdollar/toolbox

Documents

Enhanced Pedestrian Detection using Deep Learning based Semantic Image Segmentation · 2020. 3. 24. · for traditional machine learning methods to handle at once. Deep learning has