DEEP CONVEX NMF INTEGRATING WITH CON VOLUTIONAL …claimed that the SPP-net claim can run 100 times faster than RCNN and still maintain the same performance on the object detection

DEEP CONVEX NMF INTEGRATING WITH CONVOLUTIONAL NEURAL

NETWORKS FOR ON-ROAD OBSTACLE DETECTION

1 Yu-Hsun Hsieh (謝宇勳), 1 Zong-Ying Shen (沈宗穎), 1 Min-Yu Wu (吳忞諭)

1 Li-Chen Fu (傅立成), 2Pei-Yung Hsiao (蕭培墉), 3 Kuo-Ching Chang (張國清)

1 Dept. of Computer Science and Information Engineering,

National Taiwan University, Taiwan 2 Dept. of Electrical Engineering,

National University of Kaohsiung, Taiwan 3 Automotive Research and Testing Center, Taiwan

ABSTRACT

Due to the fact that the number of on-road accidents

increases over years, developing an advanced driver

assistance system (ADAS) is getting to be critical. The

ADAS is a system which applies advanced computer

technologies to alert drivers at the appropriate timing

minimizing the possibilities of accidents. The most

essential part of ADAS is to detect any on-road obstacles

through captured visual images that may jeopardize the

running host vehicle especially from its front. In this

paper, we propose a novel deep learning framework

which incorporates our proposed Deep Convex-Non-

negative Matrix Factorization (DC-NMF) technique to

process the camera images for obstacle detection.

Besides this, our proposed novel model, called Deep

Convex-NMF (DC-NMF), which helps one to learn more

sophisticated bases that represent the original high

dimensional features. Logically, we first use this

aforementioned model to extract multilayer basis matrix

and then use it to improve the detection performance of

the proposed novel deep learning framework, or called

Deep Convex-NMF Net (DC-Net). To validate the

proposed work, we evaluate the AP of our proposed

method on KITTI and INRIA dataset, and we find that

the respective quantitative performances are 79% and

91%. We also establish our own urban scene dataset and

test the performance of our method on it which turns out

to be able to achieve 95% recall/precision.

Keywords Deep Learning, Convolutional Neural

Networks, Convex-NMF, Deep Convex-NMF, Pedestrian

detection, Car Detection, Cyclist Detection, Motorcyclist

Detection

1. INTRODUCTION

In past decades, the number of vehicle or motorcycle

has increased tremendously and also the rate of traffic

accidents has been growing over the years. As a result,

developing a driver assistance system is critical to

prevent the above-mentioned problem from getting

worse. In the driver assistance system, it is apparent that

on-road obstacle detection is the most significant

function, which also turns out to be the main challenge in

the computer vision research. The on-road obstacles here

usually refer to pedestrians, cars, cyclists and

motorcyclists, which become the main goal of detection

in our work. In the research field of computer vision, it is

very important to select feature for classifying the object.

There are many notable hand-crafted type of features

such as HOG [1] and Harr-like [2] features. Concerning

the detection task, using sliding window to generate the

candidates and Support Vector Machine (SVM) to train

classifier is a widely adopted approach with such hand-

crafted features. For more complicated objects to be

detected, such as human, the deformable part-based

model (DPM) [3] is constructed on the object part

concept using HOG subject to some geometric

constraints and some penalty in order to improve

detection performance. However, the hand-crafted

features, e.g., HOG shows a weakness of, say, classifying

pedestrians and trees. Furthermore, traditionally SVM is

the typical method used to train the model with these type

of features, but in fact it is a type of shallow learning

methods since the resulting model learns only one hidden

node. In the past, for some small dataset such as Caltech-

101 [4], and the method based on SVM learning can

obtain pretty good performance.

In recent years, there exist several even larger datasets,

including ImageNet [5] and MS COCO [6] which can be

freely accessible, but the aforementioned shallow

learning encounter a bottleneck in finding more

information while data keep growing. On the other hand,

although the deep model getting more popular recently

has the potential to learn more information with many

hidden nodes from the larger datasets, there are simply

too many parameters to learn, and hence becomes a

challenge. Generally, with the advances of computing

technologies, the deep learning can be realized by the

parallel computing ability of GPU, i.e., not only deeper

models can be learned but also object detection task can

be accomplished with outstanding performance.

Nowadays, deep learning is widely applied to the field of

computer vision.

(a)

(b)

Fig. 1 : Describing the unexpected detection false positive (a)

car class. (b) Cyclist class on the KITTI dataset.

From the literature, Krizhevsky et al. [7] were the first

group who successfully implemented the deep model,

called AlexNet, for image recognition. Since then, there

have been many researches proposed in recent years

demonstrating outstanding performances on the object

detection task. So far, a series of state-of-the-art deep

learning studies have been carried out, such as RCNN [8],

SPP-net [9], Fast RCNN [10] and Faster RCNN [11].

These works all implemented the deep learning on the

detection task by training deep Convolutional Neural

Network (CNN) features, since they can describe more

statistical regularity of the training image. In this paper,

we want to develop an on-road obstacle detection system

based on the Faster RCNN. To realize our goal, we will

train our neural network using KITTI dataset [12], where

the sample distribution is similar to our detection scene.

Generally speaking, the CNN feature is a type of

spontaneously learned bottom-up feature, and hence the

content of learned features can hardly be predicted.

Naturally, one is hard to analyze and describe the learned

CNN feature, which thus leads to occasional false

detection. In Fig. 1 the red rectangle boxes show that,

when we use CNN feature trained on KITTI data to detect

cars and cyclists, some unexpected false positives did

happen. A possible reason for this is that the CNN feature

is sensitive to the image resolution that might result in the

potential of overfitting. To solve this problem, we here

proposed a Deep Convex-NMF layer to filter this type of

false positives. Originally, the Convex-NMF [13] is a

variant of Non-negative Matrix Factorization (NMF) [14].

In this paper, we extend Convex-NMF to construct a

novel model, called Deep Convex-NMF (DC-NMF).

Specifically, the DC-NMF layer will help to learn multi-

layer basis matrix from the cropped object images and

then use this matrix to reconstruct each candidate in the

testing image.

This paper is organized as follows. In Section 2, we

discuss state-of-the-art approach in CNN for object

detection and NMF applied to object detection. Section 3

describes our proposed DC-Net architecture. In Section

4, we introduce our proposed a novel model, called DC-

NMF, and how we apply this to object detection task. In

Section 5 and 6, we show the experiment results on

different datasets and give conclusions about this paper.

2. RELATED WORK

In this research, we propose a novel DC-Net

architecture and self-developed DC-NMF layer to meet

the goal of on-road obstacle detection with improved

performance. Before we describe our proposed DC-Net,

we will first discuss the evolution of CNNs applied to

detection task and NMF relative to the object detection

individually.

2.1. Convolutional Neural Networks for object

detection The deep learning is the development of neural

network and CNNs is one way of implementation of deep

learning. When Krizhevsky et al. [7] successfully train a

deep convolutional network, called AlexNet, in the first

place, there are a series of object detection methods based

on this work proposed. They adopt the parallel computing

ability of GPU to accelerate learning of a large amount of

parameters in CNNs as well as the regularization method,

called “dropout”, so that the learning of networks can

reach convergence. Such method performed outstanding

image recognition results in dataset ILSVRC-2012.

Girshick et al. [8] proposed a deep learning framework

that apply AlexNet to the object detection task, where

regions are with CNN features, called RCNN, and should

be the first implementation of object detection based on

deep CNNs. The RCNN approach uses selective search

[15] to generate candidates, and each candidate is passed

to convolutional layer to extract feature. Finally, every

class of objects will use the corresponding CNN feature

to train a specific classifier by SVM. The RCNN can

show the outstanding detection performance on PASCAL

VOC 2012 dataset.

Although RCNN can express very well on the

detection task, it is very time-consuming. There are a

series of methods to accelerate the execution time by

improving each part in RCNN detection framework. For

example, each candidate needs to pass through

convolutional layer extracting feature individually, i.e.,

the SPP-net [9], and the last pooling layer is replaced by

spatial pyramid pooling (SPP) layer. The SPP layer can

remove the constraints on fixed size of input image and

accelerate the overall execution time by asking each

image to pass through convolutional layer only once. It is

claimed that the SPP-net claim can run 100 times faster

than RCNN and still maintain the same performance on

the object detection task. On the other hand, Girshick et

al. [10] proposed the Fast RCNN which removes the

SVM step and replace it with end-to-end training by

constructing the ROI pooling layer so as to implement the

SPP-net idea. The Fast RCNN accelerates large fully

connected layers by compression with truncated SVD

[16]. With these improvements, the Fast RCNN can

achieve near real-time on the recognition task.

With a series of accelerations on the RCNN, the

candidate prediction becomes a bottleneck to achieve

real-time on the detection task. To accelerate candidate

prediction, the Faster RCNN [11] proposed Region

Proposal Networks (RPNs), which is a type of deep fully

convolution network (FCN) [17], to predict candidates

using parallel computation ability of GPU. The Faster

RCNN can achieve near real-time on the object detection

task with the RPNs. In this paper, our proposed DC-Net

architecture is inspired by Faster RCNN to design our

detection system.

2.2. Non-negative Matrix Factorization relative to

object detection Non-negative Matrix Factorization (NMF) is an

algorithm about multivariate analysis [14], where its non-

negative entry constraint requires data be represented by

using additive components only, not subtractive ones,

and combinations of data. Note that the NMF can be used

to learn a basis matrix and be applied to object detection.

For example, Casalino et al. [18] learned bases of

different object class dataset, respectively. The learned

bases from different object classes will be used to

reconstruct each candidate proposed by sliding window

in the testing image. Zeng et al. [19] used the basis

matrices of pedestrian and background learned by NMF,

respectively. In the detection, each candidate will be

reconstructed by this two basis matrices to find their

weights of linear combination, and then the object is

determined to be pedestrian or not. Additionally, NMF

has powerful ability of learning low dimensional

representation. Gui et al. [20] used NMF layer to learn

low dimensional representation of CNN feature, and it

turns out that its learning performance is better than that

of PCA.

The wide variety of NMF algorithms have been

developed over many decades such as Convex-NMF and

Semi-NMF [13]. For the popular concept of deep

learning in recent years, the deep learning idea is also

applied to the multivariate analysis. Trigeorgis et al. [21]

proposed Deep Semi-NMF model to learn multilayer

hidden representation in order to do data clustering. In

this paper, we introduce the deep learning concept to

propose a novel model, called DC-NMF. We use this

proposed model to construct DC-NMF layer. In this layer,

it will learn the multilayer of image basis matrices and

use these matrices to reconstruct each candidate in the

testing image. With DC-NMF layer, the unexpected false

positive generated by CNN features can be filtered.

To sum up, in this paper we aim to propose a novel

model, called DC-NMF, and then take multilayer basis

matrices learned by DC-NMF to refine detection score.

3. DEEP CONVEX-NMF CNN ARCHITECTURE

Our proposed DC-Net architecture is an on-road

obstacle detection system based on deep learning. In this

section, we introduce the DC-Net architecture and briefly

talk about the DC-NMF layer. The algorithm details of

DC-NMF are introduced in the next section.

3.1. DC-Net Architecture

DC-Net is an on-road obstacle detection system

containing three modules. The first one is Region

Proposal Network (RPN) which is a type of FCN [17] for

predicting candidates. The second one is a Fast RCNN

[10] detector which fulfills the classification for each

candidate generated from RPN. The RPN module tells

the Fast RCNN module where to look with sharing

convolutional layers. The third one is DC-NMF layer

which would using learned multilayer basis matrices to

refine detection scores generated from second module.

An overview of our proposed DC-Net architecture is

shown in Fig. 2. The detail of DC-NMF layer will be

introduced in the next section. In DC-NMF layer, we

prepare many cropped image for learning DC-NMF.

Originally, the Convex-NMF will decompose the data

matrix into two matrices which are basis matrix and

weight matrix. In this paper, we propose a novel model,

called DC-NMF, which will decompose the data matrix

into multilayer basis matrices and weight matrices. Each

layer of the basis matrix represents different attributes of

the data matrix. For each candidate generated from RPN,

the DC-NMF layer will use the learned multilayer basis

matrices to refine detection score of each candidate with

our defined error function. For example, when we detect

the pedestrian and use the basis matrices learned from

DC-NMF using cropped pedestrian image to refine

detection score of each pedestrian candidate, we will

consider the error which is estimated from our defined

error function. If the error is less than a threshold, we

consider this candidate does contain a pedestrian and let

this candidate obtain the bonus score. If the error is

greater than a threshold, we consider that this candidate

is background and let this candidate be penalized. With

this detection score refinement, we filtered out many

unexpected false positive and improve the detection

performance.

3.2. DC-Net Implement In first and second modules, the DC-Net is using

ImageNet [5] with 1000 object classes to pre-train the

CNN model at the training stage. Then, this model will

be fine-tuned on the dataset we prepared for on-road

scene which has 4 classes. With alternating training, the

RPN will be trained first which is learning how to predict

each candidate coordinate and its corresponding

objectness score. After the RPN training, the foreground

candidates generated from RPN will be input to train

detection network using Fast RCNN. This two kinds of

network are trained end-to-end which can accelerate

training time. In third module, DC-NMF can be trained

by cropped images individually.

At the testing stage, the RPN will predict 300 top

score candidates for each testing image. This top 300

candidates tell detection network where to extract CNN

feature and forward them to output layer doing

classification. Furthermore, RPN also tells DC-NMF

layer where to extract candidates region from input image

to refine detection score. In output layer, each node

represents a probability for each class and the maximum

one is classification results. The final detection results

will consider output probability for each class and

refinement score generated from our self-defined error

function by DC-NMF layer.

4. DC-NMF LAYER

In this section, we introduce the details of the

aforementioned DC-NMF. First, we briefly talk about

Convex-NMF which is the variant of NMF. Then, we talk

about the proposed novel model, DC-NMF. Finally, we

will describe how we apply this proposed new model to

do refinement is described.

4.1. Convex-NMF

In general, Convex-NMF [13] doesn’t have the non-

negativity constraints of NMF in the data matrix X.

Convex-NMF allows the data matrix X to have mixed

signs, but the decomposed basis matrix W and weight

matrix G are still restricted to have only non-negative

components. Convex-NMF wants to approximate the

following factorization: 𝐗 ≈ 𝐗𝐖𝐆𝑻 (1)

where 𝐗 ∈ ℝ𝑝×𝑛, 𝐖 ∈ ℝ𝑛×𝑘 and 𝐆 ∈ ℝ𝑛×𝑘. Note that n is

the number of data vectors as columns, of which each is

with p features, and k is the number of basis that we want

to find.

Because of the restriction of W and G, Convex-NMF

has the property that both factors W and G tend to be very

sparse. The W is clustering centroids and G is

corresponding coefficients. We want to optimize the cost

function for approximating the Convex-NMF factors is

given as follows: 𝐶𝐶𝑜𝑛𝑣𝑒𝑥−𝑁𝑀𝐹 = ‖𝐗 − 𝐗𝐖𝐆𝑻‖𝐹

2 (2)

where ‖∙‖ denotes the Frobenius norm of a matrix. We

optimize 𝐶𝐶𝑜𝑛𝑣𝑒𝑥−𝑁𝑀𝐹 with an alternating optimization of

W and G. We iteratively update each of the factors while

fixing the other one. With this alternating optimization,

we update W and G which initial value are randomly

between 0 and 1 alternatively until the convergence is

reached as follows.

𝐖 ← 𝐖√[(𝐗𝑻𝐗)+𝐆] + [(𝐗𝑻𝐗)−𝐖𝐆𝑻𝐆]

[(𝐗𝑻𝐗)−𝐆] + [(𝐗𝑻𝐗)+𝐖𝐆𝑻𝐆]

(3)

𝐆 ← 𝐆√[(𝐗𝑻𝐗)+𝐖] + [𝐆𝐖𝑻(𝐗𝑻𝐗)−𝐖]

[(𝐗𝑻𝐗)−𝐖] + [𝐆𝐖𝑻(𝐗𝑻𝐗)+𝐖]

(4)

where 𝐀+ is a matrix that has the negative elements of

matrix 𝐀 be replaced with 0, and similarly 𝐀−is one that

has the positive elements of 𝐀 be replaced with 0. The

definition is shown as follows:

𝐀+ =|𝐀| + 𝐀

2, 𝐀− =

|𝐀| − 𝐀

2

(5)

4.2. DC-NMF In Convex-NMF, we want to learn basis matrix and

weight matrix from input data matrix which can be used

for reconstructing the image. In our proposed DC-NMF,

we want to learn multilayer basis matrices and weight

matrices which represent multi-attribute of the data

matrix.

The proposed DC-NMF model factorizes a given data

matrix X into 2m+1 factors, which m is the number of

hidden layer, as follows. 𝐗 ≈ 𝐗𝐖1𝐆1

𝑇𝐖2𝐆2𝑇 ⋯ 𝐖𝑚𝐆𝑚

𝑇 (6)

DC-NMF allows one to hierarchically learn m layers

of implicit representation of data matrix. The weight

matrix can be shown by the following factorizations.

𝐆𝑚−1𝑇 ≈ 𝐆𝑚−1

𝑇 𝐖𝑚𝐆𝑚𝑇

(7)

With hierarchical decomposition of the weight matrix,

every layer of basis matrix represents different attributes.

To perform our proposed DC-NMF, we follow [21] to

pre-train each layer of weight matrices. First, we

decompose the initial data matrix 𝐗 ≈ 𝐗𝐖𝟏𝐆𝟏𝑻, where 𝐖1 ∈

and 𝐆1 ∈ ℝ0𝑛×𝑘1 . Then, we continually decompose the

weight matrix 𝐆𝟏𝑻 ≈ 𝐆𝟏

𝑻𝐖𝟐𝐆𝟐𝑻 , where 𝐖2 ∈ ℝ0

𝑛×𝑘2 and 𝐆2 ∈

ℝ0𝑛×𝑘2 . We follow this decomposition step until pre-

training all layers of matrices. Note that k1 and k2 are the

Fig. 2 : Overview of the proposed DC-Net architecture. This architecture can be divided into three modules. The upper one is first and

second modules in DC-Net .The lower one is the proposed DC-NMF layer in third module, which uses the learned multi-layer basis

matrices to do detection score refinement.

numbers of bases that we want to construct on each layer.

After the pre-training step, we fine-tune the basis and

weight matrix of each layer by alternating minimization

of the two factors in each layer. The cost function with

which we want to optimize reconstruction errors can be

𝐶𝐷𝑒𝑒𝑝 𝐶𝑜𝑛𝑣𝑒𝑥−𝑁𝑀𝐹 = ‖𝐗 − 𝐗𝐖𝟏𝐆𝟏𝑻𝐖𝟐𝐆𝟐

𝑻 ⋯ 𝐖𝒎𝐆𝒎𝑻 ‖

𝐹

2 (8)

When we alternatively update the decomposed factors,

we fix one of the two factors and update the other one for

each layer. The updating rule is shown as follows.

𝐖𝒊 ← 𝐖𝒊√[(𝐗𝑻𝐗)+𝐆𝒊] + [(𝐗𝑻𝐗)−𝐖𝒊𝐆𝒊

𝑻𝐆𝒊]

[(𝐗𝑻𝐗)−𝐆𝒊] + [(𝐗𝑻𝐗)+𝐖𝒊𝐆𝒊𝑻𝐆𝒊]

(9)

𝐆𝒊 ← 𝐆𝒊√[(𝐗𝑻𝐗)+𝐖𝒊] + [𝐆𝒊𝐖𝒊

𝑻(𝐗𝑻𝐗)−𝐖𝒊]

[(𝐗𝑻𝐗)−𝐖𝒊] + [𝐆𝒊𝐖𝒊𝑻(𝐗𝑻𝐗)+𝐖𝒊]

(10)

The ith is the order of hidden layer that we want to

optimize. For each fine-tuning iteration step, we update

the factors from top layer to last layer until achieving the

stopping criterion.

4.3. Detection Score Refinement

In the implementation of applying DC-NMF to

detection score refinement, we collect the cropped image

data to be the data matrix X. Then, we transpose the data

matrix to 𝐗𝑇 ∈ ℝ𝑛×𝑝 which means each row represents an

image vector. We use this transposed data matrix to find

multilayer basis matrices and weight matrices by DC-

NMF. After getting the multilayer matrices, we follow

our defined error function E(∙) to classify each candidate

generated from RPNs. E(𝐪) = ‖𝐪 − 𝐪𝐖𝟏𝐆𝟏

𝑻𝐖𝟐𝐆𝟐𝑻 ⋯ 𝐖𝒎𝐆𝒎

𝑻 ‖2

< ε (11)

Each candidate will be resized to the same size with

cropped training image in DC-NMF layer. If the error of

candidate image q is smaller than a threshold, it will get

bonus on the detection score. If the reconstruction error

of candidate image q is greater than a threshold, it will

get penalty on the detection score. The image

reconstruction step is performed whether the object

detects the candidate image successfully or not.

5. EXPERIMENTS

We describe the dataset used for training and

evaluation in this section. The deep convolution network

applied to our proposed architecture is ZF-net [22] and

VGG16-net [23]. ZF-net is a kind of small model, whose

detection can achieve near real-time. VGG16-net belongs

to larger model and can learn more sophisticated features,

but spend more execution time on detection tasks. We

will do some analyses on these two different types of

deep network.

5.1. The dataset

The KITTI dataset [12] is a novel challenging on-road

scene in computer vision benchmark. We just focus on

object detection tasks in this paper.

The INRIA person dataset collects people images

with various scenes. It has 614 labeled image and we use

it to increase variety of pedestrian data.

For increasing the richness of training data, we adopt

the Microsoft COCO object detection dataset [6]. This

dataset has 80 object classes with 80k training images and

40k validation images.

Moreover, we are interesting in on-road obstacles.

Hence, we collect some campus and urban scene data in

our city, intending to increase the richness of every class,

especially motorcyclists.

5.2. Experiment Results

5.1.1. The Experiment Results on KITTI Dataset.

In our DC-Net training, we use 10 scales and 7 aspect

ratios instead of 3 scales and 3 aspect ratios. The number

of anchor boxes defined in [11] is 70 instead of 9. We

think this parameter is more suitable for our on-road

scene. Table 1 The Average Precision (AP) on the KITTI car detection

results.

Car Easy Moderate Hard

Ours 79.07% 62.88% 52.67%

DPM-VOC+VP [24] 74.95% 64.71% 48.76%

DPM-C8B1 [25, 26] 74.33% 60.99% 47.16%

ACF-SC [27] 69.11% 58.66% 45.95%

Vote3D [28] 56.80% 47.99% 42.57%

mBoW [29] 36.02% 23.76% 18.44%

The KITTI dataset just provides ground truth of the

training data. For the validation, we need to submit our

detection results on testing data to the KITTI website. In

the KITTI dataset, it has three difficulties of being easy,

moderate and hard. The difficulty is according to the

level occlusion and truncation. We compare some

methods with our proposed DC-Net.

Fig. 3 shows the easy, moderate and hard precision-

recall curves comparing our method to DPM-VOC+VP

[24], DPM-C8B1 [25, 26], ACF-SC [27], Vote3D [28]

and mBow [29] on car detection. Table 2 The Average Precision (AP) on the KITTI pedestrian

detection results.

Pedestrian Easy Moderate Hard

Ours 65.16% 49.26% 45.51%

DPM-VOC+VP [24] 59.48% 44.86% 40.37%

RCNN [30] 61.61% 50.13% 44.79%

ACF-SC [27] 51.53% 44.49% 40.38%

ACF 128X64 [31] 60.11% 47.29% 42.90%

Fusion-DPM [32] 59.51% 46.67% 42.05%

In Table 1, we show our AP on the KITTI car

detection results. We can get 79%, 62% and 52% AP on

easy, moderate and hard, respectively. According to the

comparison, on the moderate difficulty, the AP of our

method is less than DPM-VOC+VP, because we mainly

focus on the complete car which has the potential of car

accident. However, on the easy and hard difficulties, our

method is better than DPM-VOC+VP. Our proposed

method can achieve the rank of 26 on the KITTI car

scoreboard.



[24], RCNN [30], ACF-SC [27], ACF 128x64 [31] and

Fusion-DPM [32] on pedestrian detection.

Table 3 The Average Precision (AP) on the KITTI cyclist

detection results.

Cyclist Easy Moderate Hard

Ours 56.14% 42.11% 37.45%

DPM-VOC+VP [24] 42.43% 31.08% 28.23%

DPM-C8B1 [25, 26] 43.49% 29.04% 26.20%

LSVM-MDPM-us [3] 38.84% 29.88% 27.31%

Vote3D [28] 41.43% 31.24% 28.60%

mBoW [29] 28.00% 21.62% 20.93%

In Table 2, we show our AP on the KITTI pedestrian


easy, moderate and hard, respectively. According to the

comparison, on the moderate difficulty, the AP of our

method is less than RCNN, because we mainly focus on

the complete pedestrian which may appear on the road

with the potential of car accident. However, on the easy

and hard difficulties, our method is better than RCNN.

Our proposed method can achieve the rank of 24 on the

KITTI pedestrian scoreboard.



[24], DPM-C8B1 [25, 26], LSVM-MDPM-us [3],

Vote3D [28] and mBoW [29] on cyclist detection.

In Table 3, we show our AP on the KITTI cyclist


easy, moderate and hard, respectively. Moreover, we can

observe that our performance is better than DPM-

VOC+VP for every difficulty in cyclist detection. We

guess that we provide more cyclist training data for the

deep model training. Our proposed method can achieve

the rank of 12 on the KITTI cyclist scoreboard.

5.1.2. The Experiment Results on INRIA Dataset.

We do some different scenarios on INRIA. In Fig.

6(a), we compare the Faster RCNN model trained on

PASCAL VOC with the model trained on our preparing

data. Table 4(a) shows the corresponding AP to different

scenarios. We observe that whether ZF-net or VGG16-

net can perform better on the INRIA dataset, the data with

the similar scene can help the detection performance. In

Fig. 6(b), we compare our trained VGG16-net model

with our proposed Deep Convex-NMF layer. The k is a

number of anchor box. COCO means that we use the MS

COCO dataset to increase our training data richness and

12k is the number of anchor box used in [11]. Table 4(b)

shows the corresponding AP to different scenarios. We

can observe that using the COCO dataset can’t enhance

(a)

(b)

(c)

Fig. 3 The Precision-Recall curve for KITTI car detection results. (a) The easy case. (b) The moderate case. (c) The hard case.

(a)

(b)

(c)

Fig. 4 The Precision-Recall curve for KITTI pedestrian detection results. (a) The easy case. (b) The moderate case. (c) The hard case.

(a)

(b)

(c)

Fig. 5 The Precision-Recall curve for KITTI cyclist detection results. (a) The easy case. (b) The moderate case. (c) The hard case.

the performance. There are too many categories which

may disturb the pedestrian class. In our observation, more

anchor boxes can promote the performance, so we also

compare different anchor boxes. With our proposed DC

layer, all of the scenarios detected for the INRIA dataset

can enhance the performance about 1% AP. Totally, the

performance with 70k and our proposed method can

achieve 91.36% on the INRIA testing data.

5.1.3. The Experiment Results on our urban scene

Dataset.

To validate our method on the real-world scene, we

collect some of our urban scene videos which contain

pedestrians, cyclists, cars and motorcyclists. We define a

region of interest (ROI) to evaluate the performance. This

region is in front of vehicles and has the potential of car

accidents. For detecting pedestrians, cyclists and

motorcyclists, the ROI is 5m to 30m in front of the

vehicle. For detecting cars, the ROI is 5m to 50m in front

of the vehicle. The ROI is shown in Fig. 7. The purple

dash line is our defined ROI. Our image resolution is

1280p. If we adopt the convolutional structure of ZF-net,

the fps is about 12. If we adopt the convolutional

structure of VGG16-net, the fps is about 7.

Table 5 shows our detection performance on the

campus scene. These two videos mainly contain

pedestrian and cyclist classes. This table shows that our

detection system performance can achieve 95%

recall/precision and above.

Table 6 shows our detection performance on the city

scene. These two videos mainly contain car and

motorcyclist classes. This table shows that our detection

system performance can achieve 90% recall/precision

and above on the car class. However, there still a room to

improve the motorcyclist detection performance by

collecting more motorcyclist data. Our demo videos are

publicly available at https://goo.gl/lxKa4p.

6. CONCLUSIONS

In this paper, we propose a novel deep learning

framework, called DC-Net, for on-road obstacle

detection. A novel model, called DC-NMF, to learn

multilayer feature representation is also proposed. The

DC-Net combines our proposed DC-NMF layer to

remove the unexpected false positive is presented. The

AP of our proposed method on KITTI and INRIA

datasets can achieve 79% and 91%. The performance of

our method on our urban scene dataset can achieve 95%

recall/precision and above. In the future work, we want

to use the deep feature learned by deep convolutional

network to learn more exquisite multilayer feature

representation.

Fig. 7 : The ROI defined for evaluating performance on our

urban scene dataset.

REFERENCES

[1] N. Dalal and B. Triggs, "Histograms of oriented gradients

for human detection," in Computer Vision and Pattern

Recognition, 2005. CVPR 2005. IEEE Computer Society

Conference on, 2005, pp. 886-893.

[2] P. Viola, M. J. Jones, and D. Snow, "Detecting pedestrians

using patterns of motion and appearance," International

Journal of Computer Vision, vol. 63, pp. 153-161, 2005.

[3] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D.

Ramanan, "Object detection with discriminatively trained

part-based models," Pattern Analysis and Machine

Intelligence, IEEE Transactions on, vol. 32, pp. 1627-1645,

2010.

[4] L. Fei-Fei, R. Fergus, and P. Perona, "Learning generative

visual models from few training examples: An incremental

bayesian approach tested on 101 object categories,"

(a)

(b)

Fig. 6 : The Precision-Recall curves for the INRIA dataset. (a) Comparing different training data. (b) Comparing our trained model with

Deep Convex-NMF layer.

Table 4 The Average Precision (AP) on the INRIA detection results. (a) Comparing different training data. (b) Comparing our trained model with Deep Convex-NMF layer.

(a) Pascal Our

ZF 67.96% 75.94%

VGG16 70.72% 89.23%

(b) VGG16 Ours WithDeep

25k 87.60% 88.36%

70k 90.73% 91.36%

12k+COCO 60.39% 61.61%

Computer Vision and Image Understanding, vol. 106, pp.

59-70, 2007.

[5] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S.

Ma, et al., "Imagenet large scale visual recognition

challenge," International Journal of Computer Vision, vol.

115, pp. 211-252, 2015.

[6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D.

Ramanan, et al., "Microsoft coco: Common objects in

context," in European Conference on Computer Vision,

2014, pp. 740-755.

[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet

classification with deep convolutional neural networks," in

Advances in neural information processing systems, 2012,

pp. 1097-1105.

[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich

feature hierarchies for accurate object detection and

semantic segmentation," in Proceedings of the IEEE

conference on computer vision and pattern recognition,

2014, pp. 580-587.

[9] K. He, X. Zhang, S. Ren, and J. Sun, "Spatial pyramid

pooling in deep convolutional networks for visual

recognition," Pattern Analysis and Machine Intelligence,

IEEE Transactions on, vol. 37, pp. 1904-1916, 2015.

[10] R. Girshick, "Fast r-cnn," in Proceedings of the IEEE

International Conference on Computer Vision, 2015, pp.

1440-1448.

[11] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN:

Towards real-time object detection with region proposal

networks," in Advances in Neural Information Processing

Systems, 2015, pp. 91-99.

[12] A. Geiger, P. Lenz, and R. Urtasun, "Are we ready for

autonomous driving? the kitti vision benchmark suite," in

Computer Vision and Pattern Recognition (CVPR), 2012

IEEE Conference on, 2012, pp. 3354-3361.

[13] C. Ding, T. Li, and M. I. Jordan, "Convex and semi-

nonnegative matrix factorizations," Pattern Analysis and

Machine Intelligence, IEEE Transactions on, vol. 32, pp.

45-55, 2010.

[14] D. D. Lee and H. S. Seung, "Algorithms for non-negative

matrix factorization," in Advances in neural information

processing systems, 2001, pp. 556-562.

[15] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W.

Smeulders, "Selective search for object recognition,"

International journal of computer vision, vol. 104, pp. 154-

171, 2013.

[16] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R.

Fergus, "Exploiting linear structure within convolutional

networks for efficient evaluation," in Advances in Neural

Information Processing Systems, 2014, pp. 1269-1277.

[17] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional

networks for semantic segmentation," in Proceedings of the

IEEE Conference on Computer Vision and Pattern

Recognition, 2015, pp. 3431-3440.

[18] G. Casalino, N. D. Buono, and M. Minervini,

"Nonnegative matrix factorizations performing object

detection and localization," Applied Computational

Intelligence and Soft Computing, vol. 2012, p. 15, 2012.

[19] J.-X. Zeng, C.-Y. Lin, and W.-Y. Lin, "Human detection

using non-negative matrix factorization," in Consumer

Electronics-Taiwan (ICCE-TW), 2015 IEEE International

Conference on, 2015, pp. 370-371.

[20] L. Gui and L.-P. Morency, "Learning and Transferring

Deep ConvNet Representations with Group-Sparse

Factorization," in ICCV Workshop on Machine Learning for

Intelligent Image and Video Processing, 2015.

[21] G. Trigeorgis, K. Bousmalis, S. Zafeiriou, and B. W.

Schuller, "A deep matrix factorization method for learning

attribute representations," arXiv preprint arXiv:1509.03248,

2015.

[22] M. D. Zeiler and R. Fergus, "Visualizing and

understanding convolutional networks," in Computer

vision–ECCV 2014, ed: Springer, 2014, pp. 818-833.

[23] K. Simonyan and A. Zisserman, "Very deep convolutional

networks for large-scale image recognition," arXiv preprint

arXiv:1409.1556, 2014.

[24] B. Pepik, M. Stark, P. Gehler, and B. Schiele, "Multi-view

and 3d deformable part models," Pattern Analysis and


2232-2245, 2015.

[25] J. J. Yebes, L. M. Bergasa, R. Arroyo, and A. Lazaro,

"Supervised learning and evaluation of KITTI's cars

detector with DPM," in Intelligent Vehicles Symposium

Proceedings, 2014 IEEE, 2014, pp. 768-773.

[26] J. J. Yebes, L. M. Bergasa, and M. García-Garrido, "Visual

object recognition with 3D-aware features in KITTI urban

scenes," Sensors, vol. 15, pp. 9228-9250, 2015.

[27] C. Cadena, A. Dick, and I. D. Reid, "A fast, modular scene

understanding system using context-aware object

detection," in Robotics and Automation (ICRA), 2015 IEEE

International Conference on, 2015, pp. 4859-4866.

[28] D. Z. Wang and I. Posner, "Voting for voting in online

point cloud object detection," Proceedings of Robotics:

Science and Systems, Rome, Italy, 2015.

[29] J. Behley, V. Steinhage, and A. Cremers, "Laser-based

segment classification using a mixture of bag-of-words," in

Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ

International Conference on, 2013, pp. 4195-4200.

[30] J. Hosang, M. Omran, R. Benenson, and B. Schiele,

"Taking a deeper look at pedestrians," in Proceedings of the

IEEE Conference on Computer Vision and Pattern

Recognition, 2015, pp. 4073-4082.

[31] P. Dollár, R. Appel, S. Belongie, and P. Perona, "Fast

feature pyramids for object detection," Pattern Analysis and


1532-1545, 2014.

[32] C. Premebida, J. Carreira, J. Batista, and U. Nunes,

"Pedestrian detection combining rgb and dense lidar data,"

in Intelligent Robots and Systems (IROS 2014), 2014

IEEE/RSJ International Conference on, 2014, pp. 4112-

4117.

Table 5 The detection results of pedestrians and cyclists in our urban scene dataset on campus. P: precision, R: Recall.

Pedestrian / P Pedestrian / R Cyclist / P Cyclist / R

Campus 1 99.29% 96.18% 99.51% 98.09%

Campus 2 92.39% 97.77% 99.95% 99.14%

Campus 3 99.58% 96.78% 99.22% 95.49%

Table 6 The detection results of cars and motorcyclists our urban scene dataset in city. P: precision, R: Recall.

Car / P Car /R Motorcyclist / P Motorcyclist / R

City 1 98.70% 93.03% 99.95% 92.16%

City 2 100% 97.01% 99.26% 94.87%

City 3 99.67% 95.49%

City 4 98.33% 98.63%

Documents

DEEP CONVEX NMF INTEGRATING WITH CON VOLUTIONAL …claimed that the SPP-net claim can run 100 times faster than RCNN and still maintain the same performance on the object detection