Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
DEEP CONVEX NMF INTEGRATING WITH CONVOLUTIONAL NEURAL
NETWORKS FOR ON-ROAD OBSTACLE DETECTION
1 Yu-Hsun Hsieh (謝宇勳), 1 Zong-Ying Shen (沈宗穎), 1 Min-Yu Wu (吳忞諭)
1 Li-Chen Fu (傅立成), 2Pei-Yung Hsiao (蕭培墉), 3 Kuo-Ching Chang (張國清)
1 Dept. of Computer Science and Information Engineering,
National Taiwan University, Taiwan 2 Dept. of Electrical Engineering,
National University of Kaohsiung, Taiwan 3 Automotive Research and Testing Center, Taiwan
ABSTRACT
Due to the fact that the number of on-road accidents
increases over years, developing an advanced driver
assistance system (ADAS) is getting to be critical. The
ADAS is a system which applies advanced computer
technologies to alert drivers at the appropriate timing
minimizing the possibilities of accidents. The most
essential part of ADAS is to detect any on-road obstacles
through captured visual images that may jeopardize the
running host vehicle especially from its front. In this
paper, we propose a novel deep learning framework
which incorporates our proposed Deep Convex-Non-
negative Matrix Factorization (DC-NMF) technique to
process the camera images for obstacle detection.
Besides this, our proposed novel model, called Deep
Convex-NMF (DC-NMF), which helps one to learn more
sophisticated bases that represent the original high
dimensional features. Logically, we first use this
aforementioned model to extract multilayer basis matrix
and then use it to improve the detection performance of
the proposed novel deep learning framework, or called
Deep Convex-NMF Net (DC-Net). To validate the
proposed work, we evaluate the AP of our proposed
method on KITTI and INRIA dataset, and we find that
the respective quantitative performances are 79% and
91%. We also establish our own urban scene dataset and
test the performance of our method on it which turns out
to be able to achieve 95% recall/precision.
Keywords Deep Learning, Convolutional Neural
Networks, Convex-NMF, Deep Convex-NMF, Pedestrian
detection, Car Detection, Cyclist Detection, Motorcyclist
Detection
1. INTRODUCTION
In past decades, the number of vehicle or motorcycle
has increased tremendously and also the rate of traffic
accidents has been growing over the years. As a result,
developing a driver assistance system is critical to
prevent the above-mentioned problem from getting
worse. In the driver assistance system, it is apparent that
on-road obstacle detection is the most significant
function, which also turns out to be the main challenge in
the computer vision research. The on-road obstacles here
usually refer to pedestrians, cars, cyclists and
motorcyclists, which become the main goal of detection
in our work. In the research field of computer vision, it is
very important to select feature for classifying the object.
There are many notable hand-crafted type of features
such as HOG [1] and Harr-like [2] features. Concerning
the detection task, using sliding window to generate the
candidates and Support Vector Machine (SVM) to train
classifier is a widely adopted approach with such hand-
crafted features. For more complicated objects to be
detected, such as human, the deformable part-based
model (DPM) [3] is constructed on the object part
concept using HOG subject to some geometric
constraints and some penalty in order to improve
detection performance. However, the hand-crafted
features, e.g., HOG shows a weakness of, say, classifying
pedestrians and trees. Furthermore, traditionally SVM is
the typical method used to train the model with these type
of features, but in fact it is a type of shallow learning
methods since the resulting model learns only one hidden
node. In the past, for some small dataset such as Caltech-
101 [4], and the method based on SVM learning can
obtain pretty good performance.
In recent years, there exist several even larger datasets,
including ImageNet [5] and MS COCO [6] which can be
freely accessible, but the aforementioned shallow
learning encounter a bottleneck in finding more
information while data keep growing. On the other hand,
although the deep model getting more popular recently
has the potential to learn more information with many
hidden nodes from the larger datasets, there are simply
too many parameters to learn, and hence becomes a
challenge. Generally, with the advances of computing
technologies, the deep learning can be realized by the
parallel computing ability of GPU, i.e., not only deeper
models can be learned but also object detection task can
be accomplished with outstanding performance.
Nowadays, deep learning is widely applied to the field of
computer vision.
(a)
(b)
Fig. 1 : Describing the unexpected detection false positive (a)
car class. (b) Cyclist class on the KITTI dataset.
From the literature, Krizhevsky et al. [7] were the first
group who successfully implemented the deep model,
called AlexNet, for image recognition. Since then, there
have been many researches proposed in recent years
demonstrating outstanding performances on the object
detection task. So far, a series of state-of-the-art deep
learning studies have been carried out, such as RCNN [8],
SPP-net [9], Fast RCNN [10] and Faster RCNN [11].
These works all implemented the deep learning on the
detection task by training deep Convolutional Neural
Network (CNN) features, since they can describe more
statistical regularity of the training image. In this paper,
we want to develop an on-road obstacle detection system
based on the Faster RCNN. To realize our goal, we will
train our neural network using KITTI dataset [12], where
the sample distribution is similar to our detection scene.
Generally speaking, the CNN feature is a type of
spontaneously learned bottom-up feature, and hence the
content of learned features can hardly be predicted.
Naturally, one is hard to analyze and describe the learned
CNN feature, which thus leads to occasional false
detection. In Fig. 1 the red rectangle boxes show that,
when we use CNN feature trained on KITTI data to detect
cars and cyclists, some unexpected false positives did
happen. A possible reason for this is that the CNN feature
is sensitive to the image resolution that might result in the
potential of overfitting. To solve this problem, we here
proposed a Deep Convex-NMF layer to filter this type of
false positives. Originally, the Convex-NMF [13] is a
variant of Non-negative Matrix Factorization (NMF) [14].
In this paper, we extend Convex-NMF to construct a
novel model, called Deep Convex-NMF (DC-NMF).
Specifically, the DC-NMF layer will help to learn multi-
layer basis matrix from the cropped object images and
then use this matrix to reconstruct each candidate in the
testing image.
This paper is organized as follows. In Section 2, we
discuss state-of-the-art approach in CNN for object
detection and NMF applied to object detection. Section 3
describes our proposed DC-Net architecture. In Section
4, we introduce our proposed a novel model, called DC-
NMF, and how we apply this to object detection task. In
Section 5 and 6, we show the experiment results on
different datasets and give conclusions about this paper.
2. RELATED WORK
In this research, we propose a novel DC-Net
architecture and self-developed DC-NMF layer to meet
the goal of on-road obstacle detection with improved
performance. Before we describe our proposed DC-Net,
we will first discuss the evolution of CNNs applied to
detection task and NMF relative to the object detection
individually.
2.1. Convolutional Neural Networks for object
detection The deep learning is the development of neural
network and CNNs is one way of implementation of deep
learning. When Krizhevsky et al. [7] successfully train a
deep convolutional network, called AlexNet, in the first
place, there are a series of object detection methods based
on this work proposed. They adopt the parallel computing
ability of GPU to accelerate learning of a large amount of
parameters in CNNs as well as the regularization method,
called “dropout”, so that the learning of networks can
reach convergence. Such method performed outstanding
image recognition results in dataset ILSVRC-2012.
Girshick et al. [8] proposed a deep learning framework
that apply AlexNet to the object detection task, where
regions are with CNN features, called RCNN, and should
be the first implementation of object detection based on
deep CNNs. The RCNN approach uses selective search
[15] to generate candidates, and each candidate is passed
to convolutional layer to extract feature. Finally, every
class of objects will use the corresponding CNN feature
to train a specific classifier by SVM. The RCNN can
show the outstanding detection performance on PASCAL
VOC 2012 dataset.
Although RCNN can express very well on the
detection task, it is very time-consuming. There are a
series of methods to accelerate the execution time by
improving each part in RCNN detection framework. For
example, each candidate needs to pass through
convolutional layer extracting feature individually, i.e.,
the SPP-net [9], and the last pooling layer is replaced by
spatial pyramid pooling (SPP) layer. The SPP layer can
remove the constraints on fixed size of input image and
accelerate the overall execution time by asking each
image to pass through convolutional layer only once. It is
claimed that the SPP-net claim can run 100 times faster
than RCNN and still maintain the same performance on
the object detection task. On the other hand, Girshick et
al. [10] proposed the Fast RCNN which removes the
SVM step and replace it with end-to-end training by
constructing the ROI pooling layer so as to implement the
SPP-net idea. The Fast RCNN accelerates large fully
connected layers by compression with truncated SVD
[16]. With these improvements, the Fast RCNN can
achieve near real-time on the recognition task.
With a series of accelerations on the RCNN, the
candidate prediction becomes a bottleneck to achieve
real-time on the detection task. To accelerate candidate
prediction, the Faster RCNN [11] proposed Region
Proposal Networks (RPNs), which is a type of deep fully
convolution network (FCN) [17], to predict candidates
using parallel computation ability of GPU. The Faster
RCNN can achieve near real-time on the object detection
task with the RPNs. In this paper, our proposed DC-Net
architecture is inspired by Faster RCNN to design our
detection system.
2.2. Non-negative Matrix Factorization relative to
object detection Non-negative Matrix Factorization (NMF) is an
algorithm about multivariate analysis [14], where its non-
negative entry constraint requires data be represented by
using additive components only, not subtractive ones,
and combinations of data. Note that the NMF can be used
to learn a basis matrix and be applied to object detection.
For example, Casalino et al. [18] learned bases of
different object class dataset, respectively. The learned
bases from different object classes will be used to
reconstruct each candidate proposed by sliding window
in the testing image. Zeng et al. [19] used the basis
matrices of pedestrian and background learned by NMF,
respectively. In the detection, each candidate will be
reconstructed by this two basis matrices to find their
weights of linear combination, and then the object is
determined to be pedestrian or not. Additionally, NMF
has powerful ability of learning low dimensional
representation. Gui et al. [20] used NMF layer to learn
low dimensional representation of CNN feature, and it
turns out that its learning performance is better than that
of PCA.
The wide variety of NMF algorithms have been
developed over many decades such as Convex-NMF and
Semi-NMF [13]. For the popular concept of deep
learning in recent years, the deep learning idea is also
applied to the multivariate analysis. Trigeorgis et al. [21]
proposed Deep Semi-NMF model to learn multilayer
hidden representation in order to do data clustering. In
this paper, we introduce the deep learning concept to
propose a novel model, called DC-NMF. We use this
proposed model to construct DC-NMF layer. In this layer,
it will learn the multilayer of image basis matrices and
use these matrices to reconstruct each candidate in the
testing image. With DC-NMF layer, the unexpected false
positive generated by CNN features can be filtered.
To sum up, in this paper we aim to propose a novel
model, called DC-NMF, and then take multilayer basis
matrices learned by DC-NMF to refine detection score.
3. DEEP CONVEX-NMF CNN ARCHITECTURE
Our proposed DC-Net architecture is an on-road
obstacle detection system based on deep learning. In this
section, we introduce the DC-Net architecture and briefly
talk about the DC-NMF layer. The algorithm details of
DC-NMF are introduced in the next section.
3.1. DC-Net Architecture
DC-Net is an on-road obstacle detection system
containing three modules. The first one is Region
Proposal Network (RPN) which is a type of FCN [17] for
predicting candidates. The second one is a Fast RCNN
[10] detector which fulfills the classification for each
candidate generated from RPN. The RPN module tells
the Fast RCNN module where to look with sharing
convolutional layers. The third one is DC-NMF layer
which would using learned multilayer basis matrices to
refine detection scores generated from second module.
An overview of our proposed DC-Net architecture is
shown in Fig. 2. The detail of DC-NMF layer will be
introduced in the next section. In DC-NMF layer, we
prepare many cropped image for learning DC-NMF.
Originally, the Convex-NMF will decompose the data
matrix into two matrices which are basis matrix and
weight matrix. In this paper, we propose a novel model,
called DC-NMF, which will decompose the data matrix
into multilayer basis matrices and weight matrices. Each
layer of the basis matrix represents different attributes of
the data matrix. For each candidate generated from RPN,
the DC-NMF layer will use the learned multilayer basis
matrices to refine detection score of each candidate with
our defined error function. For example, when we detect
the pedestrian and use the basis matrices learned from
DC-NMF using cropped pedestrian image to refine
detection score of each pedestrian candidate, we will
consider the error which is estimated from our defined
error function. If the error is less than a threshold, we
consider this candidate does contain a pedestrian and let
this candidate obtain the bonus score. If the error is
greater than a threshold, we consider that this candidate
is background and let this candidate be penalized. With
this detection score refinement, we filtered out many
unexpected false positive and improve the detection
performance.
3.2. DC-Net Implement In first and second modules, the DC-Net is using
ImageNet [5] with 1000 object classes to pre-train the
CNN model at the training stage. Then, this model will
be fine-tuned on the dataset we prepared for on-road
scene which has 4 classes. With alternating training, the
RPN will be trained first which is learning how to predict
each candidate coordinate and its corresponding
objectness score. After the RPN training, the foreground
candidates generated from RPN will be input to train
detection network using Fast RCNN. This two kinds of
network are trained end-to-end which can accelerate
training time. In third module, DC-NMF can be trained
by cropped images individually.
At the testing stage, the RPN will predict 300 top
score candidates for each testing image. This top 300
candidates tell detection network where to extract CNN
feature and forward them to output layer doing
classification. Furthermore, RPN also tells DC-NMF
layer where to extract candidates region from input image
to refine detection score. In output layer, each node
represents a probability for each class and the maximum
one is classification results. The final detection results
will consider output probability for each class and
refinement score generated from our self-defined error
function by DC-NMF layer.
4. DC-NMF LAYER
In this section, we introduce the details of the
aforementioned DC-NMF. First, we briefly talk about
Convex-NMF which is the variant of NMF. Then, we talk
about the proposed novel model, DC-NMF. Finally, we
will describe how we apply this proposed new model to
do refinement is described.
4.1. Convex-NMF
In general, Convex-NMF [13] doesn’t have the non-
negativity constraints of NMF in the data matrix X.
Convex-NMF allows the data matrix X to have mixed
signs, but the decomposed basis matrix W and weight
matrix G are still restricted to have only non-negative
components. Convex-NMF wants to approximate the
following factorization: 𝐗 ≈ 𝐗𝐖𝐆𝑻 (1)
where 𝐗 ∈ ℝ𝑝×𝑛, 𝐖 ∈ ℝ𝑛×𝑘 and 𝐆 ∈ ℝ𝑛×𝑘. Note that n is
the number of data vectors as columns, of which each is
with p features, and k is the number of basis that we want
to find.
Because of the restriction of W and G, Convex-NMF
has the property that both factors W and G tend to be very
sparse. The W is clustering centroids and G is
corresponding coefficients. We want to optimize the cost
function for approximating the Convex-NMF factors is
given as follows: 𝐶𝐶𝑜𝑛𝑣𝑒𝑥−𝑁𝑀𝐹 = ‖𝐗 − 𝐗𝐖𝐆𝑻‖𝐹
2 (2)
where ‖∙‖ denotes the Frobenius norm of a matrix. We
optimize 𝐶𝐶𝑜𝑛𝑣𝑒𝑥−𝑁𝑀𝐹 with an alternating optimization of
W and G. We iteratively update each of the factors while
fixing the other one. With this alternating optimization,
we update W and G which initial value are randomly
between 0 and 1 alternatively until the convergence is
reached as follows.
𝐖 ← 𝐖√[(𝐗𝑻𝐗)+𝐆] + [(𝐗𝑻𝐗)−𝐖𝐆𝑻𝐆]
[(𝐗𝑻𝐗)−𝐆] + [(𝐗𝑻𝐗)+𝐖𝐆𝑻𝐆]
(3)
𝐆 ← 𝐆√[(𝐗𝑻𝐗)+𝐖] + [𝐆𝐖𝑻(𝐗𝑻𝐗)−𝐖]
[(𝐗𝑻𝐗)−𝐖] + [𝐆𝐖𝑻(𝐗𝑻𝐗)+𝐖]
(4)
where 𝐀+ is a matrix that has the negative elements of
matrix 𝐀 be replaced with 0, and similarly 𝐀−is one that
has the positive elements of 𝐀 be replaced with 0. The
definition is shown as follows:
𝐀+ =|𝐀| + 𝐀
2, 𝐀− =
|𝐀| − 𝐀
2
(5)
4.2. DC-NMF In Convex-NMF, we want to learn basis matrix and
weight matrix from input data matrix which can be used
for reconstructing the image. In our proposed DC-NMF,
we want to learn multilayer basis matrices and weight
matrices which represent multi-attribute of the data
matrix.
The proposed DC-NMF model factorizes a given data
matrix X into 2m+1 factors, which m is the number of
hidden layer, as follows. 𝐗 ≈ 𝐗𝐖1𝐆1
𝑇𝐖2𝐆2𝑇 ⋯ 𝐖𝑚𝐆𝑚
𝑇 (6)
DC-NMF allows one to hierarchically learn m layers
of implicit representation of data matrix. The weight
matrix can be shown by the following factorizations.
𝐆𝑚−1𝑇 ≈ 𝐆𝑚−1
𝑇 𝐖𝑚𝐆𝑚𝑇
(7)
With hierarchical decomposition of the weight matrix,
every layer of basis matrix represents different attributes.
To perform our proposed DC-NMF, we follow [21] to
pre-train each layer of weight matrices. First, we
decompose the initial data matrix 𝐗 ≈ 𝐗𝐖𝟏𝐆𝟏𝑻, where 𝐖1 ∈
and 𝐆1 ∈ ℝ0𝑛×𝑘1 . Then, we continually decompose the
weight matrix 𝐆𝟏𝑻 ≈ 𝐆𝟏
𝑻𝐖𝟐𝐆𝟐𝑻 , where 𝐖2 ∈ ℝ0
𝑛×𝑘2 and 𝐆2 ∈
ℝ0𝑛×𝑘2 . We follow this decomposition step until pre-
training all layers of matrices. Note that k1 and k2 are the
Fig. 2 : Overview of the proposed DC-Net architecture. This architecture can be divided into three modules. The upper one is first and
second modules in DC-Net .The lower one is the proposed DC-NMF layer in third module, which uses the learned multi-layer basis
matrices to do detection score refinement.
numbers of bases that we want to construct on each layer.
After the pre-training step, we fine-tune the basis and
weight matrix of each layer by alternating minimization
of the two factors in each layer. The cost function with
which we want to optimize reconstruction errors can be
𝐶𝐷𝑒𝑒𝑝 𝐶𝑜𝑛𝑣𝑒𝑥−𝑁𝑀𝐹 = ‖𝐗 − 𝐗𝐖𝟏𝐆𝟏𝑻𝐖𝟐𝐆𝟐
𝑻 ⋯ 𝐖𝒎𝐆𝒎𝑻 ‖
𝐹
2 (8)
When we alternatively update the decomposed factors,
we fix one of the two factors and update the other one for
each layer. The updating rule is shown as follows.
𝐖𝒊 ← 𝐖𝒊√[(𝐗𝑻𝐗)+𝐆𝒊] + [(𝐗𝑻𝐗)−𝐖𝒊𝐆𝒊
𝑻𝐆𝒊]
[(𝐗𝑻𝐗)−𝐆𝒊] + [(𝐗𝑻𝐗)+𝐖𝒊𝐆𝒊𝑻𝐆𝒊]
(9)
𝐆𝒊 ← 𝐆𝒊√[(𝐗𝑻𝐗)+𝐖𝒊] + [𝐆𝒊𝐖𝒊
𝑻(𝐗𝑻𝐗)−𝐖𝒊]
[(𝐗𝑻𝐗)−𝐖𝒊] + [𝐆𝒊𝐖𝒊𝑻(𝐗𝑻𝐗)+𝐖𝒊]
(10)
The ith is the order of hidden layer that we want to
optimize. For each fine-tuning iteration step, we update
the factors from top layer to last layer until achieving the
stopping criterion.
4.3. Detection Score Refinement
In the implementation of applying DC-NMF to
detection score refinement, we collect the cropped image
data to be the data matrix X. Then, we transpose the data
matrix to 𝐗𝑇 ∈ ℝ𝑛×𝑝 which means each row represents an
image vector. We use this transposed data matrix to find
multilayer basis matrices and weight matrices by DC-
NMF. After getting the multilayer matrices, we follow
our defined error function E(∙) to classify each candidate
generated from RPNs. E(𝐪) = ‖𝐪 − 𝐪𝐖𝟏𝐆𝟏
𝑻𝐖𝟐𝐆𝟐𝑻 ⋯ 𝐖𝒎𝐆𝒎
𝑻 ‖2
< ε (11)
Each candidate will be resized to the same size with
cropped training image in DC-NMF layer. If the error of
candidate image q is smaller than a threshold, it will get
bonus on the detection score. If the reconstruction error
of candidate image q is greater than a threshold, it will
get penalty on the detection score. The image
reconstruction step is performed whether the object
detects the candidate image successfully or not.
5. EXPERIMENTS
We describe the dataset used for training and
evaluation in this section. The deep convolution network
applied to our proposed architecture is ZF-net [22] and
VGG16-net [23]. ZF-net is a kind of small model, whose
detection can achieve near real-time. VGG16-net belongs
to larger model and can learn more sophisticated features,
but spend more execution time on detection tasks. We
will do some analyses on these two different types of
deep network.
5.1. The dataset
The KITTI dataset [12] is a novel challenging on-road
scene in computer vision benchmark. We just focus on
object detection tasks in this paper.
The INRIA person dataset collects people images
with various scenes. It has 614 labeled image and we use
it to increase variety of pedestrian data.
For increasing the richness of training data, we adopt
the Microsoft COCO object detection dataset [6]. This
dataset has 80 object classes with 80k training images and
40k validation images.
Moreover, we are interesting in on-road obstacles.
Hence, we collect some campus and urban scene data in
our city, intending to increase the richness of every class,
especially motorcyclists.
5.2. Experiment Results
5.1.1. The Experiment Results on KITTI Dataset.
In our DC-Net training, we use 10 scales and 7 aspect
ratios instead of 3 scales and 3 aspect ratios. The number
of anchor boxes defined in [11] is 70 instead of 9. We
think this parameter is more suitable for our on-road
scene. Table 1 The Average Precision (AP) on the KITTI car detection
results.
Car Easy Moderate Hard
Ours 79.07% 62.88% 52.67%
DPM-VOC+VP [24] 74.95% 64.71% 48.76%
DPM-C8B1 [25, 26] 74.33% 60.99% 47.16%
ACF-SC [27] 69.11% 58.66% 45.95%
Vote3D [28] 56.80% 47.99% 42.57%
mBoW [29] 36.02% 23.76% 18.44%
The KITTI dataset just provides ground truth of the
training data. For the validation, we need to submit our
detection results on testing data to the KITTI website. In
the KITTI dataset, it has three difficulties of being easy,
moderate and hard. The difficulty is according to the
level occlusion and truncation. We compare some
methods with our proposed DC-Net.
Fig. 3 shows the easy, moderate and hard precision-
recall curves comparing our method to DPM-VOC+VP
[24], DPM-C8B1 [25, 26], ACF-SC [27], Vote3D [28]
and mBow [29] on car detection. Table 2 The Average Precision (AP) on the KITTI pedestrian
detection results.
Pedestrian Easy Moderate Hard
Ours 65.16% 49.26% 45.51%
DPM-VOC+VP [24] 59.48% 44.86% 40.37%
RCNN [30] 61.61% 50.13% 44.79%
ACF-SC [27] 51.53% 44.49% 40.38%
ACF 128X64 [31] 60.11% 47.29% 42.90%
Fusion-DPM [32] 59.51% 46.67% 42.05%
In Table 1, we show our AP on the KITTI car
detection results. We can get 79%, 62% and 52% AP on
easy, moderate and hard, respectively. According to the
comparison, on the moderate difficulty, the AP of our
method is less than DPM-VOC+VP, because we mainly
focus on the complete car which has the potential of car
accident. However, on the easy and hard difficulties, our
method is better than DPM-VOC+VP. Our proposed
method can achieve the rank of 26 on the KITTI car
scoreboard.
Fig. 4 shows the easy, moderate and hard precision-
recall curves comparing our method to DPM-VOC+VP
[24], RCNN [30], ACF-SC [27], ACF 128x64 [31] and
Fusion-DPM [32] on pedestrian detection.
Table 3 The Average Precision (AP) on the KITTI cyclist
detection results.
Cyclist Easy Moderate Hard
Ours 56.14% 42.11% 37.45%
DPM-VOC+VP [24] 42.43% 31.08% 28.23%
DPM-C8B1 [25, 26] 43.49% 29.04% 26.20%
LSVM-MDPM-us [3] 38.84% 29.88% 27.31%
Vote3D [28] 41.43% 31.24% 28.60%
mBoW [29] 28.00% 21.62% 20.93%
In Table 2, we show our AP on the KITTI pedestrian
detection results. We can get 65%, 49% and 45% AP on
easy, moderate and hard, respectively. According to the
comparison, on the moderate difficulty, the AP of our
method is less than RCNN, because we mainly focus on
the complete pedestrian which may appear on the road
with the potential of car accident. However, on the easy
and hard difficulties, our method is better than RCNN.
Our proposed method can achieve the rank of 24 on the
KITTI pedestrian scoreboard.
Fig. 5 shows the easy, moderate and hard precision-
recall curves comparing our method to DPM-VOC+VP
[24], DPM-C8B1 [25, 26], LSVM-MDPM-us [3],
Vote3D [28] and mBoW [29] on cyclist detection.
In Table 3, we show our AP on the KITTI cyclist
detection results. We can get 56%, 42% and 37% AP on
easy, moderate and hard, respectively. Moreover, we can
observe that our performance is better than DPM-
VOC+VP for every difficulty in cyclist detection. We
guess that we provide more cyclist training data for the
deep model training. Our proposed method can achieve
the rank of 12 on the KITTI cyclist scoreboard.
5.1.2. The Experiment Results on INRIA Dataset.
We do some different scenarios on INRIA. In Fig.
6(a), we compare the Faster RCNN model trained on
PASCAL VOC with the model trained on our preparing
data. Table 4(a) shows the corresponding AP to different
scenarios. We observe that whether ZF-net or VGG16-
net can perform better on the INRIA dataset, the data with
the similar scene can help the detection performance. In
Fig. 6(b), we compare our trained VGG16-net model
with our proposed Deep Convex-NMF layer. The k is a
number of anchor box. COCO means that we use the MS
COCO dataset to increase our training data richness and
12k is the number of anchor box used in [11]. Table 4(b)
shows the corresponding AP to different scenarios. We
can observe that using the COCO dataset can’t enhance
(a)
(b)
(c)
Fig. 3 The Precision-Recall curve for KITTI car detection results. (a) The easy case. (b) The moderate case. (c) The hard case.
(a)
(b)
(c)
Fig. 4 The Precision-Recall curve for KITTI pedestrian detection results. (a) The easy case. (b) The moderate case. (c) The hard case.
(a)
(b)
(c)
Fig. 5 The Precision-Recall curve for KITTI cyclist detection results. (a) The easy case. (b) The moderate case. (c) The hard case.
the performance. There are too many categories which
may disturb the pedestrian class. In our observation, more
anchor boxes can promote the performance, so we also
compare different anchor boxes. With our proposed DC
layer, all of the scenarios detected for the INRIA dataset
can enhance the performance about 1% AP. Totally, the
performance with 70k and our proposed method can
achieve 91.36% on the INRIA testing data.
5.1.3. The Experiment Results on our urban scene
Dataset.
To validate our method on the real-world scene, we
collect some of our urban scene videos which contain
pedestrians, cyclists, cars and motorcyclists. We define a
region of interest (ROI) to evaluate the performance. This
region is in front of vehicles and has the potential of car
accidents. For detecting pedestrians, cyclists and
motorcyclists, the ROI is 5m to 30m in front of the
vehicle. For detecting cars, the ROI is 5m to 50m in front
of the vehicle. The ROI is shown in Fig. 7. The purple
dash line is our defined ROI. Our image resolution is
1280p. If we adopt the convolutional structure of ZF-net,
the fps is about 12. If we adopt the convolutional
structure of VGG16-net, the fps is about 7.
Table 5 shows our detection performance on the
campus scene. These two videos mainly contain
pedestrian and cyclist classes. This table shows that our
detection system performance can achieve 95%
recall/precision and above.
Table 6 shows our detection performance on the city
scene. These two videos mainly contain car and
motorcyclist classes. This table shows that our detection
system performance can achieve 90% recall/precision
and above on the car class. However, there still a room to
improve the motorcyclist detection performance by
collecting more motorcyclist data. Our demo videos are
publicly available at https://goo.gl/lxKa4p.
6. CONCLUSIONS
In this paper, we propose a novel deep learning
framework, called DC-Net, for on-road obstacle
detection. A novel model, called DC-NMF, to learn
multilayer feature representation is also proposed. The
DC-Net combines our proposed DC-NMF layer to
remove the unexpected false positive is presented. The
AP of our proposed method on KITTI and INRIA
datasets can achieve 79% and 91%. The performance of
our method on our urban scene dataset can achieve 95%
recall/precision and above. In the future work, we want
to use the deep feature learned by deep convolutional
network to learn more exquisite multilayer feature
representation.
Fig. 7 : The ROI defined for evaluating performance on our
urban scene dataset.
REFERENCES
[1] N. Dalal and B. Triggs, "Histograms of oriented gradients
for human detection," in Computer Vision and Pattern
Recognition, 2005. CVPR 2005. IEEE Computer Society
Conference on, 2005, pp. 886-893.
[2] P. Viola, M. J. Jones, and D. Snow, "Detecting pedestrians
using patterns of motion and appearance," International
Journal of Computer Vision, vol. 63, pp. 153-161, 2005.
[3] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D.
Ramanan, "Object detection with discriminatively trained
part-based models," Pattern Analysis and Machine
Intelligence, IEEE Transactions on, vol. 32, pp. 1627-1645,
2010.
[4] L. Fei-Fei, R. Fergus, and P. Perona, "Learning generative
visual models from few training examples: An incremental
bayesian approach tested on 101 object categories,"
(a)
(b)
Fig. 6 : The Precision-Recall curves for the INRIA dataset. (a) Comparing different training data. (b) Comparing our trained model with
Deep Convex-NMF layer.
Table 4 The Average Precision (AP) on the INRIA detection results. (a) Comparing different training data. (b) Comparing our trained model with Deep Convex-NMF layer.
(a) Pascal Our
ZF 67.96% 75.94%
VGG16 70.72% 89.23%
(b) VGG16 Ours WithDeep
25k 87.60% 88.36%
70k 90.73% 91.36%
12k+COCO 60.39% 61.61%
Computer Vision and Image Understanding, vol. 106, pp.
59-70, 2007.
[5] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S.
Ma, et al., "Imagenet large scale visual recognition
challenge," International Journal of Computer Vision, vol.
115, pp. 211-252, 2015.
[6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D.
Ramanan, et al., "Microsoft coco: Common objects in
context," in European Conference on Computer Vision,
2014, pp. 740-755.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet
classification with deep convolutional neural networks," in
Advances in neural information processing systems, 2012,
pp. 1097-1105.
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich
feature hierarchies for accurate object detection and
semantic segmentation," in Proceedings of the IEEE
conference on computer vision and pattern recognition,
2014, pp. 580-587.
[9] K. He, X. Zhang, S. Ren, and J. Sun, "Spatial pyramid
pooling in deep convolutional networks for visual
recognition," Pattern Analysis and Machine Intelligence,
IEEE Transactions on, vol. 37, pp. 1904-1916, 2015.
[10] R. Girshick, "Fast r-cnn," in Proceedings of the IEEE
International Conference on Computer Vision, 2015, pp.
1440-1448.
[11] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN:
Towards real-time object detection with region proposal
networks," in Advances in Neural Information Processing
Systems, 2015, pp. 91-99.
[12] A. Geiger, P. Lenz, and R. Urtasun, "Are we ready for
autonomous driving? the kitti vision benchmark suite," in
Computer Vision and Pattern Recognition (CVPR), 2012
IEEE Conference on, 2012, pp. 3354-3361.
[13] C. Ding, T. Li, and M. I. Jordan, "Convex and semi-
nonnegative matrix factorizations," Pattern Analysis and
Machine Intelligence, IEEE Transactions on, vol. 32, pp.
45-55, 2010.
[14] D. D. Lee and H. S. Seung, "Algorithms for non-negative
matrix factorization," in Advances in neural information
processing systems, 2001, pp. 556-562.
[15] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W.
Smeulders, "Selective search for object recognition,"
International journal of computer vision, vol. 104, pp. 154-
171, 2013.
[16] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R.
Fergus, "Exploiting linear structure within convolutional
networks for efficient evaluation," in Advances in Neural
Information Processing Systems, 2014, pp. 1269-1277.
[17] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional
networks for semantic segmentation," in Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 3431-3440.
[18] G. Casalino, N. D. Buono, and M. Minervini,
"Nonnegative matrix factorizations performing object
detection and localization," Applied Computational
Intelligence and Soft Computing, vol. 2012, p. 15, 2012.
[19] J.-X. Zeng, C.-Y. Lin, and W.-Y. Lin, "Human detection
using non-negative matrix factorization," in Consumer
Electronics-Taiwan (ICCE-TW), 2015 IEEE International
Conference on, 2015, pp. 370-371.
[20] L. Gui and L.-P. Morency, "Learning and Transferring
Deep ConvNet Representations with Group-Sparse
Factorization," in ICCV Workshop on Machine Learning for
Intelligent Image and Video Processing, 2015.
[21] G. Trigeorgis, K. Bousmalis, S. Zafeiriou, and B. W.
Schuller, "A deep matrix factorization method for learning
attribute representations," arXiv preprint arXiv:1509.03248,
2015.
[22] M. D. Zeiler and R. Fergus, "Visualizing and
understanding convolutional networks," in Computer
vision–ECCV 2014, ed: Springer, 2014, pp. 818-833.
[23] K. Simonyan and A. Zisserman, "Very deep convolutional
networks for large-scale image recognition," arXiv preprint
arXiv:1409.1556, 2014.
[24] B. Pepik, M. Stark, P. Gehler, and B. Schiele, "Multi-view
and 3d deformable part models," Pattern Analysis and
Machine Intelligence, IEEE Transactions on, vol. 37, pp.
2232-2245, 2015.
[25] J. J. Yebes, L. M. Bergasa, R. Arroyo, and A. Lazaro,
"Supervised learning and evaluation of KITTI's cars
detector with DPM," in Intelligent Vehicles Symposium
Proceedings, 2014 IEEE, 2014, pp. 768-773.
[26] J. J. Yebes, L. M. Bergasa, and M. García-Garrido, "Visual
object recognition with 3D-aware features in KITTI urban
scenes," Sensors, vol. 15, pp. 9228-9250, 2015.
[27] C. Cadena, A. Dick, and I. D. Reid, "A fast, modular scene
understanding system using context-aware object
detection," in Robotics and Automation (ICRA), 2015 IEEE
International Conference on, 2015, pp. 4859-4866.
[28] D. Z. Wang and I. Posner, "Voting for voting in online
point cloud object detection," Proceedings of Robotics:
Science and Systems, Rome, Italy, 2015.
[29] J. Behley, V. Steinhage, and A. Cremers, "Laser-based
segment classification using a mixture of bag-of-words," in
Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ
International Conference on, 2013, pp. 4195-4200.
[30] J. Hosang, M. Omran, R. Benenson, and B. Schiele,
"Taking a deeper look at pedestrians," in Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 4073-4082.
[31] P. Dollár, R. Appel, S. Belongie, and P. Perona, "Fast
feature pyramids for object detection," Pattern Analysis and
Machine Intelligence, IEEE Transactions on, vol. 36, pp.
1532-1545, 2014.
[32] C. Premebida, J. Carreira, J. Batista, and U. Nunes,
"Pedestrian detection combining rgb and dense lidar data,"
in Intelligent Robots and Systems (IROS 2014), 2014
IEEE/RSJ International Conference on, 2014, pp. 4112-
4117.
Table 5 The detection results of pedestrians and cyclists in our urban scene dataset on campus. P: precision, R: Recall.
Pedestrian / P Pedestrian / R Cyclist / P Cyclist / R
Campus 1 99.29% 96.18% 99.51% 98.09%
Campus 2 92.39% 97.77% 99.95% 99.14%
Campus 3 99.58% 96.78% 99.22% 95.49%
Table 6 The detection results of cars and motorcyclists our urban scene dataset in city. P: precision, R: Recall.
Car / P Car /R Motorcyclist / P Motorcyclist / R
City 1 98.70% 93.03% 99.95% 92.16%
City 2 100% 97.01% 99.26% 94.87%
City 3 99.67% 95.49%
City 4 98.33% 98.63%