Upload
format-seorang-legenda
View
214
Download
0
Embed Size (px)
Citation preview
7/30/2019 A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications
1/6
7/30/2019 A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications
2/6
Figure 1: Typical image of the considered entrance
access
Fig 1 represents a typical scene to be monitored.
The way followed for implementing such solution is based on
the use of a cascade of two Multilayer Perceptron Neural
Networks used a classifiers, each one devoted to a particular task.
First network will be devoted to the classification of eachmoving object in one of three different class, i.e. people,
vehicles or other object. The second network, which takes
as input the output of the first one, will be oriented to
discriminate, among people, between uniformed personnel and
civilian people, in order to determine the counting increment to
be computed.
Fig. 2 allows to better understand the particular solution we have
designed for such problem.
C o lo r cam era
O bjec t s detec tio n a n d trac ki n g syst em
Lis t of d etec ted mov ing areas (i.e. blobs)
F irs t n eura l net work
Clas si fica tion veh icle s-ped estrian s
S ec o n d neu ra l net w
o rk
Clas sifica tion u niform ed pers onnel-c ivilian people
Figure 2: Proposed hierarchical approach to moving
objects classification task
We have chosen to use the same kind of neural network at
the two classification levels by obviously differencing the
features provided to each network according the task that has to
be faced.
3. SYSTEM DESCRIPTION
Figure 3 shows the general architecture of the proposedsurveillance system.
The following assumptions are made: (a) stationary and
precalibrated camera, (b) ground-plane hypothesis, (c) known set
of object and behaviour models. The system is composed by 5
modules: image acquisition (IA), background updating (BU),mobile object detection (MOD), object tracking (OT), object
recognition (OR) and dynamic scene interpretation (DSI).
Figure 3: General System Architecture
3.1 Image acquisition and background updating
A color surveillance camera mounted on a pole and having a
wide-angle lens objective to capture the activity over a wide area
scene acquires visual images representing the input of the system.
A pin-hole camera model has been selected.
A background updating procedure is used to adapt the
background image BCK(x,y) to the significant changes in the
scene (e.g., illumination, new static objects, etc.).
3.2 Object detection
A change detection (CD) procedure based on a simple differencemethod identifies mobile objects in the scene by separating them
from the static background.
Let B(x,y) be the output of the CD algorithm. B(x,y) is a binary
image where pixels representing mobile objects are set to 1 and
background pixels are set to 0. The B(x,y) image normally
contains some noisy isolated points or small spurious blobs
generated during the acquisition process.
A morphological erosion operator is applied to eliminate these
undesired effects. Let Bi be the binary blob representing the i-th
detected object in the scene.
7/30/2019 A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications
3/6
p p p -1
jk
3.3 Object tracking
The position and the dimensions of the minimum rectangles
(MBR) bounding the detected blobs on the image plane are
considered as target features and matched between two
successive frames. In particular, the displacement (dx,dy) of the
MBR centroid and the variations (dh,dl) in the MBR size are
computed.
After that, an extended Kalman filter (EKF) estimates the
depth Zb of each object's center of gravity in a 3D general
reference system (GRS), together with the width W and the
length L of the object itself. A ground plane hypothesis is
applied to perform
2D-into-3D transformations from the image plane into the GRS.
However, the tracking task is not the focus of the paper and
it will be not further examined in the following.
3.4 Dynamic object recognition and behaviour
understanding
The overall purpose of a visual surveillance system is to provide
an accurate description of a dynamic scene. To do this, an
effective interpretation of dynamic behavioral of 3D moving
objects is required. A set of object features, extracted from the
input images, is used to match effectively with projected
geometric features on object models.
This structure has been chosen because it allows supervised
learning with each input vector having a corresponding knowntarget output. The difference between the networks actual output
and the target is computed in order to determine the error. The
weights in the network are then updated, according to the back-propagation algorithm, to minimize the maximum modulus of the
error. Learning is deemed to be finished when the error at the
output has reached a visible minimum when plotted against
training time (number of presentations of the training set), that is
the weights converge to a particular solution for the training set.
For the choice of the numbers of layers and units in the hidden
layers, a specific rule does not exist: it must be performed on thebasis of the acquired experience.
In our specific case, the inputs of the network will correspond to
a particular set of features computed on each detected blob and
significant for the particular task, while outputs will correspond
to the clusters to which the blobs must be classified, that are, in
the specific case, three for the first level (people, vehicle andother) and two for the second level (uniformed personnel and
civilian people).
The particular configuration of perceptron that we have chosen
consists of 20 neurons in the hidden layer (see fig. 4)
In particular, regular moment invariant features are considered tocharacterize the object shape. Each detected blob on the binary
image B(x,y) represents the silhouette of a mobile object as it
appears in a perspective view from an arbitrary viewpoint in
the
3D scene and the 3D object is constrained to move on the
ground plane. Since the viewpoint is arbitrary, the position, thesize, the orientation of the 2D blob can vary from image to
image. A large set of possible perspective shape of multiple 3D-
object models, e.g., cars, lorries, buses, motorcycles, pedestrians,
etc. have been considered.
In the next Sections, the proposed solution for the particularkind
outputs
1 2 3
1 2 3 n
1 2 m
inputs
hiddenlayers
of classifier that has been chosen and for the features that will be
used as input to the classifier itself will be presented, followed
by some experimental results we have obtained in the
classification task.
4. THE NEURAL NETWORKS BASED
MOVING OBJECTS CLASSIFICATION
4.1 The choice of the classifier
The choice of the best classifier for the considered application
should be constrained by the trade off among the training
time, the required memory, the computational complexity and
the classification time, other than the probability of success.
For the considered application, the classification should be
performed as fast as possible, in order to guarantee the real
time
Figure 4: Neural network for classification (n=20)
The training set has been selected as composed by a large
number of patterns representative of the classes while tests have
been performed by using different inizializations for the network.
The parameter used in the inizialization influences the weight
updating during the training; they are briefly explained in thefollowing.
The parameters used for inizializing the neural network are the
following:
l ear ning ra t e : the weights of the network are updated by means of
the following relation:
wp +1
(s) = wp
(s) - hE
+a[w (s) - w(s)]
behaviour of the system; for this reason we have considered a
three-layer Multilayer Perceptron with backpropagation learning
rule, in fact this classifier needs a long off-line training step,but it is able to process data in a quick way [1].
jk
where
:
jkw
p(s)
jk jk
Ep is the global error performed by the network at step p;
7/30/2019 A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications
4/6
) ) ]
3
b
0( - ) ? ?
p q
F = m - mm +m
+2- +
2+
wp
(s) represents the weight value between unitsj and k,at 5(
3,03
1,2)(
3,0 1,2)[(m
3,0
m1,2
3(m2,1
m0,3
jk
layers and training stepp;
+(3m
2,1
- m0,3 )(
m2,1
+ m0,3
)[3(
m3,0 + m1,
2)
2- (
m2,1
+ m0,3 )2]
h represents the learningrate;
F =
(m
-m )(m +m ) -(m
+m ) + 4 1,1[( 3,0+
1,2 )(m2,1+
0,3)]
a represents themomentum.
62,0
0,2
3,0
2
1,2
2,1
2m m m m0,3
It is possible to notice that the learning rate plays animportant
F7 =
3(m2,1
+ m0,3
)(m3,0
- m1,2
)[(m3,0
+ m1,2
)2] -
[3(m2,1
+ m0,3
)2] -
role for the algorithm convergence, as shown in the previous
equation; in fact it represents the weight updating step: the
smaller the learning rate, the slower and usually more precise
the
-(m
3,
0
- m1,2 )(m2,1
+ m0,3
)
[3(m3,
0
+ m1,2
)2-
(m2,
1
+ m0,3
)2]
training process; nevertheless, using a too smaller learningrate wherem =
vp ,q (i.e., b = +
p + q) represents the
introduces a risk, because it may be possible that thealgorithm
p,q
1(m0,0 ) 2
converges in a localminimum. normalized central moment
The m o m e ntum p ara m e ter has been introduced in order to
ease the algorithm convergence, while the W ig h t d ecay
influences the
vp,q
=
(x -x
0
)p
(y -y
)qI(x,
y)
(p,q = 0,1,2,...)
speed at which the weights not influent for the training are set to
zero.
During the training process, the weight inizialization represents
a critic question because a bad weight inizialization set could
make the training process too slow or generating a large error.
For this reason, it is better to consider different t r i a l s
corresponding to different weight initial values; at the end, onlythe initial set that provides the best results is considered.
At the beginning of the training process, weights assume random
values and the training stops according to the following criterion:
(x,y )B
i
computed on the areaBi.
A particular comment has to be reserved to the choice of the
measure I(x,y) referred to each pixel of the image and which
could be a sort of index of luminosity associated with each pixelitself, having considered the luminosity of a pixel as discriminant
criterion for distinction among vehicles and people (e.g. humanshave a different reflectivity coefficient with respect to vehicles).
In previous works [3,4], which were limited to grey-level image
processing, it was straightforward to use the grey-level
coefficient of each pixel as reference luminosity value but in our
case it is not possible to do that, because of the vectorial nature
of the luminosity values in color images.
1 nc
np
if
y (p) t (p)threshold
STOP
c
c
As scalar luminosity index has then been selected, for the first
nc * np c=1
p=1 network, the Y coefficient of the color YUV space [5], which
where: nc = number of clusters, np = number of
patterns,yc(p) = desired output for pattern p, tc(p) = output
for patternp obtained with current weight values.
At this time the last step to be faced is the selection ofthe
features set used as input to the neural classifier: they will bepresented in the following section.
4.2 The set of features to be used
In order to recognise each observed blob, theMultilayer
well represents a luminosity index in colorimages.
Different is the case of the second network: in order to recognise
uniformed personnel among the wider and more general class of
people an a-priori knowledge about the particular uniform color
(e.g. in most of the cases orange) has been taken into account and
this fact has lead us to consider as scalar luminosity index the H
(hue) value of the HLS (Hue-Luminance-Saturation) space [5].
The normalized central moments used in our case are then equal
to:
Perceptron has been learned with Hu moments, whichare v = (x - x )p
(y - y )qY(x,
y)
(p, q = 0,1,2,...)
invariant to rotation, translation, scale changes[2].
p,q 0 0(x,y)Bi
Let f ,.., f , be these invariantmoments.
for the first network and to:
1 7v p,q = (x - x0 )
(x,y)Bi
(y -y 0)
H(x,y)
(p, q = 0,1,2,...)
F = m + m1 2,0 0,2
7/30/2019 A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications
5/6
24 ))
for the second n etwork.
F2 =(
m2,0
- m0, 2 )+
m2
1,1
The pattern for the i-th detected object will be composed as
follow: p(x,y)=[f,f,f,f,f,f,f ], where the
functions, 1 2 3 4 5 6 7
f ,.., f , are computed on the blob B . A set of feature vectors
F = m - m 2 + - 2 1 7 i3 ( 3,0
F = (m4 3,0
3 1, 2 )
+ m )2
+1,2
(3m2,1
(m2,1
+
m0,3
m2
0,3
extracted from several models representing different objects
taken from different viewpoints are used as patterns of the
training procedure.
After the object has been recognized, a new Multilayer
Perceptron trained with different dynamic features of the
same
7/30/2019 A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications
6/6
object taken from different consecutive frames, has been
employed to generate systems alarms. Each pattern is composed
by information about the recognized object class, and the
estimated object speed and position on each sequence
frame. Each reference model is characterized by a set ofparameters that are specific for the behaviour class to which the
object belongs.
The main advantage carried on by such a feature set selection is
the independence of such kind of set from the 3D knowledge,
so it is possible to use them without necessity of having a
different training depending, for example, on a particular zoom
degree of the acquiring camera.
5. EXPERIMENTAL RESULTS
Experimental results have been carried on by using a set of
different sequences representing a pilot entrance access, in
presence of different illumination and traffic conditions, in order
to have a significant validation of the classification approach wehave proposed.
Let us remind that the surveillance system must be able to detect
moving objects, localize and recognise them and interpret their
behaviour in order to prevent possible dangerous situations, e.g.one or more pedestrians moving in an area completely devoted
to the vehicle traffic.
Images have been selected from a database in which sequences
acquired with different pan, tilt ant zoom of the same color video
camera and, in this paper, particular results about the recognitiontask will be presented.
The performances provided by the examined surveillance system
in terms of capabilities of object classification anddiscrimination have been measured through the percentage of
correct object recognition. For example, if within a certainsequence N areas relevant to pedestrians present in the scene
have been detected, then the percentage of correct object
recognition will be computed as:
CONCLUSIONS
A surveillance system for detecting dangerous situations on a
road entrance access has been presented. The system is based on
the use of a Multilayer Perceptron NN to perform both objectclassification and scene understanding. Average correct
classification rate between people and vehicles is equal to 90%,while correct recognition percentage for uniformed personnel
within the more general class of pedestrians is equal to 85%.
ACKNOWLEDGEMENTS
The present work has been partially supported by European
Commission under the ESPRIT contract no. 28494 AVS-RIO
(Advanced Video Surveillance cable television based Remote
Video surveillance system for protected sites monitoring).
REFERENCES
[1] B.D. Ripley, Pattern Recognition and Neural Networks,
Cambridge University Press, UK, 1996.
[2] M.K. Hu, Visual pattern recognition by moment invariant,
IEEE Trans. on Information Theory, Vol. 8, 1962, pp. 179-187.
[3] G.L. Foresti, A neural tree based image understanding system
for advanced visual surveillance, Advanced Video Based
Surveillance Systems, Kluwer Academic Publishers, 1998, pp.
117-129.
[4] J.E. Hollis, D.J. Brown, I.C. Luckraft and C.R. Gent, Feature
vectors for road vehicle scene classification, Neural Networks,
Vol. 9, No 2, 1996, pp.337-344.
[5] B. Furht, S. W. Smoliar, H. Zhang, Video and Image
Processing in Multimedia Systems, Kluwer Academic
Publishers, 1995.
Perc=P R
100N
where PR indicates the number of the times in which the person
have been correctly detected.
In the following table the average values of the correct
recognition percentage has been presented for each sequence,
both for the percentage of objects discrimination and for the
pedestrians discrimination between civilian people andmunicipality personnel:
PERCENTAGE
OF OBJECTS
RECOGNITION
PERCENTAGE OF
CIVILIAN
DISCRIMINATION
SEQUENCE 1 94% 84%
SEQUENCE 2 98% 86%
SEQUENCE 3 89% 88%
SEQUENCE 4 92% 81%
SEQUENCE 5 87% 83%
SEQUENCE 6 94% 84%
SEQUENCE 7 88% 87%
SEQUENCE 8 92% 85%