A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications

7/30/2019 A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications

1/6


2/6

Figure 1: Typical image of the considered entrance

access

Fig 1 represents a typical scene to be monitored.

The way followed for implementing such solution is based on

the use of a cascade of two Multilayer Perceptron Neural

Networks used a classifiers, each one devoted to a particular task.

First network will be devoted to the classification of eachmoving object in one of three different class, i.e. people,

vehicles or other object. The second network, which takes

as input the output of the first one, will be oriented to

discriminate, among people, between uniformed personnel and

civilian people, in order to determine the counting increment to

be computed.

Fig. 2 allows to better understand the particular solution we have

designed for such problem.

C o lo r cam era

O bjec t s detec tio n a n d trac ki n g syst em

Lis t of d etec ted mov ing areas (i.e. blobs)

F irs t n eura l net work

Clas si fica tion veh icle s-ped estrian s

S ec o n d neu ra l net w

o rk

Clas sifica tion u niform ed pers onnel-c ivilian people

Figure 2: Proposed hierarchical approach to moving

objects classification task

We have chosen to use the same kind of neural network at

the two classification levels by obviously differencing the

features provided to each network according the task that has to

be faced.

3. SYSTEM DESCRIPTION

Figure 3 shows the general architecture of the proposedsurveillance system.

The following assumptions are made: (a) stationary and

precalibrated camera, (b) ground-plane hypothesis, (c) known set

of object and behaviour models. The system is composed by 5

modules: image acquisition (IA), background updating (BU),mobile object detection (MOD), object tracking (OT), object

recognition (OR) and dynamic scene interpretation (DSI).

Figure 3: General System Architecture

3.1 Image acquisition and background updating

A color surveillance camera mounted on a pole and having a

wide-angle lens objective to capture the activity over a wide area

scene acquires visual images representing the input of the system.

A pin-hole camera model has been selected.

A background updating procedure is used to adapt the

background image BCK(x,y) to the significant changes in the

scene (e.g., illumination, new static objects, etc.).

3.2 Object detection

A change detection (CD) procedure based on a simple differencemethod identifies mobile objects in the scene by separating them

from the static background.

Let B(x,y) be the output of the CD algorithm. B(x,y) is a binary

image where pixels representing mobile objects are set to 1 and

background pixels are set to 0. The B(x,y) image normally

contains some noisy isolated points or small spurious blobs

generated during the acquisition process.

A morphological erosion operator is applied to eliminate these

undesired effects. Let Bi be the binary blob representing the i-th

detected object in the scene.


3/6

p p p -1

jk

3.3 Object tracking

The position and the dimensions of the minimum rectangles

(MBR) bounding the detected blobs on the image plane are

considered as target features and matched between two

successive frames. In particular, the displacement (dx,dy) of the

MBR centroid and the variations (dh,dl) in the MBR size are

computed.

After that, an extended Kalman filter (EKF) estimates the

depth Zb of each object's center of gravity in a 3D general

reference system (GRS), together with the width W and the

length L of the object itself. A ground plane hypothesis is

applied to perform

2D-into-3D transformations from the image plane into the GRS.

However, the tracking task is not the focus of the paper and

it will be not further examined in the following.

3.4 Dynamic object recognition and behaviour

understanding

The overall purpose of a visual surveillance system is to provide

an accurate description of a dynamic scene. To do this, an

effective interpretation of dynamic behavioral of 3D moving

objects is required. A set of object features, extracted from the

input images, is used to match effectively with projected

geometric features on object models.

This structure has been chosen because it allows supervised

learning with each input vector having a corresponding knowntarget output. The difference between the networks actual output

and the target is computed in order to determine the error. The

weights in the network are then updated, according to the back-propagation algorithm, to minimize the maximum modulus of the

error. Learning is deemed to be finished when the error at the

output has reached a visible minimum when plotted against

training time (number of presentations of the training set), that is

the weights converge to a particular solution for the training set.

For the choice of the numbers of layers and units in the hidden

layers, a specific rule does not exist: it must be performed on thebasis of the acquired experience.

In our specific case, the inputs of the network will correspond to

a particular set of features computed on each detected blob and

significant for the particular task, while outputs will correspond

to the clusters to which the blobs must be classified, that are, in

the specific case, three for the first level (people, vehicle andother) and two for the second level (uniformed personnel and

civilian people).

The particular configuration of perceptron that we have chosen

consists of 20 neurons in the hidden layer (see fig. 4)

In particular, regular moment invariant features are considered tocharacterize the object shape. Each detected blob on the binary

image B(x,y) represents the silhouette of a mobile object as it

appears in a perspective view from an arbitrary viewpoint in

the

3D scene and the 3D object is constrained to move on the

ground plane. Since the viewpoint is arbitrary, the position, thesize, the orientation of the 2D blob can vary from image to

image. A large set of possible perspective shape of multiple 3D-

object models, e.g., cars, lorries, buses, motorcycles, pedestrians,

etc. have been considered.

In the next Sections, the proposed solution for the particularkind

outputs

1 2 3

1 2 3 n

1 2 m

inputs

hiddenlayers

of classifier that has been chosen and for the features that will be

used as input to the classifier itself will be presented, followed

by some experimental results we have obtained in the

classification task.

4. THE NEURAL NETWORKS BASED

MOVING OBJECTS CLASSIFICATION

4.1 The choice of the classifier

The choice of the best classifier for the considered application

should be constrained by the trade off among the training

time, the required memory, the computational complexity and

the classification time, other than the probability of success.

For the considered application, the classification should be

performed as fast as possible, in order to guarantee the real

time

Figure 4: Neural network for classification (n=20)

The training set has been selected as composed by a large

number of patterns representative of the classes while tests have

been performed by using different inizializations for the network.

The parameter used in the inizialization influences the weight

updating during the training; they are briefly explained in thefollowing.

The parameters used for inizializing the neural network are the

following:

l ear ning ra t e : the weights of the network are updated by means of

the following relation:

wp +1

(s) = wp

(s) - hE

+a[w (s) - w(s)]

behaviour of the system; for this reason we have considered a

three-layer Multilayer Perceptron with backpropagation learning

rule, in fact this classifier needs a long off-line training step,but it is able to process data in a quick way [1].

jk

where

:

jkw

p(s)

jk jk

Ep is the global error performed by the network at step p;


4/6

) ) ]

3

b

0( - ) ? ?

p q

F = m - mm +m

+2- +

2+

wp

(s) represents the weight value between unitsj and k,at 5(

3,03

1,2)(

3,0 1,2)[(m

3,0

m1,2

3(m2,1

m0,3

jk

layers and training stepp;

+(3m

2,1

- m0,3 )(

m2,1

+ m0,3

)[3(

m3,0 + m1,

2)

2- (

m2,1

+ m0,3 )2]

h represents the learningrate;

F =

(m

-m )(m +m ) -(m

+m ) + 4 1,1[( 3,0+

1,2 )(m2,1+

0,3)]

a represents themomentum.

62,0

0,2

3,0

2

1,2

2,1

2m m m m0,3

It is possible to notice that the learning rate plays animportant

F7 =

3(m2,1

+ m0,3

)(m3,0

- m1,2

)[(m3,0

+ m1,2

)2] -

[3(m2,1

+ m0,3

)2] -

role for the algorithm convergence, as shown in the previous

equation; in fact it represents the weight updating step: the

smaller the learning rate, the slower and usually more precise

the

-(m

3,

0

- m1,2 )(m2,1

+ m0,3

)

[3(m3,

0

+ m1,2

)2-

(m2,

1

+ m0,3

)2]

training process; nevertheless, using a too smaller learningrate wherem =

vp ,q (i.e., b = +

p + q) represents the

introduces a risk, because it may be possible that thealgorithm

p,q

1(m0,0 ) 2

converges in a localminimum. normalized central moment

The m o m e ntum p ara m e ter has been introduced in order to

ease the algorithm convergence, while the W ig h t d ecay

influences the

vp,q

=

(x -x

0

)p

(y -y

)qI(x,

y)

(p,q = 0,1,2,...)

speed at which the weights not influent for the training are set to

zero.

During the training process, the weight inizialization represents

a critic question because a bad weight inizialization set could

make the training process too slow or generating a large error.

For this reason, it is better to consider different t r i a l s

corresponding to different weight initial values; at the end, onlythe initial set that provides the best results is considered.

At the beginning of the training process, weights assume random

values and the training stops according to the following criterion:

(x,y )B

i

computed on the areaBi.

A particular comment has to be reserved to the choice of the

measure I(x,y) referred to each pixel of the image and which

could be a sort of index of luminosity associated with each pixelitself, having considered the luminosity of a pixel as discriminant

criterion for distinction among vehicles and people (e.g. humanshave a different reflectivity coefficient with respect to vehicles).

In previous works [3,4], which were limited to grey-level image

processing, it was straightforward to use the grey-level

coefficient of each pixel as reference luminosity value but in our

case it is not possible to do that, because of the vectorial nature

of the luminosity values in color images.

1 nc

np

if

y (p) t (p)threshold

STOP

c

c

As scalar luminosity index has then been selected, for the first

nc * np c=1

p=1 network, the Y coefficient of the color YUV space [5], which

where: nc = number of clusters, np = number of

patterns,yc(p) = desired output for pattern p, tc(p) = output

for patternp obtained with current weight values.

At this time the last step to be faced is the selection ofthe

features set used as input to the neural classifier: they will bepresented in the following section.

4.2 The set of features to be used

In order to recognise each observed blob, theMultilayer

well represents a luminosity index in colorimages.

Different is the case of the second network: in order to recognise

uniformed personnel among the wider and more general class of

people an a-priori knowledge about the particular uniform color

(e.g. in most of the cases orange) has been taken into account and

this fact has lead us to consider as scalar luminosity index the H

(hue) value of the HLS (Hue-Luminance-Saturation) space [5].

The normalized central moments used in our case are then equal

to:

Perceptron has been learned with Hu moments, whichare v = (x - x )p

(y - y )qY(x,

y)

(p, q = 0,1,2,...)

invariant to rotation, translation, scale changes[2].

p,q 0 0(x,y)Bi

Let f ,.., f , be these invariantmoments.

for the first network and to:

1 7v p,q = (x - x0 )

(x,y)Bi

(y -y 0)

H(x,y)

(p, q = 0,1,2,...)

F = m + m1 2,0 0,2


5/6

24 ))

for the second n etwork.

F2 =(

m2,0

- m0, 2 )+

m2

1,1

The pattern for the i-th detected object will be composed as

follow: p(x,y)=[f,f,f,f,f,f,f ], where the

functions, 1 2 3 4 5 6 7

f ,.., f , are computed on the blob B . A set of feature vectors

F = m - m 2 + - 2 1 7 i3 ( 3,0

F = (m4 3,0

3 1, 2 )

+ m )2

+1,2

(3m2,1

(m2,1

+

m0,3

m2

0,3

extracted from several models representing different objects

taken from different viewpoints are used as patterns of the

training procedure.

After the object has been recognized, a new Multilayer

Perceptron trained with different dynamic features of the

same


6/6

object taken from different consecutive frames, has been

employed to generate systems alarms. Each pattern is composed

by information about the recognized object class, and the

estimated object speed and position on each sequence

frame. Each reference model is characterized by a set ofparameters that are specific for the behaviour class to which the

object belongs.

The main advantage carried on by such a feature set selection is

the independence of such kind of set from the 3D knowledge,

so it is possible to use them without necessity of having a

different training depending, for example, on a particular zoom

degree of the acquiring camera.

5. EXPERIMENTAL RESULTS

Experimental results have been carried on by using a set of

different sequences representing a pilot entrance access, in

presence of different illumination and traffic conditions, in order

to have a significant validation of the classification approach wehave proposed.

Let us remind that the surveillance system must be able to detect

moving objects, localize and recognise them and interpret their

behaviour in order to prevent possible dangerous situations, e.g.one or more pedestrians moving in an area completely devoted

to the vehicle traffic.

Images have been selected from a database in which sequences

acquired with different pan, tilt ant zoom of the same color video

camera and, in this paper, particular results about the recognitiontask will be presented.

The performances provided by the examined surveillance system

in terms of capabilities of object classification anddiscrimination have been measured through the percentage of

correct object recognition. For example, if within a certainsequence N areas relevant to pedestrians present in the scene

have been detected, then the percentage of correct object

recognition will be computed as:

CONCLUSIONS

A surveillance system for detecting dangerous situations on a

road entrance access has been presented. The system is based on

the use of a Multilayer Perceptron NN to perform both objectclassification and scene understanding. Average correct

classification rate between people and vehicles is equal to 90%,while correct recognition percentage for uniformed personnel

within the more general class of pedestrians is equal to 85%.

ACKNOWLEDGEMENTS

The present work has been partially supported by European

Commission under the ESPRIT contract no. 28494 AVS-RIO

(Advanced Video Surveillance cable television based Remote

Video surveillance system for protected sites monitoring).

REFERENCES

[1] B.D. Ripley, Pattern Recognition and Neural Networks,

Cambridge University Press, UK, 1996.

[2] M.K. Hu, Visual pattern recognition by moment invariant,

IEEE Trans. on Information Theory, Vol. 8, 1962, pp. 179-187.

[3] G.L. Foresti, A neural tree based image understanding system

for advanced visual surveillance, Advanced Video Based

Surveillance Systems, Kluwer Academic Publishers, 1998, pp.

117-129.

[4] J.E. Hollis, D.J. Brown, I.C. Luckraft and C.R. Gent, Feature

vectors for road vehicle scene classification, Neural Networks,

Vol. 9, No 2, 1996, pp.337-344.

[5] B. Furht, S. W. Smoliar, H. Zhang, Video and Image

Processing in Multimedia Systems, Kluwer Academic

Publishers, 1995.

Perc=P R

100N

where PR indicates the number of the times in which the person

have been correctly detected.

In the following table the average values of the correct

recognition percentage has been presented for each sequence,

both for the percentage of objects discrimination and for the

pedestrians discrimination between civilian people andmunicipality personnel:

PERCENTAGE

OF OBJECTS

RECOGNITION

PERCENTAGE OF

CIVILIAN

DISCRIMINATION

SEQUENCE 1 94% 84%

SEQUENCE 2 98% 86%

SEQUENCE 3 89% 88%

SEQUENCE 4 92% 81%

SEQUENCE 5 87% 83%

SEQUENCE 6 94% 84%

SEQUENCE 7 88% 87%

SEQUENCE 8 92% 85%

Documents

A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications