A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications

Embed Size (px)

Citation preview

  • 7/30/2019 A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications

    1/6

  • 7/30/2019 A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications

    2/6

    Figure 1: Typical image of the considered entrance

    access

    Fig 1 represents a typical scene to be monitored.

    The way followed for implementing such solution is based on

    the use of a cascade of two Multilayer Perceptron Neural

    Networks used a classifiers, each one devoted to a particular task.

    First network will be devoted to the classification of eachmoving object in one of three different class, i.e. people,

    vehicles or other object. The second network, which takes

    as input the output of the first one, will be oriented to

    discriminate, among people, between uniformed personnel and

    civilian people, in order to determine the counting increment to

    be computed.

    Fig. 2 allows to better understand the particular solution we have

    designed for such problem.

    C o lo r cam era

    O bjec t s detec tio n a n d trac ki n g syst em

    Lis t of d etec ted mov ing areas (i.e. blobs)

    F irs t n eura l net work

    Clas si fica tion veh icle s-ped estrian s

    S ec o n d neu ra l net w

    o rk

    Clas sifica tion u niform ed pers onnel-c ivilian people

    Figure 2: Proposed hierarchical approach to moving

    objects classification task

    We have chosen to use the same kind of neural network at

    the two classification levels by obviously differencing the

    features provided to each network according the task that has to

    be faced.

    3. SYSTEM DESCRIPTION

    Figure 3 shows the general architecture of the proposedsurveillance system.

    The following assumptions are made: (a) stationary and

    precalibrated camera, (b) ground-plane hypothesis, (c) known set

    of object and behaviour models. The system is composed by 5

    modules: image acquisition (IA), background updating (BU),mobile object detection (MOD), object tracking (OT), object

    recognition (OR) and dynamic scene interpretation (DSI).

    Figure 3: General System Architecture

    3.1 Image acquisition and background updating

    A color surveillance camera mounted on a pole and having a

    wide-angle lens objective to capture the activity over a wide area

    scene acquires visual images representing the input of the system.

    A pin-hole camera model has been selected.

    A background updating procedure is used to adapt the

    background image BCK(x,y) to the significant changes in the

    scene (e.g., illumination, new static objects, etc.).

    3.2 Object detection

    A change detection (CD) procedure based on a simple differencemethod identifies mobile objects in the scene by separating them

    from the static background.

    Let B(x,y) be the output of the CD algorithm. B(x,y) is a binary

    image where pixels representing mobile objects are set to 1 and

    background pixels are set to 0. The B(x,y) image normally

    contains some noisy isolated points or small spurious blobs

    generated during the acquisition process.

    A morphological erosion operator is applied to eliminate these

    undesired effects. Let Bi be the binary blob representing the i-th

    detected object in the scene.

  • 7/30/2019 A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications

    3/6

    p p p -1

    jk

    3.3 Object tracking

    The position and the dimensions of the minimum rectangles

    (MBR) bounding the detected blobs on the image plane are

    considered as target features and matched between two

    successive frames. In particular, the displacement (dx,dy) of the

    MBR centroid and the variations (dh,dl) in the MBR size are

    computed.

    After that, an extended Kalman filter (EKF) estimates the

    depth Zb of each object's center of gravity in a 3D general

    reference system (GRS), together with the width W and the

    length L of the object itself. A ground plane hypothesis is

    applied to perform

    2D-into-3D transformations from the image plane into the GRS.

    However, the tracking task is not the focus of the paper and

    it will be not further examined in the following.

    3.4 Dynamic object recognition and behaviour

    understanding

    The overall purpose of a visual surveillance system is to provide

    an accurate description of a dynamic scene. To do this, an

    effective interpretation of dynamic behavioral of 3D moving

    objects is required. A set of object features, extracted from the

    input images, is used to match effectively with projected

    geometric features on object models.

    This structure has been chosen because it allows supervised

    learning with each input vector having a corresponding knowntarget output. The difference between the networks actual output

    and the target is computed in order to determine the error. The

    weights in the network are then updated, according to the back-propagation algorithm, to minimize the maximum modulus of the

    error. Learning is deemed to be finished when the error at the

    output has reached a visible minimum when plotted against

    training time (number of presentations of the training set), that is

    the weights converge to a particular solution for the training set.

    For the choice of the numbers of layers and units in the hidden

    layers, a specific rule does not exist: it must be performed on thebasis of the acquired experience.

    In our specific case, the inputs of the network will correspond to

    a particular set of features computed on each detected blob and

    significant for the particular task, while outputs will correspond

    to the clusters to which the blobs must be classified, that are, in

    the specific case, three for the first level (people, vehicle andother) and two for the second level (uniformed personnel and

    civilian people).

    The particular configuration of perceptron that we have chosen

    consists of 20 neurons in the hidden layer (see fig. 4)

    In particular, regular moment invariant features are considered tocharacterize the object shape. Each detected blob on the binary

    image B(x,y) represents the silhouette of a mobile object as it

    appears in a perspective view from an arbitrary viewpoint in

    the

    3D scene and the 3D object is constrained to move on the

    ground plane. Since the viewpoint is arbitrary, the position, thesize, the orientation of the 2D blob can vary from image to

    image. A large set of possible perspective shape of multiple 3D-

    object models, e.g., cars, lorries, buses, motorcycles, pedestrians,

    etc. have been considered.

    In the next Sections, the proposed solution for the particularkind

    outputs

    1 2 3

    1 2 3 n

    1 2 m

    inputs

    hiddenlayers

    of classifier that has been chosen and for the features that will be

    used as input to the classifier itself will be presented, followed

    by some experimental results we have obtained in the

    classification task.

    4. THE NEURAL NETWORKS BASED

    MOVING OBJECTS CLASSIFICATION

    4.1 The choice of the classifier

    The choice of the best classifier for the considered application

    should be constrained by the trade off among the training

    time, the required memory, the computational complexity and

    the classification time, other than the probability of success.

    For the considered application, the classification should be

    performed as fast as possible, in order to guarantee the real

    time

    Figure 4: Neural network for classification (n=20)

    The training set has been selected as composed by a large

    number of patterns representative of the classes while tests have

    been performed by using different inizializations for the network.

    The parameter used in the inizialization influences the weight

    updating during the training; they are briefly explained in thefollowing.

    The parameters used for inizializing the neural network are the

    following:

    l ear ning ra t e : the weights of the network are updated by means of

    the following relation:

    wp +1

    (s) = wp

    (s) - hE

    +a[w (s) - w(s)]

    behaviour of the system; for this reason we have considered a

    three-layer Multilayer Perceptron with backpropagation learning

    rule, in fact this classifier needs a long off-line training step,but it is able to process data in a quick way [1].

    jk

    where

    :

    jkw

    p(s)

    jk jk

    Ep is the global error performed by the network at step p;

  • 7/30/2019 A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications

    4/6

    ) ) ]

    3

    b

    0( - ) ? ?

    p q

    F = m - mm +m

    +2- +

    2+

    wp

    (s) represents the weight value between unitsj and k,at 5(

    3,03

    1,2)(

    3,0 1,2)[(m

    3,0

    m1,2

    3(m2,1

    m0,3

    jk

    layers and training stepp;

    +(3m

    2,1

    - m0,3 )(

    m2,1

    + m0,3

    )[3(

    m3,0 + m1,

    2)

    2- (

    m2,1

    + m0,3 )2]

    h represents the learningrate;

    F =

    (m

    -m )(m +m ) -(m

    +m ) + 4 1,1[( 3,0+

    1,2 )(m2,1+

    0,3)]

    a represents themomentum.

    62,0

    0,2

    3,0

    2

    1,2

    2,1

    2m m m m0,3

    It is possible to notice that the learning rate plays animportant

    F7 =

    3(m2,1

    + m0,3

    )(m3,0

    - m1,2

    )[(m3,0

    + m1,2

    )2] -

    [3(m2,1

    + m0,3

    )2] -

    role for the algorithm convergence, as shown in the previous

    equation; in fact it represents the weight updating step: the

    smaller the learning rate, the slower and usually more precise

    the

    -(m

    3,

    0

    - m1,2 )(m2,1

    + m0,3

    )

    [3(m3,

    0

    + m1,2

    )2-

    (m2,

    1

    + m0,3

    )2]

    training process; nevertheless, using a too smaller learningrate wherem =

    vp ,q (i.e., b = +

    p + q) represents the

    introduces a risk, because it may be possible that thealgorithm

    p,q

    1(m0,0 ) 2

    converges in a localminimum. normalized central moment

    The m o m e ntum p ara m e ter has been introduced in order to

    ease the algorithm convergence, while the W ig h t d ecay

    influences the

    vp,q

    =

    (x -x

    0

    )p

    (y -y

    )qI(x,

    y)

    (p,q = 0,1,2,...)

    speed at which the weights not influent for the training are set to

    zero.

    During the training process, the weight inizialization represents

    a critic question because a bad weight inizialization set could

    make the training process too slow or generating a large error.

    For this reason, it is better to consider different t r i a l s

    corresponding to different weight initial values; at the end, onlythe initial set that provides the best results is considered.

    At the beginning of the training process, weights assume random

    values and the training stops according to the following criterion:

    (x,y )B

    i

    computed on the areaBi.

    A particular comment has to be reserved to the choice of the

    measure I(x,y) referred to each pixel of the image and which

    could be a sort of index of luminosity associated with each pixelitself, having considered the luminosity of a pixel as discriminant

    criterion for distinction among vehicles and people (e.g. humanshave a different reflectivity coefficient with respect to vehicles).

    In previous works [3,4], which were limited to grey-level image

    processing, it was straightforward to use the grey-level

    coefficient of each pixel as reference luminosity value but in our

    case it is not possible to do that, because of the vectorial nature

    of the luminosity values in color images.

    1 nc

    np

    if

    y (p) t (p)threshold

    STOP

    c

    c

    As scalar luminosity index has then been selected, for the first

    nc * np c=1

    p=1 network, the Y coefficient of the color YUV space [5], which

    where: nc = number of clusters, np = number of

    patterns,yc(p) = desired output for pattern p, tc(p) = output

    for patternp obtained with current weight values.

    At this time the last step to be faced is the selection ofthe

    features set used as input to the neural classifier: they will bepresented in the following section.

    4.2 The set of features to be used

    In order to recognise each observed blob, theMultilayer

    well represents a luminosity index in colorimages.

    Different is the case of the second network: in order to recognise

    uniformed personnel among the wider and more general class of

    people an a-priori knowledge about the particular uniform color

    (e.g. in most of the cases orange) has been taken into account and

    this fact has lead us to consider as scalar luminosity index the H

    (hue) value of the HLS (Hue-Luminance-Saturation) space [5].

    The normalized central moments used in our case are then equal

    to:

    Perceptron has been learned with Hu moments, whichare v = (x - x )p

    (y - y )qY(x,

    y)

    (p, q = 0,1,2,...)

    invariant to rotation, translation, scale changes[2].

    p,q 0 0(x,y)Bi

    Let f ,.., f , be these invariantmoments.

    for the first network and to:

    1 7v p,q = (x - x0 )

    (x,y)Bi

    (y -y 0)

    H(x,y)

    (p, q = 0,1,2,...)

    F = m + m1 2,0 0,2

  • 7/30/2019 A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications

    5/6

    24 ))

    for the second n etwork.

    F2 =(

    m2,0

    - m0, 2 )+

    m2

    1,1

    The pattern for the i-th detected object will be composed as

    follow: p(x,y)=[f,f,f,f,f,f,f ], where the

    functions, 1 2 3 4 5 6 7

    f ,.., f , are computed on the blob B . A set of feature vectors

    F = m - m 2 + - 2 1 7 i3 ( 3,0

    F = (m4 3,0

    3 1, 2 )

    + m )2

    +1,2

    (3m2,1

    (m2,1

    +

    m0,3

    m2

    0,3

    extracted from several models representing different objects

    taken from different viewpoints are used as patterns of the

    training procedure.

    After the object has been recognized, a new Multilayer

    Perceptron trained with different dynamic features of the

    same

  • 7/30/2019 A Neural-network Approach for Moving Objects Recognition in Color Image Sequences for Surveillance Applications

    6/6

    object taken from different consecutive frames, has been

    employed to generate systems alarms. Each pattern is composed

    by information about the recognized object class, and the

    estimated object speed and position on each sequence

    frame. Each reference model is characterized by a set ofparameters that are specific for the behaviour class to which the

    object belongs.

    The main advantage carried on by such a feature set selection is

    the independence of such kind of set from the 3D knowledge,

    so it is possible to use them without necessity of having a

    different training depending, for example, on a particular zoom

    degree of the acquiring camera.

    5. EXPERIMENTAL RESULTS

    Experimental results have been carried on by using a set of

    different sequences representing a pilot entrance access, in

    presence of different illumination and traffic conditions, in order

    to have a significant validation of the classification approach wehave proposed.

    Let us remind that the surveillance system must be able to detect

    moving objects, localize and recognise them and interpret their

    behaviour in order to prevent possible dangerous situations, e.g.one or more pedestrians moving in an area completely devoted

    to the vehicle traffic.

    Images have been selected from a database in which sequences

    acquired with different pan, tilt ant zoom of the same color video

    camera and, in this paper, particular results about the recognitiontask will be presented.

    The performances provided by the examined surveillance system

    in terms of capabilities of object classification anddiscrimination have been measured through the percentage of

    correct object recognition. For example, if within a certainsequence N areas relevant to pedestrians present in the scene

    have been detected, then the percentage of correct object

    recognition will be computed as:

    CONCLUSIONS

    A surveillance system for detecting dangerous situations on a

    road entrance access has been presented. The system is based on

    the use of a Multilayer Perceptron NN to perform both objectclassification and scene understanding. Average correct

    classification rate between people and vehicles is equal to 90%,while correct recognition percentage for uniformed personnel

    within the more general class of pedestrians is equal to 85%.

    ACKNOWLEDGEMENTS

    The present work has been partially supported by European

    Commission under the ESPRIT contract no. 28494 AVS-RIO

    (Advanced Video Surveillance cable television based Remote

    Video surveillance system for protected sites monitoring).

    REFERENCES

    [1] B.D. Ripley, Pattern Recognition and Neural Networks,

    Cambridge University Press, UK, 1996.

    [2] M.K. Hu, Visual pattern recognition by moment invariant,

    IEEE Trans. on Information Theory, Vol. 8, 1962, pp. 179-187.

    [3] G.L. Foresti, A neural tree based image understanding system

    for advanced visual surveillance, Advanced Video Based

    Surveillance Systems, Kluwer Academic Publishers, 1998, pp.

    117-129.

    [4] J.E. Hollis, D.J. Brown, I.C. Luckraft and C.R. Gent, Feature

    vectors for road vehicle scene classification, Neural Networks,

    Vol. 9, No 2, 1996, pp.337-344.

    [5] B. Furht, S. W. Smoliar, H. Zhang, Video and Image

    Processing in Multimedia Systems, Kluwer Academic

    Publishers, 1995.

    Perc=P R

    100N

    where PR indicates the number of the times in which the person

    have been correctly detected.

    In the following table the average values of the correct

    recognition percentage has been presented for each sequence,

    both for the percentage of objects discrimination and for the

    pedestrians discrimination between civilian people andmunicipality personnel:

    PERCENTAGE

    OF OBJECTS

    RECOGNITION

    PERCENTAGE OF

    CIVILIAN

    DISCRIMINATION

    SEQUENCE 1 94% 84%

    SEQUENCE 2 98% 86%

    SEQUENCE 3 89% 88%

    SEQUENCE 4 92% 81%

    SEQUENCE 5 87% 83%

    SEQUENCE 6 94% 84%

    SEQUENCE 7 88% 87%

    SEQUENCE 8 92% 85%