Multi-person tracking system for complex outdoor environments790420/FULLTEXT01.pdf · the background subtraction stage, as the final accuracy and performance of the person tracking

15 007

Examensarbete 30 hpFebruari 2015

Multi-person tracking system for complex outdoor environments

Cristina-Madalina Tanase

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Multi-person tracking system for complex outdoorenvironments

Cristina-Madalina Tanase

The thesis represents the research in the domain of modern video tracking systemsand presents the details of the implementation of such a system. Video surveillance isa high point of interest and it relies on robust systems that interconnect severalcritical modules: data acquisition, data processing, background modeling, foregrounddetection and multiple object tracking. The present work analyzes different state ofthe art methods that are suitable for each module. The emphasis of the thesis is onthe background subtraction stage, as the final accuracy and performance of the persontracking dramatically dependent on it. The experimental results show theperformance of four different foreground detection algorithms, including twovariations of self-organizing feature maps for background modeling, a machine learningtechnique. The undertaken work provides a comprehensive view of the actual state ofthe research in the foreground detection field and multiple object tracking and offerssolution for common problems that occur when tracking in complex scenes. Thechosen data set for experiments covers extremely different and complex scenes(outdoor environments) that allow a detailed study of the appropriate approaches andemphasize the weaknesses and strengths of each algorithm. The proposed systemhandles problems like: dynamic backgrounds, illumination changes, camouflage, castshadows, frequent occlusions and crowded scenes. The tracking obtains a maximumMultiple Object Tracking Accuracy of 92,5% for the standard video sequence MWTand a minimum of 32,3% for an extremely difficult sequence that challenges everymethod.

Tryckt av: Reprocentralen ITC IT 15 007Examinator: Ivan ChristoffÄmnesgranskare: Cris LuengoHandledare: Hongyu Li

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Existing systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Existing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Problem formulation and proposed solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Gaussian mixture model (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Block-based classifier (BBC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Adaptive self-organizing background (SOM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Pre and Post Processing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1 Noise reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Morphological operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Shadow removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Tracking Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.1 Optical flow tracking methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Lucas-Kanade-Tomasi method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.1 Feature Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2.2 Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Tracking for surveillance systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2 Key technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.3 Method and implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.3.1 Video Handler Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.3.2 Image Processing Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.3.3 Background Subtraction Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.3.4 Feature Tracking Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.4 Background segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.4.1 Accuracy metrics for foreground detection . . . . . . . . . . . . . . . . . . . 395.4.2 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.4.3 Method 1: GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.4.4 Method 2: BCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.4.5 Method 3: SOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.4.6 Accuracy results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.5 Pre and post processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.5.1 Noise reduction and morphological operations results . 505.5.2 Shadow detection results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.6 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.6.1 Accuracy metrics for tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.6.2 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.6.3 Accuracy results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

1. Introduction

This chapter is a short overview of current methods used in the process ofmultiple object tracking, including references to existing systems and popularalgorithms. Moreover, the chapter is to introduce the problem of foregrounddetection and tracking, and explain the chosen solution for reaching this goal.

1.1 OverviewVideo tracking is an active field of research that concentrates the process oflocating objects from a video footage taken with a camera. This procedureis of great interest in fields like surveillance, security, human-computer in-teraction, video communication and compression, augmented reality, trafficcontrol, medical imaging and video editing.

The purpose of video tracking is to identify target objects in consecutivevideo frames. The association between detected foreground regions can besomewhat difficult when the objects are moving fast relative to the frame rate.Video tracking is an expensive process due to the amount of data that is con-tained in the video. The complexity of the process is increased when tech-niques of object recognition are essential to tracking. Multiple object trackingis a subfield of video tracking and it deals with more subtle problems likefollowing more than one object in time and space even if occlusions occuror relative positions of the objects change. It is important in surveillance ap-plications to register the trajectory of humans, their collisions, their stagnantphases, the moments of entering or leaving the frame. Outdoor conditions ofvideo tracking increase the difficulty of the problem by making the foregroundand background harder to segment due to the variations in light intensity, andsizes and orientation of the shadows.

The most important subcomponent of the tracking systems is the foregrounddetection. The results of this stage influence greatly the ability of the systemto track objects. An accurate segmentation of the foreground is a challengein itself. Depending on the complexity of the scene and the background’sdynamics, this procedure can vary from simple to extremely complex. Thestudy of the existing methods and their possible improvements for differentsets of data is an interesting purpose and is detailed in Chapter 2.

The background of the present thesis consists of several methods that havebeen successfully applied in projects with similar goals. Some of the meth-ods are specific to image analysis, such as: filtering, morphological transfor-mations, edge detection segmentation and classification. All these methods

3

are adapted for video frames and are compiled into more advanced intelligentsystems that are able to identify and track objects in different ways: adaptivemodeling of the background [40, 7], feature tracking [27], kernel-base tracking[11], contour tracking [19, 25], visual feature matching [37].

1.2 BackgroundThis section presents the background of the video tracking field and showsexamples of existing systems and most popular algorithms and methods.

1.2.1 Existing systemsBoth commercial and open source human tracking application exist. The cur-rent trend around the world is combining different methods of tracking forsuiting the custom problems of each setting. In general, a tracking system iscomposed of several components as shown in Figure 1.1. General descriptionof multiple object tracking systems

Figure 1.1. General description of multiple object tracking systems

This is a brief list of existing systems that perform video tracking by em-ploying various techniques and technologies:

• POM (developed by Computer Vision Laboratory - CVLAB) uses agenerative model of background subtraction to estimate the positionsof people in an individual time frame. [13]

• World-Z map (developed by Nara Institute of Science and Technology- NAIST) performs 3D people detection and tracking with World-ZMap from a single stereo camera. [45]

• HumanEva: Synchronized video and motion capture dataset and base-line algorithm for evaluation of articulated human motion. [39]

4

• The Reading People Tracker (developed by Nils T. Siebel in the Eu-ropean Framework V research project ADVISOR.) is an automaticvisual surveillance system for crime detection and prevention, usingstate-of-the art image processing algorithms. [38]

Current applications use multiple cameras [13] under different angles andpositions for obtaining more information on the tracked persons. This is howocclusion is no longer a problem, but higher complexity is added due to thefact that an object has to be identified to be the same in different frames takenfrom different points of view of the scene.

The current trend is not only tracking people, but also detecting their bodyposition or their accessories (eg. backpack) [24, 17]. Also the research hasreached the point that systems can detect an individual in different settingsand recognize the person if it has been detected before. Systems today canreconstruct in 3D the position of a person in space and reproduce it by usingsimplified models [2]. Still the problem of long term occlusions, stationarypersons and poor lighting are important and hard to solve in single camerasituations.

1.2.2 Existing methodsThe undertaken research in the area of video analysis and multiple objecttracking, in particular, has resulted in a comprehensive overview of the avail-able methods and technologies. This section presents a walk-through of themost important and valid methods employed in modern systems, or that areunder current research.

The goal of the study is to find out how to manipulate different kinds of in-formation present in the video samples for successfully tracking the objects, inour case, all persons present in the frame. There are different tracking modal-ities that have been used in literature, the main processed data being visualor sound. We will focus on the visual aspect, which is constituted by color,contours, motion and faces. These categories further expand to face detection,face recognition, foreground detection, background estimation, optical flow,frame by frame difference and feature points. A very intuitive and simple ap-proach is to track persons using their colors, as some studies suggest. Differentmethods of filtering and tracking based on color have been tested. Usually acolor mixture model is used based on Gaussian distributions [40] . A metricused in tracking according to the color of the object is the color coherence,which measures the average inter-frame histogram distance of a tracked ob-ject [10, 30]. It is assumed that the object histogram should remain constantbetween image frames. This metric has low values if the segmented object hassimilar color attributes, and higher values when color attributes are different.

Finding and tracking contours is another approach to detecting objects ina video [44]. The contour of a person can be found using a standard edge

5

detector [14]. The main problem is that often persons’ contours can becomepartly occluded when passing behind other objects.

Motion is another clue that can be interpreted as the presence of a person inthe environment, and is usually used for tracking systems. Foreground methodhas been successfully used in some studies [2, 15, ?, 29]. In most cases frameby frame difference provides a foreground and these objects constitute trackedpersons. The drawback of this approach is that it cannot detect stationarypersons in the frames. A way of solving this problem is the background sub-traction method, which differentiates between a clean/neutral background anda frame with people in it.

The foreground method opens a new research path on the ways of mod-eling the background accurately, when, there is no "clean" scene to use forsubtraction. There are pixel-wise and region-wise methods that can estimatethe background that efficiently produce models capable to adapt to small dy-namics of the scene. Classical methods such as classifiers or mixture modelscompete nowadays with neural networks that are able to successfully identifybackground pixels even in the most troublesome situations.

Another successful method of tracking objects is following their opticalflow [3]. The optical flow is constructed as the motion of all points in thescene relative to the camera projected into a 2D plane. By determining theoptical flow of the entire image, moving objects can be separated from thebackground. One way to use optical flow for tracking is by identifying a fewparticular points and then track these features. Finding relevant features is anot trivial problem.

Based on the above extracted data, certain tracking algorithms are applied.One of the most common is the Kalman filter [43]. It is considered to be themost important method for state estimation. In our case the state estimationis needed for tracking objects. The Kalman filter is an efficient recursive al-gorithm that estimates the internal state of a linear dynamic system from aseries of noisy measurements. Research shows that this filtering technique fortracking single and multiple moving objects in video sequences are very effi-cient especially for real time systems. The filter uses different features such ascolor, shape, motion, edge, etc.

Another tracking method is the particle filtering technique with multiplecues [8][Bra05] such as color, texture and edges as observation features is apowerful technique for tracking deformable objects in image sequences withcomplex backgrounds. The number of active objects and track managementare handled by means of probabilities of the number of active objects in a givenframe. The probabilities are estimated using a Monte Carlo data associationalgorithm.

6

1.3 Problem formulation and proposed solutionThe problem can be easily formulated as: the creation of a system that canautomatically identify the individuals from a surveillance area and track themfrom the entrance to the exit of the scene. This formulation represents a needthat many surveillance companies have.

The research part of this thesis is finding the optimal approach of back-ground modeling, foreground detection and tracking for multiple persons ina given setting, making it possible to uniquely identify all humans from thescene, with minimal computational effort and maximum accuracy. By com-paring different approaches, we search for simplified models and customizedmethods to obtain a useful result.

The application consists in implementing found during the research on aset of surveillance footages taken from an outdoor setting, with emphasis onthe foreground detection stage. For this purpose three main outdoor sequenceswill be used. The experiment’s aim is to detect and track all humans enteringor exiting the view, in different lighting and motion conditions.

The framework consists of a Visual Studio project, where several methodsrelying on mathematical and image analysis libraries are implemented into afinal People Tracker System. Several features from a distinct project wereused for experimental purposes. The main modules of the People Tracker are:pre and post processing, including a shadow detection subsystem, backgroundmodeling and foreground detection module and finally a tracking module thatrelies on feature detection.

The framework is open for extension and it provides a good support for run-ning relevant experiments. The original contribution to the field of the multipleobject tracking consists in the comparative evaluation of large set of methodsat different stages of processing and finding which ones impact most the track-ing accuracy. Even though the complexity of some methods is high, the impactin the final result of tracking might be less significant than other, simpler, lesstime-expensive methods. These results, presented in the thesis, are essentialto the fine tune of a real-time tracking system. Moreover, an intensive studywith remarkably good results has been undertaken on novel, machine learningtechniques for background modeling and foreground detection. These algo-rithms have the advantage of producing extremely good results even in cornercase situations, but they rise extreme difficulties in setting up the parameters.The present report presents solutions for solving the parameter problems andeasily setting up the neural networks for self-organizing backgrounds.

Finally, and possibly most importantly, a fine analysis of extremely complexand diverse scenes has been pursued. Most benchmarks and algorithm results,consist of relatively simple scenes, with a high quality of the footage. The cur-rent project tests methods for different types of environments that rise various,unexplored problems. The study shows the advantages of each method, but

7

most importantly the weaknesses and flaws under non-standard conditions ofthe scenes.

8

2. Background Modeling

Adaptive background estimation is a commonly used method in computer vi-sion, and can be found under different variations. This chapter presents ashort overview and 3 different algorithms of modeling the background: gaus-sian mixture modeling, block-based background classifier and adaptive self-organizing background.

2.1 OverviewAn attempt of classifying background modeling and foreground segmentationtechniques based on the employed algorithms is presented in Figure 2.1.

Figure 2.1. Classification of background modeling methods

Some methods [20, 32] suppose that the background has a predefined distri-bution, which implies making risky assumptions about the data. The parameter

9

choice can be a sensitive issue, but the complexity of parametric algorithmsis lower than for non-parametric ones; also, the resources used for storingand computing are restricted to a few parameters per pixel or block. On theother hand, the non-parametric methods [28, 15] can require a large amountof resources. These methods compensate though with great flexibility and canmodel arbitrary distributions, without any a priori information or assumptionsabout the data.

A recursive approach [29] will evaluate each frame at the time and willupdate accordingly the background model recursively. A non-recursive [11]approach maintains a buffer of frames, making possible for the background toadapt to temporal changes better. While the space complexity is significantlyhigher in this case, the method avoids the persistence of early errors in thebackground model.

Depending on how many distributions are needed to estimate the back-ground within a reasonable error interval, methods may employ unimodal dis-tributions or multimodal. The performance difference is that the latter cancope with more complex, dynamic backgrounds, whereas the former cannot.The last distinction is made between pixel-based methods, that consider eachpixel’s information as an independent input, and the block-based methods thatmake use of the spatial information of a region, including the proximity of thepixels as a decision process of modeling.

In the vast majority of video surveillance systems, real-time methods forbackground subtraction and foreground modeling are used, so space and timecomplexity are considered important factors. The methods employed shouldbe chosen according to the settings and complexity of the scene. The basicmethods include static background subtraction, thresholding or chromatic fil-tering, but they have limited applicability. More advanced techniques includeadaptive mixture models of Gaussians for each pixel [21, 20], block-basedclassifiers with multiple decision factors [32] , kernel density estimation back-ground model [15], codebook background model [?] and supervised or non-supervised learning of the foreground in neural networks [28, 29]. These ad-vanced techniques can cope with dynamic backgrounds, changes in illumina-tion, background bootstrapping, foreground objects similar to the background(camouflage), and resource limitations.

The chosen methods will be assessed by their capacity of achieving a set ofgoals. By analyzing the common situations in the data set, a list was createdof desirable features for the results of the tested algorithms. For this stage thegoals are:

• Mobility: the method should be able model a background in the pres-ence of occlusions, such as persons traversing the scene. Entering andleaving: the method should be able to detect weather a new object hasentered or left the scene.

10

• Adaptability: the method should adapt to changes in background,such as new static objects are introduced, and stay still for a reason-able amount of time.

• Dynamics of the scene: the method should be able to incorporatein the model small variations of the scene: branches moving, slowillumination changes, small shifts of the camera.

• Large objects: the method should create an estimation of the back-ground even in the presence of large objects in the scene. (Sequence1 and Sequence 2 contain frames with persons or significantly dif-ferent sizes). Occlusions from 3% of the scene to 20% of the sceneshould be reasonably handled.

• Noise handling: the method should not be susceptible to noise.Due to the diversity of situations encountered in the proposed video se-

quences, all these goals can be assessed. In the following sections different ap-proaches will be discussed for background modelling and indicate their abilityof reaching these goals. A more detailed evaluation will be further discussedin Chapter 4.

2.2 Gaussian mixture model (GMM)The Gaussian mixture background modeling method [40] is a parametric, re-cursive, pixel-based method that uses a variable mixture of Gaussian distribu-tions. In statistics, a mixture model is a probabilistic model for representingsubpopulation within an overall population. Formally, a mixture model corre-sponds to the mixture distribution that represents the probability distributionof observations in the overall population. In our case the population is repre-sented by the entire range of pixel values and the subpopulation is the valuesthat a certain pixel p(x,y) takes in a sequence of n frames or observations. TheGaussian distribution of each pixel is described in

p(θ) =K

∑i=1

ϕiN(µi ∑ i) (2.1)

Where:• θ is the parameter of distribution of observations.• N is the normal distribution with weight ϕi, mean µi and covariance

matrix ∑ i.• K is the number of mixtures.

. The technique compares each pixel value with an existing distribution, inorder to classify it as foreground as background. The initial method createsa model for each pixel as an average of all incoming frames. This turns outto be insufficient for modeling a dynamic background that deals with suddenillumination changes or moving fragments of the scene. Further, improved

11

version of a classical Gaussian representation of the background have beendeveloped. The benefit of the improved method consists in using a mixtureof three to five Gaussian distributions for each pixel color. By consideringthe persistence and the variance of each of the Gaussians of the mixture, theGaussians that may correspond to background colors are determined. Pixelvalues that do not fit the background distributions are ignored. The genericalgorithm is presented in Figure 2.2

Figure 2.2. Foreground detection with Gaussian mixture model

The parameters θ of the mixture represent how long a specific color ispresent in a frame. In this approach, the background colors are assumed to bethe ones which stay longer and move little on the scene. To allow the modelto adapt to changes in illumination an update scheme is applied, based uponselective updating. Every new pixel value is checked against existing modelcomponents in order of fitness. The first matched model component will beupdated. If it finds no match, a new Gaussian component will be added, havingthe mean equal with the point’s value, a large covariance matrix and a smallvalue of weighting parameter. Further on, P. KaewTraKulPong and R. Bowden[20] prove that in a crowded environment this approach is insufficient, sincethe background component might become very late dominant. They used anExpectation Maximization (EM) algorithm to determine the Gaussian mixturemodels. The method uses a window of L frames and L-recent window updateequations, where most recent frames have greater priority. In the initializa-tion phase, while we wait for sufficient statistics, another set of equations isused. The algorithm takes 4 parameters: length of the history window ofthe learned background, number of Gaussian mixtures, background ratio andnoise strength and can be easily configured.

2.3 Block-based classifier (BBC)The block-based background classifier is a parametric, recursive, block-basedmethod that uses an unimodal distribution. Another employed method is usinga simple decision chain [VRe13]. This technique works as a binary classifierdistinguishing between background blocks and foreground blocks. As op-posed to other techniques which use pixel information, the present one uses

12

contextual information about blocks of pixels. The algorithm consist of fourstages: division of a frame into overlapping blocks, classification of each block(foreground or background), background model re-initialization and finallyprobabilistic generation of the foreground mask, as shown in Figure 2.3.

Figure 2.3. Stages of "Foreground Detection via Block-based Classifier Cascade withProbabilistic Decision Integration" algorithm

The first stage is dividing the image in blocks and creating for each blocka low-dimensional descriptor. The important parameters in this stage are theblock size and block overlap, and their effect will be further discussed in theexperimental results section. Stage two consists of a three classifier cascade,namely: probability measurement, cosine distance, temporal correlation checkas shown in Figure 2.4.

The first classifier deals with dynamic backgrounds: different objects inthe background that move but mainly are present all the time (such as smallshadows or small orientation changes of the camera) and it uses a likelihood

13

Figure 2.4. Cascade Classifier for Block Based Classifier for Background Segmenta-tion

function of the low-dimensional descriptor d(i, j) for each block to decide if itis part of the background, based on a Gaussian model of each selected block.

p(di, j) =exp{1

2 [di, j −µi, j]T ∑i, j [di, j −µi, j]}

(2π)D2 (∑ i, j)

12

(2.2)

Where:• µi, j and ∑ i, j are the mean and the covariance matrix for location (i, j)• D is the dimensionality of the descriptors

The classifier decides based on the condition:

p(d(i, j))≥ p(µ(i, j) +2diag(∑(i, j)))12 (2.3)

If condition 2.3.2 is satisfied the block is classified as background.The second classifier deals with illumination changes and employs a dis-

tance measure between two vectors: µi, j and di, j as defined previously in

cosdist(d(i, j),µ(i, j)) = 1−dT(i, j)µ(i, j)

||d(i, j)|| ||µ(i, j)||(2.4)

The classifier decides based on the condition:

cosdist(di, j,µ(i, j))≤C1 (2.5)

Where C1 is an empiric value that the authors suggest it should be 0.1 asto ensure a higher probability of classifying background pixels as foreground,rather than the other way around. If condition 2.3.4 is satisfied the block isclassified as background. The third classifier handles the temporal correlationsin the successive frames, thus eliminating the false positives and it consists oftwo conditions:

Dprevi, j was classi f ied as background (2.6)

cosdist(Dprevi, j ,di, j)≤ 0.5C1 (2.7)

14

If both conditions are satisfied, the block is classified as background. Modelre-initialization is the third stage of the algorithm and is a technique triggeredif a portion of 70 % of each image is consistently classified as foreground fora reasonable period of time(ex: 15 frames). The last step is a correcting pro-cedure by producing a probabilistic foreground mask. Given the fact that oneblock can contain both foreground and background pixels, after the classifica-tion of the blocks, a pixel classification is made as well. In this stage a pixelcontained in several overlapping blocks will be classified as foreground onlyif the majority of blocks which contain it have been classified as foreground.

2.4 Adaptive self-organizing background (SOM)A new and very effective method with completely different approach is usinga neural network to decide which pixel belongs to the background model.

Self organizing features maps

Artificial Neural Networks are motivated by cognitive science that model whathappens in the cognitive capabilities of natural (human) systems. There aremany researchers in the subfield of competitive learning. The most well-known algorithm is Kohonen Networks [23], also known as Self-OrganizingMaps or SOMs that are typically a single layer of inputs completely connectedto a single layer of outputs. The general algorithm is presented in Figure 2.5.

Figure 2.5. Algorithm for self-organizing maps

The expected effect is that over time the weight vectors move towards thecenters of clusters of input vectors. Final state (convergence) finds one weightvector over the center of each cluster of the input vectors. Kohonen maps or

15

SOMs define a topographic map and a notion of neighborhood of each outputunit. On the basic algorithm all unit in the neighborhood of the winner aremodified. Output units are typically organized in a grid. Then neighborhoodconsists of those output units within a given distance (Euclidean).

Initial Algorithm

Self organizing maps have been successfully used to classify pixels into back-ground or foreground regions, by creating a competitive ANN configured as a2D grid [28]. Each node computes a weighted sum linear of the inputs, rep-resented by the value of the pixel. Each node is composed by a weight vectorcorresponding to all weights connected to it. The entire set of weight vec-tors represents the background model. The foreground will be consequentlycalculated as the difference between the frame and the background model.

Initially each pixel is represented by a map of 3x3 vectors, each elementcontaining its representation in RGB, or HSV. Initially the 3 weight vectorsfor each pixel will be equal with the initial value of it, constructing thus aninitial background model. After the first stage, the future frames are fed to thenetwork. Each pixel is compared with the 9 values from the model in orderto determine the best fitting weight vector that describes it. Finding the bestmatching unit, in our case the best weight vector to describe is accomplishedby calculating the Euclidean distance (Equation 2.8) between the pixel in HSVrepresentation and each representing vector. The choice of the HSV represen-tation is based in the established method used in the previous studies.

d(pi, p j) = �(visicos(hi),((visisin(hi),vi))− (v js jcos(h j),((v js jsin(h j),v j))�(2.8)

d(cm, pt) = min(d(ci, pt))≤ ε (2.9)

Where:• vi,si,hi are the components of pixel i in HSV representation• ci is the i component of the model• cm is the best matching unit of the model• pt is the current sample• ε is a fixed threshold that separates the foreground from the back-

groundIf such a matching unit Cm is found, the best matching unit (BMU) and its

neighborhood (defined in the neighborhood function) are reinforced as back-ground, otherwise the pixel is considered foreground and is not included in

16

Figure 2.6. Algorithm for self organizing background

the model. This behaviour can be observed in the pseudo code presented inFigure 2.6.

An improved method

This method is a modified version of Adaptive SOM that uses a fuzzy rule toupdate the neural network background model [29]. The fuzzy spatial coherence-based self-organizing map for background subtraction (FSOM) is a methodthat uses a fuzzy update of the background improves the model’s robustnessto illumination changes in the scene.

In more detail, this technique is an enhanced SOM algorithm that includesspatial coherence into background subtraction. The authors define spatial co-herence as the intensity difference between locally contiguous pixels; for in-stance, neighboring pixels showing small intensity differences are coherentand neighboring pixels with high intensity differences are incoherent. Theresearch shows that including spatial coherence of tracked objects when com-paring with background pixels ensures robustness against false positives.

Moreover, changes have been made to the update phase of the backgroundmodel. This method introduces an automatic and data-dependent mechanismfor reinforcing the background model in future steps. The decision function isnow a fuzzy rule-based procedure. Fuzzy set theory offers an appropriate wayof representing knowledge and uncertainty (the threshold in the backgroundmodel), which creates a very flexible decision system for the existing neuralnetwork. This is concretized in the calculation of learning factors on the runand then including them in the update rule of the system. This further additionto the algorithm is presented in Figure 2.7.

17

Figure 2.7. The algorithm for self organizing background with fuzzy update rule

18

3. Pre and Post Processing Methods

In order to detect the foreground we previously calculated an adaptive back-ground model. In the ideal case the background would be stationary and themodel would reflect any change that is happening in the scene.

After the first stage of foreground detection, a mask is obtained which con-tains fragments of all moving objects. Not all these objects represent actualsilhouette of persons, but also shadows and "ghosts". Ghosts are those de-tected false objects that appear when we extract a background that is not accu-rate and reactive [12]. We can observe real objects, their shadows and ghostsin the results of foreground detection.

The aim is to improve this result, by reducing the noise and also detectingthe shadows and the ghosts for eliminating the false positives. In order toachieve this, and create a more accurate mask for extracting the foregroundwe will apply morphological operations and also shadow removal techniques.The final goal is to detect which information resulted from these methods canbe usefully combined to accurately detect the foreground.

The goals at this stage are:• Creating a smooth input image for the background modeling stage• Creating low-noise results• Enhancing the foreground masks by creating smoother edges and

compacts object masks• Reducing the artifacts from the foreground masks

The pre and post processing stages can be identified in Figure 3.1.

3.1 Noise reductionThe initial frames will be smoothened in a preprocessing phase by employingtwo methods.

Clip

Clipping [31] is the simplest segmentation method. Unfortunately, our imagesare far too complex to be segmented with this technique. However, we can usea type of clipping in the preprocessing phase so that we can reduce the very

19

Figure 3.1. Preprocessing and post-processing stages in the video tracking system

dark colors from the scene, to create a smoother transition between the colorlevels. For this goal, a truncate type clipping is being used, which keeps onlythe first two thirds of the color spectrum, according to the following formula:

threshold =

�threshold, i f src(i, j)< thresholdsrc(i, j),otherwise

(3.1)

This transformation is important because it reduces the difference betweenbackground shadows and background itself.

Blur

The blur operation is a simple filtering procedure performed with the purposeof reducing the noise level. The function smoothens an image using the kernelh(k, l) according to the formula:

new_image = ∑i, j

src(i+ k, j+ l)h(k, l) (3.2)

20

The normalized box filter blur outputs pixels with the mean of its kernelneighbors ( all of them contribute with equal weights):

K =1

KheightKwidth

1 · · · 1...

. . ....

1 · · · 1

(3.3)

The result is a smoothened image, with less noise. The normalized blur fil-ter uses nevertheless not the best kernel (the Gaussian kernel has better results)but it is the fastest and produces acceptable results for our purposes.

3.2 Morphological operationsAs part of the post-processing of the grayscale foreground morphological op-erations such as erosion and dilation are applied. These morphological oper-ations [36], although very simple, become very important when we deal withnoise. This techniques is used especially when we will try to create a fore-ground mask. The problem is that the foreground mask that we can obtainfor the test images contains only fractions of the objects, due to a variety offactors: top-down view, shadows, low frame rate that affects the update ofthe background model, and finally the fact that the floor has a dark color verysimilar to the color of the shadow. In this case, we can extend the foregroundfraction with a morphological operation, so the mask will contain the entiresurface of the object.

Erosion

The erosion function is defined as probing an image with a predefined struc-turing element and extracting conclusions on how this element fits or missesthe shapes in the original image. Let be an Euclidean space and A a binaryimage in E. The erosion of the binary image A by the structuring element B isdefined by:

A�B =�

z ∈ E | BzinA�

(3.4)

Where Bz is the translation of B by the vector z. We erode the entire imagewith an ellipse structuring element of size 5x5 pixels. The result of this oper-ation is reducing the noise or small objects that are not important for our edgedetection, since we are looking for much bigger objects in the scene.

21

Dilation

The dilation operation uses also a structuring element for probing and expand-ing the shapes contained in the input image. Let E be a Euclidean space or aninteger grid, A a binary image in E, and B a structuring element. The dilationof A by B is defined by:

A⊕B =�

b∈BAb (3.5)

We use the same structuring element- ellipse of the same size - 5x5 to dilateobjects, so that the identified structures will appear clearer and with smootheredges.

Morphological opening

In fact, the succession of the two steps, erosion and dilation, one performedwith a structuring element, and the other with a mirrored version of the samestructuring element, is called in literature, morphological opening and it isdefined formally:

A◦B = (A�B)⊕BT (3.6)The aim of opening operation is to remove small objects from the fore-

ground (usually noise) of an image, placing them in the background, whileclosing removes small holes in the foreground, changing small islands of back-ground into foreground.

3.3 Shadow removalShadow removal is an important step for improving object detection and track-ing. Different methods exploit different characteristics of the image, whichinclude: chromacity, physical properties, geometry and textures of large orsmall areas [35, 9].

The targeted shadows are of two types:• Low-dynamics shadows: are attached to elements of the background

(eg. shadows of trees). These shadow elements behave differentlythan the foreground or background, having a lower motion than theforeground but higher than the background due to the composite dy-namics of both the object that produces them and the illuminationchanges.

22

Figure 3.2. An overview of useful characteristics used for shadow detection

• high-dynamics shadows: are attached to elements of the foreground(eg. cas shadow of a walking person). In this case the shadow canbe as large as the foreground object itself and usually is classified asforeground.

Chromacity based method

Chromacity is a measure of color that is independent of intensity. The shadowdetecting methods [12] that use this measure, assume that the shadow regionspreserve their chromacity at the transition between frames. This is why thistype of methods usually use color models that differentiate better betweencolor and intensity like HSV, YUV, normalized RGB.

This method analyses pixels in HSV (Hue, Saturation,Value) color space.There are three characteristics of the shadow region that were observed:

• pixels in the shadow have a lower intensity value than the other back-ground pixels (V)

• the hue is not changed on a background region that was covered byshadow (H)

• empirically the saturation is decreased by shadow (S) If a pixel valuetransits in a manner that respects these three conditions, then it isclassified as shadow pixel. The evaluation is not made pixel-wisethough, but rather in a 5x5 observation window; this way the noiseproblem is reduced.

23

Geometrical method

The geometrical method [34] employs physical characteristics of the shadowand has two main steps. The first step in this method is detecting the objectconsisting of the contour of the person and its shadow, and determining itsorientation. Having in mind that the target objects are persons, a high peakand a low peak of each individual blob is found, representing the head of theperson and the projection of the head on the ground - which is the shadow. Inthis particular implementation the orientation of an object is estimated fromthe properties of object moments.

The second step is separating the shadow region from the actual silhouetteregion, done by finding the gravity center of each shadow-person pair which isconsidered the point where shadow begins. This is just an initial approxima-tion of the shadow pixels, based on their geometrical features. Every suspectshadow pixel (lower than the gravity point in the figure) is included in Gaus-sian model, that updates with any new shadow candidate. Each pixel initiallyclassified as shadow is not checked against the Gaussian model and classifiedfinally as foreground or background.

Large region texture

The large region texture shadow detection method [34] is based on the as-sumption that the regions under the shadow preserve the same texture as thesurface that it is projected on. The algorithm consists of two steps:

• First, there are selected candidate regions for shadow classification.• Second, a comparison is made between the selected region’s texture

and, in turn, the one of the foreground and the one of the background,eliminating the regions that fall in the second category.

For the first step, the implemented method uses chromaticity and intensityto detect candidate pixels for the shadow regions. Further on, all candidatesfor shadow-pixels that are connected are considered to be a shadow region.In the second step the candidate texture is correlated with the object textureand also with the background. Shadow regions are expected to have a highcorrelation. For each shadow candidate region the gradient magnitude andthe gradient direction are calculated in each point. If the correlation value isgreater that an empirical threshold, then the region is considered shadow andremoved from the foreground. Although more computationally expensive, itis preferred to use large regions, because they are more probable to containsignificant textures than the small areas that can be easily affected by noise.

24

4. Tracking Objects

This chapter presents the chosen method for tracking objects in a video streams,starting from a general description of optical flow tracking methods and end-ing with suitable variations of the algorithms for surveillance applications.

4.1 Optical flow tracking methodsOptical flow is defined as the motion of set points relative to the camera andthe scene, which is projected on a 2D image plane. By determining the op-tical flow of the entire image, moving objects can be identified and tracked.Sequences of consecutive frames allow the estimation of motion as instanta-neous image velocities or discrete image displacements. There are four majordirections of research in the field of optical flow [3] as presented in Figure 4.1

Figure 4.1. Modalities of detecting optical flow

Even though the methods are diverse they share a three-steps strategy, likein Figure 4.2: preprocessing the image usually with a low-pass filter, extrac-tion of features such as spatio-temporal derivatives and finally, the correlationof these features in two-dimensional flow field.

Probably the most explored approach is using differential methods for esti-mating the optical flow. Based on partial derivatives of the image signal or the

25

Figure 4.2. General steps for identifying optical flow

sought flow field and higher-order partial derivatives, these methods create atrajectory of objects in motion, which is essential for surveillance systems [4].One requirement for this type of algorithms is to employ a differentiable ve-locity function. The first order derivatives will represent the image translation,while second order derivatives will give indications about the 2D velocity ofthe object.1 Region-based matching techniques define velocity of a shift func-tion that describes best the correlation between two regions at different pointsin time. This method implies finding an adequate similarity measure and afunction that maximizes it for the two regions. This method can be employedeven when the differential methods fail due to high rates of noise or impossi-bility of defining a differentiable function because of the lack of smoothnessof the space. Energy or frequency-based methods are a different category ofoptical-flow detectors that use a Fourier transform capable of translating 2Dpatterns. It is shown that all translating objects (2D patterns) will produce aplane in the frequency space that intersects the origin, so they can be identifiedwith the use of a Fourier transform.

The phase-based techniques define velocity as the phase behavior of thesignal that is produced after we apply a band-pass filter to the frames. Band-pass filters can decompose the input signal with respect to relevant featuresfor optical flow detection, such as scale, orientation and speed. This methodis very similar to the first one, since it uses first and second-order derivativesapplied to the phase, as opposed to intensity of the signal. Based on the abovementioned study [3] the most efficient method in a wide variety of input con-

26

ditions is the first-order, local differential technique proposed by Lucas andKanade [5] and local phase-based method proposed by Fleet and Jepson [16].

The concept of feature point tracking and its application to tracking mov-ing objects or persons is essential for surveillance systems and it representsthe core of the tracking system. The goal is to identify in the video sequencerelevant local features, also referred to as corners or points of interest in lit-erature. The features that ultimately are selected, are those which are easilyrecognizable and which do not change with movement and rotation.

Figure 4.3. The general steps for feature detection stage

The overall process can be generalized as in Figure 4.3, as a recursive func-tion of updating the set of detected interest points and connecting them intomeaningful feature set for tracking the target objects.

4.2 Lucas-Kanade-Tomasi methodThe KLT tracker was introduced in 1991 by Tomasi and Kanade and it is basedon a feature matching, differential algorithm developed in 1981 by Lucas andKanade. The method is still one of the most popular ones in the field, due toits robustness and capability of both detecting important features and produc-ing an optical flow. Shi and Tomasi have significantly improved the methodby including an algorithm of detecting good features for tracking such as thecorners of the objects. We will describe the algorithms employed in the KLTtracker in the current section.

This method assumes that the optical flow in a three by three window ofpixels is constant, meaning that the center pixel and all its neighbors havethe same motion. Spatial intensity information along with time information isincluded in a system of 9 equations, called basic optical flow equations. Theseequations are solved according to the Lucas-Kanade method to satisfy the leastsquares criterion. The least squares solution gives the same importance to all8 neighboring pixels in the window. In practice is better to give more weightto the pixels that are closer to the central pixel.

27

An important part is finding good features to track [37] that will make theoptical flow detection robust to noise and spatial deformations. A widely usedapproach is finding the most prominent corners in the image or in the specifiedimage region, as described in the above mentioned paper. "Corner", "interestpoint" or "feature" describe the same concept, namely a well-defined positionthat can be robustly detected where exist two dominant and different edgedirections in a local neighborhood of the point.

4.2.1 Feature TrackingThe initial algorithm proposed by Lucas and Kanade does not include a featuredetection stage: its only goal is to track frame by frame a template, procedureknown as registration. The understanding of the algorithm relies entirely in theformulas that model the complex changes of the image intensities. The track-ing algorithm and the formulas presented in the current section are derivedfrom [41, 4, 5].

An image is defined in the present context as a function with special andtemporal variables. By defining a displacement function d = (ξ ,η)as the mo-tion of the point at X = (x,y) between time instants t and t + τ , we can thenformally decide a formula that correlates two consecutive frames:

I(x,y, t + τ) = I(x−ξ ,y−η , t) (4.1)

To be more concise, the model that will be further used for the local imagemodel does not contain the time variable, but it introduces a noise componentin 4.2.

J(x) = I(x−d)+n(x) (4.2)

The aim is to find an adequate displacement vector d, so that the model willbe robust with frame transitions. Thus, the algorithm becomes a minimizationproblem of the error ε .

ε =� �

w[J(x+d)− I(x)]2 ω(x)dx (4.3)

The double integral is applied in a window W , J is the displaced locationfrom the original image I and ω is a weighting function. The desired mini-mization can be done by the Newton-Raphson [39] method by differentiatingwith respect to d and then search for a zeros. When J(x+d) is approximatedby its first order Taylor expansion J(x)+g×d, this differentiation can be de-termined:

ε ≈� �

W[J(x)+gT (x)d − I(x)]2 ω(x)dx (4.4)

28

The g represents the gradient of J(x):

g(x) =�∂J(x)

∂x∂J(x)

∂y

�(4.5)

The limitation of this approach is that the first order Taylor expansion usedto approximate J(x + d) does not always provide at each iteration a betterapproximation of d than the previous one, especially in the case of large dis-placements reported to the size of the window. The solution would be choos-ing a larger window at the cost of losing precision. As we mentioned, thealgorithm’s aim is to minimize the error, which is equivalent to setting thederivative of ε to zero. This would yield:

� �

W[J(x)− I(x)+gT (x)d]g(x)ω(x)dx = 0 (4.6)

If we consider the following notations:

G =� �

wg(x)gT (x)ω(x)dx (4.7)

e =� �

[I(x)− J(x)]g(x)ω(x)dx (4.8)

Equation 4.6 becomes the tracking equation:

Gd = e (4.9)

The tracking equation must be solved at each iteration, and it is presented asa system of two equations with two unknowns, represented by the two entriesof d:

G =

� � �W g2

xdx� �

W gxgydx� �W gxgydx

� �W g2

ydx

�(4.10)

In order to implement the algorithm, we will need a discretization of equa-tion. The solution in the discrete space, for the two unknowns, by the leastsquare fit criterion would be:

�d1d2

�=

�∑W g2

x ∑W gxgy∑W gxgy ∑W g2

y

��−∑W gxgt−∑W gygt

�(4.11)

The last formula incorporates the gradient along time as well, that was omit-ted in the initial expression for simplicity. In practical implementations theparameters are set iteratively, either varying the window size or by varying theresolution of the images. The latter solution fixes the window size and buildsresolution pyramids of the frames [6].

29

4.2.2 Feature DetectionThe above tracking algorithm does not include feature detection; instead it isapplied and iterated on all regions of the image, which makes it a weak tool forpractical purposes. An optimization emerges from the detection of the imageareas that contain motion information. A successful strategy has been provento be the selection of image regions with a rich texture that provide motioncomponents on two directions.

To achieve this goal, several methods attempt trackable features as: cor-ners, regions with high spatial frequency content, regions where second-orderderivatives are present. The fitness of a feature can be hard to predict, thisis why a robust mathematical model has been created to extend the featuretracking equation presented in 4.12. Observations show that for a feature to beeasy to track, the matrix G must have large eigenvalues. Therefore, featuresare chosen at the locations where the eigenvalues are largest and at least abovea predefined threshold λ :

min(λ1,λ2)> λ (4.12)

The threshold λ is chosen arbitrary from the interval [λmin,λmax].The lowerbound λmin is calculated from the eigenvalues for images of a region of uni-form brightness. The upper bound is calculated from regions with a set ofvarious types of features, such as corners and highly textured regions.

4.2.3 Feature SelectionNot all detected features are relevant to the tracking system, so a thoroughselection is in place at this stage. A first strategy of identifying if a feature isqualified as adequate is to calculate its dissimilarity measure. This measurecan be calculated either as the intensity difference between the correspondingwindows that contain the feature in consecutive frames, or by using an affinetransformation model [7]. Instead of comparing the first and the current frameby using the expression 4.13, the following equation is considered:

ε =� �

W[J(Ax+d)− I(x)]2 ω(x)dx (4.13)

This expression is much more general, because it compares the windowfrom the original frame with the window of the transformed feature in thecurrent frame, by employing a transformation matrix A. This improvementmakes it easier to exclude the unuseful features from the tracking system, suchas features that have left the view of the camera, the object has changed due torotation or other deformations.

Other ways of selecting features from the list of tracked interest points thatdo not imply heavy computations have been proposed. A first method is im-posing a stopping criterion for the iterative algorithm; this would be motivated

30

(a) (b)

Figure 4.4. Tracking results for Sequence 1 (set distance 10)

by the fact that if the algorithm does not converge in a certain amount of it-eration, most definitely the feature has been occluded. Another method isassessing the value of the determinant of G, because small results indicate thatthe system of equations cannot be reliably solved. This observation indicatesthat the feature point is lost. Finally, if the window exceeds a set bound inreaching towards the edges of the frame indicates that the feature point hasleft the scene, so the feature is dropped. The Shi-Tomasi algorithm calcu-lates the corner quality measure at every pixel using the minimal eigenvalueof gradient matrices representing intensity regions of the image. Further on,a non-maximum suppression (the local maximums in 3x3 neighborhood areretained) is performed. The corners that have a threshold passing eigenvalueare kept, while the weak ones are discarded. In the last phase, the corners aresorted according to their quality measure. A final decision is made to keeponly the strongest corners within a defined area.

4.3 Tracking for surveillance systemsThe goal is applying the tracking algorithm on surveillance footage, thus adiscussion on the practical implementation and setting up of the system is inplace. The videos for tracking persons have several characteristics:

• Videos contain more than one trackable object• Persons in videos don’t have monotonous trajectories• Persons might occlude each other while passing the scene• Persons enter and exit frequently the scene• Objects rotate or change their shape in a 2D plane

What is expected of this system is, ideally, to identify with a sufficientpoints of interest each person, entering, exiting or already existing in sceneand mark it. Further on, the mark should follow the person throughout hisactivity in the scene. No feature points should be detected on the background,or areas that do not represent persons.

31

In essence, there are two critical stages in the tracking sub-system: 1. Ini-tializing correctly feature points for all persons 2. Removing incorrect or re-dundant feature points from the scene. The generic logical steps that wereundertaken for the above mentioned purposes can be seen in Figure 4.5.

Figure 4.5. Pseudocode for feature points tracking algorithm

We observe that at each iteration we need the previously tracked targets anda foreground mask of the frame. The foreground extraction and processinghave been discussed in Chapter 2 and Chapter 3. The KL tracking algorithmand Shi-Tomasi good features to track have been implemented in various li-braries, including OpenCv. In our experiments we have employed GoodFea-turesToTrack function, which implements the algorithm previously describedin order to obtain strong corners. This step is preliminary to the actual trackingsession and it is used for initializing the start points. The important parametersof this method are:

• maxCorners - depending on the setting, the algorithm can return avarious number of strong feature points. If we have an intuition ofthe range of tracked objects in the scene, we can set this parameteraccordingly. Only the strongest points in this limit will be returned

• qualityLevel - this parameter determines the minimal accepted qualityof image corners. The parameter value is multiplied by the best cornerquality measure, which is the minimal eigenvalue. The corners withthe quality measure less than the product are not considered.

• minDistance - this parameter also depends on the size and density ofthe objects that are intended to be tracked. If the objects are fairlysmall or cluttered in the scene, the distance should be small, as op-posed to large, wide-apart objects.

• blockSize - Size of an average block for computing a derivative co-variation matrix over each pixel neighborhood.

32

The location of the feature points and the future targets are refined usingcornerSubPix(), a function which iterates to find the sub-pixel accurate lo-cation of corners or radial saddle points. The tracking function is fulfilled inOpenCV by the CalcOpticalFlowPyrLK() method which is an implementationof Lucas-Kanade algorithm. The most important parameters are:

• prevPts - vector of 2D points for which the flow needs to be found;point coordinates must be single-precision floating-point numbers.

• nextPts - output vector of 2D points containing the calculated newpositions of input features in the following frame

• winSize - size of the search window for each pixel• maxLevel - maximal pyramid level number• criteria - parameter, specifying the termination criteria of the iterative

search algorithm (after the specified maximum number of iterationsor when the search window moves by less than a minimum value.

We will analyze the impact of the most important parameters in Chapter 5:Experimental results.

33

5. Experimental Results

This chapter presents the used technologies in developing the system and,more importantly, the data set and the results obtained for each set of al-gorithms are discussed. For each experiment, relevant quality measures aredefined. The chapter contains individual results for background segmentation,pre and post processing methods and tracking methods, along with a combinedresults from employing compatible methods for the best tracking results.

5.1 Data setIn order to observe the behavior and the performances of each implementedmethod, we chose to use three different data sets. These video sequencesemphasize the weaknesses and strengths of both classical methods and neuralnetwork approaches.

The experimental results are relevant for qualitative and quantitative obser-vations about the employed algorithms and also provide a good base knowl-edge about setting up the SOMs. The choice of data sets enables the detailedstudy of the proposed methods and gives a valuable indication of the sensitiv-ity of each parameter and the ranges that these should be initialized in differentcircumstances: type of scene, type of tracked blobs, noise tolerance, etc.

For experimental purposes we used CAVIAR Test Case Scenarios, whichconsist of video clips created July 11, 2003 and January 20, 2004.

The CAVIAR project offers a collection of video clips of different scenariosand scenes: people walking alone, meeting with others, window shopping,entering and exiting shops, fighting and passing out.

Sequence 1

The first section of video clips were filmed for the CAVIAR project with awide angle camera lens with half-resolution PAL standard (384 x 288 pixels,25 frames per second) and compressed using MPEG2. The file sizes we usedfor our experiments is 12 MB.

For the first comparison we used MeetWalkTogether2.mpg (MWT2) fromthe data set, which is an outdoor setting with bright natural illumination, in

34

which two persons meet and walk together. We will analyze the detectionaccuracy of the proposed methods under several sets of parameters.

Sequence 2

The second set of data consist of footage taken from a surveillance camera inthe entrance of a pool in Jiading, Shanghai. The video clips are 720x576, 15frames per second and compressed using AVI. The file size we used for ourexperiments is 3,5 MB.

For the second comparison we used a sequence of an outdoor setting, withboth bright and dim illumination, in which several persons enter and exit thescene. The sequence is quite complex, since some of the present persons areholding large objects in their hands, run, stop or change trajectories. In addi-tion, illumination changes at several occasions and the footage is quite noisy.This sequence is a good input data for showing the weaknesses and strengthsof each method.

Sequence 3

The third sequence has been obtained from the TUIO Scene [18] project andit represent a public space in Portugal. The video clip is 320x240 pixels, 29frames per second and compressed using AVI. The file size we used for exper-iments was 1,4MB.

Due to the intense daylight, the sequence contains large regions of shadow,which represent a good case study for the shadow detection techniques in par-ticular. The sequence contains a various number of persons walking, skating,and also cars in motion.

5.2 Key technologiesThe key technologies that will be used are C++ programming language andOpenCV (Open Source Computer Vision Library) [26]. OpenCV provideshigh-level functions for capturing video, working with images, and performingcomputer vision related calculations. The following modules are available:

• core - a compact module defining basic data structures, including thedense multi-dimensional array Mat and basic functions used by allother modules.

35

• imgproc - an image processing module that includes linear and non-linear image filtering, geometrical image transformations (resize, affineand perspective warping, generic table-based remapping), color spaceconversion, histograms, and so on.

• video - a video analysis module that includes motion estimation, back-ground subtraction, and object tracking algorithms.

• calib3d - basic multiple-view geometry algorithms, single and stereocamera calibration, object pose estimation, stereo correspondence al-gorithms, and elements of 3D reconstruction.

• features2d - salient feature detectors, descriptors, and descriptor match-ers.

• objdetect - detection of objects and instances of the predefined classes(for example, faces, eyes, mugs, people, cars, and so on).

• highgui - an easy-to-use interface to video capturing, image and videocodecs, as well as simple UI capabilities.

Another library that has been used in one of the implementations of thebackground subtraction methods is Armadillo [1]. Armadillo is a C++ linearalgebra library aiming towards a good balance between speed and ease of use.The syntax is similar to Matlab. The following features are available:

• Integer, floating point and complex numbers are supported• Various matrix decompositions are provided through optional integra-

tion with LAPACK• A delayed evaluation approach is employed (at compile-time) to com-

bine several operations and to reduce (or eliminate) temporaries; thisis automatically accomplished through template meta-programming

• The library is open-source software, distributed under a license thatis useful in both open-source and proprietary contexts

5.3 Method and implementationThe software application was developed for the purpose of testing the methodsdetailed in Chapter 2. It was implemented in C++ under the Visual Studio2010 environment with the aid of several image processing and mathematicallibraries like OpenCV and Armadillo. The general architecture of the trackingsystem consists of several interconnected modules like in Figure 5.1:

36

Figure 5.1. Main modules of the tracking system

5.3.1 Video Handler Module

The Video Handler Module is responsible of reading the video footage whichcan be compressed by various technologies and stores it in a cv::videoCaptureobject that will be used for extracting the frames and sending the differentmodules in a cv::Mat format. Also, the display of intermediate and final resultsof each method is performed by this component.

5.3.2 Image Processing Module

The image processing module uses OpenCV methods for performing blur fil-tering, low pass filtering and morphological operations. The shadow detectorsused for the comparative experiments are adapted from [33] [43] that alsouses OpenCV library for basic operations. Each shadow detector is imple-mented in a different class, so the removeShadow instantiates one object foreach method.

37

Figure 5.2. Class diagram for Processing Module

5.3.3 Background Subtraction Module

The most extensive work has been done for the development of the back-ground subtraction module, since it provides three options for obtaining thebackground model. In practice, we use BackgroundSubtractor object, thatis extended by three different classes that override the perform method. Bydefault, the type of the BackgroundSubtractor is set to 0 and the "perform"method has a basic implementation that considers the background model asthe initial frame of the video sequence. As in all cases, this background modelis subtracted from the original frame, thus resulting in a foreground mask.

If type of the BackgroundSubtractor is 1, then the object will perform asdefined in the GMM class. The implementation uses the OpenCV multipleGaussians mixture model method cv::BackgroundSubtractorMOG which isdescribed in Section 2.2. If the type is 2, the Background subtractor will usean adaptation of the code [42] provided by the authors of the article [32]. Thismodule uses Armadillo defined matrices and employs the mathematical opera-tions provided by this specialized library. The third type is the self organizingbackground model which was implemented with the aid of the open sourcecode of the project Scene [18]. All the implementations are adapted for theC++ implementation of the current application.

38

Figure 5.3. Class diagram for Background Subtraction Module

5.3.4 Feature Tracking Module

The feature tracking module contains a single class OpticalFlowDetector thatperforms KLT according to the OpenCV implementation of the pyramidal al-gorithm, but it is extendable to other implementations also.

5.4 Background segmentationThis section presents the qualitative and quantitative results of the experi-mented foreground detection methods, based on a set of accuracy metrics.

5.4.1 Accuracy metrics for foreground detectionFor measuring accuracy we adopted classic metrics like precision, recall andF-measure. Recall is the detection rate, which gives the percentage of detectedtrue positives as compared to the total number of true positives in the groundtruth. For our experiments done on Sequence 1, we calculate the Recall by

39

frame, which means that we count the identified foreground blobs and wecompare with the observations made in the original frames.

Recall =TruePositives

TruePositives+FalseNegatives(5.1)

Sometimes the method detects not only foreground objects, but also otherbackground regions that are considered to be false positives. In order to mea-sure the frequency of this situation, we use another accuracy metric: precision.

Precision =TruePositives

TruePositives+FalsePositives(5.2)

Also known as positive prediction, this metric gives the percentage of de-tected true positives as compared to the total number of items detected by themethod. A composed metric is the F-measure, which is defined as:

F = 2recall × precisionrecall + precision

(5.3)

5.4.2 Qualitative resultsExperimental results for moving person detection using 4 different approacheshave been produced for several image sequences. We analyze and describethe two different sequences, that represent typical situations critical for videosurveillance systems, and present qualitative results obtained with all methodsand different parameters.

5.4.3 Method 1: GMM

The most important parameters in this particular method are the number ofGaussian mixtures used which determine the sensitivity of the detector, thebackground threshold and the noise variance. Sensitivity determines the re-sponsiveness to changes in the background. Low values enhance the de-tection of objects in the scene, but also make the model more sensitive tonoise. The background threshold determines which of the distributions corre-spond to background pixels. High values are adequate for simple backgrounds.Lower values are recommended for complex backgrounds with moving ob-jects. Noise variance sets the minimum value of the variance for the Gaus-sian models. Higher values are recommended for videos with noisy images.While the parameters are important for understanding the method workflow,

40

their variation has little impact in critical situations: sudden illumination, orcamouflage. As we can observe in Figure 5.4, the variation of the backgroundthreshold produces no, or little difference on the result. The background modelincludes in any case artifacts caused by the small dynamics of the scene ob-jects. It is important to observe that the foreground mask is strictly dependenton the chromatics of the scene, so only fragments of the person’s image (thatare significantly different) can be classified as foreground.

(a) (b)

(c)Figure 5.4. GMM performed on Sequence 1: (a) initial scene (b) GMM performedwith a high threshold of 0.8 (c) GMM performed with low threshold of 0.2

41

(a) (b)

(c) (d)

(e) (f)Figure 5.5. GMM performed on three consecutive frames of Sequence 2: (a,b) Blobsimilar to the background (c,d) Background artifacts (e,f) Partial foreground detection

Sequence 2 was chosen to perform experiments on, for its high rate of noisein this case. The mixture of Gaussians deals especially well with this issue,approximating in an accurate way the background. As we can observe in Fig-ure 5.5, we obtain weak quality results in two situations of major importance:slow illumination changes and camouflage. In the three consecutive frames thelight changes slightly, so we can detect progressively more false foregroundpixels from the floor, shadows and umbrella, even though the changes are notsudden. This problem could not be fixed with a fine tune of parameters, be-cause the method is not robust to complex light changes. Another, even greaterproblem that many foreground detection techniques still encounter is the cam-

42

ouflage situation: when objects from the foreground share the same chromaticdisplay as the background. In this particular situation, a foreground pixel willfind a Gaussian distribution that samples it, without exceeding the threshold,which would falsely classify it as a background pixel.

5.4.4 Method 2: BCC

The entire set of parameters and their impact on the results is discussed indetail in [32]. We mention the most important ones: number of classifiers,block advancement and block size. For the best results, all three classifiersmust be used, as the authors suggest.

We ran experiments in order to determine for the given data sets, what blocksize and advancement impact the quality of the results.

The advancement of the blocks represents how much 2 adjacent block over-lap. This parameter influences the first and last step of the algorithm, the di-vision of the image in blocks (foreground and background) and eliminationof false positives for pixels that are included in the overlap area and are clas-sified as both background and foreground, respectively. For instance, a pixelcontained in several overlapping blocks will be classified as foreground onlyif the majority of blocks which contain it, have been classified as foreground.We can observe how sensitive are the results to this parameter and how the be-havior of the method depends on the input data in Figure 5.6. and Figure 5.7.For a scene not affected by noise, the size of the overlap will work rather as amorphological opening of the blobs, making them contiguous shapes, whichmight be useful. On the other hand, on noisy inputs, the overlap might in-troduce a significant amount of false positives in the foreground such as inFigure 5.7. The accuracy of the algorithm is also lost, mostly because bound-aries will become thicker and harder to classify.

Note that the sitting man in the scene is correctly classified as backgroundas he does not change his position in this particular sequence of frames. Theexperiment shows that the block size should be proportional with the blobs ofthe foreground objects, for all input data, size 8x8 pixels proved to be best.

43

(a) (b)

(c) (d)Figure 5.6. BCC for Sequence 1, frame 217: The variation of the overlap. (a) originalframe (b) block size = 8, overlap = 2 (c) block size = 8, overlap = 4 (d) block size = 8,overlap = 6

(a) (b)

(c) (d)Figure 5.7. BCC for Sequence 3, Frame 80: The variation of the overlap. (a) originalframe (b) block size = 8, overlap = 2 (c) block size = 8, overlap = 4 (d) block size = 8,overlap = 6

44

(a) (b)

(c) (d)Figure 5.8. BCC for Sequence 3, Frame 80: The variation of the block size. (a)original frame (b)

5.4.5 Method 3: SOM

Setting up a self organizing map can be a difficult process. In the presentcase, the authors of the initial paper propose a basic configuration based onempirical results. The number of weight vectors can vary without any majorquality impact, so all the experiments have been executed with a 3x3 neuralnetwork, meaning 9 weight vectors.

Values for the distance thresholds are parameters that indicate how easilyshould the algorithm decide which pixel is included in the background model.The method, uses two such values: one for the calibration phase and one forthe online phase. The former should be higher, since we assume that the firstframes are a good approximation of the background, while the later should bemore restrictive.

The number of sequence frames for the calibration phase is an importantparameter for modeling the background. If the initial frames correspond toa clean scene, without moving objects from the foreground, a small numbershould be enough, such as 5. On the contrary, if the clean background appearslater in video, a higher value should be set. In the case of Sequence 1, weobserved that no such clean sub-sequence is available, so the calibration phase

45

may include some false background regions that will persist even in the on-line phase. If the initial scene is crowded with people, a higher value for thedistance threshold should be employed in the calibration phase.

The learning factor should be ideally set according to the dynamics of thescene. If such information cannot be a priori set, a standard value of 1 can beused.

(a) (b)

(c)Figure 5.9. SOM on sequence 1 Frame 950 (a) original image (b) background model(c) foreground mask

46

The difference between a small sensitivity and a high sensitivity of thedistance threshold parameter (on the online phase) can be observed in Fig-ure 5.10. SOM method is robust against noise, and also succeeds to deal withslow illumination changes, in sequences that other methods fail.

(a) (b)

(c) (d)

(e) (f)Figure 5.10. SOM on sequence 1 Frame 950 (a) original image (b) background model(c) foreground mask

47

(a) (b)

(c)Figure 5.11. Results for SOM: (a) initial image (b) background model (c) foregroundmask

5.4.6 Accuracy resultsThe experiments executed on Sequence 1 show that all methods have a highrecall rate: Table 5.1. The difference appears in the speed with which methodupdates its model. The false negatives appear in this case only in the transitionframes - when someone enters or exits the scene. Sequence one has few suchsituations, so errors correlated with this matter are also few. We detect thatthe recall is very high in the SOM algorithms, since the experiments were run,according to the explanations above, under suitable parameter configurations.

The false positives impact the value of the precision metric, which we canobserve has lower values for all studied methods. We considered any detectedblob larger than 4x4 pixels as being a false positive, if that area was not con-tained in a foreground object. That is why, any artifacts produced by suddenillumination or dynamic of the background count as a false positive in thedetriment of the precision value.

F-measure, as expected, is affected by both false positives and false nega-tives, yielding to smaller values. In the case of SOM and FSOM no visible

48

Table 5.1. Accuracy results for all studied methods on Sequence 1Method/Metric Recall Precision F-measureGMM 0,90 0,83 0,86BCC 0,92 0,8 0,85SOM 0,95 0,86 0,90FSOM 0,95 0,86 0,90

differences occurred, so we can conclude that both have remarkable results inthis case.

Sequence 2 is more challenging for all methods, since all use an adaptivebackground model. As opposed to Sequence 1, in this video there are manyenters and exits and the foreground objects occupy a larger fraction of thescene, which makes transition background models prone to errors. The illu-mination varies from bright to very dim, which makes the presence of shadowsa real problem. We used the composed metric - the F-measure to compare theresults of all methods on both sequences in Table 5.2. The results are con-siderably worse when we input the second sequence, but we observe that theSOM algorithms are more robust against noise and sudden changes. The F-measure value is significantly better in this approaches for SOM than the othertwo methods, producing steadily more accurate foreground masks, in spite ofthe noisy frames and high density of foreground objects.

Table 5.2. F-measure results for all studied methods in Sequence 1 and Sequence 2Method/Data Sequence 1 Sequence 2GMM 0,86 0,62BCC 0,85 0,70SOM 0,90 0,80FSOM 0,90 0,80

A detailed set of results for all accuracy metrics for the SOM algorithm ispresented in Table 5.3.

Table 5.3. Accuracy results for SOM method on Sequence 1 and Sequence 2Sequence/Metric Recall Precision F-measureSequence 1 0,95 0,86 0,90Sequence 2 0,83 0,78 0,80

5.5 Pre and post processingThis section presents the experimental results for pre and post processingmethods, respectively.

49

5.5.1 Noise reduction and morphological operations resultsNoise reduction is the simplest subsystem of the tracking ensemble, nonethe-less it is an important stage that impacts the precision of the tracking. Inthis section we will present the improvements that elimination of noise andpost-processing of the foreground mask bring based on the best configura-tion we have found. In fact there are two separate stages: pre-processingthe initial frame, and post-processing the result, namely the foreground maskobtained by the methods explained in Section 2.1. The experiment was ap-plied to Sequence 1 and Sequence 2 data sets, and their respective foregroundmasks obtained with the most representative foreground detector: GMM. Thechoice is fairly obvious, since this detector provides foreground masks withdefects, so the benefits of this stage become visible. The initial stage consistsof pre-processing the frame, by smoothening with a blur and low-pass filter.The more interesting stage to analyze is the post processing of the foregroundmask.

(a) (b)

(c) (d)Figure 5.12. (a) GMM foreground mask without processing; (b) post-processed GMMforeground mask with blur filter and morphological opening; (c) the tracking resultwhen using foreground mask (a); (d) tracking result when using foreground mask (b)KLT on Sequence 1 with distance parameter set 10

Figure 5.12 presents the impact of the foreground mask post-processing onthe tracking system. The KLT algorithm receives as input at each iteration theexisting targets along with a foreground mask, that is used for discriminating

50

between good features or occluded features. The aim of the tracking stage is tofind sufficient interest points to describe the persons present in the scene andnone feature points for the background. In the (c) picture of Figure 5.12 we seean erroneous detected corner belonging to the background. Due to the noisereduction and the more accurate background model obtained in Figure 5.12 (b)this error is avoided. The other important aspect is that now, the foregroundobjects are denser, with clearer contours which yields a better identification ofthe interest points. We observe that for the left person we improved from 3feature points to 4, while for the right person, the improvement is even moresignificant: from only 2 grouped corners on the feet to 5 descriptive cornersalong the entire body.

(a) (b)

(c) (d)Figure 5.13. (a) GMM foreground mask without processing; (b) post-processed GMMforeground mask with blur filter and morphological opening; (c) the tracking resultwhen using foreground mask (a); (d) tracking result when using foreground mask (b)Sequence 2, Distance 10

The process of tracking in the situation of Sequence 2 rises numerous prob-lems, first because of the noisy scene and the low frame rate. Neverthelessmore accurate results for the tracking are produced when preprocessing isused: significantly less background targets are registered and main featurepoints of the foreground are kept, so no object is lost out of track due to theerosion. Experimental results are shown in 5.13.

51

5.5.2 Shadow detection resultsThe shadow detection experiment was undertaken on all three video sequences,in order to observe which method provides a better result that could be usedin the tracking stage. All methods use as a foreground, the resulted mask fromthe GMM method. As a detailed study [35] shows, shadow detectors are firstlyhighly dependent on the scene and secondly dependent on the object size andshape. We will analyze in turn the results obtained for all three studied meth-ods on the three input data sets.

For the first sequence, the geometrical shadow detector fail, since it reliesmostly on geometrical features that do not fit this particular objects. The de-tector would split any found blob in 2 regions based on the gravity point ofthe figure, but the angle of the camera and light source do not respect theexpectations of this method in terms of shape and object orientation.

The chromaticity method has a positive effect in a way that smoothens theforeground mask, and eliminates small shadows around the moving objectsbut fails to detect the small shadows that occur from background illuminationvariations. The method that uses the large region texture feature for detectingshadows performs the best in this situation, because, it is more independentof the scene, color or shape of the objects. The texture of the background iscorrelated with the texture of the shadow candidates, and in most of the casesit results in a high value that correctly indicates a shadow region.

The results of all methods applied to Sequence 1 can be observed in Fig-ure 5.14.

The results on Sequence 2 are shown in Figure 5.15. Both the chromaticitymethod and the geometrical one have little impact on the foreground mask,detecting just small regions of shadow, and classifying valid foreground asshadow. On the other hand, the texture method, has good results, being ca-pable of eliminating a large amount of background shadows, giving a cleanermask for the foreground. Moreover, the cast shadow of the person enteringthe scene is significantly diminished. The large regions of texture are in thiscase relevant for the classification, and the algorithm is successful. The resultsare nevertheless far from perfect, since the scene itself raises many problemssuch as high noise, camouflage, dynamic background and sudden changes ofillumination.

The results on Sequence 3 - Figure 5.16 are somewhat more relevant, sincethe frames from this set contain many shadows from different objects, and withdifferent sizes and orientations. The geometrical method attempts to eliminateshadows, but in most of the cases it erases the lower half of the objects, becauseit considers each blob half person, half its shadow. The orientation of theshadows in this case again, does not match the assumption of the algorithm.

52

(a) (b)

(c) (d)Figure 5.14. Performance of shadow detectors on Sequence 1, applied to the fore-ground mask obtained with GMM (a) GMM foreground mask (b) geometrical shadowdetector (c) chromaticity-based shadow detector (d) large region texture shadow de-tector

53

The large region texture-based shadow detector, successfully detect most ofthe shadows, however it produces many shadow false-positives which lead tothe elimination of some valid foreground parts. The objects in this video arequite small, so the texture from one region can be easily wrongly classified.

(a) (b)

(c) (d)Figure 5.15. Performance of shadow detectors on Sequence 2, applied to the fore-ground mask obtained with GMM (a) GMM foreground mask (b) geometrical shadowdetector (c) chromaticity-based shadow detector (d) large region texture shadow de-tector

54

(a) (b)

(c) (d)Figure 5.16. Performance of shadow detectors on Sequence 3 Frame 250, applied tothe foreground mask obtained with GMM (a) GMM foreground mask (b) geometri-cal shadow detector (c) chromaticity-based shadow detector (d) large region textureshadow detector

In conclusion, the best shadow detector from the three methods that westudied, is the one using as a main feature, the texture from large regions. Themost visible results were obtained on Sequence 2, where objects are fairlylarge, compared with the entire scene, and both types of background and fore-ground shadows are visible. The algorithm eliminated a large fraction of bothtypes of shadows.

5.6 Tracking5.6.1 Accuracy metrics for trackingFor measuring accuracy we adopted measures proposed in CLEAR EvaluationWorkshop [22] and combines four different measurements:

• Misses (m): an object is considered missed if it was not tracked within50 cm accuracy.

• False positives ( f p): an object is false positive if it has feature pointsassociated even though it is not part of the background

• Mismatches (mme): when a track belonging to an object switches toanother object this is counted as one mismatches. These measure-

55

ments are combined in a more complex metric called Multiple ObjectTracking Accuracy (MOTA).

MOTA = 1− ∑n(mn + f pn +mmen)

∑t gn(5.4)

Where:• n is the frame number,• gn is the correct number of persons present in frame n

The number of mismatches has a little impact on the MOTA measure since,the error is not reported in each frame but rather once, when the error occurs.This metric relies mostly on the number of false negatives (misses) and falsepositives.

5.6.2 Qualitative resultsExperimental results for moving person detection using KLT tracking algo-rithm have been produced for several image sequences that each rise differentproblems. We analyze and describe the two different sequences, that representtypical situations critical for video surveillance systems, and present qualita-tive results obtained with all methods and different parameters.

The first stage of the tracking is detecting the features that will be followedby using the goodFeaturesToTrack function from OpenCV library. The mostimportant parameter is the minimum distance allowed between returned cor-ners. If the minimum distance allowed is fairly small feature points fromdifferent targets could coincide, affecting the mismatches rate, while if thedistance is to large, targets with few feature points in the vicinity of anothertracked target might not be considered.

The block size influences in a noticeable way the execution time of the al-gorithm. The time spent for each frame is inverse proportional with the size ofthe block. However, using large blocks for relatively small targets yields pre-cision loss and determines more misses. For all three tested sequence, blocksranging from 3 to 11 are acceptable.

Sequence 1 represents the base line of our testing, since the footage is notvery noisy, the camera is stable and persons in the foreground are easily dis-tinguishable. Furthermore, we have observed in the previous section that forthis video sequence the foreground mask is the most accurate. We can observeclearly how the tracking algorithm performs in this situation: targets are alltracked, being assigned a different number of feature points, according to theblockSize and the minimum allowed distance between the detected corners.Ranging from a minimum distance of 10 to 40 we obtain clear tracking forboth targets. In Figure 5.17 (a) some feature points belonging to target 1 mayswitch to target 2 and vice-versa determining mismatches. This does not rep-resent a real problem, but it is worth noticing that many feature points do not

56

contribute to a better tracking session, making the targets harder to distinguishand computations more time consuming. An optimal distance of 30 can be setwithout any loss of accuracy or performance.

(a) (b)

(c) (d)Figure 5.17. Variation of the minDistance parameter in the goodFeaturesToTrackfunction used in the tracking algorithm for Sequence 1 (a) minDistance = 10 (b)minDistance = 20 (c) minDistance = 30 (d) minDistance = 40

Sequence 2 is a real challenge for the tracking system and this can be ob-served in Figure 5.18. Not only that the foreground mask provided by theprevious methods but also the low frame rate makes it difficult to correlatefeature points between consecutive frames. We can observe that only one tar-get from the two present in the scene have feature points that can be identifiedby goodFeaturesToTrack function in three out of four cases. The camouflageproblem cannot always be overcome in this situation. An interesting pointto observe is that in this particular case, having a proportional distance be-tween feature points with the dimension of the target objects in the scene isnot effective. This can be explained by the sparse foreground mask: using alarger minimum distance would eliminate feature points that actually belongto the same target, but are placed in different weakly connected parts of theforeground mask.

The test ran on Sequence 3 highlights the problems of mismatching andmissing, as shown in Figure 5.19. As opposed to the previous scenes, this oneis very crowded and not only with persons but also cars moving in the back-

57

ground. This rises problems both for the update of the background model butalso for the tracking algorithm. Objects have different sizes and velocities inthe scene which makes the setting of the parameters more difficult. For thisreason we cannot afford using a larger block than 3, even at the expense ofmore time resources, since changes happen in small windows of the scene.Also, the setting of the algorithm should have in mind that the mismatch prob-lem is nowhere near as important as missing the target. This is why a smalldistance such as 5 or 10 is recommended. For these values of the parametersonly one person is being missed, as opposed to three or four misses in thecase of using higher minDistance like 20 or 40. The problem of missing ofmismatching targets for the moving cars is out of the scope of this study.

(a) (b)

(c) (d)Figure 5.18. Variation of the minDistance parameter in the goodFeaturesToTrackfunction used in the tracking algorithm for Sequence 2 (a) minDistance = 5 (b)minDistance = 10 (c) minDistance = 20 (d) minDistance = 40

58

(a) (b)

(c) (d)Figure 5.19. Variation of the minDistance parameter in the goodFeaturesToTrackfunction used in the tracking algorithm for Sequence 3 (a) minDistance = 5 (b)minDistance = 10 (c) minDistance =20 (d) minDistance = 40

The results of goodGeaturesToTrack method are further passed to corner-SubPix function, which refines the corner locations. The feature points areused in CalcOpticalFlowPyrLK method that implements the Lucas-Kanadealgorithm. This implementation allows the usage of resolution pyramids forbetter estimating the distance d that minimizes the error between two corre-sponding pixels. The experiment shows that using the simple, non-pyramidalapproach has good results, but the feature points have to be reinitialized manytimes. If pyramids are passed to input then algorithm will use as many lev-els as pyramids have but no more than maxLevel, and the feature points willbe steady, which will lead to less mismatches, less deletion and creation offeature points at each iteration.

5.6.3 Accuracy resultsThe results presented in the current section are produced by testing the track-ing system with the following setup:

• Pre-processing of the initial frame by blur and truncating by 20% thehigh intensities of pixels

• Foreground mask produced with the method GMM with a high (0.8)threshold

59

• Shadow reduction with large texture areas method• Morphological opening using as structural element an ellipse (3,3)• KLT algorithm with pyramids of maxLevel 3 and maxDistance be-

tween corners of 20.All the other parameters have been chosen with respect to previous discus-

sions. The results are collected from 100 (the most representative) consecutiveframes of each input sequence, avoiding the initialization part, when the algo-rithms may have some inconsistent behavior.

For the Sequence 1, most errors appear when a person enters the scene, orleaves it in the far end and these errors are mostly misses. The mismatcheshappen when two targets get very close to each other. Even though the framescontain 2 persons shaking hands and walking together, this phenomenon barelyhappens in the current setting.

For the Sequence 2 there are many artifacts remaining from the foregrounddetection stage that will produce false positives. Also, the frequent enters andexits in the scene combined with the slow frame rate lead to a loss of manytargets, because the re-initialization of the feature points cannot keep up withthe very sudden changes in the scene. The results for Sequence 3 are overallsatisfactory: the main misses occur for people in very far end that can be barelyspotted even by the human eye. Also there is a series of mismatches and falsepositives that appear in the crowded areas. Results can be found in Table 5.4.

Table 5.4. Accuracy results for all experimental data setsMetric/Sequence Sequence 1 Sequence 2 Sequence 3Misses 13 33 189False Positives 1 11 14Mismatches 1 n/a 4Total targets 200 65 744MOTA 92,5% 32,30% 72,17%

60

6. Conclusion

The thesis presents a multi-person tracking system for complex scenes, suchas crowded outdoor environments. The study of each method leaded to a setof general conclusions about the ability of each of solving frequent problemsin video tracking.

With regard to the background segmentation techniques, the use of multiplemixture of Gaussians is the simplest, most reliable method in regular contexts,since it does not need special tuning, and the impact of the parameters is minorto the overall efficiency and performance. However, for cluttered scenes, withshadows and camouflage problems like Sequence 3 represent a challenge thatcannot be overcome. The Block-Based Classifier detects foreground with sim-ilar accuracy as the previously mentioned method, but it has a higher degree ofcomplexity. One of the advantages of the method is that the identified shapesare more compact, so little post-processing is needed in this case. The neuralnetwork that provides a self-organizing background model, has the best resultsin the case of troublesome scenes, such as Sequence 2 and Sequence 3. Thistechnique is robust to camouflage, different sizes of the objects, occlusionsand highly dynamic backgrounds. However, due to the recursive nature of thealgorithm, early mistakes in the background model persist for long periods oftime, making the calibration phase very sensitive to the parameter settings.

The processing module comes in place at two stages: before the backgroundextraction and after the foreground modeling. Simple methods like blur filters,applying a threshold, or morphological operations increase greatly the accu-racy performance. A more subtle post-processing method is the shadow re-moval. Three methods, based on: chromaticity, geometrical and large textureproperties have been included in the experiment, but with no major impact onthe final results.

Finally, the tracking module, relies on the previously obtained and post-processed foreground mask. In order to obtain the "good features to track" it isnecessary to first define the minimum distance for the returned interest points.To conclude, the distance should be 10-20 pixels, even in the case of trackinglarge objects. The actual tracking method uses a pyramidal implementation,with maximum level of the pyramids of three. The tracking obtains a max-imum Multiple Object Tracking Accuracy of 92,5% for the standard videosequence MWT and a minimum of 32,3% for an extremely difficult sequencethat challenges every method. In conclusion, the research provided a rich un-derstanding of all modules of a tracking system and the challenges of multipleobjects tracking in complex outdoor scenes, while the implementation reacheshigh accuracy rates.

61

References

[1] Armadillo library: http://arma.sourceforge.net/. [Online; accessed1-October-2014].

[2] Martin Andersen and Rasmus Skovgaard Andersen. Master Thesis:Multi-Camera Person Tracking using Particle Filters based on ForegroundEstimation and Feature Points. Aalborg University, 2010.

[3] John L Barron, David J Fleet, and Steven S Beauchemin. Performance of opticalflow techniques. International journal of computer vision, 12(1):43–77, 1994.

[4] Jerome Berclaz, Francois Fleuret, and Pascal Fua. Multiple object trackingusing flow linear programming. In Performance Evaluation of Tracking andSurveillance (PETS-Winter), 2009 Twelfth IEEE International Workshop on,pages 1–8. IEEE, 2009.

[5] Stan Birchfield. Derivation of kanade-lucas-tomasi tracking equation.http://www.ces.clemson.edu/ stb/klt/birchfield-klt-derivation.pdf, 1997. [Online;accessed 01-October-2014].

[6] Jean-Yves Bouguet. Thechnical report: Pyramidal implementation of the affinelucas kanade feature tracker description of the algorithm. MicroprocessorResearch Labs, Intel Corporation, 5, 2001.

[7] Thierry Bouwmans, Fatih Porikli, Benjamin Höferlin, and Antoine Vacavant.Background Modeling and Foreground Detection for Video Surveillance. CRCPress, 2014.

[8] Paul A Brasnett, Lyudmila Mihaylova, Nishan Canagarajah, and David Bull.Particle filtering with multiple cues for object tracking in video sequences. InElectronic Imaging 2005, pages 430–441. International Society for Optics andPhotonics, 2005.

[9] Chia-Jung Chang, Wen-Fong Hu, Jun-Wei Hsieh, and Yung-Sheng Chen.Shadow elimination for effective moving object detection with gaussian models.In Pattern Recognition, 2002. Proceedings. 16th International Conference on,volume 2, pages 540–543. IEEE, 2002.

[10] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. Real-time tracking ofnon-rigid objects using mean shift. In Computer Vision and PatternRecognition, 2000. Proceedings. IEEE Conference on, volume 2, pages142–149. IEEE, 2000.

[11] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. Kernel-based objecttracking. Pattern Analysis and Machine Intelligence, IEEE Transactions on,25(5):564–577, 2003.

[12] Rita Cucchiara, Costantino Grana, Massimo Piccardi, and Andrea Prati.Detecting moving objects, ghosts, and shadows in video streams. PatternAnalysis and Machine Intelligence, IEEE Transactions on, 25(10):1337–1342,2003.

62

[13] Frederic Devernay, Diana Mateus, and Matthieu Guilbert. Multi-camera sceneflow by tracking 3-d points and surfels. In Computer Vision and PatternRecognition, 2006 IEEE Computer Society Conference on, volume 2, pages2203–2212. IEEE, 2006.

[14] Lijun Ding and Ardeshir Goshtasby. On the canny edge detector. PatternRecognition, 34(3):721–725, 2001.

[15] Ahmed Elgammal, Ramani Duraiswami, David Harwood, and Larry S Davis.Background and foreground modeling using nonparametric kernel densityestimation for visual surveillance. Proceedings of the IEEE, 90(7):1151–1163,2002.

[16] David J Fleet and Allan D Jepson. Computation of component image velocityfrom local phase information. International Journal of Computer Vision,5(1):77–104, 1990.

[17] Ismail Haritaoglu, David Harwood, and Larry S Davis. W4: Real-timesurveillance of people and their activities. IEEE Transactions on PatternAnalysis and Machine Intelligence, 22:809—-830, 2000.

[18] Laurence Bender Ignacio Guerra. Scene project, http://scene.sourceforge.net/.[Online; accessed 1-October-2014].

[19] Michael Isard and Andrew Blake. Contour tracking by stochastic propagationof conditional density. In Computer Vision-ECCV 1996, pages 343–356.Springer, 1996.

[20] Pakorn KaewTraKulPong and Richard Bowden. An improved adaptivebackground mixture model for real-time tracking with shadow detection. InVideo-Based Surveillance Systems, pages 135–144. Springer, 2002.

[21] Pakorn KaewTraKulPong and Richard Bowden. An improved adaptivebackground mixture model for real-time tracking with shadow detection. InVideo-based surveillance systems, pages 135–144. Springer, 2002.

[22] Bernardin Keni and Stiefelhagen Rainer. Evaluating multiple object trackingperformance: the clear mot metrics. EURASIP Journal on Image and VideoProcessing, 2008, 2008.

[23] Teuvo Kohonen. Self-organization and associative memory. Self-Organizationand Associative Memory, Springer-Verlag Berlin Heidelberg New York. AlsoSpringer Series in Information Sciences, volume 8, 1, 1988.

[24] Oswald Lanz, Paul Chippendale, and Roberto Brunelli. An appearance-basedparticle filter for visual tracking in smart rooms. In Multimodal Technologiesfor Perception of Humans, pages 57–69. Springer, 2008.

[25] Peihua Li, Tianwen Zhang, and Arthur EC Pece. Visual contour tracking basedon particle filters. Image and Vision Computing, 21(1):111–123, 2003.

[26] Open Source Computer Vision Library. http://opencv.org. [Online; accessed01-October-2014].

[27] Tony Lindeberg. Generalized gaussian scale-space axiomatics comprising linearscale-space, affine scale-space and spatio-temporal scale-space. Journal ofMathematical Imaging and Vision, 40(1):36–81, 2011.

[28] Lucia Maddalena and Alfredo Petrosino. A self-organizing approach tobackground subtraction for visual surveillance applications. Image Processing,IEEE Transactions on, 17(7):1168–1177, 2008.

[29] Lucia Maddalena and Alfredo Petrosino. A fuzzy spatial coherence-based

63

approach to background/foreground separation for moving object detection.Neural Computing and Applications, 19(2):179–186, June 2009.

[30] Katja Nummiaro, Esther Koller-Meier, and Luc Van Gool. Color features fortracking non-rigid objects. ACTA Automatica Sinica, 29(3):345–355, 2003.

[31] OpenCV. Threshold:http://docs.opencv.org/doc/tutorials/imgproc/threshold/threshold.html. [Online;accessed 1-October-2014].

[32] Vikas Reddy, Conrad Sanderson, and Brian C Lovell. Improved foregrounddetection via block-based classifier cascade with probabilistic decisionintegration. 2013.

[33] A. Sanin. Nicta: http://www.nicta.com.au/research/computer_vision/. [Online;accessed 01-October-2014].

[34] Andres Sanin, Conrad Sanderson, and Brian C Lovell. Improved shadowremoval for robust person tracking in surveillance scenarios. PatternRecognition (ICPR), 2010 20th International Conference, pages 141–144, 2010.

[35] Andres Sanin, Conrad Sanderson, and Brian C Lovell. Shadow detection: Asurvey and comparative evaluation of recent methods. Pattern recognition,45(4):1684–1695, 2012.

[36] Jean Serra. Image analysis and mathematical morphology, v. 1. Academicpress, 1982.

[37] Jianbo Shi and Carlo Tomasi. Good features to track. In Computer Vision andPattern Recognition, 1994. Proceedings CVPR’94., 1994 IEEE ComputerSociety Conference on, pages 593–600. IEEE, 1994.

[38] Nils T Siebel. Design and implementation of people tracking algorithms forvisual surveillance applications. Unveröffentlichte Dissertation, University ofReading, 26:37, 2003.

[39] Leonid Sigal, Alexandru O Balan, and Michael J Black. Humaneva:Synchronized video and motion capture dataset and baseline algorithm forevaluation of articulated human motion. International journal of computervision, 87(1-2):4–27, 2010.

[40] C. Stauffer and W.E.L. Grimson. Adaptive background mixture models forreal-time tracking. Proceedings. 1999 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (Cat. No PR00149), pages 246–252.

[41] Carlo Tomasi and Takeo Kanade. Detection and tracking of point features,Technical Report CMU-CS-91-132. School of Computer Science, CarnegieMellon Univ. Pittsburgh, 1991.

[42] B.C. Lovell. V. Reddy, C. Sanderson. Robust foreground estimation /background subtraction, http://arma.sourceforge.net/foreground/. [Online;accessed 01-October-2014].

[43] Greg Welch and Gary Bishop. An introduction to the kalman filter, 1995.[44] Alper Yilmaz, Xin Li, and Mubarak Shah. Contour-based object tracking with

occlusion handling in video acquired using mobile cameras. Pattern Analysisand Machine Intelligence, IEEE Transactions on, 26(11):1531–1536, 2004.

[45] Sofiane Yous, Hamid Laga, Kunihiro Chihara, et al. People detection andtracking with world-z map from a single stereo camera. In The EighthInternational Workshop on Visual Surveillance-VS2008, 2008.

64

Documents

Multi-person tracking system for complex outdoor environments790420/FULLTEXT01.pdf · the background subtraction stage, as the final accuracy and performance of the person tracking