HEAD DETECTION IN DENSELY CROWDED SCENES...suggests using multi-view detectors. [16] can be used only for multi-view face detection, but not for heads. To train such a multi-view detector

HEAD DETECTION IN DENSELY CROWDED SCENES

Submitted by

Ekambaram Rajmadhan

Department of Electrical & Computer Engineering

In partial fulfillment of the

Requirements for the Degree of

Master of Science in Electrical Engineering

National University of Singapore

2009

i

ABSTRACT

False alarms output by object detectors reduce their reliability and increase the

computation of subsequent processing units. In this project Viola-Jones face

detector [4] approach is compared with Dalal and Triggs [5] pedestrian detector

approach modified for head detections in densely crowded scenes. Based on the

results of this experiment Viola-Jones face detector approach is chosen for its high

detection rate. A method is then developed in this project to reduce the false alarms

obtained by using the Viola-Jones face detector when applied to the problem of

head detections in densely crowded scenes. A new type of Haar-like features called

“triangle-features” is introduced and their efficient computation is shown. The

experimental results show that “triangle-features” together with rectangle Haar-like

features reduces the false alarms from 26% to 18% with a small decrease in the

detection rate from 85% to 82.22%. This project also supports multi-view face

detector approach for head detection. For multi-view detectors the training samples

needs to be grouped based on view point or orientation. Manual classification of

training samples based on orientation is a time consuming task. A novel method to

automate the process of classifying human head images based on their orientation

is also developed in this project.

ii

ACKNOWLEDGEMENTS

I offer my sincere gratitude to my project supervisor, Professor Dr. Surendra

Ranganath, who has supported me throughout the project with his guidance and

encouragement. I would like to thank Sim Chern-Horng from Vision and Image

Processing Laboratory for giving me invaluable technical support, mentoring and

advice on this project. They opened up the unknown areas of computer vision,

machine learning and image processing to me, and enlightened me along the way.

Without them, this project would not have been possible. I would like to thank my

parents for their love and support. My family is always beside me and gives me

courage and motivation to proceed in this project.

iii

CONTENTS

ABSTRACT …………………………………………………………………… i

ACKNOWLEDGEMENTS …..………………………………………………… ii

CONTENTS ….………………………………………………………………… iii

LIST OF FIGURES …………………………………………………………….. vi

LIST OF TABLES …………………………………………………………….... viii

CHAPTER 1 INTRODUCTION ……………………………………………….. 1

1.1 Background ……………………………………………………. 1

1.2 Related Works ………………………………………………… 1

1.3 Overview ………………………………………………………. 2

CHAPTER 2 HEAD DETECTIONS IN DENSELY CROWDED SCENES…4

2.1 Introduction ……………………………………………………..4

2.2 AdaBoost ………………………………………………………. 4

2.3 Viola and Jones Face Detector ……………………………….6

2.3.1 Rectangular Haar-like Features ……………..6

2.3.2 Integral Image ……………………………….. 6

2.3.3 Weak Learning Algorithm ……………………7

2.3.4 Cascade of Classifiers ……………………… 8

2.4 Head Detector ………………………………………………... 9

CHAPTER 3 HISTOGRAM OF ORIENTED GRADIENTS FOR HEAD

DETECTIONS ………………………………………………………………… 11

3.1 Introduction ………………………………………………….. 11

3.2 Histogram of Oriented Gradients …………………………. .11

3.2.1 HOG Introduction ………………………….. 11

iv

3.2.2 HOG Descriptor ………………………… 12

3.2.3 HOG Parameters ………………………. 12

3.3 HOG for Head Detections ………………………………. 13

3.4 Experimental Details ……………………………………. 14

3.5 Testing Results ………………………………………….. 16

CHAPTER 4 HISTOGRAM OF ORIENTED GRADIENTS FOR HEAD

DETECTIONS …………………………………………………………….. 18

4.1 Introduction ………………………………………………. 18

4.2 Triangle Feature ………………………………………… 19

4.3 Triangle Feature Evaluation ……………………………. 19

4.4 A Single Stage AdaBoost Classifier To Reduce

The False Alarms .………………………………………………………… 23

4.5 Results …………………………………………………… 24

4.6 Sample Outputs …………………………………………. 26

4.7 Triangle Feature Limitations and Possible Extensions..27

CHAPTER 5 CLASSIFICATION OF HUMAN HEAD IMAGES BASED ON

ORIENTATION ……………………………………………………………. 29

5.1 Introduction ………………………………………………. 29

5.2 Overview of Head Classification Based on Orientation..30

5.3 Formation of Image Descriptor ………………………… 30

5.4 Classification Method …………………………………… 31

5.4.1 Bhattacharyya Coefficient …………….. 31

5.4.2 Clustering ………………………………. 32

5.4.3 Classification Algorithm ……………….. 32

5.5 Observation ……………………………………………… 33

v

5.6 Sample Results…………………………………………. .33

CHAPTER 6 CONCLUSIONS ……………………………………………34

REFERENCES ………………………………………………………........35

vi

LIST OF FIGURES

Figure 2.1 AdaBoost Algorithm ………………………………………………….. 5

Figure 2.2 Integral Image ………………………………………………………… 7

Figure 2.3 Sum of the pixels within rectangle can be computed with 4 memory

references………………………………………………………………………….. 7

Figure 3.1 Sample detections of a crowd facing away from the camera using HOG

descriptors and linear SVM classifier …………………………………………… 16

Figure 3.2 Sample detections for a crowd facing the camera using HOG descriptors

and linear SVM classifier ………………………………………………………… 17

Figure 4.1 Triangle feature approximation ………………………………………18

Figure 4.2 Example triangle features …………………………………………… 19

Figure 4.3 Example triangle images. The value of pixel at location 'P' is the sum of

the pixels shown in shaded region …………………………………………….. . 20

Figure 4.4 Example triangle image to show the calculation of sum of the pixels

within a triangle shown in the shaded region …………………………………. .20

Figure 4.5 Two triangular regions P1 and P2 ………………………………….21

Figure 4.6 Difference of two triangular regions P1 and P2 …………………..21

Figure 4.7 Two rectangular regions P3 and P4 ……………………………….21

Figure 4.8 Sum of pixels in the triangle can be evaluated from the triangle and

rectangular regions ………………………………………………………………22

Figure 4.9 Different triangle features …………………………………………..22

Figure 4.10 Triangle feature detector ………………………………………….23

Figure 4.11 The outputs of the detector [3] and the detections which are rejected by

Triangle Feature detector for the heads back facing the camera ………….25

Figure 4.12 The outputs of the detector [3] and the detections which are rejected by

Triangle Feature detector for the heads facing the camera ………………..26

vii

Figure 4.13 Triangle feature depicted in a square of area n * n ………….27

Figure 4.14 Scalable Triangle feature ………………………………………28

Figure 5.1 Images which are closer to the cluster center 1 ………………33

Figure 5.2 Images which are closer to the cluster center 2 ………………33

viii

LIST OF TABLES

Table 3.1 Comparison of descriptors used in head detections and in pedestrian detection ………………………………………………………………………………14 Table 4.1 Detection rate and false alarm rate of detector [3] and HOG for head detections ……………………………………………………………………………. 24 Table 4.2 Detection rate and false alarm rate of Triangle feature detector for head detections ……………………………………………………………………………. 24

1

CHAPTER 1

INTRODUCTION

1.1 Background

Tracking people in densely crowded scenes is of wide interest in Computer Vision

community, because of its potential in surveillance applications. Tracking is necessary

to monitor actions of the individuals. All the tracking algorithms require information

about the object to track. Then object detection algorithms can be used to provide

input to the tracking algorithms. To track people in densely crowded scenes a head

detector can be used to provide input to any tracking algorithm. State of the art object

detectors are available only for face detection (Viola and Jones [1]) and sparsely

crowded pedestrians (HOG [2]). The only available head detector that can be used for

head detection is [3]. The detection rate of the algorithm [3] is satisfying but its false

alarm rate is not. In this project the false alarm rate of this detector is reduced by

developing a new type of feature. To further reduce the false alarms this project

suggests using multi-view detectors. [16] can be used only for multi-view face

detection, but not for heads. To train such a multi-view detector the first step is to

classify training samples based on view point or orientation. A novel method is

developed in this project to classify heads based on orientation.

1.2 Related works

The problem considered in this project is for surveillance application, so it is assumed

that the cameras will be mounted at an elevation. People faces will not be fully visible

from top view in densely crowded scenes because of occlusion and their direction of

movement. People moving away from the camera will have their backs to the camera

and hence their faces will not be visible. For people moving sideways from right to left

or from left to right only partial profile view of their faces will be visible. Because of the

2

nature of this problem the state of the art face detector [1] cannot be used directly in

this application. A pedestrian detector like [2] is good for sparsely crowded scenes

where full body of the pedestrian is visible. But this project deals with densely crowded

scenes, where full body of the people is rarely seen because of occlusion. This makes

[2] difficult to use in this application. The pedestrian detector [11] uses Bayesian

combination of different body part detectors. [11] cannot be used in this application for

the same reasons as [2] and because of its basic concept which involves combining

different body parts detectors. [15] uses Haar-like features to detect pedestrians from a

video sequence. In this project heads should be detected from still images, so [15]

cannot be used. The detector [3] shows good results for head detections in densely

crowded scenes. But false alarms of [3] are higher compared to other detectors like [1],

[2] and [11]. In this project false alarms of [3] are reduced from 26% to 18% with a

small decrease in the detection rate from 85% to 82.22%. This project supports multi-

view face detector approach for head detections to further reduce the false alarms

without decreasing the detection rate or to even improve the detection rate. The

training for multi-view detector is a tiresome task because the training samples have to

be separated based on their view or orientation direction. [16] is a good multi-view face

detector. In [16] the authors manually classified faces based on view direction for

training. Separating human heads based on orientation is difficult and time consuming.

Inspired by [16] a method is developed to automate head orientation classification

process, which can be used to classify training samples for future multi-view head

detectors.

1.3 Overview

A new type of feature introduced in this project to reduce false alarms of the detector

[3]. A single stage classifier is constructed using AdaBoost machine learning algorithm

with triangle Haar-like features. This project proposes a new classifier which is a

cascade of classifier [3] and the newly constructed AdaBoost triangle feature classifier.

The output of the detector [3] is the co-ordinates and size information of each of the

3

detections in the image. In the new classifier the sub-images corresponding to the co-

ordinates and size information obtained from detector [3] are cropped and resized to

24*24 pixels using bilinear interpolation and passed to AdaBoost triangle feature and

rectangle Haar-like feature classifier. The outputs of this new classifier are the heads in

the image. The objective of this project is to improve the detector [3] or to find a

detector which performs better than [3]. Detector [3] uses rectangle Haar-like features

to learn the similarity among positive samples and their difference from the negative

samples. It is difficult for Haar-like features to learn the similarity among positive

samples of different orientation, since each Haar-like feature is applied over a sub-

window of an image, large structural variations among positive samples will make it

poorly learn the threshold. Based on intuition and from the performance of the detector

[16] this project supports multi-view detector for head detection. Though no multi-view

detector is developed or any existing multi-view detector is tested in this project, a

novel method to automate the tiresome task of classifying training samples based on

view point or orientation for multi-view detector training is developed. To classify the

images based on orientation, normalized descriptors similar to HOG are created and

are clustered using fuzzy c-means clustering. The clustering results in heads of

different orientation to fall in different clusters.

The rest of the chapters are organized as follows: Chapter 2 describes how the Viola

and Jones face detection approach [1] is applied to the problem of head detections in

densely crowded scenes [3]. Chapter 3 explains how Dalal and Triggs [2] pedestrian

detector approach is applied to head detection and compares its performance with

Viola-Jones face detector approach [3]. Chapter 4 introduces new type of feature

called “triangle features” and shows how it is used to reduce the false alarms of [3]. A

novel method for classifying objects based on orientation is developed in Chapter 5.

Conclusion and Future works are in Chapter 6.

4

CHAPTER 2

HEAD DETECTIONS IN DENSELY CROWDED SCENES

2.1 Introduction

The Viola and Jones face detection algorithm [1] uses rectangle Haar-like features to

learn the critical visual features of the faces and these rectangle Haar-like features are

combined to form a strong classifier using a machine learning algorithm called

Adaptive Boosting (AdaBoost). In detector [3] Viola-Jones face detection approach is

used for head detections in densely crowded scenes. This chapter explains in detail

about the AdaBoost machine learning algorithm in section 2.2, Viola-Jones face

detector in section 2.3 and detector [3] in section 2.4.

2.2 AdaBoost

Boosting is a supervised machine learning algorithm. Machine learning algorithms aim

to automatically learn complex patterns and make intelligent decisions based on data.

In supervised learning, the algorithm is presented with sample inputs and outputs, and

expected to learn the association between them. So when presented with unknown

examples the algorithm is expected to classify them correctly. Boosting is a way of

combining the results of weak learners to produce a strong classifier. It is an iterative

algorithm; in each step a simple classifier selected by a weak learning algorithm based

on a distribution, and is added to the final classifier. A learning algorithm which selects

the simple classifiers is called weak learner and the chosen classifiers are expected to

be only slightly better than random guessing i.e., their probability of classification

should be greater than 50%. Each of the simple classifiers selected by the weak

learners contributes a parameter (confidence value), which measures the importance

of the simple classifier, to the final classifier. The value of the parameter is based on

the classification accuracy of the training samples. In adaptive boosting (AdaBoost), a

variant of boosting algorithm, the parameter (confidence value or strength of the weak

5

classifier) is calculated using the probability distribution of the training samples. The

probability distribution is calculated based on the weights of the training samples. At

the start of the training weights are assigned to all the training samples. Weights are

distributed uniformly or assigned based on the importance of the training samples. The

weak classifier which produces the largest sum of weights of the correctly classified

samples is in chosen all iterations. During training the weights are updated in every

iteration after the selection of a weak classifier, based on the importance of the training

samples. The training samples which are classified correctly by a selected weak

classifier are considered less important and their weights are reduced and the

incorrectly classified samples are considered more important and their weights are

increased in every round of the iteration. The AdaBoost algorithm is presented in

Figure 2.1.

Given: (x1,y1), (x2,y2), ...... (xm,ym) where xi Є X, yi Є Y = {-1, +1}

Initialize Di(i) = 1/m

For t = 1,.....,T:

Choose αt = ½ ln((1 – εt )/εt)

Train weak learner using distribution Dt

Get weak hypothesis ht: X → {-1, +1} with error

εt = Pri ~ Dt [ht(xi) ≠ yi]

Choose αt = ½ ln((1 – εt )/εt)

Update:

Dt+1(i) = ( Dt(i) / Zt ) x { e-αt if ht(xi) = yi

{ eαt if ht(xi) ≠ yi

= Dt(i) exp(-αt yi ht(xi)) / Zt

where Zt is the normalization factor (chosen so that Dt+1 will be a

distribution)

6

Output the final hypothesis:

H(x) = sign( t=1∑T αt ht(x))

Figure 2.1 AdaBoost Algorithm

2.3 Viola and Jones Face detector

2.3.1 Rectangular Haar-Like Features

Viola and Jones face detector [1] uses rectangular Haar-like features to encode or

learn facial information or the human faces in images. The rectangular Haar-like

features are combined to form a strong classifier using AdaBoost machine learning

algorithm. The rectangular Haar-like features are reminiscent of Haar basis functions,

but are over complete. A simple rectangular Haar-like feature value can be defined as

the difference of the sum of pixels of areas inside the rectangle, which can be at any

position and scale within the original image. This feature set is called 2-rectangle

features. Viola and Jones [1] also defined 3-rectangle features and 4-rectangle

features. A 3-rectangle feature value is calculated as the difference of the sum of pixels

of the two outer rectangles from the inner rectangle. A 4 rectangle feature value is

calculated as the difference of the sum of pixels of the two diagonal rectangles from

the other two diagonal rectangles. The values of these features indicate certain

characteristics of a particular area of the image. Each feature type can indicate the

existence (or not) of certain characteristics in the image, such as edges or changes in

texture. For example, a 2-rectangle feature can indicate where the border between a

dark region and a light region lies.

2.3.2 Integral Image

The rectangular Haar-like features are computationally efficient and can be computed

rapidly using integral images. The Integral image at location (x, y) contains the sum of

the pixels above and to the left of (x, y) inclusive, as shown in Figure 2.2. Using

integral images the sum of pixels of a rectangular region in an image can be computed

7

using 4 memory references. Another advantage of integral image is that it avoids

computing the pyramid of images, which are used in other methods for finding objects

of different sizes. By using integral images, the rectangular Haar-like features are

scaled instead of the images and the features of all dimensions are computed using

the same number of operations. The following example shows the computation of the

sum of pixels within a rectangle as shown in Figure 2.3:

Sum of pixels within a rectangle = P1 – P2 – P3 + P4

Figure 2.2 Integral Image

Figure 2.3 Sum of the pixels within rectangle can be computed with 4 memory references. Sum of pixel with rectangle R = P1 – P2 – P3 + p4

2.3.3 Weak Learning Algorithm

A feature together with a threshold is called as a weak classifier. The weak learning

algorithm selects a feature which best classifies the positive and negative training

samples. For each feature the weak learner determines the optimum threshold, such

that the minimum number of samples is misclassified. The weak learner selects a

feature as follows. For each feature, the examples are sorted based on feature value.

The optimal threshold for that feature can be computed in a single pass over this

sorted list. For each element in this sorted list, four sums are evaluated: the total sum

of positive sample weights T+, the total sum of negative sample weights T-, the sum of

x,y

P1 P2

R

P3 P4

8

positive weights below the current example S+, the sum of negative weights below the

current example S-. The error for a threshold which splits the range between the

current and previous example in this sorted list is:

e = min( S+ + ( T- - S- ), S- + ( T+ - S+ ) ),

or the minimum of the error of labeling all examples below the current example

negative and labeling the examples above positive versus the error of the converse.

These sums are easily updated as the search proceeds. The feature which generates

the low error will be chosen with the corresponding threshold as a weak classifier. The

weak classifiers selected by the weak learner are combined to form a strong classifier

as given in Figure 2.1.

2.3.4 Cascade of classifiers

The advantage of Viola and Jones [1] face detector is its speed of detection. The

speed is achieved by using a cascade of classifiers. The idea is to neglect the majority

of the non-positive regions with less computation using simple classifiers and to spend

more computation on regions which have high probability of being positive using

complex classifiers. The AdaBoost algorithm described in Table 1 is used to create a

strong classifier. The strong classifiers are cascaded to form the face detector. The

earlier stages of the cascade are simple classifiers built using less number of features

and their detection rates are close to 100%. The classifiers in the higher stages are

complex and are built with large number of features to reduce the false alarms. The

detection process follows a degenerate decision tree. Only the sub-windows which are

classified as positive in an earlier stage are sent to the successive stages. The sub-

windows which are classified as negative in any one stage will be rejected

immediately. The time required to process an image relies on the amount of

computation performed over a sub window and is directly proportional to the number of

features used in the classifier. Since majority of the sub-windows in an image are

9

negative, removing most of them at an earlier stage possible; i.e., using less number of

features will reduce computation drastically.

2.4 Head detector

In Head detector [3], straight and tilted rectangle Haar-like features are used to learn

the critical visual features and Gentle AdaBoost algorithm is used to combine these

features to create the strong classifiers. These strong classifiers are connected in a

cascade to form the head detector and each strong classifier is called as the stage

classifier. In [1] the criterion to train a stage classifier in the cascade is the number of

features but in [3] the detection rate and the false alarm rate are used. A single stage

classifier is created by selecting and boosting the Haar-like features as mentioned in

section 2.3.3 until it classifies the training samples with a given detection rate and false

alarm rate. Since the classifiers are cascaded the detection rate and the false alarm

rate of the detector is equal to the product of all the stage classifiers. Let D1, D2, ......DN

be the detection rates of the stage classifiers. Then the detection rate of the cascade is

D1∙D2∙....∙DN. The false alarm rate of the detector is also calculated in the same way.

The calculation of the detection rate and the false alarm rate of each stage classifier

can be done as follows: Let D and F are the expected detection rate and false alarm

rate, respectively, of the detector and let N be the number of stages. Then the

detection rate and the false alarm rate of all the stage classifiers can be equally

chosen to be D1/N and F1/N, respectively.

Gentle AdaBoost algorithm is chosen for its performance compared to other AdaBoost

variants as mentioned in [4]. The tilted rectangle Haar-like features [18] are extensions

of the rectangle Haar-like features and are added to increase the dimension of the

feature set to improve detection. The training set consists of 4016 positive samples

and 1704 negative samples. More training samples are generated by flipping the

positive samples horizontally and negative samples both horizontally and vertically.

10

The resulting 8032 positive samples and 6816 negative samples are used for training.

Each stage classifier is trained with a detection rate of 99.9% and a false alarm rate of

50%. The number of stages trained is 20. The final classifier gives 85% detection rate

and 26% false alarm rate when tested with 1010 head samples from 30 images.

11

CHAPTER 3

HISTOGRAM OF ORIENTED GRADIENTS FOR HEAD DETECTIONS

3.1 Introduction

Tracking of people in video can be performed by using initial detections from a head

detector. For such an application the detection rate of the detector should be high with

low false alarm rate. A suitable head detector that can be used for this application is [3].

The detection rate of the detector [3] is good but its false alarm rate needs to be

reduced for its use in tracking application. Before reducing the false alarms of this

detector it is also necessary to check for the performance of other available detectors.

Potential state of the art detectors head detections in densely crowded scenes [2], [11]

and [15]. However the detectors [11] and [15] cannot be used for this application and

and is explained in related works Section 1.2. Hence, we chose to experiment with

detector [2].

3.2 Histogram of Orientated Gradients

3.2.1 HOG Introduction

The detector [2] is based on the idea that local object appearance and shape can be

characterized by distribution of intensity gradient or edge direction even without precise

knowledge about the corresponding gradient or edge directions. In this detector the

orientation of the gradient is computed and histogram of this is calculated for

overlapping image blocks. The calculated histogram is used as descriptor to detect

pedestrians in an image using linear SVM classifier.

12

3.2.2 HOG Descriptor

A descriptor is created for each of the training samples using orientation of gradient

information. A linear SVM classifier is trained to classify the descriptors of positive and

negative training samples. The descriptor for a training sample is created as follows:

Gradient of an image is calculated using a simple centered mask like [-1 0 1] in both x

and y direction. Orientation of each of the gradient elements is calculated using these

gradient images. A gradient image is divided into non-overlapping rectangles of size

m*n. Each of these rectangles is called a cell. A histogram of size h bins is calculated

for each of these cells using the magnitude of the gradient orientation. The cells are

grouped to form blocks of size p*q. The histograms of all the cells in a block are

concatenated and normalized using L1 or L2 norm. The normalized histograms of all

the blocks are concatenated to form an image descriptor. The blocks are overlapping,

meaning a cell contributes to more than one block. Though each cell contributes many

blocks, the value they contribute differs because of block normalization.

3.2.3 HOG Parameters

The following parameters yields the best results for pedestrian detector: Cell Size 8 * 8

pixels, linear gradient voting into 9 orientation bins in 0o-1800, Block size 2 * 2 cells, L2-

Hys (Lowe style clipped L2 norm) block normalization, 64 * 128 pixels detection

window and linear SVM classifier. While forming the histogram to reduce aliasing,

votes are interpolated bilinearly between the neighboring bin centers - in both

orientation and position. Image gradients are usually calculated after smoothing the

image using a Gaussian filter. But for pedestrian detector the authors found by

experiment that the Gaussian smoothing decreases performance when σ increased

from 0 to 2, so no Gaussian smoothing is performed. Gradient strengths vary over a

wide range owing to local variations in illumination and foreground-background

contrast, so effective local contrast normalization turns out to be essential for good

performance. In this experiment block normalization acts as local contrast

13

normalization. The gradient orientation is calculated for each of the color channels in

either RGB or LAB color spaces and the largest norm of all the color channels is used

in the histogram.

3.3 HOG for Head Detections

For head detections not all the steps/ideas of HOG pedestrian detector are used.

Some of the ideas are not valid for head detections in densely crowded scenes. In

pedestrian detection local contrast normalization seems to be a good idea and the

results of [2] shows their importance in improving the performance of the classifier.

Local contrast may not suit for all types of objects. Particularly for head detection it may

not improve performance. In pedestrian detection, the camera field of view includes the

entire pedestrian, and occlusion and self-shadowing affects the features extracted for

classification. However, for head detections in densely crowded scenes “heads”

occupy a small portion of the image. If there is a shadow, it will usually cover the whole

head. So, local contrast normalization which is of critical importance in the pedestrian

detection is of little use in head detection. To increase the performance of HOG for

head detections, we experimented by increasing the negative samples. But the

obtained results are contrary to the expectations. When the number of negative training

samples is increased, linear SVM is not able to find a classifier. Heuristically this

suggests that the HOG descriptors created for human heads is not sufficiently unique

for classification using linear SVM and share characteristics with natural/man-made

objects which are used as negative samples.

14

3.4 Experiment details

Table 3.1 compares the differences in parameters used and results between head

detection and pedestrian detection using HOG.

Data Head detection Pedestrian

detection

Number of

histogram bins

9 9

Cell size 6 * 6 8 * 8

Block size 3 * 3 2 * 2

Image size 30 * 30 64 * 128

Total number of

blocks in an image

(assuming window

stride is by one cell)

9 128

Number of positive

training samples

8032 2478

Number of negative

training samples

6816 12180

Detection Rate 66% 84-89%

False alarm Rate 39% 10-4 False

Positives

Per

Window

Table 3.1 Comparison of descriptors used in head detections and in pedestrian detection

15

Image descriptor dimension = Number of histogram bins * Block size * Total number of

blocks

Descriptor dimension in pedestrian detection = 9 * 4 * 128 = 4608 features

Descriptor dimension in our problem = 9 * 9 * 9 = 729 features

From the Table 3.1, it is observed that the performance of HOG descriptors for head

detection is not as good as pedestrian detection and is also less compared to the

detector [3]. So Viola and Jones approach for head detection is preferred over HOG for

head detection in densely crowded scenes.

16

3.5 Testing Results

The Figures 3.1 and 3.2 shows the performance of HOG on different images. From

these figures it can be observed that there are lots of false and miss detections.

Figure 3.1 Sample detections of a crowd facing away from the camera using HOG descriptors and linear SVM classifier

17

Figure 3.2 Sample detections for a crowd facing the camera using HOG descriptors and linear SVM classifier

18

CHAPTER 4

TRIANGLE FEATURES FOR HEAD DETECTIONS

4.1 Introduction

Features can encode ad-hoc information which is difficult to extract using pixels from a

small number of training samples. To use a feature based detector in real-time,

features should be computationally efficient. One of the reasons for the success of the

face detector [1] is its speed of computation. In this chapter we introduce a new type of

feature called triangle feature which is simple and faster to compute and are logical

extensions of the rectangle Haar-like features in [1]. The motivation for this feature is

that triangles can better approximate diagonal curves compared to rectangles, which

can be observed from the Figure 4.1. In this figure the area under the curve can be

approximated by the triangle below the diagonal of the square. The combination of the

triangle features and rectangle Haar-like features improves the performance of the

detector [3] is shown by experiment in this chapter.

Figure 4.1 Triangle feature approximation

diagonal

curve

ee

19

4.2 Triangle Feature

The value of a triangle feature is the difference between the sums of the pixels within

two triangular regions of a square. The number of pixels cannot be equally divided

between two triangle regions inside a square, so a factor equal to the ratio of the two

areas is multiplied with the sum of the pixels of the smaller triangle before subtracting

from the sum of the pixels from the larger triangle. Example triangle feature are shown

below:

Figure 4.2 Example triangle features

The value of triangle feature = Sum of pixels in Area A1 – m * Sum of pixels in Area A2,

where 'm' is the ratio of the number of pixels in A1 to the number of pixels in A2.

4.3 Triangle Feature Evaluation

To calculate the sum of pixels in a triangular region in an image, an intermediate

representation called triangle image which is similar to integral image in [1] is used.

The triangle image eases the evaluation of the features, and hence boosts the speed

of detection. The triangle image at location (x, y) contains the sum of the pixels above

and touches the line drawn at 45° to the right of the pixel. The example triangle images

are shown in Figure 4.3.

A1

A2

A1

A2

20

Figure 4.3 Example triangle images. The value of pixel at location 'P' is the sum of the pixels shown in shaded region.

The triangle image can be efficiently calculated in one pass of the image using the

following pair of recurrences:

s(x,y) = s(x,y-1) + i(x,y)

ti(x,y) = s(x,y) + ti(x+1,y-1)

where s(x,y) is the column sum of the input image and ti(x,y) is the pixel value of the

triangle image. The recurrences should be evaluated from top to bottom and from right

to left of the image.

The evaluation of triangle features requires both triangle image and integral image. The

computation of integral image is explained in Section 2.3.2. The value of triangle in an

image is calculated using 4 array accesses: two in the triangle image and two in the

integral image. The following steps show how to find the sum of pixels within the

triangle shown in Figure 4.4.

Figure 4.4 Example triangle image to show the calculation of sum of the pixels within a triangle shown in the shaded region.

PP

P4 P3 P1

P2

21

Step 1

Using the triangle image obtain the sum of pixels in two triangles P1 and P2. The

smaller triangle is P1 and the larger triangle is P2.

Figure 4.5 Two triangular regions P1 and P2

Step 2

Find sum of pixels in (P2 – P1). This gives an intermediate form as shown in Figure

4.6.

Figure 4.6 Difference of two triangular regions P1 and P2

Step 3

Using integral image obtain the sum of pixels in the two rectangular regions shown in

Figure 4.7 and represent them as P3 and P4.

Figure 4.7 Two rectangular regions P3 and P4

P3

P1

P2

P4

P1

P2

22

Step 4

Using the rectangle and triangle areas mentioned in the steps above, the sum of the

pixels in the triangle in Figure 4.4 can be evaluated as follows:

sum of pixels in triangular region (P2 - P1) - sum of pixels in rectangular region (P3 –

P4)

Figure 4.8 Sum of pixels in the triangle can be evaluated from the triangle and rectangular regions

The number of operations required to find the sum of pixels in a rectangle using the

integral image is same as the number of operations required to find the sum of pixels in

a triangle using both the triangle and integral image. But computing triangle image is

an excess operation which is not required for rectangle Haar features. Figure 4.9

shows the different ways of representing the triangle features. The evaluation of these

features can be carried out in the same way as the one explained above by using

flipped input images.

Figure 4.9 Different triangle features

P4 P3 P1

P2

A1

A2

A2

A1

A2

A1

23

4.4 A Single stage AdaBoost classifier to reduce the false alarms

A simple discrete AdaBoost classifier explained in Chapter 2.2 with the rectangle Haar-

like and the triangle features as weak classifiers is built to reduce the false alarms. The

single stage classifier is trained using the same positive samples used for training the

head detector [3] and more negatives samples are used. During the process of head

detections, the head detector [3] is used as the initial head detector and its outputs are

resized to 24*24 using bilinear interpolation and passed as inputs to the triangle feature

classifier. The output of this chain gives the heads in the image as shown in Figure

4.10.

Figure 4.10 Triangle feature detector

Cascade of classifiers is used to speed up the detection process. In cascade of

classifiers negatives are rejected with fewer computations and more computations are

performed over sub-windows which have high probability of being positive. The

objective of this classifier is to reduce the false alarms of detector [3], and it does not

involve the initial detection process in the image. So the number of inputs to this

classifier will usually be very small compared to the number of sub windows in an

image and hence single stage classifier is sufficient.

24

4.5 Results

The following tables shows the detection rate and false alarm rate of the detector [3],

HOG and triangle feature detector for head detections in densely crowded scenes:

Method Detection % False

Alarm %

AdaBoost rectangle

Haar feature

85 26

HOG 66 39

Table 4.1 Detection rate and false alarm rate of detector [3] and HOG for head

detections

Number of features Detection % False

Alarm %

3 82.22 18

10 83.94 21.63

30 84.51 22.67

50 84.46 22.76

Table 4.2 Detection rate and false alarm rate of Triangle feature detector for head detections

25

4.6 Sample outputs

The Figures 4.11 and 4.12 shows the output of detector [3] and the detections which

are removed by the triangle feature classifier.

Figure 4.11 The red and blue circles are the detections of the detector [3]. The blue circles are the detections which are rejected by Triangle Feature detector for the heads

back facing the camera.

26

Figure 4.12 The red and blue circles are the detections of the detector [3]. The blue circles are the detections which are rejected by Triangle Feature detector for the heads

facing the camera.

27

4.7 Triangle Feature Limitations and Possible Extensions

In face detector [1] the rectangle features are scaled instead of the image to find faces

of different scales. It is one of the main reasons for the speed of the face detector [1].

While scaling a rectangle feature the threshold associated with it will also be scaled.

The disadvantage of the triangle features is that they cannot be scaled, because the

threshold value associated with the feature cannot be scaled. When a triangle feature

is scaled by a value say 'a', the proportionality constant 'm' the ratio of the number of

pixels in A1 to the number of pixels in A2 will not be scaled by the same value 'a'.

Instead the value of 'm' should be calculated for the triangle feature of that dimension,

because the area of the two triangles inside a square in a triangle feature will not get

scaled by the same value. So the triangle formed when scaled by a value 'a' is a

different feature and threshold for that feature cannot be obtained using the scale value

'a'. This can be shown for the triangle feature in Figure 4.13 using the derivation below:

Figure 4.13 Triangle feature depicted in a square of area n * n

The number pixels in the square = n * n

Number of pixels in triangle A1 = n * (n + 1) / 2

Number of pixels in triangle A2 = n * (n - 1) / 2

m = Area of triangle A1 / Area of triangle A2

= Number of pixels in triangle A1 / Number of pixels in triangle A2

» m = (n + 1) / (n - 1)

Let 'a' be the scale factor for the feature.

The number pixels in the scaled square = (a * n) * (a * n)

Number of pixels in scaled triangle A1 = (a * n) * ((a * n) + 1) / 2

1

2

A1 3

:

:

A2 :

:

n

1 2 3 ... ... ... ... n

28

Number of pixels in scaled triangle A2 = (a * n) * ((a * n) - 1) / 2

mscaled = ((a * n) + 1) / ((a * n) – 1)

mscaled / m = (((a * n) + 1) * (n - 1) )/ (((a * n) – 1) * (n + 1)) ≠ a

The relation mscaled / m ≠ a show that the threshold for the triangle features and the

scaled triangle features are not the same and hence the triangle feature in Figure 4.13

is not scalable.

Triangle features can be made scalable if represented as shown in Figure 4.14. From

the Figure it can be observed that the number of pixels in the triangle A1 and A2 are

the same and there is no need for the proportionality constant 'm'. Therefore the

threshold for this feature can be scaled and can be used in the same way as rectangle

features in [1].

Figure 4.14 Scalable Triangle feature

1

2

A1 3

:

:

A2 :

:

n

1 2 3 ... ... ... ... n

29

CHAPTER 5

CLASSIFICATION OF HUMAN HEAD IMAGES BASED ON ORIENTATION

5.1 Introduction

Feature based detectors will evaluate the features by applying them at the same pixel

locations over all the training samples of the same resolution. For a feature to classify

the training samples properly, the range of feature values for the positive and negative

samples should be separated. If the feature values vary greatly for the positive training

samples then it is possible that features values evaluated over the negative samples

will lie between the feature values of the positive samples. These features cannot

classify the positive and negative samples properly. Structural variations among the

training samples, e.g. heads of different orientation, will result in feature values to vary

greatly for majority of the features. In such cases the threshold of the selected feature

will not be good. If the positive samples can be divided based on the structural

similarity and trained separately, features can be selected independently and it help to

increase the detection rate and reduce the false alarm rate. An example of such a

detector is multi view face detector [11]. To train such a detector, the training samples

have to be classified based on structural similarity. Depending on the object to be

trained, structural similarity can be defined. For example in [11], structural similarity is

defined based on the out of plane rotation angle and the human faces are divided into

5 categories: full left profile, half left profile, frontal, half right and full right profile.

Similarly human heads can be coarsely classified based on their orientation: head

facing right, facing left, facing the camera and back head, and can even be finely

divided based on the angle of orientation. Instead of specifying any requirements for

manual classification, a new approach to automatically classify the heads is developed

in this project. This automatic classification will greatly reduce time and also help to

30

gain understanding over the problem, i.e., how close or far apart heads of different

orientations are. Gradient histogram information seems to be good for classifying

objects based on orientation, so HOG descriptor is tested for its performance for head

classification based on orientation. While calculating HOG descriptors an image is

divided into blocks and orientation histogram is calculated for each of the blocks

separately and concatenated to form the image descriptor. When an image is divided

into blocks, it is possible that structurally different regions of the objects are separated.

So when images of slight change in orientations are divided into blocks, approximately

structurally similar region of the image belongs to a particular block in all the images.

This will help in selecting features to classify the training samples properly. In this

experiment instead of constraining any particular number of orientations, natural

grouping is preferred.

5.2 Overview of Head Classification Based on Orientation

Though the results obtained from HOG are not impressive for head detections, the

gradient orientation information can be made useful for classification of heads. The

proposed approach uses gradient orientation histogram and Bhattacharyya distance to

classify heads based on orientation. Image descriptors are formed for all the training

samples using gradient orientation histogram and are clustered using Bhattacharyya

distance. Each cluster will contain only heads of particular orientation.

5.3 Formation of Image Descriptor

From the surveillance videos, it is observed that the size of head ranges between

20*20 and 40*40 pixels. Before creating the image descriptors, all the images are

resized to an equal size. Intuitively the midpoint in the range appears to be a good

choice, so all the training images are resized to 30*30 pixels. All the resized images

are divided into four quadrants of size 15 * 15 pixels, and the gradient orientation

histogram of 9 bins is computed for each quadrant. Based on the analysis done for

HOG descriptor (for detailed description refer Chapter 3.4), computing overlapping

31

blocks will not improve the performance for human head classification. Also the object

is very small in this application, so we choose to divide the image into 4 non-

overlapping blocks only. This gives a contribution of 225 pixels for each of the blocks.

To form a histogram of 9 bins, contribution from less than 225 pixels seems to be too

low. After computing the histogram, it is normalized to be a pdf. The concatenation of

all the histograms in the four blocks represents the descriptor. The reason for choosing

pdf is to compensate for brightness variation in different images. If pdf is created after

concatenation of descriptors of each block, histogram peak in one block will affect the

other blocks, so pdf is created before concatenation of descriptors of each block.

5.4 Classification Method

5.4.1 Bhattacharyya Coefficient

The Bhattacharyya distance measures the similarity of two discrete probability

distributions. It is normally used to measure the separability of classes in classification.

For discrete probability distributions p and q over the same domain X, it is defined as:

DB(p,q) = - ln(BC(p,q)),

where,

BC(p,q) = Σxϵ X(p(x)q(x))0.5

The Bhattacharyya coefficient (BC(p,q)) is an approximate measurement of the amount

of overlap between two statistical samples. The coefficient can be used to determine

the relative closeness of the two samples being considered.

32

5.4.2 Clustering

In pattern recognition applications, one of the ways to find natural grouping is to use

clustering. Clustering can be defined as “the process of organizing objects into groups

whose members are similar in some way”. All the members in a group or cluster will be

similar between them and are dissimilar to the members from other groups. “Similarity

measure” measures how samples in one cluster are more like one another than

samples in other cluster. There is no existing “similarity measure” which can be used

for all the problems. User must define the “similarity measure” based on the problem at

hand. There are different clustering algorithms available: K-means, Fuzzy C-means

(FCM), Hierarchical Clustering, Mixture of Gaussian, etc. Fuzzy C-means is a method

of clustering which allows one piece of data to belong to two or more clusters, with a

degree of membership to each of the clusters it belongs.

5.4.3 Classification Algorithm

Image descriptors are formed for all the samples as described in Section 5.3. Since the

objective of this experiment is to find natural groupings, clustering seems to be a

reasonable choice. Fuzzy c-means clustering with Bhattacharyya coefficient similarity

measure is used for classification. The descriptor involves concatenation of the pdf's,

so the maximum Bhattacharyya coefficient will be 4, one for each pdf. The basic

assumption for this method is that Bhattacharyya coefficient between training samples

of same orientation will be large and of different orientation will be small. The output of

clustering will be cluster centers and degree of membership for all the training samples

with the cluster centers. Each cluster is expected to contain samples with specific

orientation. The number of clusters has to be decided iteratively based on the

experiment results.

33

5.5 Observation

The training is started with 4 clusters and the degree of membership is observed for

each of the training samples. For 13% of the samples the degree of membership is

equally shared among all the clusters. And for 26% of the samples the degree of

membership is equally shared between any of the 3 clusters. As the number of clusters

is increased, the degree of membership is approximately equally shared between more

clusters. When the number of clusters is reduced to 2, more reliable degree of

membership is obtained. This suggests that only 2 natural groups can be found using

Bhattacharyya coefficient similarity measure for this data set. The samples belongs to

the two groups, frontal and back head as shown in the Figure 5.1 and 5.2, respectively.

5.6 Sample Results

The following figures show how the algorithm classifies human head images into

different clusters based on orientation.

Figure 5.1 Images which are closer to the cluster center 1

Figure 5.2 Images which are closer to the cluster center 2

34

CHAPTER 6

CONCLUSIONS

Tracking people in densely crowded scenes is a complicated task in the field of computer

vision and has large potential in surveillance applications. With the advances in the

computing power of the processors, this task can be done with reasonable accuracy. To

achieve this task, head detections explored in this project. The current best performing

head detection algorithm [3] is chosen and improved on by reducing its false alarms in

this work. We also discussed the possible improvements of the detector [3] using multi-

view face detector approach [16] and an initial step was taken by develop a novel

method for head classification based on orientation.

Possibilities of the current state of the art object detectors are analyzed for head

detections in densely crowded scenes. A modified version of pedestrian detector [2] was

tested and compared with the available head detector [3]. Based on the testing results [3]

was chosen and improved by reducing its false alarms. A new type of feature called

“triangle features” similar to rectangle haar-like features was developed. The use of this

feature was shown experimentally by verifying the reduction in false alarms of the

detector [3] from 26% to 18%.

The appearance of human heads is structurally different from different viewpoints or

orientations, so multi-view based object detector is very useful for head detections in

densely crowded scenes. A novel method based on HOG was developed to automate

the initial process of classifying training samples based on orientation for training a multi-

view detector. The performance of this method was shown by classifying the human

head images into frontal and rear heads. This generic method can be adapted for

classifying any object groups.

35

REFERENCES

1. Paul Viola, Michael J. Jones: Robust Real-Time Face Detection. In: ICCV, vol. 20(11),

pp. 1254-1259 (2001)

2. Navneet Dalal and Bill Triggs: Histograms of oriented gradients for human detection.

In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2 (2005)

3. Chern-Horng Sim, Ekambaram Rajmadhan and Surendra Ranganath: A Two-Step

Approach for Detecting Individuals within Dense Crowds. In: AMDO: 166-174 (2008)

4. Jerome Friedman, Trevor Hastie, Robert Tibshirani, Additive Logistics Regression: A

statistical view of boosting. Technical Report, Department of Statistics, Stanford

University.

5. Yoav Freund and Robert E. Schapire: A Short Introduction to Boosting. In: Journal of

Japanese Society for Artificial Intelligence, vol. 14(5), pp. 771–780, (September 1999)

6. Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In: IEEE

Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 878–885 (2005)

7. Rittscher, J., Tu, P.H., Krahnstoever, N.: Simultaneous estimation of segmentation and

shape. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp.

486–493 (2005)

8. Zhu, Q., Yeh, M.C., Cheng, K.T., Avidan, S.: Fast human detection using a cascade of

histograms of oriented gradients. In: IEEE Conference on Computer Vision and Pattern

Recognition, vol. 2, pp. 1491–1498 (2006)

9. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic

assembly of robust part detectors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004.

LNCS, vol. 3021, pp. 69–82. Springer, Heidelberg (2004)

10. Tuzel, O., Porikli, F.M., Meer, P.: Human detection via classification on riemannian

manifolds. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1

(2007)

36

11. Wu, B., Nevatia, R.: Detection and tracking of multiple, partially occluded humans by

bayesian combination of edgelet based part detectors. In: International Journal of

Computer Vision, vol. 75(2), pp. 247–266 (2007)

12. Casas, J.R., Sitjes, A.P., Folch, P.P.: Mutual feedback scheme for face detection and

tracking aimed at density estimation in demonstrations. In: Vision, Image and Signal

Processing, vol. 152(3), pp. 334–346 (2005)

13. Brostow, G.J., Cipolla, R.: Unsupervised bayesian detection of independent motion in

crowds. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp.

594–601 (2006)

14. Rabaud, V., Belongie, S.: Counting crowded moving objects. In: IEEE Conference on

Computer Vision and Pattern Recognition, vol. 1, pp. 705–711 (2006)

15. Viola, P.; Jones, M.J.; Snow, D: Detecting Pedestrians Using Patterns of Motion and

Appearance. In: ICCV, vol. 1, pp. 734–741 (2003)

16. Chang Huang, Haizhou Al, Bo Wu, Shihong Lao: Boosting Nested Cascade Detector

for Multi-View Face Detection. In: ICPR, vol. 2, pp. 415-418 (2004)

17. http://wapedia.mobi/en/Haar-like_features

18. Lienhart, R. and Maydt, J., "An extended set of Haar-like features for rapid object

detection", In: ICIP, pp. 900-903 (2002)

Documents

HEAD DETECTION IN DENSELY CROWDED SCENES...suggests using multi-view detectors. [16] can be used only for multi-view face detection, but not for heads. To train such a multi-view detector